CN109960799B - Short text-oriented optimization classification method - Google Patents

Short text-oriented optimization classification method Download PDF

Info

Publication number
CN109960799B
CN109960799B CN201910182364.9A CN201910182364A CN109960799B CN 109960799 B CN109960799 B CN 109960799B CN 201910182364 A CN201910182364 A CN 201910182364A CN 109960799 B CN109960799 B CN 109960799B
Authority
CN
China
Prior art keywords
feature
word
vector
cluster
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910182364.9A
Other languages
Chinese (zh)
Other versions
CN109960799A (en
Inventor
李芳芳
尹垚
毛星亮
施荣华
石金晶
胡超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHANGSHA ZHIWEI INFORMATION TECHNOLOGY Co.,Ltd.
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201910182364.9A priority Critical patent/CN109960799B/en
Publication of CN109960799A publication Critical patent/CN109960799A/en
Application granted granted Critical
Publication of CN109960799B publication Critical patent/CN109960799B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a short text-oriented optimization classification method, which comprises the following steps: acquiring an original data set and preprocessing the original data set; step two: selecting a characteristic item set from the preprocessed data set; step three: training the collected large-scale linguistic data by using a word vector tool to obtain a word vector model; step four: performing word vector representation on each feature item in the feature item set by using a word vector model, and performing one-stage primary clustering on the word vectors of the feature items to obtain a plurality of primary feature clusters; step five: performing two-stage loose clustering inside each preliminary feature cluster to obtain a plurality of similar feature clusters; step six: and replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, and then classifying the short text by using a classifier. Most of the traditional short text classification lacks of semantic expression capability, and the dimensionality of the feature space is higher, so that the short text classification method can better express the semantic information of the short text, simultaneously reduce the dimensionality of the feature space, further improve the precision and efficiency of the short text classification, and can be used in short text classification tasks in various fields, such as spam short message classification, microblog topic classification and the like.

Description

Short text-oriented optimization classification method
Technical Field
The invention belongs to the technical field of Chinese short text classification, relates to an optimized classification method for short texts, and particularly relates to a classification method for network short texts.
Background
In the information age of data explosion, the intellectualization of mobile terminals and the high-speed development of internet technology have prompted people to communicate more and more frequently on the mobile internet, thereby generating a large amount of information data. The data is mostly used as a carrier of information transmission in the form of short texts, such as micro blogs, instant news pushing and the like, and the data is concise and refined in content, rich in meaning and high in research value. Therefore, how to automatically classify the short texts, which is helpful for understanding the rich meanings expressed by the short texts, has become a hotspot and difficulty in research in the fields of natural language processing and machine learning.
Short text is typically within one hundred words long, contains a rich vocabulary, and is structurally flexible. Conventional long text classification methods typically segment an entire document into words, phrases, and sentences, and then represent the document vector using a Vector Space Model (VSM). The method has the following problems when being applied to short text classification: (1) the VSM does not consider the influence of synonyms in sentences on the similarity when calculating the semantic similarity between sentences. Synonyms have similar meanings, and the VSM counts the synonyms contained therein separately during calculation, which affects the calculation of the similarity between sentences and thus the accuracy of text classification. (2) When the text data is more, the VSM is adopted to represent the text, so that a serious dimension disaster problem can occur. (3) Short texts are usually small in size and many ambiguous words and noise words, and effective features of the short texts extracted by the traditional method are often insufficient, so that semantic information of the short texts is less in representation and is not beneficial to subsequent classification.
In response to the above-mentioned problems faced by short text classification, many researchers have proposed improved methods. The patent text (201810447731.9) extracts the context emotional characteristic value and the prior emotional characteristic value of the short text, and then combines the emotional characteristic vectors of different layers for auxiliary classification. The patent text (201810090256.4) uses each piece of information in the short text training set as the input search key word of the search engine, and then selects the first search result with the highest similarity as the expansion corpus to achieve the purpose of semantic enhancement. The patent text (201710994366.9) firstly obtains an external corpus from a knowledge base so as to construct a topic model, then divides the short text data stream into data blocks according to a sliding window mechanism, and expands the short text in the data blocks by using the topic model to obtain an expanded data stream. And finally, constructing a theme model for each data block in the expanded data stream to obtain the theme representation of each short text. The patent text (201710686945.7) calculates the geometric spacing between each sample to be classified and the hyperplane after the hyperplane divides the two types of samples, and then divides a plurality of subdomains according to the geometric spacing, and each subdomain interval is endowed with different weights. And finally, in the undersampling stage, undersampling the data according to the weight to obtain a new text vector represented based on the vector machine. The patent text (201710337594.9) utilizes the n-element phonetic notation characters of the characteristic words to establish word vectors of the characteristic words and phonetic notation character vectors of the n-element phonetic notation characters corresponding to the characteristic words, and then trains the word vectors and the phonetic notation character vectors. The method uses the n-gram phonetic characters corresponding to the word to express the characteristics of the word.
The existing short text classification patent technology mainly shows improvement based on semantic expansion and word vector representation. The method of semantic expansion needs a large amount of external corpora, increases the calculation overhead, brings dimensionality disasters, and often limits the application scenarios. While methods based solely on word vector representation have limited improvement in classification accuracy. The main reasons are that: the word expression obtained by the traditional word embedding method or TF-IDF method only contains semantic information or statistical information in the current text corpus, while the short text has small space and more polysemous words and noise words, and the effective characteristics extracted by the method are less, thereby causing difficulty in extracting enough semantic information. Therefore, the invention provides a short text optimization classification method based on two-stage feature clustering, which utilizes a word2vec model to calculate the similarity between features, and uses effective feature clustering and optimization means to establish similar feature clusters to enhance the semantic representation capability of short texts and improve the classification precision of the short texts; meanwhile, the method improves the classification efficiency of the short text by reducing the calculation amount of the similar feature cluster construction and the dimension of feature space calculation.
Disclosure of Invention
The invention aims to overcome the defects of the technology and provide an optimized classification method for short texts. On one hand, the method expands and uses more semantic features to classify the short text so as to improve the accuracy of short text classification; on the other hand, similar feature clusters obtained by clustering the two-stage features are used for replacing the original features for classification, so that the dimensionality of the features is reduced, and the efficiency of short text classification is improved. The invention can better support related application research, such as spam classification, mail classification, microblog topic classification and the like.
The specific technical scheme is as follows:
acquiring an original data set and preprocessing the original data set;
A. the original data set is from open source news corpora published by networks (e.g., the Compound Dane and dog laboratories);
B. adding a collected and sorted network real word dictionary for improving the precision of subsequent word segmentation;
C. and performing word segmentation on the original data set, removing stop words and finishing preprocessing. Wherein the segmentation uses Chinese academy segmentation tool ICTCCLA 2018.
Selecting a characteristic item set from the preprocessed data set obtained in the step one;
specifically, each word in the preprocessed original data set is traversed, words with the word frequency larger than a set threshold and without repetition are selected as feature words, and then a feature item set is constructed.
Training the collected large-scale corpus by using a word vector tool to obtain a word vector model;
A. searching open source news corpora from a network path (a laboratory of Redandan and dog searching), searching Chinese corpora from Wikipedia, and preprocessing the data, wherein the preprocessing step is the same as the preprocessing step in the first step;
B. and performing word vector training on the preprocessed corpus by using a word2vec word vector tool. Word2Vec is a tool for Word vector computation from Google open source. Word2Vec can be trained on a million-order dictionary and a billion data set with high efficiency, and the training result obtained by the tool, namely word vector (word embedding), can well measure the similarity between words;
C. and D, storing the word vector model obtained in the step B.
Performing word vector representation on each feature item in the feature item set by using a word vector model, and performing primary clustering on the word vectors of the feature items at one stage to obtain a plurality of primary feature clusters;
the first-stage preliminary clustering is specifically carried out by adopting a spectral clustering algorithm according to the following steps:
A. and determining the number K of the clustering centers by using an elbow method. With the increase of K, the sample division is more fine, the aggregation degree of each cluster is gradually improved, and the sum of squares of errors is gradually reduced; when K is smaller than the real clustering number, the aggregation degree of each cluster is greatly increased due to the increase of K, so that the descending amplitude of the error square sum is large, when K reaches the real clustering number, the return of the aggregation degree obtained by increasing K is rapidly reduced, so that the descending amplitude of the error square sum is reduced suddenly and then tends to be flat along with the continuous increase of the K value, the relation graph of the error square sum and K is in the shape of an elbow, and the K value corresponding to the elbow is the real clustering number of the data.
B. Calling a natural language processing toolkit scimit to perform spectral clustering on the feature items in the feature item set in the step four;
C. k preliminary feature clusters are obtained.
Step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in the step four, thereby obtaining a plurality of similar feature clusters;
the two-stage loose clustering specifically comprises the following steps:
A. deriving a word vector of each characteristic item from the word vector model obtained in the step three;
B. the cosine similarity Vsim (V) between the feature vectors in each preliminary feature cluster is calculated using the following formulai,Vj):
Figure BDA0001991677620000051
In the formula ViIs a vector of a feature word i, VjIs the vector of the feature word j, k is the dimension of the feature vector, n is the total dimension of the feature vector,
Figure BDA0001991677620000052
is the k-th dimension vector of the feature word i,
Figure BDA0001991677620000053
a k-dimension vector of the feature word j;
C. matching and connecting the feature words with cosine similarity larger than a set threshold;
D. and D, in each preliminary feature cluster obtained in the step four, constructing a feature similarity adjacency list by using the matched feature words, performing depth-first traversal on the adjacency list, and forming a similar feature cluster through each iteration traversal to finally obtain a plurality of similar feature clusters. This way of clustering, in which similar features are directly connected, we call loose clustering. Because the method is not based on any model of clustering algorithm, the characteristic words with similar semantemes are selected for matching, and the characteristic word pairs are combined to form similar characteristic clusters.
Replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, and then using a classifier to classify the short texts, wherein the method specifically comprises the following steps:
A. traversing the feature words obtained in the fourth step, and if the feature words belong to a certain similar feature cluster, replacing the feature words with the similar feature cluster; if the feature word does not belong to any similar feature cluster, the feature word is retained. Finally, a replacement data set T of the original data set is obtained;
B. calculating a cluster vector of each similar feature cluster, wherein the cluster vector is a mean vector of word vectors of all feature items in each similar feature cluster;
C. and calculating a space vector of the replacement data set T, wherein the space vector is obtained by performing vector space calculation on T by using a TF-IDF algorithm.
D. The final vector representation of the replacement data set T is spliced from the cluster vector in B and the space vector in C.
E. And carrying out text classification on the final vector of the replacement data set T by using a text classifier SVM to obtain a classification result.
The invention discloses a short text-oriented optimization classification method, which replaces the characteristics of an original data set with the characteristics after two-stage clustering and then carries out short text classification. The method comprises the following steps of obtaining a plurality of large preliminary feature clusters by preliminarily clustering in one stage by adopting spectral clustering, and obtaining a plurality of small similar feature clusters in each preliminary feature cluster in the second stage by adopting loose clustering.
Compared with the traditional short text classification mode, the invention has the following advantages that:
firstly, the loose clustering mode used by the invention can control the similarity level of the feature clusters and cover more short text features, thereby enhancing the semantic representation capability of the short text and improving the precision of subsequent classification. The method comprises the following specific steps: the method controls the similarity level of the clusters by varying the similarity pairing threshold used in the pairing as well as the size of the largest cluster. When the similarity matching threshold is low, similar features are gathered in a feature sub-cluster as much as possible, and when the similarity matching threshold is high, the method can reduce the probability of appearance of outliers in each cluster, so that the quality of each similar feature cluster is improved, and the precision of subsequent classification is improved. For example, in our word vector model, when the similarity pairing threshold is 0.35, the constructed feature cluster can cover 66% of the whole feature item set. When the similarity pairing threshold is 0.6, this ratio drops to 25%.
Secondly, the two-stage clustering method is used, so that the calculation amount of the similar feature cluster construction and the dimensionality of feature space calculation can be reduced, and the efficiency of subsequent classification can be improved. The reason for this is that: if the method for constructing the preliminary feature clusters and then constructing the similar feature sub-clusters in each preliminary feature cluster is adopted, the calculation amount for constructing the similar feature sub-clusters can be reduced. This is because constructing similar feature sub-clusters requires similarity calculation for any two words in the feature set, if the feature set is too large, pairwise comparison operations will increase if the feature set is directly constructed, and the overall calculation overhead will also increase dramatically. Therefore, the original feature item set is divided into a plurality of preliminary feature clusters through one-stage clustering, and sub-clusters are constructed in the preliminary feature clusters, so that the effect of reducing the calculated amount can be achieved.
In addition, the vector space dimension constructed by directly splicing the TF-IDF vector and the word vector is much smaller than the vector space dimension directly constructed by using the TF-IDF. For example, after running the clustering process, "apple-banana," "apple-pear" and "banana-pear" are clustered into "apple, banana, pear" to form a cluster of similar features. Its characteristic dimension decreases from 3 to 1. The short text classification precision can be improved by using a text representation mode of replacing the feature words contained in the short text with the feature clusters, and the reason is that the short text contains fewer feature words, and the semantic information contained in a small number of feature words is limited, so that a proper classifier model is difficult to train. The method for expressing the text by using the feature cluster can improve the semantic expression capability of a single feature word, so that the semantic expression capability of the whole short text is enhanced.
Thirdly, the splicing mode of the feature vectors is optimized. Since the TF-IDF vector is very high in dimension, usually thousands or even tens of thousands of dimensions. And the corresponding word vector dimension is relatively low, typically 200 to 400 dimensions. If the TF-IDF vector and the word vector are directly spliced, the effect is difficult to be fully reflected if the occupation ratio of the word vector is small, so that a good classification result is difficult to obtain by the vector splicing mode. And the TF-IDF vector after feature cluster replacement is used, the feature vector dimension can be reduced by 20% -66%, and the effect of the word vector is amplified, so that the vector splicing method can achieve a better effect.
Drawings
FIG. 1 is a process flow diagram of the process of the present invention.
FIG. 2 is a diagram of an embodiment of the method of the present invention.
FIG. 3 is a visualization diagram of similar feature clusters of the method of the present invention.
Detailed Description
Comparative example 1:
chinese patent CN107368611A, "a short text classification method", adopts a short text classification method, which first divides two types of samples on a hyperplane, then calculates the geometric distance between each type of sample and the hyperplane, divides a plurality of sub-domains according to the geometric distance, each sub-domain interval is given different weights, the sub-domain farther away from the hyperplane has smaller weight, under-samples the data according to the weights in the under-sampling stage, and finally, introduces the samples obtained after sampling into an SVM classifier for classification. In practical applications, the effect of using only the SVM classifier is limited, because the SVM classifier is slow and inefficient in classifying large-scale data, and thus the time efficiency of the invention needs to be improved.
Comparative example 2:
chinese patent CN108080206A, "a short text classification method based on semantic enhancement," adopts a way of external corpus resource expansion to perform semantic expansion of short text. Aiming at the characteristics of small short text information quantity and sparse semantics, the method utilizes a method of expanding linguistic data with high quality and high-precision word vectors to carry out semantic enhancement expression on the short text, and simultaneously uses an efficient text classification algorithm to capture limited text features to the maximum extent and effectively shorten the training time of a classifier.
The short text semantic expansion method based on external corpus expansion needs to use a large amount of external data, and the resource and time overhead for constructing an external corpus is huge. In addition, the method does not design the corresponding relation between specific characteristic words and external linguistic data, so that the problem of low classification precision possibly exists.
Fig. 1 is a flow chart of the method of the present invention, fig. 2 is a diagram of an embodiment of the method of the present invention, and fig. 3 is a diagram of a similar feature cluster visualization of the method of the present invention.
The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.
Example 1:
the embodiment is a specific embodiment of a short text optimization classification method based on feature clustering. The invention mainly comprises six steps:
acquiring training data, and preprocessing the training data by adopting the following steps:
A. the training data come from the open source news corpora released by the double-denier and dog-searching laboratories, have more than 20 ten thousand pieces of data, and totally comprise 6 categories of sports, Internet, economy, politics, art and military;
B. adding a collected and sorted network real word dictionary for improving the precision of subsequent word segmentation;
C. removing stop words;
D. and performing word segmentation on the training data to finish preprocessing.
Traversing each feature word in the data set after word segmentation for the training data obtained in the step one, and selecting the feature words with the word frequency larger than a set threshold value and without repetition to construct a feature item set. The threshold is set here to 3;
step three, training the collected large-scale corpus, and specifically comprising the following steps:
A. searching open source news corpora from a double-denier and dog-searching laboratory, and carrying out data preprocessing on the open source news corpora, wherein the preprocessing step is the same as the step I;
B. performing word vector training on the preprocessed corpus by using a word2vec tool, wherein the dimension of the trained word vector is 300 dimensions;
C. and D, storing the word vector model obtained in the step B, and performing word vector representation on the feature item set obtained in the step two by using the model.
Step four, performing primary clustering on the word vectors of the feature items obtained in the step three by adopting a spectral clustering algorithm according to the following steps to obtain a plurality of primary feature clusters:
A. and determining the number K of the clustering centers by using an elbow method. And setting the initial K value to be 5, then continuously increasing the K value by using a grid method, and checking the evaluation index of the cluster. Finally, for this data set, the appropriate K value was determined to be 11 according to the elbow method steps and conclusions.
B. Calling a toolkit scimit to perform spectral clustering on the feature items in the feature item set in the step four;
C. k-11 preliminary feature clusters were obtained.
Step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in the step four to obtain a plurality of similar feature clusters, and specifically performing the following steps:
A. deriving a word vector of each characteristic item from the word vector model obtained in the step three;
B. the cosine similarity Vsim (V) between the feature vectors in each preliminary feature cluster is calculated using the following formulai,Vj):
Figure BDA0001991677620000111
In the formula ViIs a vector of a feature word i, VjIs the vector of the feature word j, k is the dimension of the feature vector, n is the total dimension of the feature vector,
Figure BDA0001991677620000112
is the k-th dimension vector of the feature word i,
Figure BDA0001991677620000113
a k-dimension vector of the feature word j;
C. matching and connecting the feature words with cosine similarity larger than a set threshold;
D. firstly, in each preliminary feature cluster obtained in the fourth step, a feature similarity adjacency list is constructed by using matched feature words. And then performing depth-first traversal on the adjacency list by using a loose clustering strategy, wherein each iteration of traversal forms a similar feature cluster. And finally obtaining a plurality of similar feature clusters.
The cluster sizes of similar features of the present invention are shown in table 1. When the similarity pairing threshold is set to be small, the similar feature cluster size will be large. But such large sized clusters may exhibit negative effects in the text classification process. When the threshold is too high, each similar feature cluster has the strongest semantic relation, but the size of the whole similar feature cluster is sharply reduced. Based on the word vector model used in the present invention, the optimal similarity pairing threshold is 50%.
Table 1 cluster of similar features results
Figure BDA0001991677620000121
Replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, splicing and representing the features in the fifth step by using TF-IDF vectors of similar feature cluster linguistic data and similar feature cluster word vectors, and then classifying the short texts by using a classifier, wherein the method specifically comprises the following steps:
A. traversing the original data set, and replacing all feature words contained in the similar feature clusters with the similar feature clusters to obtain a replacement data set T of the original corpus;
B. calculating a cluster vector of each similar feature cluster, wherein the cluster vector is a vector mean value of all feature items in each corresponding similar feature cluster;
C. and performing vector calculation in the T by using a TF-IDF algorithm to obtain a corresponding vector space.
D. The final vector representation of the replacement data set T is directly concatenated by the word vector in B and the TF-IDF vector in C. Since the similarity pairing threshold is too high, only a few features are affected, so its impact on class recall and feature reduction is also very limited. Along with the continuous reduction of the similarity matching threshold, the similar feature cluster scale is continuously enlarged, the feature reduction effect is better and better, and the highest feature reduction effect can reach 36%.
E. And using a text classifier to classify the text theme of the replacement data set T to obtain a classification result. The invention uses a plurality of text classifiers to carry out experiments, including support vector machine, naive Bayes and logistic regression. The accuracy of the classification by using the TF-IDF is 78.12%, and the classification accuracy of the invention is respectively improved by 2.5% -5%, 1.6% -4.1% and 4.1% -5.3% on the basis of the baseline.
Example 2:
the embodiment is a specific embodiment of a short text optimization classification method based on feature clustering. The invention mainly comprises six steps:
acquiring training data, and preprocessing the training data by adopting the following steps:
A. the training data is from a China Mobile short message data set, and comprises 5 categories of normal, marketing, advertisement, credit card and others, wherein the total data is about 10 ten thousand;
B. removing stop words;
C. and performing word segmentation on the training data to finish preprocessing.
Traversing each feature word in the segmented data set for the training data obtained in the step one, and selecting the feature words with the word frequency larger than a set threshold (set as 2) and without repetition to construct a feature item set;
step three, training the collected large-scale corpus, and specifically comprising the following steps:
A. the method comprises the following steps of (1) collecting source Chinese corpora from Wikipedia, and carrying out data preprocessing on the source Chinese corpora, wherein the preprocessing step is the same as the first step;
B. performing word vector training on the preprocessed corpus by using a word2vec tool, wherein the dimension of the trained word vector is 300 dimensions;
C. and D, storing the word vector model obtained in the step B, and performing word vector representation on the feature item set obtained in the step two by using the model.
Step four, performing primary clustering on the word vectors of the feature items obtained in the step three by adopting a spectral clustering algorithm according to the following steps to obtain a plurality of primary feature clusters:
A. and determining the number K of the clustering centers by using an elbow method. And setting the initial K value to be 3, then continuously increasing the K value by using a grid method, and checking the evaluation index of the cluster. Finally, for this data set, the appropriate K value was determined to be 9 according to the elbow method steps and conclusions.
B. Calling a tool kit (using the scimit tool kit herein) to perform spectral clustering on the feature items in the feature item set in the step four;
C. and obtaining K-9 preliminary feature clusters.
Step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in the step four to obtain a plurality of similar feature clusters, and specifically performing the following steps:
A. deriving a word vector of each characteristic item from the word vector model obtained in the step three;
B. calculating cosine similarity between the feature vectors in each preliminary feature cluster;
C. matching and connecting the feature words with cosine similarity larger than a set threshold;
D. firstly, in each preliminary feature cluster obtained in the fourth step, a feature similarity adjacency list is constructed by using matched feature words. And then performing depth-first traversal on the adjacency list by using a loose clustering strategy, wherein each iteration of traversal forms a similar feature cluster. And finally obtaining a plurality of similar feature clusters.
When the similarity pairing threshold is set to be small, the similar feature cluster size will be large. But such large sized clusters may exhibit negative effects in the text classification process. Because the data set is small, the appropriate similarity pairing threshold may fluctuate greatly, and multiple sets of experiments need to be set for verification. When the threshold is 65% or higher, there is the strongest semantic relation within each similar feature cluster, but the size of the entire similar feature cluster drops sharply. When the similarity threshold is 50% or less, the effect of improving the classification accuracy is not significant. Based on the word vector model used in the present invention, the optimal similarity pairing threshold is 55%.
Replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, splicing and representing the features in the fifth step by using TF-IDF vectors of similar feature cluster linguistic data and similar feature cluster word vectors, and then classifying the short texts by using a classifier, wherein the method specifically comprises the following steps:
A. traversing the original data set, and replacing all feature words contained in the similar feature clusters with the similar feature clusters to obtain a replacement data set T of the original corpus;
B. calculating a cluster vector of each similar feature cluster, wherein the cluster vector is a vector mean value of all feature items in each corresponding similar feature cluster;
C. and performing vector calculation in the T by using a TF-IDF algorithm to obtain a corresponding vector space.
D. The final vector representation of the replacement data set T is directly concatenated by the word vector in B and the TF-IDF vector in C.
E. And using a text classifier to classify the text theme of the replacement data set T to obtain a classification result. The invention uses a plurality of text classifiers to carry out experiments, including a support vector machine SV M, a naive Bayes NB and a logistic regression LR. The 6 classification accuracy base line using TF-IDF is 85.12%, and under the current data set, the classification accuracy of the invention is respectively improved by 2.9% -4%, 2.6% -3.1% and 1.9% -3.3% to the maximum. Table 2 shows specific similar feature clustering results.
Table 2 cluster of similar features results
Figure BDA0001991677620000161

Claims (1)

1. A short text-oriented optimization classification method is characterized by comprising the following steps:
acquiring an original data set and preprocessing the original data set;
selecting a characteristic item set from the preprocessed data set obtained in the step one;
training the collected large-scale corpus by using a word vector tool to obtain a word vector model;
performing word vector representation on each feature item in the feature item set by using a word vector model to obtain feature words, and performing one-stage primary clustering on the feature words to obtain a plurality of primary feature clusters;
step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in the step four, thereby obtaining a plurality of similar feature clusters;
replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, and then using a classifier to classify the short texts, wherein the feature words refer to word vectors of feature items;
step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in step four to obtain a plurality of similar feature clusters, specifically:
deriving a word vector of each characteristic item from the word vector model obtained in the step three;
calculating cosine similarity between the feature vectors in each preliminary feature cluster;
matching and connecting the feature words with cosine similarity larger than a set threshold;
constructing a feature similarity adjacency list by using the matched feature words in each preliminary feature cluster obtained in the step four, performing depth-first traversal on the adjacency list, and forming a similar feature cluster through each iteration traversal to finally obtain a plurality of similar feature clusters;
replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, and then using a classifier to classify the short texts, wherein the method specifically comprises the following steps:
A. traversing the feature words obtained in the fourth step, and if the feature words belong to a certain similar feature cluster, replacing the feature words with the similar feature cluster; if the feature words do not belong to any similar feature cluster, the feature words are reserved, and finally a replacement data set T of the original data set is obtained;
B. calculating a cluster vector of each similar feature cluster, wherein the cluster vector is a mean vector of word vectors of all feature items in each similar feature cluster;
C. calculating a space vector of the replacement data set T, wherein the space vector is obtained by performing vector space calculation on the T by using a TF-IDF algorithm;
D. the final vector representation of the replacement data set T is formed by splicing the cluster vector in B and the space vector in C;
E. performing text classification on the final vector of the replacement data set T by using a text classifier SVM to obtain a classification result;
the method comprises the following steps of firstly, acquiring an original data set and preprocessing the original data set, and specifically comprises the following steps:
the original data set is from published open source news corpora;
adding a collected and sorted network real word dictionary for improving the precision of subsequent word segmentation;
performing word segmentation on the original data set, removing stop words, and finishing preprocessing;
selecting a characteristic item set from the preprocessed data set obtained in the step one, specifically traversing each word in the preprocessed original data set, and selecting a word with a word frequency larger than a set threshold value and without repetition as a characteristic word, thereby constructing the characteristic item set;
step three, training the collected large-scale corpus by using a word vector tool to obtain a word vector model, specifically, training by adopting the following steps:
searching open source news corpora and open Chinese corpora from a network path, and carrying out data preprocessing on the open source news corpora and the open Chinese corpora, wherein the preprocessing step is the same as the preprocessing step in the step one;
performing word vector training on the preprocessed corpus by using a word2vec tool;
b, storing the word vector model obtained in the step B;
performing word vector representation on each feature item in the feature item set by using a word vector model, and performing one-stage primary clustering on the word vectors of the feature items to obtain a plurality of primary feature clusters, specifically performing the following steps by using a spectral clustering algorithm:
determining the number K of clustering centers by using an elbow method, wherein the sample division is more precise along with the increase of K, the aggregation degree of each cluster is gradually improved, the sum of squares of errors is gradually reduced, and the optimal K value can be determined according to the change of the sum of squares of errors;
calling a natural language processing toolkit scimit to perform spectral clustering on the feature items in the feature item set in the step four;
obtaining K preliminary feature clusters;
calculating the cosine similarity Vsim (V) between the feature vectors in each preliminary feature cluster using the following formulai,Vj):
Figure FDA0003046402050000031
In the formula ViAs vectors of the feature word i,VjIs the vector of the feature word j, k is the dimension of the feature vector, n is the total dimension of the feature vector,
Figure FDA0003046402050000041
is the k-th dimension vector of the feature word i,
Figure FDA0003046402050000042
is the k-th dimension vector of the feature word j.
CN201910182364.9A 2019-03-12 2019-03-12 Short text-oriented optimization classification method Active CN109960799B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910182364.9A CN109960799B (en) 2019-03-12 2019-03-12 Short text-oriented optimization classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910182364.9A CN109960799B (en) 2019-03-12 2019-03-12 Short text-oriented optimization classification method

Publications (2)

Publication Number Publication Date
CN109960799A CN109960799A (en) 2019-07-02
CN109960799B true CN109960799B (en) 2021-07-27

Family

ID=67024233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910182364.9A Active CN109960799B (en) 2019-03-12 2019-03-12 Short text-oriented optimization classification method

Country Status (1)

Country Link
CN (1) CN109960799B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825850B (en) * 2019-11-07 2022-07-08 哈尔滨工业大学(深圳) Natural language theme classification method and device
CN111104511B (en) * 2019-11-18 2023-09-29 腾讯科技(深圳)有限公司 Method, device and storage medium for extracting hot topics
CN111488429A (en) * 2020-03-19 2020-08-04 杭州叙简科技股份有限公司 Short text clustering system based on search engine and short text clustering method thereof
CN111553173B (en) * 2020-04-23 2023-09-15 思必驰科技股份有限公司 Natural language generation training method and device
TWI807203B (en) 2020-07-28 2023-07-01 華碩電腦股份有限公司 Voice recognition method and electronic device using the same
CN112328790A (en) * 2020-11-06 2021-02-05 渤海大学 Fast text classification method of corpus
CN112860898B (en) * 2021-03-16 2022-05-27 哈尔滨工业大学(威海) Short text box clustering method, system, equipment and storage medium
CN113377607A (en) * 2021-05-13 2021-09-10 长沙理工大学 Method and device for detecting log abnormity based on Word2Vec and electronic equipment
CN113486176B (en) * 2021-07-08 2022-11-04 桂林电子科技大学 News classification method based on secondary feature amplification
CN114357121B (en) * 2022-03-10 2022-07-15 四川大学 Innovative scheme design method and system based on data driving
CN115329078B (en) * 2022-08-11 2024-03-12 北京百度网讯科技有限公司 Text data processing method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030236659A1 (en) * 2002-06-20 2003-12-25 Malu Castellanos Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN105426426A (en) * 2015-11-04 2016-03-23 北京工业大学 KNN text classification method based on improved K-Medoids
CN107092679A (en) * 2017-04-21 2017-08-25 北京邮电大学 A kind of feature term vector preparation method, file classification method and device
CN108363810A (en) * 2018-03-09 2018-08-03 南京工业大学 A kind of file classification method and device
CN108664633A (en) * 2018-05-15 2018-10-16 南京大学 A method of carrying out text classification using diversified text feature
CN108763348A (en) * 2018-05-15 2018-11-06 南京邮电大学 A kind of classification improved method of extension short text word feature vector

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599029B (en) * 2016-11-02 2021-04-06 焦点科技股份有限公司 Chinese short text clustering method
CN109033307B (en) * 2018-07-17 2021-08-31 华北水利水电大学 CRP clustering-based word multi-prototype vector representation and word sense disambiguation method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030236659A1 (en) * 2002-06-20 2003-12-25 Malu Castellanos Method for categorizing documents by multilevel feature selection and hierarchical clustering based on parts of speech tagging
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN105426426A (en) * 2015-11-04 2016-03-23 北京工业大学 KNN text classification method based on improved K-Medoids
CN107092679A (en) * 2017-04-21 2017-08-25 北京邮电大学 A kind of feature term vector preparation method, file classification method and device
CN108363810A (en) * 2018-03-09 2018-08-03 南京工业大学 A kind of file classification method and device
CN108664633A (en) * 2018-05-15 2018-10-16 南京大学 A method of carrying out text classification using diversified text feature
CN108763348A (en) * 2018-05-15 2018-11-06 南京邮电大学 A kind of classification improved method of extension short text word feature vector

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周庆平 等.基于聚类改进的KNN文本分类算法.《计算机应用研究》.2016,第33卷(第11期),第3374-3377+3382页. *

Also Published As

Publication number Publication date
CN109960799A (en) 2019-07-02

Similar Documents

Publication Publication Date Title
CN109960799B (en) Short text-oriented optimization classification method
CN108108351B (en) Text emotion classification method based on deep learning combination model
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN107193801B (en) Short text feature optimization and emotion analysis method based on deep belief network
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
CN108280206B (en) Short text classification method based on semantic enhancement
CN110532554A (en) A kind of Chinese abstraction generating method, system and storage medium
CN109002473B (en) Emotion analysis method based on word vectors and parts of speech
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN106528642A (en) TF-IDF feature extraction based short text classification method
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN111506728B (en) Hierarchical structure text automatic classification method based on HD-MSCNN
CN109522544A (en) Sentence vector calculation, file classification method and system based on Chi-square Test
CN107220293B (en) Emotion-based text classification method
CN110705247A (en) Based on x2-C text similarity calculation method
CN113672718A (en) Dialog intention recognition method and system based on feature matching and field self-adaption
Zhang et al. Research on keyword extraction of Word2vec model in Chinese corpus
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN115577080A (en) Question reply matching method, system, server and storage medium
Song Sentiment analysis of Japanese text and vocabulary learning based on natural language processing and SVM
CN116756303A (en) Automatic generation method and system for multi-topic text abstract
CN111460147A (en) Title short text classification method based on semantic enhancement
CN115168580A (en) Text classification method based on keyword extraction and attention mechanism
Liu et al. Internet news headlines classification method based on the n-gram language model
CN114065749A (en) Text-oriented Guangdong language recognition model and training and recognition method of system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Li Fangfang

Inventor after: Yin Yao

Inventor after: Mao Xingliang

Inventor after: Shi Ronghua

Inventor after: Shi Jinjing

Inventor after: Hu Chao

Inventor before: Yin Yao

Inventor before: Li Fangfang

Inventor before: Mao Xingliang

Inventor before: Shi Ronghua

Inventor before: Shi Jinjing

Inventor before: Hu Chao

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Li Fangfang

Inventor after: Yin Yao

Inventor after: Mao Xingliang

Inventor after: Shi Ronghua

Inventor after: Shi Jinjing

Inventor after: Hu Chao

Inventor after: Huang Wei

Inventor before: Li Fangfang

Inventor before: Yin Yao

Inventor before: Mao Xingliang

Inventor before: Shi Ronghua

Inventor before: Shi Jinjing

Inventor before: Hu Chao

CB03 Change of inventor or designer information
TR01 Transfer of patent right

Effective date of registration: 20211119

Address after: 410221 floor 5, building E6, Lugu enterprise Plaza, No. 27, Wenxuan Road, high tech Zone, Changsha City, Hunan Province

Patentee after: CHANGSHA ZHIWEI INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 410000 Hunan province Changsha Lushan Road No. 932

Patentee before: CENTRAL SOUTH University

TR01 Transfer of patent right