CN109960799B

CN109960799B - Short text-oriented optimization classification method

Info

Publication number: CN109960799B
Application number: CN201910182364.9A
Authority: CN
Inventors: 李芳芳; 尹垚; 毛星亮; 施荣华; 石金晶; 胡超
Original assignee: Central South University
Current assignee: CHANGSHA ZHIWEI INFORMATION TECHNOLOGY Co.,Ltd.
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2021-07-27
Anticipated expiration: 2039-03-12
Also published as: CN109960799A

Abstract

The invention discloses a short text-oriented optimization classification method, which comprises the following steps: acquiring an original data set and preprocessing the original data set; step two: selecting a characteristic item set from the preprocessed data set; step three: training the collected large-scale linguistic data by using a word vector tool to obtain a word vector model; step four: performing word vector representation on each feature item in the feature item set by using a word vector model, and performing one-stage primary clustering on the word vectors of the feature items to obtain a plurality of primary feature clusters; step five: performing two-stage loose clustering inside each preliminary feature cluster to obtain a plurality of similar feature clusters; step six: and replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, and then classifying the short text by using a classifier. Most of the traditional short text classification lacks of semantic expression capability, and the dimensionality of the feature space is higher, so that the short text classification method can better express the semantic information of the short text, simultaneously reduce the dimensionality of the feature space, further improve the precision and efficiency of the short text classification, and can be used in short text classification tasks in various fields, such as spam short message classification, microblog topic classification and the like.

Description

Short text-oriented optimization classification method

Technical Field

The invention belongs to the technical field of Chinese short text classification, relates to an optimized classification method for short texts, and particularly relates to a classification method for network short texts.

Background

In the information age of data explosion, the intellectualization of mobile terminals and the high-speed development of internet technology have prompted people to communicate more and more frequently on the mobile internet, thereby generating a large amount of information data. The data is mostly used as a carrier of information transmission in the form of short texts, such as micro blogs, instant news pushing and the like, and the data is concise and refined in content, rich in meaning and high in research value. Therefore, how to automatically classify the short texts, which is helpful for understanding the rich meanings expressed by the short texts, has become a hotspot and difficulty in research in the fields of natural language processing and machine learning.

Short text is typically within one hundred words long, contains a rich vocabulary, and is structurally flexible. Conventional long text classification methods typically segment an entire document into words, phrases, and sentences, and then represent the document vector using a Vector Space Model (VSM). The method has the following problems when being applied to short text classification: (1) the VSM does not consider the influence of synonyms in sentences on the similarity when calculating the semantic similarity between sentences. Synonyms have similar meanings, and the VSM counts the synonyms contained therein separately during calculation, which affects the calculation of the similarity between sentences and thus the accuracy of text classification. (2) When the text data is more, the VSM is adopted to represent the text, so that a serious dimension disaster problem can occur. (3) Short texts are usually small in size and many ambiguous words and noise words, and effective features of the short texts extracted by the traditional method are often insufficient, so that semantic information of the short texts is less in representation and is not beneficial to subsequent classification.

In response to the above-mentioned problems faced by short text classification, many researchers have proposed improved methods. The patent text (201810447731.9) extracts the context emotional characteristic value and the prior emotional characteristic value of the short text, and then combines the emotional characteristic vectors of different layers for auxiliary classification. The patent text (201810090256.4) uses each piece of information in the short text training set as the input search key word of the search engine, and then selects the first search result with the highest similarity as the expansion corpus to achieve the purpose of semantic enhancement. The patent text (201710994366.9) firstly obtains an external corpus from a knowledge base so as to construct a topic model, then divides the short text data stream into data blocks according to a sliding window mechanism, and expands the short text in the data blocks by using the topic model to obtain an expanded data stream. And finally, constructing a theme model for each data block in the expanded data stream to obtain the theme representation of each short text. The patent text (201710686945.7) calculates the geometric spacing between each sample to be classified and the hyperplane after the hyperplane divides the two types of samples, and then divides a plurality of subdomains according to the geometric spacing, and each subdomain interval is endowed with different weights. And finally, in the undersampling stage, undersampling the data according to the weight to obtain a new text vector represented based on the vector machine. The patent text (201710337594.9) utilizes the n-element phonetic notation characters of the characteristic words to establish word vectors of the characteristic words and phonetic notation character vectors of the n-element phonetic notation characters corresponding to the characteristic words, and then trains the word vectors and the phonetic notation character vectors. The method uses the n-gram phonetic characters corresponding to the word to express the characteristics of the word.

The existing short text classification patent technology mainly shows improvement based on semantic expansion and word vector representation. The method of semantic expansion needs a large amount of external corpora, increases the calculation overhead, brings dimensionality disasters, and often limits the application scenarios. While methods based solely on word vector representation have limited improvement in classification accuracy. The main reasons are that: the word expression obtained by the traditional word embedding method or TF-IDF method only contains semantic information or statistical information in the current text corpus, while the short text has small space and more polysemous words and noise words, and the effective characteristics extracted by the method are less, thereby causing difficulty in extracting enough semantic information. Therefore, the invention provides a short text optimization classification method based on two-stage feature clustering, which utilizes a word2vec model to calculate the similarity between features, and uses effective feature clustering and optimization means to establish similar feature clusters to enhance the semantic representation capability of short texts and improve the classification precision of the short texts; meanwhile, the method improves the classification efficiency of the short text by reducing the calculation amount of the similar feature cluster construction and the dimension of feature space calculation.

Disclosure of Invention

The invention aims to overcome the defects of the technology and provide an optimized classification method for short texts. On one hand, the method expands and uses more semantic features to classify the short text so as to improve the accuracy of short text classification; on the other hand, similar feature clusters obtained by clustering the two-stage features are used for replacing the original features for classification, so that the dimensionality of the features is reduced, and the efficiency of short text classification is improved. The invention can better support related application research, such as spam classification, mail classification, microblog topic classification and the like.

The specific technical scheme is as follows:

acquiring an original data set and preprocessing the original data set;

A. the original data set is from open source news corpora published by networks (e.g., the Compound Dane and dog laboratories);

B. adding a collected and sorted network real word dictionary for improving the precision of subsequent word segmentation;

C. and performing word segmentation on the original data set, removing stop words and finishing preprocessing. Wherein the segmentation uses Chinese academy segmentation tool ICTCCLA 2018.

Selecting a characteristic item set from the preprocessed data set obtained in the step one;

specifically, each word in the preprocessed original data set is traversed, words with the word frequency larger than a set threshold and without repetition are selected as feature words, and then a feature item set is constructed.

Training the collected large-scale corpus by using a word vector tool to obtain a word vector model;

A. searching open source news corpora from a network path (a laboratory of Redandan and dog searching), searching Chinese corpora from Wikipedia, and preprocessing the data, wherein the preprocessing step is the same as the preprocessing step in the first step;

B. and performing word vector training on the preprocessed corpus by using a word2vec word vector tool. Word2Vec is a tool for Word vector computation from Google open source. Word2Vec can be trained on a million-order dictionary and a billion data set with high efficiency, and the training result obtained by the tool, namely word vector (word embedding), can well measure the similarity between words;

C. and D, storing the word vector model obtained in the step B.

Performing word vector representation on each feature item in the feature item set by using a word vector model, and performing primary clustering on the word vectors of the feature items at one stage to obtain a plurality of primary feature clusters;

the first-stage preliminary clustering is specifically carried out by adopting a spectral clustering algorithm according to the following steps:

A. and determining the number K of the clustering centers by using an elbow method. With the increase of K, the sample division is more fine, the aggregation degree of each cluster is gradually improved, and the sum of squares of errors is gradually reduced; when K is smaller than the real clustering number, the aggregation degree of each cluster is greatly increased due to the increase of K, so that the descending amplitude of the error square sum is large, when K reaches the real clustering number, the return of the aggregation degree obtained by increasing K is rapidly reduced, so that the descending amplitude of the error square sum is reduced suddenly and then tends to be flat along with the continuous increase of the K value, the relation graph of the error square sum and K is in the shape of an elbow, and the K value corresponding to the elbow is the real clustering number of the data.

B. Calling a natural language processing toolkit scimit to perform spectral clustering on the feature items in the feature item set in the step four;

C. k preliminary feature clusters are obtained.

Step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in the step four, thereby obtaining a plurality of similar feature clusters;

the two-stage loose clustering specifically comprises the following steps:

A. deriving a word vector of each characteristic item from the word vector model obtained in the step three;

B. the cosine similarity Vsim (V) between the feature vectors in each preliminary feature cluster is calculated using the following formula_i,V_j)：

In the formula V_iIs a vector of a feature word i, V_jIs the vector of the feature word j, k is the dimension of the feature vector, n is the total dimension of the feature vector,

is the k-th dimension vector of the feature word i,

a k-dimension vector of the feature word j;

C. matching and connecting the feature words with cosine similarity larger than a set threshold;

D. and D, in each preliminary feature cluster obtained in the step four, constructing a feature similarity adjacency list by using the matched feature words, performing depth-first traversal on the adjacency list, and forming a similar feature cluster through each iteration traversal to finally obtain a plurality of similar feature clusters. This way of clustering, in which similar features are directly connected, we call loose clustering. Because the method is not based on any model of clustering algorithm, the characteristic words with similar semantemes are selected for matching, and the characteristic word pairs are combined to form similar characteristic clusters.

Replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, and then using a classifier to classify the short texts, wherein the method specifically comprises the following steps:

A. traversing the feature words obtained in the fourth step, and if the feature words belong to a certain similar feature cluster, replacing the feature words with the similar feature cluster; if the feature word does not belong to any similar feature cluster, the feature word is retained. Finally, a replacement data set T of the original data set is obtained;

B. calculating a cluster vector of each similar feature cluster, wherein the cluster vector is a mean vector of word vectors of all feature items in each similar feature cluster;

C. and calculating a space vector of the replacement data set T, wherein the space vector is obtained by performing vector space calculation on T by using a TF-IDF algorithm.

D. The final vector representation of the replacement data set T is spliced from the cluster vector in B and the space vector in C.

E. And carrying out text classification on the final vector of the replacement data set T by using a text classifier SVM to obtain a classification result.

The invention discloses a short text-oriented optimization classification method, which replaces the characteristics of an original data set with the characteristics after two-stage clustering and then carries out short text classification. The method comprises the following steps of obtaining a plurality of large preliminary feature clusters by preliminarily clustering in one stage by adopting spectral clustering, and obtaining a plurality of small similar feature clusters in each preliminary feature cluster in the second stage by adopting loose clustering.

Compared with the traditional short text classification mode, the invention has the following advantages that:

firstly, the loose clustering mode used by the invention can control the similarity level of the feature clusters and cover more short text features, thereby enhancing the semantic representation capability of the short text and improving the precision of subsequent classification. The method comprises the following specific steps: the method controls the similarity level of the clusters by varying the similarity pairing threshold used in the pairing as well as the size of the largest cluster. When the similarity matching threshold is low, similar features are gathered in a feature sub-cluster as much as possible, and when the similarity matching threshold is high, the method can reduce the probability of appearance of outliers in each cluster, so that the quality of each similar feature cluster is improved, and the precision of subsequent classification is improved. For example, in our word vector model, when the similarity pairing threshold is 0.35, the constructed feature cluster can cover 66% of the whole feature item set. When the similarity pairing threshold is 0.6, this ratio drops to 25%.

Secondly, the two-stage clustering method is used, so that the calculation amount of the similar feature cluster construction and the dimensionality of feature space calculation can be reduced, and the efficiency of subsequent classification can be improved. The reason for this is that: if the method for constructing the preliminary feature clusters and then constructing the similar feature sub-clusters in each preliminary feature cluster is adopted, the calculation amount for constructing the similar feature sub-clusters can be reduced. This is because constructing similar feature sub-clusters requires similarity calculation for any two words in the feature set, if the feature set is too large, pairwise comparison operations will increase if the feature set is directly constructed, and the overall calculation overhead will also increase dramatically. Therefore, the original feature item set is divided into a plurality of preliminary feature clusters through one-stage clustering, and sub-clusters are constructed in the preliminary feature clusters, so that the effect of reducing the calculated amount can be achieved.

In addition, the vector space dimension constructed by directly splicing the TF-IDF vector and the word vector is much smaller than the vector space dimension directly constructed by using the TF-IDF. For example, after running the clustering process, "apple-banana," "apple-pear" and "banana-pear" are clustered into "apple, banana, pear" to form a cluster of similar features. Its characteristic dimension decreases from 3 to 1. The short text classification precision can be improved by using a text representation mode of replacing the feature words contained in the short text with the feature clusters, and the reason is that the short text contains fewer feature words, and the semantic information contained in a small number of feature words is limited, so that a proper classifier model is difficult to train. The method for expressing the text by using the feature cluster can improve the semantic expression capability of a single feature word, so that the semantic expression capability of the whole short text is enhanced.

Thirdly, the splicing mode of the feature vectors is optimized. Since the TF-IDF vector is very high in dimension, usually thousands or even tens of thousands of dimensions. And the corresponding word vector dimension is relatively low, typically 200 to 400 dimensions. If the TF-IDF vector and the word vector are directly spliced, the effect is difficult to be fully reflected if the occupation ratio of the word vector is small, so that a good classification result is difficult to obtain by the vector splicing mode. And the TF-IDF vector after feature cluster replacement is used, the feature vector dimension can be reduced by 20% -66%, and the effect of the word vector is amplified, so that the vector splicing method can achieve a better effect.

Drawings

FIG. 1 is a process flow diagram of the process of the present invention.

FIG. 2 is a diagram of an embodiment of the method of the present invention.

FIG. 3 is a visualization diagram of similar feature clusters of the method of the present invention.

Detailed Description

Comparative example 1:

chinese patent CN107368611A, "a short text classification method", adopts a short text classification method, which first divides two types of samples on a hyperplane, then calculates the geometric distance between each type of sample and the hyperplane, divides a plurality of sub-domains according to the geometric distance, each sub-domain interval is given different weights, the sub-domain farther away from the hyperplane has smaller weight, under-samples the data according to the weights in the under-sampling stage, and finally, introduces the samples obtained after sampling into an SVM classifier for classification. In practical applications, the effect of using only the SVM classifier is limited, because the SVM classifier is slow and inefficient in classifying large-scale data, and thus the time efficiency of the invention needs to be improved.

Comparative example 2:

chinese patent CN108080206A, "a short text classification method based on semantic enhancement," adopts a way of external corpus resource expansion to perform semantic expansion of short text. Aiming at the characteristics of small short text information quantity and sparse semantics, the method utilizes a method of expanding linguistic data with high quality and high-precision word vectors to carry out semantic enhancement expression on the short text, and simultaneously uses an efficient text classification algorithm to capture limited text features to the maximum extent and effectively shorten the training time of a classifier.

The short text semantic expansion method based on external corpus expansion needs to use a large amount of external data, and the resource and time overhead for constructing an external corpus is huge. In addition, the method does not design the corresponding relation between specific characteristic words and external linguistic data, so that the problem of low classification precision possibly exists.

Fig. 1 is a flow chart of the method of the present invention, fig. 2 is a diagram of an embodiment of the method of the present invention, and fig. 3 is a diagram of a similar feature cluster visualization of the method of the present invention.

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

Example 1:

the embodiment is a specific embodiment of a short text optimization classification method based on feature clustering. The invention mainly comprises six steps:

acquiring training data, and preprocessing the training data by adopting the following steps:

A. the training data come from the open source news corpora released by the double-denier and dog-searching laboratories, have more than 20 ten thousand pieces of data, and totally comprise 6 categories of sports, Internet, economy, politics, art and military;

C. removing stop words;

D. and performing word segmentation on the training data to finish preprocessing.

Traversing each feature word in the data set after word segmentation for the training data obtained in the step one, and selecting the feature words with the word frequency larger than a set threshold value and without repetition to construct a feature item set. The threshold is set here to 3;

step three, training the collected large-scale corpus, and specifically comprising the following steps:

A. searching open source news corpora from a double-denier and dog-searching laboratory, and carrying out data preprocessing on the open source news corpora, wherein the preprocessing step is the same as the step I;

B. performing word vector training on the preprocessed corpus by using a word2vec tool, wherein the dimension of the trained word vector is 300 dimensions;

C. and D, storing the word vector model obtained in the step B, and performing word vector representation on the feature item set obtained in the step two by using the model.

Step four, performing primary clustering on the word vectors of the feature items obtained in the step three by adopting a spectral clustering algorithm according to the following steps to obtain a plurality of primary feature clusters:

A. and determining the number K of the clustering centers by using an elbow method. And setting the initial K value to be 5, then continuously increasing the K value by using a grid method, and checking the evaluation index of the cluster. Finally, for this data set, the appropriate K value was determined to be 11 according to the elbow method steps and conclusions.

B. Calling a toolkit scimit to perform spectral clustering on the feature items in the feature item set in the step four;

C. k-11 preliminary feature clusters were obtained.

Step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in the step four to obtain a plurality of similar feature clusters, and specifically performing the following steps:

is the k-th dimension vector of the feature word i,

a k-dimension vector of the feature word j;

D. firstly, in each preliminary feature cluster obtained in the fourth step, a feature similarity adjacency list is constructed by using matched feature words. And then performing depth-first traversal on the adjacency list by using a loose clustering strategy, wherein each iteration of traversal forms a similar feature cluster. And finally obtaining a plurality of similar feature clusters.

The cluster sizes of similar features of the present invention are shown in table 1. When the similarity pairing threshold is set to be small, the similar feature cluster size will be large. But such large sized clusters may exhibit negative effects in the text classification process. When the threshold is too high, each similar feature cluster has the strongest semantic relation, but the size of the whole similar feature cluster is sharply reduced. Based on the word vector model used in the present invention, the optimal similarity pairing threshold is 50%.

Table 1 cluster of similar features results

Replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, splicing and representing the features in the fifth step by using TF-IDF vectors of similar feature cluster linguistic data and similar feature cluster word vectors, and then classifying the short texts by using a classifier, wherein the method specifically comprises the following steps:

A. traversing the original data set, and replacing all feature words contained in the similar feature clusters with the similar feature clusters to obtain a replacement data set T of the original corpus;

B. calculating a cluster vector of each similar feature cluster, wherein the cluster vector is a vector mean value of all feature items in each corresponding similar feature cluster;

C. and performing vector calculation in the T by using a TF-IDF algorithm to obtain a corresponding vector space.

D. The final vector representation of the replacement data set T is directly concatenated by the word vector in B and the TF-IDF vector in C. Since the similarity pairing threshold is too high, only a few features are affected, so its impact on class recall and feature reduction is also very limited. Along with the continuous reduction of the similarity matching threshold, the similar feature cluster scale is continuously enlarged, the feature reduction effect is better and better, and the highest feature reduction effect can reach 36%.

E. And using a text classifier to classify the text theme of the replacement data set T to obtain a classification result. The invention uses a plurality of text classifiers to carry out experiments, including support vector machine, naive Bayes and logistic regression. The accuracy of the classification by using the TF-IDF is 78.12%, and the classification accuracy of the invention is respectively improved by 2.5% -5%, 1.6% -4.1% and 4.1% -5.3% on the basis of the baseline.

Example 2:

A. the training data is from a China Mobile short message data set, and comprises 5 categories of normal, marketing, advertisement, credit card and others, wherein the total data is about 10 ten thousand;

B. removing stop words;

C. and performing word segmentation on the training data to finish preprocessing.

Traversing each feature word in the segmented data set for the training data obtained in the step one, and selecting the feature words with the word frequency larger than a set threshold (set as 2) and without repetition to construct a feature item set;

A. the method comprises the following steps of (1) collecting source Chinese corpora from Wikipedia, and carrying out data preprocessing on the source Chinese corpora, wherein the preprocessing step is the same as the first step;

A. and determining the number K of the clustering centers by using an elbow method. And setting the initial K value to be 3, then continuously increasing the K value by using a grid method, and checking the evaluation index of the cluster. Finally, for this data set, the appropriate K value was determined to be 9 according to the elbow method steps and conclusions.

B. Calling a tool kit (using the scimit tool kit herein) to perform spectral clustering on the feature items in the feature item set in the step four;

C. and obtaining K-9 preliminary feature clusters.

B. calculating cosine similarity between the feature vectors in each preliminary feature cluster;

When the similarity pairing threshold is set to be small, the similar feature cluster size will be large. But such large sized clusters may exhibit negative effects in the text classification process. Because the data set is small, the appropriate similarity pairing threshold may fluctuate greatly, and multiple sets of experiments need to be set for verification. When the threshold is 65% or higher, there is the strongest semantic relation within each similar feature cluster, but the size of the entire similar feature cluster drops sharply. When the similarity threshold is 50% or less, the effect of improving the classification accuracy is not significant. Based on the word vector model used in the present invention, the optimal similarity pairing threshold is 55%.

D. The final vector representation of the replacement data set T is directly concatenated by the word vector in B and the TF-IDF vector in C.

E. And using a text classifier to classify the text theme of the replacement data set T to obtain a classification result. The invention uses a plurality of text classifiers to carry out experiments, including a support vector machine SV M, a naive Bayes NB and a logistic regression LR. The 6 classification accuracy base line using TF-IDF is 85.12%, and under the current data set, the classification accuracy of the invention is respectively improved by 2.9% -4%, 2.6% -3.1% and 1.9% -3.3% to the maximum. Table 2 shows specific similar feature clustering results.

Table 2 cluster of similar features results

Claims

1. A short text-oriented optimization classification method is characterized by comprising the following steps:

acquiring an original data set and preprocessing the original data set;

performing word vector representation on each feature item in the feature item set by using a word vector model to obtain feature words, and performing one-stage primary clustering on the feature words to obtain a plurality of primary feature clusters;

replacing the feature words obtained in the fourth step with the similar feature clusters obtained in the fifth step, and then using a classifier to classify the short texts, wherein the feature words refer to word vectors of feature items;

step five, performing two-stage loose clustering inside each preliminary feature cluster obtained in step four to obtain a plurality of similar feature clusters, specifically:

deriving a word vector of each characteristic item from the word vector model obtained in the step three;

calculating cosine similarity between the feature vectors in each preliminary feature cluster;

matching and connecting the feature words with cosine similarity larger than a set threshold;

constructing a feature similarity adjacency list by using the matched feature words in each preliminary feature cluster obtained in the step four, performing depth-first traversal on the adjacency list, and forming a similar feature cluster through each iteration traversal to finally obtain a plurality of similar feature clusters;

A. traversing the feature words obtained in the fourth step, and if the feature words belong to a certain similar feature cluster, replacing the feature words with the similar feature cluster; if the feature words do not belong to any similar feature cluster, the feature words are reserved, and finally a replacement data set T of the original data set is obtained;

C. calculating a space vector of the replacement data set T, wherein the space vector is obtained by performing vector space calculation on the T by using a TF-IDF algorithm;

D. the final vector representation of the replacement data set T is formed by splicing the cluster vector in B and the space vector in C;

E. performing text classification on the final vector of the replacement data set T by using a text classifier SVM to obtain a classification result;

the method comprises the following steps of firstly, acquiring an original data set and preprocessing the original data set, and specifically comprises the following steps:

the original data set is from published open source news corpora;

adding a collected and sorted network real word dictionary for improving the precision of subsequent word segmentation;

performing word segmentation on the original data set, removing stop words, and finishing preprocessing;

selecting a characteristic item set from the preprocessed data set obtained in the step one, specifically traversing each word in the preprocessed original data set, and selecting a word with a word frequency larger than a set threshold value and without repetition as a characteristic word, thereby constructing the characteristic item set;

step three, training the collected large-scale corpus by using a word vector tool to obtain a word vector model, specifically, training by adopting the following steps:

searching open source news corpora and open Chinese corpora from a network path, and carrying out data preprocessing on the open source news corpora and the open Chinese corpora, wherein the preprocessing step is the same as the preprocessing step in the step one;

performing word vector training on the preprocessed corpus by using a word2vec tool;

b, storing the word vector model obtained in the step B;

performing word vector representation on each feature item in the feature item set by using a word vector model, and performing one-stage primary clustering on the word vectors of the feature items to obtain a plurality of primary feature clusters, specifically performing the following steps by using a spectral clustering algorithm:

determining the number K of clustering centers by using an elbow method, wherein the sample division is more precise along with the increase of K, the aggregation degree of each cluster is gradually improved, the sum of squares of errors is gradually reduced, and the optimal K value can be determined according to the change of the sum of squares of errors;

calling a natural language processing toolkit scimit to perform spectral clustering on the feature items in the feature item set in the step four;

obtaining K preliminary feature clusters;

calculating the cosine similarity Vsim (V) between the feature vectors in each preliminary feature cluster using the following formula_i,V_j)：

In the formula V_iAs vectors of the feature word i，V_jIs the vector of the feature word j, k is the dimension of the feature vector, n is the total dimension of the feature vector,

is the k-th dimension vector of the feature word i,

is the k-th dimension vector of the feature word j.