Microblog topic clustering method based on word vector and single-pass fusion
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a microblog topic clustering method based on word vector and single-pass fusion.
Background
With the rapid development of network technologies and the overall popularization of mobile internet, traditional news media represented by newspapers, televisions, magazines and the like cannot well meet the requirement of audience persons for obtaining information, and new electronic media taking interconnection as a main factor are more and more concerned by more netizens. Microblogs are an emerging platform for new electronic media, and are favored by more and more users due to their unique flexibility and convenience. With the increasing of the number of users, the data volume of microblogs is increased day by day, meanwhile, sensitive information such as false messages, violence, reaction, terrorism and the like is randomly spread in a network due to the lack of supervision of microblogs and emerging electronic media at present, and serious negative effects are brought to the social health development and the long-term security of the country. Therefore, necessary topic monitoring and tracking research is carried out on microblog data, network public opinion dynamics can be effectively mastered, and early warning capacity of a supervision department or a government safety department on real world emergencies can be improved.
Due to the characteristics of nonstandard terms, random word number and length and the like of the microblog texts, the traditional clustering algorithm and topic monitoring and tracking technology have poor performance in the aspect of microblog public opinion analysis. The thesis is improved by analyzing the structural characteristics of the microblog text and combining a word vector space model and a deep learning method on the basis of a traditional algorithm based on text public opinion analysis.
The text clustering algorithm is an unsupervised clustering algorithm, and is used for clustering texts according to certain rules and strategies, wherein the number of clustering clusters can be determined according to predicted information, and the number of uncertain clustering clusters can also be generated according to different clustering algorithms. At present, the text clustering algorithm commonly used in the natural language field mainly comprises the partition clustering, hierarchical clustering, grid clustering and Single-pass incremental clustering of texts.
The division clustering is a simple and efficient clustering algorithm, which divides N texts in a corpus set into K classes according to the similarity of the texts, wherein K represents the number of clustering result clusters, and ensures that the number of texts is greater than the number of clusters to be clustered, and simultaneously ensures that data in different clusters are mutually exclusive. The common K-means clustering algorithm, the K-methods clustering algorithm and the like are based on the idea of partition clustering.
The K-means clustering algorithm is a typical representative of partitional clustering. Meanwhile, the training process of the K-means clustering algorithm adopts the idea of machine learning EM algorithm, the implicit category C of the text is determined through the iteration of the step E according to the input N text data and the number K of the clusters to be clustered, and the training parameters are updated through the step M, so that the loss is minimum, namely the central point of the clustering cluster is not updated any more. The K-means algorithm uses the squared difference as its loss function:
in the formula CiDenotes the divided clusters, k denotes the number of class clusters, μiIs represented by CiMay be referred to as a centroid. The above technique has the following disadvantages:
k value must be given
When performing the k-means algorithm, the number of clusters must be specified. But sometimes we do not know how many classes we should cluster into, but rather want the algorithm to give a reasonable number of clusters, often the k value is difficult to estimate and give in advance from the beginning.
2. Random k center points influence the results
In the k-means algorithm, the first k center points are randomly selected and recalculated in subsequent iterations until convergence. However, it is not difficult to see the steps of the algorithm, so that the final result is often largely dependent on the position of the first K center points. This means that the result is highly random, and each calculation results in different results because the initial randomly selected center particles are different.
3. Performance of computation
The algorithm needs to continuously classify and adjust the objects and continuously calculate new cluster center points after adjustment, so that the time overhead of the algorithm is very large when the data volume is very large.
Single-pass belongs to an incremental unsupervised clustering algorithm and is an algorithm model for clustering clusters of questions in a large number of news reports. The basic idea of the clustering algorithm is as follows: and carrying out clustering analysis on the text data represented by the feature vectors one by one, namely clustering in one pass. The algorithm takes the first piece of data as a new topic cluster, and similarity calculation is respectively carried out on the subsequently input text vector and the existing topic cluster. If the obtained similarity exceeds a preset threshold value, the text belongs to the topic cluster, and if not, the text is established as a new topic cluster until all texts are read. The technology has the defects that the traditional Single-pass incremental clustering algorithm usually calculates the feature weight (TF-IDF) by means of Term Frequency (TF) statistics and introduction of inverse article Frequency (IDF), forms the space vector representation of the words, but has larger dimensionality and higher calculation cost; the semantic ambiguity of the natural language cannot be distinguished, and the semantic information among the text word sequences is ignored; meanwhile, the influence of the context is ignored, and the recall ratio and precision ratio of the information retrieval result are influenced.
The traditional single-pass clustering algorithm calculates semantic similarity based on the space vector of the feature words, easily causes the problems of excessive data dimension, context semantic deletion and the like, and the clustering algorithm combining word embedded word2vec and the single-pass algorithm introduced into Wikipedia is provided.
Disclosure of Invention
1. Objects of the invention
The invention provides a microblog topic clustering method based on word vector and single-pass fusion to solve the problems.
2. The technical scheme adopted by the invention
The invention provides a microblog topic clustering method based on word vector and single-pass fusion, which comprises the following steps:
preprocessing the acquired microblog data and constructing a word list library;
carrying out Word2vec Word vector mapping on the feature words;
clustering the microblog texts by using single-pass fused with Word2vec Word vectors;
and (5) clustering the clusters by using an LDA topic model to find topics.
Furthermore, the mapping step is a Skip-gram step, which is an inverse process of the CBOW algorithm model, and inputs word vectors of the current feature words and outputs context-dependent word vectors corresponding to the feature words.
Furthermore, in the Single-pass incremental clustering step, the text data represented by the feature vectors are subjected to clustering analysis item by item, namely one-pass clustering.
Further, clustering the microblog texts by using single-pass fused with Word2vec Word vectors:
taking the first piece of data as a new topic cluster, and respectively calculating the similarity between the subsequently input text vector and the existing topic cluster;
if the obtained similarity exceeds a preset threshold value, the text belongs to the topic cluster;
and if not, establishing the text as a new topic cluster until all the texts are read.
Further, the method includes classifying the cluster with the highest similarity if the obtained similarity exceeds a preset threshold and the similarity exceeds the preset threshold.
Further, in the LDA polynomial distribution relation step, each word is considered to select a specific topic in the article with a certain probability, and the topic also selects a word with a certain probability value, so that a polynomial distribution relation exists between the document and the topic, and a certain polynomial distribution relation also exists between the topic and the word.
Still further, a preprocessing operation is included: noise data filtering step, Chinese word segmentation step, stop word filtering step and text feature selection step.
Further, a step of filtering noise data, wherein the data noise mainly comprises advertisements, emoticons, special characters, pictures and hyperlinks which appear in the text.
Furthermore, Chinese word segmentation step, selecting and using the ending word segmentation.
And further, a stop word filtering step, namely performing stop word removing processing on the segmentation result by constructing a stop word table.
3. Advantageous effects adopted by the present invention
(1) The topic detection is carried out on the microblog short texts as the targets;
(2) in the invention, the single-pass incremental clustering algorithm and the Word2vec mapped feature Word vector are fused, so that the integrity of the context semantic information of the short text is increased while the feature dimension is reduced, and the effect of coarse topic clustering is improved;
(3) the detection of the text topics of the inventor is firstly the clustering of topic clusters and secondly the discovery of the topics; the fusion of the Word2vec mapped feature Word vector and single-pass in the topic clustering is more prominent in reducing feature dimension and algorithm timeliness.
Drawings
FIG. 1 is a CBOW model;
FIG. 2 is a Skip-gram model;
FIG. 3 is a flow diagram of topic cluster discovery;
FIG. 4 is an LDA topic model.
Detailed Description
The technical solutions in the examples of the present invention are clearly and completely described below with reference to the drawings in the examples of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without inventive step, are within the scope of the present invention.
The present invention will be described in further detail with reference to the accompanying drawings.
In order to improve the clustering effect of microblog text topics, Word vectors obtained by fusing a Word2vec model in a single-pass incremental clustering algorithm are integrated, and deep text semantic features are mined; clustering microblog topics, namely preprocessing microblog texts, including denoising, Chinese word segmentation, removal of stop words and the like; then constructing a space vector model based on sentences through word segmentation results; the single-pass incremental clustering algorithm based on the fusion Word2vec Word vector can effectively mine text depth semantic information, can avoid influence on computer processing speed due to overhigh VSM dimension, and effectively solves the problems of neglecting the distribution of characteristic words among classes and the distribution of the characteristic words in documents inside the classes in the traditional TF-IDF statistical method.
Example 1
The LDA topic model is relatively sensitive in topic discovery and relatively good in topic discovery effect when aiming at long texts similar to news reports and the like, but for microblog short texts, the word number is relatively short, irrelevant information including noise and the like is relatively large, the number of characteristic words is relatively small, and therefore the expression effect is poor, therefore, the single-pass algorithm is improved, the improved single-pass algorithm is adopted to perform topic cluster clustering on the microblog texts, and finally the LDA topic model is used to perform topic discovery on the texts in the same cluster. The realization process is as follows: preprocessing the acquired microblog data to construct a word list library; carrying out Word2vec Word vector mapping on the feature words; clustering the microblog texts by using single-pass fused with Word2vec Word vectors; and clustering the clusters by using an LDA topic model to find topics.
1. Filtering noisy data
The data noise mainly comprises advertisements, emoticons, special characters, pictures, hyperlinks and the like which appear in the text, the information appears frequently and contains little useful information, the learning of an algorithm model is seriously influenced, and the accuracy of topic detection and tracking is greatly influenced.
2. Chinese word segmentation
The chinese data is not similar to the english data, and each word is separated by a space, so the chinese data cannot be segmented in the same manner as english. The Chinese word segmentation belongs to the category of natural language processing, and the existing word segmentation methods can be mainly divided into three categories: 1. a method based on string matching; 2. an understanding-based word segmentation method; 3. a statistical-based word segmentation method. At present, Chinese word segmentation technology is mature, commonly used word segmentation tools mainly comprise Stanford Chinese word segmentation, ICT CLAS word segmentation, Paoding cattle and crust word segmentation, and the crust is used by a plurality of researchers due to the fact that the Chinese word segmentation tools are simple to install, convenient to operate and satisfactory in word segmentation effect. The word segmentation is carried out on the microblog text in an accurate mode by selecting the ending word segmentation.
3. Filter stop words
Stop Words (Stop Words) are to some extent equivalent to Filter Words (Filter Words), and mainly comprise the following two features: 1. the words are used more widely or are words with higher use frequency, such as words of the type "i", "you", "he", "is", etc. in the text; 2. words which are frequently used but have no specific meaning in the text mainly include adverbs, prepositions, word-atmosphere assistant words, exclamation words and the like. The text filtering stop words can reduce the interference of noise to effective information and improve the performance of the algorithm to a certain extent. The method comprises the step of removing stop words from a word segmentation result by constructing a stop word list of 1210 words.
4. Text feature selection
The Word2Vector algorithm [ i ] is an algorithm model which expresses a feature item as a high-efficiency real number Vector by utilizing the thought of a deep learning algorithm, the feature item is expressed as a K-dimensional space Vector by training a text, and the Vector similarity in the space can be expressed as the semantic similarity in the text. The Word2Vector model contains two different methods CBOW (continuous Bag of words) and Skip-gram. By training the data of the corpus using the CBOW or Skip-gram methods, an optimized feature vector representation for each word can be obtained.
The input of the CBOW algorithm model is word vectors corresponding to the relevant words of the context of a certain feature item, the word vectors corresponding to the feature words are obtained through the training of the model, and the algorithm model is shown in fig. 1:
the Skip-gram algorithm model is the inverse process of the CBOW algorithm model, and the input of the Skip-gram algorithm model is a word vector of a current characteristic word, and the output of the Skip-gram algorithm model is a context-related word vector corresponding to the characteristic word. A Skip-gram algorithm model is shown in FIG. 2:
5. single-pass incremental clustering
The basic idea of the clustering algorithm is as follows: and carrying out clustering analysis on the text data represented by the feature vectors one by one, namely clustering in one pass. The algorithm takes the first piece of data as a new topic cluster, and similarity calculation is respectively carried out on the subsequently input text vector and the existing topic cluster. If the obtained similarity exceeds a preset threshold value, the text belongs to the topic cluster, and if not, the text is established as a new topic cluster until all texts are read. The flow chart of the algorithm is shown in fig. 3:
6. LDA topic model
Lda (late Dirichlet allocation), proposed in 2013 by David m.blei, Andrew Ng et al, is a document generation model, also called a three-layer bayesian probability model based on documents, topics and words, and its basic idea is: it is considered that each word selects a specific topic in the article with a certain probability, and the topic also selects a word with a certain probability value, so that the relationship that the document and the topic have polynomial distribution can be obtained, and the relationship that the topic and the word have polynomial distribution also exists, as shown in fig. 4.
In fig. 4, a hyper-parameter α represents a prior probability distribution of a document d, a hyper-parameter β represents a control parameter of a prior word of a current topic, θ represents a probability distribution of text based on a topic, and ψ represents a probability distribution of a word based on a current topic word.
The experimental environment can be seen in the following table
System environment
7. Test evaluation index
The confusion matrix is an index for judging the result of the model, is often used for judging the quality of the performance of the classifier, is suitable for the data model with different types, and uses the accuracy, the recall rate, the omission factor and the false detection rate in the confusion matrix to evaluate the experimental performance. The evaluation indexes of the confusion matrix are described by taking the second classification as an example, and are shown in the following table:
in the table, a (TP) represents the sample which is actually a positive example and is correctly classified; b (FP) represents samples that are actually positive examples and are misclassified; c (tn) represents samples that are actually negative and divided into positive examples; d (FN) represents samples that are actually negative and are divided into negative examples.
The evaluation index formula is as follows:
false detection rate P
flase:
(9) Analysis of Experimental results
In order to evaluate the performance of the LDA topic-based detection algorithm, the accuracy P, the recall rate R, the false detection rate Pflash, the undetected rate Pmis and the like are adopted to detect the LDA topic-based detection algorithm.
8. Performance evaluation analysis of topic cluster clustering algorithm
The K-means clustering algorithm based on TF-IDF characteristic value combination, the TF-IDF characteristic value combination and single-pass algorithm and the single-pass algorithm combined with Wikipedia word2vec are adopted for experimental comparison. The experimental data comprises 14 subjects, 800 pieces of data are totally arranged under each subject, 100 pieces of data are randomly drawn from the subjects to serve as test data, and the experimental performance is compared as follows:
as can be seen from the above table, the performance of the word2vec + single-pass algorithm based on the combination of Wikipedia presented herein is significantly better than that of the traditional single-pass algorithm and the k-means algorithm. The main reason for improving the performance of the algorithm is that potential semantic information among feature words is combined in word vectors, so that the Euclidean distance of similar texts is closer.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.