CN111368072A

CN111368072A - Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity

Info

Publication number: CN111368072A
Application number: CN201910770568.4A
Authority: CN
Inventors: 吴迪; 张梦甜; 生龙; 黄竹韵; 杨瑞欣; 孙雷
Original assignee: Hebei University of Engineering
Current assignee: Hebei University of Engineering
Priority date: 2019-08-20
Filing date: 2019-08-20
Publication date: 2020-07-03

Abstract

The invention relates to a microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion, which is characterized in that data acquisition and pretreatment are carried out firstly in three stages of data acquisition and pretreatment, modeling and clustering, then modeling is carried out on the obtained data, and the modeled data is clustered; the invention provides a microblog hot topic finding algorithm based on linear fusion of BTM and GloVe similarity, aiming at the problem that a distance function of a K-means algorithm can influence a microblog hot topic clustering result. The GloVe model only trains non-zero elements in a word and word co-occurrence matrix instead of the whole sparse matrix to utilize statistical information, and the sparsity problem of the TF-IDF algorithm in the process of constructing the document-word vector matrix is effectively relieved. The GloVe model combines a global matrix decomposition method and a local context window method, the trained word vectors can carry more semantic information, and the word ambiguity problem which cannot be better solved by the BTM topic model can be relieved to a certain extent.

Description

Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity

Technical Field

The invention relates to the technical field of topic discovery and tracking in natural language processing, in particular to a microblog hot topic discovery algorithm based on BTM and Glove similarity linear fusion.

Background

With the rapid development of the traditional internet and the mobile internet, the microblog is developed vigorously. The microblog allows the user to publish messages through a webpage, an external program, a mobile phone client and the like, so as to realize message sharing. The advantages of short text, timeliness and interactivity of the microblog are accepted by the public, and become an important tool for people to acquire and release information gradually. How to mine hot topics from massive disordered microblog data becomes a problem to be solved urgently.

In order to solve the problems, a plurality of methods exist at present, mainly including a method for performing vector representation on a microblog text set by using TF-IDF and then discovering hot topic by using a clustering algorithm, a method for vectorizing the text set by using topic models such as LDA and BTM and then discovering the hot topic by using the clustering algorithm, and a method for modeling microblog short texts by using a BTM topic model and an improved TF-IDF algorithm, calculating text similarity based on BTM modeling by using JS divergence, calculating text similarity based on the improved TF-IDF algorithm by using cosine distance, and finally performing linear fusion on the two similarities and clustering by using K-means to obtain the hot topic; the above method is briefly described as follows:

(1) BTM topic model

LDA is a document topic generation model, which is mainly used to identify potential topics in a large-scale document collection without supervision. However, the traditional topic model (such as LDA, PLSA, etc.) implicitly captures word co-occurrence patterns of the document layer to reveal the topic, and modeling on a large number of short text sets can cause a serious data sparsity problem, so the traditional topic model has good applicability to long texts and cannot be well adapted to short texts. Thus, Yanxiaohui et al propose a word-pair Topic Model (BTM) for short text that explicitly models word co-occurrence patterns to improve the ability to learn topics, while at the same time, using aggregation patterns throughout the corpus to learn topics to solve the data sparseness problem of document-level word co-occurrence patterns.

In the BTM graph model of today, α and β are hyper-parameters, | B | represents the number of word pairs contained in the entire corpus, K represents the number of topics, θ represents the distribution of topics in the entire corpus,

topic-feature word distribution representing the entire corpus, Z represents a set of topics, W_i、W_jRepresenting all of the two different words in the set of word pairs that make up the word pair.

The process of generating a corpus by the BTM topic model is as follows:

for each topic z, the topic-feature word distribution is sampled: phi is a_z～Dirichlet(β)；

For the entire corpus, a global topic distribution is sampled theta-Dirichlet (α);

for each word pair B in the set of word pairs B ═ (w)_i,w_j)：

One topic z is sampled from the topic distribution θ of the entire corpus: z to Mult (θ);

randomly extracting two words w from the sampled subject z_i、w_jThe component word pair b ═ w_i,w_j)～Mult(φ_z)。

As can be seen from the corpus generation process, the word pair b ═ (w)_i,w_j) The joint probability distribution of (c) is:

wherein p (z) ═ θ_zRepresenting the probability distribution, P (w), of the topic z_i|z)＝φ_i|zZ-feature word w representing a topic_iProbability distribution of (1), P (w)_j|z)＝φ_j|zZ-feature word w representing a topic_jThe probability distribution of the whole corpus B is:

because the BTM cannot directly obtain the proportion of the topics in the document, in order to infer the topics contained in one document, it is assumed that the proportion of the topics in the document is equal to the proportion of the topics in the word pairs generated by the document:

in the formula, P (z | d) is a document-topic distribution, P (b | d) is a document-word pair distribution, and P (z | b) is a word pair-topic distribution. P (z | b) can be calculated by a Bayesian formula based on BTM predicted parameter values:

the value of P (b | d) is estimated using a prior distribution of word pairs in the document:

wherein n is_d(b) Representing the frequency with which word pair b appears in document d.

The BTM also adopts a Gibbs sampling method when solving theta and phi, and obtains the following conditional probability by applying a chain rule to the whole data joint probability:

wherein n is_zRepresenting the number of times a word pair b is assigned to a topic z, z_-bA topic distribution representing all other word pairs except for the word pair b, n_w|zRepresenting the number of times a feature word w is assigned to a topic z, G being the vocabulary size. Finally, the theme distribution theta can be estimated_zAnd topic-word distribution phi_w|z：

(2) JS divergence

The KL divergence is a commonly used text similarity measure, and the calculation formula is as follows:

in which p and q are two probability distributions, p_yAnd q is_yAre two probability distributions with a value of y.

Because KL divergence is asymmetric, certain inaccuracy is caused when measuring text similarity, JS divergence is obtained by improving the asymmetric problem on the basis of KL divergence, and the calculation formula is as follows:

(3) glove global vector model

Macroscopically, word vectors are mainly of two categories: one type of dependency matrix decomposition, such as LSA, etc., which uses a word co-occurrence matrix to capture the similarity between words, but cannot solve the problem of ambiguity once; the other is a shallow window-based method, such as skip-gram and CBOW, which only scans the contextual window of the entire corpus and cannot utilize global statistical information. The GloVe global vector model proposed by Stanford university is a new global logarithmic regression model, combines the advantages of global matrix decomposition and local context window method, effectively utilizes statistical information by training non-zero elements of word co-occurrence matrix, generates a vector space of meaningful substructure, and expresses the semantic similarity degree of words by using the difference of word vectors in dimension.

Considering that two words with similar semantics tend to have higher co-occurrence ratio, the GloVe global vector model learns the semantic similarity between words by using the co-occurrence ratio of the words instead of the probability itself, for example: "ice" and "water vapor" are both a state of water, but the semantic similarity between "ice" and "solid" is large, and the semantic similarity between "water vapor" and "solid" is small, so the word co-occurrence ratio of "ice" and "solid" is much higher than the word co-occurrence ratio of "water vapor" and "solid". The relationship of the word vectors to the co-occurrence matrix is as follows:

wherein, X_stRepresenting the number of times the word t appears in the context of the word s, v_sA word vector representing the word s,

the representation being a vector of independent context words generated by another neural network instance, b_sAnd

is the offset of the two word vectors used to ensure equation symmetry.

To remove noisy data such as low frequency co-occurrence words, equation (11) is converted to a least squares problem and a weighting function f (X) is introduced_st) The resulting loss function calculation is as follows:

wherein G is the vocabulary size. The weighting function f (x) is defined as follows:

the authors give the parameter values: x is the number of_max＝100，η＝3/4。

(4) WMD word wandering distance

The WMD distance is the similarity between the two short texts, wherein the sum of the shortest distances which are passed by the feature word vectors of one short text to the feature word vectors of the other short text is taken as the similarity between the two short texts.

WMD distance measures similarity between documents by using weight transfer quantity and word conversion cost, namely short text d_iAnd d_jThe WMD distance between may be expressed as:

the above formula should satisfy the following constraints:

wherein d is_i,sIndicating that the word s is a short text d_iCharacteristic word of d_j,tIndicating that the word t is a short text d_jG is the vocabulary size.

In the formula (14), c (s, t) is a word conversion cost and represents d_iThe word s in (1) is transferred to d_jThe word conversion cost of the word t in (1) is calculated by the following formula:

c(s,t)＝||v_s-v_t||₂(17)

wherein v is_sAnd v_tA word vector for words s and t, respectively.

(5) K-means clustering algorithm

K-means is a clustering algorithm based on division, and is a clustering technology based on centroids. The algorithm was first proposed in 1967 by Macqueen J, which defined the centroid of a cluster as the mean of the points within the cluster. The algorithm mainly comprises the following principles: randomly selecting k objects in the data set D as initial centers of k clusters, calculating the distance between other objects and the center of each cluster by using a distance function, and allocating the distance to the most similar cluster; then updating the cluster center and reallocating the objects; and repeating iteration until the clustering center converges.

The K-means algorithm uses the euclidean distance as a distance function, and the calculation formula is as follows:

Dis(d_i,c_e)＝||d_i-c_e||₂(18)

wherein d is_iFor data set D ═ D₁,d₂,…,d_nText in }, c_eRepresents a cluster C_eCluster center of (a). The calculation formula for updating the cluster center is as follows:

wherein n is the total number of texts in the data set, and K is the number of clusters.

In determining whether the cluster center converges, the algorithm uses the sum of squared errors as a criterion function:

the existing common method is that a BTM topic model and an improved TF-IDF algorithm are used for respectively modeling microblog short texts, then JS divergence is used for calculating text similarity based on BTM modeling, cosine distance is used for calculating text similarity based on the improved TF-IDF algorithm, and finally the two similarities are subjected to linear fusion and K-means clustering is used for obtaining hot topics.

However, this method has the following problems:

1. the TF-IDF is an algorithm based on statistical information such as word frequency, and the like, so that semantic information of words can be ignored, and the precision of topic clustering is influenced.

2. The research target of the text is a large number of microblog short texts, and a document-word vector matrix constructed by the TF-IDF algorithm has high sparsity, so that the operation efficiency of the algorithm is influenced.

Disclosure of Invention

According to the technical problems, the invention provides a microblog hot topic discovery algorithm based on BTM and Glove similarity linear fusion, which is characterized in that the algorithm comprises three stages of data acquisition and pretreatment, modeling and clustering, wherein the data acquisition and pretreatment are carried out firstly, then the obtained data are modeled, and the modeled data are clustered;

the data acquisition and preprocessing stage processing method comprises the following steps: the microblog open platform provides a microblog API technology, a developer can use own App key and App secret to pass the user authorization of Oauth2.0 by creating application on the open platform, and then an API interface can be called to obtain microblog data in java, C + +, python and other environments; the microblog text preprocessing mainly comprises three parts, namely microblog text filtering, word segmentation and part-of-speech tagging, stop word removal and feature selection;

the modeling phase is divided into two parts: the first part is that a BTM topic model is used for topic modeling, then a modeling result is used for carrying out text representation on an original text set, and finally JS divergence is used for calculating text similarity; the second part is that a GloVe global vector model is used for word vector modeling, and then the WMD distance, namely WMD distance, of improved word weight is used for calculating text similarity;

the clustering is to linearly fuse similarity obtained by respectively calculating based on two modeling modes after BTM topic modeling and GloVe word vector modeling so as to improve a distance function of a K-means algorithm, wherein the distance function based on the fused similarity is shown as a formula 31, and a corresponding clustering criterion function formula is as follows:

a microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity is characterized in that the algorithm execution sequence is as follows: BTM-JS-GloVe-WMD-BG & SLF-Kmeans; the specific method comprises the following steps:

firstly, acquiring microblog data and performing text preprocessing; secondly, modeling the preprocessed microblog text set by using a BTM topic model, and calculating the topic-based text similarity after BTM modeling by using JS divergence after text representation, namely BTM-JS; thirdly, modeling the preprocessed microblog text set by using a GloVe global vector model, and calculating text similarity based on word vectors after the GloVe modeling by using the improved WMD distance, namely GloVe-WMD; and finally, linearly fusing the two similarities, and applying a distance function of the fused similarities to a K-means clustering algorithm, so as to find a microblog hot topic, namely BG & SLF-Kmeans.

The specific flow of microblog text preprocessing is as follows:

(1) microblog text filtering

First, useless information such as emoticons, links, and notations are deleted, such as: "@ user", "stamp graph ↓ know", "more > > >", etc., which will cause great interference to the following text analysis; secondly, deleting ultra-short microblogs with less than 10 words, wherein the ultra-short microblogs are generally used for expressing the emotion of a user rather than describing a hot topic; finally, removing all punctuation marks in the data set;

(2) word segmentation and part-of-speech tagging

Chinese word segmentation is a very important technology in natural language processing, it means to cut a Chinese character sequence into individual words according to certain norm, the commonly used Chinese word segmentation tool mainly has ICTCCLAS (NLPIR), Ansj, Jieba, THULAC, before the word segmentation according to the particularity of the data set add a user-defined dictionary (generally including network new words, hotwords, professional terms, names of people, names of places, etc.) in order to improve the accuracy of word segmentation;

(3) stop word

After word segmentation, the text set becomes a word set, some words have no actual meanings and are only used as connection components of sentences or for expressing the emotion and other effects of authors, such as words of 'o' and 'ya', and the like;

(4) feature selection

After the part of speech is labeled, the part of speech of each word is attached to the back of each word, such as 'space stealing brushing/v', the information content of words with different parts of speech is different, as the hot topic is found in the research of the text, the part of speech of some parts of speech is smaller, such as adjectives, adverbs and the like, in order to improve the operation efficiency of the algorithm, the experimental data only keeps nouns and verbs, and the words with the rest parts of speech are filtered as useless words.

The concrete flow of the modeling stage is as follows:

(1) text similarity measurement based on BTM topic modeling

The text similarity measurement process based on BTM topic modeling is divided into two parts, namely a first part: firstly, calculating an optimal theme number K by using a confusion formula, then carrying out BTM theme modeling on a preprocessed microblog text set, and finally carrying out text representation according to a modeling result; a second part: calculating text similarity by using JS divergence;

1) BTM topic modeling

Because the selection of the theme number K value can directly influence the modeling result of the BTM, the theme number K which can enable the modeling result to be optimal needs to be determined before modeling, the optimal K value can be determined by using a perplexity (perplexity), the perplexity can be used for evaluating the generalization capability of the model, the lower the perplexity is, the better the modeling effect is, and the perplexity calculation formula is as follows:

in the formula, B represents a word pair set, and p (B) represents the probability generated by the word pair B, and the formula is as follows:

wherein p (z) ═ θ_zRepresenting the probability distribution, P (w), of the topic z_i|z)＝φ_i|zZ-feature word w representing a topic_iProbability distribution of (1), P (w)_j|z)＝φ_j|zZ-feature word w representing a topic_jA probability distribution of (a);

after determining the number of topics K, α -50/K and β -0.01 are empirically obtained, and then the topic distribution θ is obtained according to the formulas (7) and (8)_zAnd topic-word distribution phi_w|z；

2) JS divergence

After BTM modeling is completed, for each document, 6 feature words with the first probability in the topic-word distribution p (w | z) under the maximum probability topic in the document-topic probability distribution p (z | d) are selected as the feature words of the document, so that dimension reduction can be realized on the basis of maximally preserving document semantics, and the algorithm complexity is reduced_iThe document vector based on BTM topic modeling can be represented by K document-topic probability distributions:

d_{i_BTM}＝{p(z₁|d_i),p(z₂|d_i),...,p(z_K|d_i)} (23)

two documents d are to be calculated_iAnd d_jThe similarity between the two is converted into the calculation d_{i_BTM}And d_{j_BTM}The similarity between these two document-topic vectors, calculated herein using the commonly used text similarity metric — JS divergence, is based on the BTM topic modeling and the text similarity calculation formula for JS divergence as follows:

wherein, the calculation formula of the KL divergence is as follows:

in which p and q are two probability distributions, p_hAnd p_hIs the probability distribution of the first 6 feature words;

(2) text similarity measurement based on GloVe word vector modeling

The text similarity measurement process based on GloVe word vector modeling is divided into two parts, firstly, GloVe word vector modeling is carried out on a preprocessed microblog text set, and then the similarity between texts is calculated by using WMD distance;

1) glove word vector modeling

The GloVe model needs to count the target word v according to the size of a context window before training the word vector_sWith context words

Co-occurrence times in the whole corpus to construct a word co-occurrence matrix X_stIn the original paper, the values of two parameters, namely vector dimension vector _ size and window size window _ size, of a word used by the model are compared and tested by adopting three indexes of semantic accuracy, grammar accuracy and overall accuracy, and a conclusion is obtained: the vector _ size is 300 to reach the best, and the window _ size is approximately between 6 and 10, since the data set used by the source code is an english data set, the parameter vector _ size is 300 and the window _ size is 8 are set for the specificity of the microblog chinese short text data set;

2) WMD word wandering distance with improved word weight

After GloVe word vector modeling, similarity between texts is calculated by using WMD (distance between words) distance, and short text d_iAnd d_jThe text similarity calculation formula based on GloVe word vector modeling and WMD distance is as follows:

wherein G is the vocabulary size, T_stIs a weight transfer matrix of order G, representing d_iHow many weighted words s in the list are transferred to d_jWord t in (1), when text d_jAll words s in (1) are transferred to text d_j∑ when the word "t" is included_tT_st＝weight_sWeight transfer amount weight_sThe measure is measured by a formula which is calculated by fusing the weight of the contribution degree of the word position, and the above formula should satisfy two constraint conditions of formula (15) and formula (16).

Text similarity measurement based on BTM topic modeling and JS divergence (BTM-JS) is as follows: inputting: microblog short text set D ═ { D ═ D₁,d₂,…,d_nH, hyperparameters α and β

And (3) outputting: topic-based text similarity Dis_BTM(d_i,d_j)

Step1. determining the optimal number of subjects K value according to the formula (21)

Step2. random assignment of initial topics to all word pairs

Step3.for d_i∈D do

Step4.for b∈B do

Step5. assign a topic z to each word pair according to equation (6)_b

Update n of Step5_z、

And

step6. calculate topic distribution θ according to equations (7) and (8)_zAnd topic-word distribution phi_w|z

Step7, selecting characteristic words for each document according to the formula (23) and vectorizing and expressing the text

Step8, calculating the text similarity Dis based on the theme according to the formula (24)_BTM(d_i,d_j) And output.

Text similarity measure based on GloVe word vector modeling and WMD distance (GloVe-WMD) as algorithm as input: microblog short text set D ═ { D ═ D₁,d₂,…,d_n}

And (3) outputting: text similarity Dis based on word vectors_GloVe(d_i,d_j)

Step1, constructing word co-occurrence matrix X of microblog short text set_st

Step2. based on the word co-occurrence matrix X_stModeling with GloVe to obtain a word vector set V ═ V₁,v₂,…,v_G}

Step3.For s＝1 to G do

Step4. according to l_s< 10, judge whether it is a heading word or a text word, and update c_{s_title}And c_{s_text}

Step5, appointing the formula (28) as a weight calculation formula of the word and calculating the weight transfer amount weight_s

Step6. according to ∑_tT_st＝weight_sComputing a weight transfer matrix T_st

Step7.For v_s,v_t∈V do

Step8. calculating the word conversion cost c (s, t) according to equation (17)

Step9. calculating word vector-based text similarity Dis according to formula (26)_GloVe(d_i,d_j) And output.

Based on BG&The hot topic discovery algorithm of SLF-Kmeans is as follows: inputting: microblog short text set D ═ { D ═ D₁,d₂,…,d_nThe number of clusters K calculated by the perplexity of equation (21)

And (3) outputting: k cluster sets

Step1. from dataset D ═ { D ═ D₁,d₂,…,d_nRandomly selecting K short texts as initial clustering centers c_e,e＝1,2,…,K

Step2. assigning equation (31) as a function of distance

Step3.repeat

Step4. calculate the remaining short text d_iWith cluster centre c_eDistance Dis (d)_i,c_e) Assigning each short text to the most similar cluster

Step5. recalculating K cluster centers c according to equation (19)_e

Step6. unitil clustering criterion function equation (27) converges.

The invention has the beneficial effects that: the invention provides a microblog hot topic finding algorithm based on linear fusion of BTM and GloVe similarity, aiming at the problem that a distance function of a K-means algorithm can influence a microblog hot topic clustering result.

The GloVe model only trains non-zero elements in a word and word co-occurrence matrix instead of the whole sparse matrix to utilize statistical information, and the sparsity problem of the TF-IDF algorithm in the process of constructing the document-word vector matrix is effectively relieved. The GloVe model combines a global matrix decomposition method and a local context window method, the trained word vectors can carry more semantic information, and the word ambiguity problem which cannot be better solved by the BTM topic model can be relieved to a certain extent.

Drawings

FIG. 1 is a flow chart of microblog hot topic discovery based on BTM and GloVe similarity linear fusion.

FIG. 2 is a flow chart of microblog text preprocessing according to the invention.

FIG. 3 is a diagram showing the similarity between the short texts "Cobby meeting in los Angeles in the football fans Cobby in los Angeles" and "NBA Justar and US Juwanese meeting NBA Juwanese and US Juwanese meeting" in the embodiment of the present invention.

Detailed Description

Example 1

For a microblog news short text, the title of the news is positioned at the forefront and is marked by a double # sign mark or a bracket, the rest content is a text part, and the news title can play a role in summarizing news content.

Definition 1 (heading words and text words) assumes that any one word takes the first 10 words of the processed microblog short text as a title, and the rest are texts; i.e. if the column index l of the word s_sIf less than 10, s is the heading word, otherwise, s is the text word.

Example 1: "the volunteers in primary school in Hongkou building walk on the street to hold the old to cross the road" do so by themselves, care about everyone, care about everything, and propagate the spirit of thunder and lightning. The preprocessed microblog news short text is as follows: the first 6 words from "primary school of hong jia building" to "old man" are entry words and the following words are text words, respectively.

The WMD distance is only measured by using the TF value when the weight transfer quantity of the word is calculated, and the method is relatively rough, because some words have high occurrence frequency but have low contribution degree to topic discovery, and the difference of the words is difficult to accurately reflect only the TF value of the statistical word; meanwhile, the importance of the header word is different from that of the text word, so the position factor of the word is also taken into consideration.

Definition 2 (weight calculation formula for fused word position contribution degree) calculates the weight transfer amount of a word using the TF-IDF value of the word, and sets the position contribution degree γ of the entry word₁1.5, the position contribution degree y of the text word₂1, some words may be both heading words and text words, so the formula of the weight calculation of the position contribution degree of the fused word is as follows:

wherein, c_sRepresenting the total number of occurrences of the word s, c_{s_title}Number of times s is a term, c_{s_text}Denotes the number of times s is a text word, and c_{s_title}+c_{s_text}＝c_s，tf_sAnd idf_sThe calculation formula of (a) is as follows:

in the formula (22), G is the vocabulary size, and in the formula (23), | D | represents the number of texts in the short text set, | { i: s ∈ D_iDenotes the number of texts containing the word s.

Example 2: "there are 100 datasets of microblog short texts, total 1000 words, wherein the 'explosion' word appears in 9 short texts and 20 times in total, 15 times in title and 5 times in body text respectively. If TF values are usedCalculate, then weight_{Explosion of the vessel}20/1000 ═ 0.02; if the weight calculation formula of the position contribution degree of the fused word is used for calculation, the position contribution degree of the fused word is calculated

For a clustering algorithm, it is very important to accurately calculate the distance between a text and the center of each cluster so as to determine the cluster to which the text belongs, and therefore, the selection of the distance function plays a significant role in the result of the clustering algorithm.

Definition 3 (distance function fusing similarity) given text similarity Dis based on BTM topic modeling and JS divergence_BTM(d_i,d_j) And text similarity Dis based on GloVe word vector modeling and improved WMD distance_GloVe(d_i,d_j) Then the distance function of the fusion similarity is as follows:

Dis(d_i,c_e)＝λ·Dis_BTM(d_i,c_e)+(1-λ)·Dis_GloVe(d_i,c_e)

i＝1,2,…,n；e＝1,2,…,K (31)

wherein d is_iIs data set D ═ D₁,d₂,…,d_nText in }, c_eFor the clustering center, K texts are randomly selected from the data set at the beginning of the algorithm as an initial clustering center c_eλ is a fusion coefficient and 0<λ<1. The value of the fusion coefficient lambda is determined by the clustering effect.

Example 3: text similarity Dis based on BTM topic modeling and JS divergence is assumed_BTM(d_i,c_e) 0.76 text similarity Dis based on GloVe word vector modeling and improved WMD distance_GloVe(d_i,c_e) When the fusion coefficient λ is 0.7 and 0.64, the distance of the fusion similarity is Dis (d)_i,c_e)＝0.7×0.76+0.3×0.64＝0.724。

Example 2

The collected microblog text set comprises:

[ official announcement! Title Singapore concert playbill # dedicate Singapore public welfare cloud concert # playbill formally released! Each song is ordered by the medical personnel. Also has mystery surprise performance without negative expectation!

[ King frying volume ]! To make the best at the angel to sing a concert poster # this evening 19: 30, # dedicate the white clothes angel public welfare cloud concert # live broadcast is turned on! The strong guest formation is a love song. Each song is ordered by the medical staff! Thank you to spell a life for us and let us sing a song for us! Forwarding expectation!

The primary school volunteers in the Hongkou buildings walk on the street to hold the old to cross the road so as to care about everyone and everything and promote the thunder-and-lightning spirit.

During the growth of mother's festival, she always wants to be career and depends on her mother to listen to her for years.

The # mother's festival, caucasian competition # says that the mother and our love are separated from each other by a gradual run. If this is a growth cost, at least today, please pause stepping far away to see what we keep our mother behind.

"mother festival happy" # is just a child in the eyes of mother no matter how big you are. Having mother at you always have a home, having a home! Leave no call, say with her: mom, I love you! Then multiply her by 365 for good care.

Pre-processed data set:

official announce/vocation/custom/angel/concert/playbill/release/song/medic/request/show/expect lineup/vocation/angel/concert/playbill/poster/tonight/vocation/angel/public welfare/cloud/concert/live broadcast/start/guest/lineup/song/medic/request/thank/award/give/offer/song/medic/request/thank/award/show/offer/song/forward/expect/ask/forward/ask

Primary school of Hongkou/volunteer/walk/street/hold/elder/cross/road/do/care/person/care/event/propaganda/thunder god

Mother's day/small time/want/last life/want/in/mom/body/listen/fine year mother's day/tell/show/mother/love/separate/grow/cost/in/today/pause/step/look/behind/watch/mother/day

Mother's festival/happy/in/mom/eye/child/mother/in/present/home/forget/make/call/say/mom/love/hold/love/multiply/care

Modeling the BTM theme:

running the code of model confusion, assuming the algorithm yields the following results:

when K is 1, Perplexity is 193;

when K is 2, Perplexity is 162;

when K is 3, Perplexity is 151;

when K is 4, Perplexity is 165;

when K is 5, Perplexity is 186;

as a result, when the topic number K is 3, the model confusion is the smallest, and the optimal topic number K is 3.

When K is 3, it is assumed that running the code of the BTM model yields the following results:

(1) document-topic distribution

(2) Topic distribution, topic-word distribution (only the first 6 words with the largest proportion are retained)

Calculating JS divergence:

the text similarity is expressed as follows:

angel/offer/concert/medical staff/public welfare/expectation

Angel/offer/concert/medical staff/public welfare/expectation

Hongkou primary school/volunteer/support/elder/road/thunderstorm spirit

Mother's festival/love/happy/caucasian/mother/child

For example: calculating JS divergence of a first document and a third document

d₁Probability distribution of (2): 0.1054160.0732150.0606130.0428460.0361250.021653

Probability distribution of (2): 0.11581550.092430.08463250.0736040.063880.038933

Then pass through

The JS divergence of the first document and the third document can be obtained

GloVe word vector modeling:

running the GloVe code, a word vector for each word can be obtained, assuming the approximate result is as follows: (part)

Okinawa-0.0041153620.0023200110.0010127611-0.00042312752-0.004730146

Rhizoma bletilla 0.003889411-0.0027805932-0.00198634830.8018722716-0.00267

Concert-0.0048637153-0.0047526550.0040961024-0.0034338231-0.001136

Song 0.00318738750.0027359112-0.00353613570.0035860823-0.0028235812

Mother's day 0.0002723677-0.00130813460.00116098-0.0015890214-0.00448590

Mother 0.00062034760.0030022748-0.0046566210.0019886396-0.0003756053

Mother-0.0008758595-0.00373174970.0019768209-0.00206301830.00242962

Medical personnel-0.00038896065-0.00258334680.0021852236-0.00377636470.000

Point singing-2.3274260-050.00177467540.0022264344-0.00090659875-2.9263224

Calculating the WMD distance:

using formulas

WMD distance may be calculated.

Wherein, the word conversion cost c (s, t) | | | v_s-v_t||₂(v_sAnd v_tGlove word vectors for words s and t, respectively)

Weight transfer matrix T_stFormula ∑_tT_st＝weight_sWeight transfer amount weight_sAnd calculating by a weight calculation formula defining the position contribution degree of the fused word of 2.

Examples are as follows:

assuming that there are several short texts, the similarity between the two short texts "Cobbs meet in los Angeles in the fans Cobbs in los Angeles" and "NBA Jupiter and American Bean Fall NBA Jupiter American Bean Fall" is calculated. As shown in fig. 3.

Weight transfer weight of each word_sBy passing

To calculate, for ease of explanation, a simple number is chosen here as weight_sThe labels are illustrated next to each word.

According to the constraint condition, due to weight_{Scientific ratio}When the "kob" is 0.5, the weight shift amount to the four words "NBA giant star", "usa", "fan", "meet" should be 0.5. At the same time, due to weight_{NBA Jupiter}0.6, the sum of the weight transfer amounts of the four words "cobi", "los angeles", "interview", and "fans" transferred to the word "NBA giant star" should be 0.6. In fact, according to the constraints:

the weight transfer matrix T can be obtained_stEach component of the plurality of components, but with a plurality of sets of solutions, is calculated

The minimum value is selected to be the WMD distance between the two documents.

And (3) performing K-means clustering based on linear fusion similarity:

1) randomly selecting three short texts as initial cluster centers

2) Using distance functions

And respectively calculating the distances between the rest short texts and the centers of the three initial clusters, and allocating the rest short texts to the cluster most similar to the center of the cluster.

3) According to

e 1,2, …, K updates the cluster center

4) Repeat 2) and 3) until the criterion function converges (cluster center no longer changes)

The following results will eventually be approximately obtained:

1. clusters corresponding to each short text (the clusters are represented by the labels 1-3, the label 1 corresponds to the first cluster, and so on):

2. further intuitively displaying the results, several feature words may also be used to represent clusters

Label character words

1 flood family primary school, volunteer, support, old man, road, thunderstorm spirit

2 Angel, vocation, concert, medical staff, public welfare, expectation

3 mother's day, love, happy, caucasian, mother and child.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion is characterized by comprising three stages of data acquisition and preprocessing, modeling and clustering, wherein the data acquisition and preprocessing are carried out firstly, then the obtained data are modeled, and the modeled data are clustered;

2. The microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion as claimed in claim 1, wherein the microblog text preprocessing comprises the following specific procedures:

(1) microblog text filtering

First, useless information such as emoticons, links, and notations are deleted, such as: "@ user", "stamp graph ↓ understanding", "more > >", which will cause great interference to the following text analysis; secondly, deleting ultra-short microblogs with less than 10 words, wherein the ultra-short microblogs are generally used for expressing the emotion of a user rather than describing a hot topic; finally, removing all punctuation marks in the data set;

(2) word segmentation and part-of-speech tagging

Chinese word segmentation is an important technology in natural language processing, and refers to dividing a Chinese character sequence into individual words according to a certain specification, wherein a commonly used Chinese word segmentation tool mainly comprises ICTCCLAS, NLPIR, Ansj, Jieba and THULAC, and a user-defined dictionary which generally comprises network new words, hot words, professional terms, names of people and place names is added before word segmentation according to the particularity of a data set so as to improve the accuracy of word segmentation;

(3) stop word

After word segmentation, the text set becomes a word set, some words have no actual meanings and are only used as connection components of sentences or for expressing the emotion and other effects of authors, such as words of 'o' and 'ya', if the words are reserved, the high dimensionality of the data set can increase the operation cost of the algorithm and can also influence the final short text clustering effect, and a word table of using a word table in a Hadamard mode is selected for experiments to delete the meaningless words;

(4) feature selection

3. The microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion as claimed in claim 1, characterized in that the specific flow of the modeling stage is as follows:

(1) text similarity measurement based on BTM topic modeling

The text similarity metric process based on BTM topic modeling is divided into two parts,

a first part:

firstly, calculating an optimal theme number K by using a confusion formula, then carrying out BTM theme modeling on a preprocessed microblog text set, and finally carrying out text representation according to a modeling result;

a second part: calculating text similarity by using JS divergence;

1) BTM topic modeling

Because the selection of the value of the number K of the topics can directly influence the modeling result of the BTM, the number K of the topics which can enable the modeling result to be optimal needs to be determined before modeling, the optimal value K can be determined by using the confusion degree, the confusion degree can be used for evaluating the generalization ability of the model, the smaller the confusion degree is, the better the modeling effect is, and the confusion degree calculation formula is as follows:

after determining the number of topics K, α -50/K and β -0.01 are empirically obtained, and then the topic distribution θ is obtained according to the formulas (7) and (8) mentioned in the background art_zAnd topic-word distribution phi_w|z；

2) JS divergence

d_{i_BTM}＝{p(z₁|d_i),p(z₂|d_i),...,p(z_K|d_i)} (23)

two pieces are to be calculatedDocument d_iAnd d_jThe similarity between the two is converted into the calculation d_{i_BTM}And d_{j_BTM}The similarity between these two document-topic vectors, calculated herein using the commonly used text similarity metric — JS divergence, is based on the BTM topic modeling and the text similarity calculation formula for JS divergence as follows:

wherein, the calculation formula of the KL divergence is as follows:

(2) text similarity measurement based on GloVe word vector modeling

1) glove word vector modeling

Co-occurrence times in the whole corpus to construct a word co-occurrence matrix X_stIn the original paper, the values of two parameters, namely vector dimension vector _ size and window size window _ size, of a word used by the model are compared and tested by adopting three indexes of semantic accuracy, grammar accuracy and overall accuracy, and a conclusion is obtained: the vector _ size is 300, the window _ size is approximately between 6 and 10, and the data set used by the source code is an English data set, and parameters are set according to the particularity of the Chinese short text data set of the microblogvector_size＝300，window_size＝8；

2) WMD word wandering distance with improved word weight

wherein G is the vocabulary size, T_stIs a weight transfer matrix of order G, representing d_iHow many weighted words s in the list are transferred to d_jWord t in (1), when text d_iAll words s in (1) are transferred to text d_j∑ when the word "t" is included_tT_st＝weight_sWeight transfer amount weight_sThe above formula should satisfy two constraints of formula (15) (16) in the background art, and is measured by a weight calculation formula of fused word position contribution degree.

4. The microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion as claimed in claim 1, wherein the algorithm is based on BG&The hot topic discovery algorithm of SLF-Kmeans is as follows: inputting: microblog short text set D ═ { D ═ D₁,d₂,…,d_nThe cluster number K calculated by the formula (21),

and (3) outputting: k cluster sets

Step1. from dataset D ═ { D ═ D₁,d₂,…,d_nRandomly selecting K short texts as initial clustering centers c_e,e＝1,2,…,K；

Step2, assigning equation (31) as a distance function;

Step3.repeat；

step4. calculate the remaining short text d_iWith cluster centre c_eDistance Dis (d)_i,c_e) Assigning each short text to the most similar cluster;

step5. recalculating K clusters from equation (19)Heart c_e；

Step6. unitil clustering criterion function equation (27) converges.

5. The microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity according to claim 3, characterized in that the text similarity measure based on BTM topic modeling and JS divergence (BTM-JS) is as follows: inputting: microblog short text set D ═ { D ═ D₁,d₂,…,d_nH, hyperparameters α and β

And (3) outputting: topic-based text similarity Dis_BTM(d_i,d_j)

Step1, determining an optimal theme number K value according to a formula (21);

step2, randomly distributing initial subjects for all the word pairs;

Step3.for d_i∈D do；

Step4.for b∈B do；

step5. assign a topic z to each word pair according to equation (22)_b；

Update n of Step5_z、

And

step6. calculate topic distribution θ according to equations (7) and (8)_zAnd topic-word distribution phi_w|z；

Step7, selecting characteristic words for each document according to a formula (23) and vectorizing and expressing the text;

6. The microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity as claimed in claim 3, wherein the microblog hot topic discovery algorithm is based on GloVe word vector modeling and WMD distance (GloVe-WMD) text similarity measurementIf the algorithm is used as input: microblog short text set D ═ { D ═ D₁,d₂,…,d_n}

And (3) outputting: text similarity Dis based on word vectors_GloVe(d_i,d_j)

Step1, constructing word co-occurrence matrix X of microblog short text set_st；

Step2. based on the word co-occurrence matrix X_stModeling with GloVe to obtain a word vector set V ═ V₁,v₂,…,v_G}；

Step3.For s＝1 to G do；

Step4. according to l_s< 10, judge whether it is a heading word or a text word, and update c_{s_title}And c_{s_text}；

Step5, appointing the formula (28) as a weight calculation formula of the word and calculating the weight transfer amount weight_s；

Step6. according to ∑_tT_st＝weight_sComputing a weight transfer matrix T_st；

Step7.For v_s,v_t∈V do；

Step8, calculating word conversion cost c (s, t) according to the formula (17);

7. The microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion as claimed in claim 1 is characterized in that the specific processing method is as follows:

for a microblog news short text, the news title is positioned at the forefront and marked by a double # sign mark or a bracket, the news title generally serves as a news summarizing effect, and the rest of the contents are text parts;

definition 1: supposing that any one of the words takes the first 10 words of the processed microblog short text as a title, and the rest words are texts; i.e. if the column index l of the word s_sIf the number is less than 10, s is a heading word, otherwise, s is a text word;

definition 2: weight of fused word position contribution degreeRecalculation formula, calculating weight transfer amount of word by using TF-IDF value of word, and setting position contribution degree gamma of subject word₁1.5, the position contribution degree y of the text word₂1, some words may be both heading words and text words, so the formula of the weight calculation of the position contribution degree of the fused word is as follows:

in the formula (22), G is the vocabulary size, and in the formula (23), | D | represents the number of texts in the short text set, | { i: s ∈ D_i} | denotes the number of texts containing the word s;

definition 3: and fusing a distance function of similarity, and giving the text similarity Dis based on the BTM topic modeling and the JS divergence_BTM(d_i,d_j) And text similarity Dis based on GloVe word vector modeling and WMD distance_GloVe(d_i,d_j) Then the distance function of the fusion similarity is as follows:

Dis(d_i,c_e)＝λ·Dis_BTM(d_i,c_e)+(1-λ)·Dis_GloVe(d_i,c_e)

i＝1,2,…,n；e＝1,2,…,K (31)

wherein d is_iIs data set D ═ D₁,d₂,…,d_nText in }, c_eFor the clustering center, K texts are randomly selected from the data set at the beginning of the algorithm as an initial clustering center c_eλ is a fusion coefficient and 0<λ<1, the value of the fusion coefficient lambda is determined by the clustering effect.