CN111368072A - Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity - Google Patents

Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity Download PDF

Info

Publication number
CN111368072A
CN111368072A CN201910770568.4A CN201910770568A CN111368072A CN 111368072 A CN111368072 A CN 111368072A CN 201910770568 A CN201910770568 A CN 201910770568A CN 111368072 A CN111368072 A CN 111368072A
Authority
CN
China
Prior art keywords
word
text
btm
modeling
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910770568.4A
Other languages
Chinese (zh)
Inventor
吴迪
张梦甜
生龙
黄竹韵
杨瑞欣
孙雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Engineering
Original Assignee
Hebei University of Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Engineering filed Critical Hebei University of Engineering
Priority to CN201910770568.4A priority Critical patent/CN111368072A/en
Publication of CN111368072A publication Critical patent/CN111368072A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion, which is characterized in that data acquisition and pretreatment are carried out firstly in three stages of data acquisition and pretreatment, modeling and clustering, then modeling is carried out on the obtained data, and the modeled data is clustered; the invention provides a microblog hot topic finding algorithm based on linear fusion of BTM and GloVe similarity, aiming at the problem that a distance function of a K-means algorithm can influence a microblog hot topic clustering result. The GloVe model only trains non-zero elements in a word and word co-occurrence matrix instead of the whole sparse matrix to utilize statistical information, and the sparsity problem of the TF-IDF algorithm in the process of constructing the document-word vector matrix is effectively relieved. The GloVe model combines a global matrix decomposition method and a local context window method, the trained word vectors can carry more semantic information, and the word ambiguity problem which cannot be better solved by the BTM topic model can be relieved to a certain extent.

Description

Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity
Technical Field
The invention relates to the technical field of topic discovery and tracking in natural language processing, in particular to a microblog hot topic discovery algorithm based on BTM and Glove similarity linear fusion.
Background
With the rapid development of the traditional internet and the mobile internet, the microblog is developed vigorously. The microblog allows the user to publish messages through a webpage, an external program, a mobile phone client and the like, so as to realize message sharing. The advantages of short text, timeliness and interactivity of the microblog are accepted by the public, and become an important tool for people to acquire and release information gradually. How to mine hot topics from massive disordered microblog data becomes a problem to be solved urgently.
In order to solve the problems, a plurality of methods exist at present, mainly including a method for performing vector representation on a microblog text set by using TF-IDF and then discovering hot topic by using a clustering algorithm, a method for vectorizing the text set by using topic models such as LDA and BTM and then discovering the hot topic by using the clustering algorithm, and a method for modeling microblog short texts by using a BTM topic model and an improved TF-IDF algorithm, calculating text similarity based on BTM modeling by using JS divergence, calculating text similarity based on the improved TF-IDF algorithm by using cosine distance, and finally performing linear fusion on the two similarities and clustering by using K-means to obtain the hot topic; the above method is briefly described as follows:
(1) BTM topic model
LDA is a document topic generation model, which is mainly used to identify potential topics in a large-scale document collection without supervision. However, the traditional topic model (such as LDA, PLSA, etc.) implicitly captures word co-occurrence patterns of the document layer to reveal the topic, and modeling on a large number of short text sets can cause a serious data sparsity problem, so the traditional topic model has good applicability to long texts and cannot be well adapted to short texts. Thus, Yanxiaohui et al propose a word-pair Topic Model (BTM) for short text that explicitly models word co-occurrence patterns to improve the ability to learn topics, while at the same time, using aggregation patterns throughout the corpus to learn topics to solve the data sparseness problem of document-level word co-occurrence patterns.
In the BTM graph model of today, α and β are hyper-parameters, | B | represents the number of word pairs contained in the entire corpus, K represents the number of topics, θ represents the distribution of topics in the entire corpus,
Figure RE-GDA0002513551180000021
topic-feature word distribution representing the entire corpus, Z represents a set of topics, Wi、WjRepresenting all of the two different words in the set of word pairs that make up the word pair.
The process of generating a corpus by the BTM topic model is as follows:
for each topic z, the topic-feature word distribution is sampled: phi is az~Dirichlet(β);
For the entire corpus, a global topic distribution is sampled theta-Dirichlet (α);
for each word pair B in the set of word pairs B ═ (w)i,wj):
One topic z is sampled from the topic distribution θ of the entire corpus: z to Mult (θ);
randomly extracting two words w from the sampled subject zi、wjThe component word pair b ═ wi,wj)~Mult(φz)。
As can be seen from the corpus generation process, the word pair b ═ (w)i,wj) The joint probability distribution of (c) is:
Figure RE-GDA0002513551180000031
wherein p (z) ═ θzRepresenting the probability distribution, P (w), of the topic zi|z)=φi|zZ-feature word w representing a topiciProbability distribution of (1), P (w)j|z)=φj|zZ-feature word w representing a topicjThe probability distribution of the whole corpus B is:
Figure RE-GDA0002513551180000032
because the BTM cannot directly obtain the proportion of the topics in the document, in order to infer the topics contained in one document, it is assumed that the proportion of the topics in the document is equal to the proportion of the topics in the word pairs generated by the document:
Figure RE-GDA0002513551180000033
in the formula, P (z | d) is a document-topic distribution, P (b | d) is a document-word pair distribution, and P (z | b) is a word pair-topic distribution. P (z | b) can be calculated by a Bayesian formula based on BTM predicted parameter values:
Figure RE-GDA0002513551180000034
the value of P (b | d) is estimated using a prior distribution of word pairs in the document:
Figure RE-GDA0002513551180000035
wherein n isd(b) Representing the frequency with which word pair b appears in document d.
The BTM also adopts a Gibbs sampling method when solving theta and phi, and obtains the following conditional probability by applying a chain rule to the whole data joint probability:
Figure RE-GDA0002513551180000041
wherein n iszRepresenting the number of times a word pair b is assigned to a topic z, z-bA topic distribution representing all other word pairs except for the word pair b, nw|zRepresenting the number of times a feature word w is assigned to a topic z, G being the vocabulary size. Finally, the theme distribution theta can be estimatedzAnd topic-word distribution phiw|z
Figure RE-GDA0002513551180000042
Figure RE-GDA0002513551180000043
(2) JS divergence
The KL divergence is a commonly used text similarity measure, and the calculation formula is as follows:
Figure RE-GDA0002513551180000044
in which p and q are two probability distributions, pyAnd q isyAre two probability distributions with a value of y.
Because KL divergence is asymmetric, certain inaccuracy is caused when measuring text similarity, JS divergence is obtained by improving the asymmetric problem on the basis of KL divergence, and the calculation formula is as follows:
Figure RE-GDA0002513551180000051
(3) glove global vector model
Macroscopically, word vectors are mainly of two categories: one type of dependency matrix decomposition, such as LSA, etc., which uses a word co-occurrence matrix to capture the similarity between words, but cannot solve the problem of ambiguity once; the other is a shallow window-based method, such as skip-gram and CBOW, which only scans the contextual window of the entire corpus and cannot utilize global statistical information. The GloVe global vector model proposed by Stanford university is a new global logarithmic regression model, combines the advantages of global matrix decomposition and local context window method, effectively utilizes statistical information by training non-zero elements of word co-occurrence matrix, generates a vector space of meaningful substructure, and expresses the semantic similarity degree of words by using the difference of word vectors in dimension.
Considering that two words with similar semantics tend to have higher co-occurrence ratio, the GloVe global vector model learns the semantic similarity between words by using the co-occurrence ratio of the words instead of the probability itself, for example: "ice" and "water vapor" are both a state of water, but the semantic similarity between "ice" and "solid" is large, and the semantic similarity between "water vapor" and "solid" is small, so the word co-occurrence ratio of "ice" and "solid" is much higher than the word co-occurrence ratio of "water vapor" and "solid". The relationship of the word vectors to the co-occurrence matrix is as follows:
Figure RE-GDA0002513551180000052
wherein, XstRepresenting the number of times the word t appears in the context of the word s, vsA word vector representing the word s,
Figure RE-GDA0002513551180000061
the representation being a vector of independent context words generated by another neural network instance, bsAnd
Figure RE-GDA0002513551180000062
is the offset of the two word vectors used to ensure equation symmetry.
To remove noisy data such as low frequency co-occurrence words, equation (11) is converted to a least squares problem and a weighting function f (X) is introducedst) The resulting loss function calculation is as follows:
Figure RE-GDA0002513551180000063
wherein G is the vocabulary size. The weighting function f (x) is defined as follows:
Figure RE-GDA0002513551180000064
the authors give the parameter values: x is the number ofmax=100,η=3/4。
(4) WMD word wandering distance
The WMD distance is the similarity between the two short texts, wherein the sum of the shortest distances which are passed by the feature word vectors of one short text to the feature word vectors of the other short text is taken as the similarity between the two short texts.
WMD distance measures similarity between documents by using weight transfer quantity and word conversion cost, namely short text diAnd djThe WMD distance between may be expressed as:
Figure RE-GDA0002513551180000065
the above formula should satisfy the following constraints:
Figure RE-GDA0002513551180000066
Figure RE-GDA0002513551180000071
wherein d isi,sIndicating that the word s is a short text diCharacteristic word of dj,tIndicating that the word t is a short text djG is the vocabulary size.
In the formula (14), c (s, t) is a word conversion cost and represents diThe word s in (1) is transferred to djThe word conversion cost of the word t in (1) is calculated by the following formula:
c(s,t)=||vs-vt||2(17)
wherein v issAnd vtA word vector for words s and t, respectively.
(5) K-means clustering algorithm
K-means is a clustering algorithm based on division, and is a clustering technology based on centroids. The algorithm was first proposed in 1967 by Macqueen J, which defined the centroid of a cluster as the mean of the points within the cluster. The algorithm mainly comprises the following principles: randomly selecting k objects in the data set D as initial centers of k clusters, calculating the distance between other objects and the center of each cluster by using a distance function, and allocating the distance to the most similar cluster; then updating the cluster center and reallocating the objects; and repeating iteration until the clustering center converges.
The K-means algorithm uses the euclidean distance as a distance function, and the calculation formula is as follows:
Dis(di,ce)=||di-ce||2(18)
wherein d isiFor data set D ═ D1,d2,…,dnText in }, ceRepresents a cluster CeCluster center of (a). The calculation formula for updating the cluster center is as follows:
Figure RE-GDA0002513551180000081
wherein n is the total number of texts in the data set, and K is the number of clusters.
In determining whether the cluster center converges, the algorithm uses the sum of squared errors as a criterion function:
Figure RE-GDA0002513551180000082
the existing common method is that a BTM topic model and an improved TF-IDF algorithm are used for respectively modeling microblog short texts, then JS divergence is used for calculating text similarity based on BTM modeling, cosine distance is used for calculating text similarity based on the improved TF-IDF algorithm, and finally the two similarities are subjected to linear fusion and K-means clustering is used for obtaining hot topics.
However, this method has the following problems:
1. the TF-IDF is an algorithm based on statistical information such as word frequency, and the like, so that semantic information of words can be ignored, and the precision of topic clustering is influenced.
2. The research target of the text is a large number of microblog short texts, and a document-word vector matrix constructed by the TF-IDF algorithm has high sparsity, so that the operation efficiency of the algorithm is influenced.
Disclosure of Invention
According to the technical problems, the invention provides a microblog hot topic discovery algorithm based on BTM and Glove similarity linear fusion, which is characterized in that the algorithm comprises three stages of data acquisition and pretreatment, modeling and clustering, wherein the data acquisition and pretreatment are carried out firstly, then the obtained data are modeled, and the modeled data are clustered;
the data acquisition and preprocessing stage processing method comprises the following steps: the microblog open platform provides a microblog API technology, a developer can use own App key and App secret to pass the user authorization of Oauth2.0 by creating application on the open platform, and then an API interface can be called to obtain microblog data in java, C + +, python and other environments; the microblog text preprocessing mainly comprises three parts, namely microblog text filtering, word segmentation and part-of-speech tagging, stop word removal and feature selection;
the modeling phase is divided into two parts: the first part is that a BTM topic model is used for topic modeling, then a modeling result is used for carrying out text representation on an original text set, and finally JS divergence is used for calculating text similarity; the second part is that a GloVe global vector model is used for word vector modeling, and then the WMD distance, namely WMD distance, of improved word weight is used for calculating text similarity;
the clustering is to linearly fuse similarity obtained by respectively calculating based on two modeling modes after BTM topic modeling and GloVe word vector modeling so as to improve a distance function of a K-means algorithm, wherein the distance function based on the fused similarity is shown as a formula 31, and a corresponding clustering criterion function formula is as follows:
Figure RE-GDA0002513551180000091
a microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity is characterized in that the algorithm execution sequence is as follows: BTM-JS-GloVe-WMD-BG & SLF-Kmeans; the specific method comprises the following steps:
firstly, acquiring microblog data and performing text preprocessing; secondly, modeling the preprocessed microblog text set by using a BTM topic model, and calculating the topic-based text similarity after BTM modeling by using JS divergence after text representation, namely BTM-JS; thirdly, modeling the preprocessed microblog text set by using a GloVe global vector model, and calculating text similarity based on word vectors after the GloVe modeling by using the improved WMD distance, namely GloVe-WMD; and finally, linearly fusing the two similarities, and applying a distance function of the fused similarities to a K-means clustering algorithm, so as to find a microblog hot topic, namely BG & SLF-Kmeans.
The specific flow of microblog text preprocessing is as follows:
(1) microblog text filtering
First, useless information such as emoticons, links, and notations are deleted, such as: "@ user", "stamp graph ↓ know", "more > > >", etc., which will cause great interference to the following text analysis; secondly, deleting ultra-short microblogs with less than 10 words, wherein the ultra-short microblogs are generally used for expressing the emotion of a user rather than describing a hot topic; finally, removing all punctuation marks in the data set;
(2) word segmentation and part-of-speech tagging
Chinese word segmentation is a very important technology in natural language processing, it means to cut a Chinese character sequence into individual words according to certain norm, the commonly used Chinese word segmentation tool mainly has ICTCCLAS (NLPIR), Ansj, Jieba, THULAC, before the word segmentation according to the particularity of the data set add a user-defined dictionary (generally including network new words, hotwords, professional terms, names of people, names of places, etc.) in order to improve the accuracy of word segmentation;
(3) stop word
After word segmentation, the text set becomes a word set, some words have no actual meanings and are only used as connection components of sentences or for expressing the emotion and other effects of authors, such as words of 'o' and 'ya', and the like;
(4) feature selection
After the part of speech is labeled, the part of speech of each word is attached to the back of each word, such as 'space stealing brushing/v', the information content of words with different parts of speech is different, as the hot topic is found in the research of the text, the part of speech of some parts of speech is smaller, such as adjectives, adverbs and the like, in order to improve the operation efficiency of the algorithm, the experimental data only keeps nouns and verbs, and the words with the rest parts of speech are filtered as useless words.
The concrete flow of the modeling stage is as follows:
(1) text similarity measurement based on BTM topic modeling
The text similarity measurement process based on BTM topic modeling is divided into two parts, namely a first part: firstly, calculating an optimal theme number K by using a confusion formula, then carrying out BTM theme modeling on a preprocessed microblog text set, and finally carrying out text representation according to a modeling result; a second part: calculating text similarity by using JS divergence;
1) BTM topic modeling
Because the selection of the theme number K value can directly influence the modeling result of the BTM, the theme number K which can enable the modeling result to be optimal needs to be determined before modeling, the optimal K value can be determined by using a perplexity (perplexity), the perplexity can be used for evaluating the generalization capability of the model, the lower the perplexity is, the better the modeling effect is, and the perplexity calculation formula is as follows:
Figure RE-GDA0002513551180000121
in the formula, B represents a word pair set, and p (B) represents the probability generated by the word pair B, and the formula is as follows:
Figure RE-GDA0002513551180000122
wherein p (z) ═ θzRepresenting the probability distribution, P (w), of the topic zi|z)=φi|zZ-feature word w representing a topiciProbability distribution of (1), P (w)j|z)=φj|zZ-feature word w representing a topicjA probability distribution of (a);
after determining the number of topics K, α -50/K and β -0.01 are empirically obtained, and then the topic distribution θ is obtained according to the formulas (7) and (8)zAnd topic-word distribution phiw|z
2) JS divergence
After BTM modeling is completed, for each document, 6 feature words with the first probability in the topic-word distribution p (w | z) under the maximum probability topic in the document-topic probability distribution p (z | d) are selected as the feature words of the document, so that dimension reduction can be realized on the basis of maximally preserving document semantics, and the algorithm complexity is reducediThe document vector based on BTM topic modeling can be represented by K document-topic probability distributions:
di_BTM={p(z1|di),p(z2|di),...,p(zK|di)} (23)
two documents d are to be calculatediAnd djThe similarity between the two is converted into the calculation di_BTMAnd dj_BTMThe similarity between these two document-topic vectors, calculated herein using the commonly used text similarity metric — JS divergence, is based on the BTM topic modeling and the text similarity calculation formula for JS divergence as follows:
Figure RE-GDA0002513551180000131
wherein, the calculation formula of the KL divergence is as follows:
Figure RE-GDA0002513551180000132
in which p and q are two probability distributions, phAnd phIs the probability distribution of the first 6 feature words;
(2) text similarity measurement based on GloVe word vector modeling
The text similarity measurement process based on GloVe word vector modeling is divided into two parts, firstly, GloVe word vector modeling is carried out on a preprocessed microblog text set, and then the similarity between texts is calculated by using WMD distance;
1) glove word vector modeling
The GloVe model needs to count the target word v according to the size of a context window before training the word vectorsWith context words
Figure RE-GDA0002513551180000133
Co-occurrence times in the whole corpus to construct a word co-occurrence matrix XstIn the original paper, the values of two parameters, namely vector dimension vector _ size and window size window _ size, of a word used by the model are compared and tested by adopting three indexes of semantic accuracy, grammar accuracy and overall accuracy, and a conclusion is obtained: the vector _ size is 300 to reach the best, and the window _ size is approximately between 6 and 10, since the data set used by the source code is an english data set, the parameter vector _ size is 300 and the window _ size is 8 are set for the specificity of the microblog chinese short text data set;
2) WMD word wandering distance with improved word weight
After GloVe word vector modeling, similarity between texts is calculated by using WMD (distance between words) distance, and short text diAnd djThe text similarity calculation formula based on GloVe word vector modeling and WMD distance is as follows:
Figure RE-GDA0002513551180000141
wherein G is the vocabulary size, TstIs a weight transfer matrix of order G, representing diHow many weighted words s in the list are transferred to djWord t in (1), when text djAll words s in (1) are transferred to text dj∑ when the word "t" is includedtTst=weightsWeight transfer amount weightsThe measure is measured by a formula which is calculated by fusing the weight of the contribution degree of the word position, and the above formula should satisfy two constraint conditions of formula (15) and formula (16).
Text similarity measurement based on BTM topic modeling and JS divergence (BTM-JS) is as follows: inputting: microblog short text set D ═ { D ═ D1,d2,…,dnH, hyperparameters α and β
And (3) outputting: topic-based text similarity DisBTM(di,dj)
Step1. determining the optimal number of subjects K value according to the formula (21)
Step2. random assignment of initial topics to all word pairs
Step3.for di∈D do
Step4.for b∈B do
Step5. assign a topic z to each word pair according to equation (6)b
Update n of Step5z
Figure RE-GDA0002513551180000151
And
Figure RE-GDA0002513551180000152
step6. calculate topic distribution θ according to equations (7) and (8)zAnd topic-word distribution phiw|z
Step7, selecting characteristic words for each document according to the formula (23) and vectorizing and expressing the text
Step8, calculating the text similarity Dis based on the theme according to the formula (24)BTM(di,dj) And output.
Text similarity measure based on GloVe word vector modeling and WMD distance (GloVe-WMD) as algorithm as input: microblog short text set D ═ { D ═ D1,d2,…,dn}
And (3) outputting: text similarity Dis based on word vectorsGloVe(di,dj)
Step1, constructing word co-occurrence matrix X of microblog short text setst
Step2. based on the word co-occurrence matrix XstModeling with GloVe to obtain a word vector set V ═ V1,v2,…,vG}
Step3.For s=1 to G do
Step4. according to ls< 10, judge whether it is a heading word or a text word, and update cs_titleAnd cs_text
Step5, appointing the formula (28) as a weight calculation formula of the word and calculating the weight transfer amount weights
Step6. according to ∑tTst=weightsComputing a weight transfer matrix Tst
Step7.For vs,vt∈V do
Step8. calculating the word conversion cost c (s, t) according to equation (17)
Step9. calculating word vector-based text similarity Dis according to formula (26)GloVe(di,dj) And output.
Based on BG&The hot topic discovery algorithm of SLF-Kmeans is as follows: inputting: microblog short text set D ═ { D ═ D1,d2,…,dnThe number of clusters K calculated by the perplexity of equation (21)
And (3) outputting: k cluster sets
Step1. from dataset D ═ { D ═ D1,d2,…,dnRandomly selecting K short texts as initial clustering centers ce,e=1,2,…,K
Step2. assigning equation (31) as a function of distance
Step3.repeat
Step4. calculate the remaining short text diWith cluster centre ceDistance Dis (d)i,ce) Assigning each short text to the most similar cluster
Step5. recalculating K cluster centers c according to equation (19)e
Step6. unitil clustering criterion function equation (27) converges.
The invention has the beneficial effects that: the invention provides a microblog hot topic finding algorithm based on linear fusion of BTM and GloVe similarity, aiming at the problem that a distance function of a K-means algorithm can influence a microblog hot topic clustering result.
The GloVe model only trains non-zero elements in a word and word co-occurrence matrix instead of the whole sparse matrix to utilize statistical information, and the sparsity problem of the TF-IDF algorithm in the process of constructing the document-word vector matrix is effectively relieved. The GloVe model combines a global matrix decomposition method and a local context window method, the trained word vectors can carry more semantic information, and the word ambiguity problem which cannot be better solved by the BTM topic model can be relieved to a certain extent.
Drawings
FIG. 1 is a flow chart of microblog hot topic discovery based on BTM and GloVe similarity linear fusion.
FIG. 2 is a flow chart of microblog text preprocessing according to the invention.
FIG. 3 is a diagram showing the similarity between the short texts "Cobby meeting in los Angeles in the football fans Cobby in los Angeles" and "NBA Justar and US Juwanese meeting NBA Juwanese and US Juwanese meeting" in the embodiment of the present invention.
Detailed Description
Example 1
For a microblog news short text, the title of the news is positioned at the forefront and is marked by a double # sign mark or a bracket, the rest content is a text part, and the news title can play a role in summarizing news content.
Definition 1 (heading words and text words) assumes that any one word takes the first 10 words of the processed microblog short text as a title, and the rest are texts; i.e. if the column index l of the word ssIf less than 10, s is the heading word, otherwise, s is the text word.
Example 1: "the volunteers in primary school in Hongkou building walk on the street to hold the old to cross the road" do so by themselves, care about everyone, care about everything, and propagate the spirit of thunder and lightning. The preprocessed microblog news short text is as follows: the first 6 words from "primary school of hong jia building" to "old man" are entry words and the following words are text words, respectively.
The WMD distance is only measured by using the TF value when the weight transfer quantity of the word is calculated, and the method is relatively rough, because some words have high occurrence frequency but have low contribution degree to topic discovery, and the difference of the words is difficult to accurately reflect only the TF value of the statistical word; meanwhile, the importance of the header word is different from that of the text word, so the position factor of the word is also taken into consideration.
Definition 2 (weight calculation formula for fused word position contribution degree) calculates the weight transfer amount of a word using the TF-IDF value of the word, and sets the position contribution degree γ of the entry word11.5, the position contribution degree y of the text word21, some words may be both heading words and text words, so the formula of the weight calculation of the position contribution degree of the fused word is as follows:
Figure RE-GDA0002513551180000181
wherein, csRepresenting the total number of occurrences of the word s, cs_titleNumber of times s is a term, cs_textDenotes the number of times s is a text word, and cs_title+cs_text=cs,tfsAnd idfsThe calculation formula of (a) is as follows:
Figure RE-GDA0002513551180000182
Figure RE-GDA0002513551180000183
in the formula (22), G is the vocabulary size, and in the formula (23), | D | represents the number of texts in the short text set, | { i: s ∈ DiDenotes the number of texts containing the word s.
Example 2: "there are 100 datasets of microblog short texts, total 1000 words, wherein the 'explosion' word appears in 9 short texts and 20 times in total, 15 times in title and 5 times in body text respectively. If TF values are usedCalculate, then weightExplosion of the vessel20/1000 ═ 0.02; if the weight calculation formula of the position contribution degree of the fused word is used for calculation, the position contribution degree of the fused word is calculated
Figure RE-GDA0002513551180000184
For a clustering algorithm, it is very important to accurately calculate the distance between a text and the center of each cluster so as to determine the cluster to which the text belongs, and therefore, the selection of the distance function plays a significant role in the result of the clustering algorithm.
Definition 3 (distance function fusing similarity) given text similarity Dis based on BTM topic modeling and JS divergenceBTM(di,dj) And text similarity Dis based on GloVe word vector modeling and improved WMD distanceGloVe(di,dj) Then the distance function of the fusion similarity is as follows:
Dis(di,ce)=λ·DisBTM(di,ce)+(1-λ)·DisGloVe(di,ce)
i=1,2,…,n;e=1,2,…,K (31)
wherein d isiIs data set D ═ D1,d2,…,dnText in }, ceFor the clustering center, K texts are randomly selected from the data set at the beginning of the algorithm as an initial clustering center ceλ is a fusion coefficient and 0<λ<1. The value of the fusion coefficient lambda is determined by the clustering effect.
Example 3: text similarity Dis based on BTM topic modeling and JS divergence is assumedBTM(di,ce) 0.76 text similarity Dis based on GloVe word vector modeling and improved WMD distanceGloVe(di,ce) When the fusion coefficient λ is 0.7 and 0.64, the distance of the fusion similarity is Dis (d)i,ce)=0.7×0.76+0.3×0.64=0.724。
Example 2
The collected microblog text set comprises:
[ official announcement! Title Singapore concert playbill # dedicate Singapore public welfare cloud concert # playbill formally released! Each song is ordered by the medical personnel. Also has mystery surprise performance without negative expectation!
[ King frying volume ]! To make the best at the angel to sing a concert poster # this evening 19: 30, # dedicate the white clothes angel public welfare cloud concert # live broadcast is turned on! The strong guest formation is a love song. Each song is ordered by the medical staff! Thank you to spell a life for us and let us sing a song for us! Forwarding expectation!
The primary school volunteers in the Hongkou buildings walk on the street to hold the old to cross the road so as to care about everyone and everything and promote the thunder-and-lightning spirit.
During the growth of mother's festival, she always wants to be career and depends on her mother to listen to her for years.
The # mother's festival, caucasian competition # says that the mother and our love are separated from each other by a gradual run. If this is a growth cost, at least today, please pause stepping far away to see what we keep our mother behind.
"mother festival happy" # is just a child in the eyes of mother no matter how big you are. Having mother at you always have a home, having a home! Leave no call, say with her: mom, I love you! Then multiply her by 365 for good care.
Pre-processed data set:
official announce/vocation/custom/angel/concert/playbill/release/song/medic/request/show/expect lineup/vocation/angel/concert/playbill/poster/tonight/vocation/angel/public welfare/cloud/concert/live broadcast/start/guest/lineup/song/medic/request/thank/award/give/offer/song/medic/request/thank/award/show/offer/song/forward/expect/ask/forward/ask
Primary school of Hongkou/volunteer/walk/street/hold/elder/cross/road/do/care/person/care/event/propaganda/thunder god
Mother's day/small time/want/last life/want/in/mom/body/listen/fine year mother's day/tell/show/mother/love/separate/grow/cost/in/today/pause/step/look/behind/watch/mother/day
Mother's festival/happy/in/mom/eye/child/mother/in/present/home/forget/make/call/say/mom/love/hold/love/multiply/care
Modeling the BTM theme:
running the code of model confusion, assuming the algorithm yields the following results:
when K is 1, Perplexity is 193;
when K is 2, Perplexity is 162;
when K is 3, Perplexity is 151;
when K is 4, Perplexity is 165;
when K is 5, Perplexity is 186;
as a result, when the topic number K is 3, the model confusion is the smallest, and the optimal topic number K is 3.
When K is 3, it is assumed that running the code of the BTM model yields the following results:
(1) document-topic distribution
Figure RE-GDA0002513551180000211
(2) Topic distribution, topic-word distribution (only the first 6 words with the largest proportion are retained)
Figure RE-GDA0002513551180000212
Figure RE-GDA0002513551180000221
Calculating JS divergence:
the text similarity is expressed as follows:
angel/offer/concert/medical staff/public welfare/expectation
Angel/offer/concert/medical staff/public welfare/expectation
Hongkou primary school/volunteer/support/elder/road/thunderstorm spirit
Mother's festival/love/happy/caucasian/mother/child
Mother's festival/love/happy/caucasian/mother/child
Mother's festival/love/happy/caucasian/mother/child
For example: calculating JS divergence of a first document and a third document
Figure RE-GDA0002513551180000222
d1Probability distribution of (2): 0.1054160.0732150.0606130.0428460.0361250.021653
Figure RE-GDA0002513551180000223
Probability distribution of (2): 0.11581550.092430.08463250.0736040.063880.038933
Then pass through
Figure RE-GDA0002513551180000224
The JS divergence of the first document and the third document can be obtained
GloVe word vector modeling:
running the GloVe code, a word vector for each word can be obtained, assuming the approximate result is as follows: (part)
Okinawa-0.0041153620.0023200110.0010127611-0.00042312752-0.004730146
Rhizoma bletilla 0.003889411-0.0027805932-0.00198634830.8018722716-0.00267
Concert-0.0048637153-0.0047526550.0040961024-0.0034338231-0.001136
Song 0.00318738750.0027359112-0.00353613570.0035860823-0.0028235812
Mother's day 0.0002723677-0.00130813460.00116098-0.0015890214-0.00448590
Mother 0.00062034760.0030022748-0.0046566210.0019886396-0.0003756053
Mother-0.0008758595-0.00373174970.0019768209-0.00206301830.00242962
Medical personnel-0.00038896065-0.00258334680.0021852236-0.00377636470.000
Point singing-2.3274260-050.00177467540.0022264344-0.00090659875-2.9263224
Calculating the WMD distance:
using formulas
Figure RE-GDA0002513551180000231
WMD distance may be calculated.
Wherein, the word conversion cost c (s, t) | | | vs-vt||2(vsAnd vtGlove word vectors for words s and t, respectively)
Weight transfer matrix TstFormula ∑tTst=weightsWeight transfer amount weightsAnd calculating by a weight calculation formula defining the position contribution degree of the fused word of 2.
Examples are as follows:
assuming that there are several short texts, the similarity between the two short texts "Cobbs meet in los Angeles in the fans Cobbs in los Angeles" and "NBA Jupiter and American Bean Fall NBA Jupiter American Bean Fall" is calculated. As shown in fig. 3.
Weight transfer weight of each wordsBy passing
Figure RE-GDA0002513551180000232
To calculate, for ease of explanation, a simple number is chosen here as weightsThe labels are illustrated next to each word.
According to the constraint condition, due to weightScientific ratioWhen the "kob" is 0.5, the weight shift amount to the four words "NBA giant star", "usa", "fan", "meet" should be 0.5. At the same time, due to weightNBA Jupiter0.6, the sum of the weight transfer amounts of the four words "cobi", "los angeles", "interview", and "fans" transferred to the word "NBA giant star" should be 0.6. In fact, according to the constraints:
Figure RE-GDA0002513551180000241
Figure RE-GDA0002513551180000242
the weight transfer matrix T can be obtainedstEach component of the plurality of components, but with a plurality of sets of solutions, is calculated
Figure RE-GDA0002513551180000243
The minimum value is selected to be the WMD distance between the two documents.
And (3) performing K-means clustering based on linear fusion similarity:
1) randomly selecting three short texts as initial cluster centers
2) Using distance functions
Figure RE-GDA0002513551180000244
And respectively calculating the distances between the rest short texts and the centers of the three initial clusters, and allocating the rest short texts to the cluster most similar to the center of the cluster.
3) According to
Figure RE-GDA0002513551180000245
e 1,2, …, K updates the cluster center
4) Repeat 2) and 3) until the criterion function converges (cluster center no longer changes)
The following results will eventually be approximately obtained:
1. clusters corresponding to each short text (the clusters are represented by the labels 1-3, the label 1 corresponds to the first cluster, and so on):
Figure RE-GDA0002513551180000251
2. further intuitively displaying the results, several feature words may also be used to represent clusters
Label character words
1 flood family primary school, volunteer, support, old man, road, thunderstorm spirit
2 Angel, vocation, concert, medical staff, public welfare, expectation
3 mother's day, love, happy, caucasian, mother and child.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (7)

1. A microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion is characterized by comprising three stages of data acquisition and preprocessing, modeling and clustering, wherein the data acquisition and preprocessing are carried out firstly, then the obtained data are modeled, and the modeled data are clustered;
the data acquisition and preprocessing stage processing method comprises the following steps: the microblog open platform provides a microblog API technology, a developer can use own App key and App secret to pass the user authorization of Oauth2.0 by creating application on the open platform, and then an API interface can be called to obtain microblog data in java, C + +, python and other environments; the microblog text preprocessing mainly comprises three parts, namely microblog text filtering, word segmentation and part-of-speech tagging, stop word removal and feature selection;
the modeling phase is divided into two parts: the first part is that a BTM topic model is used for topic modeling, then a modeling result is used for carrying out text representation on an original text set, and finally JS divergence is used for calculating text similarity; the second part is that a GloVe global vector model is used for word vector modeling, and then the WMD distance, namely WMD distance, of improved word weight is used for calculating text similarity;
the clustering is to linearly fuse similarity obtained by respectively calculating based on two modeling modes after BTM topic modeling and GloVe word vector modeling so as to improve a distance function of a K-means algorithm, wherein the distance function based on the fused similarity is shown as a formula 31, and a corresponding clustering criterion function formula is as follows:
Figure RE-FDA0002217272120000011
a microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity is characterized in that the algorithm execution sequence is as follows: BTM-JS-GloVe-WMD-BG & SLF-Kmeans; the specific method comprises the following steps:
firstly, acquiring microblog data and performing text preprocessing; secondly, modeling the preprocessed microblog text set by using a BTM topic model, and calculating the topic-based text similarity after BTM modeling by using JS divergence after text representation, namely BTM-JS; thirdly, modeling the preprocessed microblog text set by using a GloVe global vector model, and calculating text similarity based on word vectors after the GloVe modeling by using the improved WMD distance, namely GloVe-WMD; and finally, linearly fusing the two similarities, and applying a distance function of the fused similarities to a K-means clustering algorithm, so as to find a microblog hot topic, namely BG & SLF-Kmeans.
2. The microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion as claimed in claim 1, wherein the microblog text preprocessing comprises the following specific procedures:
(1) microblog text filtering
First, useless information such as emoticons, links, and notations are deleted, such as: "@ user", "stamp graph ↓ understanding", "more > >", which will cause great interference to the following text analysis; secondly, deleting ultra-short microblogs with less than 10 words, wherein the ultra-short microblogs are generally used for expressing the emotion of a user rather than describing a hot topic; finally, removing all punctuation marks in the data set;
(2) word segmentation and part-of-speech tagging
Chinese word segmentation is an important technology in natural language processing, and refers to dividing a Chinese character sequence into individual words according to a certain specification, wherein a commonly used Chinese word segmentation tool mainly comprises ICTCCLAS, NLPIR, Ansj, Jieba and THULAC, and a user-defined dictionary which generally comprises network new words, hot words, professional terms, names of people and place names is added before word segmentation according to the particularity of a data set so as to improve the accuracy of word segmentation;
(3) stop word
After word segmentation, the text set becomes a word set, some words have no actual meanings and are only used as connection components of sentences or for expressing the emotion and other effects of authors, such as words of 'o' and 'ya', if the words are reserved, the high dimensionality of the data set can increase the operation cost of the algorithm and can also influence the final short text clustering effect, and a word table of using a word table in a Hadamard mode is selected for experiments to delete the meaningless words;
(4) feature selection
After the part of speech is labeled, the part of speech of each word is attached to the back of each word, such as 'space stealing brushing/v', the information content of words with different parts of speech is different, as the hot topic is found in the research of the text, the part of speech of some parts of speech is smaller, such as adjectives, adverbs and the like, in order to improve the operation efficiency of the algorithm, the experimental data only keeps nouns and verbs, and the words with the rest parts of speech are filtered as useless words.
3. The microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion as claimed in claim 1, characterized in that the specific flow of the modeling stage is as follows:
(1) text similarity measurement based on BTM topic modeling
The text similarity metric process based on BTM topic modeling is divided into two parts,
a first part:
firstly, calculating an optimal theme number K by using a confusion formula, then carrying out BTM theme modeling on a preprocessed microblog text set, and finally carrying out text representation according to a modeling result;
a second part: calculating text similarity by using JS divergence;
1) BTM topic modeling
Because the selection of the value of the number K of the topics can directly influence the modeling result of the BTM, the number K of the topics which can enable the modeling result to be optimal needs to be determined before modeling, the optimal value K can be determined by using the confusion degree, the confusion degree can be used for evaluating the generalization ability of the model, the smaller the confusion degree is, the better the modeling effect is, and the confusion degree calculation formula is as follows:
Figure RE-FDA0002217272120000041
in the formula, B represents a word pair set, and p (B) represents the probability generated by the word pair B, and the formula is as follows:
Figure RE-FDA0002217272120000042
wherein p (z) ═ θzRepresenting the probability distribution, P (w), of the topic zi|z)=φi|zZ-feature word w representing a topiciProbability distribution of (1), P (w)j|z)=φj|zZ-feature word w representing a topicjA probability distribution of (a);
after determining the number of topics K, α -50/K and β -0.01 are empirically obtained, and then the topic distribution θ is obtained according to the formulas (7) and (8) mentioned in the background artzAnd topic-word distribution phiw|z
2) JS divergence
After BTM modeling is completed, for each document, 6 feature words with the first probability in the topic-word distribution p (w | z) under the maximum probability topic in the document-topic probability distribution p (z | d) are selected as the feature words of the document, so that dimension reduction can be realized on the basis of maximally preserving document semantics, and the algorithm complexity is reducediThe document vector based on BTM topic modeling can be represented by K document-topic probability distributions:
di_BTM={p(z1|di),p(z2|di),...,p(zK|di)} (23)
two pieces are to be calculatedDocument diAnd djThe similarity between the two is converted into the calculation di_BTMAnd dj_BTMThe similarity between these two document-topic vectors, calculated herein using the commonly used text similarity metric — JS divergence, is based on the BTM topic modeling and the text similarity calculation formula for JS divergence as follows:
Figure RE-FDA0002217272120000051
wherein, the calculation formula of the KL divergence is as follows:
Figure RE-FDA0002217272120000052
in which p and q are two probability distributions, phAnd phIs the probability distribution of the first 6 feature words;
(2) text similarity measurement based on GloVe word vector modeling
The text similarity measurement process based on GloVe word vector modeling is divided into two parts, firstly, GloVe word vector modeling is carried out on a preprocessed microblog text set, and then the similarity between texts is calculated by using WMD distance;
1) glove word vector modeling
The GloVe model needs to count the target word v according to the size of a context window before training the word vectorsWith context words
Figure RE-FDA0002217272120000053
Co-occurrence times in the whole corpus to construct a word co-occurrence matrix XstIn the original paper, the values of two parameters, namely vector dimension vector _ size and window size window _ size, of a word used by the model are compared and tested by adopting three indexes of semantic accuracy, grammar accuracy and overall accuracy, and a conclusion is obtained: the vector _ size is 300, the window _ size is approximately between 6 and 10, and the data set used by the source code is an English data set, and parameters are set according to the particularity of the Chinese short text data set of the microblogvector_size=300,window_size=8;
2) WMD word wandering distance with improved word weight
After GloVe word vector modeling, similarity between texts is calculated by using WMD (distance between words) distance, and short text diAnd djThe text similarity calculation formula based on GloVe word vector modeling and WMD distance is as follows:
Figure RE-FDA0002217272120000061
wherein G is the vocabulary size, TstIs a weight transfer matrix of order G, representing diHow many weighted words s in the list are transferred to djWord t in (1), when text diAll words s in (1) are transferred to text dj∑ when the word "t" is includedtTst=weightsWeight transfer amount weightsThe above formula should satisfy two constraints of formula (15) (16) in the background art, and is measured by a weight calculation formula of fused word position contribution degree.
4. The microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion as claimed in claim 1, wherein the algorithm is based on BG&The hot topic discovery algorithm of SLF-Kmeans is as follows: inputting: microblog short text set D ═ { D ═ D1,d2,…,dnThe cluster number K calculated by the formula (21),
and (3) outputting: k cluster sets
Step1. from dataset D ═ { D ═ D1,d2,…,dnRandomly selecting K short texts as initial clustering centers ce,e=1,2,…,K;
Step2, assigning equation (31) as a distance function;
Step3.repeat;
step4. calculate the remaining short text diWith cluster centre ceDistance Dis (d)i,ce) Assigning each short text to the most similar cluster;
step5. recalculating K clusters from equation (19)Heart ce
Step6. unitil clustering criterion function equation (27) converges.
5. The microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity according to claim 3, characterized in that the text similarity measure based on BTM topic modeling and JS divergence (BTM-JS) is as follows: inputting: microblog short text set D ═ { D ═ D1,d2,…,dnH, hyperparameters α and β
And (3) outputting: topic-based text similarity DisBTM(di,dj)
Step1, determining an optimal theme number K value according to a formula (21);
step2, randomly distributing initial subjects for all the word pairs;
Step3.for di∈D do;
Step4.for b∈B do;
step5. assign a topic z to each word pair according to equation (22)b
Update n of Step5z
Figure RE-FDA0002217272120000071
And
Figure RE-FDA0002217272120000072
step6. calculate topic distribution θ according to equations (7) and (8)zAnd topic-word distribution phiw|z
Step7, selecting characteristic words for each document according to a formula (23) and vectorizing and expressing the text;
step8, calculating the text similarity Dis based on the theme according to the formula (24)BTM(di,dj) And output.
6. The microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity as claimed in claim 3, wherein the microblog hot topic discovery algorithm is based on GloVe word vector modeling and WMD distance (GloVe-WMD) text similarity measurementIf the algorithm is used as input: microblog short text set D ═ { D ═ D1,d2,…,dn}
And (3) outputting: text similarity Dis based on word vectorsGloVe(di,dj)
Step1, constructing word co-occurrence matrix X of microblog short text setst
Step2. based on the word co-occurrence matrix XstModeling with GloVe to obtain a word vector set V ═ V1,v2,…,vG};
Step3.For s=1 to G do;
Step4. according to ls< 10, judge whether it is a heading word or a text word, and update cs_titleAnd cs_text
Step5, appointing the formula (28) as a weight calculation formula of the word and calculating the weight transfer amount weights
Step6. according to ∑tTst=weightsComputing a weight transfer matrix Tst
Step7.For vs,vt∈V do;
Step8, calculating word conversion cost c (s, t) according to the formula (17);
step9. calculating word vector-based text similarity Dis according to formula (26)GloVe(di,dj) And output.
7. The microblog hot topic discovery algorithm based on BTM and GloVe similarity linear fusion as claimed in claim 1 is characterized in that the specific processing method is as follows:
for a microblog news short text, the news title is positioned at the forefront and marked by a double # sign mark or a bracket, the news title generally serves as a news summarizing effect, and the rest of the contents are text parts;
definition 1: supposing that any one of the words takes the first 10 words of the processed microblog short text as a title, and the rest words are texts; i.e. if the column index l of the word ssIf the number is less than 10, s is a heading word, otherwise, s is a text word;
definition 2: weight of fused word position contribution degreeRecalculation formula, calculating weight transfer amount of word by using TF-IDF value of word, and setting position contribution degree gamma of subject word11.5, the position contribution degree y of the text word21, some words may be both heading words and text words, so the formula of the weight calculation of the position contribution degree of the fused word is as follows:
Figure RE-FDA0002217272120000091
wherein, csRepresenting the total number of occurrences of the word s, cs_titleNumber of times s is a term, cs_textDenotes the number of times s is a text word, and cs_title+cs_text=cs,tfsAnd idfsThe calculation formula of (a) is as follows:
Figure RE-FDA0002217272120000092
Figure RE-FDA0002217272120000101
in the formula (22), G is the vocabulary size, and in the formula (23), | D | represents the number of texts in the short text set, | { i: s ∈ Di} | denotes the number of texts containing the word s;
definition 3: and fusing a distance function of similarity, and giving the text similarity Dis based on the BTM topic modeling and the JS divergenceBTM(di,dj) And text similarity Dis based on GloVe word vector modeling and WMD distanceGloVe(di,dj) Then the distance function of the fusion similarity is as follows:
Dis(di,ce)=λ·DisBTM(di,ce)+(1-λ)·DisGloVe(di,ce)
i=1,2,…,n;e=1,2,…,K (31)
wherein d isiIs data set D ═ D1,d2,…,dnText in }, ceFor the clustering center, K texts are randomly selected from the data set at the beginning of the algorithm as an initial clustering center ceλ is a fusion coefficient and 0<λ<1, the value of the fusion coefficient lambda is determined by the clustering effect.
CN201910770568.4A 2019-08-20 2019-08-20 Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity Withdrawn CN111368072A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910770568.4A CN111368072A (en) 2019-08-20 2019-08-20 Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910770568.4A CN111368072A (en) 2019-08-20 2019-08-20 Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity

Publications (1)

Publication Number Publication Date
CN111368072A true CN111368072A (en) 2020-07-03

Family

ID=71210025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910770568.4A Withdrawn CN111368072A (en) 2019-08-20 2019-08-20 Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity

Country Status (1)

Country Link
CN (1) CN111368072A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328735A (en) * 2020-11-11 2021-02-05 河北工程大学 Hot topic determination method and device and terminal equipment
CN112860883A (en) * 2021-02-08 2021-05-28 国网河北省电力有限公司营销服务中心 Electric power work order short text hot topic identification method and device and terminal
CN113139599A (en) * 2021-04-22 2021-07-20 北方工业大学 Service distributed clustering method fusing word vector expansion and topic model
CN113191147A (en) * 2021-05-27 2021-07-30 中国人民解放军军事科学院评估论证研究中心 Unsupervised automatic term extraction method, apparatus, device and medium
CN113591473A (en) * 2021-07-21 2021-11-02 西北工业大学 Text similarity calculation method based on BTM topic model and Doc2vec
CN116432639A (en) * 2023-05-31 2023-07-14 华东交通大学 News element word mining method based on improved BTM topic model
CN117195004A (en) * 2023-11-03 2023-12-08 苏州市吴江区盛泽镇人民政府 Policy matching method integrating industry classification and wvLDA theme model
TWI825535B (en) * 2021-12-22 2023-12-11 中華電信股份有限公司 System, method and computer-readable medium for formulating potential hot word degree

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328735A (en) * 2020-11-11 2021-02-05 河北工程大学 Hot topic determination method and device and terminal equipment
CN112860883A (en) * 2021-02-08 2021-05-28 国网河北省电力有限公司营销服务中心 Electric power work order short text hot topic identification method and device and terminal
CN112860883B (en) * 2021-02-08 2022-06-24 国网河北省电力有限公司营销服务中心 Electric power work order short text hot topic identification method, device and terminal
CN113139599A (en) * 2021-04-22 2021-07-20 北方工业大学 Service distributed clustering method fusing word vector expansion and topic model
CN113139599B (en) * 2021-04-22 2023-08-08 北方工业大学 Service distributed clustering method integrating word vector expansion and topic model
CN113191147A (en) * 2021-05-27 2021-07-30 中国人民解放军军事科学院评估论证研究中心 Unsupervised automatic term extraction method, apparatus, device and medium
CN113591473A (en) * 2021-07-21 2021-11-02 西北工业大学 Text similarity calculation method based on BTM topic model and Doc2vec
CN113591473B (en) * 2021-07-21 2024-03-12 西北工业大学 Text similarity calculation method based on BTM topic model and Doc2vec
TWI825535B (en) * 2021-12-22 2023-12-11 中華電信股份有限公司 System, method and computer-readable medium for formulating potential hot word degree
CN116432639A (en) * 2023-05-31 2023-07-14 华东交通大学 News element word mining method based on improved BTM topic model
CN116432639B (en) * 2023-05-31 2023-08-25 华东交通大学 News element word mining method based on improved BTM topic model
CN117195004A (en) * 2023-11-03 2023-12-08 苏州市吴江区盛泽镇人民政府 Policy matching method integrating industry classification and wvLDA theme model
CN117195004B (en) * 2023-11-03 2024-02-06 苏州市吴江区盛泽镇人民政府 Policy matching method integrating industry classification and wvLDA theme model

Similar Documents

Publication Publication Date Title
CN111368072A (en) Microblog hot topic discovery algorithm based on linear fusion of BTM and GloVe similarity
CN107220352B (en) Method and device for constructing comment map based on artificial intelligence
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
Vitevitch What can graph theory tell us about word learning and lexical retrieval?
US11704501B2 (en) Providing a response in a session
CN111539197B (en) Text matching method and device, computer system and readable storage medium
Gómez-Adorno et al. Automatic authorship detection using textual patterns extracted from integrated syntactic graphs
CN111723295B (en) Content distribution method, device and storage medium
Graff et al. Evomsa: A multilingual evolutionary approach for sentiment analysis [application notes]
Cui et al. KNET: A general framework for learning word embedding using morphological knowledge
Marujo et al. Hourly traffic prediction of news stories
Kalabikhina et al. The measurement of demographic temperature using the sentiment analysis of data from the social network VKontakte
Yordanova et al. Automatic detection of everyday social behaviours and environments from verbatim transcripts of daily conversations
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN116882414B (en) Automatic comment generation method and related device based on large-scale language model
CN111046662B (en) Training method, device and system of word segmentation model and storage medium
O'Connor Statistical Text Analysis for Social Science.
CN114048395B (en) User forwarding prediction method and system based on time perception and key information extraction
Ling Coronavirus public sentiment analysis with BERT deep learning
US20220269704A1 (en) Irrelevancy filtering
KR20070118154A (en) Information processing device and method, and program recording medium
Brown et al. On the problem of small objects
Wang et al. Emotional tagging of videos by exploring multiple emotions' coexistence
ElGindy et al. Capturing place semantics on the geosocial web
Zhang et al. On the need of hierarchical emotion classification: detecting the implicit feature using constrained topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20200703