CN106599181A

CN106599181A - Hot news detecting method based on topic model

Info

Publication number: CN106599181A
Application number: CN201611145855.9A
Authority: CN
Inventors: 庄郭冕; 黄乔; 彭志宇; 付晗; 王忆诗
Original assignee: Insigma Hengtian Software Ltd
Current assignee: Insigma Hengtian Software Ltd
Priority date: 2016-12-13
Filing date: 2016-12-13
Publication date: 2017-04-26
Anticipated expiration: 2036-12-13
Also published as: CN106599181B

Abstract

The invention discloses a hot news detecting method based on a topic model. According to the method, news streams are directionally crawled through a web crawler; an article is firstly subjected to preprocessing such as word segmentation and stop word and meaningless character string removal; then, the preprocessed article is subjected to feature extraction; a text model is built; then, texts with high similarity is added into the most similar class through a text clustering algorithm to obtain a topic base; next, new and old topics are subjected to similarity calculation; the new and cold topics with high similarity are merged; finally, the topic hot degree calculation is performed; and the hottest topic is selected through sequencing. An LDA (Latent Dirichlet Allocation) algorithm is creatively applied to hot topic discovery; an explosive concept is provided; the hottest news can be timely and effectively discovered; meanwhile, the topic hot degree attenuation concept is provided; the topic hot degree can be recorded and tracked in real time; the development change of the hot news is really reflected; and the important significance is realized on the tracking display of the hot news.

Description

A kind of hot news detection method based on topic model

Technical field

The invention provides a kind of hot news detection method based on topic model, it is related to web crawlers, cluster analyses, The core technologies such as Text similarity computing and algorithm, timely and effectively detect hot news, follow the trail of hot news and develop.

Background technology

As the development of Internet technology, magnanimity information epoch have been arrived, various information is full of in the Internet, but only Minority news can create much of a stir, i.e., so-called top news, hot news, and timely hot news finds to help people's reality When pay close attention to social state.

On the other hand, the outburst of a hot news is not to die at a flash, is usually associated with one and of flowing rhythm sends out Exhibition process, and cause other potential problems to occur, so following the trail of the evolution of hot news for research social problem has Significance.

The development of the Internet, the rise of big data, the Internet are flooded with bulk information, send out in these low-quality information Existing hot news becomes of crucial importance.

The content of the invention

Present invention aims to nowadays internet information is numerous and diverse, there is provided a kind of based on web crawlers, cluster point Analysis and the hot news detection method of topic model.

The purpose of the present invention is achieved through the following technical solutions：A kind of hot news based on topic model is detected Method, crawls news stream by web crawlers orientation, carries out participle first to article, remove stop words and meaningless character string etc. Pretreatment, carries out feature extraction then to pretreated article, builds text model, then by Text Clustering Algorithm just The high text of similarity degree is added in most like classification, obtains topic storehouse, then carries out similarity meter to new familiar topic Calculate, the new familiar topic high for similarity merge, finally carry out topic temperature calculating, by sequence select it is most hot if Topic.Specifically include following steps：

(1) oriented by the way of web crawlers and crawl news stream, often the new article of an arrival N pieces carries out a batch processing, right Crawl data and carry out data cleansing, article participle and obtain pretreated article；

(2) build vector space model：Through pretreatment operation, original document can be regarded as and is made up of a pile word , if document is regarded as one it is vectorial if, then each word is exactly one-dimensional characteristic, by transform a document to Amount, text data are just changed into the structural data that can be subsequently can by computer, and the Similarity Problem between two documents is just converted In order to two it is vectorial between Similarity Problem.When document vector is calculated per one-dimensional weight, calculated using improved B-TFIDF Method, algorithmic formula are as follows：

In formula (1), w represents word, and A represents the article number comprising word w in new article, and B does not include in representing new article The article number of word w, C represent the article number comprising word w in history article, and D represents the text not comprising word w in history article Chapter number, d in formula (2)_iI-th new article is represented, N represents new article sum, and tf (d, w) represents words of the word w in article d Frequently, df (w) represents the article number comprising word w.The burst of word is taken into account by the algorithm, and explosive is that a word exists Suddenly it is a large amount of in a short time to occur.By algorithm above calculate constitute document each word weight, and then generate article to Quantity space model D_i=(weight (d_i,w₁),weight(d_i,w₂),weight(d_i,w₃)…..weight(d_i,w_n)), wherein n For total word number.

(3) article cluster：Through step 2, text is represented as the form of vector, and text vector is clustered；Using LDA topic model clustering algorithms, specially：

LDA cluster process：LDA is three layers of bayesian probability model, comprising three layers of word, theme and document, by one Such a process is regarded in the generation of article as：Certain theme is selected with certain probability, and with certain probability in this theme Certain word is selected, document to theme obeys multinomial distribution, and theme to word obeys multinomial distribution, obtains " main by LDA clusters Topic-word " probability matrix phi and " document-theme " probability matrix theta, according to " document-theme " probability matrix theta The probability of m theme N pieces article corresponding with m theme is obtained, every a line i of theta represents an article, represents per string j One theme, homography value theta_ijIt is probability that article i belongs to theme j.It is thresholdT to arrange screening threshold value, if theta_ij>ThresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme.

LDA clusters the determination of number m：Be respectively provided with cluster number and LDA clustering algorithms repeated for N/10-N/5, then Similarity between the theme of implementing result each time is calculated, the corresponding theme of the minimum implementing result of similarity between theme is selected Number.Every a line j of " theme-word " probability matrix phi that the calculating of similarity is obtained according to LDA clusters between theme, phi is represented One theme T_j, a word w is represented per string k_k, phi_jkRepresent theme T_jComprising word w_kProbability.A line of Phi can be with Regard theme T as_jVector form T_j=(w₁,w₂,w₃,…w_k…w_n), n is total word number.Theme similarity between any two is calculated, Similarity meansigma methodss are sought, and minima are taken as similarity between final theme.Meter of the calculating of similarity using cosine similarity Calculation method, computing formula are as follows：

T in formula (3)_iAnd T_jRepresent two themes, ω_k(T_i) represent theme T_iValue in dimension k, n represent total word Number.

(4) subject key words are extracted：Key word is extracted under theme in the exercise question of all articles, first title of article is carried out Participle, filters out stop-word, it is not intended to adopted word and punctuation mark, and remaining word is used as subject key words.

(5) topic merges：M theme and its corresponding article are obtained by step 3, next by m new theme and old master Topic is merged, and calculates similarity f1 between theme, if f1>0.5 is thought that two themes are similar, and merges two themes.Theme Between similarity f1 computing formula it is as follows：

F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)

It is similar that vectorSim representatives in formula (4) calculate theme cosine as dimension using all words that theme is included Degree, keywordSim are represented by dimension of subject key words and calculate theme cosine similarity, and the computing formula of cosine similarity is same Formula (3).

(6) temperature is calculated：Final all themes are obtained through step 5, theme temperature h is next calculated, is filtered out heat The high theme of degree, removes that temperature is low, i.e., out-of-date theme.According to focus theme news concentration class s it is high the characteristics of, temperature calculates public Formula is as follows：

h_t=∑ sim (d_i,t) (5)

D in formula (5)_iRepresent the article that theme T is included, temperature h of theme T_tIt is similar to theme equal to article under theme The sum of degree, the same formula of sim functions (3).

As time go on, the temperature of a theme constantly can decay, until the theme is rejected less than threshold value.Temperature Decay, in each batch process, if there is new article to arrive below theme T, then temperature h of theme T_tCan increase accordingly Plus, h_t=h_t* Up, if not having new article to be added to theme T, then temperature h_tCan decay, h_t=h_t* Down, wherein Up ＞ 1, Down ＜ 1.

The invention has the beneficial effects as follows：The present invention innovatively applies LDA algorithm in focus motif discovery, and proposes Fulminant concept, can timely and effectively find newest hot news, while it is general to propose the decay of theme temperature Read, can real time record tracking theme temperature, truly reflect the development and change of hot news, for the tracking of hot news Show that tool has very important significance.

Description of the drawings

Fig. 1 is the hot news testing process schematic diagram based on topic model；

Fig. 2 is article modeling process schematic diagram；

Fig. 3 is LDA cluster process schematic diagrams；

Fig. 4 is that new old theme merges schematic diagram；

Fig. 5 is that theme temperature calculates schematic diagram.

Specific embodiment

The present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.

As shown in figure 1, a kind of hot news detection method based on topic model proposed by the present invention, including following step Suddenly：

(1) oriented by the way of web crawlers and crawl news stream, often arrival N pieces article carries out a batch processing, to climbing Fetching data carries out data cleansing, article participle and obtains pretreated article；

(2) build vector space model：As shown in Fig. 2 through pretreatment operation, original document can be regarded as by a pile Word constitute, if document is regarded as one it is vectorial if, then each word is exactly one-dimensional characteristic, by by document turn Vector is turned to, text data is just changed into the structural data that can be subsequently can by computer, the Similarity Problem between two documents Translated into two it is vectorial between Similarity Problem.When document vector is calculated per one-dimensional weight, employ improved B-TFIDF algorithms, algorithmic formula are as follows：

In formula 1, w represents word, and it is single that A represents that the article number comprising word w in new article, B do not include in representing new article The article number of word w, C represent the article number comprising word w in history article, and D represents the article not comprising word w in history article Number, d in formula 2_iI-th new article is represented, N represents new article sum, and tf (d, w) represents word frequency of the word w in article d, Df (w) represents the article number comprising word w.The burst of word is taken into account by the algorithm, and an explosive i.e. word is short It is a large amount of suddenly in phase to occur.The weight of each word for constituting document is calculated by algorithm above, and then generates the vector of article Spatial model D_i=(weight (d_i,w₁),weight(d_i,w₂),weight(d_i,w₃)…..weight(d_i,w_n)), wherein n is Total word number.

(3) article cluster：Through step 2, text is represented as the form of vector, and text vector is clustered；Such as Fig. 3 It is shown, be employed herein LDA topic model clustering algorithms, LDA is three layers of bayesian probability model, comprising word, theme and Three layers of document, regards the generation of an article as such a process：Certain theme is selected with certain probability, and in this master Certain word is selected with certain probability in topic, document to theme obeys multinomial distribution, and theme to word is obeyed multinomial distribution, passed through LDA cluster analyses obtain " theme-word " probability matrix and " document-theme " probability matrix, and detailed process sees below description.

LDA cluster process：LDA is three layers of bayesian probability model, can obtain " theme-word by LDA clusters Language " probability matrix phi and " document-theme " probability matrix theta, obtains m according to " document-theme " probability matrix theta The probability of individual theme N pieces article corresponding with m theme, every a line i of theta represent an article, represent a master per string j Topic, homography value theta_ijIt is probability that article i belongs to theme j.It is thresholdT (preferred values to arrange screening threshold value 0.32), if theta_ij>ThresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme.

LDA clusters the determination of number m：As N pieces article cluster number relatively meets reality between N/10 to N/5 (for example, when new article sum N is 150, cluster number relatively meets reality between 15 to 30), so being respectively provided with poly- Class number repeats LDA clustering algorithms for N/10-N/5, then calculates similarity between the theme of implementing result each time, selects The corresponding theme number of the minimum implementing result of similarity between theme.The calculating of similarity between theme needs to be clustered according to LDA " theme-word " the probability matrix phi for arriving, every a line j of phi represent a theme T_j, a word w is represented per string k_k, phi_jkRepresent theme T_jComprising word w_kProbability.A line of Phi can regard theme T as_jVector form T_j=(w₁,w₂, w₃,…w_k…w_n), n is total word number.Theme similarity between any two is calculated, and is sought similarity meansigma methodss, minima is taken as most Similarity between whole theme.The calculating of similarity employs the computational methods of cosine similarity, and computing formula is as follows：

T in formula 3_iAnd T_jRepresent two themes, ω_k(T_i) represent theme T_iValue in dimension k, n represent total word number.

(5) topic merges：M theme and its corresponding article are obtained by step 3, next by m new theme and old master Topic is merged, as shown in figure 4, similarity f1 between theme is calculated, if f1>0.5 is thought that two themes are similar, and merges two Theme.Between theme, similarity f1 computing formula is as follows：

F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)

VectorSim in formula 4 is represented using all words that theme is included and is calculated theme cosine similarity as dimension, KeywordSim is represented by dimension of subject key words and is calculated theme cosine similarity, the same formula of computing formula of cosine similarity 3。

(6) temperature is calculated：As shown in figure 5, final all themes are obtained through step 5, next calculate theme temperature H, filters out the high theme of temperature, removes that temperature is low, i.e., out-of-date theme.According to focus theme news concentration class s it is high the characteristics of, Temperature computing formula is as follows：

h_t=∑ sim (d_i,t) (5)

Di in formula 5 represents the article that theme T is included, temperature h of theme T_tIt is similar to theme equal to article under theme The sum of degree, sim functions are with formula 3.

As time go on, the temperature of a theme constantly can decay, until the theme is rejected less than threshold value.Temperature Decay, in each batch process, if there is new article to arrive below theme T, then temperature h of theme T_tCan increase accordingly Plus, h_t=h_t* Up (preferred value 1.05), if not having new article to be added to theme T, then temperature h_tCan decay, h_t=h_t* Down (preferred value 0.9).

Claims

1. a kind of hot news detection method based on topic model, it is characterised in that comprise the following steps：

(1) oriented by the way of web crawlers and crawl news stream, often the new article of an arrival N pieces carries out a batch processing, to crawling Data carry out data cleansing, article participle and obtain pretreated article；

(2) build vector space model：Through pretreatment operation, original document can regard what is be made up of a pile word as, such as Fruit document regard as one it is vectorial if, then each word is exactly one-dimensional characteristic, by transforming a document to vector, text Data are just changed into the structural data that can be subsequently can by computer, and the Similarity Problem between two documents has translated into two Similarity Problem between vector.When document vector is calculated per one-dimensional weight, using improved B-TFIDF algorithms, algorithm Formula is as follows：

b_{i} (w) = \frac{(A + B + C + D) {(A D - B C)}^{2}}{(A + B) (C + D) (A + C) (B + D)} - - - (1)

w e i g h t (d_{i}, w) = \frac{({tf}_{i} (d_{i}, w) l o g ((N + 1) / ({df}_{i} (w) + 0.5)) \cdot b_{i} (w))}{\sqrt{\underset{w^{'} &Element; d}{Σ} {(t f (d, w^{'}) 1 o g ((N + 1) / (d f (w^{'}) + 0.5)) \cdot b (w^{'}))}^{2}}} - - - (2)

In formula (1), w represents word, and A represents that the article number comprising word w in new article, B do not include word in representing new article The article number of w, C represent the article number comprising word w in history article, and D represents the article not comprising word w in history article Number, d in formula (2)_iI-th new article is represented, N represents new article sum, and tf (d, w) represents words of the word w in article d Frequently, df (w) represents the article number comprising word w.The burst of word is taken into account by the algorithm, and explosive is that a word exists Suddenly it is a large amount of in a short time to occur.By algorithm above calculate constitute document each word weight, and then generate article to Quantity space model D_i=(weight (d_i,w₁),weight(d_i,w₂),weight(d_i,w₃)…..weight(d_i,w_n)), wherein n For total word number.

(3) article cluster：Through step 2, text is represented as the form of vector, and text vector is clustered；Using LDA master Topic Model tying algorithm, specially：

LDA cluster process：LDA is three layers of bayesian probability model, comprising three layers of word, theme and document, by an article Generation regard such a process as：Certain theme is selected with certain probability, and is selected with certain probability in this theme Certain word, document to theme obey multinomial distribution, theme to word obey multinomial distribution, by LDA cluster obtain " theme- Word " probability matrix phi and " document-theme " probability matrix theta, obtains according to " document-theme " probability matrix theta The probability of m theme N pieces article corresponding with m theme, every a line i of theta represent an article, represent one per string j Theme, homography value theta_ijIt is probability that article i belongs to theme j.It is thresholdT to arrange screening threshold value, if theta_ij>ThresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme.

LDA clusters the determination of number m：Be respectively provided with cluster number and LDA clustering algorithms repeated for N/10-N/5, then calculate Similarity between the theme of implementing result, selects the corresponding theme number of the minimum implementing result of similarity between theme each time.It is main " theme-word " probability matrix phi that the calculating of similarity is obtained according to LDA clusters between topic, every a line j of phi represent one Theme T_j, a word w is represented per string k_k, phi_jkRepresent theme T_jComprising word w_kProbability.A line of Phi can be regarded as Theme T_jVector form T_j=(w₁,w₂,w₃,…w_k…w_n), n is total word number.Theme similarity between any two is calculated, phase is sought Like degree meansigma methodss, minima is taken as similarity between final theme.Calculating side of the calculating of similarity using cosine similarity Method, computing formula are as follows：

s i m (T_{i}, T_{j}) = \frac{Σ_{k = 1}^{n} ω_{k} (T_{i}) \times ω_{k} (T_{j})}{\sqrt{((Σ_{k = 1}^{n} ω_{k}^{2} (T_{i})) (Σ_{k = 1}^{n} ω_{k}^{2} (T_{j})))}} - - - (3)

(4) subject key words are extracted：Key word is extracted under theme in the exercise question of all articles, first title of article is carried out point Word, filters out stop-word, it is not intended to adopted word and punctuation mark, and remaining word is used as subject key words.

(5) topic merges：M theme and its corresponding article are obtained by step 3, next m new theme is entered with old theme Row merges, and calculates similarity f1 between theme, if f1>0.5 is thought that two themes are similar, and merges two themes.Phase between theme It is as follows like degree f1 computing formula：

F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)

VectorSim in formula (4) is represented using all words that theme is included and is calculated theme cosine similarity as dimension, KeywordSim is represented by dimension of subject key words and is calculated theme cosine similarity, the same formula of computing formula of cosine similarity (3)。

(6) temperature is calculated：Final all themes are obtained through step 5, theme temperature h is next calculated, is filtered out temperature high Theme, remove that temperature is low, i.e., out-of-date theme.According to focus theme news concentration class s it is high the characteristics of, temperature computing formula is such as Under：

h_t=∑ sim (d_i,t) (5)

D in formula (5)_iRepresent the article that theme T is included, temperature h of theme T_tEqual to article under theme and Topic Similarity With the same formula of sim functions (3).

As time go on, the temperature of a theme constantly can decay, until the theme is rejected less than threshold value.Temperature declines Subtract, in each batch process, if there is new article to arrive below theme T, then temperature h of theme T_tCan increase accordingly Plus, h_t=h_t* Up, if not having new article to be added to theme T, then temperature h_tCan decay, h_t=h_t* Down, wherein Up ＞ 1, Down ＜ 1.