CN106599181B

CN106599181B - A kind of hot news detection method based on topic model

Info

Publication number: CN106599181B
Application number: CN201611145855.9A
Authority: CN
Inventors: 庄郭冕; 黄乔; 彭志宇; 付晗; 王忆诗
Original assignee: Insigma Hengtian Software Ltd
Current assignee: Insigma Hengtian Software Ltd
Priority date: 2016-12-13
Filing date: 2016-12-13
Publication date: 2019-06-18
Anticipated expiration: 2036-12-13
Also published as: CN106599181A

Abstract

The hot news detection method based on topic model that the invention discloses a kind of, news stream is crawled by web crawlers orientation, article is segmented first, remove the pretreatment such as stop words and meaningless character string, then feature extraction is carried out to pretreated article, construct text model, then the high text of similarity degree is added in most like classification by Text Clustering Algorithm, obtain theme library, then similarity calculation is carried out to the old and new's theme, the old and new theme high for similarity merges, finally carry out the calculating of theme temperature, most hot theme is selected by sequence.The present invention innovatively applies LDA algorithm in hot spot motif discovery, and propose fulminant concept, it can timely and effectively find newest hot news, theme temperature decaying concept is proposed simultaneously, tracking theme temperature can be recorded in real time, the development and change of hot news have been truly reflected, have been of great significance for the tracking displaying of hot news.

Description

A kind of hot news detection method based on topic model

Technical field

The hot news detection method based on topic model that the present invention provides a kind of, is related to web crawlers, clustering, The core technologies such as Text similarity computing and algorithm, timely and effectively detect hot news, and tracking hot news develops.

Background technique

With the development of internet technology, the massive information epoch have arrived, and various information is full of in internet, but only A small number of news can create much of a stir, i.e., so-called top news, hot news, and timely hot news discovery can help people real When pay close attention to social state.

On the other hand, the outburst of a hot news is not to die at a flash, is usually associated with a hair of flowing rhythm Exhibition process, and cause the generation of other potential problems, so the development process of tracking hot news has research social concern Significance.

The development of internet, the rise of big data, internet are flooded with bulk information, send out in these low-quality information Existing hot news becomes of crucial importance.

Summary of the invention

It is an object of the invention to provide a kind of based on web crawlers, cluster point for many and diverse of nowadays internet information The hot news detection method of analysis and topic model.

The purpose of the present invention is achieved through the following technical solutions: a kind of hot news detection based on topic model Method crawls news stream by web crawlers orientation, segments first to article, remove stop words and meaningless character string etc. Pretreatment then carries out feature extraction to pretreated article, constructs text model, then will by Text Clustering Algorithm The high text of similarity degree is added in most like classification, obtains topic library, then carries out similarity meter to new familiar topic Calculate, the new familiar topic high for similarity merge, finally carry out the calculating of topic temperature, by sequence select it is most hot if Topic.Specifically includes the following steps:

(1) it is oriented by the way of web crawlers and crawls news stream, every arrival N new articles carry out a batch processing, right Crawl data progress data cleansing, article segments to obtain pretreated article；

(2) it constructs vector space model: passing through pretreatment operation, original document can be regarded as to be made of a pile word , if document being regarded as a vector, each word is exactly one-dimensional characteristic, by transform a document to Amount, text data just become the structural data that can be subsequently can by computer, and the Similarity Problem between two documents just converts For the Similarity Problem between two vectors；When calculating document vector per one-dimensional weight, calculated using improved B-TFIDF Method, algorithmic formula are as follows:

W represents word in formula (1), and A indicates that the article number in new article comprising word w, B indicate not including in new article The article number of word w, C indicate that the article number in history article comprising word w, D indicate the text for not including word w in history article Chapter number, d in formula (2)_iIndicate i-th new article, N indicates that new article sum, tf (d, w) indicate word of the word w in article d Frequently, df (w) indicates the article number comprising word w；B-TFIDF algorithm takes into account the burst of word, burst i.e. one Word is a large amount of suddenly in a short time to be occurred；The weight for constituting each word of document is calculated by algorithm above, and then generates text The vector space model D of chapter_i=(weight (d_i,w₁),weight(d_i,w₂),weight(d_i,w₃)…..weight(d_i, w_n)), wherein n is total word number；

(3) article clusters: passing through step 2, text is represented as the form of vector, clusters to text vector；Using LDA topic model clustering algorithm, specifically:

LDA cluster process: LDA is three layers of bayesian probability model, comprising word, theme and three layers of document, by one Such a process is regarded in the generation of article as: selecting some theme with certain probability, and with certain probability in this theme Some word is selected, document to theme obeys multinomial distribution, and theme to word obeys multinomial distribution, clusters to obtain by LDA " main Topic-word " probability matrix phi and " document-theme " probability matrix theta, according to " document-theme " probability matrix theta It obtains m theme and m theme corresponds to the probability of N articles, every a line i of theta represents an article, and each column j is represented One theme, homography value theta_ijIt is the probability that article i belongs to theme j；Setting screening threshold value is thresholdT, if theta_ij> thresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme；

The determination of LDA cluster number m: it is that N/10-N/5 repeats LDA clustering algorithm that cluster number, which is respectively set, then Similarity between the theme of implementing result each time is calculated, the corresponding theme of implementing result that similarity is minimum between theme is selected Number；The calculating of similarity is represented according to every a line j of LDA " theme-word " the probability matrix phi, phi clustered between theme One theme T_j, each column k represents a word w_k, phi_jkRepresent theme T_jInclude word w_kProbability；A line of Phi can be with Regard theme T as_jVector form T_j=(w₁,w₂,w₃,…w_k…w_n), n is total word number；The similarity of theme between any two is calculated, Similarity average value is sought, is minimized as similarity between final theme；The calculating of similarity uses the meter of cosine similarity Calculation method, calculation formula are as follows:

T in formula (3)_iAnd T_jRepresent two themes, ω_k(T_i) represent theme T_iValue on dimension k, n indicate total word Number；

(4) subject key words are extracted: being extracted keyword from the topic of articles all under theme, first carried out to title of article Participle, filters out stop-word, it is not intended to which adopted word and punctuation mark, remaining word is as subject key words；

(5) topic merges: m theme article corresponding with its is obtained by step 3, next by m new themes and old master Topic merges, and calculates similarity f1 between theme, thinks that two themes are similar if f1 > 0.5, and merges two themes；Theme Between similarity f1 calculation formula it is as follows:

F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)

It is similar that vectorSim in formula (4) represents the calculating theme cosine using all words that theme includes as dimension Degree, keywordSim, which is represented, calculates theme cosine similarity by dimension of subject key words, and the calculation formula of cosine similarity is same Formula (3)；

(6) temperature calculates: obtaining final all themes by step 5, next calculates theme temperature h, filter out heat It spends high theme, removes that temperature is low, i.e., out-of-date theme；According to the feature of hot spot theme news concentration class s high, temperature calculates public Formula is as follows:

h_t=∑ sim (d_i,t)(5)

D in formula (5)_iIndicate the article that theme T includes, the temperature h of theme T_tIt is similar to theme equal to article under theme The sum of degree, the same formula of sim function (3)；

As time go on, the temperature of a theme can constantly decay, until being rejected lower than the threshold value theme；Temperature Decaying, in each batch process, if there is new article to arrive below theme T, the temperature h of theme T_tIt can increase accordingly Add, h_t=h_t* Up, if not new article is added to theme T, temperature h_tIt can decay, h_t=h_t* Down, wherein Up > 1, Down < 1.

The beneficial effects of the present invention are: the present invention innovatively applies LDA algorithm in hot spot motif discovery, and propose Fulminant concept, can timely and effectively find newest hot news, while propose theme temperature decaying concept, can The theme temperature of record tracking in real time, has been truly reflected the development and change of hot news, shows tool for the tracking of hot news Have very important significance.

Detailed description of the invention

Fig. 1 is the hot news testing process schematic diagram based on topic model；

Fig. 2 is article modeling process schematic diagram；

Fig. 3 is LDA cluster process schematic diagram；

Fig. 4 is that new old theme merges schematic diagram；

Fig. 5 is that theme temperature calculates schematic diagram.

Specific embodiment

Invention is further described in detail in the following with reference to the drawings and specific embodiments.

As shown in Figure 1, a kind of hot news detection method based on topic model proposed by the present invention, including following step It is rapid:

(1) it is oriented by the way of web crawlers and crawls news stream, every arrival N articles carry out a batch processing, to climbing Access segments to obtain pretreated article according to progress data cleansing, article；

(2) construct vector space model: as shown in Fig. 2, by pretreatment operation, original document can be regarded as by a pile What word was constituted, if document being regarded as a vector, each word is exactly one-dimensional characteristic, by turning document Vector is turned to, text data just becomes the structural data that can be subsequently can by computer, the Similarity Problem between two documents The Similarity Problem between two vectors is translated into.When calculating document vector per one-dimensional weight, use improved B-TFIDF algorithm, algorithmic formula are as follows:

W represents word in formula 1, and A indicates that the article number in new article comprising word w, B indicate not including in new article single The article number of word w, C indicate that the article number in history article comprising word w, D indicate the article for not including word w in history article It counts, d in formula 2_iIndicating i-th new article, N indicates that new article sum, tf (d, w) indicate word frequency of the word w in article d, Df (w) indicates the article number comprising word w.The algorithm takes into account the burst of word, and burst is a word short It is a large amount of suddenly in phase to occur.The weight for constituting each word of document is calculated by algorithm above, and then generates the vector of article Spatial model D_i=(weight (d_i,w₁),weight(d_i,w₂),weight(d_i,w₃)…..weight(d_i,w_n)), wherein n is Total word number.

(3) article clusters: passing through step 2, text is represented as the form of vector, clusters to text vector；Such as Fig. 3 It is shown, LDA topic model clustering algorithm is employed herein, LDA is three layers of bayesian probability model, comprising word, theme and Three layers of document, regard the generation of an article as such a process: some theme being selected with certain probability, and in this master Some word is selected with certain probability in topic, document to theme obeys multinomial distribution, and theme to word is obeyed multinomial distribution, passed through LDA clustering obtains " theme-word " probability matrix and " document-theme " probability matrix, detailed process see below description.

LDA cluster process: LDA is three layers of bayesian probability model, clusters available " theme-word by LDA Language " probability matrix phi and " document-theme " probability matrix theta obtains m according to " document-theme " probability matrix theta A theme and m theme correspond to the probability of N articles, and every a line i of theta represents an article, and each column j represents a master Topic, homography value theta_ijIt is the probability that article i belongs to theme j.Setting screening threshold value is thresholdT (preferred value 0.32), if theta_ij> thresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme.

The determination of LDA cluster number m: since N article cluster numbers relatively meet reality between N/10 to N/5 (for example, cluster number relatively meets reality between 15 to 30 when new article sum N is 150), so being respectively set poly- Class number is that N/10-N/5 repeats LDA clustering algorithm, then calculates similarity between the theme of implementing result each time, selects The corresponding theme number of the minimum implementing result of similarity between theme.The calculating needs of similarity are clustered according to LDA between theme Every a line j of " theme-word " probability matrix phi, phi that arrive represent a theme T_j, each column k represents a word w_k, phi_jkRepresent theme T_jInclude word w_kProbability.A line of Phi can regard theme T as_jVector form T_j=(w₁,w₂, w₃,…w_k…w_n), n is total word number.The similarity of theme between any two is calculated, similarity average value is sought, is minimized as most Similarity between whole theme.The calculating of similarity uses the calculation method of cosine similarity, and calculation formula is as follows:

T in formula 3_iAnd T_jRepresent two themes, ω_k(T_i) represent theme T_iValue on dimension k, n indicate total word number.

(4) subject key words are extracted: being extracted keyword from the topic of articles all under theme, first carried out to title of article Participle, filters out stop-word, it is not intended to which adopted word and punctuation mark, remaining word is as subject key words.

(5) topic merges: m theme article corresponding with its is obtained by step 3, next by m new themes and old master Topic merges, as shown in figure 4, calculating similarity f1 between theme, thinks that two themes are similar if f1 > 0.5, and merges two Theme.Similarity f1 calculation formula is as follows between theme:

F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)

VectorSim in formula 4, which is represented, calculates theme cosine similarity using all words that theme includes as dimension, KeywordSim, which is represented, calculates theme cosine similarity, the same formula of the calculation formula of cosine similarity by dimension of subject key words 3。

(6) temperature calculates: as shown in figure 5, obtaining final all themes by step 5, next calculating theme temperature H filters out the high theme of temperature, removes that temperature is low, i.e., out-of-date theme.According to the feature of hot spot theme news concentration class s high, Temperature calculation formula is as follows:

h_t=∑ sim (d_i,t)(5)

Di in formula 5 indicates the article that theme T includes, the temperature h of theme T_tIt is similar to theme equal to article under theme The sum of degree, sim function is the same as formula 3.

As time go on, the temperature of a theme can constantly decay, until being rejected lower than the threshold value theme.Temperature Decaying, in each batch process, if there is new article to arrive below theme T, the temperature h of theme T_tIt can increase accordingly Add, h_t=h_t* Up (preferred value 1.05), if not new article is added to theme T, temperature h_tIt can decay, h_t=h_t* Down (preferred value 0.9).

Claims

1. a kind of hot news detection method based on topic model, which comprises the following steps:

(1) it is oriented by the way of web crawlers and crawls news stream, every arrival N new articles carry out a batch processing, to crawling Data carry out data cleansing, article segments to obtain pretreated article；

(2) it constructs vector space model: passing through pretreatment operation, original document, which can be regarded as, to be made of a pile word, such as If document is regarded as a vector by fruit, then each word is exactly one-dimensional characteristic, by transforming a document to vector, text Data just become the structural data that can be subsequently can by computer, and the Similarity Problem between two documents has translated into two Similarity Problem between vector；When calculating document vector per one-dimensional weight, using improved B-TFIDF algorithm, algorithm Formula is as follows:

W represents word in formula (1), and A indicates that the article number in new article comprising word w, B indicate not including word in new article The article number of w, C indicate that the article number in history article comprising word w, D indicate the article for not including word w in history article It counts, d in formula (2)_iIndicate i-th new article, N indicates that new article sum, tf (d, w) indicate word of the word w in article d Frequently, df (w) indicates the article number comprising word w；B-TFIDF algorithm takes into account the burst of word, burst i.e. one Word is a large amount of suddenly in a short time to be occurred；The weight for constituting each word of document is calculated by algorithm above, and then generates text The vector space model D of chapter_i=(weight (d_i,w₁),weight(d_i,w₂),weight(d_i,w₃)…..weight(d_i, w_n)), wherein n is total word number；

(3) article clusters: passing through step 2, text is represented as the form of vector, clusters to text vector；Using LDA master Model tying algorithm is inscribed, specifically:

LDA cluster process: LDA is three layers of bayesian probability model, comprising word, theme and three layers of document, by an article Generation regard such a process as: some theme is selected with certain probability, and selected with certain probability in this theme Some word, document to theme obey multinomial distribution, and theme to word obeys multinomial distribution, clusters to obtain " theme-by LDA Word " probability matrix phi and " document-theme " probability matrix theta, obtains according to " document-theme " probability matrix theta M theme and m theme correspond to the probability of N articles, and every a line i of theta represents an article, and each column j represents one Theme, homography value theta_ijIt is the probability that article i belongs to theme j；Setting screening threshold value is thresholdT, if theta_ij> thresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme；

The determination of LDA cluster number m: it is that N/10-N/5 repeats LDA clustering algorithm that cluster number, which is respectively set, is then calculated Similarity between the theme of implementing result each time selects the minimum corresponding theme number of implementing result of similarity between theme；It is main Every a line j of the calculating of similarity is clustered according to LDA between topic " theme-word " probability matrix phi, phi represent one Theme T_j, each column k represents a word w_k, phi_jkRepresent theme T_jInclude word w_kProbability；A line of Phi can be regarded as Theme T_jVector form

T_j=(w₁,w₂,w₃,…w_k…w_n), n is total word number；The similarity of theme between any two is calculated, similarity average value is sought, It is minimized as similarity between final theme；The calculating of similarity uses the calculation method of cosine similarity, calculation formula It is as follows:

(4) subject key words are extracted: being extracted keyword from the topic of articles all under theme, first divided title of article Word filters out stop-word, it is not intended to which adopted word and punctuation mark, remaining word is as subject key words；

(5) topic merges: m theme article corresponding with its is obtained by step 3, next by a newly themes of m and old theme into Row merges, and calculates similarity f1 between theme, thinks that two themes are similar if f1 > 0.5, and merges two themes；Phase between theme It is as follows like degree f1 calculation formula:

F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)

VectorSim in formula (4), which is represented, calculates theme cosine similarity using all words that theme includes as dimension, KeywordSim, which is represented, calculates theme cosine similarity, the same formula of the calculation formula of cosine similarity by dimension of subject key words (3)；

(6) temperature calculates: obtaining final all themes by step 5, next calculates theme temperature h, filter out temperature height Theme, remove that temperature is low, i.e., out-of-date theme；According to the feature of hot spot theme news concentration class s high, temperature calculation formula is such as Under:

h_t=∑ sim (d_i,t) (5)

D in formula (5)_iIndicate the article that theme T includes, the temperature h of theme T_tEqual to article under theme and Topic Similarity With the same formula of sim function (3)；

As time go on, the temperature of a theme can constantly decay, until being rejected lower than the threshold value theme；Temperature declines Subtract, in each batch process, if there is new article to arrive below theme T, the temperature h of theme T_tIt can increase accordingly Add, h_t=h_t* Up, if not new article is added to theme T, temperature h_tIt can decay, h_t=h_t* Down, wherein Up > 1, Down < 1.