CN107423337A

CN107423337A - News topic detection method based on LDA Fusion Models and multi-level clustering

Info

Publication number: CN107423337A
Application number: CN201710289343.8A
Authority: CN
Inventors: 喻梅; 安永利; 于健; 于瑞国; 赵满坤; 谢晓东
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2017-12-01

Abstract

The invention belongs to data mining, natural language processing and information retrieval field, to propose news topic detection method, the defects of for aspect semantic based on TF IDF Vector Space Algorithms, the defects of with text hierarchical clustering time complexity and the degree of accuracy, feature extraction to a large amount of newsletter archives, represent modeling, Similarity Measure and fast and accurately Text Clustering Method is improved.The present invention, the news topic detection method based on LDA Fusion Models and multi-level clustering, step are as follows：Step 1：Similarity model is built using vector space model；Step 2：Finally give accurate parameter setting；Step 3：Two kinds of text models are made organically to merge；Step 4：Judge whether the topic is new topic；Step 5：Similarity is calculated, until all clustering documents terminate；Step 6：AHC ISP＆AH clustering algorithms are added on the basis of step 5.Present invention is mainly applied to manufacture and design occasion.

Description

News topic detection method based on LDA Fusion Models and multi-level clustering

Technical field

The invention belongs to data mining, natural language processing and information retrieval field, it is related to monitoring technology and the network information Filtering technique, especially text analyzing and topic detecting method.Concretely relate to be based on Cray distribution (Latent in potential Di Dirichlet Allocation, LDA) Fusion Model and multi-level clustering news topic detection method.

Background technology

Topic detection and tracking (Topic Detection and Tracking, TDT) is in those early years from the detection of event Developed with tracking (Event Detection and Tracking, EDT), be one in the case of no manual intervention Automatically content recognition, excavation and the technology of tissue typing are carried out to news report.Based on word frequency-anti-document frequency (Term Frequency-Inverse Document Frequency, TF-IDF) vector space model (Vector Space Model, VSM) powerful ability is shown in terms of text representation.Vector space model is one and is used for representing text Algebraic model.It is applied to information filtering, information retrieval, index and correlation rule.Relative to standard Boolean mathematical modeling, to Quantity space model is the naive model based on linear algebra, and the weight of its phrase is not binary, it is allowed to calculates document and index Between continuous similarity, it is allowed to it carries out document ordering according to possible correlation, and allows local matching.

But vector space model also has shortcoming.Vector space model is not suitable for longer file, because its is similar Value is undesirable because of too small inner product and too high dimension.And because based on statistical starting point this mode is neglected The relevance between text semantic has been omited, has caused semantic susceptibility bad.In addition, the order that its phrase occurs in a document Can not be represented in vector, its weight be intuitively obtain and it is not formal enough.

Base has been established in the research that topic detection based on Once-clustering algorithm (Single-Pass) is TDT with tracking framework Plinth.Text vector is compared with the report in existing topic by the way of increment cluster for Single-pass algorithms, is calculated Text similarity is matched.If with some topic categorical match, the text is included into the topic, if all words in text domain The similarity of topic classification is respectively less than a certain threshold value, then the text is expressed as to new kind sub-topic.

There is also certain defect for Once-clustering algorithm.Because Single-Pass algorithms are for the input sequence of newsletter archive It is more sensitive, cause when the quantity of newsletter archive is constantly lifted, the Clustering Effect of algorithm but decreases, in terms of the degree of accuracy slightly It is weak.The hierarchical clustering algorithm effect of text is good, but O (n²) time complexity and superelevation internal memory expend govern the calculation Method.

The content of the invention

For overcome the deficiencies in the prior art, the present invention is directed to propose being talked about based on LDA Fusion Models and the news of multi-level clustering Detection method is inscribed, the defects of for aspect semantic based on TF-IDF Vector Space Algorithms, and text hierarchical clustering time complexity And the defects of degree of accuracy, feature extraction to a large amount of newsletter archives, represent modeling, Similarity Measure and fast and accurately text Clustering method is improved.The technical solution adopted by the present invention is examined based on LDA Fusion Models and the news topic of multi-level clustering Survey method, step are as follows：

Step 1：Similarity model is built using vector space model, each dimension of VSM models represents equivalent Weight vectors, for two vectorial d₁、d₂, the similarity of theirs between the two is calculated with cosine similarity computational methods, cosine value is got over It is to be intended to 1, represents that two vector angles are bigger；Cosine value is intended to 0, also implies that two vector directions are consistent, Similarity is higher；

Step 2：Topic model is built using LDA, is sampled using gibbs Gibbs methods, the items of model are joined Number is calculated, and is realized by iteration sample value mode for markovian structure, and causes it to be finally reached convergence, Finally give accurate parameter setting；

Step 3：The potential topic models of LDA and VSM vector space models are combined, before the operation of whole clustering algorithm, led to Cross text-thematic relation matrix, merge the VSM models based on TF-IDF weights methods, the similarity that VSM models are tried to achieve with The similarity that LDA models are tried to achieve carries out linear expression, and weighted sum obtains final Similarity value, there is two kinds of text models The fusion of machine；

Step 4：Text data is subjected to VSM modelings, Feature Words power using based on Once-clustering algorithm Single-Pass The mode assigned again uses TF-IDF methods, so as to which report is characterized into vector form one by one.Then by document flow with Whole topics carry out Similarity Measure in cluster process, by the way that the similarity of calculating and threshold value set in advance are contrasted, Judge whether the topic is new topic；

Step 5：Use ISP clustering algorithms：Increase cached document stream on the basis of step 4 Single-Pass algorithms, The similarity for being less than preset threshold value in step 4 similarity is put into cached document stream, and recalculates similarity, directly Terminate to all clustering documents；

Step 6：AHC ISP＆AH clustering algorithms are added on the basis of step 5：Calculate similar between each document Degree, one is established on document and the similarity matrix of document, being then combined with two maximum documents of Similarity value in matrix is One topic set, the document in two Geju City being merged is substituted with this new topic class, iterationization calculates similarity moment again Battle array simultaneously merges again, is finally reached when meeting stop condition and stops.

Also include verification step, VSM structures similarity model is used alone, exclusive use LDA builds topic model and will LDA with the VSM methods being combined contrast, and carries out efficiency assessment, F- to three kinds of methods by calculating F-Measure Shown in Measure calculating such as formula (1)：

F-Measure=2 × Precision × Recall/ (Precision+Recall) (1)

As shown in formula (1), Precision represents accuracy rate, and Recall represents recall rate, and Precision refers to correctly The ratio of the relevant documentation number of retrieval and total number of files of retrieval, Recall refer to the relevant documentation number correctly retrieved with it is actual The ratio of relevant documentation number, F-Measure value is bigger, represents that prediction result is better.

Comprising the following steps that in one example：

Step S0101：VSM similarity models are built using TF-IDF, content of text is different in size to cause weight distribution On it is unbalanced show, and then cause Similarity Measure on there is deviation, therefore also need to by text vector normalize represent；

Step S0201：Topic model is built using LDA：The parameters of model are counted using the Gibbs methods of samplings Calculate, realize that the accurate parameter for markovian structure, finally given is set, then for two different text d_iWith d_j, calculate the LDA topic model similarities Sim based on potential theme vector_LDA(d_i,d_j)；

Step S0301：The potential topic models of LDA and VSM vector space models are combined, calculate based on TF-IDF weight to Measure the similarity Sim of model_TFIDF(d_i,d_j), and combine Sim_LDA(d_i,d_j) by both the above text similarity carry out linear group Close, obtain merging the final similarity of two kinds of results, as shown in formula (2)；

Sim(d_i,d_j)=λ × Sim_TFIDF(d_i,d_j)+(1-λ)×Sim_LDA(d_i,d_j) (2)

Wherein λ is the customized linear effect factor, will calculate the VSM models of weights based on TF-IDF by its influence value Linear change and weighted sum are carried out according to a specific ratio with the LDA models based on theme；

Step S0401：Using Single-Pass clustering algorithms, text data is subjected to VSM modelings, with TF-IDF methods Term weight function is assigned, is vector form by text characterization；

Step S0402：Text flow and cluster process whole document are subjected to Similarity Measure, obtain similarity maximum MaxSim, and corresponding topic TopicMax is recorded, MaxSim and threshold value set in advance are contrasted, if MaxSim is more than threshold Value, then be TopicMax, be otherwise new topic；

Step S0501：Using ISP clustering algorithms, increase cached document stream on the basis of step S0402, by similarity Document less than threshold value adds cache flow, and the article of cache flow is clustered again, if the similarity calculated is more than threshold value, updates Topic, the document is otherwise considered as new topic, until all clustering documents terminate；

Step S0601：AHC ISP＆AH clustering algorithms are added, first gather the topic of high similarity in newsletter archive Together, then, secondary cluster is carried out in preliminary clusters result by hierarchy clustering method, the high topic of similarity is further Fusion, reach the purpose for improving accuracy rate and recall rate.

The features of the present invention and beneficial effect are：

Accurate foundation of the method for the fusion that the present invention uses for model has obvious impetus.

The present invention combines together Statistics-Based Method and method based on semantic topic, supplies mutually, reaches Improve the purpose of text cluster quality.The news topic detection of multi-level clustering combines ISP clustering algorithms and hierarchical clustering algorithm, enters Row is multi-level, deeper into cluster.By improving Single-Pass clustering algorithms, topic preliminary clusters are carried out to newsletter archive, High polymerization, the topic aggregated result of low granularity are obtained, can both meet the requirement clustered next time, and carry to a certain extent High Clustering Effect.

Brief description of the drawings：

Fig. 1 overall schematics.

The F-Measure contrast line charts of tri- groups of Experimental modeling clusters of Fig. 2.

Embodiment

The present invention proposes a kind of method detected based on LDA Fusion Models and the news topic of multi-level clustering, comprising following Step：

Step 1：Similarity model is built using VSM.The each dimension of VSM models represents the weight vectors of equivalent, for Two vectorial d₁、d₂, the similarity of theirs between the two is calculated with cosine similarity computational methods.Cosine value is intended to 1, table Show that two vector angles are bigger；Cosine value is intended to 0, also implies that two vector directions are consistent, similarity is higher.

Step 2：Topic model is built using LDA.Gibbs (Gibbs) sampling is the markovian one kind side of generation Method, it is sampled using Gibbs methods, the parameters of model is calculated, is realized pair by iteration sample value mode In the structure of Markov chain, and cause it to be finally reached convergence, finally give accurate parameter setting.

Step 3：The potential topic models of LDA and VSM vector space models are combined.Before the operation of whole clustering algorithm, lead to Cross text-thematic relation matrix, merge the VSM models based on TF-IDF weights methods, the similarity that VSM models are tried to achieve with The similarity that LDA models are tried to achieve carries out linear expression, and weighted sum obtains final Similarity value, there is two kinds of text models The fusion of machine.

Step 4：Use traditional Si ngle-Pass clustering algorithms.Text data is subjected to VSM modelings, term weight function is assigned The mode given uses TF-IDF methods, so as to which report is characterized into vector form one by one.Then by document flow and cluster During whole topics carry out Similarity Measures.By the way that the similarity of calculating and threshold value set in advance are contrasted, judge Whether the topic is new topic.

Step 5：Use ISP clustering algorithms.Increase cached document stream on the basis of step 4 Single-Pass algorithms. The similarity for being less than preset threshold value in step 4 similarity is put into cached document stream, and recalculates similarity.Directly Terminate to all clustering documents.

Step 6：AHC ISP＆AH clustering algorithms are added on the basis of step 5.Calculate similar between each document Degree, one is established on document and the similarity matrix of document, being then combined with two maximum documents of Similarity value in matrix is One topic set, the document in two Geju City being merged is substituted with this new topic class, iterationization calculates similarity moment again Battle array simultaneously merges again, is finally reached when meeting stop condition and stops.

Experiment builds similarity model by the way that VSM is used alone, LDA is used alone builds topic model and by LDA and VSM The method being combined contrast.And efficiency assessment is carried out to three kinds of methods by calculating F-Measure.F-Measure Calculating such as formula (1) shown in.

F-Measure=2 × Precision × Recall/ (Precision+Recall) (1)

As shown in Figure 2, build similarity model using VSM and build F- of the topic model on 5 topics using LDA Measure has height to have bottom, illustrates that both modeling methods emphasize particularly on different fields, but the F-Measure of VSM+LDA Fusion Model is It is maximum.Experiment shows, the accurate foundation of the method for fusion for model has obvious impetus.

Meanwhile for effect of the algorithm to Clustering Effect of Improvement, the present invention by calculate accuracy rate, recall rate and F-Measure, to only using traditional Si ngle-Pass clustering algorithms, only using ISP clustering algorithms and addition AHC ISP＆AH Three groups of experiments of clustering algorithm carry out performance measure.

The semantic relation of LDA topic models is used based on LDA Fusion Models, is incorporated into newsletter archive field.Will Statistics-Based Method and method based on semantic topic combine together, and supply mutually, reach and improve text cluster quality Purpose.The news topic detection of multi-level clustering combines ISP clustering algorithms and hierarchical clustering algorithm, carry out at many levels, deeper into Cluster.By improving Single-Pass clustering algorithms, topic preliminary clusters are carried out to newsletter archive, obtain high polymerization, low granularity Topic aggregated result, can both meet the requirement clustered next time, and improve Clustering Effect to a certain extent.

The invention provides a kind of based on LDA Fusion Models and the news topic of multi-level clustering detection research method, such as Fig. 1 It is shown, it is the overall schematic of the specific embodiment of the invention, including：

Step S0101：VSM similarity models are built using TF-IDF.Content of text is different in size to cause weight distribution On it is unbalanced show, and then cause Similarity Measure on there is deviation, therefore also need to by text vector normalization represent such as Shown in formula (2).

Step S0201：Topic model is built using LDA.The parameters of model are counted using the Gibbs methods of samplings Calculate, realize the structure for Markov chain, the accurate parameter finally given is set.So for two different text d_iAnd d_j, Calculate the LDA topic model similarities Sim based on potential theme vector_LDA(d_i,d_j)。

Step S0301：The potential topic models of LDA and VSM vector space models are combined.Calculate based on TF-IDF weight to Measure the similarity Sim of model_TFIDF(d_i,d_j), and combine Sim_LDA(d_i,d_j) by both the above text similarity carry out linear group Close, obtain merging the final similarity of two kinds of results, as shown in formula (2).

Sim(d_i,d_j)=λ × Sim_TFIDF(d_i,d_j)+(1-λ)×Sim_LDA(d_i,d_j) (2)

Wherein λ is the customized linear effect factor, will calculate the VSM models of weights based on TF-IDF by its influence value Linear change and weighted sum are carried out according to a specific ratio with the LDA models based on theme.

Step S0401：Use traditional Si ngle-Pass clustering algorithms.Text data is subjected to VSM modelings, uses TF-IDF Method assigns term weight function, is vector form by text characterization.

Step S0402：Text flow and cluster process whole document are subjected to Similarity Measure, obtain similarity maximum MaxSim, and record corresponding topic TopicMax.MaxSim and threshold value set in advance are contrasted, if MaxSim is more than threshold Value, then be TopicMax, be otherwise new topic.

Step S0501：Use ISP clustering algorithms.Increase cached document stream on the basis of step S0402, by similarity Document less than threshold value adds cache flow, and the article of cache flow is clustered again.If the similarity calculated is more than threshold value, update Topic, the document is otherwise considered as new topic, until all clustering documents terminate.

Step S0601：Add AHC ISP＆AH clustering algorithms.The topic of high similarity in newsletter archive is gathered first Together.Then, secondary cluster is carried out in preliminary clusters result by hierarchy clustering method, the high topic of similarity is further Fusion, reach the purpose for improving accuracy rate and recall rate.

A kind of news topic detection method based on LDA Fusion Models and multi-level clustering of the present invention, compensate for base In TF-IDF vector space model is in the relevance between text semantic is have ignored in terms of text representation the shortcomings that, text is improved This clustering result quality.Meanwhile by improving Single-Pass clustering algorithms, topic preliminary clusters and level are carried out to newsletter archive Cluster compensate for the shortcomings that hierarchical clustering algorithm time complexity is high and the cluster degree of accuracy of traditional Si ngle-Pass algorithms is relatively low Shortcoming.A kind of effective method is provided for text analyzing and topic detection side.

Claims

1. a kind of news topic detection method based on LDA Fusion Models and multi-level clustering, it is characterized in that, step is as follows：

Step 1：Similarity model is built using vector space model, each dimension of VSM models represents the weight of equivalent Vector, for two vectorial d₁、d₂, the similarity of theirs between the two is calculated with cosine similarity computational methods, cosine value becomes To in 1, represent that two vector angles are bigger；Cosine value is intended to 0, also implies that two vector directions are consistent, similar Degree is higher；

Step 2：Topic model is built using LDA, is sampled using gibbs Gibbs methods, the parameters of model is entered Row calculates, and is realized by iteration sample value mode for markovian structure, and causes it to be finally reached convergence, finally Obtain accurate parameter setting；

Step 3：The potential topic models of LDA and VSM vector space models are combined, before the operation of whole clustering algorithm, pass through text Sheet-thematic relation matrix, merge the VSM models based on TF-IDF weights methods, the similarity that VSM models are tried to achieve and LDA moulds The similarity that type is tried to achieve carries out linear expression, and weighted sum obtains final Similarity value, makes two kinds of text models organic Fusion；

Step 4：Text data is subjected to VSM modelings using based on Once-clustering algorithm Single-Pass, term weight function is assigned The mode given uses TF-IDF methods, so as to which report is characterized into vector form one by one.Then by document flow and cluster During whole topics carry out Similarity Measures, by the way that the similarity of calculating and threshold value set in advance are contrasted, judgement Whether the topic is new topic；

Step 5：Use ISP clustering algorithms：Increase cached document stream on the basis of step 4 Single-Pass algorithms, will walk Similarity in rapid four similarity less than preset threshold value is put into cached document stream, and recalculates similarity, Zhi Daosuo There are clustering documents to terminate；

Step 6：AHC ISP＆AH clustering algorithms are added on the basis of step 5：The similarity between each document is calculated, is built Vertical one, on document and the similarity matrix of document, is then combined with the document that two Similarity values are maximum in matrix and is talked about for one Topic set, the document in two Geju City being merged with this new topic class substitution, again iterationization calculate similarity matrix and again Secondary merging, it is finally reached when meeting stop condition and stops.

2. the news topic detection method based on LDA Fusion Models and multi-level clustering as claimed in claim 1, it is characterized in that, Also include verification step, VSM structures similarity model is used alone, LDA structure topic models are used alone and by LDA and VSM The method being combined contrast, and carries out efficiency assessment, F-Measure to three kinds of methods by calculating F-Measure Calculating such as formula (1) shown in：

F-Measure=2 × Precision × Recall/ (Precision+Recall) (1)

As shown in formula (1), Precision represents accuracy rate, and Recall represents recall rate, and Precision refers to correctly retrieve Relevant documentation number and total number of files of retrieval ratio, Recall refer to the relevant documentation number correctly retrieved to it is actual related The ratio of number of files, F-Measure value is bigger, represents that prediction result is better.

3. the news topic detection method based on LDA Fusion Models and multi-level clustering as claimed in claim 1, it is characterized in that, Comprising the following steps that in one example：

Step S0101：VSM similarity models are built using TF-IDF, content of text is different in size to be caused in weight distribution It is unbalanced to show, and then cause deviation occur on Similarity Measure, therefore also need to normalize text vector and represent；

Step S0201：Topic model is built using LDA：The parameters of model are calculated using the Gibbs methods of samplings, Realize that the accurate parameter for markovian structure, finally given is set, then for two different text d_iAnd d_j, Calculate the LDA topic model similarities Sim based on potential theme vector_LDA(d_i,d_j)；

Step S0301：The potential topic models of LDA and VSM vector space models are combined, calculating is based on TF-IDF weight vector mould The similarity Sim of type_TFIDF(d_i,d_j), and combine Sim_LDA(d_i,d_j) by both the above text similarity carry out linear combination, obtain To the final similarity of two kinds of results of fusion, as shown in formula (2)；

Sim(d_i,d_j)=λ × Sim_TFIDF(d_i,d_j)+(1-λ)×Sim_LDA(d_i,d_j) (2)

Wherein λ is the customized linear effect factor, will calculate the VSM models and base of weights based on TF-IDF by its influence value Linear change and weighted sum are carried out according to a specific ratio in the LDA models of theme；

Step S0401：Using Single-Pass clustering algorithms, text data is subjected to VSM modelings, assigned with TF-IDF methods Term weight function, it is vector form by text characterization；

Step S0501：Using ISP clustering algorithms, increase cached document stream on the basis of step S0402, similarity is less than The document of threshold value adds cache flow, and the article of cache flow is clustered again, if the similarity calculated is more than threshold value, more newspeak Topic, is otherwise considered as new topic, until all clustering documents terminate by the document；

Step S0601：AHC ISP＆AH clustering algorithms are added, the topic of high similarity in newsletter archive is got together first, Then, secondary cluster is carried out in preliminary clusters result by hierarchy clustering method, the high topic of similarity is further merged, Reach the purpose for improving accuracy rate and recall rate.