CN107423337A - News topic detection method based on LDA Fusion Models and multi-level clustering - Google Patents

News topic detection method based on LDA Fusion Models and multi-level clustering Download PDF

Info

Publication number
CN107423337A
CN107423337A CN201710289343.8A CN201710289343A CN107423337A CN 107423337 A CN107423337 A CN 107423337A CN 201710289343 A CN201710289343 A CN 201710289343A CN 107423337 A CN107423337 A CN 107423337A
Authority
CN
China
Prior art keywords
similarity
topic
lda
models
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710289343.8A
Other languages
Chinese (zh)
Inventor
喻梅
安永利
于健
于瑞国
赵满坤
谢晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201710289343.8A priority Critical patent/CN107423337A/en
Publication of CN107423337A publication Critical patent/CN107423337A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention belongs to data mining, natural language processing and information retrieval field, to propose news topic detection method, the defects of for aspect semantic based on TF IDF Vector Space Algorithms, the defects of with text hierarchical clustering time complexity and the degree of accuracy, feature extraction to a large amount of newsletter archives, represent modeling, Similarity Measure and fast and accurately Text Clustering Method is improved.The present invention, the news topic detection method based on LDA Fusion Models and multi-level clustering, step are as follows:Step 1:Similarity model is built using vector space model;Step 2:Finally give accurate parameter setting;Step 3:Two kinds of text models are made organically to merge;Step 4:Judge whether the topic is new topic;Step 5:Similarity is calculated, until all clustering documents terminate;Step 6:AHC ISP&AH clustering algorithms are added on the basis of step 5.Present invention is mainly applied to manufacture and design occasion.

Description

News topic detection method based on LDA Fusion Models and multi-level clustering
Technical field
The invention belongs to data mining, natural language processing and information retrieval field, it is related to monitoring technology and the network information Filtering technique, especially text analyzing and topic detecting method.Concretely relate to be based on Cray distribution (Latent in potential Di Dirichlet Allocation, LDA) Fusion Model and multi-level clustering news topic detection method.
Background technology
Topic detection and tracking (Topic Detection and Tracking, TDT) is in those early years from the detection of event Developed with tracking (Event Detection and Tracking, EDT), be one in the case of no manual intervention Automatically content recognition, excavation and the technology of tissue typing are carried out to news report.Based on word frequency-anti-document frequency (Term Frequency-Inverse Document Frequency, TF-IDF) vector space model (Vector Space Model, VSM) powerful ability is shown in terms of text representation.Vector space model is one and is used for representing text Algebraic model.It is applied to information filtering, information retrieval, index and correlation rule.Relative to standard Boolean mathematical modeling, to Quantity space model is the naive model based on linear algebra, and the weight of its phrase is not binary, it is allowed to calculates document and index Between continuous similarity, it is allowed to it carries out document ordering according to possible correlation, and allows local matching.
But vector space model also has shortcoming.Vector space model is not suitable for longer file, because its is similar Value is undesirable because of too small inner product and too high dimension.And because based on statistical starting point this mode is neglected The relevance between text semantic has been omited, has caused semantic susceptibility bad.In addition, the order that its phrase occurs in a document Can not be represented in vector, its weight be intuitively obtain and it is not formal enough.
Base has been established in the research that topic detection based on Once-clustering algorithm (Single-Pass) is TDT with tracking framework Plinth.Text vector is compared with the report in existing topic by the way of increment cluster for Single-pass algorithms, is calculated Text similarity is matched.If with some topic categorical match, the text is included into the topic, if all words in text domain The similarity of topic classification is respectively less than a certain threshold value, then the text is expressed as to new kind sub-topic.
There is also certain defect for Once-clustering algorithm.Because Single-Pass algorithms are for the input sequence of newsletter archive It is more sensitive, cause when the quantity of newsletter archive is constantly lifted, the Clustering Effect of algorithm but decreases, in terms of the degree of accuracy slightly It is weak.The hierarchical clustering algorithm effect of text is good, but O (n2) time complexity and superelevation internal memory expend govern the calculation Method.
The content of the invention
For overcome the deficiencies in the prior art, the present invention is directed to propose being talked about based on LDA Fusion Models and the news of multi-level clustering Detection method is inscribed, the defects of for aspect semantic based on TF-IDF Vector Space Algorithms, and text hierarchical clustering time complexity And the defects of degree of accuracy, feature extraction to a large amount of newsletter archives, represent modeling, Similarity Measure and fast and accurately text Clustering method is improved.The technical solution adopted by the present invention is examined based on LDA Fusion Models and the news topic of multi-level clustering Survey method, step are as follows:
Step 1:Similarity model is built using vector space model, each dimension of VSM models represents equivalent Weight vectors, for two vectorial d1、d2, the similarity of theirs between the two is calculated with cosine similarity computational methods, cosine value is got over It is to be intended to 1, represents that two vector angles are bigger;Cosine value is intended to 0, also implies that two vector directions are consistent, Similarity is higher;
Step 2:Topic model is built using LDA, is sampled using gibbs Gibbs methods, the items of model are joined Number is calculated, and is realized by iteration sample value mode for markovian structure, and causes it to be finally reached convergence, Finally give accurate parameter setting;
Step 3:The potential topic models of LDA and VSM vector space models are combined, before the operation of whole clustering algorithm, led to Cross text-thematic relation matrix, merge the VSM models based on TF-IDF weights methods, the similarity that VSM models are tried to achieve with The similarity that LDA models are tried to achieve carries out linear expression, and weighted sum obtains final Similarity value, there is two kinds of text models The fusion of machine;
Step 4:Text data is subjected to VSM modelings, Feature Words power using based on Once-clustering algorithm Single-Pass The mode assigned again uses TF-IDF methods, so as to which report is characterized into vector form one by one.Then by document flow with Whole topics carry out Similarity Measure in cluster process, by the way that the similarity of calculating and threshold value set in advance are contrasted, Judge whether the topic is new topic;
Step 5:Use ISP clustering algorithms:Increase cached document stream on the basis of step 4 Single-Pass algorithms, The similarity for being less than preset threshold value in step 4 similarity is put into cached document stream, and recalculates similarity, directly Terminate to all clustering documents;
Step 6:AHC ISP&AH clustering algorithms are added on the basis of step 5:Calculate similar between each document Degree, one is established on document and the similarity matrix of document, being then combined with two maximum documents of Similarity value in matrix is One topic set, the document in two Geju City being merged is substituted with this new topic class, iterationization calculates similarity moment again Battle array simultaneously merges again, is finally reached when meeting stop condition and stops.
Also include verification step, VSM structures similarity model is used alone, exclusive use LDA builds topic model and will LDA with the VSM methods being combined contrast, and carries out efficiency assessment, F- to three kinds of methods by calculating F-Measure Shown in Measure calculating such as formula (1):
F-Measure=2 × Precision × Recall/ (Precision+Recall) (1)
As shown in formula (1), Precision represents accuracy rate, and Recall represents recall rate, and Precision refers to correctly The ratio of the relevant documentation number of retrieval and total number of files of retrieval, Recall refer to the relevant documentation number correctly retrieved with it is actual The ratio of relevant documentation number, F-Measure value is bigger, represents that prediction result is better.
Comprising the following steps that in one example:
Step S0101:VSM similarity models are built using TF-IDF, content of text is different in size to cause weight distribution On it is unbalanced show, and then cause Similarity Measure on there is deviation, therefore also need to by text vector normalize represent;
Step S0201:Topic model is built using LDA:The parameters of model are counted using the Gibbs methods of samplings Calculate, realize that the accurate parameter for markovian structure, finally given is set, then for two different text diWith dj, calculate the LDA topic model similarities Sim based on potential theme vectorLDA(di,dj);
Step S0301:The potential topic models of LDA and VSM vector space models are combined, calculate based on TF-IDF weight to Measure the similarity Sim of modelTFIDF(di,dj), and combine SimLDA(di,dj) by both the above text similarity carry out linear group Close, obtain merging the final similarity of two kinds of results, as shown in formula (2);
Sim(di,dj)=λ × SimTFIDF(di,dj)+(1-λ)×SimLDA(di,dj) (2)
Wherein λ is the customized linear effect factor, will calculate the VSM models of weights based on TF-IDF by its influence value Linear change and weighted sum are carried out according to a specific ratio with the LDA models based on theme;
Step S0401:Using Single-Pass clustering algorithms, text data is subjected to VSM modelings, with TF-IDF methods Term weight function is assigned, is vector form by text characterization;
Step S0402:Text flow and cluster process whole document are subjected to Similarity Measure, obtain similarity maximum MaxSim, and corresponding topic TopicMax is recorded, MaxSim and threshold value set in advance are contrasted, if MaxSim is more than threshold Value, then be TopicMax, be otherwise new topic;
Step S0501:Using ISP clustering algorithms, increase cached document stream on the basis of step S0402, by similarity Document less than threshold value adds cache flow, and the article of cache flow is clustered again, if the similarity calculated is more than threshold value, updates Topic, the document is otherwise considered as new topic, until all clustering documents terminate;
Step S0601:AHC ISP&AH clustering algorithms are added, first gather the topic of high similarity in newsletter archive Together, then, secondary cluster is carried out in preliminary clusters result by hierarchy clustering method, the high topic of similarity is further Fusion, reach the purpose for improving accuracy rate and recall rate.
The features of the present invention and beneficial effect are:
Accurate foundation of the method for the fusion that the present invention uses for model has obvious impetus.
The present invention combines together Statistics-Based Method and method based on semantic topic, supplies mutually, reaches Improve the purpose of text cluster quality.The news topic detection of multi-level clustering combines ISP clustering algorithms and hierarchical clustering algorithm, enters Row is multi-level, deeper into cluster.By improving Single-Pass clustering algorithms, topic preliminary clusters are carried out to newsletter archive, High polymerization, the topic aggregated result of low granularity are obtained, can both meet the requirement clustered next time, and carry to a certain extent High Clustering Effect.
Brief description of the drawings:
Fig. 1 overall schematics.
The F-Measure contrast line charts of tri- groups of Experimental modeling clusters of Fig. 2.
Embodiment
The present invention proposes a kind of method detected based on LDA Fusion Models and the news topic of multi-level clustering, comprising following Step:
Step 1:Similarity model is built using VSM.The each dimension of VSM models represents the weight vectors of equivalent, for Two vectorial d1、d2, the similarity of theirs between the two is calculated with cosine similarity computational methods.Cosine value is intended to 1, table Show that two vector angles are bigger;Cosine value is intended to 0, also implies that two vector directions are consistent, similarity is higher.
Step 2:Topic model is built using LDA.Gibbs (Gibbs) sampling is the markovian one kind side of generation Method, it is sampled using Gibbs methods, the parameters of model is calculated, is realized pair by iteration sample value mode In the structure of Markov chain, and cause it to be finally reached convergence, finally give accurate parameter setting.
Step 3:The potential topic models of LDA and VSM vector space models are combined.Before the operation of whole clustering algorithm, lead to Cross text-thematic relation matrix, merge the VSM models based on TF-IDF weights methods, the similarity that VSM models are tried to achieve with The similarity that LDA models are tried to achieve carries out linear expression, and weighted sum obtains final Similarity value, there is two kinds of text models The fusion of machine.
Step 4:Use traditional Si ngle-Pass clustering algorithms.Text data is subjected to VSM modelings, term weight function is assigned The mode given uses TF-IDF methods, so as to which report is characterized into vector form one by one.Then by document flow and cluster During whole topics carry out Similarity Measures.By the way that the similarity of calculating and threshold value set in advance are contrasted, judge Whether the topic is new topic.
Step 5:Use ISP clustering algorithms.Increase cached document stream on the basis of step 4 Single-Pass algorithms. The similarity for being less than preset threshold value in step 4 similarity is put into cached document stream, and recalculates similarity.Directly Terminate to all clustering documents.
Step 6:AHC ISP&AH clustering algorithms are added on the basis of step 5.Calculate similar between each document Degree, one is established on document and the similarity matrix of document, being then combined with two maximum documents of Similarity value in matrix is One topic set, the document in two Geju City being merged is substituted with this new topic class, iterationization calculates similarity moment again Battle array simultaneously merges again, is finally reached when meeting stop condition and stops.
Experiment builds similarity model by the way that VSM is used alone, LDA is used alone builds topic model and by LDA and VSM The method being combined contrast.And efficiency assessment is carried out to three kinds of methods by calculating F-Measure.F-Measure Calculating such as formula (1) shown in.
F-Measure=2 × Precision × Recall/ (Precision+Recall) (1)
As shown in formula (1), Precision represents accuracy rate, and Recall represents recall rate, and Precision refers to correctly The ratio of the relevant documentation number of retrieval and total number of files of retrieval, Recall refer to the relevant documentation number correctly retrieved with it is actual The ratio of relevant documentation number, F-Measure value is bigger, represents that prediction result is better.
As shown in Figure 2, build similarity model using VSM and build F- of the topic model on 5 topics using LDA Measure has height to have bottom, illustrates that both modeling methods emphasize particularly on different fields, but the F-Measure of VSM+LDA Fusion Model is It is maximum.Experiment shows, the accurate foundation of the method for fusion for model has obvious impetus.
Meanwhile for effect of the algorithm to Clustering Effect of Improvement, the present invention by calculate accuracy rate, recall rate and F-Measure, to only using traditional Si ngle-Pass clustering algorithms, only using ISP clustering algorithms and addition AHC ISP&AH Three groups of experiments of clustering algorithm carry out performance measure.
The semantic relation of LDA topic models is used based on LDA Fusion Models, is incorporated into newsletter archive field.Will Statistics-Based Method and method based on semantic topic combine together, and supply mutually, reach and improve text cluster quality Purpose.The news topic detection of multi-level clustering combines ISP clustering algorithms and hierarchical clustering algorithm, carry out at many levels, deeper into Cluster.By improving Single-Pass clustering algorithms, topic preliminary clusters are carried out to newsletter archive, obtain high polymerization, low granularity Topic aggregated result, can both meet the requirement clustered next time, and improve Clustering Effect to a certain extent.
The invention provides a kind of based on LDA Fusion Models and the news topic of multi-level clustering detection research method, such as Fig. 1 It is shown, it is the overall schematic of the specific embodiment of the invention, including:
Step S0101:VSM similarity models are built using TF-IDF.Content of text is different in size to cause weight distribution On it is unbalanced show, and then cause Similarity Measure on there is deviation, therefore also need to by text vector normalization represent such as Shown in formula (2).
Step S0201:Topic model is built using LDA.The parameters of model are counted using the Gibbs methods of samplings Calculate, realize the structure for Markov chain, the accurate parameter finally given is set.So for two different text diAnd dj, Calculate the LDA topic model similarities Sim based on potential theme vectorLDA(di,dj)。
Step S0301:The potential topic models of LDA and VSM vector space models are combined.Calculate based on TF-IDF weight to Measure the similarity Sim of modelTFIDF(di,dj), and combine SimLDA(di,dj) by both the above text similarity carry out linear group Close, obtain merging the final similarity of two kinds of results, as shown in formula (2).
Sim(di,dj)=λ × SimTFIDF(di,dj)+(1-λ)×SimLDA(di,dj) (2)
Wherein λ is the customized linear effect factor, will calculate the VSM models of weights based on TF-IDF by its influence value Linear change and weighted sum are carried out according to a specific ratio with the LDA models based on theme.
Step S0401:Use traditional Si ngle-Pass clustering algorithms.Text data is subjected to VSM modelings, uses TF-IDF Method assigns term weight function, is vector form by text characterization.
Step S0402:Text flow and cluster process whole document are subjected to Similarity Measure, obtain similarity maximum MaxSim, and record corresponding topic TopicMax.MaxSim and threshold value set in advance are contrasted, if MaxSim is more than threshold Value, then be TopicMax, be otherwise new topic.
Step S0501:Use ISP clustering algorithms.Increase cached document stream on the basis of step S0402, by similarity Document less than threshold value adds cache flow, and the article of cache flow is clustered again.If the similarity calculated is more than threshold value, update Topic, the document is otherwise considered as new topic, until all clustering documents terminate.
Step S0601:Add AHC ISP&AH clustering algorithms.The topic of high similarity in newsletter archive is gathered first Together.Then, secondary cluster is carried out in preliminary clusters result by hierarchy clustering method, the high topic of similarity is further Fusion, reach the purpose for improving accuracy rate and recall rate.
A kind of news topic detection method based on LDA Fusion Models and multi-level clustering of the present invention, compensate for base In TF-IDF vector space model is in the relevance between text semantic is have ignored in terms of text representation the shortcomings that, text is improved This clustering result quality.Meanwhile by improving Single-Pass clustering algorithms, topic preliminary clusters and level are carried out to newsletter archive Cluster compensate for the shortcomings that hierarchical clustering algorithm time complexity is high and the cluster degree of accuracy of traditional Si ngle-Pass algorithms is relatively low Shortcoming.A kind of effective method is provided for text analyzing and topic detection side.

Claims (3)

1. a kind of news topic detection method based on LDA Fusion Models and multi-level clustering, it is characterized in that, step is as follows:
Step 1:Similarity model is built using vector space model, each dimension of VSM models represents the weight of equivalent Vector, for two vectorial d1、d2, the similarity of theirs between the two is calculated with cosine similarity computational methods, cosine value becomes To in 1, represent that two vector angles are bigger;Cosine value is intended to 0, also implies that two vector directions are consistent, similar Degree is higher;
Step 2:Topic model is built using LDA, is sampled using gibbs Gibbs methods, the parameters of model is entered Row calculates, and is realized by iteration sample value mode for markovian structure, and causes it to be finally reached convergence, finally Obtain accurate parameter setting;
Step 3:The potential topic models of LDA and VSM vector space models are combined, before the operation of whole clustering algorithm, pass through text Sheet-thematic relation matrix, merge the VSM models based on TF-IDF weights methods, the similarity that VSM models are tried to achieve and LDA moulds The similarity that type is tried to achieve carries out linear expression, and weighted sum obtains final Similarity value, makes two kinds of text models organic Fusion;
Step 4:Text data is subjected to VSM modelings using based on Once-clustering algorithm Single-Pass, term weight function is assigned The mode given uses TF-IDF methods, so as to which report is characterized into vector form one by one.Then by document flow and cluster During whole topics carry out Similarity Measures, by the way that the similarity of calculating and threshold value set in advance are contrasted, judgement Whether the topic is new topic;
Step 5:Use ISP clustering algorithms:Increase cached document stream on the basis of step 4 Single-Pass algorithms, will walk Similarity in rapid four similarity less than preset threshold value is put into cached document stream, and recalculates similarity, Zhi Daosuo There are clustering documents to terminate;
Step 6:AHC ISP&AH clustering algorithms are added on the basis of step 5:The similarity between each document is calculated, is built Vertical one, on document and the similarity matrix of document, is then combined with the document that two Similarity values are maximum in matrix and is talked about for one Topic set, the document in two Geju City being merged with this new topic class substitution, again iterationization calculate similarity matrix and again Secondary merging, it is finally reached when meeting stop condition and stops.
2. the news topic detection method based on LDA Fusion Models and multi-level clustering as claimed in claim 1, it is characterized in that, Also include verification step, VSM structures similarity model is used alone, LDA structure topic models are used alone and by LDA and VSM The method being combined contrast, and carries out efficiency assessment, F-Measure to three kinds of methods by calculating F-Measure Calculating such as formula (1) shown in:
F-Measure=2 × Precision × Recall/ (Precision+Recall) (1)
As shown in formula (1), Precision represents accuracy rate, and Recall represents recall rate, and Precision refers to correctly retrieve Relevant documentation number and total number of files of retrieval ratio, Recall refer to the relevant documentation number correctly retrieved to it is actual related The ratio of number of files, F-Measure value is bigger, represents that prediction result is better.
3. the news topic detection method based on LDA Fusion Models and multi-level clustering as claimed in claim 1, it is characterized in that, Comprising the following steps that in one example:
Step S0101:VSM similarity models are built using TF-IDF, content of text is different in size to be caused in weight distribution It is unbalanced to show, and then cause deviation occur on Similarity Measure, therefore also need to normalize text vector and represent;
Step S0201:Topic model is built using LDA:The parameters of model are calculated using the Gibbs methods of samplings, Realize that the accurate parameter for markovian structure, finally given is set, then for two different text diAnd dj, Calculate the LDA topic model similarities Sim based on potential theme vectorLDA(di,dj);
Step S0301:The potential topic models of LDA and VSM vector space models are combined, calculating is based on TF-IDF weight vector mould The similarity Sim of typeTFIDF(di,dj), and combine SimLDA(di,dj) by both the above text similarity carry out linear combination, obtain To the final similarity of two kinds of results of fusion, as shown in formula (2);
Sim(di,dj)=λ × SimTFIDF(di,dj)+(1-λ)×SimLDA(di,dj) (2)
Wherein λ is the customized linear effect factor, will calculate the VSM models and base of weights based on TF-IDF by its influence value Linear change and weighted sum are carried out according to a specific ratio in the LDA models of theme;
Step S0401:Using Single-Pass clustering algorithms, text data is subjected to VSM modelings, assigned with TF-IDF methods Term weight function, it is vector form by text characterization;
Step S0402:Text flow and cluster process whole document are subjected to Similarity Measure, obtain similarity maximum MaxSim, and corresponding topic TopicMax is recorded, MaxSim and threshold value set in advance are contrasted, if MaxSim is more than threshold Value, then be TopicMax, be otherwise new topic;
Step S0501:Using ISP clustering algorithms, increase cached document stream on the basis of step S0402, similarity is less than The document of threshold value adds cache flow, and the article of cache flow is clustered again, if the similarity calculated is more than threshold value, more newspeak Topic, is otherwise considered as new topic, until all clustering documents terminate by the document;
Step S0601:AHC ISP&AH clustering algorithms are added, the topic of high similarity in newsletter archive is got together first, Then, secondary cluster is carried out in preliminary clusters result by hierarchy clustering method, the high topic of similarity is further merged, Reach the purpose for improving accuracy rate and recall rate.
CN201710289343.8A 2017-04-27 2017-04-27 News topic detection method based on LDA Fusion Models and multi-level clustering Pending CN107423337A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710289343.8A CN107423337A (en) 2017-04-27 2017-04-27 News topic detection method based on LDA Fusion Models and multi-level clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710289343.8A CN107423337A (en) 2017-04-27 2017-04-27 News topic detection method based on LDA Fusion Models and multi-level clustering

Publications (1)

Publication Number Publication Date
CN107423337A true CN107423337A (en) 2017-12-01

Family

ID=60425684

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710289343.8A Pending CN107423337A (en) 2017-04-27 2017-04-27 News topic detection method based on LDA Fusion Models and multi-level clustering

Country Status (1)

Country Link
CN (1) CN107423337A (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992596A (en) * 2017-12-12 2018-05-04 百度在线网络技术(北京)有限公司 A kind of Text Clustering Method, device, server and storage medium
CN108664633A (en) * 2018-05-15 2018-10-16 南京大学 A method of carrying out text classification using diversified text feature
CN108932228A (en) * 2018-06-06 2018-12-04 武汉斗鱼网络科技有限公司 INDUSTRY OVERVIEW and subregion matching process, device, server and storage medium is broadcast live
CN109684474A (en) * 2018-11-19 2019-04-26 北京百度网讯科技有限公司 For providing the method, apparatus, equipment and storage medium of subject matter
CN109857869A (en) * 2019-01-26 2019-06-07 北京工业大学 A kind of hot topic prediction technique based on Ap increment cluster and network primitive
CN110019556A (en) * 2017-12-27 2019-07-16 阿里巴巴集团控股有限公司 A kind of topic news acquisition methods, device and its equipment
CN110245275A (en) * 2019-06-18 2019-09-17 中电科大数据研究院有限公司 A kind of extensive similar quick method for normalizing of headline
CN110297988A (en) * 2019-07-06 2019-10-01 四川大学 Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN110428102A (en) * 2019-07-31 2019-11-08 杭州电子科技大学 Major event trend forecasting method based on HC-TC-LDA
CN110765942A (en) * 2019-10-23 2020-02-07 睿魔智能科技(深圳)有限公司 Image data labeling method, device, equipment and storage medium
CN110795533A (en) * 2019-10-22 2020-02-14 王帅 Long text-oriented theme detection method
CN110851592A (en) * 2019-09-19 2020-02-28 昆明理工大学 Clustering-based news text optimal theme number calculation method
CN110909021A (en) * 2018-09-12 2020-03-24 北京奇虎科技有限公司 Construction method and device of query rewriting model and application thereof
CN111026835A (en) * 2019-12-26 2020-04-17 厦门市美亚柏科信息股份有限公司 Chat subject detection method, device and storage medium
CN111198946A (en) * 2019-12-25 2020-05-26 北京邮电大学 Network news hotspot mining method and device
CN111444336A (en) * 2020-02-25 2020-07-24 桂林电子科技大学 Topic detection method based on Siamese network
CN111814016A (en) * 2020-07-13 2020-10-23 重庆邮电大学 Mixed-granularity multi-view news data clustering method
CN112580355A (en) * 2020-12-30 2021-03-30 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method
CN112905751A (en) * 2021-03-19 2021-06-04 常熟理工学院 Topic evolution tracking method combining topic model and twin network model
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN113157857A (en) * 2021-03-13 2021-07-23 中国科学院新疆理化技术研究所 Hot topic detection method, device and equipment for news
CN113792125A (en) * 2021-08-25 2021-12-14 北京库睿科技有限公司 Intelligent retrieval sorting method and device based on text relevance and user intention
US11436287B2 (en) 2020-12-07 2022-09-06 International Business Machines Corporation Computerized grouping of news articles by activity and associated phase of focus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194012A (en) * 2011-06-17 2011-09-21 清华大学 Microblog topic detecting method and system
CN103823848A (en) * 2014-02-11 2014-05-28 浙江大学 LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method
CN104915446A (en) * 2015-06-29 2015-09-16 华南理工大学 Automatic extracting method and system of event evolving relationship based on news
CN106599181A (en) * 2016-12-13 2017-04-26 浙江网新恒天软件有限公司 Hot news detecting method based on topic model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194012A (en) * 2011-06-17 2011-09-21 清华大学 Microblog topic detecting method and system
CN103823848A (en) * 2014-02-11 2014-05-28 浙江大学 LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method
CN104915446A (en) * 2015-06-29 2015-09-16 华南理工大学 Automatic extracting method and system of event evolving relationship based on news
CN106599181A (en) * 2016-12-13 2017-04-26 浙江网新恒天软件有限公司 Hot news detecting method based on topic model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李勇 等: "面向LDA和VSM模型的微博热点话题发现研究", 《自动化技术与应用》 *
李文坤: "面向微博的新词发现和话题检测技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992596A (en) * 2017-12-12 2018-05-04 百度在线网络技术(北京)有限公司 A kind of Text Clustering Method, device, server and storage medium
CN110019556A (en) * 2017-12-27 2019-07-16 阿里巴巴集团控股有限公司 A kind of topic news acquisition methods, device and its equipment
CN110019556B (en) * 2017-12-27 2023-08-15 阿里巴巴集团控股有限公司 Topic news acquisition method, device and equipment thereof
CN108664633A (en) * 2018-05-15 2018-10-16 南京大学 A method of carrying out text classification using diversified text feature
CN108664633B (en) * 2018-05-15 2020-12-04 南京大学 Method for classifying texts by using diversified text characteristics
CN108932228B (en) * 2018-06-06 2023-08-08 广东南方报业移动媒体有限公司 Live broadcast industry news and partition matching method and device, server and storage medium
CN108932228A (en) * 2018-06-06 2018-12-04 武汉斗鱼网络科技有限公司 INDUSTRY OVERVIEW and subregion matching process, device, server and storage medium is broadcast live
CN110909021A (en) * 2018-09-12 2020-03-24 北京奇虎科技有限公司 Construction method and device of query rewriting model and application thereof
CN109684474B (en) * 2018-11-19 2021-01-01 北京百度网讯科技有限公司 Method, device, equipment and storage medium for providing written topics
CN109684474A (en) * 2018-11-19 2019-04-26 北京百度网讯科技有限公司 For providing the method, apparatus, equipment and storage medium of subject matter
CN109857869A (en) * 2019-01-26 2019-06-07 北京工业大学 A kind of hot topic prediction technique based on Ap increment cluster and network primitive
CN109857869B (en) * 2019-01-26 2021-07-30 北京工业大学 Ap incremental clustering and network element-based hot topic prediction method
CN110245275B (en) * 2019-06-18 2023-09-01 中电科大数据研究院有限公司 Large-scale similar news headline rapid normalization method
CN110245275A (en) * 2019-06-18 2019-09-17 中电科大数据研究院有限公司 A kind of extensive similar quick method for normalizing of headline
CN110297988A (en) * 2019-07-06 2019-10-01 四川大学 Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN110428102A (en) * 2019-07-31 2019-11-08 杭州电子科技大学 Major event trend forecasting method based on HC-TC-LDA
CN110428102B (en) * 2019-07-31 2021-11-09 杭州电子科技大学 HC-TC-LDA-based major event trend prediction method
CN110851592A (en) * 2019-09-19 2020-02-28 昆明理工大学 Clustering-based news text optimal theme number calculation method
CN110851592B (en) * 2019-09-19 2022-04-05 昆明理工大学 Clustering-based news text optimal theme number calculation method
CN110795533A (en) * 2019-10-22 2020-02-14 王帅 Long text-oriented theme detection method
CN110765942A (en) * 2019-10-23 2020-02-07 睿魔智能科技(深圳)有限公司 Image data labeling method, device, equipment and storage medium
CN111198946A (en) * 2019-12-25 2020-05-26 北京邮电大学 Network news hotspot mining method and device
CN111026835A (en) * 2019-12-26 2020-04-17 厦门市美亚柏科信息股份有限公司 Chat subject detection method, device and storage medium
CN111026835B (en) * 2019-12-26 2022-06-10 厦门市美亚柏科信息股份有限公司 Chat subject detection method, device and storage medium
CN111444336A (en) * 2020-02-25 2020-07-24 桂林电子科技大学 Topic detection method based on Siamese network
CN111814016B (en) * 2020-07-13 2022-07-12 重庆邮电大学 Mixed-granularity multi-view news data clustering method
CN111814016A (en) * 2020-07-13 2020-10-23 重庆邮电大学 Mixed-granularity multi-view news data clustering method
US11436287B2 (en) 2020-12-07 2022-09-06 International Business Machines Corporation Computerized grouping of news articles by activity and associated phase of focus
CN112580355A (en) * 2020-12-30 2021-03-30 中科院计算技术研究所大数据研究院 News information topic detection and real-time aggregation method
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN113157857B (en) * 2021-03-13 2023-06-02 中国科学院新疆理化技术研究所 Hot topic detection method, device and equipment for news
CN113157857A (en) * 2021-03-13 2021-07-23 中国科学院新疆理化技术研究所 Hot topic detection method, device and equipment for news
CN112905751A (en) * 2021-03-19 2021-06-04 常熟理工学院 Topic evolution tracking method combining topic model and twin network model
CN112905751B (en) * 2021-03-19 2024-03-29 常熟理工学院 Topic evolution tracking method combining topic model and twin network model
CN113792125A (en) * 2021-08-25 2021-12-14 北京库睿科技有限公司 Intelligent retrieval sorting method and device based on text relevance and user intention
CN113792125B (en) * 2021-08-25 2024-04-02 北京库睿科技有限公司 Intelligent retrieval ordering method and device based on text relevance and user intention

Similar Documents

Publication Publication Date Title
CN107423337A (en) News topic detection method based on LDA Fusion Models and multi-level clustering
CN109886020B (en) Software vulnerability automatic classification method based on deep neural network
CN105760507B (en) Cross-module state topic relativity modeling method based on deep learning
CN102929937B (en) Based on the data processing method of the commodity classification of text subject model
Zhang et al. Automatic text summarization based on sentences clustering and extraction
Yi et al. Topic modeling for short texts via word embedding and document correlation
CN104392006B (en) A kind of event query processing method and processing device
CN109213843A (en) A kind of detection method and device of rubbish text information
Kaviani et al. Emhash: Hashtag recommendation using neural network based on bert embedding
Wahid et al. Topic2Labels: A framework to annotate and classify the social media data through LDA topics and deep learning models for crisis response
CN112949713B (en) Text emotion classification method based on complex network integrated learning
Zhang et al. Continuous word embeddings for detecting local text reuses at the semantic level
Li et al. Dirichlet multinomial mixture with variational manifold regularization: Topic modeling over short texts
CN107832467A (en) A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN110287321A (en) A kind of electric power file classification method based on improvement feature selecting
Saleh et al. A genetic based optimization model for extractive multi-document text summarization
Zhang et al. Clustering based behavior sampling with long sequential data for CTR prediction
CN111259156A (en) Hot spot clustering method facing time sequence
Wang et al. Improving short text classification through better feature space selection
Shang A computational intelligence model for legal prediction and decision support
CN108334573A (en) High relevant microblog search method based on clustering information
CN110020034B (en) Information quotation analysis method and system
CN109783586B (en) Water army comment detection method based on clustering resampling
Sun et al. Chinese microblog sentiment classification based on convolution neural network with content extension method
Ma et al. A novel keyword generation model based on topic-aware and title-guide

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171201

RJ01 Rejection of invention patent application after publication