CN107423337A - News topic detection method based on LDA Fusion Models and multi-level clustering - Google Patents
News topic detection method based on LDA Fusion Models and multi-level clustering Download PDFInfo
- Publication number
- CN107423337A CN107423337A CN201710289343.8A CN201710289343A CN107423337A CN 107423337 A CN107423337 A CN 107423337A CN 201710289343 A CN201710289343 A CN 201710289343A CN 107423337 A CN107423337 A CN 107423337A
- Authority
- CN
- China
- Prior art keywords
- similarity
- topic
- lda
- models
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention belongs to data mining, natural language processing and information retrieval field, to propose news topic detection method, the defects of for aspect semantic based on TF IDF Vector Space Algorithms, the defects of with text hierarchical clustering time complexity and the degree of accuracy, feature extraction to a large amount of newsletter archives, represent modeling, Similarity Measure and fast and accurately Text Clustering Method is improved.The present invention, the news topic detection method based on LDA Fusion Models and multi-level clustering, step are as follows:Step 1:Similarity model is built using vector space model;Step 2:Finally give accurate parameter setting;Step 3:Two kinds of text models are made organically to merge;Step 4:Judge whether the topic is new topic;Step 5:Similarity is calculated, until all clustering documents terminate;Step 6:AHC ISP&AH clustering algorithms are added on the basis of step 5.Present invention is mainly applied to manufacture and design occasion.
Description
Technical field
The invention belongs to data mining, natural language processing and information retrieval field, it is related to monitoring technology and the network information
Filtering technique, especially text analyzing and topic detecting method.Concretely relate to be based on Cray distribution (Latent in potential Di
Dirichlet Allocation, LDA) Fusion Model and multi-level clustering news topic detection method.
Background technology
Topic detection and tracking (Topic Detection and Tracking, TDT) is in those early years from the detection of event
Developed with tracking (Event Detection and Tracking, EDT), be one in the case of no manual intervention
Automatically content recognition, excavation and the technology of tissue typing are carried out to news report.Based on word frequency-anti-document frequency (Term
Frequency-Inverse Document Frequency, TF-IDF) vector space model (Vector Space
Model, VSM) powerful ability is shown in terms of text representation.Vector space model is one and is used for representing text
Algebraic model.It is applied to information filtering, information retrieval, index and correlation rule.Relative to standard Boolean mathematical modeling, to
Quantity space model is the naive model based on linear algebra, and the weight of its phrase is not binary, it is allowed to calculates document and index
Between continuous similarity, it is allowed to it carries out document ordering according to possible correlation, and allows local matching.
But vector space model also has shortcoming.Vector space model is not suitable for longer file, because its is similar
Value is undesirable because of too small inner product and too high dimension.And because based on statistical starting point this mode is neglected
The relevance between text semantic has been omited, has caused semantic susceptibility bad.In addition, the order that its phrase occurs in a document
Can not be represented in vector, its weight be intuitively obtain and it is not formal enough.
Base has been established in the research that topic detection based on Once-clustering algorithm (Single-Pass) is TDT with tracking framework
Plinth.Text vector is compared with the report in existing topic by the way of increment cluster for Single-pass algorithms, is calculated
Text similarity is matched.If with some topic categorical match, the text is included into the topic, if all words in text domain
The similarity of topic classification is respectively less than a certain threshold value, then the text is expressed as to new kind sub-topic.
There is also certain defect for Once-clustering algorithm.Because Single-Pass algorithms are for the input sequence of newsletter archive
It is more sensitive, cause when the quantity of newsletter archive is constantly lifted, the Clustering Effect of algorithm but decreases, in terms of the degree of accuracy slightly
It is weak.The hierarchical clustering algorithm effect of text is good, but O (n2) time complexity and superelevation internal memory expend govern the calculation
Method.
The content of the invention
For overcome the deficiencies in the prior art, the present invention is directed to propose being talked about based on LDA Fusion Models and the news of multi-level clustering
Detection method is inscribed, the defects of for aspect semantic based on TF-IDF Vector Space Algorithms, and text hierarchical clustering time complexity
And the defects of degree of accuracy, feature extraction to a large amount of newsletter archives, represent modeling, Similarity Measure and fast and accurately text
Clustering method is improved.The technical solution adopted by the present invention is examined based on LDA Fusion Models and the news topic of multi-level clustering
Survey method, step are as follows:
Step 1:Similarity model is built using vector space model, each dimension of VSM models represents equivalent
Weight vectors, for two vectorial d1、d2, the similarity of theirs between the two is calculated with cosine similarity computational methods, cosine value is got over
It is to be intended to 1, represents that two vector angles are bigger;Cosine value is intended to 0, also implies that two vector directions are consistent,
Similarity is higher;
Step 2:Topic model is built using LDA, is sampled using gibbs Gibbs methods, the items of model are joined
Number is calculated, and is realized by iteration sample value mode for markovian structure, and causes it to be finally reached convergence,
Finally give accurate parameter setting;
Step 3:The potential topic models of LDA and VSM vector space models are combined, before the operation of whole clustering algorithm, led to
Cross text-thematic relation matrix, merge the VSM models based on TF-IDF weights methods, the similarity that VSM models are tried to achieve with
The similarity that LDA models are tried to achieve carries out linear expression, and weighted sum obtains final Similarity value, there is two kinds of text models
The fusion of machine;
Step 4:Text data is subjected to VSM modelings, Feature Words power using based on Once-clustering algorithm Single-Pass
The mode assigned again uses TF-IDF methods, so as to which report is characterized into vector form one by one.Then by document flow with
Whole topics carry out Similarity Measure in cluster process, by the way that the similarity of calculating and threshold value set in advance are contrasted,
Judge whether the topic is new topic;
Step 5:Use ISP clustering algorithms:Increase cached document stream on the basis of step 4 Single-Pass algorithms,
The similarity for being less than preset threshold value in step 4 similarity is put into cached document stream, and recalculates similarity, directly
Terminate to all clustering documents;
Step 6:AHC ISP&AH clustering algorithms are added on the basis of step 5:Calculate similar between each document
Degree, one is established on document and the similarity matrix of document, being then combined with two maximum documents of Similarity value in matrix is
One topic set, the document in two Geju City being merged is substituted with this new topic class, iterationization calculates similarity moment again
Battle array simultaneously merges again, is finally reached when meeting stop condition and stops.
Also include verification step, VSM structures similarity model is used alone, exclusive use LDA builds topic model and will
LDA with the VSM methods being combined contrast, and carries out efficiency assessment, F- to three kinds of methods by calculating F-Measure
Shown in Measure calculating such as formula (1):
F-Measure=2 × Precision × Recall/ (Precision+Recall) (1)
As shown in formula (1), Precision represents accuracy rate, and Recall represents recall rate, and Precision refers to correctly
The ratio of the relevant documentation number of retrieval and total number of files of retrieval, Recall refer to the relevant documentation number correctly retrieved with it is actual
The ratio of relevant documentation number, F-Measure value is bigger, represents that prediction result is better.
Comprising the following steps that in one example:
Step S0101:VSM similarity models are built using TF-IDF, content of text is different in size to cause weight distribution
On it is unbalanced show, and then cause Similarity Measure on there is deviation, therefore also need to by text vector normalize represent;
Step S0201:Topic model is built using LDA:The parameters of model are counted using the Gibbs methods of samplings
Calculate, realize that the accurate parameter for markovian structure, finally given is set, then for two different text diWith
dj, calculate the LDA topic model similarities Sim based on potential theme vectorLDA(di,dj);
Step S0301:The potential topic models of LDA and VSM vector space models are combined, calculate based on TF-IDF weight to
Measure the similarity Sim of modelTFIDF(di,dj), and combine SimLDA(di,dj) by both the above text similarity carry out linear group
Close, obtain merging the final similarity of two kinds of results, as shown in formula (2);
Sim(di,dj)=λ × SimTFIDF(di,dj)+(1-λ)×SimLDA(di,dj) (2)
Wherein λ is the customized linear effect factor, will calculate the VSM models of weights based on TF-IDF by its influence value
Linear change and weighted sum are carried out according to a specific ratio with the LDA models based on theme;
Step S0401:Using Single-Pass clustering algorithms, text data is subjected to VSM modelings, with TF-IDF methods
Term weight function is assigned, is vector form by text characterization;
Step S0402:Text flow and cluster process whole document are subjected to Similarity Measure, obtain similarity maximum
MaxSim, and corresponding topic TopicMax is recorded, MaxSim and threshold value set in advance are contrasted, if MaxSim is more than threshold
Value, then be TopicMax, be otherwise new topic;
Step S0501:Using ISP clustering algorithms, increase cached document stream on the basis of step S0402, by similarity
Document less than threshold value adds cache flow, and the article of cache flow is clustered again, if the similarity calculated is more than threshold value, updates
Topic, the document is otherwise considered as new topic, until all clustering documents terminate;
Step S0601:AHC ISP&AH clustering algorithms are added, first gather the topic of high similarity in newsletter archive
Together, then, secondary cluster is carried out in preliminary clusters result by hierarchy clustering method, the high topic of similarity is further
Fusion, reach the purpose for improving accuracy rate and recall rate.
The features of the present invention and beneficial effect are:
Accurate foundation of the method for the fusion that the present invention uses for model has obvious impetus.
The present invention combines together Statistics-Based Method and method based on semantic topic, supplies mutually, reaches
Improve the purpose of text cluster quality.The news topic detection of multi-level clustering combines ISP clustering algorithms and hierarchical clustering algorithm, enters
Row is multi-level, deeper into cluster.By improving Single-Pass clustering algorithms, topic preliminary clusters are carried out to newsletter archive,
High polymerization, the topic aggregated result of low granularity are obtained, can both meet the requirement clustered next time, and carry to a certain extent
High Clustering Effect.
Brief description of the drawings:
Fig. 1 overall schematics.
The F-Measure contrast line charts of tri- groups of Experimental modeling clusters of Fig. 2.
Embodiment
The present invention proposes a kind of method detected based on LDA Fusion Models and the news topic of multi-level clustering, comprising following
Step:
Step 1:Similarity model is built using VSM.The each dimension of VSM models represents the weight vectors of equivalent, for
Two vectorial d1、d2, the similarity of theirs between the two is calculated with cosine similarity computational methods.Cosine value is intended to 1, table
Show that two vector angles are bigger;Cosine value is intended to 0, also implies that two vector directions are consistent, similarity is higher.
Step 2:Topic model is built using LDA.Gibbs (Gibbs) sampling is the markovian one kind side of generation
Method, it is sampled using Gibbs methods, the parameters of model is calculated, is realized pair by iteration sample value mode
In the structure of Markov chain, and cause it to be finally reached convergence, finally give accurate parameter setting.
Step 3:The potential topic models of LDA and VSM vector space models are combined.Before the operation of whole clustering algorithm, lead to
Cross text-thematic relation matrix, merge the VSM models based on TF-IDF weights methods, the similarity that VSM models are tried to achieve with
The similarity that LDA models are tried to achieve carries out linear expression, and weighted sum obtains final Similarity value, there is two kinds of text models
The fusion of machine.
Step 4:Use traditional Si ngle-Pass clustering algorithms.Text data is subjected to VSM modelings, term weight function is assigned
The mode given uses TF-IDF methods, so as to which report is characterized into vector form one by one.Then by document flow and cluster
During whole topics carry out Similarity Measures.By the way that the similarity of calculating and threshold value set in advance are contrasted, judge
Whether the topic is new topic.
Step 5:Use ISP clustering algorithms.Increase cached document stream on the basis of step 4 Single-Pass algorithms.
The similarity for being less than preset threshold value in step 4 similarity is put into cached document stream, and recalculates similarity.Directly
Terminate to all clustering documents.
Step 6:AHC ISP&AH clustering algorithms are added on the basis of step 5.Calculate similar between each document
Degree, one is established on document and the similarity matrix of document, being then combined with two maximum documents of Similarity value in matrix is
One topic set, the document in two Geju City being merged is substituted with this new topic class, iterationization calculates similarity moment again
Battle array simultaneously merges again, is finally reached when meeting stop condition and stops.
Experiment builds similarity model by the way that VSM is used alone, LDA is used alone builds topic model and by LDA and VSM
The method being combined contrast.And efficiency assessment is carried out to three kinds of methods by calculating F-Measure.F-Measure
Calculating such as formula (1) shown in.
F-Measure=2 × Precision × Recall/ (Precision+Recall) (1)
As shown in formula (1), Precision represents accuracy rate, and Recall represents recall rate, and Precision refers to correctly
The ratio of the relevant documentation number of retrieval and total number of files of retrieval, Recall refer to the relevant documentation number correctly retrieved with it is actual
The ratio of relevant documentation number, F-Measure value is bigger, represents that prediction result is better.
As shown in Figure 2, build similarity model using VSM and build F- of the topic model on 5 topics using LDA
Measure has height to have bottom, illustrates that both modeling methods emphasize particularly on different fields, but the F-Measure of VSM+LDA Fusion Model is
It is maximum.Experiment shows, the accurate foundation of the method for fusion for model has obvious impetus.
Meanwhile for effect of the algorithm to Clustering Effect of Improvement, the present invention by calculate accuracy rate, recall rate and
F-Measure, to only using traditional Si ngle-Pass clustering algorithms, only using ISP clustering algorithms and addition AHC ISP&AH
Three groups of experiments of clustering algorithm carry out performance measure.
The semantic relation of LDA topic models is used based on LDA Fusion Models, is incorporated into newsletter archive field.Will
Statistics-Based Method and method based on semantic topic combine together, and supply mutually, reach and improve text cluster quality
Purpose.The news topic detection of multi-level clustering combines ISP clustering algorithms and hierarchical clustering algorithm, carry out at many levels, deeper into
Cluster.By improving Single-Pass clustering algorithms, topic preliminary clusters are carried out to newsletter archive, obtain high polymerization, low granularity
Topic aggregated result, can both meet the requirement clustered next time, and improve Clustering Effect to a certain extent.
The invention provides a kind of based on LDA Fusion Models and the news topic of multi-level clustering detection research method, such as Fig. 1
It is shown, it is the overall schematic of the specific embodiment of the invention, including:
Step S0101:VSM similarity models are built using TF-IDF.Content of text is different in size to cause weight distribution
On it is unbalanced show, and then cause Similarity Measure on there is deviation, therefore also need to by text vector normalization represent such as
Shown in formula (2).
Step S0201:Topic model is built using LDA.The parameters of model are counted using the Gibbs methods of samplings
Calculate, realize the structure for Markov chain, the accurate parameter finally given is set.So for two different text diAnd dj,
Calculate the LDA topic model similarities Sim based on potential theme vectorLDA(di,dj)。
Step S0301:The potential topic models of LDA and VSM vector space models are combined.Calculate based on TF-IDF weight to
Measure the similarity Sim of modelTFIDF(di,dj), and combine SimLDA(di,dj) by both the above text similarity carry out linear group
Close, obtain merging the final similarity of two kinds of results, as shown in formula (2).
Sim(di,dj)=λ × SimTFIDF(di,dj)+(1-λ)×SimLDA(di,dj) (2)
Wherein λ is the customized linear effect factor, will calculate the VSM models of weights based on TF-IDF by its influence value
Linear change and weighted sum are carried out according to a specific ratio with the LDA models based on theme.
Step S0401:Use traditional Si ngle-Pass clustering algorithms.Text data is subjected to VSM modelings, uses TF-IDF
Method assigns term weight function, is vector form by text characterization.
Step S0402:Text flow and cluster process whole document are subjected to Similarity Measure, obtain similarity maximum
MaxSim, and record corresponding topic TopicMax.MaxSim and threshold value set in advance are contrasted, if MaxSim is more than threshold
Value, then be TopicMax, be otherwise new topic.
Step S0501:Use ISP clustering algorithms.Increase cached document stream on the basis of step S0402, by similarity
Document less than threshold value adds cache flow, and the article of cache flow is clustered again.If the similarity calculated is more than threshold value, update
Topic, the document is otherwise considered as new topic, until all clustering documents terminate.
Step S0601:Add AHC ISP&AH clustering algorithms.The topic of high similarity in newsletter archive is gathered first
Together.Then, secondary cluster is carried out in preliminary clusters result by hierarchy clustering method, the high topic of similarity is further
Fusion, reach the purpose for improving accuracy rate and recall rate.
A kind of news topic detection method based on LDA Fusion Models and multi-level clustering of the present invention, compensate for base
In TF-IDF vector space model is in the relevance between text semantic is have ignored in terms of text representation the shortcomings that, text is improved
This clustering result quality.Meanwhile by improving Single-Pass clustering algorithms, topic preliminary clusters and level are carried out to newsletter archive
Cluster compensate for the shortcomings that hierarchical clustering algorithm time complexity is high and the cluster degree of accuracy of traditional Si ngle-Pass algorithms is relatively low
Shortcoming.A kind of effective method is provided for text analyzing and topic detection side.
Claims (3)
1. a kind of news topic detection method based on LDA Fusion Models and multi-level clustering, it is characterized in that, step is as follows:
Step 1:Similarity model is built using vector space model, each dimension of VSM models represents the weight of equivalent
Vector, for two vectorial d1、d2, the similarity of theirs between the two is calculated with cosine similarity computational methods, cosine value becomes
To in 1, represent that two vector angles are bigger;Cosine value is intended to 0, also implies that two vector directions are consistent, similar
Degree is higher;
Step 2:Topic model is built using LDA, is sampled using gibbs Gibbs methods, the parameters of model is entered
Row calculates, and is realized by iteration sample value mode for markovian structure, and causes it to be finally reached convergence, finally
Obtain accurate parameter setting;
Step 3:The potential topic models of LDA and VSM vector space models are combined, before the operation of whole clustering algorithm, pass through text
Sheet-thematic relation matrix, merge the VSM models based on TF-IDF weights methods, the similarity that VSM models are tried to achieve and LDA moulds
The similarity that type is tried to achieve carries out linear expression, and weighted sum obtains final Similarity value, makes two kinds of text models organic
Fusion;
Step 4:Text data is subjected to VSM modelings using based on Once-clustering algorithm Single-Pass, term weight function is assigned
The mode given uses TF-IDF methods, so as to which report is characterized into vector form one by one.Then by document flow and cluster
During whole topics carry out Similarity Measures, by the way that the similarity of calculating and threshold value set in advance are contrasted, judgement
Whether the topic is new topic;
Step 5:Use ISP clustering algorithms:Increase cached document stream on the basis of step 4 Single-Pass algorithms, will walk
Similarity in rapid four similarity less than preset threshold value is put into cached document stream, and recalculates similarity, Zhi Daosuo
There are clustering documents to terminate;
Step 6:AHC ISP&AH clustering algorithms are added on the basis of step 5:The similarity between each document is calculated, is built
Vertical one, on document and the similarity matrix of document, is then combined with the document that two Similarity values are maximum in matrix and is talked about for one
Topic set, the document in two Geju City being merged with this new topic class substitution, again iterationization calculate similarity matrix and again
Secondary merging, it is finally reached when meeting stop condition and stops.
2. the news topic detection method based on LDA Fusion Models and multi-level clustering as claimed in claim 1, it is characterized in that,
Also include verification step, VSM structures similarity model is used alone, LDA structure topic models are used alone and by LDA and VSM
The method being combined contrast, and carries out efficiency assessment, F-Measure to three kinds of methods by calculating F-Measure
Calculating such as formula (1) shown in:
F-Measure=2 × Precision × Recall/ (Precision+Recall) (1)
As shown in formula (1), Precision represents accuracy rate, and Recall represents recall rate, and Precision refers to correctly retrieve
Relevant documentation number and total number of files of retrieval ratio, Recall refer to the relevant documentation number correctly retrieved to it is actual related
The ratio of number of files, F-Measure value is bigger, represents that prediction result is better.
3. the news topic detection method based on LDA Fusion Models and multi-level clustering as claimed in claim 1, it is characterized in that,
Comprising the following steps that in one example:
Step S0101:VSM similarity models are built using TF-IDF, content of text is different in size to be caused in weight distribution
It is unbalanced to show, and then cause deviation occur on Similarity Measure, therefore also need to normalize text vector and represent;
Step S0201:Topic model is built using LDA:The parameters of model are calculated using the Gibbs methods of samplings,
Realize that the accurate parameter for markovian structure, finally given is set, then for two different text diAnd dj,
Calculate the LDA topic model similarities Sim based on potential theme vectorLDA(di,dj);
Step S0301:The potential topic models of LDA and VSM vector space models are combined, calculating is based on TF-IDF weight vector mould
The similarity Sim of typeTFIDF(di,dj), and combine SimLDA(di,dj) by both the above text similarity carry out linear combination, obtain
To the final similarity of two kinds of results of fusion, as shown in formula (2);
Sim(di,dj)=λ × SimTFIDF(di,dj)+(1-λ)×SimLDA(di,dj) (2)
Wherein λ is the customized linear effect factor, will calculate the VSM models and base of weights based on TF-IDF by its influence value
Linear change and weighted sum are carried out according to a specific ratio in the LDA models of theme;
Step S0401:Using Single-Pass clustering algorithms, text data is subjected to VSM modelings, assigned with TF-IDF methods
Term weight function, it is vector form by text characterization;
Step S0402:Text flow and cluster process whole document are subjected to Similarity Measure, obtain similarity maximum
MaxSim, and corresponding topic TopicMax is recorded, MaxSim and threshold value set in advance are contrasted, if MaxSim is more than threshold
Value, then be TopicMax, be otherwise new topic;
Step S0501:Using ISP clustering algorithms, increase cached document stream on the basis of step S0402, similarity is less than
The document of threshold value adds cache flow, and the article of cache flow is clustered again, if the similarity calculated is more than threshold value, more newspeak
Topic, is otherwise considered as new topic, until all clustering documents terminate by the document;
Step S0601:AHC ISP&AH clustering algorithms are added, the topic of high similarity in newsletter archive is got together first,
Then, secondary cluster is carried out in preliminary clusters result by hierarchy clustering method, the high topic of similarity is further merged,
Reach the purpose for improving accuracy rate and recall rate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710289343.8A CN107423337A (en) | 2017-04-27 | 2017-04-27 | News topic detection method based on LDA Fusion Models and multi-level clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710289343.8A CN107423337A (en) | 2017-04-27 | 2017-04-27 | News topic detection method based on LDA Fusion Models and multi-level clustering |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107423337A true CN107423337A (en) | 2017-12-01 |
Family
ID=60425684
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710289343.8A Pending CN107423337A (en) | 2017-04-27 | 2017-04-27 | News topic detection method based on LDA Fusion Models and multi-level clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107423337A (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107992596A (en) * | 2017-12-12 | 2018-05-04 | 百度在线网络技术(北京)有限公司 | A kind of Text Clustering Method, device, server and storage medium |
CN108664633A (en) * | 2018-05-15 | 2018-10-16 | 南京大学 | A method of carrying out text classification using diversified text feature |
CN108932228A (en) * | 2018-06-06 | 2018-12-04 | 武汉斗鱼网络科技有限公司 | INDUSTRY OVERVIEW and subregion matching process, device, server and storage medium is broadcast live |
CN109684474A (en) * | 2018-11-19 | 2019-04-26 | 北京百度网讯科技有限公司 | For providing the method, apparatus, equipment and storage medium of subject matter |
CN109857869A (en) * | 2019-01-26 | 2019-06-07 | 北京工业大学 | A kind of hot topic prediction technique based on Ap increment cluster and network primitive |
CN110019556A (en) * | 2017-12-27 | 2019-07-16 | 阿里巴巴集团控股有限公司 | A kind of topic news acquisition methods, device and its equipment |
CN110245275A (en) * | 2019-06-18 | 2019-09-17 | 中电科大数据研究院有限公司 | A kind of extensive similar quick method for normalizing of headline |
CN110297988A (en) * | 2019-07-06 | 2019-10-01 | 四川大学 | Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm |
CN110428102A (en) * | 2019-07-31 | 2019-11-08 | 杭州电子科技大学 | Major event trend forecasting method based on HC-TC-LDA |
CN110765942A (en) * | 2019-10-23 | 2020-02-07 | 睿魔智能科技(深圳)有限公司 | Image data labeling method, device, equipment and storage medium |
CN110795533A (en) * | 2019-10-22 | 2020-02-14 | 王帅 | Long text-oriented theme detection method |
CN110851592A (en) * | 2019-09-19 | 2020-02-28 | 昆明理工大学 | Clustering-based news text optimal theme number calculation method |
CN110909021A (en) * | 2018-09-12 | 2020-03-24 | 北京奇虎科技有限公司 | Construction method and device of query rewriting model and application thereof |
CN111026835A (en) * | 2019-12-26 | 2020-04-17 | 厦门市美亚柏科信息股份有限公司 | Chat subject detection method, device and storage medium |
CN111198946A (en) * | 2019-12-25 | 2020-05-26 | 北京邮电大学 | Network news hotspot mining method and device |
CN111444336A (en) * | 2020-02-25 | 2020-07-24 | 桂林电子科技大学 | Topic detection method based on Siamese network |
CN111814016A (en) * | 2020-07-13 | 2020-10-23 | 重庆邮电大学 | Mixed-granularity multi-view news data clustering method |
CN112580355A (en) * | 2020-12-30 | 2021-03-30 | 中科院计算技术研究所大数据研究院 | News information topic detection and real-time aggregation method |
CN112905751A (en) * | 2021-03-19 | 2021-06-04 | 常熟理工学院 | Topic evolution tracking method combining topic model and twin network model |
CN113064990A (en) * | 2021-01-04 | 2021-07-02 | 上海金融期货信息技术有限公司 | Hot event identification method and system based on multi-level clustering |
CN113157857A (en) * | 2021-03-13 | 2021-07-23 | 中国科学院新疆理化技术研究所 | Hot topic detection method, device and equipment for news |
CN113792125A (en) * | 2021-08-25 | 2021-12-14 | 北京库睿科技有限公司 | Intelligent retrieval sorting method and device based on text relevance and user intention |
US11436287B2 (en) | 2020-12-07 | 2022-09-06 | International Business Machines Corporation | Computerized grouping of news articles by activity and associated phase of focus |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102194012A (en) * | 2011-06-17 | 2011-09-21 | 清华大学 | Microblog topic detecting method and system |
CN103823848A (en) * | 2014-02-11 | 2014-05-28 | 浙江大学 | LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method |
CN104915446A (en) * | 2015-06-29 | 2015-09-16 | 华南理工大学 | Automatic extracting method and system of event evolving relationship based on news |
CN106599181A (en) * | 2016-12-13 | 2017-04-26 | 浙江网新恒天软件有限公司 | Hot news detecting method based on topic model |
-
2017
- 2017-04-27 CN CN201710289343.8A patent/CN107423337A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102194012A (en) * | 2011-06-17 | 2011-09-21 | 清华大学 | Microblog topic detecting method and system |
CN103823848A (en) * | 2014-02-11 | 2014-05-28 | 浙江大学 | LDA (latent dirichlet allocation) and VSM (vector space model) based similar Chinese herb literature recommendation method |
CN104915446A (en) * | 2015-06-29 | 2015-09-16 | 华南理工大学 | Automatic extracting method and system of event evolving relationship based on news |
CN106599181A (en) * | 2016-12-13 | 2017-04-26 | 浙江网新恒天软件有限公司 | Hot news detecting method based on topic model |
Non-Patent Citations (2)
Title |
---|
李勇 等: "面向LDA和VSM模型的微博热点话题发现研究", 《自动化技术与应用》 * |
李文坤: "面向微博的新词发现和话题检测技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107992596A (en) * | 2017-12-12 | 2018-05-04 | 百度在线网络技术(北京)有限公司 | A kind of Text Clustering Method, device, server and storage medium |
CN110019556A (en) * | 2017-12-27 | 2019-07-16 | 阿里巴巴集团控股有限公司 | A kind of topic news acquisition methods, device and its equipment |
CN110019556B (en) * | 2017-12-27 | 2023-08-15 | 阿里巴巴集团控股有限公司 | Topic news acquisition method, device and equipment thereof |
CN108664633A (en) * | 2018-05-15 | 2018-10-16 | 南京大学 | A method of carrying out text classification using diversified text feature |
CN108664633B (en) * | 2018-05-15 | 2020-12-04 | 南京大学 | Method for classifying texts by using diversified text characteristics |
CN108932228B (en) * | 2018-06-06 | 2023-08-08 | 广东南方报业移动媒体有限公司 | Live broadcast industry news and partition matching method and device, server and storage medium |
CN108932228A (en) * | 2018-06-06 | 2018-12-04 | 武汉斗鱼网络科技有限公司 | INDUSTRY OVERVIEW and subregion matching process, device, server and storage medium is broadcast live |
CN110909021A (en) * | 2018-09-12 | 2020-03-24 | 北京奇虎科技有限公司 | Construction method and device of query rewriting model and application thereof |
CN109684474B (en) * | 2018-11-19 | 2021-01-01 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for providing written topics |
CN109684474A (en) * | 2018-11-19 | 2019-04-26 | 北京百度网讯科技有限公司 | For providing the method, apparatus, equipment and storage medium of subject matter |
CN109857869A (en) * | 2019-01-26 | 2019-06-07 | 北京工业大学 | A kind of hot topic prediction technique based on Ap increment cluster and network primitive |
CN109857869B (en) * | 2019-01-26 | 2021-07-30 | 北京工业大学 | Ap incremental clustering and network element-based hot topic prediction method |
CN110245275B (en) * | 2019-06-18 | 2023-09-01 | 中电科大数据研究院有限公司 | Large-scale similar news headline rapid normalization method |
CN110245275A (en) * | 2019-06-18 | 2019-09-17 | 中电科大数据研究院有限公司 | A kind of extensive similar quick method for normalizing of headline |
CN110297988A (en) * | 2019-07-06 | 2019-10-01 | 四川大学 | Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm |
CN110428102A (en) * | 2019-07-31 | 2019-11-08 | 杭州电子科技大学 | Major event trend forecasting method based on HC-TC-LDA |
CN110428102B (en) * | 2019-07-31 | 2021-11-09 | 杭州电子科技大学 | HC-TC-LDA-based major event trend prediction method |
CN110851592A (en) * | 2019-09-19 | 2020-02-28 | 昆明理工大学 | Clustering-based news text optimal theme number calculation method |
CN110851592B (en) * | 2019-09-19 | 2022-04-05 | 昆明理工大学 | Clustering-based news text optimal theme number calculation method |
CN110795533A (en) * | 2019-10-22 | 2020-02-14 | 王帅 | Long text-oriented theme detection method |
CN110765942A (en) * | 2019-10-23 | 2020-02-07 | 睿魔智能科技(深圳)有限公司 | Image data labeling method, device, equipment and storage medium |
CN111198946A (en) * | 2019-12-25 | 2020-05-26 | 北京邮电大学 | Network news hotspot mining method and device |
CN111026835A (en) * | 2019-12-26 | 2020-04-17 | 厦门市美亚柏科信息股份有限公司 | Chat subject detection method, device and storage medium |
CN111026835B (en) * | 2019-12-26 | 2022-06-10 | 厦门市美亚柏科信息股份有限公司 | Chat subject detection method, device and storage medium |
CN111444336A (en) * | 2020-02-25 | 2020-07-24 | 桂林电子科技大学 | Topic detection method based on Siamese network |
CN111814016B (en) * | 2020-07-13 | 2022-07-12 | 重庆邮电大学 | Mixed-granularity multi-view news data clustering method |
CN111814016A (en) * | 2020-07-13 | 2020-10-23 | 重庆邮电大学 | Mixed-granularity multi-view news data clustering method |
US11436287B2 (en) | 2020-12-07 | 2022-09-06 | International Business Machines Corporation | Computerized grouping of news articles by activity and associated phase of focus |
CN112580355A (en) * | 2020-12-30 | 2021-03-30 | 中科院计算技术研究所大数据研究院 | News information topic detection and real-time aggregation method |
CN113064990A (en) * | 2021-01-04 | 2021-07-02 | 上海金融期货信息技术有限公司 | Hot event identification method and system based on multi-level clustering |
CN113157857B (en) * | 2021-03-13 | 2023-06-02 | 中国科学院新疆理化技术研究所 | Hot topic detection method, device and equipment for news |
CN113157857A (en) * | 2021-03-13 | 2021-07-23 | 中国科学院新疆理化技术研究所 | Hot topic detection method, device and equipment for news |
CN112905751A (en) * | 2021-03-19 | 2021-06-04 | 常熟理工学院 | Topic evolution tracking method combining topic model and twin network model |
CN112905751B (en) * | 2021-03-19 | 2024-03-29 | 常熟理工学院 | Topic evolution tracking method combining topic model and twin network model |
CN113792125A (en) * | 2021-08-25 | 2021-12-14 | 北京库睿科技有限公司 | Intelligent retrieval sorting method and device based on text relevance and user intention |
CN113792125B (en) * | 2021-08-25 | 2024-04-02 | 北京库睿科技有限公司 | Intelligent retrieval ordering method and device based on text relevance and user intention |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107423337A (en) | News topic detection method based on LDA Fusion Models and multi-level clustering | |
CN109886020B (en) | Software vulnerability automatic classification method based on deep neural network | |
CN105760507B (en) | Cross-module state topic relativity modeling method based on deep learning | |
CN102929937B (en) | Based on the data processing method of the commodity classification of text subject model | |
Zhang et al. | Automatic text summarization based on sentences clustering and extraction | |
Yi et al. | Topic modeling for short texts via word embedding and document correlation | |
CN104392006B (en) | A kind of event query processing method and processing device | |
CN109213843A (en) | A kind of detection method and device of rubbish text information | |
Kaviani et al. | Emhash: Hashtag recommendation using neural network based on bert embedding | |
Wahid et al. | Topic2Labels: A framework to annotate and classify the social media data through LDA topics and deep learning models for crisis response | |
CN112949713B (en) | Text emotion classification method based on complex network integrated learning | |
Zhang et al. | Continuous word embeddings for detecting local text reuses at the semantic level | |
Li et al. | Dirichlet multinomial mixture with variational manifold regularization: Topic modeling over short texts | |
CN107832467A (en) | A kind of microblog topic detecting method based on improved Single pass clustering algorithms | |
CN110287321A (en) | A kind of electric power file classification method based on improvement feature selecting | |
Saleh et al. | A genetic based optimization model for extractive multi-document text summarization | |
Zhang et al. | Clustering based behavior sampling with long sequential data for CTR prediction | |
CN111259156A (en) | Hot spot clustering method facing time sequence | |
Wang et al. | Improving short text classification through better feature space selection | |
Shang | A computational intelligence model for legal prediction and decision support | |
CN108334573A (en) | High relevant microblog search method based on clustering information | |
CN110020034B (en) | Information quotation analysis method and system | |
CN109783586B (en) | Water army comment detection method based on clustering resampling | |
Sun et al. | Chinese microblog sentiment classification based on convolution neural network with content extension method | |
Ma et al. | A novel keyword generation model based on topic-aware and title-guide |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171201 |
|
RJ01 | Rejection of invention patent application after publication |