CN106599181A - Hot news detecting method based on topic model - Google Patents
Hot news detecting method based on topic model Download PDFInfo
- Publication number
- CN106599181A CN106599181A CN201611145855.9A CN201611145855A CN106599181A CN 106599181 A CN106599181 A CN 106599181A CN 201611145855 A CN201611145855 A CN 201611145855A CN 106599181 A CN106599181 A CN 106599181A
- Authority
- CN
- China
- Prior art keywords
- theme
- article
- word
- similarity
- topic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a hot news detecting method based on a topic model. According to the method, news streams are directionally crawled through a web crawler; an article is firstly subjected to preprocessing such as word segmentation and stop word and meaningless character string removal; then, the preprocessed article is subjected to feature extraction; a text model is built; then, texts with high similarity is added into the most similar class through a text clustering algorithm to obtain a topic base; next, new and old topics are subjected to similarity calculation; the new and cold topics with high similarity are merged; finally, the topic hot degree calculation is performed; and the hottest topic is selected through sequencing. An LDA (Latent Dirichlet Allocation) algorithm is creatively applied to hot topic discovery; an explosive concept is provided; the hottest news can be timely and effectively discovered; meanwhile, the topic hot degree attenuation concept is provided; the topic hot degree can be recorded and tracked in real time; the development change of the hot news is really reflected; and the important significance is realized on the tracking display of the hot news.
Description
Technical field
The invention provides a kind of hot news detection method based on topic model, it is related to web crawlers, cluster analyses,
The core technologies such as Text similarity computing and algorithm, timely and effectively detect hot news, follow the trail of hot news and develop.
Background technology
As the development of Internet technology, magnanimity information epoch have been arrived, various information is full of in the Internet, but only
Minority news can create much of a stir, i.e., so-called top news, hot news, and timely hot news finds to help people's reality
When pay close attention to social state.
On the other hand, the outburst of a hot news is not to die at a flash, is usually associated with one and of flowing rhythm sends out
Exhibition process, and cause other potential problems to occur, so following the trail of the evolution of hot news for research social problem has
Significance.
The development of the Internet, the rise of big data, the Internet are flooded with bulk information, send out in these low-quality information
Existing hot news becomes of crucial importance.
The content of the invention
Present invention aims to nowadays internet information is numerous and diverse, there is provided a kind of based on web crawlers, cluster point
Analysis and the hot news detection method of topic model.
The purpose of the present invention is achieved through the following technical solutions:A kind of hot news based on topic model is detected
Method, crawls news stream by web crawlers orientation, carries out participle first to article, remove stop words and meaningless character string etc.
Pretreatment, carries out feature extraction then to pretreated article, builds text model, then by Text Clustering Algorithm just
The high text of similarity degree is added in most like classification, obtains topic storehouse, then carries out similarity meter to new familiar topic
Calculate, the new familiar topic high for similarity merge, finally carry out topic temperature calculating, by sequence select it is most hot if
Topic.Specifically include following steps:
(1) oriented by the way of web crawlers and crawl news stream, often the new article of an arrival N pieces carries out a batch processing, right
Crawl data and carry out data cleansing, article participle and obtain pretreated article;
(2) build vector space model:Through pretreatment operation, original document can be regarded as and is made up of a pile word
, if document is regarded as one it is vectorial if, then each word is exactly one-dimensional characteristic, by transform a document to
Amount, text data are just changed into the structural data that can be subsequently can by computer, and the Similarity Problem between two documents is just converted
In order to two it is vectorial between Similarity Problem.When document vector is calculated per one-dimensional weight, calculated using improved B-TFIDF
Method, algorithmic formula are as follows:
In formula (1), w represents word, and A represents the article number comprising word w in new article, and B does not include in representing new article
The article number of word w, C represent the article number comprising word w in history article, and D represents the text not comprising word w in history article
Chapter number, d in formula (2)iI-th new article is represented, N represents new article sum, and tf (d, w) represents words of the word w in article d
Frequently, df (w) represents the article number comprising word w.The burst of word is taken into account by the algorithm, and explosive is that a word exists
Suddenly it is a large amount of in a short time to occur.By algorithm above calculate constitute document each word weight, and then generate article to
Quantity space model Di=(weight (di,w1),weight(di,w2),weight(di,w3)…..weight(di,wn)), wherein n
For total word number.
(3) article cluster:Through step 2, text is represented as the form of vector, and text vector is clustered;Using
LDA topic model clustering algorithms, specially:
LDA cluster process:LDA is three layers of bayesian probability model, comprising three layers of word, theme and document, by one
Such a process is regarded in the generation of article as:Certain theme is selected with certain probability, and with certain probability in this theme
Certain word is selected, document to theme obeys multinomial distribution, and theme to word obeys multinomial distribution, obtains " main by LDA clusters
Topic-word " probability matrix phi and " document-theme " probability matrix theta, according to " document-theme " probability matrix theta
The probability of m theme N pieces article corresponding with m theme is obtained, every a line i of theta represents an article, represents per string j
One theme, homography value thetaijIt is probability that article i belongs to theme j.It is thresholdT to arrange screening threshold value, if
thetaij>ThresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme.
LDA clusters the determination of number m:Be respectively provided with cluster number and LDA clustering algorithms repeated for N/10-N/5, then
Similarity between the theme of implementing result each time is calculated, the corresponding theme of the minimum implementing result of similarity between theme is selected
Number.Every a line j of " theme-word " probability matrix phi that the calculating of similarity is obtained according to LDA clusters between theme, phi is represented
One theme Tj, a word w is represented per string kk, phijkRepresent theme TjComprising word wkProbability.A line of Phi can be with
Regard theme T asjVector form Tj=(w1,w2,w3,…wk…wn), n is total word number.Theme similarity between any two is calculated,
Similarity meansigma methodss are sought, and minima are taken as similarity between final theme.Meter of the calculating of similarity using cosine similarity
Calculation method, computing formula are as follows:
T in formula (3)iAnd TjRepresent two themes, ωk(Ti) represent theme TiValue in dimension k, n represent total word
Number.
(4) subject key words are extracted:Key word is extracted under theme in the exercise question of all articles, first title of article is carried out
Participle, filters out stop-word, it is not intended to adopted word and punctuation mark, and remaining word is used as subject key words.
(5) topic merges:M theme and its corresponding article are obtained by step 3, next by m new theme and old master
Topic is merged, and calculates similarity f1 between theme, if f1>0.5 is thought that two themes are similar, and merges two themes.Theme
Between similarity f1 computing formula it is as follows:
F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)
It is similar that vectorSim representatives in formula (4) calculate theme cosine as dimension using all words that theme is included
Degree, keywordSim are represented by dimension of subject key words and calculate theme cosine similarity, and the computing formula of cosine similarity is same
Formula (3).
(6) temperature is calculated:Final all themes are obtained through step 5, theme temperature h is next calculated, is filtered out heat
The high theme of degree, removes that temperature is low, i.e., out-of-date theme.According to focus theme news concentration class s it is high the characteristics of, temperature calculates public
Formula is as follows:
ht=∑ sim (di,t) (5)
D in formula (5)iRepresent the article that theme T is included, temperature h of theme TtIt is similar to theme equal to article under theme
The sum of degree, the same formula of sim functions (3).
As time go on, the temperature of a theme constantly can decay, until the theme is rejected less than threshold value.Temperature
Decay, in each batch process, if there is new article to arrive below theme T, then temperature h of theme TtCan increase accordingly
Plus, ht=ht* Up, if not having new article to be added to theme T, then temperature htCan decay, ht=ht* Down, wherein Up >
1, Down < 1.
The invention has the beneficial effects as follows:The present invention innovatively applies LDA algorithm in focus motif discovery, and proposes
Fulminant concept, can timely and effectively find newest hot news, while it is general to propose the decay of theme temperature
Read, can real time record tracking theme temperature, truly reflect the development and change of hot news, for the tracking of hot news
Show that tool has very important significance.
Description of the drawings
Fig. 1 is the hot news testing process schematic diagram based on topic model;
Fig. 2 is article modeling process schematic diagram;
Fig. 3 is LDA cluster process schematic diagrams;
Fig. 4 is that new old theme merges schematic diagram;
Fig. 5 is that theme temperature calculates schematic diagram.
Specific embodiment
The present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.
As shown in figure 1, a kind of hot news detection method based on topic model proposed by the present invention, including following step
Suddenly:
(1) oriented by the way of web crawlers and crawl news stream, often arrival N pieces article carries out a batch processing, to climbing
Fetching data carries out data cleansing, article participle and obtains pretreated article;
(2) build vector space model:As shown in Fig. 2 through pretreatment operation, original document can be regarded as by a pile
Word constitute, if document is regarded as one it is vectorial if, then each word is exactly one-dimensional characteristic, by by document turn
Vector is turned to, text data is just changed into the structural data that can be subsequently can by computer, the Similarity Problem between two documents
Translated into two it is vectorial between Similarity Problem.When document vector is calculated per one-dimensional weight, employ improved
B-TFIDF algorithms, algorithmic formula are as follows:
In formula 1, w represents word, and it is single that A represents that the article number comprising word w in new article, B do not include in representing new article
The article number of word w, C represent the article number comprising word w in history article, and D represents the article not comprising word w in history article
Number, d in formula 2iI-th new article is represented, N represents new article sum, and tf (d, w) represents word frequency of the word w in article d,
Df (w) represents the article number comprising word w.The burst of word is taken into account by the algorithm, and an explosive i.e. word is short
It is a large amount of suddenly in phase to occur.The weight of each word for constituting document is calculated by algorithm above, and then generates the vector of article
Spatial model Di=(weight (di,w1),weight(di,w2),weight(di,w3)…..weight(di,wn)), wherein n is
Total word number.
(3) article cluster:Through step 2, text is represented as the form of vector, and text vector is clustered;Such as Fig. 3
It is shown, be employed herein LDA topic model clustering algorithms, LDA is three layers of bayesian probability model, comprising word, theme and
Three layers of document, regards the generation of an article as such a process:Certain theme is selected with certain probability, and in this master
Certain word is selected with certain probability in topic, document to theme obeys multinomial distribution, and theme to word is obeyed multinomial distribution, passed through
LDA cluster analyses obtain " theme-word " probability matrix and " document-theme " probability matrix, and detailed process sees below description.
LDA cluster process:LDA is three layers of bayesian probability model, can obtain " theme-word by LDA clusters
Language " probability matrix phi and " document-theme " probability matrix theta, obtains m according to " document-theme " probability matrix theta
The probability of individual theme N pieces article corresponding with m theme, every a line i of theta represent an article, represent a master per string j
Topic, homography value thetaijIt is probability that article i belongs to theme j.It is thresholdT (preferred values to arrange screening threshold value
0.32), if thetaij>ThresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme.
LDA clusters the determination of number m:As N pieces article cluster number relatively meets reality between N/10 to N/5
(for example, when new article sum N is 150, cluster number relatively meets reality between 15 to 30), so being respectively provided with poly-
Class number repeats LDA clustering algorithms for N/10-N/5, then calculates similarity between the theme of implementing result each time, selects
The corresponding theme number of the minimum implementing result of similarity between theme.The calculating of similarity between theme needs to be clustered according to LDA
" theme-word " the probability matrix phi for arriving, every a line j of phi represent a theme Tj, a word w is represented per string kk,
phijkRepresent theme TjComprising word wkProbability.A line of Phi can regard theme T asjVector form Tj=(w1,w2,
w3,…wk…wn), n is total word number.Theme similarity between any two is calculated, and is sought similarity meansigma methodss, minima is taken as most
Similarity between whole theme.The calculating of similarity employs the computational methods of cosine similarity, and computing formula is as follows:
T in formula 3iAnd TjRepresent two themes, ωk(Ti) represent theme TiValue in dimension k, n represent total word number.
(4) subject key words are extracted:Key word is extracted under theme in the exercise question of all articles, first title of article is carried out
Participle, filters out stop-word, it is not intended to adopted word and punctuation mark, and remaining word is used as subject key words.
(5) topic merges:M theme and its corresponding article are obtained by step 3, next by m new theme and old master
Topic is merged, as shown in figure 4, similarity f1 between theme is calculated, if f1>0.5 is thought that two themes are similar, and merges two
Theme.Between theme, similarity f1 computing formula is as follows:
F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)
VectorSim in formula 4 is represented using all words that theme is included and is calculated theme cosine similarity as dimension,
KeywordSim is represented by dimension of subject key words and is calculated theme cosine similarity, the same formula of computing formula of cosine similarity
3。
(6) temperature is calculated:As shown in figure 5, final all themes are obtained through step 5, next calculate theme temperature
H, filters out the high theme of temperature, removes that temperature is low, i.e., out-of-date theme.According to focus theme news concentration class s it is high the characteristics of,
Temperature computing formula is as follows:
ht=∑ sim (di,t) (5)
Di in formula 5 represents the article that theme T is included, temperature h of theme TtIt is similar to theme equal to article under theme
The sum of degree, sim functions are with formula 3.
As time go on, the temperature of a theme constantly can decay, until the theme is rejected less than threshold value.Temperature
Decay, in each batch process, if there is new article to arrive below theme T, then temperature h of theme TtCan increase accordingly
Plus, ht=ht* Up (preferred value 1.05), if not having new article to be added to theme T, then temperature htCan decay, ht=ht*
Down (preferred value 0.9).
Claims (1)
1. a kind of hot news detection method based on topic model, it is characterised in that comprise the following steps:
(1) oriented by the way of web crawlers and crawl news stream, often the new article of an arrival N pieces carries out a batch processing, to crawling
Data carry out data cleansing, article participle and obtain pretreated article;
(2) build vector space model:Through pretreatment operation, original document can regard what is be made up of a pile word as, such as
Fruit document regard as one it is vectorial if, then each word is exactly one-dimensional characteristic, by transforming a document to vector, text
Data are just changed into the structural data that can be subsequently can by computer, and the Similarity Problem between two documents has translated into two
Similarity Problem between vector.When document vector is calculated per one-dimensional weight, using improved B-TFIDF algorithms, algorithm
Formula is as follows:
In formula (1), w represents word, and A represents that the article number comprising word w in new article, B do not include word in representing new article
The article number of w, C represent the article number comprising word w in history article, and D represents the article not comprising word w in history article
Number, d in formula (2)iI-th new article is represented, N represents new article sum, and tf (d, w) represents words of the word w in article d
Frequently, df (w) represents the article number comprising word w.The burst of word is taken into account by the algorithm, and explosive is that a word exists
Suddenly it is a large amount of in a short time to occur.By algorithm above calculate constitute document each word weight, and then generate article to
Quantity space model Di=(weight (di,w1),weight(di,w2),weight(di,w3)…..weight(di,wn)), wherein n
For total word number.
(3) article cluster:Through step 2, text is represented as the form of vector, and text vector is clustered;Using LDA master
Topic Model tying algorithm, specially:
LDA cluster process:LDA is three layers of bayesian probability model, comprising three layers of word, theme and document, by an article
Generation regard such a process as:Certain theme is selected with certain probability, and is selected with certain probability in this theme
Certain word, document to theme obey multinomial distribution, theme to word obey multinomial distribution, by LDA cluster obtain " theme-
Word " probability matrix phi and " document-theme " probability matrix theta, obtains according to " document-theme " probability matrix theta
The probability of m theme N pieces article corresponding with m theme, every a line i of theta represent an article, represent one per string j
Theme, homography value thetaijIt is probability that article i belongs to theme j.It is thresholdT to arrange screening threshold value, if
thetaij>ThresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme.
LDA clusters the determination of number m:Be respectively provided with cluster number and LDA clustering algorithms repeated for N/10-N/5, then calculate
Similarity between the theme of implementing result, selects the corresponding theme number of the minimum implementing result of similarity between theme each time.It is main
" theme-word " probability matrix phi that the calculating of similarity is obtained according to LDA clusters between topic, every a line j of phi represent one
Theme Tj, a word w is represented per string kk, phijkRepresent theme TjComprising word wkProbability.A line of Phi can be regarded as
Theme TjVector form Tj=(w1,w2,w3,…wk…wn), n is total word number.Theme similarity between any two is calculated, phase is sought
Like degree meansigma methodss, minima is taken as similarity between final theme.Calculating side of the calculating of similarity using cosine similarity
Method, computing formula are as follows:
T in formula (3)iAnd TjRepresent two themes, ωk(Ti) represent theme TiValue in dimension k, n represent total word number.
(4) subject key words are extracted:Key word is extracted under theme in the exercise question of all articles, first title of article is carried out point
Word, filters out stop-word, it is not intended to adopted word and punctuation mark, and remaining word is used as subject key words.
(5) topic merges:M theme and its corresponding article are obtained by step 3, next m new theme is entered with old theme
Row merges, and calculates similarity f1 between theme, if f1>0.5 is thought that two themes are similar, and merges two themes.Phase between theme
It is as follows like degree f1 computing formula:
F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)
VectorSim in formula (4) is represented using all words that theme is included and is calculated theme cosine similarity as dimension,
KeywordSim is represented by dimension of subject key words and is calculated theme cosine similarity, the same formula of computing formula of cosine similarity
(3)。
(6) temperature is calculated:Final all themes are obtained through step 5, theme temperature h is next calculated, is filtered out temperature high
Theme, remove that temperature is low, i.e., out-of-date theme.According to focus theme news concentration class s it is high the characteristics of, temperature computing formula is such as
Under:
ht=∑ sim (di,t) (5)
D in formula (5)iRepresent the article that theme T is included, temperature h of theme TtEqual to article under theme and Topic Similarity
With the same formula of sim functions (3).
As time go on, the temperature of a theme constantly can decay, until the theme is rejected less than threshold value.Temperature declines
Subtract, in each batch process, if there is new article to arrive below theme T, then temperature h of theme TtCan increase accordingly
Plus, ht=ht* Up, if not having new article to be added to theme T, then temperature htCan decay, ht=ht* Down, wherein Up >
1, Down < 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611145855.9A CN106599181B (en) | 2016-12-13 | 2016-12-13 | A kind of hot news detection method based on topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611145855.9A CN106599181B (en) | 2016-12-13 | 2016-12-13 | A kind of hot news detection method based on topic model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599181A true CN106599181A (en) | 2017-04-26 |
CN106599181B CN106599181B (en) | 2019-06-18 |
Family
ID=58802054
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611145855.9A Active CN106599181B (en) | 2016-12-13 | 2016-12-13 | A kind of hot news detection method based on topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599181B (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107203632A (en) * | 2017-06-01 | 2017-09-26 | 中国人民解放军国防科学技术大学 | Topic Popularity prediction method based on similarity relation and cooccurrence relation |
CN107239497A (en) * | 2017-05-02 | 2017-10-10 | 广东万丈金数信息技术股份有限公司 | Hot content searching method and system |
CN107330049A (en) * | 2017-06-28 | 2017-11-07 | 北京搜狐新媒体信息技术有限公司 | A kind of news temperature predictor method and system |
CN107423337A (en) * | 2017-04-27 | 2017-12-01 | 天津大学 | News topic detection method based on LDA Fusion Models and multi-level clustering |
CN107451224A (en) * | 2017-07-17 | 2017-12-08 | 广州特道信息科技有限公司 | A kind of clustering method and system based on big data parallel computation |
CN107563725A (en) * | 2017-08-25 | 2018-01-09 | 浙江网新恒天软件有限公司 | A kind of recruitment system for optimizing cumbersome personnel recruitment process |
CN107656919A (en) * | 2017-09-12 | 2018-02-02 | 中国软件与技术服务股份有限公司 | A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme |
CN107784127A (en) * | 2017-11-30 | 2018-03-09 | 杭州数梦工场科技有限公司 | A kind of focus localization method and device |
CN107835113A (en) * | 2017-07-05 | 2018-03-23 | 中山大学 | Abnormal user detection method in a kind of social networks based on network mapping |
CN107832418A (en) * | 2017-11-08 | 2018-03-23 | 郑州云海信息技术有限公司 | A kind of much-talked-about topic finds method, system and a kind of much-talked-about topic discovering device |
CN107862089A (en) * | 2017-12-02 | 2018-03-30 | 北京工业大学 | A kind of tag extraction method based on perception data |
CN107918644A (en) * | 2017-10-31 | 2018-04-17 | 北京锐思爱特咨询股份有限公司 | News subject under discussion analysis method and implementation system in reputation Governance framework |
CN107992542A (en) * | 2017-11-27 | 2018-05-04 | 中山大学 | A kind of similar article based on topic model recommends method |
CN108090157A (en) * | 2017-12-12 | 2018-05-29 | 百度在线网络技术(北京)有限公司 | A kind of hot news method for digging, device and server |
CN108153818A (en) * | 2017-11-29 | 2018-06-12 | 成都东方盛行电子有限责任公司 | A kind of clustering method based on big data |
CN110096649A (en) * | 2019-05-14 | 2019-08-06 | 武汉斗鱼网络科技有限公司 | A kind of model extracting method, device, equipment and storage medium |
CN110532388A (en) * | 2019-08-15 | 2019-12-03 | 苏州朗动网络科技有限公司 | Method, equipment and the storage medium of text cluster |
CN110888978A (en) * | 2018-09-06 | 2020-03-17 | 北京京东金融科技控股有限公司 | Article clustering method and device, electronic equipment and storage medium |
CN111343467A (en) * | 2020-02-10 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Live broadcast data processing method and device, electronic equipment and storage medium |
WO2021027116A1 (en) * | 2019-08-15 | 2021-02-18 | 平安科技(深圳)有限公司 | Method and apparatus for discovering text hotspot and computer-readable storage medium |
CN112612889A (en) * | 2020-12-28 | 2021-04-06 | 中科院计算技术研究所大数据研究院 | Multilingual document classification method and device and storage medium |
CN113127611A (en) * | 2019-12-31 | 2021-07-16 | 北京中关村科金技术有限公司 | Method and device for processing question corpus and storage medium |
CN113360600A (en) * | 2021-06-03 | 2021-09-07 | 中国科学院计算机网络信息中心 | Method and system for screening enterprise performance prediction indexes based on signal attenuation |
WO2022037446A1 (en) * | 2020-08-20 | 2022-02-24 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Front-page news prediction and classification method |
US11436287B2 (en) | 2020-12-07 | 2022-09-06 | International Business Machines Corporation | Computerized grouping of news articles by activity and associated phase of focus |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140019460A1 (en) * | 2012-07-12 | 2014-01-16 | Yahoo! Inc. | Targeted search suggestions |
CN104699814A (en) * | 2015-03-24 | 2015-06-10 | 清华大学 | Searching method and system of hot spot information |
CN106156276A (en) * | 2016-06-25 | 2016-11-23 | 贵州大学 | Hot news discovery method based on Pitman Yor process |
-
2016
- 2016-12-13 CN CN201611145855.9A patent/CN106599181B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140019460A1 (en) * | 2012-07-12 | 2014-01-16 | Yahoo! Inc. | Targeted search suggestions |
CN104699814A (en) * | 2015-03-24 | 2015-06-10 | 清华大学 | Searching method and system of hot spot information |
CN106156276A (en) * | 2016-06-25 | 2016-11-23 | 贵州大学 | Hot news discovery method based on Pitman Yor process |
Cited By (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107423337A (en) * | 2017-04-27 | 2017-12-01 | 天津大学 | News topic detection method based on LDA Fusion Models and multi-level clustering |
CN107239497A (en) * | 2017-05-02 | 2017-10-10 | 广东万丈金数信息技术股份有限公司 | Hot content searching method and system |
CN107239497B (en) * | 2017-05-02 | 2020-11-03 | 广东万丈金数信息技术股份有限公司 | Hot content search method and system |
CN107203632A (en) * | 2017-06-01 | 2017-09-26 | 中国人民解放军国防科学技术大学 | Topic Popularity prediction method based on similarity relation and cooccurrence relation |
CN107330049A (en) * | 2017-06-28 | 2017-11-07 | 北京搜狐新媒体信息技术有限公司 | A kind of news temperature predictor method and system |
CN107330049B (en) * | 2017-06-28 | 2020-05-22 | 北京搜狐新媒体信息技术有限公司 | News popularity estimation method and system |
CN107835113A (en) * | 2017-07-05 | 2018-03-23 | 中山大学 | Abnormal user detection method in a kind of social networks based on network mapping |
CN107835113B (en) * | 2017-07-05 | 2020-09-08 | 中山大学 | Method for detecting abnormal user in social network based on network mapping |
CN107451224A (en) * | 2017-07-17 | 2017-12-08 | 广州特道信息科技有限公司 | A kind of clustering method and system based on big data parallel computation |
CN107563725B (en) * | 2017-08-25 | 2021-04-06 | 浙江网新恒天软件有限公司 | Recruitment system for optimizing fussy talent recruitment process |
CN107563725A (en) * | 2017-08-25 | 2018-01-09 | 浙江网新恒天软件有限公司 | A kind of recruitment system for optimizing cumbersome personnel recruitment process |
CN107656919B (en) * | 2017-09-12 | 2018-10-26 | 中国软件与技术服务股份有限公司 | A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme |
CN107656919A (en) * | 2017-09-12 | 2018-02-02 | 中国软件与技术服务股份有限公司 | A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme |
CN107918644A (en) * | 2017-10-31 | 2018-04-17 | 北京锐思爱特咨询股份有限公司 | News subject under discussion analysis method and implementation system in reputation Governance framework |
CN107832418A (en) * | 2017-11-08 | 2018-03-23 | 郑州云海信息技术有限公司 | A kind of much-talked-about topic finds method, system and a kind of much-talked-about topic discovering device |
CN107992542A (en) * | 2017-11-27 | 2018-05-04 | 中山大学 | A kind of similar article based on topic model recommends method |
CN108153818B (en) * | 2017-11-29 | 2021-08-10 | 成都东方盛行电子有限责任公司 | Big data based clustering method |
CN108153818A (en) * | 2017-11-29 | 2018-06-12 | 成都东方盛行电子有限责任公司 | A kind of clustering method based on big data |
CN107784127A (en) * | 2017-11-30 | 2018-03-09 | 杭州数梦工场科技有限公司 | A kind of focus localization method and device |
CN107862089B (en) * | 2017-12-02 | 2020-03-13 | 北京工业大学 | Label extraction method based on perception data |
CN107862089A (en) * | 2017-12-02 | 2018-03-30 | 北京工业大学 | A kind of tag extraction method based on perception data |
CN108090157A (en) * | 2017-12-12 | 2018-05-29 | 百度在线网络技术(北京)有限公司 | A kind of hot news method for digging, device and server |
CN110888978A (en) * | 2018-09-06 | 2020-03-17 | 北京京东金融科技控股有限公司 | Article clustering method and device, electronic equipment and storage medium |
CN110096649B (en) * | 2019-05-14 | 2021-07-30 | 武汉斗鱼网络科技有限公司 | Post extraction method, device, equipment and storage medium |
CN110096649A (en) * | 2019-05-14 | 2019-08-06 | 武汉斗鱼网络科技有限公司 | A kind of model extracting method, device, equipment and storage medium |
WO2021027116A1 (en) * | 2019-08-15 | 2021-02-18 | 平安科技(深圳)有限公司 | Method and apparatus for discovering text hotspot and computer-readable storage medium |
CN110532388A (en) * | 2019-08-15 | 2019-12-03 | 苏州朗动网络科技有限公司 | Method, equipment and the storage medium of text cluster |
CN110532388B (en) * | 2019-08-15 | 2022-07-01 | 企查查科技有限公司 | Text clustering method, equipment and storage medium |
CN113127611A (en) * | 2019-12-31 | 2021-07-16 | 北京中关村科金技术有限公司 | Method and device for processing question corpus and storage medium |
CN111343467A (en) * | 2020-02-10 | 2020-06-26 | 腾讯科技(深圳)有限公司 | Live broadcast data processing method and device, electronic equipment and storage medium |
CN111343467B (en) * | 2020-02-10 | 2021-10-26 | 腾讯科技(深圳)有限公司 | Live broadcast data processing method and device, electronic equipment and storage medium |
WO2022037446A1 (en) * | 2020-08-20 | 2022-02-24 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Front-page news prediction and classification method |
US11436287B2 (en) | 2020-12-07 | 2022-09-06 | International Business Machines Corporation | Computerized grouping of news articles by activity and associated phase of focus |
CN112612889A (en) * | 2020-12-28 | 2021-04-06 | 中科院计算技术研究所大数据研究院 | Multilingual document classification method and device and storage medium |
CN112612889B (en) * | 2020-12-28 | 2021-10-29 | 中科院计算技术研究所大数据研究院 | Multilingual document classification method and device and storage medium |
CN113360600A (en) * | 2021-06-03 | 2021-09-07 | 中国科学院计算机网络信息中心 | Method and system for screening enterprise performance prediction indexes based on signal attenuation |
Also Published As
Publication number | Publication date |
---|---|
CN106599181B (en) | 2019-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106599181B (en) | A kind of hot news detection method based on topic model | |
CN104216954B (en) | The prediction meanss and Forecasting Methodology of accident topic state | |
US20170185680A1 (en) | Chinese website classification method and system based on characteristic analysis of website homepage | |
CN104572977B (en) | A kind of agricultural product quality and safety event online test method | |
Zhang et al. | Multiresolution graph attention networks for relevance matching | |
CN106250513A (en) | A kind of event personalization sorting technique based on event modeling and system | |
TWI695277B (en) | Automatic website data collection method | |
CN102999615B (en) | Based on variety of images mark and the search method of radial basis function neural network | |
Lee | Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams | |
Yu et al. | Question classification based on co-training style semi-supervised learning | |
Rafea et al. | Topic detection approaches in identifying topics and events from arabic corpora | |
Papadopoulos et al. | Image clustering through community detection on hybrid image similarity graphs | |
CN110222172A (en) | A kind of multi-source network public sentiment Topics Crawling method based on improvement hierarchical clustering | |
CN103761286B (en) | A kind of Service Source search method based on user interest | |
Hossain et al. | A study towards Bangla fake news detection using machine learning and deep learning | |
Yuan et al. | Research of deceptive review detection based on target product identification and metapath feature weight calculation | |
CN106844765B (en) | Significant information detection method and device based on convolutional neural network | |
Konagala et al. | Fake news detection using deep learning: supervised fake news detection analysis in social media with semantic similarity method | |
Huang et al. | Tag refinement of micro-videos by learning from multiple data sources | |
Lin et al. | GIF video sentiment detection using semantic sequence | |
CN107423294A (en) | A kind of community image search method and system | |
Zhao et al. | Lsif: A system for large-scale information flow detection based on topic-related semantic similarity measurement | |
Meng | Text clustering and economic analysis of free trade zone governance strategies based on random matrix and subject analysis | |
Yang et al. | Web service clustering method based on word vector and biterm topic model | |
Vahidnia et al. | Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |