CN106599181A - Hot news detecting method based on topic model - Google Patents

Hot news detecting method based on topic model Download PDF

Info

Publication number
CN106599181A
CN106599181A CN201611145855.9A CN201611145855A CN106599181A CN 106599181 A CN106599181 A CN 106599181A CN 201611145855 A CN201611145855 A CN 201611145855A CN 106599181 A CN106599181 A CN 106599181A
Authority
CN
China
Prior art keywords
theme
article
word
similarity
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611145855.9A
Other languages
Chinese (zh)
Other versions
CN106599181B (en
Inventor
庄郭冕
黄乔
彭志宇
付晗
王忆诗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Insigma Hengtian Software Ltd
Original Assignee
Insigma Hengtian Software Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Insigma Hengtian Software Ltd filed Critical Insigma Hengtian Software Ltd
Priority to CN201611145855.9A priority Critical patent/CN106599181B/en
Publication of CN106599181A publication Critical patent/CN106599181A/en
Application granted granted Critical
Publication of CN106599181B publication Critical patent/CN106599181B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a hot news detecting method based on a topic model. According to the method, news streams are directionally crawled through a web crawler; an article is firstly subjected to preprocessing such as word segmentation and stop word and meaningless character string removal; then, the preprocessed article is subjected to feature extraction; a text model is built; then, texts with high similarity is added into the most similar class through a text clustering algorithm to obtain a topic base; next, new and old topics are subjected to similarity calculation; the new and cold topics with high similarity are merged; finally, the topic hot degree calculation is performed; and the hottest topic is selected through sequencing. An LDA (Latent Dirichlet Allocation) algorithm is creatively applied to hot topic discovery; an explosive concept is provided; the hottest news can be timely and effectively discovered; meanwhile, the topic hot degree attenuation concept is provided; the topic hot degree can be recorded and tracked in real time; the development change of the hot news is really reflected; and the important significance is realized on the tracking display of the hot news.

Description

A kind of hot news detection method based on topic model
Technical field
The invention provides a kind of hot news detection method based on topic model, it is related to web crawlers, cluster analyses, The core technologies such as Text similarity computing and algorithm, timely and effectively detect hot news, follow the trail of hot news and develop.
Background technology
As the development of Internet technology, magnanimity information epoch have been arrived, various information is full of in the Internet, but only Minority news can create much of a stir, i.e., so-called top news, hot news, and timely hot news finds to help people's reality When pay close attention to social state.
On the other hand, the outburst of a hot news is not to die at a flash, is usually associated with one and of flowing rhythm sends out Exhibition process, and cause other potential problems to occur, so following the trail of the evolution of hot news for research social problem has Significance.
The development of the Internet, the rise of big data, the Internet are flooded with bulk information, send out in these low-quality information Existing hot news becomes of crucial importance.
The content of the invention
Present invention aims to nowadays internet information is numerous and diverse, there is provided a kind of based on web crawlers, cluster point Analysis and the hot news detection method of topic model.
The purpose of the present invention is achieved through the following technical solutions:A kind of hot news based on topic model is detected Method, crawls news stream by web crawlers orientation, carries out participle first to article, remove stop words and meaningless character string etc. Pretreatment, carries out feature extraction then to pretreated article, builds text model, then by Text Clustering Algorithm just The high text of similarity degree is added in most like classification, obtains topic storehouse, then carries out similarity meter to new familiar topic Calculate, the new familiar topic high for similarity merge, finally carry out topic temperature calculating, by sequence select it is most hot if Topic.Specifically include following steps:
(1) oriented by the way of web crawlers and crawl news stream, often the new article of an arrival N pieces carries out a batch processing, right Crawl data and carry out data cleansing, article participle and obtain pretreated article;
(2) build vector space model:Through pretreatment operation, original document can be regarded as and is made up of a pile word , if document is regarded as one it is vectorial if, then each word is exactly one-dimensional characteristic, by transform a document to Amount, text data are just changed into the structural data that can be subsequently can by computer, and the Similarity Problem between two documents is just converted In order to two it is vectorial between Similarity Problem.When document vector is calculated per one-dimensional weight, calculated using improved B-TFIDF Method, algorithmic formula are as follows:
In formula (1), w represents word, and A represents the article number comprising word w in new article, and B does not include in representing new article The article number of word w, C represent the article number comprising word w in history article, and D represents the text not comprising word w in history article Chapter number, d in formula (2)iI-th new article is represented, N represents new article sum, and tf (d, w) represents words of the word w in article d Frequently, df (w) represents the article number comprising word w.The burst of word is taken into account by the algorithm, and explosive is that a word exists Suddenly it is a large amount of in a short time to occur.By algorithm above calculate constitute document each word weight, and then generate article to Quantity space model Di=(weight (di,w1),weight(di,w2),weight(di,w3)…..weight(di,wn)), wherein n For total word number.
(3) article cluster:Through step 2, text is represented as the form of vector, and text vector is clustered;Using LDA topic model clustering algorithms, specially:
LDA cluster process:LDA is three layers of bayesian probability model, comprising three layers of word, theme and document, by one Such a process is regarded in the generation of article as:Certain theme is selected with certain probability, and with certain probability in this theme Certain word is selected, document to theme obeys multinomial distribution, and theme to word obeys multinomial distribution, obtains " main by LDA clusters Topic-word " probability matrix phi and " document-theme " probability matrix theta, according to " document-theme " probability matrix theta The probability of m theme N pieces article corresponding with m theme is obtained, every a line i of theta represents an article, represents per string j One theme, homography value thetaijIt is probability that article i belongs to theme j.It is thresholdT to arrange screening threshold value, if thetaij>ThresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme.
LDA clusters the determination of number m:Be respectively provided with cluster number and LDA clustering algorithms repeated for N/10-N/5, then Similarity between the theme of implementing result each time is calculated, the corresponding theme of the minimum implementing result of similarity between theme is selected Number.Every a line j of " theme-word " probability matrix phi that the calculating of similarity is obtained according to LDA clusters between theme, phi is represented One theme Tj, a word w is represented per string kk, phijkRepresent theme TjComprising word wkProbability.A line of Phi can be with Regard theme T asjVector form Tj=(w1,w2,w3,…wk…wn), n is total word number.Theme similarity between any two is calculated, Similarity meansigma methodss are sought, and minima are taken as similarity between final theme.Meter of the calculating of similarity using cosine similarity Calculation method, computing formula are as follows:
T in formula (3)iAnd TjRepresent two themes, ωk(Ti) represent theme TiValue in dimension k, n represent total word Number.
(4) subject key words are extracted:Key word is extracted under theme in the exercise question of all articles, first title of article is carried out Participle, filters out stop-word, it is not intended to adopted word and punctuation mark, and remaining word is used as subject key words.
(5) topic merges:M theme and its corresponding article are obtained by step 3, next by m new theme and old master Topic is merged, and calculates similarity f1 between theme, if f1>0.5 is thought that two themes are similar, and merges two themes.Theme Between similarity f1 computing formula it is as follows:
F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)
It is similar that vectorSim representatives in formula (4) calculate theme cosine as dimension using all words that theme is included Degree, keywordSim are represented by dimension of subject key words and calculate theme cosine similarity, and the computing formula of cosine similarity is same Formula (3).
(6) temperature is calculated:Final all themes are obtained through step 5, theme temperature h is next calculated, is filtered out heat The high theme of degree, removes that temperature is low, i.e., out-of-date theme.According to focus theme news concentration class s it is high the characteristics of, temperature calculates public Formula is as follows:
ht=∑ sim (di,t) (5)
D in formula (5)iRepresent the article that theme T is included, temperature h of theme TtIt is similar to theme equal to article under theme The sum of degree, the same formula of sim functions (3).
As time go on, the temperature of a theme constantly can decay, until the theme is rejected less than threshold value.Temperature Decay, in each batch process, if there is new article to arrive below theme T, then temperature h of theme TtCan increase accordingly Plus, ht=ht* Up, if not having new article to be added to theme T, then temperature htCan decay, ht=ht* Down, wherein Up > 1, Down < 1.
The invention has the beneficial effects as follows:The present invention innovatively applies LDA algorithm in focus motif discovery, and proposes Fulminant concept, can timely and effectively find newest hot news, while it is general to propose the decay of theme temperature Read, can real time record tracking theme temperature, truly reflect the development and change of hot news, for the tracking of hot news Show that tool has very important significance.
Description of the drawings
Fig. 1 is the hot news testing process schematic diagram based on topic model;
Fig. 2 is article modeling process schematic diagram;
Fig. 3 is LDA cluster process schematic diagrams;
Fig. 4 is that new old theme merges schematic diagram;
Fig. 5 is that theme temperature calculates schematic diagram.
Specific embodiment
The present invention is described in further detail with specific embodiment below in conjunction with the accompanying drawings.
As shown in figure 1, a kind of hot news detection method based on topic model proposed by the present invention, including following step Suddenly:
(1) oriented by the way of web crawlers and crawl news stream, often arrival N pieces article carries out a batch processing, to climbing Fetching data carries out data cleansing, article participle and obtains pretreated article;
(2) build vector space model:As shown in Fig. 2 through pretreatment operation, original document can be regarded as by a pile Word constitute, if document is regarded as one it is vectorial if, then each word is exactly one-dimensional characteristic, by by document turn Vector is turned to, text data is just changed into the structural data that can be subsequently can by computer, the Similarity Problem between two documents Translated into two it is vectorial between Similarity Problem.When document vector is calculated per one-dimensional weight, employ improved B-TFIDF algorithms, algorithmic formula are as follows:
In formula 1, w represents word, and it is single that A represents that the article number comprising word w in new article, B do not include in representing new article The article number of word w, C represent the article number comprising word w in history article, and D represents the article not comprising word w in history article Number, d in formula 2iI-th new article is represented, N represents new article sum, and tf (d, w) represents word frequency of the word w in article d, Df (w) represents the article number comprising word w.The burst of word is taken into account by the algorithm, and an explosive i.e. word is short It is a large amount of suddenly in phase to occur.The weight of each word for constituting document is calculated by algorithm above, and then generates the vector of article Spatial model Di=(weight (di,w1),weight(di,w2),weight(di,w3)…..weight(di,wn)), wherein n is Total word number.
(3) article cluster:Through step 2, text is represented as the form of vector, and text vector is clustered;Such as Fig. 3 It is shown, be employed herein LDA topic model clustering algorithms, LDA is three layers of bayesian probability model, comprising word, theme and Three layers of document, regards the generation of an article as such a process:Certain theme is selected with certain probability, and in this master Certain word is selected with certain probability in topic, document to theme obeys multinomial distribution, and theme to word is obeyed multinomial distribution, passed through LDA cluster analyses obtain " theme-word " probability matrix and " document-theme " probability matrix, and detailed process sees below description.
LDA cluster process:LDA is three layers of bayesian probability model, can obtain " theme-word by LDA clusters Language " probability matrix phi and " document-theme " probability matrix theta, obtains m according to " document-theme " probability matrix theta The probability of individual theme N pieces article corresponding with m theme, every a line i of theta represent an article, represent a master per string j Topic, homography value thetaijIt is probability that article i belongs to theme j.It is thresholdT (preferred values to arrange screening threshold value 0.32), if thetaij>ThresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme.
LDA clusters the determination of number m:As N pieces article cluster number relatively meets reality between N/10 to N/5 (for example, when new article sum N is 150, cluster number relatively meets reality between 15 to 30), so being respectively provided with poly- Class number repeats LDA clustering algorithms for N/10-N/5, then calculates similarity between the theme of implementing result each time, selects The corresponding theme number of the minimum implementing result of similarity between theme.The calculating of similarity between theme needs to be clustered according to LDA " theme-word " the probability matrix phi for arriving, every a line j of phi represent a theme Tj, a word w is represented per string kk, phijkRepresent theme TjComprising word wkProbability.A line of Phi can regard theme T asjVector form Tj=(w1,w2, w3,…wk…wn), n is total word number.Theme similarity between any two is calculated, and is sought similarity meansigma methodss, minima is taken as most Similarity between whole theme.The calculating of similarity employs the computational methods of cosine similarity, and computing formula is as follows:
T in formula 3iAnd TjRepresent two themes, ωk(Ti) represent theme TiValue in dimension k, n represent total word number.
(4) subject key words are extracted:Key word is extracted under theme in the exercise question of all articles, first title of article is carried out Participle, filters out stop-word, it is not intended to adopted word and punctuation mark, and remaining word is used as subject key words.
(5) topic merges:M theme and its corresponding article are obtained by step 3, next by m new theme and old master Topic is merged, as shown in figure 4, similarity f1 between theme is calculated, if f1>0.5 is thought that two themes are similar, and merges two Theme.Between theme, similarity f1 computing formula is as follows:
F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)
VectorSim in formula 4 is represented using all words that theme is included and is calculated theme cosine similarity as dimension, KeywordSim is represented by dimension of subject key words and is calculated theme cosine similarity, the same formula of computing formula of cosine similarity 3。
(6) temperature is calculated:As shown in figure 5, final all themes are obtained through step 5, next calculate theme temperature H, filters out the high theme of temperature, removes that temperature is low, i.e., out-of-date theme.According to focus theme news concentration class s it is high the characteristics of, Temperature computing formula is as follows:
ht=∑ sim (di,t) (5)
Di in formula 5 represents the article that theme T is included, temperature h of theme TtIt is similar to theme equal to article under theme The sum of degree, sim functions are with formula 3.
As time go on, the temperature of a theme constantly can decay, until the theme is rejected less than threshold value.Temperature Decay, in each batch process, if there is new article to arrive below theme T, then temperature h of theme TtCan increase accordingly Plus, ht=ht* Up (preferred value 1.05), if not having new article to be added to theme T, then temperature htCan decay, ht=ht* Down (preferred value 0.9).

Claims (1)

1. a kind of hot news detection method based on topic model, it is characterised in that comprise the following steps:
(1) oriented by the way of web crawlers and crawl news stream, often the new article of an arrival N pieces carries out a batch processing, to crawling Data carry out data cleansing, article participle and obtain pretreated article;
(2) build vector space model:Through pretreatment operation, original document can regard what is be made up of a pile word as, such as Fruit document regard as one it is vectorial if, then each word is exactly one-dimensional characteristic, by transforming a document to vector, text Data are just changed into the structural data that can be subsequently can by computer, and the Similarity Problem between two documents has translated into two Similarity Problem between vector.When document vector is calculated per one-dimensional weight, using improved B-TFIDF algorithms, algorithm Formula is as follows:
b i ( w ) = ( A + B + C + D ) ( A D - B C ) 2 ( A + B ) ( C + D ) ( A + C ) ( B + D ) - - - ( 1 )
w e i g h t ( d i , w ) = ( tf i ( d i , w ) l o g ( ( N + 1 ) / ( df i ( w ) + 0.5 ) ) · b i ( w ) ) Σ w ′ ∈ d ( t f ( d , w ′ ) 1 o g ( ( N + 1 ) / ( d f ( w ′ ) + 0.5 ) ) · b ( w ′ ) ) 2 - - - ( 2 )
In formula (1), w represents word, and A represents that the article number comprising word w in new article, B do not include word in representing new article The article number of w, C represent the article number comprising word w in history article, and D represents the article not comprising word w in history article Number, d in formula (2)iI-th new article is represented, N represents new article sum, and tf (d, w) represents words of the word w in article d Frequently, df (w) represents the article number comprising word w.The burst of word is taken into account by the algorithm, and explosive is that a word exists Suddenly it is a large amount of in a short time to occur.By algorithm above calculate constitute document each word weight, and then generate article to Quantity space model Di=(weight (di,w1),weight(di,w2),weight(di,w3)…..weight(di,wn)), wherein n For total word number.
(3) article cluster:Through step 2, text is represented as the form of vector, and text vector is clustered;Using LDA master Topic Model tying algorithm, specially:
LDA cluster process:LDA is three layers of bayesian probability model, comprising three layers of word, theme and document, by an article Generation regard such a process as:Certain theme is selected with certain probability, and is selected with certain probability in this theme Certain word, document to theme obey multinomial distribution, theme to word obey multinomial distribution, by LDA cluster obtain " theme- Word " probability matrix phi and " document-theme " probability matrix theta, obtains according to " document-theme " probability matrix theta The probability of m theme N pieces article corresponding with m theme, every a line i of theta represent an article, represent one per string j Theme, homography value thetaijIt is probability that article i belongs to theme j.It is thresholdT to arrange screening threshold value, if thetaij>ThresholdT then thinks that article i belongs to theme j, thus selects the corresponding article of each theme.
LDA clusters the determination of number m:Be respectively provided with cluster number and LDA clustering algorithms repeated for N/10-N/5, then calculate Similarity between the theme of implementing result, selects the corresponding theme number of the minimum implementing result of similarity between theme each time.It is main " theme-word " probability matrix phi that the calculating of similarity is obtained according to LDA clusters between topic, every a line j of phi represent one Theme Tj, a word w is represented per string kk, phijkRepresent theme TjComprising word wkProbability.A line of Phi can be regarded as Theme TjVector form Tj=(w1,w2,w3,…wk…wn), n is total word number.Theme similarity between any two is calculated, phase is sought Like degree meansigma methodss, minima is taken as similarity between final theme.Calculating side of the calculating of similarity using cosine similarity Method, computing formula are as follows:
s i m ( T i , T j ) = Σ k = 1 n ω k ( T i ) × ω k ( T j ) ( ( Σ k = 1 n ω k 2 ( T i ) ) ( Σ k = 1 n ω k 2 ( T j ) ) ) - - - ( 3 )
T in formula (3)iAnd TjRepresent two themes, ωk(Ti) represent theme TiValue in dimension k, n represent total word number.
(4) subject key words are extracted:Key word is extracted under theme in the exercise question of all articles, first title of article is carried out point Word, filters out stop-word, it is not intended to adopted word and punctuation mark, and remaining word is used as subject key words.
(5) topic merges:M theme and its corresponding article are obtained by step 3, next m new theme is entered with old theme Row merges, and calculates similarity f1 between theme, if f1>0.5 is thought that two themes are similar, and merges two themes.Phase between theme It is as follows like degree f1 computing formula:
F1=2*vectorSim*keywordSim/ (vectorSim+keywordSim) (4)
VectorSim in formula (4) is represented using all words that theme is included and is calculated theme cosine similarity as dimension, KeywordSim is represented by dimension of subject key words and is calculated theme cosine similarity, the same formula of computing formula of cosine similarity (3)。
(6) temperature is calculated:Final all themes are obtained through step 5, theme temperature h is next calculated, is filtered out temperature high Theme, remove that temperature is low, i.e., out-of-date theme.According to focus theme news concentration class s it is high the characteristics of, temperature computing formula is such as Under:
ht=∑ sim (di,t) (5)
D in formula (5)iRepresent the article that theme T is included, temperature h of theme TtEqual to article under theme and Topic Similarity With the same formula of sim functions (3).
As time go on, the temperature of a theme constantly can decay, until the theme is rejected less than threshold value.Temperature declines Subtract, in each batch process, if there is new article to arrive below theme T, then temperature h of theme TtCan increase accordingly Plus, ht=ht* Up, if not having new article to be added to theme T, then temperature htCan decay, ht=ht* Down, wherein Up > 1, Down < 1.
CN201611145855.9A 2016-12-13 2016-12-13 A kind of hot news detection method based on topic model Active CN106599181B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611145855.9A CN106599181B (en) 2016-12-13 2016-12-13 A kind of hot news detection method based on topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611145855.9A CN106599181B (en) 2016-12-13 2016-12-13 A kind of hot news detection method based on topic model

Publications (2)

Publication Number Publication Date
CN106599181A true CN106599181A (en) 2017-04-26
CN106599181B CN106599181B (en) 2019-06-18

Family

ID=58802054

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611145855.9A Active CN106599181B (en) 2016-12-13 2016-12-13 A kind of hot news detection method based on topic model

Country Status (1)

Country Link
CN (1) CN106599181B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203632A (en) * 2017-06-01 2017-09-26 中国人民解放军国防科学技术大学 Topic Popularity prediction method based on similarity relation and cooccurrence relation
CN107239497A (en) * 2017-05-02 2017-10-10 广东万丈金数信息技术股份有限公司 Hot content searching method and system
CN107330049A (en) * 2017-06-28 2017-11-07 北京搜狐新媒体信息技术有限公司 A kind of news temperature predictor method and system
CN107423337A (en) * 2017-04-27 2017-12-01 天津大学 News topic detection method based on LDA Fusion Models and multi-level clustering
CN107451224A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 A kind of clustering method and system based on big data parallel computation
CN107563725A (en) * 2017-08-25 2018-01-09 浙江网新恒天软件有限公司 A kind of recruitment system for optimizing cumbersome personnel recruitment process
CN107656919A (en) * 2017-09-12 2018-02-02 中国软件与技术服务股份有限公司 A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme
CN107784127A (en) * 2017-11-30 2018-03-09 杭州数梦工场科技有限公司 A kind of focus localization method and device
CN107835113A (en) * 2017-07-05 2018-03-23 中山大学 Abnormal user detection method in a kind of social networks based on network mapping
CN107832418A (en) * 2017-11-08 2018-03-23 郑州云海信息技术有限公司 A kind of much-talked-about topic finds method, system and a kind of much-talked-about topic discovering device
CN107862089A (en) * 2017-12-02 2018-03-30 北京工业大学 A kind of tag extraction method based on perception data
CN107918644A (en) * 2017-10-31 2018-04-17 北京锐思爱特咨询股份有限公司 News subject under discussion analysis method and implementation system in reputation Governance framework
CN107992542A (en) * 2017-11-27 2018-05-04 中山大学 A kind of similar article based on topic model recommends method
CN108090157A (en) * 2017-12-12 2018-05-29 百度在线网络技术(北京)有限公司 A kind of hot news method for digging, device and server
CN108153818A (en) * 2017-11-29 2018-06-12 成都东方盛行电子有限责任公司 A kind of clustering method based on big data
CN110096649A (en) * 2019-05-14 2019-08-06 武汉斗鱼网络科技有限公司 A kind of model extracting method, device, equipment and storage medium
CN110532388A (en) * 2019-08-15 2019-12-03 苏州朗动网络科技有限公司 Method, equipment and the storage medium of text cluster
CN110888978A (en) * 2018-09-06 2020-03-17 北京京东金融科技控股有限公司 Article clustering method and device, electronic equipment and storage medium
CN111343467A (en) * 2020-02-10 2020-06-26 腾讯科技(深圳)有限公司 Live broadcast data processing method and device, electronic equipment and storage medium
WO2021027116A1 (en) * 2019-08-15 2021-02-18 平安科技(深圳)有限公司 Method and apparatus for discovering text hotspot and computer-readable storage medium
CN112612889A (en) * 2020-12-28 2021-04-06 中科院计算技术研究所大数据研究院 Multilingual document classification method and device and storage medium
CN113127611A (en) * 2019-12-31 2021-07-16 北京中关村科金技术有限公司 Method and device for processing question corpus and storage medium
CN113360600A (en) * 2021-06-03 2021-09-07 中国科学院计算机网络信息中心 Method and system for screening enterprise performance prediction indexes based on signal attenuation
WO2022037446A1 (en) * 2020-08-20 2022-02-24 西南电子技术研究所(中国电子科技集团公司第十研究所) Front-page news prediction and classification method
US11436287B2 (en) 2020-12-07 2022-09-06 International Business Machines Corporation Computerized grouping of news articles by activity and associated phase of focus

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140019460A1 (en) * 2012-07-12 2014-01-16 Yahoo! Inc. Targeted search suggestions
CN104699814A (en) * 2015-03-24 2015-06-10 清华大学 Searching method and system of hot spot information
CN106156276A (en) * 2016-06-25 2016-11-23 贵州大学 Hot news discovery method based on Pitman Yor process

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140019460A1 (en) * 2012-07-12 2014-01-16 Yahoo! Inc. Targeted search suggestions
CN104699814A (en) * 2015-03-24 2015-06-10 清华大学 Searching method and system of hot spot information
CN106156276A (en) * 2016-06-25 2016-11-23 贵州大学 Hot news discovery method based on Pitman Yor process

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423337A (en) * 2017-04-27 2017-12-01 天津大学 News topic detection method based on LDA Fusion Models and multi-level clustering
CN107239497A (en) * 2017-05-02 2017-10-10 广东万丈金数信息技术股份有限公司 Hot content searching method and system
CN107239497B (en) * 2017-05-02 2020-11-03 广东万丈金数信息技术股份有限公司 Hot content search method and system
CN107203632A (en) * 2017-06-01 2017-09-26 中国人民解放军国防科学技术大学 Topic Popularity prediction method based on similarity relation and cooccurrence relation
CN107330049A (en) * 2017-06-28 2017-11-07 北京搜狐新媒体信息技术有限公司 A kind of news temperature predictor method and system
CN107330049B (en) * 2017-06-28 2020-05-22 北京搜狐新媒体信息技术有限公司 News popularity estimation method and system
CN107835113A (en) * 2017-07-05 2018-03-23 中山大学 Abnormal user detection method in a kind of social networks based on network mapping
CN107835113B (en) * 2017-07-05 2020-09-08 中山大学 Method for detecting abnormal user in social network based on network mapping
CN107451224A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 A kind of clustering method and system based on big data parallel computation
CN107563725B (en) * 2017-08-25 2021-04-06 浙江网新恒天软件有限公司 Recruitment system for optimizing fussy talent recruitment process
CN107563725A (en) * 2017-08-25 2018-01-09 浙江网新恒天软件有限公司 A kind of recruitment system for optimizing cumbersome personnel recruitment process
CN107656919B (en) * 2017-09-12 2018-10-26 中国软件与技术服务股份有限公司 A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme
CN107656919A (en) * 2017-09-12 2018-02-02 中国软件与技术服务股份有限公司 A kind of optimal L DA Automatic Model Selection methods based on minimum average B configuration similarity between theme
CN107918644A (en) * 2017-10-31 2018-04-17 北京锐思爱特咨询股份有限公司 News subject under discussion analysis method and implementation system in reputation Governance framework
CN107832418A (en) * 2017-11-08 2018-03-23 郑州云海信息技术有限公司 A kind of much-talked-about topic finds method, system and a kind of much-talked-about topic discovering device
CN107992542A (en) * 2017-11-27 2018-05-04 中山大学 A kind of similar article based on topic model recommends method
CN108153818B (en) * 2017-11-29 2021-08-10 成都东方盛行电子有限责任公司 Big data based clustering method
CN108153818A (en) * 2017-11-29 2018-06-12 成都东方盛行电子有限责任公司 A kind of clustering method based on big data
CN107784127A (en) * 2017-11-30 2018-03-09 杭州数梦工场科技有限公司 A kind of focus localization method and device
CN107862089B (en) * 2017-12-02 2020-03-13 北京工业大学 Label extraction method based on perception data
CN107862089A (en) * 2017-12-02 2018-03-30 北京工业大学 A kind of tag extraction method based on perception data
CN108090157A (en) * 2017-12-12 2018-05-29 百度在线网络技术(北京)有限公司 A kind of hot news method for digging, device and server
CN110888978A (en) * 2018-09-06 2020-03-17 北京京东金融科技控股有限公司 Article clustering method and device, electronic equipment and storage medium
CN110096649B (en) * 2019-05-14 2021-07-30 武汉斗鱼网络科技有限公司 Post extraction method, device, equipment and storage medium
CN110096649A (en) * 2019-05-14 2019-08-06 武汉斗鱼网络科技有限公司 A kind of model extracting method, device, equipment and storage medium
WO2021027116A1 (en) * 2019-08-15 2021-02-18 平安科技(深圳)有限公司 Method and apparatus for discovering text hotspot and computer-readable storage medium
CN110532388A (en) * 2019-08-15 2019-12-03 苏州朗动网络科技有限公司 Method, equipment and the storage medium of text cluster
CN110532388B (en) * 2019-08-15 2022-07-01 企查查科技有限公司 Text clustering method, equipment and storage medium
CN113127611A (en) * 2019-12-31 2021-07-16 北京中关村科金技术有限公司 Method and device for processing question corpus and storage medium
CN111343467A (en) * 2020-02-10 2020-06-26 腾讯科技(深圳)有限公司 Live broadcast data processing method and device, electronic equipment and storage medium
CN111343467B (en) * 2020-02-10 2021-10-26 腾讯科技(深圳)有限公司 Live broadcast data processing method and device, electronic equipment and storage medium
WO2022037446A1 (en) * 2020-08-20 2022-02-24 西南电子技术研究所(中国电子科技集团公司第十研究所) Front-page news prediction and classification method
US11436287B2 (en) 2020-12-07 2022-09-06 International Business Machines Corporation Computerized grouping of news articles by activity and associated phase of focus
CN112612889A (en) * 2020-12-28 2021-04-06 中科院计算技术研究所大数据研究院 Multilingual document classification method and device and storage medium
CN112612889B (en) * 2020-12-28 2021-10-29 中科院计算技术研究所大数据研究院 Multilingual document classification method and device and storage medium
CN113360600A (en) * 2021-06-03 2021-09-07 中国科学院计算机网络信息中心 Method and system for screening enterprise performance prediction indexes based on signal attenuation

Also Published As

Publication number Publication date
CN106599181B (en) 2019-06-18

Similar Documents

Publication Publication Date Title
CN106599181B (en) A kind of hot news detection method based on topic model
CN104216954B (en) The prediction meanss and Forecasting Methodology of accident topic state
US20170185680A1 (en) Chinese website classification method and system based on characteristic analysis of website homepage
CN104572977B (en) A kind of agricultural product quality and safety event online test method
Zhang et al. Multiresolution graph attention networks for relevance matching
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
TWI695277B (en) Automatic website data collection method
CN102999615B (en) Based on variety of images mark and the search method of radial basis function neural network
Lee Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams
Yu et al. Question classification based on co-training style semi-supervised learning
Rafea et al. Topic detection approaches in identifying topics and events from arabic corpora
Papadopoulos et al. Image clustering through community detection on hybrid image similarity graphs
CN110222172A (en) A kind of multi-source network public sentiment Topics Crawling method based on improvement hierarchical clustering
CN103761286B (en) A kind of Service Source search method based on user interest
Hossain et al. A study towards Bangla fake news detection using machine learning and deep learning
Yuan et al. Research of deceptive review detection based on target product identification and metapath feature weight calculation
CN106844765B (en) Significant information detection method and device based on convolutional neural network
Konagala et al. Fake news detection using deep learning: supervised fake news detection analysis in social media with semantic similarity method
Huang et al. Tag refinement of micro-videos by learning from multiple data sources
Lin et al. GIF video sentiment detection using semantic sequence
CN107423294A (en) A kind of community image search method and system
Zhao et al. Lsif: A system for large-scale information flow detection based on topic-related semantic similarity measurement
Meng Text clustering and economic analysis of free trade zone governance strategies based on random matrix and subject analysis
Yang et al. Web service clustering method based on word vector and biterm topic model
Vahidnia et al. Document Clustering and Labeling for Research Trend Extraction and Evolution Mapping.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant