CN102662965A - Method and system of automatically discovering hot news theme on the internet - Google Patents

Method and system of automatically discovering hot news theme on the internet Download PDF

Info

Publication number
CN102662965A
CN102662965A CN2012100601339A CN201210060133A CN102662965A CN 102662965 A CN102662965 A CN 102662965A CN 2012100601339 A CN2012100601339 A CN 2012100601339A CN 201210060133 A CN201210060133 A CN 201210060133A CN 102662965 A CN102662965 A CN 102662965A
Authority
CN
China
Prior art keywords
news
internet
classification
article
automatically
Prior art date
Application number
CN2012100601339A
Other languages
Chinese (zh)
Inventor
不公告发明人
Original Assignee
上海引跑信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海引跑信息科技有限公司 filed Critical 上海引跑信息科技有限公司
Priority to CN2012100601339A priority Critical patent/CN102662965A/en
Publication of CN102662965A publication Critical patent/CN102662965A/en

Links

Abstract

The invention provides a method of automatically discovering a hot news theme (or a hot event) on the internet, which can automatically discover various news events on the internet and automatically and briefly explain the events. The method comprises the following steps: downloading webpages of recent news, blogs, microblogs and the like from the internet; extracting titles and texts from the news, blogs, microblogs and the like; carrying out word segmentation for the extracted titles and texts to acquire a word frequency vector, and combining all the texts to acquire a word frequency matrix; carrying out cluster analysis for the word frequency matrix with clustering algorithm so that texts with the same theme are gathered and clusters of various themes are acquired; and extracting a title of a central text and keywords from each cluster to explain the cluster, i.e., to explain the news theme.

Description

A kind of method and system of automatic discovery internet hot spots theme of news

Technical field

The present invention relates in the internet public feelings analysis hot news and find the field automatically.

Background technology

Flourish along with the internet, the network media has the huge influence power that guides public opinion, influences the audient, and the online public opinion focus that forms is continuous, becomes social hotspots continually and even causes great public opinion crisis.The focus incident of in the past going to find the internet through manpower often has hysteresis quality; Can not in the very first time, find these public opinion focuses; Thereby cause and the very first time to make corresponding counter-measure; Can not control with the development of an incident of correct guiding the state of affairs is further worsened, what its result can influence society concerning entire society is stable and united, and concerning enterprise, can make the vivid impaired of enterprise.

Summary of the invention

The method that the purpose of this invention is to provide a kind of automatic discovery internet hot spots theme of news; To in the very first time, find public opinion focus instantly exactly; The very first time is made corresponding counter-measure; With the development of control or a correct incident of guiding and the acquisition result that rules and the public can both accept that is in conformity with law, guarantee the harmonious development of society.The information obtain manner of the sort of formula of looking for a needle in a haystack before replacing.

In order to achieve the above object, the present invention has set a kind of method of automatic discovery internet hot spots theme of news.At first, download the webpage of classes such as news recent on the internet, blog, microblogging and extract title and text, secondly; Title and text to extracting carry out word frequency vector of participle acquisition; All articles are combined and will be obtained a word frequency matrix, once more, utilize clustering algorithm that the word frequency matrix is carried out cluster analysis; The identical article of theme is got together; So just obtained clustering of a lot of themes, last, we extract the explanation explanation of this theme of news (promptly to) that the title of each center article that clusters and keyword that each clusters are used as this bunch.The user can more reasonably make up the automatic explanation that machine produces.This mode effectively combines analytical technology, manual intervention, provide more be close to the users, function more efficiently.

Description of drawings

Fig. 1 is a workflow synoptic diagram of the present invention.

Embodiment

Do below in conjunction with 1 pair of technical scheme of the present invention of accompanying drawing and to describe in further detail.

Fig. 1 finds the block scheme of each module that the method for internet hot spots theme of news is related automatically, and it comprises automatic resolver three parts of data pre-processor, cluster analysis device and classification.Data pre-processor comprises two parts of generation of news content collection and word frequency matrix; The news content collecting part mainly is to obtain the various news web pages on the internet and extract text wherein; It mainly is that each piece article is generated a word frequency vector that the word frequency matrix generates part, and all word frequency vectors form a word frequency matrix.The cluster analysis device is one of core of this method, mainly is that article is classified.The automatic resolver of classification is that each classification that cluster is obtained is done automatic explanation.

This web page contents resolving key step is following:

1) gathers various theme class webpages on the internet instantly.

2) title and the text of extraction news web page.

3) obtain the word frequency matrix.

A) article is carried out participle, ask the word frequency and the reverse word frequency of each speech.

B) calculate the TFIDF value (being the reverse word frequency of word frequency) of each speech.

C) combine an acquisition word frequency vector (representing one piece of article) to the TFIDF value of being calculated.

D) lump together word frequency matrix of acquisition to all word frequency Vector Groups.

4) carry out cluster analysis, obtain the classification of some.

5) each classification is explained automatically.

A) computing center's article and obtain its title.

B) extract the keyword of each classification.

Claims (6)

1. method and system of automatically finding the internet hot spots theme of news is characterized in that:
1) obtains webpages such as news, blog, microblogging from the internet, extract wherein text and title;
2) calculate the TFIDF value according to text that is extracted and title, obtain word frequency vector sum word frequency matrix;
3) the word frequency matrix is carried out cluster analysis;
4) cluster analysis is obtained each classification and calculate the title that its center article obtains the center article, extract the keyword of each classification, as such other explanation;
5) the artificial classification explanation to automatic generation is adjusted.
2. like the method and system of right 1 described automatic discovery internet hot spots theme of news, it is characterized in that:
The source form internet of data.
3. like the method and system of right 2 described automatic discovery internet hot spots themes of news, it is characterized in that:
Use the TFIDF value to represent the weight of each speech, thereby quantize article to become a vector, be expressed as a matrix to one group of article at article.
4. like the method and system of right 3 described automatic discovery internet hot spots themes of news, it is characterized in that:
We classify to article with clustering algorithm automatically.
5. like the method and system of right 4 described automatic discovery internet hot spots themes of news, it is characterized in that:
We do explanation to each classification automatically, extract the title and the classification keyword of the center article of each classification automatically.
6. like the method and system of right 5 described automatic discovery internet hot spots themes of news, it is characterized in that:
Though the classification that system produces is very accurate, we still can adjust with explaining automatically to the classification that system produces through the mode of manual work.
CN2012100601339A 2012-03-07 2012-03-07 Method and system of automatically discovering hot news theme on the internet CN102662965A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012100601339A CN102662965A (en) 2012-03-07 2012-03-07 Method and system of automatically discovering hot news theme on the internet

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012100601339A CN102662965A (en) 2012-03-07 2012-03-07 Method and system of automatically discovering hot news theme on the internet

Publications (1)

Publication Number Publication Date
CN102662965A true CN102662965A (en) 2012-09-12

Family

ID=46772456

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012100601339A CN102662965A (en) 2012-03-07 2012-03-07 Method and system of automatically discovering hot news theme on the internet

Country Status (1)

Country Link
CN (1) CN102662965A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages
CN103324665A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Hot spot information extraction method and device based on micro-blog
CN103530316A (en) * 2013-09-12 2014-01-22 浙江大学 Science subject extraction method based on multi-view learning
CN103646078A (en) * 2013-12-11 2014-03-19 北京启明星辰信息安全技术有限公司 Method and device for realizing internet propaganda monitoring target evaluations
WO2014040570A1 (en) * 2012-09-17 2014-03-20 腾讯科技(深圳)有限公司 Spam template article identification method and device
CN103714132A (en) * 2013-12-17 2014-04-09 北京本果信息技术有限公司 Method and equipment used for mining hot events based on regions and industries
CN103838737A (en) * 2012-11-21 2014-06-04 大连灵动科技发展有限公司 Method for improving vector distance classifying quality
CN104618216A (en) * 2013-11-05 2015-05-13 腾讯科技(北京)有限公司 Message management method, apparatus and system
CN105528432A (en) * 2015-12-15 2016-04-27 北大方正集团有限公司 Digital resource hotspot generating method and device
CN105654113A (en) * 2015-12-23 2016-06-08 北京奇虎科技有限公司 Article fingerprint characteristic generation method and device
CN106354846A (en) * 2016-08-31 2017-01-25 成都广电视讯文化传播有限公司 Intelligent news manuscript selection method and system based on big data
CN106446179A (en) * 2016-09-28 2017-02-22 东软集团股份有限公司 Hot topic generation method and device
CN109213841A (en) * 2017-06-29 2019-01-15 武汉斗鱼网络科技有限公司 Theme sample extraction method, storage medium, electronic equipment and system is broadcast live

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694659A (en) * 2009-10-20 2010-04-14 浙江大学 Individual network news recommending method based on multitheme tracing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694659A (en) * 2009-10-20 2010-04-14 浙江大学 Individual network news recommending method based on multitheme tracing

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9330075B2 (en) 2012-09-17 2016-05-03 Tencent Technology (Shenzhen) Company Limited Method and apparatus for identifying garbage template article
WO2014040570A1 (en) * 2012-09-17 2014-03-20 腾讯科技(深圳)有限公司 Spam template article identification method and device
CN103838737A (en) * 2012-11-21 2014-06-04 大连灵动科技发展有限公司 Method for improving vector distance classifying quality
CN103324665A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Hot spot information extraction method and device based on micro-blog
CN103268339B (en) * 2013-05-17 2016-06-01 中国科学院计算技术研究所 Named entity recognition method and system in Twitter message
CN103268339A (en) * 2013-05-17 2013-08-28 中国科学院计算技术研究所 Recognition method and system of named entities in microblog messages
CN103530316A (en) * 2013-09-12 2014-01-22 浙江大学 Science subject extraction method based on multi-view learning
CN104618216B (en) * 2013-11-05 2019-05-17 腾讯科技(北京)有限公司 Information management method, equipment and system
CN104618216A (en) * 2013-11-05 2015-05-13 腾讯科技(北京)有限公司 Message management method, apparatus and system
CN103646078A (en) * 2013-12-11 2014-03-19 北京启明星辰信息安全技术有限公司 Method and device for realizing internet propaganda monitoring target evaluations
CN103646078B (en) * 2013-12-11 2017-01-25 北京启明星辰信息安全技术有限公司 Method and device for realizing internet propaganda monitoring target evaluations
CN103714132A (en) * 2013-12-17 2014-04-09 北京本果信息技术有限公司 Method and equipment used for mining hot events based on regions and industries
CN105528432A (en) * 2015-12-15 2016-04-27 北大方正集团有限公司 Digital resource hotspot generating method and device
CN105528432B (en) * 2015-12-15 2019-04-26 北大方正集团有限公司 A kind of digital resource hot spot generation method and device
CN105654113B (en) * 2015-12-23 2020-02-21 北京奇虎科技有限公司 Article fingerprint feature generation method and device
CN105654113A (en) * 2015-12-23 2016-06-08 北京奇虎科技有限公司 Article fingerprint characteristic generation method and device
CN106354846A (en) * 2016-08-31 2017-01-25 成都广电视讯文化传播有限公司 Intelligent news manuscript selection method and system based on big data
CN106446179A (en) * 2016-09-28 2017-02-22 东软集团股份有限公司 Hot topic generation method and device
CN106446179B (en) * 2016-09-28 2019-07-30 东软集团股份有限公司 The generation method and device of hot topic
CN109213841A (en) * 2017-06-29 2019-01-15 武汉斗鱼网络科技有限公司 Theme sample extraction method, storage medium, electronic equipment and system is broadcast live

Similar Documents

Publication Publication Date Title
JP6416150B2 (en) Search method, search system, and computer program
Zheludev et al. When can social media lead financial markets?
Chen et al. WT-LDA: user tagging augmented LDA for web service clustering
CN103390051B (en) A kind of topic detection and tracking method based on microblog data
Wen et al. Emotion classification in microblog texts using class sequential rules
Qin Hero on Twitter, traitor on news: How social media and legacy news frame Snowden
Li et al. Tag-based social interest discovery
US9613149B2 (en) Automatic mapping of a location identifier pattern of an object to a semantic type using object metadata
CN101408883B (en) Method for collecting network public feelings viewpoint
Jäschke et al. Tag recommendations in folksonomies
Stamatatos et al. Overview of the pan/clef 2015 evaluation lab
Tang et al. Enriching short text representation in microblog for clustering
Szomszor et al. Semantic modelling of user interests based on cross-folksonomy analysis
CN102609458B (en) A kind of picture recommendation method and device
Zhang et al. Efficient partial-duplicate detection based on sequence matching
Ding et al. Learning topical translation model for microblog hashtag suggestion
CN104750789B (en) The recommendation method and device of label
Bellaachia et al. Ne-rank: A novel graph-based keyphrase extraction in twitter
CA2865186A1 (en) Method and system relating to sentiment analysis of electronic content
US20090319449A1 (en) Providing context for web articles
Lin et al. Social annotation in query expansion: a machine learning approach
CN104537097A (en) Microblog public opinion monitoring system
Andrews et al. A classification of semantic annotation systems
CN102043808B (en) Method and equipment for extracting bilingual terms using webpage structure
Maniu et al. Building a signed network from interactions in Wikipedia

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120912