CN107122481B - Real-time online prediction method for news popularity - Google Patents
Real-time online prediction method for news popularity Download PDFInfo
- Publication number
- CN107122481B CN107122481B CN201710308998.5A CN201710308998A CN107122481B CN 107122481 B CN107122481 B CN 107122481B CN 201710308998 A CN201710308998 A CN 201710308998A CN 107122481 B CN107122481 B CN 107122481B
- Authority
- CN
- China
- Prior art keywords
- hot
- news
- heat
- vocabulary
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000004458 analytical method Methods 0.000 claims abstract description 14
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000009193 crawling Effects 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims 1
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 206010033307 Overweight Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Economics (AREA)
- Strategic Management (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Entrepreneurship & Innovation (AREA)
- Quality & Reliability (AREA)
- Development Economics (AREA)
- Game Theory and Decision Science (AREA)
- Artificial Intelligence (AREA)
- Operations Research (AREA)
- Computing Systems (AREA)
- Primary Health Care (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention discloses a news popularity real-time online prediction method which comprises two parts, wherein hot event analysis and modeling and latest news popularity prediction are carried out, hot word pairs and hot word pairs in all events are combined together to form a hot value table of the hot words and the hot word pairs, and the heat values of the hot words in all events are added to obtain the current heat values of the hot words and the hot word pairs; continuously updating the heat value table of the hot words and the hot word pairs; heat degree scoring is carried out on the vocabulary and vocabulary combinations obtained in the heat value table pairs of the hot words and the hot word pairs obtained in the hot event analysis and modeling step, namely, the heat value of the vocabulary and the heat value of the vocabulary combinations are inquired in the heat value table, and the heat values of the same vocabulary and the vocabulary combinations are accumulated to obtain the heat value of each vocabulary and the vocabulary combinations in the current news; and adding the popularity of all the words and the word combinations in the news to obtain the news popularity, wherein the popularity is the predicted news popularity. The method and the system can comprehensively analyze the hot topics and update hot news in time.
Description
Technical Field
The invention relates to the field of news information, in particular to a real-time online prediction method for news popularity.
Background
With the rapid development of internet technology, online public sentiment increasingly affects the stable development of society, and monitoring of online public sentiment is an important link for government to maintain social stability. As a link in public opinion monitoring, prediction of hot news appears to be particularly critical. The microblog changes the spreading mode of the traditional news information by the unique spreading characteristic and the real-time interaction characteristic. Particularly, the combination of the microblog and the mobile terminal enables the microblog information to be forwarded or commented more quickly, and a large amount of user comments and communication information on a microblog platform can be collected quickly as viewpoints, so that a certain public opinion trend is formed. The natural openness, real-time performance, interactivity, mass performance and easiness in detection of the microblog form the basis of hot news prediction. And judging the popularity of the news by comprehensively analyzing the topic quantity of the news on the microblog platform.
The traditional public opinion hot topics are only judged through data such as click number, forwarding number, comment number and the like, but the hot topic prediction technology cannot comprehensively analyze the characteristics of the hot topics, and the hot news is not extracted timely.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a news popularity real-time online prediction method which can comprehensively analyze hot topics and update hot news in time.
The purpose of the invention is realized by the following technical scheme:
a news popularity real-time online prediction method comprises the following two parts:
the hotspot event analysis and modeling method comprises the following steps:
s01: manually determining keywords for the occurred hot events, and crawling various information related to the hot events from the network based on the manually determined keywords;
s02: evaluating the event heat value, namely scoring the heat of the event by using the total information amount crawled by the network, wherein the larger the total information amount is, the higher the score is, and the top is not covered;
s03: carrying out hot word analysis on the crawled information, and finding out a vocabulary with the highest 20% of heat;
s04: carrying out hot event modeling on known hot events, and analyzing the contribution rate of various entries to the heat degree and the joint contribution rate of entry combinations to the event heat degree;
s05, calculating the heat value of the hot word by utilizing the event heat value, the contribution rate of the hot word to the event and the contribution rate of the hot word combination to the event, wherein the calculation formula is that the heat value of the hot word = event heat value × hot word frequency/sum of frequencies of all hot words;
s06: combining the hot words and the hot word pairs in all the events together to form a hot value table of the hot words and the hot word pairs, and adding the heat values of the hot words in all the events to obtain the current heat values of the hot words and the hot word pairs;
s07: continuously updating the heat value table of the hot words and the hot word pairs;
the latest news popularity prediction method comprises the following steps:
s11: collecting information of various sources in real time, including but not limited to news, microblogs, forums and content of posts;
s12: segmenting the acquired information, and removing stop words to obtain related words of news;
s13: heat degree scoring is carried out on the vocabulary and vocabulary combinations obtained in the heat value table pairs of the hot words and the hot word pairs obtained in the hot event analysis and modeling step, namely, the heat value of the vocabulary and the heat value of the vocabulary combinations are inquired in the heat value table, and the heat values of the same vocabulary and the vocabulary combinations are accumulated to obtain the heat value of each vocabulary and the vocabulary combinations in the current news;
s14: and adding the popularity of all the words and the word combinations in the news to obtain the news popularity, wherein the popularity is the predicted news popularity.
Further, the network in step S01 includes the content of articles, microblogs, and wechat that contain the keyword in different channels, such as news websites, microblogs, wechat, forums, posts, government websites, and the like.
Further, the calculation formula of the event heat value in step S02 is Hotvalue = sum [ count × k ], where count represents (total number of public sentiments), and k is a weight whose value is 1-100.
Further, the analysis of the vocabulary in step S03 includes the steps of removing stop words, scoring the vocabulary by the frequency of occurrence, and finding out the vocabulary with the highest heat degree of 20% according to the score.
The invention has the beneficial effects that: according to the invention, the current hot event is analyzed, the score is carried out on each hot event, the event heat is sorted to form the hot event list, then the collected hot words are scored according to the hot event list, and the corresponding heat value is obtained through a series of calculations, so that the current hot event is analyzed comprehensively in real time, and the news hot spot is updated in time.
Detailed Description
A news popularity real-time online prediction method comprises the following two parts:
the hotspot event analysis and modeling method comprises the following steps:
s01: manually determining keywords for the occurred hot events, and crawling various information related to the hot events from the network based on the manually determined keywords; the method comprises the contents of articles, microblogs, WeChat and the like containing the keyword in different channels such as news websites, microblogs, WeChat, forums, post bars, government websites and the like.
S02: evaluating the event heat value, namely scoring the heat of the event by using the total information amount crawled by the network, wherein the larger the total information amount is, the higher the score is, and the top is not covered; different information sources can be manually set with weights, the more important information sources are, the higher the weights of scoring influence is, the higher the weights can be continuously adjusted according to service scenes, for example, for news events, the more important information sources are from websites with high public confidence, such as people's network, newcastle network and the like, the higher weights can be set, and for entertainment events, the more important values can be given by microblog information from star-big-V.
S03: carrying out hot word analysis on the crawled information, and finding out a vocabulary with the highest 20% of heat; the hot word analysis firstly needs to remove stop words, then scores the words by adopting the occurrence frequency, the word frequency refers to the frequency of the occurrence of the words in all contents, and finds out the words with the highest heat degree by scoring, for example, in a certain sports news, the frequency of the occurrence of a football in the text is 10, the frequency of the word of the football in all news under the special topic is added, the words of the football in all news under the special topic are arranged at the top 20%, and the football is one of the hot words under the special topic.
S04: carrying out hot event modeling on known hot events, and analyzing the contribution rate of various entries to the heat degree and the joint contribution rate of entry combinations to the event heat degree;
s05, calculating the heat value of the hot word by utilizing the event heat value, the contribution rate of the hot word to the event and the contribution rate of the hot word combination to the event, wherein the calculation formula is that the heat value of the hot word = event heat value × hot word frequency/sum of frequencies of all hot words;
s06: combining the hot words and the hot word pairs in all the events together to form a hot value table of the hot words and the hot word pairs, and adding the heat values of the hot words in all the events to obtain the current heat values of the hot words and the hot word pairs;
s07: continuously updating the heat value table of the hot words and the hot word pairs;
further, the calculation formula of the event heat value in step S02 is Hotvalue = sum [ count × k ], where count represents (total number of public sentiments), k is a weight whose value is 1-100, e.g., the k value of news from high-weight websites such as people' S web, newcastle, etc. is 100, and the k value of news from self-media channels such as microblogs, etc. is 1.
The latest news popularity prediction method comprises the following steps:
s11: collecting information of various sources in real time, including but not limited to news, microblogs, forums and content of posts;
s12: segmenting the acquired information, and removing stop words to obtain related words of news;
s13, performing heat degree scoring on the vocabulary and vocabulary combinations obtained from the heat value table pairs of the hot words and the hot word pairs obtained in the hot event analysis and modeling step, wherein the calculation formula is Hotvalue = sum [ count × k ], the count represents (total number of public sentiments), k is weight, and the value of the weight is 1-100. namely, the heat value of the vocabulary and the heat value of the vocabulary combinations are inquired in the heat value table, and the heat values of the same vocabulary and the vocabulary combinations are accumulated to obtain the heat value of each vocabulary and the vocabulary combinations in the current news;
s14: and adding the popularity of all the words and the word combinations in the news to obtain the news popularity, wherein the popularity is the predicted news popularity.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (4)
1. A news popularity real-time online prediction method is characterized by comprising the following two parts:
the hotspot event analysis and modeling method comprises the following steps:
s01: manually determining keywords for the occurred hot events, and crawling various information related to the hot events from the network based on the manually determined keywords;
s02: evaluating the event heat value, namely scoring the heat of the event by using the total information amount crawled by the network, wherein the larger the total information amount is, the higher the score is, and the top is not covered;
s03: carrying out hot word analysis on the crawled information, and finding out a vocabulary with the highest 20% of heat;
s04: carrying out hot event modeling on known hot events, and analyzing the contribution rate of various terms to the event heat and the joint contribution rate of term combinations to the event heat;
s05, calculating the heat value of the hot word by utilizing the event heat value, the contribution rate of the hot word to the event and the contribution rate of the hot word combination to the event, wherein the calculation formula is that the heat value of the hot word = the event heat value × hot word frequency/the sum of the frequencies of all the hot words;
s06: combining the hot words and the hot word pairs in all the events together to form a hot value table of the hot words and the hot word pairs, and adding the heat values of the hot words in all the events to obtain the current heat values of the hot words and the hot word pairs;
s07: continuously updating the heat value table of the hot words and the hot word pairs;
the latest news popularity prediction method comprises the following steps:
s11: collecting information of various sources in real time, including but not limited to news, microblogs, forums and content of posts;
s12: performing word segmentation on the acquired information, and removing stop words to obtain words and word combinations related to the information content;
s13: performing heat degree scoring on a heat value table of the hot words and the hot word pairs obtained in the hot event analysis and modeling step and the obtained vocabularies and vocabulary combinations, namely inquiring the heat values of the vocabularies and the vocabulary combinations in the heat value table, and accumulating the heat values of the same vocabularies and the vocabulary combinations to obtain the heat value of each vocabulary and the vocabulary combination in the current news;
s14: and adding the popularity of all the words and the word combinations in the news to obtain the news popularity, wherein the popularity is the predicted news popularity.
2. The real-time online news popularity prediction method according to claim 1, wherein: the network in step S01 includes the content of articles, microblogs, and wechat that contain the keyword in different channels of news websites, microblogs, wechat, forums, post bars, and government websites.
3. The method for real-time online prediction of news popularity according to claim 1, wherein the formula for calculating the event popularity value in step S02 is Hotvalue = sum [ count × k ], where count represents the total number of public sentiments, and k is a weight whose value is 1-100.
4. The real-time online news popularity prediction method according to claim 1, wherein: the analysis of the vocabulary in step S03 includes the following steps, first removing stop words, then scoring the vocabulary by using the frequency of occurrence, where the frequency of occurrence refers to the number of times the vocabulary appears in all contents, and finding out the vocabulary with the highest degree of popularity of 20% according to the score.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710308998.5A CN107122481B (en) | 2017-05-04 | 2017-05-04 | Real-time online prediction method for news popularity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710308998.5A CN107122481B (en) | 2017-05-04 | 2017-05-04 | Real-time online prediction method for news popularity |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107122481A CN107122481A (en) | 2017-09-01 |
CN107122481B true CN107122481B (en) | 2020-06-30 |
Family
ID=59726634
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710308998.5A Active CN107122481B (en) | 2017-05-04 | 2017-05-04 | Real-time online prediction method for news popularity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107122481B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750682B (en) * | 2018-07-06 | 2022-08-16 | 武汉斗鱼网络科技有限公司 | Title hot word automatic metering method, storage medium, electronic equipment and system |
CN109344316B (en) * | 2018-08-14 | 2022-04-29 | 阿里巴巴(中国)有限公司 | News popularity calculation method and device |
CN109376231A (en) * | 2018-09-29 | 2019-02-22 | 杭州凡闻科技有限公司 | A kind of media hotspot tracking and system |
CN109885656B (en) * | 2019-02-18 | 2021-06-29 | 国家计算机网络与信息安全管理中心 | Microblog forwarding prediction method and device based on quantification heat degree |
CN110457594B (en) * | 2019-08-01 | 2021-06-01 | 深圳市顶尖传诚科技有限公司 | Big data-based public opinion hotspot prediction method |
CN112597280A (en) * | 2020-12-28 | 2021-04-02 | 上海朝阳永续信息技术股份有限公司 | Method for automatically discovering hot keywords and hot news |
CN113535956A (en) * | 2021-07-26 | 2021-10-22 | 北京清博智能科技有限公司 | News hotspot prediction method based on medium contribution degree |
CN114938477B (en) * | 2022-06-23 | 2024-05-03 | 阿里巴巴(中国)有限公司 | Video topic determination method, device and equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101923544A (en) * | 2009-06-15 | 2010-12-22 | 北京百分通联传媒技术有限公司 | Method for monitoring and displaying Internet hot spots |
CN102945290A (en) * | 2012-12-03 | 2013-02-27 | 北京奇虎科技有限公司 | Hot microblog topic digging device and method |
CN104035960A (en) * | 2014-05-08 | 2014-09-10 | 东莞市巨细信息科技有限公司 | Internet information hotspot predicting method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8429170B2 (en) * | 2010-02-05 | 2013-04-23 | Yahoo! Inc. | System and method for discovering story trends in real time from user generated content |
-
2017
- 2017-05-04 CN CN201710308998.5A patent/CN107122481B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101923544A (en) * | 2009-06-15 | 2010-12-22 | 北京百分通联传媒技术有限公司 | Method for monitoring and displaying Internet hot spots |
CN102945290A (en) * | 2012-12-03 | 2013-02-27 | 北京奇虎科技有限公司 | Hot microblog topic digging device and method |
CN104035960A (en) * | 2014-05-08 | 2014-09-10 | 东莞市巨细信息科技有限公司 | Internet information hotspot predicting method |
Also Published As
Publication number | Publication date |
---|---|
CN107122481A (en) | 2017-09-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107122481B (en) | Real-time online prediction method for news popularity | |
US11847612B2 (en) | Social media profiling for one or more authors using one or more social media platforms | |
Jain et al. | Towards automated real-time detection of misinformation on Twitter | |
CN102411587B (en) | Webpage classification method and device | |
US20140108388A1 (en) | Method and system for sorting, searching and presenting micro-blogs | |
US20160019659A1 (en) | Predicting the business impact of tweet conversations | |
US8527450B2 (en) | Apparatus and methods for analyzing and using short messages from commercial accounts | |
CN103116605A (en) | Method and system of microblog hot events real-time detection based on detection subnet | |
CN103049440A (en) | Recommendation processing method and processing system for related articles | |
CN104657498B (en) | The appraisal procedure of microblog users influence power | |
Sahana et al. | Automatic detection of rumoured tweets and finding its origin | |
US8965867B2 (en) | Measuring and altering topic influence on edited and unedited media | |
US20110314009A1 (en) | Method and Device for Extracting Characteristic Relation Circle From Network | |
WO2013037223A1 (en) | Recommendation processing method and device for internet microblog celebrity information | |
CN104899335A (en) | Method for performing sentiment classification on network public sentiment of information | |
Granskogen et al. | Fake news detection: Network data from social media used to predict fakes | |
WO2014183544A1 (en) | Method and device for generating a personalized navigation webpage | |
CN114048389B (en) | Content recommendation method and system for engineering machinery industry | |
KR101486924B1 (en) | Method for recommanding media contents using social network service | |
US20090307344A1 (en) | Web page ranking method and system based on user referrals | |
JP2017091436A (en) | Feature word selection device | |
KR101821777B1 (en) | Automatic answering system for on-line bulletin board and method of the same | |
JP2020129239A (en) | Post Analysis System, Post Analysis Device, and Post Analysis Method | |
Soman et al. | A study of Spam Detection Algorithm On Social Media networks | |
Inches et al. | Statistics of online user-generated short documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |