CN111666749A - Hot article identification method - Google Patents

Hot article identification method Download PDF

Info

Publication number
CN111666749A
CN111666749A CN202010502429.6A CN202010502429A CN111666749A CN 111666749 A CN111666749 A CN 111666749A CN 202010502429 A CN202010502429 A CN 202010502429A CN 111666749 A CN111666749 A CN 111666749A
Authority
CN
China
Prior art keywords
article
hot
word
storage
articles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010502429.6A
Other languages
Chinese (zh)
Other versions
CN111666749B (en
Inventor
姚洲鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Fanews Technology Co ltd
Original Assignee
Hangzhou Fanews Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Fanews Technology Co ltd filed Critical Hangzhou Fanews Technology Co ltd
Priority to CN202010502429.6A priority Critical patent/CN111666749B/en
Publication of CN111666749A publication Critical patent/CN111666749A/en
Application granted granted Critical
Publication of CN111666749B publication Critical patent/CN111666749B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a hot article identification method, which comprises the following steps: acquiring a plurality of hot articles to form a hot article library; extracting the word of each hot article in the hot article library and counting the word frequency corresponding to each word to form a hot word library; extracting the participles of the article to be put in storage and counting the word frequency corresponding to each participle; calculating the heat value of the article in storage; and judging whether the articles in storage are hot articles according to the heat value of the articles in storage. The method has the advantages that the hot article identification method can extract the hot word bank according to the existing hot articles, then calculate the heat value of the newly-stored article in the warehouse according to the hot word bank, and quickly judge whether the newly-stored article in the warehouse is the hot article according to the heat value.

Description

Hot article identification method
Technical Field
The invention relates to a hot article identification method.
Background
With the development of the internet industry, news workers need to find and identify hot articles in time, so that hot trends which are concerned by the public at present are obtained from the hot articles. At present, news workers generally click a ranking list to identify hot articles according to the hot articles in a large website. However, the method relies on statistics of user click data by websites to obtain hotspot article data, and is relatively delayed in time efficiency. Since the hot news in the hot articles counted by the data clicked by a large number of users has already been viewed by most users, the hot news extracted from the hot articles has a small value. Therefore, a method for quickly identifying hot articles without the need to use click data of each large website is needed.
Disclosure of Invention
In order to solve the defects of the prior art, the invention provides a hot article identification method capable of solving the problems.
In order to achieve the above object, the present invention adopts the following technical solutions:
a hot article identification method comprises the following steps:
acquiring a plurality of hot articles to form a hot article library;
extracting the word of each hot article in the hot article library and counting the word frequency corresponding to each word to form a hot word library;
extracting the participles of the article to be put in storage and counting the word frequency corresponding to each participle;
calculating the heat value of the article in storage;
and judging whether the articles in storage are hot articles according to the heat value of the articles in storage.
Further, the specific method for calculating the heat value of the article in storage comprises the following steps:
calculating the word segmentation heat value of each word segmentation of the article in storage through the following formula,
score=(subsetFreq/subsetSize-superFreq/superSize)*((subsetFreq/subsetSize)/(superFreq/superSize))*natureBoost*fieldBoost,
wherein, score represents the word segmentation heat value, subsetFreq represents the word frequency of one word segmentation in the word segmentation of the article in storage, subsetSize represents the word frequency sum of the word segmentation of all the articles in storage, superfeq represents the word frequency corresponding to the word segmentation in the word segmentation of the article in storage in a hot word bank, superSize represents the word frequency sum of the word segmentation in all the hot word banks, natureBoost represents the word weight of the word segmentation of the article in storage, and fieldst represents the field weight of the word segmentation of the article in storage;
and averaging the word segmentation heat value of each word segmentation obtained by calculation to obtain the heat value of the article stored in the database.
Further, selecting partial participles with larger word frequency from the participles of the articles in storage according to the word frequency of the participles of the articles in storage;
and when the heat value of the article in storage is calculated, only the selected participles are calculated.
Further, the participles with the top rank of 100 are selected from the participles of the articles in storage according to the word frequency of the participles of the articles in storage.
Further, the specific method for obtaining natureBoost is as follows:
and solving an average value according to the part of speech of the participles of the articles in storage in the articles in storage.
Further, a specific method for obtaining fieldBosst is as follows:
and obtaining an average value according to the field of the participles of the articles in storage in the articles in storage.
Further, the specific method for judging whether the article in storage is a hot article according to the heat value of the article in storage comprises the following steps:
and when the heat value of the articles in storage is larger than a preset threshold value, judging that the articles in storage are hot articles.
Further, the hot article identification method further comprises the following steps:
when the heat value of the articles in storage is larger than a preset threshold value, adding the articles in storage into a hot spot article library to update the hot spot article library;
and extracting the word segmentation of each hot article in the updated hot article library, and counting the word frequency corresponding to each word segmentation to update the hot word library.
Further, a specific method for obtaining a plurality of hot articles to form a hot article library comprises the following steps:
and acquiring the hot articles in the first preset time from the network to form a hot article library.
Further, the hot articles in the first preset time are obtained from the network again every second preset time to form a new hot article library.
The method has the advantages that the hot article identification method can extract the hot word bank according to the existing hot articles, then calculate the heat value of the newly-stored article in the warehouse according to the hot word bank, and quickly judge whether the newly-stored article in the warehouse is the hot article according to the heat value.
The method has the advantages that after the newly-stored articles are recognized as the hot articles, the newly-stored articles are added into the hot article library, and a new hot word library is further constructed through the new hot article library. The new hot word bank is updated more completely, and the identification efficiency and the identification accuracy of the hot articles are improved.
Drawings
FIG. 1 is a flow chart of a hot article identification method of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and the embodiments.
Fig. 1 shows a hot spot article identification method according to the present invention, which includes the following steps: s1, acquiring a plurality of hot articles to form a hot article library. And S2, extracting the word of each hot article in the hot article library and counting the word frequency corresponding to each word to form a hot word library. And S3, extracting the participles of the article in storage and counting the word frequency corresponding to each participle. And S4, calculating the heat value of the article in storage. And S5, judging whether the articles in storage are hot articles according to the heat value of the articles in storage. Through the steps, the hot articles which are hot in comparison are obtained at first, and the hot word bank is extracted from the hot articles. And calculating the heat value of the newly-warehoused articles according to the hot word bank, and judging whether the newly-warehoused articles are hot articles or not according to the heat value. The above steps are specifically described below.
For S1, acquiring a plurality of hot articles to form a hot article library.
Specifically, a hotspot article in a first preset time is acquired from the network to form a hotspot article library. In the invention, the hot article is obtained from online media such as Xinlang, Neyi and today's headline through data acquisition software. Specifically, the articles with a large click rate may be directly acquired from the above-mentioned websites as the hot articles, and preferably, the articles with a high ranking may be directly acquired from the hot article ranking lists of the websites. The hot articles are sometimes limited, articles with high click rate one year ago are not hot for the first time, and in order to avoid capturing articles before a long time, time limit is set during data acquisition, and only hot articles within the first preset time are acquired to form a hot article library. Specifically, in the present invention, the first preset time is set to one month. It is understood that the first preset time can be freely set as needed.
As a preferred embodiment, the hot articles in the first preset time are obtained from the network again at intervals of a second preset time to form a new hot article library.
It can be understood that the hot articles are ineffective, and only if the hot articles are updated regularly, the articles in the hot article library can accurately reflect the current hot spot.
And S2, extracting the word of each hot article in the hot article library and counting the word frequency corresponding to each word to form the hot word library.
In step S1, the hot articles that are hot at that time are acquired to form a hot article library, and in step S2, a hot thesaurus is extracted from the hot article library. Specifically, each hot article in the hot article library is analyzed and processed according to methods such as semantic analysis to obtain a plurality of participles, and then the total times of each participle appearing in the hot articles are unified as the word frequency of the participle. And combining the components containing the corresponding word frequency to form a hot word bank.
It is understood that in step S1, the hotspot word bank is updated at the same time when the hotspot articles are obtained again in each cycle.
And S3, extracting the participles of the article in storage and counting the word frequency corresponding to each participle.
When the warehousing articles of the information are collected, analyzing and processing the warehousing articles according to methods such as semantic analysis and the like to obtain the segmentation of the warehousing articles, and counting the frequency of the segmentation of each segmentation in the warehousing articles as the word frequency of the segmentation of the warehousing articles.
For S4, the heat value of the article in storage is calculated.
In step S4, the popularity of the articles to be put in storage is calculated based on the data counted in the previous step. The specific method for calculating the heat value of the article in storage comprises the following steps: calculating the word segmentation heat value of each word segmentation of the article in storage through the following formula,
score=(subsetFreq/subsetSize-superFreq/superSize)*((subsetFreq/subsetSize)/(superFreq/superSize))*natureBoost*fieldBoost,
wherein, score represents the word segmentation heat value, subsetFreq represents the word frequency of one word segmentation in the word segmentation of the article in storage, subsetSize represents the sum of the word frequencies of the word segmentation of all the articles in storage, and the word frequencies of all the word segmentation of the articles in storage are added to obtain the product. superfeq represents the corresponding word frequency of one word in the word segmentation of the article in the database in the hot spot, and the word frequency of the word segmentation in the hot spot database can be directly matched by searching from the hot spot database through the word segmentation. superSize represents the sum of word frequencies of the participles in all the hot word banks, and the word frequencies of all the participles in the hot word banks are added to obtain the word frequency sum. natureBoost represents part-of-speech weights of the participles of the article being put in storage. fieldBoost represents the field weight of the participle of the article being put in stock. And averaging the word segmentation heat value of each word segmentation obtained by calculation to obtain the heat value of the article stored in the database.
It can be understood that when the heat value of the stored article is calculated, the word segmentation heat value calculation does not need to be performed on each participle in the stored article, and as a preferred embodiment, some participles with larger word frequency can be selected from the participles of the stored article according to the word frequency of the participle of the stored article. And when the heat value of the article in storage is calculated, only the selected participles are calculated. In the invention, the participles with the top rank of 100 are selected from the participles of the articles in storage according to the word frequency of the participles of the articles in storage.
The natureBoost represents part-of-speech weight of the participle of the article in storage, and the specific method for obtaining the natureBoost comprises the following steps: and solving an average value according to the part of speech of the participles of the articles in storage in the articles in storage.
It can be understood that the participles in the article in storage contribute different word-segmentation heat values due to different parts of speech. Generally, the part-of-speech weight of a noun is 0.85 or more and 0.95 or less, the part-of-speech weight of a verb is 0.65 or more and 0.85 or less, the part-of-speech weight of an adjective is 0.5 or more and 0.7 or less, and the part-of-speech weight of an adverb is 0.35 or more and 0.5 or less.
In this embodiment, the part-of-speech weight of the noun is 0.9, the part-of-speech weight of the verb is 0.8, the adjective is 0.6, and the adverb is 0.4. When the part of speech of a participle is a noun, the value of natureBoost is 0.9, when a participle can be a noun or a verb, the participle is analyzed to appear m times in a noun form in the article to be put in storage according to the semantic meaning, and n times appear in a verb form, at the moment, natureBoost is (0.9m +0.8n)/(m + n), and so on. The average value is obtained according to the part of speech of the participle of the article in storage in the article in storage.
fieldBoost represents the field weight of the participle of the article being put in stock. The specific method for obtaining fieldBosst is as follows: and obtaining an average value according to the field of the participles of the articles in storage in the articles in storage.
It will be appreciated that for the same word segmentation, which occurs in the title or text of an article, the contribution to the word segmentation heat value will also be different. In general, when a word is found in the title of an article, the field weight is 0.85 or more and less than 0.95, and when a word is found in the body of an article, the field weight is 0.6 or more and 0.8 or less.
In this embodiment, when a participle appears in the title of an article, the field weight is 0.9, when the participle appears in the body text of the article, the field weight is 0.7, and similar to the aforementioned natureBoost, the number of times the participle appears in the title of the article in storage is a and the number of times the participle appears in the body text of the article in storage is b according to semantic analysis, then fieldBoost is (0.9a +0.7b)/(a + b).
And S5, judging whether the warehousing article is a hot article according to the heat value of the warehousing article.
Specifically, the specific method for judging whether the article in storage is a hotspot article according to the heat value of the article in storage comprises the following steps: and when the heat value of the articles in storage is larger than a preset threshold value, judging that the articles in storage are hot articles.
As a preferred embodiment, the hot article identification method further comprises the following steps:
and when the heat value of the articles in storage is larger than a preset threshold value, adding the articles in storage into the hot spot article library to update the hot spot article library. And extracting the word segmentation of each hot article in the updated hot article library, and counting the word frequency corresponding to each word segmentation to update the hot word library.
It can be understood that when the article in storage is determined as a hot article, the article in storage is added to the hot article library, the hot article library is updated through the newly identified hot article, and the updated hot article library is processed to obtain a new hot word library, so that the new hot word library is updated more completely, and the identification efficiency and the identification accuracy of the hot article are improved.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It should be understood by those skilled in the art that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by using equivalent alternatives or equivalent variations fall within the scope of the present invention.

Claims (10)

1. A hot spot article identification method is characterized by comprising the following steps:
acquiring a plurality of hot articles to form a hot article library;
extracting the word of each hot article in the hot article library and counting the word frequency corresponding to each word to form a hot word library;
extracting the participles of the article to be put in storage and counting the word frequency corresponding to each participle;
calculating the heat value of the article in storage;
and judging whether the articles in storage are hot articles according to the heat value of the articles in storage.
2. The hot spot article identification method of claim 1,
the specific method for calculating the heat value of the article to be put in storage comprises the following steps:
calculating the word segmentation heat value of each word segmentation of the article in storage through the following formula,
score=(subsetFreq/subsetSize-superFreq/superSize)*((subsetFreq/subsetSize)/(superFreq/superSize))*natureBoost*fieldBoost,
wherein score represents the word segmentation heat value, subsetFreq represents the word frequency of one of the word segments of the article in storage, subsetSize represents the sum of the word frequencies of all the word segments of the article in storage, superfeq represents the word frequency of the word segment of the article in storage in the hot word bank, superSize represents the sum of the word frequencies of all the word segments in the hot word bank, natureBoost represents the part-of-speech weight of the word segment of the article in storage, and fieldBoost represents the field weight of the word segment of the article in storage;
and averaging the word segmentation heat value of each word segmentation obtained by calculation to obtain the heat value of the article in the storage.
3. The hot spot article identification method of claim 2,
selecting part of participles with larger word frequency from the participles of the warehousing article according to the word frequency of the participles of the warehousing article;
and only calculating the selected participles when calculating the heat value of the article in the warehouse.
4. The hot spot article identification method of claim 3,
and selecting the participles with the top rank of 100 from the participles of the articles to be put in storage according to the word frequency of the participles of the articles to be put in storage.
5. The hot spot article identification method of claim 2,
the specific method for obtaining the natureBoost comprises the following steps:
and solving an average value according to the part of speech of the participles of the article in storage in the article in storage.
6. The hot spot article identification method of claim 5,
the specific method for obtaining fieldBosst is as follows:
and obtaining an average value according to the fields of the participles of the articles in storage in the articles in storage.
7. The hot spot article identification method of claim 2,
the specific method for judging whether the articles in storage are hot articles according to the heat value of the articles in storage comprises the following steps:
and when the heat value of the articles in storage is larger than a preset threshold value, judging that the articles in storage are hot articles.
8. The hot spot article identification method of claim 7,
the hot article identification method further comprises the following steps:
when the heat value of the articles in storage is larger than a preset threshold value, adding the articles in storage into the hot spot article library to update the hot spot article library;
extracting the word of each hot article in the updated hot article library, and counting the word frequency corresponding to each word to update the hot word library.
9. The hot spot article identification method of claim 1,
the specific method for obtaining the hot articles to form the hot article library comprises the following steps:
and acquiring the hot articles in a first preset time from the network to form the hot article library.
10. The hot spot article identification method of claim 9,
and acquiring the hot articles in the first preset time from the network again at intervals of second preset time to form a new hot article library.
CN202010502429.6A 2020-06-03 2020-06-03 Hot article identification method Active CN111666749B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010502429.6A CN111666749B (en) 2020-06-03 2020-06-03 Hot article identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010502429.6A CN111666749B (en) 2020-06-03 2020-06-03 Hot article identification method

Publications (2)

Publication Number Publication Date
CN111666749A true CN111666749A (en) 2020-09-15
CN111666749B CN111666749B (en) 2023-09-19

Family

ID=72386400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010502429.6A Active CN111666749B (en) 2020-06-03 2020-06-03 Hot article identification method

Country Status (1)

Country Link
CN (1) CN111666749B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095073A (en) * 2021-03-12 2021-07-09 深圳索信达数据技术有限公司 Corpus tag generation method and device, computer equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000259666A (en) * 1999-03-11 2000-09-22 Nippon Hoso Kyokai <Nhk> Topic extraction device
US20080091674A1 (en) * 2006-10-13 2008-04-17 Thomas Bradley Allen Method, apparatus and article for assigning a similarity measure to names
CN102662965A (en) * 2012-03-07 2012-09-12 上海引跑信息科技有限公司 Method and system of automatically discovering hot news theme on the internet
CN105354333A (en) * 2015-12-07 2016-02-24 天云融创数据科技(北京)有限公司 Topic extraction method based on news text
US20180203843A1 (en) * 2017-01-13 2018-07-19 Yahoo! Inc. Scalable Multilingual Named-Entity Recognition
CN109213845A (en) * 2018-09-06 2019-01-15 杭州凡闻科技有限公司 Original news appraisal procedure and system based on article feature
CN109376231A (en) * 2018-09-29 2019-02-22 杭州凡闻科技有限公司 A kind of media hotspot tracking and system
CN109815499A (en) * 2019-01-25 2019-05-28 杭州凡闻科技有限公司 Information correlation method and system
JP2020064630A (en) * 2019-10-11 2020-04-23 株式会社野村総合研究所 Sentence symbol insertion device and method thereof

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000259666A (en) * 1999-03-11 2000-09-22 Nippon Hoso Kyokai <Nhk> Topic extraction device
US20080091674A1 (en) * 2006-10-13 2008-04-17 Thomas Bradley Allen Method, apparatus and article for assigning a similarity measure to names
CN102662965A (en) * 2012-03-07 2012-09-12 上海引跑信息科技有限公司 Method and system of automatically discovering hot news theme on the internet
CN105354333A (en) * 2015-12-07 2016-02-24 天云融创数据科技(北京)有限公司 Topic extraction method based on news text
US20180203843A1 (en) * 2017-01-13 2018-07-19 Yahoo! Inc. Scalable Multilingual Named-Entity Recognition
CN109213845A (en) * 2018-09-06 2019-01-15 杭州凡闻科技有限公司 Original news appraisal procedure and system based on article feature
CN109376231A (en) * 2018-09-29 2019-02-22 杭州凡闻科技有限公司 A kind of media hotspot tracking and system
CN109815499A (en) * 2019-01-25 2019-05-28 杭州凡闻科技有限公司 Information correlation method and system
JP2020064630A (en) * 2019-10-11 2020-04-23 株式会社野村総合研究所 Sentence symbol insertion device and method thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MILLER B N,: "MovieLens Unplugged:Experiences with an Occasionally Connected Recommender System" *
林翰轩;耿琛明;史景宏;: "基于WEB热词挖掘的热点方向预测" *
田丹;刘奕杉;王玉琳;: "热点分析类文章的文献计量分析――以词频分析方法为例" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113095073A (en) * 2021-03-12 2021-07-09 深圳索信达数据技术有限公司 Corpus tag generation method and device, computer equipment and storage medium
CN113095073B (en) * 2021-03-12 2022-04-19 深圳索信达数据技术有限公司 Corpus tag generation method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN111666749B (en) 2023-09-19

Similar Documents

Publication Publication Date Title
US10997256B2 (en) Webpage classification method and apparatus, calculation device and machine readable storage medium
CN106503014B (en) Real-time information recommendation method, device and system
US10423648B2 (en) Method, system, and computer readable medium for interest tag recommendation
CN108170692B (en) Hotspot event information processing method and device
CN109299271B (en) Training sample generation method, text data method, public opinion event classification method and related equipment
CN107180093B (en) Information searching method and device and timeliness query word identification method and device
WO2016000555A1 (en) Methods and systems for recommending social network-based content and news
CN104391999B (en) Information recommendation method and device
CN105159932B (en) A kind of data retrieval engine and ordering system and method
KR20150036117A (en) Query expansion
CN104881458B (en) A kind of mask method and device of Web page subject
CN107544988B (en) Method and device for acquiring public opinion data
CN104866554B (en) A kind of individuation search method and system based on socialization mark
CN108241613A (en) A kind of method and apparatus for extracting keyword
US9245035B2 (en) Information processing system, information processing method, program, and non-transitory information storage medium
Albishre et al. Effective 20 newsgroups dataset cleaning
CN107688563B (en) Synonym recognition method and recognition device
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN111026965A (en) Hot topic tracing method and device based on knowledge graph
CN111666749A (en) Hot article identification method
Ceroni et al. Improving event detection by automatically assessing validity of event occurrence in text
CN110110219A (en) The method and device of user preference is determined according to network behavior
Zendah et al. Detecting Significant Events in Arabic Microblogs using Soft Frequent Pattern Mining.
CN114943285B (en) Intelligent auditing system for internet news content data
CN108717637B (en) Automatic mining method and system for E-commerce safety related entities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant