CN111666749A

CN111666749A - Hot article identification method

Info

Publication number: CN111666749A
Application number: CN202010502429.6A
Authority: CN
Inventors: 姚洲鹏
Original assignee: Hangzhou Fanews Technology Co ltd
Current assignee: Hangzhou Fanews Technology Co ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-09-15
Anticipated expiration: 2040-06-03
Also published as: CN111666749B

Abstract

The invention discloses a hot article identification method, which comprises the following steps: acquiring a plurality of hot articles to form a hot article library; extracting the word of each hot article in the hot article library and counting the word frequency corresponding to each word to form a hot word library; extracting the participles of the article to be put in storage and counting the word frequency corresponding to each participle; calculating the heat value of the article in storage; and judging whether the articles in storage are hot articles according to the heat value of the articles in storage. The method has the advantages that the hot article identification method can extract the hot word bank according to the existing hot articles, then calculate the heat value of the newly-stored article in the warehouse according to the hot word bank, and quickly judge whether the newly-stored article in the warehouse is the hot article according to the heat value.

Description

Hot article identification method

Technical Field

The invention relates to a hot article identification method.

Background

With the development of the internet industry, news workers need to find and identify hot articles in time, so that hot trends which are concerned by the public at present are obtained from the hot articles. At present, news workers generally click a ranking list to identify hot articles according to the hot articles in a large website. However, the method relies on statistics of user click data by websites to obtain hotspot article data, and is relatively delayed in time efficiency. Since the hot news in the hot articles counted by the data clicked by a large number of users has already been viewed by most users, the hot news extracted from the hot articles has a small value. Therefore, a method for quickly identifying hot articles without the need to use click data of each large website is needed.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a hot article identification method capable of solving the problems.

In order to achieve the above object, the present invention adopts the following technical solutions:

a hot article identification method comprises the following steps:

acquiring a plurality of hot articles to form a hot article library;

extracting the word of each hot article in the hot article library and counting the word frequency corresponding to each word to form a hot word library;

extracting the participles of the article to be put in storage and counting the word frequency corresponding to each participle;

calculating the heat value of the article in storage;

and judging whether the articles in storage are hot articles according to the heat value of the articles in storage.

Further, the specific method for calculating the heat value of the article in storage comprises the following steps:

calculating the word segmentation heat value of each word segmentation of the article in storage through the following formula,

score＝(subsetFreq/subsetSize-superFreq/superSize)*((subsetFreq/subsetSize)/(superFreq/superSize))*natureBoost*fieldBoost，

wherein, score represents the word segmentation heat value, subsetFreq represents the word frequency of one word segmentation in the word segmentation of the article in storage, subsetSize represents the word frequency sum of the word segmentation of all the articles in storage, superfeq represents the word frequency corresponding to the word segmentation in the word segmentation of the article in storage in a hot word bank, superSize represents the word frequency sum of the word segmentation in all the hot word banks, natureBoost represents the word weight of the word segmentation of the article in storage, and fieldst represents the field weight of the word segmentation of the article in storage;

and averaging the word segmentation heat value of each word segmentation obtained by calculation to obtain the heat value of the article stored in the database.

Further, selecting partial participles with larger word frequency from the participles of the articles in storage according to the word frequency of the participles of the articles in storage;

and when the heat value of the article in storage is calculated, only the selected participles are calculated.

Further, the participles with the top rank of 100 are selected from the participles of the articles in storage according to the word frequency of the participles of the articles in storage.

Further, the specific method for obtaining natureBoost is as follows:

and solving an average value according to the part of speech of the participles of the articles in storage in the articles in storage.

Further, a specific method for obtaining fieldBosst is as follows:

and obtaining an average value according to the field of the participles of the articles in storage in the articles in storage.

Further, the specific method for judging whether the article in storage is a hot article according to the heat value of the article in storage comprises the following steps:

and when the heat value of the articles in storage is larger than a preset threshold value, judging that the articles in storage are hot articles.

Further, the hot article identification method further comprises the following steps:

when the heat value of the articles in storage is larger than a preset threshold value, adding the articles in storage into a hot spot article library to update the hot spot article library;

and extracting the word segmentation of each hot article in the updated hot article library, and counting the word frequency corresponding to each word segmentation to update the hot word library.

Further, a specific method for obtaining a plurality of hot articles to form a hot article library comprises the following steps:

and acquiring the hot articles in the first preset time from the network to form a hot article library.

Further, the hot articles in the first preset time are obtained from the network again every second preset time to form a new hot article library.

The method has the advantages that the hot article identification method can extract the hot word bank according to the existing hot articles, then calculate the heat value of the newly-stored article in the warehouse according to the hot word bank, and quickly judge whether the newly-stored article in the warehouse is the hot article according to the heat value.

The method has the advantages that after the newly-stored articles are recognized as the hot articles, the newly-stored articles are added into the hot article library, and a new hot word library is further constructed through the new hot article library. The new hot word bank is updated more completely, and the identification efficiency and the identification accuracy of the hot articles are improved.

Drawings

FIG. 1 is a flow chart of a hot article identification method of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and the embodiments.

Fig. 1 shows a hot spot article identification method according to the present invention, which includes the following steps: s1, acquiring a plurality of hot articles to form a hot article library. And S2, extracting the word of each hot article in the hot article library and counting the word frequency corresponding to each word to form a hot word library. And S3, extracting the participles of the article in storage and counting the word frequency corresponding to each participle. And S4, calculating the heat value of the article in storage. And S5, judging whether the articles in storage are hot articles according to the heat value of the articles in storage. Through the steps, the hot articles which are hot in comparison are obtained at first, and the hot word bank is extracted from the hot articles. And calculating the heat value of the newly-warehoused articles according to the hot word bank, and judging whether the newly-warehoused articles are hot articles or not according to the heat value. The above steps are specifically described below.

For S1, acquiring a plurality of hot articles to form a hot article library.

Specifically, a hotspot article in a first preset time is acquired from the network to form a hotspot article library. In the invention, the hot article is obtained from online media such as Xinlang, Neyi and today's headline through data acquisition software. Specifically, the articles with a large click rate may be directly acquired from the above-mentioned websites as the hot articles, and preferably, the articles with a high ranking may be directly acquired from the hot article ranking lists of the websites. The hot articles are sometimes limited, articles with high click rate one year ago are not hot for the first time, and in order to avoid capturing articles before a long time, time limit is set during data acquisition, and only hot articles within the first preset time are acquired to form a hot article library. Specifically, in the present invention, the first preset time is set to one month. It is understood that the first preset time can be freely set as needed.

As a preferred embodiment, the hot articles in the first preset time are obtained from the network again at intervals of a second preset time to form a new hot article library.

It can be understood that the hot articles are ineffective, and only if the hot articles are updated regularly, the articles in the hot article library can accurately reflect the current hot spot.

And S2, extracting the word of each hot article in the hot article library and counting the word frequency corresponding to each word to form the hot word library.

In step S1, the hot articles that are hot at that time are acquired to form a hot article library, and in step S2, a hot thesaurus is extracted from the hot article library. Specifically, each hot article in the hot article library is analyzed and processed according to methods such as semantic analysis to obtain a plurality of participles, and then the total times of each participle appearing in the hot articles are unified as the word frequency of the participle. And combining the components containing the corresponding word frequency to form a hot word bank.

It is understood that in step S1, the hotspot word bank is updated at the same time when the hotspot articles are obtained again in each cycle.

And S3, extracting the participles of the article in storage and counting the word frequency corresponding to each participle.

When the warehousing articles of the information are collected, analyzing and processing the warehousing articles according to methods such as semantic analysis and the like to obtain the segmentation of the warehousing articles, and counting the frequency of the segmentation of each segmentation in the warehousing articles as the word frequency of the segmentation of the warehousing articles.

For S4, the heat value of the article in storage is calculated.

In step S4, the popularity of the articles to be put in storage is calculated based on the data counted in the previous step. The specific method for calculating the heat value of the article in storage comprises the following steps: calculating the word segmentation heat value of each word segmentation of the article in storage through the following formula,

wherein, score represents the word segmentation heat value, subsetFreq represents the word frequency of one word segmentation in the word segmentation of the article in storage, subsetSize represents the sum of the word frequencies of the word segmentation of all the articles in storage, and the word frequencies of all the word segmentation of the articles in storage are added to obtain the product. superfeq represents the corresponding word frequency of one word in the word segmentation of the article in the database in the hot spot, and the word frequency of the word segmentation in the hot spot database can be directly matched by searching from the hot spot database through the word segmentation. superSize represents the sum of word frequencies of the participles in all the hot word banks, and the word frequencies of all the participles in the hot word banks are added to obtain the word frequency sum. natureBoost represents part-of-speech weights of the participles of the article being put in storage. fieldBoost represents the field weight of the participle of the article being put in stock. And averaging the word segmentation heat value of each word segmentation obtained by calculation to obtain the heat value of the article stored in the database.

It can be understood that when the heat value of the stored article is calculated, the word segmentation heat value calculation does not need to be performed on each participle in the stored article, and as a preferred embodiment, some participles with larger word frequency can be selected from the participles of the stored article according to the word frequency of the participle of the stored article. And when the heat value of the article in storage is calculated, only the selected participles are calculated. In the invention, the participles with the top rank of 100 are selected from the participles of the articles in storage according to the word frequency of the participles of the articles in storage.

The natureBoost represents part-of-speech weight of the participle of the article in storage, and the specific method for obtaining the natureBoost comprises the following steps: and solving an average value according to the part of speech of the participles of the articles in storage in the articles in storage.

It can be understood that the participles in the article in storage contribute different word-segmentation heat values due to different parts of speech. Generally, the part-of-speech weight of a noun is 0.85 or more and 0.95 or less, the part-of-speech weight of a verb is 0.65 or more and 0.85 or less, the part-of-speech weight of an adjective is 0.5 or more and 0.7 or less, and the part-of-speech weight of an adverb is 0.35 or more and 0.5 or less.

In this embodiment, the part-of-speech weight of the noun is 0.9, the part-of-speech weight of the verb is 0.8, the adjective is 0.6, and the adverb is 0.4. When the part of speech of a participle is a noun, the value of natureBoost is 0.9, when a participle can be a noun or a verb, the participle is analyzed to appear m times in a noun form in the article to be put in storage according to the semantic meaning, and n times appear in a verb form, at the moment, natureBoost is (0.9m +0.8n)/(m + n), and so on. The average value is obtained according to the part of speech of the participle of the article in storage in the article in storage.

fieldBoost represents the field weight of the participle of the article being put in stock. The specific method for obtaining fieldBosst is as follows: and obtaining an average value according to the field of the participles of the articles in storage in the articles in storage.

It will be appreciated that for the same word segmentation, which occurs in the title or text of an article, the contribution to the word segmentation heat value will also be different. In general, when a word is found in the title of an article, the field weight is 0.85 or more and less than 0.95, and when a word is found in the body of an article, the field weight is 0.6 or more and 0.8 or less.

In this embodiment, when a participle appears in the title of an article, the field weight is 0.9, when the participle appears in the body text of the article, the field weight is 0.7, and similar to the aforementioned natureBoost, the number of times the participle appears in the title of the article in storage is a and the number of times the participle appears in the body text of the article in storage is b according to semantic analysis, then fieldBoost is (0.9a +0.7b)/(a + b).

And S5, judging whether the warehousing article is a hot article according to the heat value of the warehousing article.

Specifically, the specific method for judging whether the article in storage is a hotspot article according to the heat value of the article in storage comprises the following steps: and when the heat value of the articles in storage is larger than a preset threshold value, judging that the articles in storage are hot articles.

As a preferred embodiment, the hot article identification method further comprises the following steps:

and when the heat value of the articles in storage is larger than a preset threshold value, adding the articles in storage into the hot spot article library to update the hot spot article library. And extracting the word segmentation of each hot article in the updated hot article library, and counting the word frequency corresponding to each word segmentation to update the hot word library.

It can be understood that when the article in storage is determined as a hot article, the article in storage is added to the hot article library, the hot article library is updated through the newly identified hot article, and the updated hot article library is processed to obtain a new hot word library, so that the new hot word library is updated more completely, and the identification efficiency and the identification accuracy of the hot article are improved.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It should be understood by those skilled in the art that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by using equivalent alternatives or equivalent variations fall within the scope of the present invention.

Claims

1. A hot spot article identification method is characterized by comprising the following steps:

acquiring a plurality of hot articles to form a hot article library;

calculating the heat value of the article in storage;

2. The hot spot article identification method of claim 1,

the specific method for calculating the heat value of the article to be put in storage comprises the following steps:

wherein score represents the word segmentation heat value, subsetFreq represents the word frequency of one of the word segments of the article in storage, subsetSize represents the sum of the word frequencies of all the word segments of the article in storage, superfeq represents the word frequency of the word segment of the article in storage in the hot word bank, superSize represents the sum of the word frequencies of all the word segments in the hot word bank, natureBoost represents the part-of-speech weight of the word segment of the article in storage, and fieldBoost represents the field weight of the word segment of the article in storage;

and averaging the word segmentation heat value of each word segmentation obtained by calculation to obtain the heat value of the article in the storage.

3. The hot spot article identification method of claim 2,

selecting part of participles with larger word frequency from the participles of the warehousing article according to the word frequency of the participles of the warehousing article;

and only calculating the selected participles when calculating the heat value of the article in the warehouse.

4. The hot spot article identification method of claim 3,

and selecting the participles with the top rank of 100 from the participles of the articles to be put in storage according to the word frequency of the participles of the articles to be put in storage.

5. The hot spot article identification method of claim 2,

the specific method for obtaining the natureBoost comprises the following steps:

and solving an average value according to the part of speech of the participles of the article in storage in the article in storage.

6. The hot spot article identification method of claim 5,

the specific method for obtaining fieldBosst is as follows:

and obtaining an average value according to the fields of the participles of the articles in storage in the articles in storage.

7. The hot spot article identification method of claim 2,

the specific method for judging whether the articles in storage are hot articles according to the heat value of the articles in storage comprises the following steps:

8. The hot spot article identification method of claim 7,

the hot article identification method further comprises the following steps:

when the heat value of the articles in storage is larger than a preset threshold value, adding the articles in storage into the hot spot article library to update the hot spot article library;

extracting the word of each hot article in the updated hot article library, and counting the word frequency corresponding to each word to update the hot word library.

9. The hot spot article identification method of claim 1,

the specific method for obtaining the hot articles to form the hot article library comprises the following steps:

and acquiring the hot articles in a first preset time from the network to form the hot article library.

10. The hot spot article identification method of claim 9,

and acquiring the hot articles in the first preset time from the network again at intervals of second preset time to form a new hot article library.