WO2022141803A1 - 一种自动发现热点关键词和热点新闻的方法 - Google Patents

一种自动发现热点关键词和热点新闻的方法 Download PDF

Info

Publication number
WO2022141803A1
WO2022141803A1 PCT/CN2021/080154 CN2021080154W WO2022141803A1 WO 2022141803 A1 WO2022141803 A1 WO 2022141803A1 CN 2021080154 W CN2021080154 W CN 2021080154W WO 2022141803 A1 WO2022141803 A1 WO 2022141803A1
Authority
WO
WIPO (PCT)
Prior art keywords
hot
news
proportion
keywords
preset
Prior art date
Application number
PCT/CN2021/080154
Other languages
English (en)
French (fr)
Inventor
尹扬
Original Assignee
上海朝阳永续信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海朝阳永续信息技术股份有限公司 filed Critical 上海朝阳永续信息技术股份有限公司
Publication of WO2022141803A1 publication Critical patent/WO2022141803A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the invention relates to the technical field of Internet applications, in particular to a method for automatically discovering hot keywords and hot news.
  • the existing hot news discovery methods mainly include the method of manual editing and the method of obtaining hot information by calculating the user behavior data.
  • manual editing method professional editors need to be hired to read, organize and edit massive news every day, which is time-consuming and labor-intensive, and labor costs are high.
  • Calculation through user behavior data is used by large Internet search companies such as Baidu and Google to obtain the current hotspots by calculating a large amount of user behavior data such as user search record sorting, clicks, page views, and sharing rates. For most companies and individuals, there is not enough user behavior data to obtain current hotspot information through similar methods.
  • the purpose of the present invention is to provide a method for automatically discovering hot keywords and hot news, so as to solve the problem that existing small and medium-sized enterprises are difficult to obtain hot keywords and hot news automatically, resulting in the loss of opportunities in investment and decision-making.
  • the present invention provides a method for automatically discovering hot keywords and hot news, comprising the following steps:
  • the topic keyword corresponding to the heat value is a hot keyword
  • the calculation method of the average proportion is:
  • M (P 1 +P 2 +...P n )/n, where M is the average proportion of any topic keyword in the preset historical time period, and P 1 to P n are the preset historical time The proportion of news corresponding to the topic keyword calculated in the segment, and n is the number of the proportion of news corresponding to the topic keyword in the preset historical time period.
  • the calculation method of the proportion standard deviation is:
  • Std sqrt(((P 1 -M) ⁇ 2+(P 2 -M) ⁇ 2+...(P n -M) ⁇ 2)/n), where Std is the key of any subject
  • P 1 to P n are the proportion of news corresponding to the topic keyword calculated in the preset historical time period
  • M is the topic keyword in the preset historical period
  • n is the proportion of the news corresponding to the topic keyword in the preset historical time period.
  • the proportion of news corresponding to each topic keyword in a preset period is calculated according to a preset frequency, and the proportion is updated in time.
  • the preset frequency includes: 30 minutes, 1 hour or 2 hours;
  • the preset period includes: 1 day, 1 week or 1 month;
  • the preset historical time period includes: 1 month, 1 quarter or 2 quarters.
  • the preset hot threshold includes: 2.8, 3.0 or 3.2.
  • the method of extracting each topic keyword includes the following steps:
  • the subject keywords of each news are extracted from a mass of news.
  • the extracted subject keywords are stored in the database as the labels of the corresponding news for standby use;
  • the proportion of news corresponding to each topic keyword in the preset period is stored in the database for backup.
  • the present invention in the method for automatically discovering hot keywords and hot news provided by the present invention, by calculating the proportion, average proportion, standard deviation of proportion, and heat value of news corresponding to each theme keyword in a preset period, so that The present invention can fully automatically and timely calculate the current hot keywords from the massive disorganized news information in the database, and based on these hot keywords, find out the corresponding hot news.
  • the whole process of the present invention does not need any manual intervention, nor does it need to collect and use any user behavior data. It saves labor costs and lowers the threshold for small and medium-sized enterprises and individuals to automatically obtain hot keywords and hot news in a timely manner.
  • FIG. 1 is a flowchart of a method for automatically discovering hot keywords and hot news provided by an embodiment of the present invention
  • Fig. 2 is the change trend diagram of the news ratio corresponding to the subject keyword provided by the embodiment of the present invention.
  • FIG. 3 is a display diagram of hot news corresponding to hot keywords according to an embodiment of the present invention.
  • the existing hot news discovery methods mainly include the method of manual editing and the method of obtaining hot information by calculating the user behavior data.
  • the manual editing method there are problems such as time-consuming, labor-intensive, and high labor costs. Compared with most companies and individuals, there is not enough user behavior data to participate in the calculation of behavior data to obtain current hot information through the method of user behavior data calculation.
  • the method for hot keywords and hot news includes the following steps:
  • the topic keyword corresponding to the heat value is a hot keyword
  • the present invention can completely automatically and timely retrieve the mass and disorganized news information from the database. Calculate the current hot keywords, and based on these hot keywords, find out the corresponding hot news.
  • the whole process of the present invention does not need any manual intervention, nor does it need to collect and use any user behavior data. It saves labor costs and lowers the threshold for small and medium-sized enterprises and individuals to automatically obtain hot keywords and hot news in a timely manner.
  • the method of extracting each topic keyword includes the following steps:
  • the TextRank algorithm and the machine learning classifier are used to extract the subject keywords of each news from a large amount of news, and the mass news is generally stored in a storage device such as a news information database.
  • a storage device such as a news information database.
  • the TextRank algorithm and the machine learning classifier can be used to extract the topic keywords of the newly added news, and the extracted topic keywords are stored in the database as the tags of the corresponding news for standby use.
  • the proportion of news corresponding to each theme keyword in the preset period is calculated according to the preset frequency, and the proportion is updated in time.
  • the preset frequency includes: 30 minutes, 1 hour or 2 hours, so
  • the preset period includes: 1 day, 1 week or 1 month.
  • P is any topic within 1 day
  • T is the number of news corresponding to the topic keyword in 1 day
  • N is the number of new news in 1 day, so as to obtain the proportion of news corresponding to each topic keyword in the preset period
  • the proportion of news corresponding to each topic keyword in the preset period is stored in the database for backup.
  • the present invention will first calculate the historical distribution of the proportion of news corresponding to each topic keyword within a preset historical time period, and then calculate the popularity value of the current proportion of news corresponding to each topic keyword relative to the historical distribution.
  • the current proportion of news corresponding to the topic keyword Mean(w) is the average proportion of the topic keyword in the preset historical period
  • Std(w) is the topic keyword in the preset historical period proportion of standard deviation.
  • the calculation method of the average proportion of any topic keyword in the preset historical time period is the same, and the calculation method of the proportion standard deviation is also the same.
  • the preset historical time period includes: 1 month, 1 quarter or 2 quarters, preferably the preset historical time period is 1 month.
  • the word frequency of the topic keyword is in line with the normal distribution; if there is a hot event related to the topic keyword, the distribution is will change. If the word frequency of the topic keyword deviates from the mean and reaches the standard deviation of Hot times in the current cycle, the more the deviation is, the lower the probability that the word frequency of the topic keyword comes from the original distribution, that is, there is no possibility of a hot event. The smaller the value, the more likely a hot event related to the topic keyword has occurred. Therefore, the larger the popularity value of the topic keyword, the more popular the topic keyword.
  • the screening method of the preset popularity threshold is that if the popularity value is greater than the preset popularity threshold, it is determined that the subject keyword corresponding to the popularity value is a hot keyword. Then, the news corresponding to the hot keywords is queried in the database, and the news obtained by the query is the current hot news.
  • the preset popular threshold includes: 2.8, 3.0 or 3.2, preferably, the preset popular threshold may be 3.0.
  • the preset frequency, the preset period, the preset historical time period and the preset popular threshold can all be specifically set according to requirements such as news timeliness and hotspot accuracy.
  • the present invention is used to calculate the historical distribution of the proportion of news corresponding to the subject keyword "Douyu", so as to discover the hot news of the merger of Douyu Live and Huya Live that occurred on October 13, 2020.
  • FIG. 2 is a change trend diagram of the proportion of news corresponding to the subject keyword provided by the embodiment of the present invention. It can be seen from the figure that before October 13, 2020, the subject keyword "Betta" The proportion of the corresponding news basically fluctuates within 0.001; however, on October 13, 2020, the proportion of news corresponding to the theme keyword "Betta” suddenly risend to more than 0.007.
  • the hot value Hot["Douyu”] of the theme keyword “Betta” on October 13, 2020 calculated by the algorithm of the present invention is 11.24, which is far beyond the preset hot threshold of the hot value (within 3.0), indicating that There are hot events about the subject keyword "Betta fish".
  • the present invention in the method for automatically discovering hot keywords and hot news provided by the present invention, by calculating the proportion, average proportion, standard deviation of proportion, and heat value of news corresponding to each theme keyword in a preset period, so that The present invention can fully automatically and timely calculate the current hot keywords from the massive disorganized news information in the database, and based on these hot keywords, find out the corresponding hot news.
  • the whole process of the present invention does not need any manual intervention, nor does it need to collect and use any user behavior data. It saves labor costs and lowers the threshold for small and medium-sized enterprises and individuals to automatically obtain hot keywords and hot news in a timely manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种自动发现热点关键词和热点新闻的方法,包括以下步骤:提取各新闻的主题关键词;计算预设周期内各主题关键词所对应新闻数量与预设周期内新增的新闻数量的比值,以得到预设周期内各主题关键词所对应新闻的占比;计算各主题关键词在预设历史时间段内的占比平均值和占比标准差;根据各主题关键词在预设历史时间段内的占比平均值和占比标准差计算各主题关键词的热度值;若热度值大于预设热门阈值,则判断热度值所对应的主题关键词为热点关键词;根据所述热点关键词查找对应热点新闻。所述方法通过自动获取、计算和筛查得到热点关键词和热点新闻,节省了人工成本,也降低了中小企业和个人及时自动获取热点关键词和热点新闻的门槛。

Description

一种自动发现热点关键词和热点新闻的方法 技术领域
本发明涉及互联网应用技术领域,特别涉及一种自动发现热点关键词和热点新闻的方法。
背景技术
当今的互联网时代,每天都会产生海量的新闻资讯信息,并且这些资讯信息每时每刻都在通过互联网以极快的速度在全世界各地传播。如何在这些海量的信息中快速获得有价值的热点信息,在金融投资、管理决策等领域变得至关重要。
目前已有的热点新闻发现方法主要有人工编辑的方法和通过用户行为数据计算得到热点信息的方法。对于人工编辑的方法,需要聘用专业的编辑,每天阅读、整理和编辑海量的新闻,费时费力,人工成本高昂。通过用户行为数据计算则是类似百度、谷歌这样的大型互联网搜索公司所采用的通过用户搜索记录排序、点击量、页面访问量以及分享率等大量的用户行为数据计算得到当前人们关注的热点,但对于大多数公司和个人,没有足够的用户行为数据通过类似的方法得到当前的热点信息。
因此有必要提供一种自动发现热点关键词和热点新闻的方法,以解决现有中小企业难以自动获取热点关键词和热点新闻,导致在投资和决策等中失去先机的问题。
发明内容
本发明的目的在于提供一种自动发现热点关键词和热点新闻的方法,以解决现有中小企业难以自动获取热点关键词和热点新闻,导致在投资和决策等中失去先机的问题。
为了解决现有技术中存在的问题,本发明提供了一种自动发现热点关键词和热点新闻的方法,包括以下步骤:
提取各新闻的主题关键词;
计算预设周期内各主题关键词所对应新闻数量与预设周期内新增的新闻数量的比值,以得到预设周期内各主题关键词所对应新闻的占比;
计算各主题关键词在预设历史时间段内的占比平均值和占比标准差;
根据各主题关键词在预设历史时间段内的占比平均值和占比标准差计算各主题关键词的热度值;
若热度值大于预设热门阈值,则判断热度值所对应的主题关键词为热点关键词;
根据所述热点关键词查找对应热点新闻。
可选的,在所述自动发现热点关键词和热点新闻的方法中,热度值的计算公式为:Hot(w)=(Proportion(w)-Mean(w))/Std(w),其中,w为待计算热度值的主题关键词,Hot(w)为该主题关键词的热度值,Proportion(w)为预设周期内该主题关键词所对应新闻当前的占比,Mean(w)为该主题关键词在预设历史时间段内的占比平均值,Std(w)为该主题关键词在预设历史时间段内的占比标准差。
可选的,在所述自动发现热点关键词和热点新闻的方法中,占比平均值的计算方式为:
M=(P 1+P 2+……P n)/n,其中,M为任一主题关键词在预设历史时间段内的占比平均值,P 1到P n为在预设历史时间段内计算的该主题关键词所对应新闻的占比,n为在预设历史时间段内该主题关键词所对应新闻的占比的个数。
可选的,在所述自动发现热点关键词和热点新闻的方法中,占比标准差的计算方式为:
Std=sqrt(((P 1-M)^2+(P 2-M)^2+......(P n-M)^2)/n),其中,Std为任一主题关键词在预设历史时间段内的占比标准差,P 1到P n为在预设历史时间段内计算的该主题关键词所对应新闻的占比,M为该主题关键词在预设历史时间段内的占比平均值,n为在预设历史时间段内该主题关键词所对应新闻的占比的个数。
可选的,在所述自动发现热点关键词和热点新闻的方法中,按照预设频率计算预设周期内各主题关键词所对应新闻的占比,以及时更新所述占比。
可选的,在所述自动发现热点关键词和热点新闻的方法中,
所述预设频率包括:30分钟、1个小时或2个小时;
所述预设周期包括:1天、1周或1个月;
所述预设历史时间段包括:1个月、1个季度或2个季度。
可选的,在所述自动发现热点关键词和热点新闻的方法中,所述预设热门阈值包括:2.8、3.0或3.2。
可选的,在所述自动发现热点关键词和热点新闻的方法中,提取各主题关键词的方式包括以下步骤:
采用TextRank算法获取各新闻主题中的关键词;
采用机器学习分类器对获取的关键词进行分类;
得到不同类别的主题关键词。
可选的,在所述自动发现热点关键词和热点新闻的方法中,从海量新闻中提取各新闻的主题关键词。
可选的,在所述自动发现热点关键词和热点新闻的方法中,
将提取的各主题关键词作为各对应新闻的标签存入数据库备用;
将预设周期内各主题关键词所对应新闻的占比存入数据库备用。
在本发明所提供的自动发现热点关键词和热点新闻的方法中,通过计算预设周期内各主题关键词所对应新闻的占比、占比平均值、占比标准差和热度值等,使本发明能够完全自动、及时地从数据库中海量杂乱无章的新闻资讯中计算出当前的热点关键词,并以这些热点关键词为基础,找出与之对应的热点新闻。本发明整个过程不需要任何的人工干预,也不需要收集和使用任何用户行为数据。节省了人工成本,也降低了中小企业和个人及时自动获取热点关键词和热点新闻的门槛。
附图说明
图1为本发明实施例提供的自动发现热点关键词和热点新闻的方法的流程图;
图2为本发明实施例提供的主题关键词所对应新闻占比的变化趋势图;
图3为本发明实施例提供的热点关键词所对应热点新闻的展示图。
具体实施方式
下面将结合示意图对本发明的具体实施方式进行更详细的描述。根据下列描述,本发明的优点和特征将更清楚。需说明的是,附图均采用非常简化的形式且均使用非精准的比例,仅用以方便、明晰地辅助说明本发明实施例的目的。
在下文中,如果本文所述的方法包括一系列步骤,则本文所呈现的这些步骤的顺序并非必须是可执行这些步骤的唯一顺序,且一些所述的步骤可被省略和/或一些本文未描述的其他步骤可被添加到该方法中。
目前已有的热点新闻发现方法主要有人工编辑的方法和通过用户行为数据计算得到热点信息的方法。对于人工编辑的方法,存在费时费力,人工成本高昂等问题;通过用户行为数据计算的方法,相对于大多数公司和个人而言,没有足够的用户行为数据参与行为数据计算得到当前的热点信息。
因此有必要提供一种自动发现热点关键词和热点新闻的方法,如图1所示,图1为本发明实施例提供的自动发现热点关键词和热点新闻的方法的流程图,所述自动发现热点关键词和热点新闻的方法包括以下步骤:
提取各新闻的主题关键词;
计算预设周期内各主题关键词所对应新闻数量与预设周期内新增的新闻数量的比值,以得到预设周期内各主题关键词所对应新闻的占比;
计算各主题关键词在预设历史时间段内的占比平均值和占比标准差;
根据各主题关键词在预设历史时间段内的占比平均值和占比标准差计算各主题关键词的热度值;
若热度值大于预设热门阈值,则判断热度值所对应的主题关键词为热点关键词;
根据所述热点关键词查找对应热点新闻。
本发明通过计算预设周期内各主题关键词所对应新闻的占比、占比平均值、占比标准差和热度值等,使本发明能够完全自动、及时地从数据库中海量杂乱无章的新闻资讯中计算出当前的热点关键词,并以这些热点关键词为基础,找出与之对应的热点新闻。本发明整个过程不需要任何的人工干预,也不需要收集和使用任何用户行为数据。节省了人工成本,也降低了中小企业和个人及时 自动获取热点关键词和热点新闻的门槛。
具体的,在所述自动发现热点关键词和热点新闻的方法中,提取各主题关键词的方式包括以下步骤:
采用TextRank算法获取各新闻主题中的关键词;
采用机器学习分类器对获取的关键词进行分类;
得到不同类别的主题关键词。
通常的,采用所述TextRank算法和所述机器学习分类器从海量新闻中提取各新闻的主题关键词,海量的新闻一般存储在新闻资讯数据库等存储设备中,当所述新闻资讯数据库每新增一篇新闻,就可以采用所述TextRank算法和所述机器学习分类器去提取新增新闻的主题关键词,并将提取的各主题关键词作为各对应新闻的标签存入数据库备用。
进一步的,按照预设频率计算预设周期内各主题关键词所对应新闻的占比,以及时更新所述占比,所述预设频率包括:30分钟、1个小时或2个小时,所述预设周期包括:1天、1周或1个月。例如优选预设频率为1个小时,预设周期为1天,即每隔一个小时就计算一次占比,计算一次占比的公式为:P=T/N,其中P为1天内任一主题关键词所对应新闻的占比,T为1天内该主题关键词所对应新闻数量,N为1天内新增的新闻数量,从而得到预设周期内各主题关键词所对应新闻的占比,并将预设周期内各主题关键词所对应新闻的占比存入数据库备用。
接着,由于各主题关键词在整个语料库里出现的概率不一样。例如,对于金融领域的新闻资讯,“投资”、“股票”等关键词所对应新闻占比总是高于其他主题关键词,因此,不能简单的按照关键词所对应新闻占比的大小排序来寻找热点主题。为此,本发明会首先计算每一个主题关键词所对应新闻占比在预设历史时间段内的历史分布,然后计算每个主题关键词所对应当前新闻占比相对于历史分布的热度值。
具体的,根据数据库中存储的各预设周期内各主题关键词所对应新闻的占比等计算各主题关键词的热度值,所述热度值的计算公式如下:Hot(w)=(Proportion(w)-Mean(w))/Std(w),其中,w为待计算热度值的主题关键词,Hot(w)为该主题关键词的热度值,Proportion(w)为预设周期内该主题关键词所对 应新闻当前的占比,Mean(w)为该主题关键词在预设历史时间段内的占比平均值,Std(w)为该主题关键词在预设历史时间段内的占比标准差。
进一步的,任意一个主题关键词在预设历史时间段内的占比平均值的计算方式相同,占比标准差的计算方式也相同。其中,占比平均值的计算方式为:M=(P 1+P 2+……P n)/n,其中,M为任一主题关键词在预设历史时间段内的占比平均值,P 1到P n为在预设历史时间段内计算的该主题关键词所对应新闻的占比,这些占比都是按照预设频率计算预设周期内各主题关键词所对应新闻的占比得到的,n为在预设历史时间段内该主题关键词所对应新闻的占比的个数。占比标准差的计算方式为:Std=sqrt(((P 1-M)^2+(P 2-M)^2+......(P n-M)^2)/n),其中,Std为任一主题关键词在预设历史时间段内的占比标准差,P 1到P n为在预设历史时间段内计算的该主题关键词所对应新闻的占比,这些占比都是按照预设频率计算预设周期内各主题关键词所对应新闻的占比得到的,M为该主题关键词在预设历史时间段内的占比平均值,n为在预设历史时间段内该主题关键词所对应新闻的占比的个数。通常情况下,所述预设历史时间段包括:1个月、1个季度或2个季度,优选所述预设历史时间段为1个月。
通常的,对于某一主题关键词,若没有关于该主题关键词的热点事件发生,那么这个主题关键词的词频是符合正常分布的;若发生了与该主题关键词相关的热点事件,分布就会改变。如果在当前周期内该主题关键词的词频偏离均值并达到了Hot倍的标准差,偏离得越多,则该主题关键词的词频来自原有分布的概率越低,即没有热点事件的可能性越小,也就是越可能发生了与该主题关键词相关的热点事件。所以该主题关键词的热度值越大,表明该主题关键词越热门。
进一步的,在计算得到各主题关键词的热度值之后,按照热度值从大到小对主题关键词进行排序,取排名靠前的几个主题关键词或者预设一个热度阈值对主题关键词进行筛查。预设热度阈值的筛查方式为若热度值大于预设热度阈值,则判断热度值所对应的主题关键词为热点关键词。然后在数据库中查询出和热点关键词对应的新闻,查询得到的新闻即为当前的热点新闻。其中,所述预设热门阈值包括:2.8、3.0或3.2,优选所述预设热门阈值可以为3.0。
较佳的,所述预设频率、所述预设周期、所述预设历史时间段以及所述预 设热门阈值都可以根据新闻及时性和热点准确率等要求具体设定。
在一个实施例中,通过采用本发明计算主题关键词“斗鱼”所对应新闻占比的历史分布,从而发现2020年10月13日发生的斗鱼直播和虎牙直播合并的热点新闻。如图2所示,图2为本发明实施例提供的主题关键词所对应新闻占比的变化趋势图,从图中可以看出2020年10月13日之前,主题关键词“斗鱼”所对应新闻占比proportion基本在0.001以内波动;然而,2020年10月13日主题关键词“斗鱼”所对应新闻占比proportion突然飙升到0.007以上。通过本发明的算法计算出的2020年10月13日主题关键词“斗鱼”的热度值Hot[“斗鱼”]为11.24,远远超出热度值的预设热门阈值(3.0以内),表明有关于主题关键词“斗鱼”的热点事件发生。
接着在数据库中查询与“斗鱼”相关的新闻,如图3所示,图3为本发明实施例提供的热点关键词所对应热点新闻的展示图,便可看到2020年10月13日有大量关于斗鱼直播和虎牙直播合并的新闻,从而发现热点新闻。
在本发明所提供的自动发现热点关键词和热点新闻的方法中,通过计算预设周期内各主题关键词所对应新闻的占比、占比平均值、占比标准差和热度值等,使本发明能够完全自动、及时地从数据库中海量杂乱无章的新闻资讯中计算出当前的热点关键词,并以这些热点关键词为基础,找出与之对应的热点新闻。本发明整个过程不需要任何的人工干预,也不需要收集和使用任何用户行为数据。节省了人工成本,也降低了中小企业和个人及时自动获取热点关键词和热点新闻的门槛。
上述仅为本发明的优选实施例而已,并不对本发明起到任何限制作用。任何所属技术领域的技术人员,在不脱离本发明的技术方案的范围内,对本发明揭露的技术方案和技术内容做任何形式的等同替换或修改等变动,均属未脱离本发明的技术方案的内容,仍属于本发明的保护范围之内。

Claims (10)

  1. 一种自动发现热点关键词和热点新闻的方法,其特征在于,包括以下步骤:
    提取各新闻的主题关键词;
    计算预设周期内各主题关键词所对应新闻数量与预设周期内新增的新闻数量的比值,以得到预设周期内各主题关键词所对应新闻的占比;
    计算各主题关键词在预设历史时间段内的占比平均值和占比标准差;
    根据各主题关键词在预设历史时间段内的占比平均值和占比标准差计算各主题关键词的热度值;
    若热度值大于预设热门阈值,则判断热度值所对应的主题关键词为热点关键词;
    根据所述热点关键词查找对应热点新闻。
  2. 如权利要求1所述的自动发现热点关键词和热点新闻的方法,其特征在于,热度值的计算公式为:Hot(w)=(Proportion(w)-Mean(w))/Std(w),其中,w为待计算热度值的主题关键词,Hot(w)为该主题关键词的热度值,Proportion(w)为预设周期内该主题关键词所对应新闻当前的占比,Mean(w)为该主题关键词在预设历史时间段内的占比平均值,Std(w)为该主题关键词在预设历史时间段内的占比标准差。
  3. 如权利要求1所述的自动发现热点关键词和热点新闻的方法,其特征在于,占比平均值的计算方式为:
    M=(P 1+P 2+……P n)/n,其中,M为任一主题关键词在预设历史时间段内的占比平均值,P 1到P n为在预设历史时间段内计算的该主题关键词所对应新闻的占比,n为在预设历史时间段内该主题关键词所对应新闻的占比的个数。
  4. 如权利要求1所述的自动发现热点关键词和热点新闻的方法,其特征在于,占比标准差的计算方式为:
    Std=sqrt(((P 1-M)^2+(P 2-M)^2+......(P n-M)^2)/n),其中,Std为任一主题关键词在预设历史时间段内的占比标准差,P 1到P n为在预设历史时间段内计算的该主题关键词所对应新闻的占比,M为该主题关键词在预设历史时间段内的占比平均值,n为在预设历史时间段内该主题关键词所对应新闻的占比的个数。
  5. 如权利要求1所述的自动发现热点关键词和热点新闻的方法,其特征在于,按照预设频率计算预设周期内各主题关键词所对应新闻的占比,以及时更新所述占比。
  6. 如权利要求5所述的自动发现热点关键词和热点新闻的方法,其特征在于,
    所述预设频率包括:30分钟、1个小时或2个小时;
    所述预设周期包括:1天、1周或1个月;
    所述预设历史时间段包括:1个月、1个季度或2个季度。
  7. 如权利要求1所述的自动发现热点关键词和热点新闻的方法,其特征在于,所述预设热门阈值包括:2.8、3.0或3.2。
  8. 如权利要求1所述的自动发现热点关键词和热点新闻的方法,其特征在于,提取各主题关键词的方式包括以下步骤:
    采用TextRank算法获取各新闻主题中的关键词;
    采用机器学习分类器对获取的关键词进行分类;
    得到不同类别的主题关键词。
  9. 如权利要求8所述的自动发现热点关键词和热点新闻的方法,其特征在于,从海量新闻中提取各新闻的主题关键词。
  10. 如权利要求1所述的自动发现热点关键词和热点新闻的方法,其特征在于,
    将提取的各主题关键词作为各对应新闻的标签存入数据库备用;
    将预设周期内各主题关键词所对应新闻的占比存入数据库备用。
PCT/CN2021/080154 2020-12-28 2021-03-11 一种自动发现热点关键词和热点新闻的方法 WO2022141803A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011580056.0 2020-12-28
CN202011580056.0A CN112597280A (zh) 2020-12-28 2020-12-28 一种自动发现热点关键词和热点新闻的方法

Publications (1)

Publication Number Publication Date
WO2022141803A1 true WO2022141803A1 (zh) 2022-07-07

Family

ID=75202798

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/080154 WO2022141803A1 (zh) 2020-12-28 2021-03-11 一种自动发现热点关键词和热点新闻的方法

Country Status (2)

Country Link
CN (1) CN112597280A (zh)
WO (1) WO2022141803A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127576B (zh) * 2021-04-15 2024-05-24 微梦创科网络科技(中国)有限公司 一种基于用户内容消费分析的热点发现方法及系统
CN113127743B (zh) * 2021-05-06 2023-01-10 数库(上海)科技有限公司 新闻主体热度计算及排序方法、装置、设备和存储介质
CN113420093A (zh) * 2021-06-30 2021-09-21 北京小米移动软件有限公司 热点检测方法、装置、存储服务器及存储介质
CN113489776A (zh) * 2021-06-30 2021-10-08 北京小米移动软件有限公司 热点检测方法、装置、监测服务器及存储介质
CN115795175B (zh) * 2023-02-15 2023-04-25 铭台(北京)科技有限公司 基于数据分析的多维度热点提取方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662965A (zh) * 2012-03-07 2012-09-12 上海引跑信息科技有限公司 一种自动发现互联网热点新闻主题的方法及系统
CN103593444A (zh) * 2013-11-15 2014-02-19 北京国双科技有限公司 网络关键词识别处理方法和装置
CN107122481A (zh) * 2017-05-04 2017-09-01 成都华栖云科技有限公司 新闻热度实时在线预测方法
US20180260484A1 (en) * 2017-03-06 2018-09-13 Guangzhou Shenma Mobile Information Technology Co., Ltd. Method, Apparatus, and Device for Generating Hot News
CN111737555A (zh) * 2020-06-18 2020-10-02 苏州朗动网络科技有限公司 热点关键词的选取方法、设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662965A (zh) * 2012-03-07 2012-09-12 上海引跑信息科技有限公司 一种自动发现互联网热点新闻主题的方法及系统
CN103593444A (zh) * 2013-11-15 2014-02-19 北京国双科技有限公司 网络关键词识别处理方法和装置
US20180260484A1 (en) * 2017-03-06 2018-09-13 Guangzhou Shenma Mobile Information Technology Co., Ltd. Method, Apparatus, and Device for Generating Hot News
CN107122481A (zh) * 2017-05-04 2017-09-01 成都华栖云科技有限公司 新闻热度实时在线预测方法
CN111737555A (zh) * 2020-06-18 2020-10-02 苏州朗动网络科技有限公司 热点关键词的选取方法、设备和存储介质

Also Published As

Publication number Publication date
CN112597280A (zh) 2021-04-02

Similar Documents

Publication Publication Date Title
WO2022141803A1 (zh) 一种自动发现热点关键词和热点新闻的方法
US8645385B2 (en) System and method for automating categorization and aggregation of content from network sites
US6389412B1 (en) Method and system for constructing integrated metadata
TWI652584B (zh) 文本資訊的匹配、業務對象的推送方法和裝置
WO2021175009A1 (zh) 预警事件图谱的构建方法、装置、设备及存储介质
US7814089B1 (en) System and method for presenting categorized content on a site using programmatic and manual selection of content items
US8078629B2 (en) Detecting spam documents in a phrase based information retrieval system
EP2192500B1 (en) System and method for providing robust topic identification in social indexes
US20100191742A1 (en) System And Method For Managing User Attention By Detecting Hot And Cold Topics In Social Indexes
CN111026965B (zh) 基于知识图谱的热点话题追溯方法及装置
CN112035658B (zh) 基于深度学习的企业舆情监测方法
CN111008265A (zh) 企业信息搜索方法及装置
CN111506727B (zh) 文本内容类别获取方法、装置、计算机设备和存储介质
Lu et al. How do author-selected keywords function semantically in scientific manuscripts?
CN116541480B (zh) 一种基于多标签驱动的专题数据构建方法及系统
CN111369294B (zh) 软件造价估算方法及装置
US20060253433A1 (en) Method and apparatus for knowledge-based music searching and method and apparatus for managing music file
JP4375626B2 (ja) カテゴリ別のキーワードの入力順位を提供するための検索サービスシステムおよびその方法
CN110795613A (zh) 商品搜索方法、装置、系统及电子设备
CN111046281A (zh) 热点话题的构建方法及装置
US20160246794A1 (en) Method for entity-driven alerts based on disambiguated features
Neubarth et al. Modelling pattern interestingness in comparative music corpus analysis
JP5292336B2 (ja) 検索システムユーザの分野ごとにおける知識量推定装置、知識量推定方法および知識量推定プログラム
Li et al. A hybrid news recommendation algorithm based on user's browsing path
CN111026990B (zh) 热点话题日志信息的展示方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912573

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912573

Country of ref document: EP

Kind code of ref document: A1