WO2021027116A1 - Method and apparatus for discovering text hotspot and computer-readable storage medium - Google Patents

Method and apparatus for discovering text hotspot and computer-readable storage medium Download PDF

Info

Publication number
WO2021027116A1
WO2021027116A1 PCT/CN2019/116550 CN2019116550W WO2021027116A1 WO 2021027116 A1 WO2021027116 A1 WO 2021027116A1 CN 2019116550 W CN2019116550 W CN 2019116550W WO 2021027116 A1 WO2021027116 A1 WO 2021027116A1
Authority
WO
WIPO (PCT)
Prior art keywords
data set
text data
feature
feature word
text
Prior art date
Application number
PCT/CN2019/116550
Other languages
French (fr)
Chinese (zh)
Inventor
苏智辉
侯丽
姚飞
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021027116A1 publication Critical patent/WO2021027116A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for extracting keywords in a text data set to discover text hotspots.
  • This application provides a method, device and computer-readable storage medium for discovering text hotspots, the main purpose of which is to discover text hotspots by extracting keywords in a text data set.
  • a method for discovering text hotspots includes:
  • preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;
  • the present application also provides a text hotspot discovery device, which includes a memory and a processor.
  • the memory stores a text hotspot discovery program that can run on the processor.
  • the text hot spot discovery program is executed by the processor, the following steps are implemented:
  • preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;
  • the present application also provides a computer-readable storage medium having a text hotspot discovery program stored on the computer-readable storage medium.
  • the text hotspot discovery program can be used by one or more processors. Perform the steps of the method for discovering text hotspots as described above.
  • This application first crawls the real-time text data of the news forum. Through the preprocessing of the more accurate word segmentation and part-of-speech standards in the early stage, the words that may belong to the hot keywords can be effectively extracted. Further, through the conversion of word vectors, without losing features At the same time, it can efficiently analyze by the computer, and finally traverse the hot keywords based on the calculation of feature similarity, so as to get the current text hot spots. Therefore, the method, device, and computer-readable storage medium for discovering text hotspots proposed in this application can realize accurate and efficient text hotspot discovery functions.
  • FIG. 1 is a schematic flowchart of a method for discovering text hotspots according to an embodiment of this application
  • FIG. 2 is a schematic diagram of the internal structure of a text hotspot discovery device provided by an embodiment of this application;
  • FIG. 3 is a schematic diagram of modules of a text hotspot discovery program in a text hotspot discovery device provided by an embodiment of the application.
  • This application provides a method for discovering text hotspots.
  • FIG. 1 it is a schematic flowchart of a method for discovering text hotspots according to an embodiment of this application.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the method for discovering text hot spots includes:
  • the crawling can use a crawler technology.
  • the crawler technology is to first create a URL queue, wherein the URL queue includes several URLs, and then read the URLs in the URL queue in turn and resolve them to IP addresses. Finally, download the webpage data specified by the IP address based on the HTTP communication protocol, and analyze the webpage data to obtain the original text data set and the tag set.
  • the URL is called a uniform resource locator, which is a concise representation of the location and access method of various resources in the news forum website, and is also called the resource of various resources in the news forum website. address.
  • the URL is composed of protocol, hostname, port, path, query string, hash element, etc.
  • the protocol represents a protocol for accessing resources and services, such as http, ftp, mailto, file, etc.
  • the hostname represents the fully qualified domain name of the host where the resource is located, such as www.baidu.com.
  • the port represents the TCP port number used by the protocol, wherein the commonly used port of the HTTP communication protocol is 80, which is generally omitted by default;
  • the path (path) represents the directory/file path name of the resource;
  • the query string represents the query string passed in the URL;
  • the hash element represents the file offset specified by the URL, including a hash (#) plus the location related to the file offset.
  • parsing to an IP address is to extract the protocol (protocol), hostname (hostname), port (port), path (path), etc. to obtain the IP address.
  • the URL is generally a designated news, microblog, etc. URL, because the webpage data of the news, microblog, etc. URL has text data and release time, and the text data is grouped into an original text data set. The publishing time of the text data in the original text data set is in the label set.
  • word segmentation processing is performed on the original text data set.
  • the word segmentation process uses jieba word segmentation based on Python, JAVA and other programming languages.
  • the original text data set contains text data: "Yang Yubin is a well-known entrepreneurial youth who relies on solid knowledge and hard work in the local area. Started my own business".
  • the result is: [ ⁇ ][ ⁇ ][one][ ⁇ ][ ⁇ ][ ⁇ ][,][lea on][solid][knowledge][ ⁇ ][ ⁇ Work hard] [in] [local] [start] [up] [own] [career].
  • the part-of-speech tagging is based on a pre-built part-of-speech tagging template to tag nouns and verbs in the original text data set where the word segmentation is completed.
  • the part-of-speech tagging template refers to a recognizer for the characteristics of nouns and verbs, and the part-of-speech tagging template can identify nouns and verbs by recognizing the characteristics of words.
  • the words that are longer than the preset length and contain " ⁇ ” or " ⁇ ” are adjectives or adverbs, such as [ ⁇ ][ ⁇ ][ ⁇ ][ ⁇ ][Entrepreneurship][Youth][,][Rely on][Solid][Knowledge][and][Diligence][In][Local][Start][ ⁇ ][Own][Career], according to The part-of-speech tag template identified the nouns as [ ⁇ ], [Knowledge], [local], and [career], and the verbs as [Business], [lea on], [Start], and recognized that the length is greater than two characters and contains
  • the words of " ⁇ " or " ⁇ " are [ ⁇ ], [ ⁇ ], [ ⁇ ], and it is judged that there are nouns or verbs before and after the said words, such as [Business], [Reliance], [Knowledge], etc.
  • the labeling methods can be used in the form of a reference symbol comprising, as [Yang Yubin start is a well-known v adj n youth, against solid adj v n knowledge and diligence in the local hard work began n v own adj career n ].
  • heteromorphic words such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc.
  • the stop words include words such as " ⁇ ", " ⁇ ”, etc.
  • the heteromorphic words are removed as described above Later, I got it as [famous adj entrepreneurship v youth n relying on v solid adj knowledge n local n starting v own adj business n ].
  • the feature extraction is:
  • DF t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set
  • N c is the total number of data in the primary text data set
  • c is the primary text data set
  • lg Represents the log function with 10 as the base.
  • the process of converting the feature data set into a feature word vector set includes assuming a weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, and based on the weight The relationship calculates the weight and completes the conversion process.
  • the weight relationship is:
  • d is the feature word vector set
  • t 1 , t 2 , ..., t n are the features in the feature data set, such as the aforementioned [Famous], [Venture], etc.
  • w 1 , w 2 , ..., w n is the weight of the corresponding feature.
  • f i represents the number of times the feature word appears in the primary text data set
  • N is the total number of documents in the document collection
  • N j represents the total number of feature words in the primary text data set
  • N i represents the feature word i in the primary text data set.
  • the number of occurrences of the text data set, F m is a weighting factor, and the value is generally less than 1.
  • the calculation method of the similarity is:
  • sim(d, t) represents the similarity between the feature word vectors d and t
  • w represents the weight coefficients of the feature word vectors d, t and other feature word vectors k in the feature word vector set
  • n is the The total number of data in the set of feature word vectors
  • time distance function T is:
  • t d represents the publishing time of the text in the tag set where the feature word vector d is located
  • t Ts is the earliest publishing time of the text data in the tag set
  • t Te is the latest publishing time in the tag set.
  • the original text data set includes such as "Yang Yubin Is a well-known entrepreneurial youth who started his own business locally with solid knowledge and hard work.
  • the hot text keywords of the original text data set are "Entrepreneurship”.
  • the invention also provides a text hot spot discovery device.
  • FIG. 2 it is a schematic diagram of the internal structure of a text hotspot discovery apparatus provided by an embodiment of this application.
  • the text hotspot discovery apparatus 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server.
  • the text hotspot discovery device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 11 may be an internal storage unit of the text hotspot discovery device 1, for example, the hard disk of the text hotspot discovery device 1.
  • the memory 11 may also be an external storage device of the text hotspot discovery device 1, such as a plug-in hard disk equipped on the text hotspot discovery device 1, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the memory 11 may also include both the internal storage unit of the text hotspot discovery apparatus 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data of the discovery device 1 installed in the text hotspot, such as the code of the text hotspot discovery program 01, etc., but also to temporarily store data that has been output or will be output.
  • the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, and is used to run the program code or processing stored in the memory 11 Data, such as the implementation of the text hot spot discovery program 01, etc.
  • CPU central processing unit
  • controller microcontroller
  • microprocessor or other data processing chip
  • the communication bus 13 is used to realize the connection and communication between these components.
  • the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.
  • the device 1 may also include a user interface.
  • the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
  • the optional user interface may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the text hotspot discovery device 1 and to display a visualized user interface.
  • Figure 2 only shows the text hot spot discovery device 1 with components 11-14 and the text hot spot discovery program 01. Those skilled in the art can understand that the structure shown in Figure 1 does not constitute a text hot spot discovery device
  • the definition of 1 may include fewer or more components than shown, or a combination of certain components, or different component arrangements.
  • the memory 11 stores a text hotspot discovery program 01; when the processor 12 executes the text hotspot discovery program 01 stored in the memory 11, the following steps are implemented:
  • Step 1 Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set.
  • the crawling can use a crawler technology.
  • the crawler technology is to first create a URL queue, wherein the URL queue includes several URLs, and then read the URLs in the URL queue in turn and resolve them to IP addresses. Finally, download the webpage data specified by the IP address based on the HTTP communication protocol, and analyze the webpage data to obtain the original text data set and the tag set.
  • the URL is called a uniform resource locator, which is a concise representation of the location and access method of various resources in the news forum website, and is also called the resource of various resources in the news forum website. address.
  • the URL is composed of protocol, hostname, port, path, query string, hash element, etc.
  • the protocol represents a protocol for accessing resources and services, such as http, ftp, mailto, file, etc.
  • the hostname represents the fully qualified domain name of the host where the resource is located, such as www.baidu.com.
  • the port represents the TCP port number used by the protocol, and the commonly used port of the HTTP communication protocol is 80, which is generally omitted by default;
  • the path (path) represents the directory/file path name of the resource;
  • the query string represents the query string passed in the URL;
  • the hash element represents the file offset specified by the URL, including a hash (#) plus the location related to the file offset.
  • parsing to an IP address is to extract the protocol (protocol), hostname (hostname), port (port), path (path), etc. to obtain the IP address.
  • the URL is generally a designated news, microblog, etc. URL, because the webpage data of the news, microblog, etc. URL has text data and release time, and the text data is grouped into an original text data set. The publishing time of the text data in the original text data set is in the label set.
  • Step 2 Perform preprocessing operations including word segmentation, part-of-speech tagging, and stop word removal on the original text data set to obtain a primary text data set.
  • word segmentation processing is performed on the original text data set.
  • the word segmentation process uses jieba word segmentation based on Python, JAVA and other programming languages.
  • the original text data set contains text data: "Yang Yubin is a well-known entrepreneurial youth who relies on solid knowledge and hard work in the local area. Started my own business".
  • the result is: [ ⁇ ][ ⁇ ][one][ ⁇ ][ ⁇ ][ ⁇ ][,][lea on][solid][knowledge][ ⁇ ][ ⁇ Work hard] [in] [local] [start] [up] [own] [career].
  • the part-of-speech tagging is based on a pre-built part-of-speech tagging template to tag nouns and verbs in the original text data set where the word segmentation is completed.
  • the part-of-speech tagging template refers to a recognizer for the characteristics of nouns and verbs, and the part-of-speech tagging template can identify nouns and verbs by recognizing the characteristics of words.
  • the words that are longer than the preset length and contain " ⁇ ” or " ⁇ ” are adjectives or adverbs, such as [ ⁇ ][ ⁇ ][ ⁇ ][ ⁇ ][Entrepreneurship][Youth][,][Rely on][Solid][Knowledge][and][Diligence][In][Local][Start][ ⁇ ][Own][Career], according to The part-of-speech tag template identified the nouns as [ ⁇ ], [Knowledge], [local], and [career], and the verbs as [Business], [lea on], [Start], and recognized that the length is greater than two characters and contains
  • the words of " ⁇ " or " ⁇ " are [ ⁇ ], [ ⁇ ], [ ⁇ ], and it is judged that there are nouns or verbs before and after the said words, such as [Business], [Reliance], [Knowledge], etc.
  • the labeling methods can be used in the form of a reference symbol comprising, as [Yang Yubin start is a well-known v adj n youth, against solid adj v n knowledge and diligence in the local hard work began n v own adj career n ].
  • heteromorphic words such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc.
  • the stop words include words such as " ⁇ ", " ⁇ ”, etc.
  • the heteromorphic words are removed as described above Later, I got it as [famous adj entrepreneurship v youth n relying on v solid adj knowledge n local n starting v own adj business n ].
  • Step 3 Perform feature extraction on the primary text data set based on the tag set to obtain a feature data set, and convert the feature data set into a feature word vector set.
  • the feature extraction is:
  • DF t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set
  • N c is the total number of data in the primary text data set
  • c is the primary text data set
  • lg Represents the log function with 10 as the base.
  • the process of converting the feature data set into a feature word vector set includes assuming a weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, and based on the weight The relationship calculates the weight and completes the conversion process.
  • the weight relationship is:
  • d is the feature word vector set
  • t 1 , t 2 , ..., t n are the features in the feature data set, such as the aforementioned [Famous], [Venture], etc.
  • w 1 , w 2 , ..., w n is the weight of the corresponding feature.
  • f i represents the number of times the feature word appears in the primary text data set
  • N is the total number of documents in the document collection
  • N j represents the total number of feature words in the primary text data set
  • N i represents the feature word i in the primary text data set.
  • the number of occurrences of the text data set, F m is a weighting factor, and the value is generally less than 1.
  • Step 4 Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation , Find hot keywords based on the specified number of feature word vectors and output the hot spots of the original text data set.
  • the calculation method of the similarity is:
  • sim(d, t) represents the similarity between the feature word vectors d and t
  • w represents the weight coefficients of the feature word vectors d, t and other feature word vectors k in the feature word vector set
  • n is the The total number of data in the set of feature word vectors
  • time distance function T is:
  • t d represents the publishing time of the text in the tag set where the feature word vector d is located
  • t Ts is the earliest publishing time of the text data in the tag set
  • t Te is the latest publishing time in the tag set.
  • the original text data set includes such as "Yang Yubin Is a well-known entrepreneurial youth who started his own business locally with solid knowledge and hard work.
  • the hot text keywords of the original text data set are "Entrepreneurship”.
  • the text hotspot discovery program can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (this embodiment It is executed by the processor 12) to complete this application.
  • the module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the execution process of the text hot spot discovery program in the text hot spot discovery device .
  • FIG. 3 a schematic diagram of the program modules of the text hotspot discovery program in an embodiment of the text hotspot discovery apparatus of this application.
  • the text hotspot discovery program can be divided into data receiving modules 10.
  • the data processing module 20, the word vector conversion module 30, and the text hotspot output module 40 are exemplary:
  • the data receiving module 10 is used to crawl an original text data set and a tag set from a news forum website, and the tag set records the publication time of the text in the original text data set.
  • the data processing module 20 is configured to: perform preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;
  • the word vector conversion module 30 is configured to perform a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and convert the feature data set into a feature word vector set.
  • the text hotspot output module 40 is configured to calculate the similarity between the features in the feature word vector set to obtain a similarity set, and perform a sorting operation on the similarity set, from the similarity set after the sorting operation Select a specified number of feature word vectors, find hot keywords based on the specified number of feature word vectors, and output the hot spots of the original text data set.
  • the embodiment of the present application also proposes a computer-readable storage medium, the computer-readable storage medium stores a text hotspot discovery program, and the text hotspot discovery program can be executed by one or more processors to Implement the following operations:
  • preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;

Abstract

The present application relates to artificial intelligence technology, and disclosed is a method for discovering a text hotspot, comprising: receiving an original text data set and a tag set, the tag set recording the publication time of text in the original text data set; performing a preprocessing operation comprising word segmentation, part-of-speech tagging, and heteromorphic word removal on the original text data set to obtain a primary text data set; performing a feature extraction operation on the primary text data set on the basis of the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set; calculating the similarity between features in the feature word vector set to obtain a similarity set; selecting a specified quantity of feature word vectors from within the similarity set; and discovering hotspot keywords on the basis of the specified quantity of feature word vectors and outputting a hotspot. Further proposed by the present application are an apparatus for discovering a text hotspot and a computer-readable storage medium. The present application may achieve the function of accurately and efficiently discovering a text hotspot.

Description

文本热点的发现方法、装置及计算机可读存储介质Method, device and computer readable storage medium for discovering text hotspot
本申请基于巴黎公约申明享有2019年8月15日递交的申请号为CN201910768143.X、名称为“文本热点的发现方法、装置及计算机可读存储介质”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。This application is based on the Paris Convention declares that it enjoys the priority of a Chinese patent application filed on August 15, 2019 with the application number CN201910768143.X and titled "Method, device and computer-readable storage medium for discovering text hotspots". The Chinese patent The entire content of the application is incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种提取文本数据集中关键字进而发现文本热点的方法、装置及计算机可读存储介质。This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for extracting keywords in a text data set to discover text hotspots.
背景技术Background technique
随着互联网技术的飞速发展,各大门户网站应运而生,且多数门户网站也成为了人们获取信息的主要渠道。但由于网络的复杂性、冗余性、更新和传播的快速性等,都给人们快速、准确地获取自己所需的关键信息带来了困难,也不利于网络舆情的监控。因此及时发现网络热点关键字已成为当下研究的侧重点。目前有基于single-pass的文本聚类算法,由于算法简单易实现,时空复杂度低,聚类效果优异等特点被广泛用作发现网络热点关键字。但single-pass算法存在局限性,如关键字相似匹配只根据经验阈值归类,不仅造成网络中的每条文本数据的话题分析效率慢,且由于数据量和时间是指数级正相关关系,进一步影响了准确性。With the rapid development of Internet technology, major portals have emerged, and most portals have become the main channels for people to obtain information. However, due to the complexity, redundancy, and rapidity of updates and dissemination of the network, it is difficult for people to quickly and accurately obtain the key information they need, and it is also not conducive to the monitoring of online public opinion. Therefore, timely discovery of Internet hot keywords has become the focus of current research. At present, there is a single-pass-based text clustering algorithm. Due to its simplicity and ease of implementation, low temporal and spatial complexity, and excellent clustering effects, it is widely used to discover network hot keywords. However, the single-pass algorithm has limitations. For example, keyword similarity matches are only classified according to empirical thresholds, which not only causes slow topic analysis efficiency for each text data in the network, but also because the amount of data and time are exponentially positive. Affected accuracy.
发明内容Summary of the invention
本申请提供一种文本热点的发现方法、装置及计算机可读存储介质,其主要目的在于通过提取文本数据集中关键字进而发现文本热点。This application provides a method, device and computer-readable storage medium for discovering text hotspots, the main purpose of which is to discover text hotspots by extracting keywords in a text data set.
为实现上述目的,本申请提供的一种文本热点的发现方法,包括:To achieve the above-mentioned purpose, a method for discovering text hotspots provided by this application includes:
从新闻论坛网站上爬取原始文本数据集及标签集,所述标签集记录所述原始文本数据集内文本的发布时间;Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set;
将所述原始文本数据集进行包括分词、词性标注、去异形词的预处理操 作得到初级文本数据集;Performing preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;
基于所述标签集对所述初级文本数据集进行特征提取操作,得到特征数据集,并将所述特征数据集转化为特征词向量集;Performing a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set;
计算所述特征词向量集内特征之间的相似度得到相似度集,并对所述相似度集进行排序操作,从所述排序操作后的相似度集中选择指定数量的特征词向量,基于所述指定数量的特征词向量发现热点关键字并输出所述原始文本数据集的热点。Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, based on all The specified number of feature word vectors are used to find hot keywords and output the hot spots of the original text data set.
此外,为实现上述目的,本申请还提供一种文本热点的发现装置,该装置包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的文本热点的发现程序,所述文本热点的发现程序被所述处理器执行时实现如下步骤:In addition, in order to achieve the above-mentioned object, the present application also provides a text hotspot discovery device, which includes a memory and a processor. The memory stores a text hotspot discovery program that can run on the processor. When the text hot spot discovery program is executed by the processor, the following steps are implemented:
从新闻论坛网站上爬取原始文本数据集及标签集,所述标签集记录所述原始文本数据集内文本的发布时间;Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set;
将所述原始文本数据集进行包括分词、词性标注、去异形词的预处理操作得到初级文本数据集;Performing preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;
基于所述标签集对所述初级文本数据集进行特征提取操作,得到特征数据集,并将所述特征数据集转化为特征词向量集;Performing a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set;
计算所述特征词向量集内特征之间的相似度得到相似度集,并对所述相似度集进行排序操作,从所述排序操作后的相似度集中选择指定数量的特征词向量,基于所述指定数量的特征词向量发现热点关键字并输出所述原始文本数据集的热点。Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, based on all The specified number of feature word vectors are used to find hot keywords and output the hot spots of the original text data set.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有文本热点的发现程序,所述文本热点的发现程序可被一个或者多个处理器执行,以实现如上所述的文本热点的发现方法的步骤。In addition, in order to achieve the above-mentioned object, the present application also provides a computer-readable storage medium having a text hotspot discovery program stored on the computer-readable storage medium. The text hotspot discovery program can be used by one or more processors. Perform the steps of the method for discovering text hotspots as described above.
本申请首先爬取新闻论坛的实时文本数据,通过前期较精准的分词、词性标准的预处理,可以有效提取出可能属于热点关键字的词语,进一步地,通过词向量的转换,在不损失特征精准的同时,可高效的让计算机进行分析,最后基于特征相似度的计算遍历出热点关键字,从而得到当前的文本热点。 因此本申请提出的文本热点的发现方法、装置及计算机可读存储介质可以实现精准高效的文本热点发现功能。This application first crawls the real-time text data of the news forum. Through the preprocessing of the more accurate word segmentation and part-of-speech standards in the early stage, the words that may belong to the hot keywords can be effectively extracted. Further, through the conversion of word vectors, without losing features At the same time, it can efficiently analyze by the computer, and finally traverse the hot keywords based on the calculation of feature similarity, so as to get the current text hot spots. Therefore, the method, device, and computer-readable storage medium for discovering text hotspots proposed in this application can realize accurate and efficient text hotspot discovery functions.
附图说明Description of the drawings
图1为本申请一实施例提供的文本热点的发现方法的流程示意图;FIG. 1 is a schematic flowchart of a method for discovering text hotspots according to an embodiment of this application;
图2为本申请一实施例提供的文本热点的发现装置的内部结构示意图;2 is a schematic diagram of the internal structure of a text hotspot discovery device provided by an embodiment of this application;
图3为本申请一实施例提供的文本热点的发现装置中文本热点的发现程序的模块示意图。FIG. 3 is a schematic diagram of modules of a text hotspot discovery program in a text hotspot discovery device provided by an embodiment of the application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.
本申请提供一种文本热点的发现方法。参照图1所示,为本申请一实施例提供的文本热点的发现方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。This application provides a method for discovering text hotspots. Referring to FIG. 1, it is a schematic flowchart of a method for discovering text hotspots according to an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.
在本实施例中,文本热点的发现方法包括:In this embodiment, the method for discovering text hot spots includes:
S1、从新闻论坛网站上爬取原始文本数据集及标签集,所述标签集记录所述原始文本数据集内文本的发布时间。S1. Crawling an original text data set and a tag set from a news forum website, the tag set recording the publication time of the text in the original text data set.
较佳地,所述爬取可采用爬虫技术,所述爬虫技术是先创建URL队列,其中所述URL队列包括若干个URL,然后依次读取所述URL队列内的URL并解析为IP地址,最后基于HTTP通信协议下载所述IP地址指定的网页数据,并对所述网页数据进行分析得到原始文本数据集及标签集。Preferably, the crawling can use a crawler technology. The crawler technology is to first create a URL queue, wherein the URL queue includes several URLs, and then read the URLs in the URL queue in turn and resolve them to IP addresses. Finally, download the webpage data specified by the IP address based on the HTTP communication protocol, and analyze the webpage data to obtain the original text data set and the tag set.
较佳地,所述URL称为统一资源定位符,是对所述新闻论坛网站内的各种资源位置和访问方法的一种简洁表示,又称为所述新闻论坛网站内的各种资源的地址。所述URL由协议(protocol)、主机名(hostname)、端口(port)、路径(path)、查询字符串、散列符元素等组成。所述协议(protocol)表示访问资源和服务的协议,例如http、ftp、mailto和file等;所述主机名(hostname)表示资源所在主机的完全限定域名,例如www.baidu.com。所述端口(port) 表示协议使用的TCP端口号,其中所述HTTP通信协议的常用端口为80,一般采用默认省略的模式;所述路径(path)表示资源的目录/文件路径名;所述查询字符串表示URL中传递的查询字符串;所述散列符元素表示所述URL所指定的文件偏移量,包括散列符(#)加上所述文件偏移量相关的位置。Preferably, the URL is called a uniform resource locator, which is a concise representation of the location and access method of various resources in the news forum website, and is also called the resource of various resources in the news forum website. address. The URL is composed of protocol, hostname, port, path, query string, hash element, etc. The protocol represents a protocol for accessing resources and services, such as http, ftp, mailto, file, etc.; the hostname represents the fully qualified domain name of the host where the resource is located, such as www.baidu.com. The port (port) represents the TCP port number used by the protocol, wherein the commonly used port of the HTTP communication protocol is 80, which is generally omitted by default; the path (path) represents the directory/file path name of the resource; The query string represents the query string passed in the URL; the hash element represents the file offset specified by the URL, including a hash (#) plus the location related to the file offset.
进一步地,所述解析为IP地址即提取所述协议(protocol)、主机名(hostname)、端口(port)、路径(path)等得到所述IP地址。Further, the parsing to an IP address is to extract the protocol (protocol), hostname (hostname), port (port), path (path), etc. to obtain the IP address.
优选地,所述URL一般为指定的新闻、微博等URL,因为所述新闻、微博等URL的网页数据有文本数据及发布时间,将所述文本数据组建成原始文本数据集,所述原始文本数据集内文本数据的发布时间在标签集中。Preferably, the URL is generally a designated news, microblog, etc. URL, because the webpage data of the news, microblog, etc. URL has text data and release time, and the text data is grouped into an original text data set. The publishing time of the text data in the original text data set is in the label set.
S2、将所述原始文本数据集进行包括分词、词性标注、去停用词的预处理操作得到初级文本数据集。S2. Perform preprocessing operations including word segmentation, part-of-speech tagging, and stop word removal on the original text data set to obtain a primary text data set.
较佳地,因为在汉语表示中,词和词之间没有明确的分隔标识,因此要对所述原始文本数据集进行分词处理。所述分词处理使用基于Python、JAVA等编程语言的jieba分词进行处理,如所述原始文本数据集中有文本数据为:“杨宇彬是一位有名的创业青年,靠着扎实的知识和勤劳实干在当地开始了自己的事业”。基于所述jieba分词进行处理后得到为:[杨宇彬][是][一位][有名的][创业][青年][,][靠着][扎实的][知识][和][勤劳实干][在][当地][开始][了][自己的][事业]。Preferably, because there is no clear separation mark between words and words in Chinese representation, word segmentation processing is performed on the original text data set. The word segmentation process uses jieba word segmentation based on Python, JAVA and other programming languages. For example, the original text data set contains text data: "Yang Yubin is a well-known entrepreneurial youth who relies on solid knowledge and hard work in the local area. Started my own business". After processing based on the jieba participle, the result is: [杨宇彬][是][one][名的][创业][青年][,][lea on][solid][knowledge][和][勤劳Work hard] [in] [local] [start] [up] [own] [career].
进一步地,所述词性标注是基于预先构建的词性标记模板标注出所述分词完成的原始文本数据集中的名词、动词。其中,所述词性标记模板是指名词、动词特征的识别器,所述词性标记模板可以通过识别词语的特征来确定名词、动词。如上述[杨宇彬][是][一位][有名的][创业][青年][,][靠着][扎实的][知识][和][勤劳实干][在][当地][开始][了][自己的][事业],根据所述词性标记模板标注出为名词为[青年]、[知识]、[当地]、[事业],动词为[创业]、[靠着]、[开始];Further, the part-of-speech tagging is based on a pre-built part-of-speech tagging template to tag nouns and verbs in the original text data set where the word segmentation is completed. Wherein, the part-of-speech tagging template refers to a recognizer for the characteristics of nouns and verbs, and the part-of-speech tagging template can identify nouns and verbs by recognizing the characteristics of words. As mentioned above [杨宇彬][Yes][One][Famous][Entrepreneurship][Youth][,][Relying on][Solid][Knowledge][and][Hardworking][在][Local][ Start][了][my][career], according to the part-of-speech tagging template marked as nouns [青年], [知识], [local], [career], and verbs are [创业], [lea] ,[Start];
搜索所述原始文本数据集内长度大于预设长度,如两个字符并含有“的”或“地”的词,并判断所述长度大于两个字符并含有“的”或“地”的词在所述文本数据中的前后词是否是名词或动词。若所述前后词是名词或动词,则所述长度大于预设长度字符并含有“的”或“地”的词即为形容词或副词,如[杨宇彬][是][一位][有名的][创业][青年][,][靠着][扎实的][知识][和][勤劳实干][在][当 地][开始][了][自己的][事业],先根据所述词性标记模板识别出名词为[青年]、[知识]、[当地]、[事业],动词为[创业]、[靠着]、[开始],同时识别出长度大于两个字符并含有“的”或“地”的词为[有名的]、[扎实的]、[自己的],判断出所述词前后都有名词或动词如[创业]、[靠着]、[知识]等,因此为形容词或副词并标注。较佳地,所述标注方式可采用包括标注符号的形式,如[杨宇彬是一位有名的 adj创业 v青年 n,靠着 v扎实的 adj知识 n和勤劳实干在当地 n开始 v了自己的 adj事业 n]。 Search the original text data set for words whose length is greater than a preset length, such as two characters and contain "的" or "地", and determine the words whose length is greater than two characters and contain "的" or "地" Whether the preceding and following words in the text data are nouns or verbs. If the preceding and following words are nouns or verbs, the words that are longer than the preset length and contain "的" or "地" are adjectives or adverbs, such as [杨宇彬][是][一个][有名的][Entrepreneurship][Youth][,][Rely on][Solid][Knowledge][and][Diligence][In][Local][Start][了][Own][Career], according to The part-of-speech tag template identified the nouns as [青年], [Knowledge], [local], and [career], and the verbs as [Business], [lea on], [Start], and recognized that the length is greater than two characters and contains The words of "的" or "地" are [有名的], [强实的], [自己的], and it is judged that there are nouns or verbs before and after the said words, such as [Business], [Reliance], [Knowledge], etc. , So it is an adjective or adverb and marked. Preferably, the labeling methods can be used in the form of a reference symbol comprising, as [Yang Yubin start is a well-known v adj n youth, against solid adj v n knowledge and diligence in the local hard work began n v own adj Career n ].
进一步地,所述异形词如所有英文字母、阿拉伯数字、中文数字、标点符号、停用词等,所述停用词包括“了”“于”等用词,如上述经过去除所述异形词后得到为[有名的 adj创业 v青年 n靠着 v扎实的 adj知识 n当地 n开始 v自己的 adj事业 n]。 Further, the heteromorphic words such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc., the stop words include words such as "了", "于", etc., and the heteromorphic words are removed as described above Later, I got it as [famous adj entrepreneurship v youth n relying on v solid adj knowledge n local n starting v own adj business n ].
S3、基于所述标签集对所述初级文本数据集进行特征提取得到特征数据集,并将所述特征数据集转化为特征词向量集。S3. Perform feature extraction on the primary text data set based on the tag set to obtain a feature data set, and convert the feature data set into a feature word vector set.
较佳地,所述特征提取为:Preferably, the feature extraction is:
Figure PCTCN2019116550-appb-000001
Figure PCTCN2019116550-appb-000001
其中,DF t表示所述初级文本数据集内特征词t在所述初级文本数据集中出现的文本数,N c为所述初级文本数据集的数据总数,c为所述初级文本数据集,lg表示以10为底数的log函数。例如在上述[有名的 adj创业 v青年 n靠着 v扎实的 adj知识 n当地 n开始 v自己的 adj事业 n]中,[有名的]、[创业]等都为特征词。 Wherein, DF t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set, N c is the total number of data in the primary text data set, c is the primary text data set, lg Represents the log function with 10 as the base. For example, in the above-mentioned [famous adj entrepreneurship v youth n relying on v solid adj knowledge n local n starting v own adj business n ], [famous], [enterprise], etc. are all characteristic words.
进一步地,所述特征数据集转化为特征词向量集的转化过程包括假设出所述特征数据集内的特征与所述特征词向量集内的特征词向量之间的权重关系、基于所述权重关系计算所述权重,完成所述转化过程。Further, the process of converting the feature data set into a feature word vector set includes assuming a weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, and based on the weight The relationship calculates the weight and completes the conversion process.
具体地,所述权重关系为:Specifically, the weight relationship is:
d={(t 1,w 1),(t 2,w 2),……,(t i,w i),……,(t n,w n)} d={(t 1 ,w 1 ),(t 2 ,w 2 ),……,(t i ,w i ),……,(t n ,w n )}
其中,d为所述特征词向量集,t 1、t 2、……、t n为所述特征数据集内的特征,如上述[有名的]、[创业]等,w 1、w 2、……、w n为所述对应特征的权重。 Among them, d is the feature word vector set, t 1 , t 2 , ..., t n are the features in the feature data set, such as the aforementioned [Famous], [Venture], etc., w 1 , w 2 , ..., w n is the weight of the corresponding feature.
进一步地,所述权重的计算方法为:Further, the calculation method of the weight is:
Figure PCTCN2019116550-appb-000002
Figure PCTCN2019116550-appb-000002
其中,f i表示特征词在所述初级文本数据集中出现的次数,N为文档合集中文档的总数,N j表示所述初级文本数据集中特征词总数,N i表示特征词i在所述初级文本数据集的出现次数,F m为加权因子,一般取值为小于1。 Wherein, f i represents the number of times the feature word appears in the primary text data set, N is the total number of documents in the document collection, N j represents the total number of feature words in the primary text data set, and N i represents the feature word i in the primary text data set. The number of occurrences of the text data set, F m is a weighting factor, and the value is generally less than 1.
S4、计算所述特征词向量集内特征之间的相似度得到相似度集,并对所述相似度集进行排序操作,从所述排序操作后的相似度集中选择指定数量的特征词向量,基于所述指定数量的特征词向量发现热点关键字并输出所述原始文本数据集的热点。S4. Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, Finding hot keywords based on the specified number of feature word vectors and outputting the hot spots of the original text data set.
较佳地,所述相似度的计算方法为:Preferably, the calculation method of the similarity is:
Figure PCTCN2019116550-appb-000003
Figure PCTCN2019116550-appb-000003
其中,sim(d,t)表示特征词向量d、t之间的相似度,w表示所述特征词向量d、t与所述特征词向量集其他特征词向量k的权重系数,n为所述特征词向量集中的数据总数,α、β为偏置系数,其中α+β=1,T为时间距离函数。Among them, sim(d, t) represents the similarity between the feature word vectors d and t, w represents the weight coefficients of the feature word vectors d, t and other feature word vectors k in the feature word vector set, and n is the The total number of data in the set of feature word vectors, α and β are bias coefficients, where α+β=1, and T is the time distance function.
进一步地,所述时间距离函数T为:Further, the time distance function T is:
Figure PCTCN2019116550-appb-000004
Figure PCTCN2019116550-appb-000004
其中,t d表示所述标签集内所述特征词向量d所在文本的发布时间,t Ts是所述标签集中文本数据最早的发布时间,t Te是所述标签集中最新的发布时间。 Wherein, t d represents the publishing time of the text in the tag set where the feature word vector d is located, t Ts is the earliest publishing time of the text data in the tag set, and t Te is the latest publishing time in the tag set.
较佳地,遍历并按照相似度从大到小的排序方法排序所述相似度集,选取相似度高所对应的特征词向量并最终得到特征词,如上述原始文本数据集中包括了如“杨宇彬是一位有名的创业青年,靠着扎实的知识和勤劳实干在当地开始了自己的事业”的文本数据,经过本申请所述方法进行分析后得到所述原始文本数据集的文本热点关键词为“创业”。Preferably, traverse and sort the similarity set according to the sorting method of similarity from largest to smallest, select the feature word vector corresponding to the high similarity, and finally obtain the feature word. For example, the original text data set includes such as "Yang Yubin Is a well-known entrepreneurial youth who started his own business locally with solid knowledge and hard work. After analyzing the method described in this application, the hot text keywords of the original text data set are "Entrepreneurship".
发明还提供一种文本热点的发现装置。参照图2所示,为本申请一实施例提供的文本热点的发现装置的内部结构示意图。The invention also provides a text hot spot discovery device. Referring to FIG. 2, it is a schematic diagram of the internal structure of a text hotspot discovery apparatus provided by an embodiment of this application.
在本实施例中,所述文本热点的发现装置1可以是PC(Personal Computer,个人电脑),或者是智能手机、平板电脑、便携计算机等终端设备,也可以是一种服务器等。该文本热点的发现装置1至少包括存储器11、处理器12,通 信总线13,以及网络接口14。In this embodiment, the text hotspot discovery apparatus 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server. The text hotspot discovery device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是文本热点的发现装置1的内部存储单元,例如该文本热点的发现装置1的硬盘。存储器11在另一些实施例中也可以是文本热点的发现装置1的外部存储设备,例如文本热点的发现装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括文本热点的发现装置1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于文本热点的发现装置1的应用软件及各类数据,例如文本热点的发现程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 11 may be an internal storage unit of the text hotspot discovery device 1, for example, the hard disk of the text hotspot discovery device 1. In other embodiments, the memory 11 may also be an external storage device of the text hotspot discovery device 1, such as a plug-in hard disk equipped on the text hotspot discovery device 1, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Further, the memory 11 may also include both the internal storage unit of the text hotspot discovery apparatus 1 and an external storage device. The memory 11 can be used not only to store application software and various data of the discovery device 1 installed in the text hotspot, such as the code of the text hotspot discovery program 01, etc., but also to temporarily store data that has been output or will be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行文本热点的发现程序01等。In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, and is used to run the program code or processing stored in the memory 11 Data, such as the implementation of the text hot spot discovery program 01, etc.
通信总线13用于实现这些组件之间的连接通信。The communication bus 13 is used to realize the connection and communication between these components.
网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该装置1与其他电子设备之间建立通信连接。The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.
可选地,该装置1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在文本热点的发现装置1中处理的信息以及用于显示可视化的用户界面。Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the text hotspot discovery device 1 and to display a visualized user interface.
图2仅示出了具有组件11-14以及文本热点的发现程序01的文本热点的发现装置1,本领域技术人员可以理解的是,图1示出的结构并不构成对文本热点的发现装置1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。Figure 2 only shows the text hot spot discovery device 1 with components 11-14 and the text hot spot discovery program 01. Those skilled in the art can understand that the structure shown in Figure 1 does not constitute a text hot spot discovery device The definition of 1 may include fewer or more components than shown, or a combination of certain components, or different component arrangements.
在图2所示的装置1实施例中,存储器11中存储有文本热点的发现程序 01;处理器12执行存储器11中存储的文本热点的发现程序01时实现如下步骤:In the embodiment of the apparatus 1 shown in FIG. 2, the memory 11 stores a text hotspot discovery program 01; when the processor 12 executes the text hotspot discovery program 01 stored in the memory 11, the following steps are implemented:
步骤一、从新闻论坛网站上爬取原始文本数据集及标签集,所述标签集记录所述原始文本数据集内文本的发布时间。Step 1: Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set.
较佳地,所述爬取可采用爬虫技术,所述爬虫技术是先创建URL队列,其中所述URL队列包括若干个URL,然后依次读取所述URL队列内的URL并解析为IP地址,最后基于HTTP通信协议下载所述IP地址指定的网页数据,并对所述网页数据进行分析得到原始文本数据集及标签集。Preferably, the crawling can use a crawler technology. The crawler technology is to first create a URL queue, wherein the URL queue includes several URLs, and then read the URLs in the URL queue in turn and resolve them to IP addresses. Finally, download the webpage data specified by the IP address based on the HTTP communication protocol, and analyze the webpage data to obtain the original text data set and the tag set.
较佳地,所述URL称为统一资源定位符,是对所述新闻论坛网站内的各种资源位置和访问方法的一种简洁表示,又称为所述新闻论坛网站内的各种资源的地址。所述URL由协议(protocol)、主机名(hostname)、端口(port)、路径(path)、查询字符串、散列符元素等组成。所述协议(protocol)表示访问资源和服务的协议,例如http、ftp、mailto和file等;所述主机名(hostname)表示资源所在主机的完全限定域名,例如www.baidu.com。所述端口(port)表示协议使用的TCP端口号,其中所述HTTP通信协议的常用端口为80,一般采用默认省略的模式;所述路径(path)表示资源的目录/文件路径名;所述查询字符串表示URL中传递的查询字符串;所述散列符元素表示所述URL所指定的文件偏移量,包括散列符(#)加上所述文件偏移量相关的位置。Preferably, the URL is called a uniform resource locator, which is a concise representation of the location and access method of various resources in the news forum website, and is also called the resource of various resources in the news forum website. address. The URL is composed of protocol, hostname, port, path, query string, hash element, etc. The protocol represents a protocol for accessing resources and services, such as http, ftp, mailto, file, etc.; the hostname represents the fully qualified domain name of the host where the resource is located, such as www.baidu.com. The port (port) represents the TCP port number used by the protocol, and the commonly used port of the HTTP communication protocol is 80, which is generally omitted by default; the path (path) represents the directory/file path name of the resource; The query string represents the query string passed in the URL; the hash element represents the file offset specified by the URL, including a hash (#) plus the location related to the file offset.
进一步地,所述解析为IP地址即提取所述协议(protocol)、主机名(hostname)、端口(port)、路径(path)等得到所述IP地址。Further, the parsing to an IP address is to extract the protocol (protocol), hostname (hostname), port (port), path (path), etc. to obtain the IP address.
优选地,所述URL一般为指定的新闻、微博等URL,因为所述新闻、微博等URL的网页数据有文本数据及发布时间,将所述文本数据组建成原始文本数据集,所述原始文本数据集内文本数据的发布时间在标签集中。Preferably, the URL is generally a designated news, microblog, etc. URL, because the webpage data of the news, microblog, etc. URL has text data and release time, and the text data is grouped into an original text data set. The publishing time of the text data in the original text data set is in the label set.
步骤二、将所述原始文本数据集进行包括分词、词性标注、去停用词的预处理操作得到初级文本数据集。Step 2: Perform preprocessing operations including word segmentation, part-of-speech tagging, and stop word removal on the original text data set to obtain a primary text data set.
较佳地,因为在汉语表示中,词和词之间没有明确的分隔标识,因此要对所述原始文本数据集进行分词处理。所述分词处理使用基于Python、JAVA等编程语言的jieba分词进行处理,如所述原始文本数据集中有文本数据为:“杨宇彬是一位有名的创业青年,靠着扎实的知识和勤劳实干在当地开始了自己的事业”。基于所述jieba分词进行处理后得到为:[杨宇彬][是][一位][有 名的][创业][青年][,][靠着][扎实的][知识][和][勤劳实干][在][当地][开始][了][自己的][事业]。Preferably, because there is no clear separation mark between words and words in Chinese representation, word segmentation processing is performed on the original text data set. The word segmentation process uses jieba word segmentation based on Python, JAVA and other programming languages. For example, the original text data set contains text data: "Yang Yubin is a well-known entrepreneurial youth who relies on solid knowledge and hard work in the local area. Started my own business". After processing based on the jieba participle, the result is: [杨宇彬][是][one][名的][创业][青年][,][lea on][solid][knowledge][和][勤劳Work hard] [in] [local] [start] [up] [own] [career].
进一步地,所述词性标注是基于预先构建的词性标记模板标注出所述分词完成的原始文本数据集中的名词、动词。其中,所述词性标记模板是指名词、动词特征的识别器,所述词性标记模板可以通过识别词语的特征来确定名词、动词。如上述[杨宇彬][是][一位][有名的][创业][青年][,][靠着][扎实的][知识][和][勤劳实干][在][当地][开始][了][自己的][事业],根据所述词性标记模板标注出为名词为[青年]、[知识]、[当地]、[事业],动词为[创业]、[靠着]、[开始];Further, the part-of-speech tagging is based on a pre-built part-of-speech tagging template to tag nouns and verbs in the original text data set where the word segmentation is completed. Wherein, the part-of-speech tagging template refers to a recognizer for the characteristics of nouns and verbs, and the part-of-speech tagging template can identify nouns and verbs by recognizing the characteristics of words. As mentioned above [杨宇彬][Yes][One][Famous][Entrepreneurship][Youth][,][Relying on][Solid][Knowledge][and][Hardworking][在][Local][ Start][了][my][career], according to the part-of-speech tagging template marked as nouns [青年], [知识], [local], [career], and verbs are [创业], [lea] ,[Start];
搜索所述原始文本数据集内长度大于预设长度,如两个字符并含有“的”或“地”的词,并判断所述长度大于两个字符并含有“的”或“地”的词在所述文本数据中的前后词是否是名词或动词。若所述前后词是名词或动词,则所述长度大于预设长度字符并含有“的”或“地”的词即为形容词或副词,如[杨宇彬][是][一位][有名的][创业][青年][,][靠着][扎实的][知识][和][勤劳实干][在][当地][开始][了][自己的][事业],先根据所述词性标记模板识别出名词为[青年]、[知识]、[当地]、[事业],动词为[创业]、[靠着]、[开始],同时识别出长度大于两个字符并含有“的”或“地”的词为[有名的]、[扎实的]、[自己的],判断出所述词前后都有名词或动词如[创业]、[靠着]、[知识]等,因此为形容词或副词并标注。较佳地,所述标注方式可采用包括标注符号的形式,如[杨宇彬是一位有名的 adj创业 v青年 n,靠着 v扎实的 adj知识 n和勤劳实干在当地 n开始 v了自己的 adj事业 n]。 Search the original text data set for words whose length is greater than a preset length, such as two characters and contain "的" or "地", and determine the words whose length is greater than two characters and contain "的" or "地" Whether the preceding and following words in the text data are nouns or verbs. If the preceding and following words are nouns or verbs, the words that are longer than the preset length and contain "的" or "地" are adjectives or adverbs, such as [杨宇彬][是][一个][有名的][Entrepreneurship][Youth][,][Rely on][Solid][Knowledge][and][Diligence][In][Local][Start][了][Own][Career], according to The part-of-speech tag template identified the nouns as [青年], [Knowledge], [local], and [career], and the verbs as [Business], [lea on], [Start], and recognized that the length is greater than two characters and contains The words of "的" or "地" are [有名的], [强实的], [自己的], and it is judged that there are nouns or verbs before and after the said words, such as [Business], [Reliance], [Knowledge], etc. , So it is an adjective or adverb and marked. Preferably, the labeling methods can be used in the form of a reference symbol comprising, as [Yang Yubin start is a well-known v adj n youth, against solid adj v n knowledge and diligence in the local hard work began n v own adj Career n ].
进一步地,所述异形词如所有英文字母、阿拉伯数字、中文数字、标点符号、停用词等,所述停用词包括“了”“于”等用词,如上述经过去除所述异形词后得到为[有名的 adj创业 v青年 n靠着 v扎实的 adj知识 n当地 n开始 v自己的 adj事业 n]。 Further, the heteromorphic words such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc., the stop words include words such as "了", "于", etc., and the heteromorphic words are removed as described above Later, I got it as [famous adj entrepreneurship v youth n relying on v solid adj knowledge n local n starting v own adj business n ].
步骤三、基于所述标签集对所述初级文本数据集进行特征提取得到特征数据集,并将所述特征数据集转化为特征词向量集。Step 3: Perform feature extraction on the primary text data set based on the tag set to obtain a feature data set, and convert the feature data set into a feature word vector set.
较佳地,所述特征提取为:Preferably, the feature extraction is:
Figure PCTCN2019116550-appb-000005
Figure PCTCN2019116550-appb-000005
其中,DF t表示所述初级文本数据集内特征词t在所述初级文本数据集中 出现的文本数,N c为所述初级文本数据集的数据总数,c为所述初级文本数据集,lg表示以10为底数的log函数。例如在上述[有名的 adj创业 v青年 n靠着 v扎实的 adj知识 n当地 n开始 v自己的 adj事业 n]中,[有名的]、[创业]等都为特征词。 Wherein, DF t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set, N c is the total number of data in the primary text data set, c is the primary text data set, lg Represents the log function with 10 as the base. For example, in the above-mentioned [famous adj entrepreneurship v youth n relying on v solid adj knowledge n local n starting v own adj business n ], [famous], [enterprise], etc. are all characteristic words.
进一步地,所述特征数据集转化为特征词向量集的转化过程包括假设出所述特征数据集内的特征与所述特征词向量集内的特征词向量之间的权重关系、基于所述权重关系计算所述权重,完成所述转化过程。Further, the process of converting the feature data set into a feature word vector set includes assuming a weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, and based on the weight The relationship calculates the weight and completes the conversion process.
具体地,所述权重关系为:Specifically, the weight relationship is:
d={(t 1,w 1),(t 2,w 2),……,(t i,w i),……,(t n,w n)} d={(t 1 ,w 1 ),(t 2 ,w 2 ),……,(t i ,w i ),……,(t n ,w n )}
其中,d为所述特征词向量集,t 1、t 2、……、t n为所述特征数据集内的特征,如上述[有名的]、[创业]等,w 1、w 2、……、w n为所述对应特征的权重。 Among them, d is the feature word vector set, t 1 , t 2 , ..., t n are the features in the feature data set, such as the aforementioned [Famous], [Venture], etc., w 1 , w 2 , ..., w n is the weight of the corresponding feature.
进一步地,所述权重的计算方法为:Further, the calculation method of the weight is:
Figure PCTCN2019116550-appb-000006
Figure PCTCN2019116550-appb-000006
其中,f i表示特征词在所述初级文本数据集中出现的次数,N为文档合集中文档的总数,N j表示所述初级文本数据集中特征词总数,N i表示特征词i在所述初级文本数据集的出现次数,F m为加权因子,一般取值为小于1。 Wherein, f i represents the number of times the feature word appears in the primary text data set, N is the total number of documents in the document collection, N j represents the total number of feature words in the primary text data set, and N i represents the feature word i in the primary text data set. The number of occurrences of the text data set, F m is a weighting factor, and the value is generally less than 1.
步骤四、计算所述特征词向量集内特征之间的相似度得到相似度集,并对所述相似度集进行排序操作,从所述排序操作后的相似度集中选择指定数量的特征词向量,基于所述指定数量的特征词向量发现热点关键字并输出所述原始文本数据集的热点。Step 4: Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation , Find hot keywords based on the specified number of feature word vectors and output the hot spots of the original text data set.
较佳地,所述相似度的计算方法为:Preferably, the calculation method of the similarity is:
Figure PCTCN2019116550-appb-000007
Figure PCTCN2019116550-appb-000007
其中,sim(d,t)表示特征词向量d、t之间的相似度,w表示所述特征词向量d、t与所述特征词向量集其他特征词向量k的权重系数,n为所述特征词向量集中的数据总数,α、β为偏置系数,其中α+β=1,T为时间距离函数。Among them, sim(d, t) represents the similarity between the feature word vectors d and t, w represents the weight coefficients of the feature word vectors d, t and other feature word vectors k in the feature word vector set, and n is the The total number of data in the set of feature word vectors, α and β are bias coefficients, where α+β=1, and T is the time distance function.
进一步地,所述时间距离函数T为:Further, the time distance function T is:
Figure PCTCN2019116550-appb-000008
Figure PCTCN2019116550-appb-000008
其中,t d表示所述标签集内所述特征词向量d所在文本的发布时间,t Ts是 所述标签集中文本数据最早的发布时间,t Te是所述标签集中最新的发布时间。 Wherein, t d represents the publishing time of the text in the tag set where the feature word vector d is located, t Ts is the earliest publishing time of the text data in the tag set, and t Te is the latest publishing time in the tag set.
较佳地,遍历并按照相似度从大到小的排序方法排序所述相似度集,选取相似度高所对应的特征词向量并最终得到特征词,如上述原始文本数据集中包括了如“杨宇彬是一位有名的创业青年,靠着扎实的知识和勤劳实干在当地开始了自己的事业”的文本数据,经过本申请所述方法进行分析后得到所述原始文本数据集的文本热点关键词为“创业”。Preferably, traverse and sort the similarity set according to the sorting method of similarity from largest to smallest, select the feature word vector corresponding to the high similarity, and finally obtain the feature word. For example, the original text data set includes such as "Yang Yubin Is a well-known entrepreneurial youth who started his own business locally with solid knowledge and hard work. After analyzing the method described in this application, the hot text keywords of the original text data set are "Entrepreneurship".
可选地,在其他实施例中,文本热点的发现程序还可以被分割为一个或者多个模块,一个或者多个模块被存储于存储器11中,并由一个或多个处理器(本实施例为处理器12)所执行以完成本申请,本申请所称的模块是指能够完成特定功能的一系列计算机程序指令段,用于描述文本热点的发现程序在文本热点的发现装置中的执行过程。Optionally, in other embodiments, the text hotspot discovery program can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (this embodiment It is executed by the processor 12) to complete this application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the execution process of the text hot spot discovery program in the text hot spot discovery device .
例如,参照图3所示,为本申请文本热点的发现装置一实施例中的文本热点的发现程序的程序模块示意图,该实施例中,所述文本热点的发现程序可以被分割为数据接收模块10、数据处理模块20、词向量转化模块30、文本热点输出模块40示例性地:For example, referring to FIG. 3, a schematic diagram of the program modules of the text hotspot discovery program in an embodiment of the text hotspot discovery apparatus of this application. In this embodiment, the text hotspot discovery program can be divided into data receiving modules 10. The data processing module 20, the word vector conversion module 30, and the text hotspot output module 40 are exemplary:
所述数据接收模块10用于:从新闻论坛网站上爬取原始文本数据集及标签集,所述标签集记录所述原始文本数据集内文本的发布时间。The data receiving module 10 is used to crawl an original text data set and a tag set from a news forum website, and the tag set records the publication time of the text in the original text data set.
所述数据处理模块20用于:将所述原始文本数据集进行包括分词、词性标注、去异形词的预处理操作得到初级文本数据集;The data processing module 20 is configured to: perform preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;
所述词向量转化模块30用于:基于所述标签集对所述初级文本数据集进行特征提取操作,得到特征数据集,并将所述特征数据集转化为特征词向量集。The word vector conversion module 30 is configured to perform a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and convert the feature data set into a feature word vector set.
所述文本热点输出模块40用于:计算所述特征词向量集内特征之间的相似度得到相似度集,并对所述相似度集进行排序操作,从所述排序操作后的相似度集中选择指定数量的特征词向量,基于所述指定数量的特征词向量发现热点关键字并输出所述原始文本数据集的热点。The text hotspot output module 40 is configured to calculate the similarity between the features in the feature word vector set to obtain a similarity set, and perform a sorting operation on the similarity set, from the similarity set after the sorting operation Select a specified number of feature word vectors, find hot keywords based on the specified number of feature word vectors, and output the hot spots of the original text data set.
上述数据接收模块10、数据处理模块20、词向量转化模块30、文本热点输出模块40等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。The functions or operation steps implemented by the program modules such as the data receiving module 10, the data processing module 20, the word vector conversion module 30, and the text hotspot output module 40 when executed are substantially the same as those in the foregoing embodiment, and will not be repeated here.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有文本热点的发现程序,所述文本热点的发现程序可被一个或多个处理器执行,以实现如下操作:In addition, the embodiment of the present application also proposes a computer-readable storage medium, the computer-readable storage medium stores a text hotspot discovery program, and the text hotspot discovery program can be executed by one or more processors to Implement the following operations:
从新闻论坛网站上爬取原始文本数据集及标签集,所述标签集记录所述原始文本数据集内文本的发布时间;Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set;
将所述原始文本数据集进行包括分词、词性标注、去异形词的预处理操作得到初级文本数据集;Performing preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;
基于所述标签集对所述初级文本数据集进行特征提取操作,得到特征数据集,并将所述特征数据集转化为特征词向量集;Performing a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set;
计算所述特征词向量集内特征之间的相似度得到相似度集,并对所述相似度集进行排序操作,从所述排序操作后的相似度集中选择指定数量的特征词向量,基于所述指定数量的特征词向量发现热点关键字并输出所述原始文本数据集的热点。Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, based on all The specified number of feature word vectors are used to find hot keywords and output the hot spots of the original text data set.
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the serial numbers of the above embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间 接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种文本热点的发现方法,其特征在于,所述方法包括:A method for discovering text hotspots, characterized in that the method includes:
    从新闻论坛网站上爬取原始文本数据集及标签集,所述标签集记录所述原始文本数据集内文本的发布时间;Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set;
    将所述原始文本数据集进行包括分词、词性标注、去异形词的预处理操作得到初级文本数据集;Performing preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;
    基于所述标签集对所述初级文本数据集进行特征提取操作,得到特征数据集,并将所述特征数据集转化为特征词向量集;Performing a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set;
    计算所述特征词向量集内特征之间的相似度得到相似度集,并对所述相似度集进行排序操作,从所述排序操作后的相似度集中选择指定数量的特征词向量,基于所述指定数量的特征词向量发现热点关键字并输出所述原始文本数据集的热点。Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, based on all The specified number of feature word vectors are used to find hot keywords and output the hot spots of the original text data set.
  2. 如权利要求1所述的文本热点的发现方法,其特征在于,从新闻论坛网站上爬取原始文本数据集及标签集,包括:The method for discovering text hotspots according to claim 1, wherein crawling the original text data set and tag set from the news forum website comprises:
    创建URL队列,其中,所述URL队列包括若干个URL;Creating a URL queue, where the URL queue includes several URLs;
    依次读取所述URL队列内的URL并解析为IP地址;Sequentially read the URLs in the URL queue and parse them into IP addresses;
    基于HTTP通信协议下载所述IP地址指定的网页数据,并对所述网页数据进行分析得到所述原始文本数据集和标签集。Downloading the webpage data specified by the IP address based on the HTTP communication protocol, and analyzing the webpage data to obtain the original text data set and tag set.
  3. 如权利要求2所述的文本热点的发现方法,其特征在于,所述特征提取操作为:The method for discovering text hotspots according to claim 2, wherein the feature extraction operation is:
    Figure PCTCN2019116550-appb-100001
    Figure PCTCN2019116550-appb-100001
    其中,DF(t,c)为所述特征数据集,DF t表示所述初级文本数据集内特征词t在所述初级文本数据集中出现的文本数,N c为所述初级文本数据集的数据总数,c为所述初级文本数据集。 Wherein, DF(t,c) is the feature data set, DF t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set, and N c is the number of texts in the primary text data set The total number of data, and c is the primary text data set.
  4. 如权利要求3所述的文本热点的发现方法,其特征在于,计算所述特征词向量集内特征之间的相似度的计算方法为:The method for discovering text hot spots according to claim 3, wherein the calculation method for calculating the similarity between the features in the feature word vector set is:
    Figure PCTCN2019116550-appb-100002
    Figure PCTCN2019116550-appb-100002
    其中,sim(d,t)表示特征词向量d、t之间的相似度,w表示所述特征词向 量d、t与所述特征词向量集其他特征词向量的权重系数,n为所述特征词向量集中的数据总数,α、β为偏置系数,其中α+β=1,T为时间距离函数。Wherein, sim(d, t) represents the similarity between the feature word vectors d and t, w represents the weight coefficients of the feature word vectors d, t and other feature word vectors in the feature word vector set, and n is the The total number of data in the feature word vector set, α and β are bias coefficients, where α+β=1, and T is the time distance function.
  5. 如权利要求4所述的文本热点的发现方法,其特征在于,所述时间距离函数T为:The method for discovering text hot spots according to claim 4, wherein the time distance function T is:
    Figure PCTCN2019116550-appb-100003
    Figure PCTCN2019116550-appb-100003
    其中,t d表示所述标签集内所述特征词向量d所在文本的发布时间,t Ts是所述标签集中最早的发布时间,t Te是所述标签集中最新的发布时间。 Where, t d represents the publication time of the text in the tag set where the feature word vector d is located, t Ts is the earliest publication time in the tag set, and t Te is the latest publication time in the tag set.
  6. 如权利要求1所述的文本热点的发现方法,其特征在于,所述将所述特征数据集转化为特征词向量集包括:The method for discovering text hotspots according to claim 1, wherein said converting said feature data set into a feature word vector set comprises:
    设定所述特征数据集内的特征与所述特征词向量集内的特征词向量之间的权重关系、基于所述权重关系计算所述权重,完成所述转化过程。Set the weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, calculate the weight based on the weight relationship, and complete the conversion process.
  7. 如权利要求6所述的文本热点的发现方法,其特征在于,所述权重的计算方法包括:8. The method for discovering text hotspots according to claim 6, wherein the weight calculation method comprises:
    Figure PCTCN2019116550-appb-100004
    Figure PCTCN2019116550-appb-100004
    其中,f i表示特征词在所述初级文本数据集中出现的次数,N表示文档合集中文档的总数,N j表示所述初级文本数据集中特征词总数,N i表示特征词i在所述初级文本数据集的出现次数,F m表示加权因子,F m取值小于1。 Wherein, f i represents the number of times the feature word appears in the primary text data set, N represents the total number of documents in the document collection, N j represents the total number of feature words in the primary text data set, and N i represents the feature word i in the primary text data set. The number of occurrences of the text data set, F m represents the weighting factor, and the value of F m is less than 1.
  8. 一种文本热点的发现装置,其特征在于,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的文本热点的发现程序,所述文本热点的发现程序被所述处理器执行时实现如下步骤:A text hotspot discovery device, characterized in that the device includes a memory and a processor, the memory stores a text hotspot discovery program that can run on the processor, and the text hotspot discovery program is The processor implements the following steps when executing:
    从新闻论坛网站上爬取原始文本数据集及标签集,所述标签集记录所述原始文本数据集内文本的发布时间;Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set;
    将所述原始文本数据集进行包括分词、词性标注、去异形词的预处理操作得到初级文本数据集;Performing preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;
    基于所述标签集对所述初级文本数据集进行特征提取操作,得到特征数据集,并将所述特征数据集转化为特征词向量集;Performing a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set;
    计算所述特征词向量集内特征之间的相似度得到相似度集,并对所述相似度集进行排序操作,从所述排序操作后的相似度集中选择指定数量的特征 词向量,基于所述指定数量的特征词向量发现热点关键字并输出所述原始文本数据集的热点。Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, based on all The specified number of feature word vectors are used to find hot keywords and output the hot spots of the original text data set.
  9. 如权利要求8所述的文本热点的发现装置,其特征在于,从新闻论坛网站上爬取原始文本数据集及标签集,包括:8. The device for discovering text hotspots according to claim 8, wherein crawling the original text data set and tag set from the news forum website comprises:
    创建URL队列,其中,所述URL队列包括若干个URL;Creating a URL queue, where the URL queue includes several URLs;
    依次读取所述URL队列内的URL并解析为IP地址;Sequentially read the URLs in the URL queue and parse them into IP addresses;
    基于HTTP通信协议下载所述IP地址指定的网页数据,并对所述网页数据进行分析得到所述原始文本数据集和标签集。Downloading the webpage data specified by the IP address based on the HTTP communication protocol, and analyzing the webpage data to obtain the original text data set and tag set.
  10. 如权利要求9所述的文本热点的发现装置,其特征在于,所述特征提取操作为:The text hot spot discovery device according to claim 9, wherein the feature extraction operation is:
    Figure PCTCN2019116550-appb-100005
    Figure PCTCN2019116550-appb-100005
    其中,DF(t,c)为所述特征数据集,DF t表示所述初级文本数据集内特征词t在所述初级文本数据集中出现的文本数,N c为所述初级文本数据集的数据总数,c为所述初级文本数据集。 Wherein, DF(t,c) is the feature data set, DF t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set, and N c is the number of texts in the primary text data set The total number of data, and c is the primary text data set.
  11. 如权利要求10所述的文本热点的发现装置,其特征在于,计算所述特征词向量集内特征之间的相似度的计算方法为:10. The text hotspot discovery device according to claim 10, wherein the calculation method for calculating the similarity between the features in the feature word vector set is:
    Figure PCTCN2019116550-appb-100006
    Figure PCTCN2019116550-appb-100006
    其中,sim(d,t)表示特征词向量d、t之间的相似度,w表示所述特征词向量d、t与所述特征词向量集其他特征词向量的权重系数,n为所述特征词向量集中的数据总数,α、β为偏置系数,其中α+β=1,T为时间距离函数。Wherein, sim(d, t) represents the similarity between the feature word vectors d and t, w represents the weight coefficients of the feature word vectors d, t and other feature word vectors in the feature word vector set, and n is the The total number of data in the feature word vector set, α and β are bias coefficients, where α+β=1, and T is the time distance function.
  12. 如权利要求11所述的文本热点的发现装置,其特征在于,所述时间距离函数T为:The device for discovering text hotspots according to claim 11, wherein the time distance function T is:
    Figure PCTCN2019116550-appb-100007
    Figure PCTCN2019116550-appb-100007
    其中,t d表示所述标签集内所述特征词向量d所在文本的发布时间,t Ts是所述标签集中最早的发布时间,t Te是所述标签集中最新的发布时间。 Where, t d represents the publication time of the text in the tag set where the feature word vector d is located, t Ts is the earliest publication time in the tag set, and t Te is the latest publication time in the tag set.
  13. 如权利要求8所述的文本热点的发现装置,其特征在于,所述将所述特征数据集转化为特征词向量集包括:8. The text hotspot discovery device according to claim 8, wherein said converting said characteristic data set into a characteristic word vector set comprises:
    设定所述特征数据集内的特征与所述特征词向量集内的特征词向量之间的权重关系、基于所述权重关系计算所述权重,完成所述转化过程。Set the weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, calculate the weight based on the weight relationship, and complete the conversion process.
  14. 如权利要求13所述的文本热点的发现装置,其特征在于,所述权重的计算方法包括:The device for discovering text hotspots according to claim 13, wherein said weight calculation method comprises:
    Figure PCTCN2019116550-appb-100008
    Figure PCTCN2019116550-appb-100008
    其中,f i表示特征词在所述初级文本数据集中出现的次数,N表示文档合集中文档的总数,N j表示所述初级文本数据集中特征词总数,N i表示特征词i在所述初级文本数据集的出现次数,F m表示加权因子,F m取值小于1。 Wherein, f i represents the number of times the feature word appears in the primary text data set, N represents the total number of documents in the document collection, N j represents the total number of feature words in the primary text data set, and N i represents the feature word i in the primary text data set. The number of occurrences of the text data set, F m represents the weighting factor, and the value of F m is less than 1.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有文本热点的发现程序,所述文本热点的发现程序可被一个或者多个处理器执行时,实现如下步骤:A computer-readable storage medium, characterized in that a text hotspot discovery program is stored on the computer-readable storage medium, and when the text hotspot discovery program can be executed by one or more processors, the following steps are implemented:
    从新闻论坛网站上爬取原始文本数据集及标签集,所述标签集记录所述原始文本数据集内文本的发布时间;Crawling an original text data set and a tag set from a news forum website, the tag set recording the publication time of the text in the original text data set;
    将所述原始文本数据集进行包括分词、词性标注、去异形词的预处理操作得到初级文本数据集;Performing preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;
    基于所述标签集对所述初级文本数据集进行特征提取操作,得到特征数据集,并将所述特征数据集转化为特征词向量集;Performing a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set;
    计算所述特征词向量集内特征之间的相似度得到相似度集,并对所述相似度集进行排序操作,从所述排序操作后的相似度集中选择指定数量的特征词向量,基于所述指定数量的特征词向量发现热点关键字并输出所述原始文本数据集的热点。Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, based on all The specified number of feature word vectors are used to find hot keywords and output the hot spots of the original text data set.
  16. 如权利要求15所述的计算机可读存储介质,其特征在于,从新闻论坛网站上爬取原始文本数据集及标签集,包括:15. The computer-readable storage medium of claim 15, wherein crawling the original text data set and tag set from the news forum website comprises:
    创建URL队列,其中,所述URL队列包括若干个URL;Creating a URL queue, where the URL queue includes several URLs;
    依次读取所述URL队列内的URL并解析为IP地址;Sequentially read the URLs in the URL queue and parse them into IP addresses;
    基于HTTP通信协议下载所述IP地址指定的网页数据,并对所述网页数据进行分析得到所述原始文本数据集和标签集。Downloading the webpage data specified by the IP address based on the HTTP communication protocol, and analyzing the webpage data to obtain the original text data set and tag set.
  17. 如权利要求16所述的计算机可读存储介质,其特征在于,所述特征提取操作为:16. The computer-readable storage medium of claim 16, wherein the feature extraction operation is:
    Figure PCTCN2019116550-appb-100009
    Figure PCTCN2019116550-appb-100009
    其中,DF(t,c)为所述特征数据集,DF t表示所述初级文本数据集内特征词t在所述初级文本数据集中出现的文本数,N c为所述初级文本数据集的数据总数,c为所述初级文本数据集。 Wherein, DF(t,c) is the feature data set, DF t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set, and N c is the number of texts in the primary text data set The total number of data, and c is the primary text data set.
  18. 如权利要求17所述的计算机可读存储介质,其特征在于,计算所述特征词向量集内特征之间的相似度的计算方法为:17. The computer-readable storage medium of claim 17, wherein the calculation method for calculating the similarity between the features in the feature word vector set is:
    Figure PCTCN2019116550-appb-100010
    Figure PCTCN2019116550-appb-100010
    其中,sim(d,t)表示特征词向量d、t之间的相似度,w表示所述特征词向量d、t与所述特征词向量集其他特征词向量的权重系数,n为所述特征词向量集中的数据总数,α、β为偏置系数,其中α+β=1,T为时间距离函数。Wherein, sim(d, t) represents the similarity between the feature word vectors d and t, w represents the weight coefficients of the feature word vectors d, t and other feature word vectors in the feature word vector set, and n is the The total number of data in the feature word vector set, α and β are bias coefficients, where α+β=1, and T is the time distance function.
  19. 如权利要求18所述的计算机可读存储介质,其特征在于,所述时间距离函数T为:18. The computer readable storage medium of claim 18, wherein the time distance function T is:
    Figure PCTCN2019116550-appb-100011
    Figure PCTCN2019116550-appb-100011
    其中,t d表示所述标签集内所述特征词向量d所在文本的发布时间,t Ts是所述标签集中最早的发布时间,t Te是所述标签集中最新的发布时间。 Where, t d represents the publication time of the text in the tag set where the feature word vector d is located, t Ts is the earliest publication time in the tag set, and t Te is the latest publication time in the tag set.
  20. 如权利要求15所述的计算机可读存储介质,其特征在于,所述将所述特征数据集转化为特征词向量集包括:15. The computer-readable storage medium of claim 15, wherein the converting the characteristic data set into a characteristic word vector set comprises:
    设定所述特征数据集内的特征与所述特征词向量集内的特征词向量之间的权重关系、基于所述权重关系计算所述权重,完成所述转化过程。Set the weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, calculate the weight based on the weight relationship, and complete the conversion process.
PCT/CN2019/116550 2019-08-15 2019-11-08 Method and apparatus for discovering text hotspot and computer-readable storage medium WO2021027116A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910768143.XA CN110609938A (en) 2019-08-15 2019-08-15 Text hotspot discovery method and device and computer-readable storage medium
CN201910768143.X 2019-08-15

Publications (1)

Publication Number Publication Date
WO2021027116A1 true WO2021027116A1 (en) 2021-02-18

Family

ID=68890661

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116550 WO2021027116A1 (en) 2019-08-15 2019-11-08 Method and apparatus for discovering text hotspot and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN110609938A (en)
WO (1) WO2021027116A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101488150A (en) * 2009-03-04 2009-07-22 哈尔滨工程大学 Real-time multi-view network focus event analysis apparatus and analysis method
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN106599181A (en) * 2016-12-13 2017-04-26 浙江网新恒天软件有限公司 Hot news detecting method based on topic model
CN107895053A (en) * 2017-12-13 2018-04-10 福州大学 Emerging much-talked-about topic detecting system and method based on topic cluster momentum model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106682149A (en) * 2016-12-22 2017-05-17 湖南科技学院 Label automatic generation method based on meta-search engine
CN107368595A (en) * 2017-07-26 2017-11-21 中国华戎科技集团有限公司 network hotspot information mining method and system
CN108491429A (en) * 2018-02-09 2018-09-04 湖北工业大学 A kind of feature selection approach based on document frequency and word frequency statistics between class in class

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101488150A (en) * 2009-03-04 2009-07-22 哈尔滨工程大学 Real-time multi-view network focus event analysis apparatus and analysis method
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN106599181A (en) * 2016-12-13 2017-04-26 浙江网新恒天软件有限公司 Hot news detecting method based on topic model
CN107895053A (en) * 2017-12-13 2018-04-10 福州大学 Emerging much-talked-about topic detecting system and method based on topic cluster momentum model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZHANG TING: "Research on Interactive Timeline System of News Report", CHINESE MASTER'S THESES FULL-TEXT DATABASE, 15 May 2018 (2018-05-15), pages 1 - 82, XP055780919 *

Also Published As

Publication number Publication date
CN110609938A (en) 2019-12-24

Similar Documents

Publication Publication Date Title
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US11799823B2 (en) Domain name classification systems and methods
US8161059B2 (en) Method and apparatus for collecting entity aliases
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
US9172666B2 (en) Locating a user based on aggregated tweet content associated with a location
US9507867B2 (en) Discovery engine
US9104772B2 (en) System and method for providing tag-based relevance recommendations of bookmarks in a bookmark and tag database
US8095530B1 (en) Detecting common prefixes and suffixes in a list of strings
WO2020237856A1 (en) Smart question and answer method and apparatus based on knowledge graph, and computer storage medium
US20090089278A1 (en) Techniques for keyword extraction from urls using statistical analysis
CA2774278C (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
CN113822067A (en) Key information extraction method and device, computer equipment and storage medium
US20180025012A1 (en) Web page classification based on noise removal
WO2007143914A1 (en) Method, device and inputting system for creating word frequency database based on web information
Elshater et al. godiscovery: Web service discovery made efficient
Wu et al. Searching services" on the web": A public web services discovery approach
WO2021051934A1 (en) Method and apparatus for extracting key contract term on basis of artificial intelligence, and storage medium
US20200342037A1 (en) System and method for search discovery
CN106446123A (en) Webpage verification code element identification method
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
JP2014102827A (en) Retrieval system and retrieval method for the same
WO2021068681A1 (en) Tag analysis method and device, and computer readable storage medium
CN104778232B (en) Searching result optimizing method and device based on long query
KR102483004B1 (en) Method for detecting harmful url
US20130230248A1 (en) Ensuring validity of the bookmark reference in a collaborative bookmarking system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19941663

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19941663

Country of ref document: EP

Kind code of ref document: A1