WO2021027116A1

WO2021027116A1 - Method and apparatus for discovering text hotspot and computer-readable storage medium

Info

Publication number: WO2021027116A1
Application number: PCT/CN2019/116550
Authority: WO
Inventors: 苏智辉; 侯丽; 姚飞
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-08-15
Filing date: 2019-11-08
Publication date: 2021-02-18
Also published as: CN110609938A

Abstract

The present application relates to artificial intelligence technology, and disclosed is a method for discovering a text hotspot, comprising: receiving an original text data set and a tag set, the tag set recording the publication time of text in the original text data set; performing a preprocessing operation comprising word segmentation, part-of-speech tagging, and heteromorphic word removal on the original text data set to obtain a primary text data set; performing a feature extraction operation on the primary text data set on the basis of the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set; calculating the similarity between features in the feature word vector set to obtain a similarity set; selecting a specified quantity of feature word vectors from within the similarity set; and discovering hotspot keywords on the basis of the specified quantity of feature word vectors and outputting a hotspot. Further proposed by the present application are an apparatus for discovering a text hotspot and a computer-readable storage medium. The present application may achieve the function of accurately and efficiently discovering a text hotspot.

Description

Method, device and computer readable storage medium for discovering text hotspot

This application is based on the Paris Convention declares that it enjoys the priority of a Chinese patent application filed on August 15, 2019 with the application number CN201910768143.X and titled "Method, device and computer-readable storage medium for discovering text hotspots". The Chinese patent The entire content of the application is incorporated into this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a method, device and computer-readable storage medium for extracting keywords in a text data set to discover text hotspots.

Background technique

With the rapid development of Internet technology, major portals have emerged, and most portals have become the main channels for people to obtain information. However, due to the complexity, redundancy, and rapidity of updates and dissemination of the network, it is difficult for people to quickly and accurately obtain the key information they need, and it is also not conducive to the monitoring of online public opinion. Therefore, timely discovery of Internet hot keywords has become the focus of current research. At present, there is a single-pass-based text clustering algorithm. Due to its simplicity and ease of implementation, low temporal and spatial complexity, and excellent clustering effects, it is widely used to discover network hot keywords. However, the single-pass algorithm has limitations. For example, keyword similarity matches are only classified according to empirical thresholds, which not only causes slow topic analysis efficiency for each text data in the network, but also because the amount of data and time are exponentially positive. Affected accuracy.

Summary of the invention

This application provides a method, device and computer-readable storage medium for discovering text hotspots, the main purpose of which is to discover text hotspots by extracting keywords in a text data set.

To achieve the above-mentioned purpose, a method for discovering text hotspots provided by this application includes:

Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set;

Performing preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;

Performing a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set;

Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, based on all The specified number of feature word vectors are used to find hot keywords and output the hot spots of the original text data set.

In addition, in order to achieve the above-mentioned object, the present application also provides a text hotspot discovery device, which includes a memory and a processor. The memory stores a text hotspot discovery program that can run on the processor. When the text hot spot discovery program is executed by the processor, the following steps are implemented:

In addition, in order to achieve the above-mentioned object, the present application also provides a computer-readable storage medium having a text hotspot discovery program stored on the computer-readable storage medium. The text hotspot discovery program can be used by one or more processors. Perform the steps of the method for discovering text hotspots as described above.

This application first crawls the real-time text data of the news forum. Through the preprocessing of the more accurate word segmentation and part-of-speech standards in the early stage, the words that may belong to the hot keywords can be effectively extracted. Further, through the conversion of word vectors, without losing features At the same time, it can efficiently analyze by the computer, and finally traverse the hot keywords based on the calculation of feature similarity, so as to get the current text hot spots. Therefore, the method, device, and computer-readable storage medium for discovering text hotspots proposed in this application can realize accurate and efficient text hotspot discovery functions.

Description of the drawings

FIG. 1 is a schematic flowchart of a method for discovering text hotspots according to an embodiment of this application;

2 is a schematic diagram of the internal structure of a text hotspot discovery device provided by an embodiment of this application;

FIG. 3 is a schematic diagram of modules of a text hotspot discovery program in a text hotspot discovery device provided by an embodiment of the application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

detailed description

It should be understood that the specific embodiments described here are only used to explain the application, and are not used to limit the application.

This application provides a method for discovering text hotspots. Referring to FIG. 1, it is a schematic flowchart of a method for discovering text hotspots according to an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.

In this embodiment, the method for discovering text hot spots includes:

S1. Crawling an original text data set and a tag set from a news forum website, the tag set recording the publication time of the text in the original text data set.

Preferably, the crawling can use a crawler technology. The crawler technology is to first create a URL queue, wherein the URL queue includes several URLs, and then read the URLs in the URL queue in turn and resolve them to IP addresses. Finally, download the webpage data specified by the IP address based on the HTTP communication protocol, and analyze the webpage data to obtain the original text data set and the tag set.

Preferably, the URL is called a uniform resource locator, which is a concise representation of the location and access method of various resources in the news forum website, and is also called the resource of various resources in the news forum website. address. The URL is composed of protocol, hostname, port, path, query string, hash element, etc. The protocol represents a protocol for accessing resources and services, such as http, ftp, mailto, file, etc.; the hostname represents the fully qualified domain name of the host where the resource is located, such as www.baidu.com. The port (port) represents the TCP port number used by the protocol, wherein the commonly used port of the HTTP communication protocol is 80, which is generally omitted by default; the path (path) represents the directory/file path name of the resource; The query string represents the query string passed in the URL; the hash element represents the file offset specified by the URL, including a hash (#) plus the location related to the file offset.

Further, the parsing to an IP address is to extract the protocol (protocol), hostname (hostname), port (port), path (path), etc. to obtain the IP address.

Preferably, the URL is generally a designated news, microblog, etc. URL, because the webpage data of the news, microblog, etc. URL has text data and release time, and the text data is grouped into an original text data set. The publishing time of the text data in the original text data set is in the label set.

S2. Perform preprocessing operations including word segmentation, part-of-speech tagging, and stop word removal on the original text data set to obtain a primary text data set.

Preferably, because there is no clear separation mark between words and words in Chinese representation, word segmentation processing is performed on the original text data set. The word segmentation process uses jieba word segmentation based on Python, JAVA and other programming languages. For example, the original text data set contains text data: "Yang Yubin is a well-known entrepreneurial youth who relies on solid knowledge and hard work in the local area. Started my own business". After processing based on the jieba participle, the result is: [杨宇彬][是][one][名的][创业][青年][,][lea on][solid][knowledge][和][勤劳Work hard] [in] [local] [start] [up] [own] [career].

Further, the part-of-speech tagging is based on a pre-built part-of-speech tagging template to tag nouns and verbs in the original text data set where the word segmentation is completed. Wherein, the part-of-speech tagging template refers to a recognizer for the characteristics of nouns and verbs, and the part-of-speech tagging template can identify nouns and verbs by recognizing the characteristics of words. As mentioned above [杨宇彬][Yes][One][Famous][Entrepreneurship][Youth][,][Relying on][Solid][Knowledge][and][Hardworking][在][Local][ Start][了][my][career], according to the part-of-speech tagging template marked as nouns [青年], [知识], [local], [career], and verbs are [创业], [lea] ,[Start];

Search the original text data set for words whose length is greater than a preset length, such as two characters and contain "的" or "地", and determine the words whose length is greater than two characters and contain "的" or "地" Whether the preceding and following words in the text data are nouns or verbs. If the preceding and following words are nouns or verbs, the words that are longer than the preset length and contain "的" or "地" are adjectives or adverbs, such as [杨宇彬][是][一个][有名的][Entrepreneurship][Youth][,][Rely on][Solid][Knowledge][and][Diligence][In][Local][Start][了][Own][Career], according to The part-of-speech tag template identified the nouns as [青年], [Knowledge], [local], and [career], and the verbs as [Business], [lea on], [Start], and recognized that the length is greater than two characters and contains The words of "的" or "地" are [有名的], [强实的], [自己的], and it is judged that there are nouns or verbs before and after the said words, such as [Business], [Reliance], [Knowledge], etc. , So it is an adjective or adverb and marked. Preferably, the labeling methods can be used in the form of a reference symbol comprising, as [Yang Yubin start is a well-known ^v ^adj ⁿ youth, against solid ^adj ^v ⁿ knowledge and diligence in the local hard work began ⁿ ^v own ^adj Career ⁿ ].

Further, the heteromorphic words such as all English letters, Arabic numerals, Chinese numerals, punctuation marks, stop words, etc., the stop words include words such as "了", "于", etc., and the heteromorphic words are removed as described above Later, I got it as [famous ^adj entrepreneurship ^v youth ⁿ relying on ^v solid ^adj knowledge ⁿ local ⁿ starting ^v own ^adj business ⁿ ].

S3. Perform feature extraction on the primary text data set based on the tag set to obtain a feature data set, and convert the feature data set into a feature word vector set.

Preferably, the feature extraction is:

Wherein, DF _t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set, N _c is the total number of data in the primary text data set, c is the primary text data set, lg Represents the log function with 10 as the base. For example, in the above-mentioned [famous ^adj entrepreneurship ^v youth ⁿ relying on ^v solid ^adj knowledge ⁿ local ⁿ starting ^v own ^adj business ⁿ ], [famous], [enterprise], etc. are all characteristic words.

Further, the process of converting the feature data set into a feature word vector set includes assuming a weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, and based on the weight The relationship calculates the weight and completes the conversion process.

Specifically, the weight relationship is:

d={(t ₁ ,w ₁ ),(t ₂ ,w ₂ ),……,(t _i ,w _i ),……,(t _n ,w _n )}

Among them, d is the feature word vector set, t ₁ , t ₂ , ..., t _n are the features in the feature data set, such as the aforementioned [Famous], [Venture], etc., w ₁ , w ₂ , ..., w _n is the weight of the corresponding feature.

Further, the calculation method of the weight is:

Wherein, f _i represents the number of times the feature word appears in the primary text data set, N is the total number of documents in the document collection, N _j represents the total number of feature words in the primary text data set, and N _i represents the feature word i in the primary text data set. The number of occurrences of the text data set, F _m is a weighting factor, and the value is generally less than 1.

S4. Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, Finding hot keywords based on the specified number of feature word vectors and outputting the hot spots of the original text data set.

Preferably, the calculation method of the similarity is:

Among them, sim(d, t) represents the similarity between the feature word vectors d and t, w represents the weight coefficients of the feature word vectors d, t and other feature word vectors k in the feature word vector set, and n is the The total number of data in the set of feature word vectors, α and β are bias coefficients, where α+β=1, and T is the time distance function.

Further, the time distance function T is:

Wherein, t _d represents the publishing time of the text in the tag set where the feature word vector d is located, t _Ts is the earliest publishing time of the text data in the tag set, and t _Te is the latest publishing time in the tag set.

Preferably, traverse and sort the similarity set according to the sorting method of similarity from largest to smallest, select the feature word vector corresponding to the high similarity, and finally obtain the feature word. For example, the original text data set includes such as "Yang Yubin Is a well-known entrepreneurial youth who started his own business locally with solid knowledge and hard work. After analyzing the method described in this application, the hot text keywords of the original text data set are "Entrepreneurship".

The invention also provides a text hot spot discovery device. Referring to FIG. 2, it is a schematic diagram of the internal structure of a text hotspot discovery apparatus provided by an embodiment of this application.

In this embodiment, the text hotspot discovery apparatus 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server. The text hotspot discovery device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.

Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 11 may be an internal storage unit of the text hotspot discovery device 1, for example, the hard disk of the text hotspot discovery device 1. In other embodiments, the memory 11 may also be an external storage device of the text hotspot discovery device 1, such as a plug-in hard disk equipped on the text hotspot discovery device 1, a smart media card (SMC), and a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Further, the memory 11 may also include both the internal storage unit of the text hotspot discovery apparatus 1 and an external storage device. The memory 11 can be used not only to store application software and various data of the discovery device 1 installed in the text hotspot, such as the code of the text hotspot discovery program 01, etc., but also to temporarily store data that has been output or will be output.

In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, and is used to run the program code or processing stored in the memory 11 Data, such as the implementation of the text hot spot discovery program 01, etc.

The communication bus 13 is used to realize the connection and communication between these components.

The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the device 1 and other electronic devices.

Optionally, the device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the text hotspot discovery device 1 and to display a visualized user interface.

Figure 2 only shows the text hot spot discovery device 1 with components 11-14 and the text hot spot discovery program 01. Those skilled in the art can understand that the structure shown in Figure 1 does not constitute a text hot spot discovery device The definition of 1 may include fewer or more components than shown, or a combination of certain components, or different component arrangements.

In the embodiment of the apparatus 1 shown in FIG. 2, the memory 11 stores a text hotspot discovery program 01; when the processor 12 executes the text hotspot discovery program 01 stored in the memory 11, the following steps are implemented:

Step 1: Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set.

Preferably, the URL is called a uniform resource locator, which is a concise representation of the location and access method of various resources in the news forum website, and is also called the resource of various resources in the news forum website. address. The URL is composed of protocol, hostname, port, path, query string, hash element, etc. The protocol represents a protocol for accessing resources and services, such as http, ftp, mailto, file, etc.; the hostname represents the fully qualified domain name of the host where the resource is located, such as www.baidu.com. The port (port) represents the TCP port number used by the protocol, and the commonly used port of the HTTP communication protocol is 80, which is generally omitted by default; the path (path) represents the directory/file path name of the resource; The query string represents the query string passed in the URL; the hash element represents the file offset specified by the URL, including a hash (#) plus the location related to the file offset.

Step 2: Perform preprocessing operations including word segmentation, part-of-speech tagging, and stop word removal on the original text data set to obtain a primary text data set.

Step 3: Perform feature extraction on the primary text data set based on the tag set to obtain a feature data set, and convert the feature data set into a feature word vector set.

Preferably, the feature extraction is:

Specifically, the weight relationship is:

d={(t ₁ ,w ₁ ),(t ₂ ,w ₂ ),……,(t _i ,w _i ),……,(t _n ,w _n )}

Further, the calculation method of the weight is:

Step 4: Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation , Find hot keywords based on the specified number of feature word vectors and output the hot spots of the original text data set.

Preferably, the calculation method of the similarity is:

Further, the time distance function T is:

Optionally, in other embodiments, the text hotspot discovery program can also be divided into one or more modules, and the one or more modules are stored in the memory 11 and run by one or more processors (this embodiment It is executed by the processor 12) to complete this application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is used to describe the execution process of the text hot spot discovery program in the text hot spot discovery device .

For example, referring to FIG. 3, a schematic diagram of the program modules of the text hotspot discovery program in an embodiment of the text hotspot discovery apparatus of this application. In this embodiment, the text hotspot discovery program can be divided into data receiving modules 10. The data processing module 20, the word vector conversion module 30, and the text hotspot output module 40 are exemplary:

The data receiving module 10 is used to crawl an original text data set and a tag set from a news forum website, and the tag set records the publication time of the text in the original text data set.

The data processing module 20 is configured to: perform preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;

The word vector conversion module 30 is configured to perform a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and convert the feature data set into a feature word vector set.

The text hotspot output module 40 is configured to calculate the similarity between the features in the feature word vector set to obtain a similarity set, and perform a sorting operation on the similarity set, from the similarity set after the sorting operation Select a specified number of feature word vectors, find hot keywords based on the specified number of feature word vectors, and output the hot spots of the original text data set.

The functions or operation steps implemented by the program modules such as the data receiving module 10, the data processing module 20, the word vector conversion module 30, and the text hotspot output module 40 when executed are substantially the same as those in the foregoing embodiment, and will not be repeated here.

In addition, the embodiment of the present application also proposes a computer-readable storage medium, the computer-readable storage medium stores a text hotspot discovery program, and the text hotspot discovery program can be executed by one or more processors to Implement the following operations:

It should be noted that the serial numbers of the above embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article or method that includes the element.

Through the description of the above embodiments, those skilled in the art can clearly understand that the method of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only preferred embodiments of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A method for discovering text hotspots, characterized in that the method includes:

Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set;

Performing preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;

Performing a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set;

Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, based on all The specified number of feature word vectors are used to find hot keywords and output the hot spots of the original text data set.
The method for discovering text hotspots according to claim 1, wherein crawling the original text data set and tag set from the news forum website comprises:

Creating a URL queue, where the URL queue includes several URLs;

Sequentially read the URLs in the URL queue and parse them into IP addresses;

Downloading the webpage data specified by the IP address based on the HTTP communication protocol, and analyzing the webpage data to obtain the original text data set and tag set.
The method for discovering text hotspots according to claim 2, wherein the feature extraction operation is:

Wherein, DF(t,c) is the feature data set, DF t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set, and N c is the number of texts in the primary text data set The total number of data, and c is the primary text data set.
The method for discovering text hot spots according to claim 3, wherein the calculation method for calculating the similarity between the features in the feature word vector set is:

Wherein, sim(d, t) represents the similarity between the feature word vectors d and t, w represents the weight coefficients of the feature word vectors d, t and other feature word vectors in the feature word vector set, and n is the The total number of data in the feature word vector set, α and β are bias coefficients, where α+β=1, and T is the time distance function.
The method for discovering text hot spots according to claim 4, wherein the time distance function T is:

Where, t d represents the publication time of the text in the tag set where the feature word vector d is located, t Ts is the earliest publication time in the tag set, and t Te is the latest publication time in the tag set.
The method for discovering text hotspots according to claim 1, wherein said converting said feature data set into a feature word vector set comprises:

Set the weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, calculate the weight based on the weight relationship, and complete the conversion process.
8. The method for discovering text hotspots according to claim 6, wherein the weight calculation method comprises:

Wherein, f i represents the number of times the feature word appears in the primary text data set, N represents the total number of documents in the document collection, N j represents the total number of feature words in the primary text data set, and N i represents the feature word i in the primary text data set. The number of occurrences of the text data set, F m represents the weighting factor, and the value of F m is less than 1.
A text hotspot discovery device, characterized in that the device includes a memory and a processor, the memory stores a text hotspot discovery program that can run on the processor, and the text hotspot discovery program is The processor implements the following steps when executing:

Crawling the original text data set and tag set from the news forum website, the tag set recording the publication time of the text in the original text data set;

Performing preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;

Performing a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set;

Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, based on all The specified number of feature word vectors are used to find hot keywords and output the hot spots of the original text data set.
8. The device for discovering text hotspots according to claim 8, wherein crawling the original text data set and tag set from the news forum website comprises:

Creating a URL queue, where the URL queue includes several URLs;

Sequentially read the URLs in the URL queue and parse them into IP addresses;

Downloading the webpage data specified by the IP address based on the HTTP communication protocol, and analyzing the webpage data to obtain the original text data set and tag set.
The text hot spot discovery device according to claim 9, wherein the feature extraction operation is:

Wherein, DF(t,c) is the feature data set, DF t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set, and N c is the number of texts in the primary text data set The total number of data, and c is the primary text data set.
10. The text hotspot discovery device according to claim 10, wherein the calculation method for calculating the similarity between the features in the feature word vector set is:

Wherein, sim(d, t) represents the similarity between the feature word vectors d and t, w represents the weight coefficients of the feature word vectors d, t and other feature word vectors in the feature word vector set, and n is the The total number of data in the feature word vector set, α and β are bias coefficients, where α+β=1, and T is the time distance function.
The device for discovering text hotspots according to claim 11, wherein the time distance function T is:

Where, t d represents the publication time of the text in the tag set where the feature word vector d is located, t Ts is the earliest publication time in the tag set, and t Te is the latest publication time in the tag set.
8. The text hotspot discovery device according to claim 8, wherein said converting said characteristic data set into a characteristic word vector set comprises:

Set the weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, calculate the weight based on the weight relationship, and complete the conversion process.
The device for discovering text hotspots according to claim 13, wherein said weight calculation method comprises:

Wherein, f i represents the number of times the feature word appears in the primary text data set, N represents the total number of documents in the document collection, N j represents the total number of feature words in the primary text data set, and N i represents the feature word i in the primary text data set. The number of occurrences of the text data set, F m represents the weighting factor, and the value of F m is less than 1.
A computer-readable storage medium, characterized in that a text hotspot discovery program is stored on the computer-readable storage medium, and when the text hotspot discovery program can be executed by one or more processors, the following steps are implemented:

Crawling an original text data set and a tag set from a news forum website, the tag set recording the publication time of the text in the original text data set;

Performing preprocessing operations including word segmentation, part-of-speech tagging, and removal of heteromorphic words on the original text data set to obtain a primary text data set;

Performing a feature extraction operation on the primary text data set based on the tag set to obtain a feature data set, and converting the feature data set into a feature word vector set;

Calculate the similarity between the features in the feature word vector set to obtain a similarity set, perform a sorting operation on the similarity set, and select a specified number of feature word vectors from the similarity set after the sorting operation, based on all The specified number of feature word vectors are used to find hot keywords and output the hot spots of the original text data set.
15. The computer-readable storage medium of claim 15, wherein crawling the original text data set and tag set from the news forum website comprises:

Creating a URL queue, where the URL queue includes several URLs;

Sequentially read the URLs in the URL queue and parse them into IP addresses;

Downloading the webpage data specified by the IP address based on the HTTP communication protocol, and analyzing the webpage data to obtain the original text data set and tag set.
16. The computer-readable storage medium of claim 16, wherein the feature extraction operation is:

Wherein, DF(t,c) is the feature data set, DF t represents the number of texts in the primary text data set that the feature word t appears in the primary text data set, and N c is the number of texts in the primary text data set The total number of data, and c is the primary text data set.
17. The computer-readable storage medium of claim 17, wherein the calculation method for calculating the similarity between the features in the feature word vector set is:

Wherein, sim(d, t) represents the similarity between the feature word vectors d and t, w represents the weight coefficients of the feature word vectors d, t and other feature word vectors in the feature word vector set, and n is the The total number of data in the feature word vector set, α and β are bias coefficients, where α+β=1, and T is the time distance function.
18. The computer readable storage medium of claim 18, wherein the time distance function T is:

Where, t d represents the publication time of the text in the tag set where the feature word vector d is located, t Ts is the earliest publication time in the tag set, and t Te is the latest publication time in the tag set.
15. The computer-readable storage medium of claim 15, wherein the converting the characteristic data set into a characteristic word vector set comprises:

Set the weight relationship between the features in the feature data set and the feature word vectors in the feature word vector set, calculate the weight based on the weight relationship, and complete the conversion process.