CN111460153A - Hot topic extraction method and device, terminal device and storage medium - Google Patents

Hot topic extraction method and device, terminal device and storage medium Download PDF

Info

Publication number
CN111460153A
CN111460153A CN202010231954.9A CN202010231954A CN111460153A CN 111460153 A CN111460153 A CN 111460153A CN 202010231954 A CN202010231954 A CN 202010231954A CN 111460153 A CN111460153 A CN 111460153A
Authority
CN
China
Prior art keywords
news
cluster
text
news text
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010231954.9A
Other languages
Chinese (zh)
Other versions
CN111460153B (en
Inventor
赵洋
包荣鑫
王宇
魏世胜
朱继刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Valueonline Technology Co ltd
Original Assignee
Shenzhen Valueonline Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Valueonline Technology Co ltd filed Critical Shenzhen Valueonline Technology Co ltd
Priority to CN202010231954.9A priority Critical patent/CN111460153B/en
Publication of CN111460153A publication Critical patent/CN111460153A/en
Application granted granted Critical
Publication of CN111460153B publication Critical patent/CN111460153B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application is applicable to the technical field of information, and provides a hot topic extraction method, a device, a terminal device and a storage medium, wherein the method comprises the following steps: collecting a plurality of news texts; aiming at any news text, extracting a plurality of characteristic words of the news text; generating a sentence vector corresponding to the news text according to the plurality of feature words; clustering the news texts based on sentence vectors corresponding to the news texts to obtain a plurality of cluster clusters; extracting hot topics from the plurality of cluster clusters. By adopting the method, the accuracy and the real-time performance of hot topic extraction can be improved.

Description

Hot topic extraction method and device, terminal device and storage medium
Technical Field
The application belongs to the technical field of information, and particularly relates to a hot topic extraction method and device, a terminal device and a storage medium.
Background
The progress of internet technology has greatly promoted the development of news media and web portals. The way of acquiring information is changed from the traditional channels of television, newspaper and the like into the way of reading news on the internet at any time and any place through a computer and a mobile phone.
For the infinite news contents, the hot contents which are popular currently or are concerned widely can be introduced to the user by extracting the hot news topics. For some organizations, the hot topics can help the organizations analyze social public opinion and provide suggestions for public policies of governments; for enterprises, hot topics can help enterprise decision makers to master development directions and make correct decisions; for individuals, the hot topics are helpful for the individuals to know social events and improve knowledge. Therefore, how to analyze and extract the real-time hot topics has important research value.
Disclosure of Invention
In view of this, embodiments of the present application provide a method and an apparatus for extracting a hot topic, a terminal device, and a storage medium, so as to solve the problems that in the prior art, the accuracy of extracting a hot topic is low, and real-time performance is difficult to meet.
A first aspect of an embodiment of the present application provides a method for extracting a hot topic, including:
collecting a plurality of news texts;
aiming at any news text, extracting a plurality of characteristic words of the news text;
generating a sentence vector corresponding to the news text according to the plurality of feature words;
clustering the news texts based on sentence vectors corresponding to the news texts to obtain a plurality of cluster clusters;
extracting hot topics from the plurality of cluster clusters.
A second aspect of the embodiments of the present application provides a hot topic extraction device, including:
the news text acquisition module is used for acquiring a plurality of news texts;
the characteristic word extraction module is used for extracting a plurality of characteristic words of any news text;
a sentence vector generating module, configured to generate a sentence vector corresponding to the news text according to the plurality of feature words;
the news text clustering module is used for clustering the news texts based on the sentence vectors corresponding to the news texts to obtain a plurality of clustering clusters;
and the hot topic extracting module is used for extracting the hot topics from the plurality of clustering clusters.
A third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the hot topic extraction method described in the first aspect when executing the computer program.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the hot topic extraction method described in the first aspect.
A fifth aspect of embodiments of the present application provides a computer program product, which when running on a terminal device, causes the terminal device to execute the method for extracting a hot topic in the first aspect.
Compared with the prior art, the embodiment of the application has the following advantages:
according to the embodiment of the application, the outliers can be detected after the nodes are added every time based on the improved SinglePass clustering algorithm, if the distance is too large, the outliers are removed from the current cluster, the representativeness of a clustering center and the accuracy of a clustering result are guaranteed, secondly, the historical hotspot recalling algorithm provided by the embodiment can effectively judge the relation between a new hotspot and a historical hotspot, hotspots with the same theme and similar news are combined, and the accuracy of real-time pushing is guaranteed. Thirdly, in the embodiment, by using word2vec and TF-IDF to perform vectorization processing on the sentence, the global characteristics of the sentence vector can be more accurately represented, interference of irrelevant words is eliminated, and meanwhile, real-time quantization processing is supported, so that the time requirement of practical application can be met. The hot topic extraction method provided by the embodiment of the application realizes the functions of news sentence vector representation, hot topic clustering, hot title screening, historical hot recall and the like, improves the problems that sentence vector representation is not accurate and incremental clustering is not supported in the existing algorithm, does not need prior knowledge for large-scale dynamic news data, does not need obvious characteristics for news, and has better universality on the whole algorithm.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a schematic flowchart illustrating steps of a hot topic extraction method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram illustrating steps of another hot topic extraction method according to an embodiment of the present application;
FIG. 3 is a flow chart of an improved SinglePass clustering algorithm according to one embodiment of the present application;
FIG. 4 is a flowchart of a historical hot topic recall algorithm according to one embodiment of the present application;
FIG. 5 is a schematic diagram of a hot topic extraction device according to an embodiment of the application;
fig. 6 is a schematic diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
The embodiment of the application aims at the problems in the existing various topic extraction algorithms in the prior art, news texts are cleaned and segmented firstly, then segmentation results are screened, and only representative partial segmentation results are selected for feature extraction. Thereafter, a first language model (word2vec model) is trained using historical news corpus of large data size (e.g., 20 GB). And then, after the word segmentation result is mapped into a vector by the first language model, weighting is carried out according to the TF-IDF value of the word, and a corresponding sentence vector is generated. After that, the generated sentence vector is clustered by a modified SinglePass algorithm, and a plurality of cluster clusters are generated. And finally, generating a hot spot according to the size of the cluster, and selecting the central point of the cluster as a final hot spot topic. The method can accurately extract global features in the news text, and experimental results show that hot topics are extracted according to the method provided by the embodiment, the accuracy and the recall rate are high, incremental hot topic extraction is supported, and the requirement for extracting the hot topics on line in real time can be met.
The technical solution of the present application will be described below by way of specific examples.
Referring to fig. 1, a schematic flow chart illustrating steps of a method for extracting a hot topic according to an embodiment of the present application is shown, and the method specifically includes the following steps:
s101, collecting a plurality of news texts;
in this embodiment of the present application, the plurality of news texts may be used for clustering, and news information or news reports of corresponding news hot topics may be extracted according to the clustering result, and the specific type of the news text is not limited in this embodiment.
In a particular implementation, news text may be crawled from various news websites, web portals, etc. by web crawlers or other forms.
Generally, in order to guarantee timeliness of subsequent topic extraction, news within a certain time period can be captured according to release or online time of the news. For example, news released within the past hour or two is captured.
S102, aiming at any news text, extracting a plurality of characteristic words of the news text;
in the embodiment of the application, all collected news texts can be processed in a share-by-share mode, and each news is processed into a format capable of being input into a subsequent model for processing.
In a specific implementation, for any news text, a plurality of feature words of the text may be extracted first.
In general, for a news story, the news headline should be a summary of the content of the entire news story; on the other hand, it is also common for several passages of news opening to include a brief introduction to the entire news. Thus, extracting feature words of news text can be done mainly from news headlines and parts located at the top of the whole news.
In a specific implementation, the news headline and the whole news are combined together in sequence, i.e. combined into a form of "headline + body", and then a plurality of feature words in the front part of the content are extracted from the combined text. The feature word may be any word in the part of the content, or may be any word remaining after data cleaning is performed on the part of the content, and a stop word and a single word which do not have actual meanings are deleted.
To ensure that the feature windows of all news text are similar, the length of all news is guaranteed to be the same, so the title and body of the news text can be intercepted as a string of "title + how many words before body".
For example, the first 500 words may be extracted from the "title + text" first, and then the feature words may be extracted from the 500 words. Or, the first 500 characters are firstly extracted from the 'title + text', stop words, numbers and single characters which do not have practical meanings are deleted through data cleaning, and then the rest words are recognized as characteristic words.
S103, generating a sentence vector corresponding to the news text according to the plurality of feature words;
because the clustering algorithm cannot calculate the input of characters, the news text needs to be vectorized before clustering.
In this embodiment of the present application, the feature words extracted in the foregoing steps may be represented as a sentence vector, and each value in the sentence vector corresponds to one feature word.
Each collected news text can be processed according to the method, and a sentence vector corresponding to the news text is obtained and used for subsequent clustering.
S104, clustering the news texts based on sentence vectors corresponding to the news texts respectively to obtain a plurality of cluster clusters;
in the embodiment of the application, after all news texts are represented by using vectors, the sentence vectors corresponding to all the news texts can be used as input data of a clustering algorithm, and output data of the algorithm is a plurality of clustered clusters obtained through clustering.
In a specific implementation, a SinglePass algorithm may be used to cluster the respective news texts that are vectorized.
The SinglePass clustering algorithm is simple in idea and high in running speed. Like its name, the algorithm only needs to go through all data during operation, and is relatively dependent on the input sequence of the data, and the time complexity is o (n). In clustering, each cluster has a dynamically updated cluster center, which is the mean of all vectors, and the cluster center can be used as a global feature representing the cluster.
S105, extracting hot topics from the plurality of clustering clusters.
In the embodiment of the application, the cluster center of each cluster obtained by clustering can be used as a global feature representing the cluster, so that a final topic generation result can be obtained according to news texts corresponding to cluster center vectors of a plurality of clusters.
In specific implementation, the title of the news text corresponding to the cluster center vector can be directly used as a final hot topic; or respectively calculating the distances between other vectors in the cluster and the cluster center vector based on the cluster center vector, and selecting the title of the news text corresponding to the vector with the minimum distance as the final hot topic; the title of the news text corresponding to the cluster center vector and the title of the news text corresponding to the vector with the smallest distance may also be subjected to certain combination processing, and the content obtained after the combination processing is used as the final hot topic, which is not limited in this embodiment.
In the embodiment of the application, for a plurality of collected news texts, a plurality of feature words of any news text can be extracted, sentence vectors corresponding to the news text are generated according to the feature words, and then the plurality of news texts are clustered based on the sentence vectors corresponding to the news texts, so that a plurality of cluster clusters can be obtained, a hot topic can be conveniently extracted from the cluster clusters, and the accuracy and the real-time performance of extracting the hot topic are improved.
Referring to fig. 2, a schematic flow chart illustrating steps of another hot topic extraction method according to an embodiment of the present application is shown, and specifically, the method may include the following steps:
s201, collecting a plurality of news texts, and extracting a plurality of characteristic words of any news text;
in the embodiment of the application, the plurality of news texts may be used for clustering, and news information or news reports of corresponding news hot topics can be extracted according to clustering results. The news text can be obtained by crawling from various news websites, portal websites and the like through webpage crawlers or other forms.
In a specific implementation, all collected news texts can be processed in portions. For example, for any news text, the news text may be first participled. For example, a word segmentation tool of a jieba (jieba) may be used to segment news text, and the segmentation results may be stored in a list.
Considering that the word segmentation result has partial interference information, the representation of the global characteristics of the sentence is not facilitated. Therefore, the non-target words such as stop words, pure numbers, and single words in the word segmentation result can be deleted, and the word vectors of the segmented words are not representative, which greatly affects the generation accuracy of the sentence vectors.
For example, for a certain news text "16 day company news focus: the result "company/news/focus/format/appliance/equity/transfer/approval" can be obtained after final segmentation and processing.
For a target text obtained after word segmentation and deletion of non-target words, a plurality of words in preset text positions of the target text can be extracted as feature words. The preset position may be a position at the front of the target text, for example, the first 100 words, and so on.
In the embodiment of the application, the method includes the steps that firstly, the 'title + 500 characters in front of a text' can be intercepted, then, the 500 characters are subjected to word segmentation, and partial non-target words which may affect subsequent processing after word segmentation are deleted to obtain a plurality of feature words; or the whole "title + text" may be segmented first, then the non-target words are deleted, and then a certain number of previous words, for example, 100 words, are extracted from the remaining words as the feature words, and the sequence of the steps of intercepting the text, segmenting the words, and extracting the feature words is not limited in this embodiment.
A plurality of feature words may be represented in the form of a sentence X. For example, X ═ X0,x1,...,xn]。
S202, mapping each feature word into a dense vector of a preset dimension according to a preset first language model, wherein the first language model is obtained by training a sample news text by adopting a preset skip word model;
in an embodiment of the application, the first language model may be trained based on a word2vec model. word2vec is a deep learning algorithm proposed by Mikolov in 2013, and is based on the language model hypothesis that the meaning of a word can be inferred from the context of the word, words are changed into dense vectorization representation according to the corpus, and the method comprises two word vectorization modes of a continuous word bag model CBOW and a Skip-Gram model Skip-Gram.
In a specific implementation, a word2vec model may be used for training using a certain number of, for example, 20GB of full-web historical news as sample news text. The parameters in the training may be chosen as follows: the vector dimension of the words is 100, the window number is 10, the minimum occurrence frequency of the words is 8, the model is a Skip-Gram model, the cycle frequency is 20, and the rest parameters adopt default parameters.
Finally training a 2.1GB first language model, and using the model, each word with the occurrence frequency more than 8 in the corpus can be represented as a dense vector W (x) with 100 dimensions0,w1,...,w99]。
Therefore, for each feature word, the trained first language model can be used to map the feature word into a dense vector of 100 dimensions.
S203, determining the weight value of each feature word according to a preset second language model, wherein the second language model is obtained by counting the inverse document frequency of each word in a sample news text;
in the embodiment of the present application, the second language model may be a term Frequency-Inverse Document Frequency index (TF-IDF) model. TF refers to word frequency and IDF refers to the inverse document frequency index, both of which are commonly used in combination to assess how important a word is to the entire document. If a word occurs more frequently in a certain text and rarely occurs in other documents, the word has better distinguishing capability and is suitable for representing the global characteristics of the text.
According to the method and the device, the inverse document frequency of each word can be counted by combining the sample news text and the word segmentation result thereof to form a dictionary, and then the IDF is modeled and stored. The sample news text may be news text in a historical corpus.
In the embodiment of the present application, the calculation formula of the IDF may be expressed as:
Figure RE-GDA0002482561010000081
where N represents the total number of documents and N (x) represents the number of documents containing the term x. The stored dictionary may measure how important each word is in the article for later weighted generation of sentence vectors.
S204, generating a sentence vector corresponding to the news text according to the dense vector of each feature word and the weight value;
because the clustering algorithm cannot calculate the input of characters, the news text needs to be vectorized before clustering.
In the embodiment of the present application, the IDF value may be used to weight the word vector, and the higher the importance of the word IDF value, the higher the weight of the word IDF value. Each news text S can be represented as a 100-dimensional dense vector S (x) ═ S using a first language model, i.e., word2vec model, and a second language model, i.e., TF-IDF model, obtained from previous training0,s1,...,s99]. For each dimension in the sentence vector s (x), the value is equal to the value of the dimension of each word vector multiplied by the value of the word IDF, and then the number of words included in the sentence is averaged.
Therefore, in a specific implementation, for any feature word, a product of a value of a dense vector corresponding to the feature word and a weight value of the feature word, which is an IDF value of the feature word, may be calculated respectively. And then, calculating the ratio between the product and the number of all the characteristic words, and taking the ratio as the vector value of the dimensionality corresponding to the characteristic word in the sentence vector to obtain the sentence vector corresponding to the current news text.
The above calculation process can be expressed as the following formula:
Figure RE-GDA0002482561010000091
wherein n is the number of the feature words.
It should be noted that, for uncommon words that do not exist in the first language model and the second language model, skipping may be selected without performing weighting operation, so that all texts of news may be represented as dense sentence vectors of 100 dimensions as input data of the clustering algorithm.
S205, clustering the news texts based on sentence vectors corresponding to the news texts to obtain a plurality of cluster clusters;
in the embodiment of the application, clustering can be performed based on a modified SinglePass algorithm. And (3) clustering clusters obtained by the SinglePass algorithm, wherein each cluster is provided with a dynamically updated cluster center, and the cluster center is the mean value of all vectors. The cluster center may serve as a global feature representing the cluster. The clusters to which the different nodes belong can be judged by calculating the distances, and the euclidean distance can be used as the measurement standard of the similarity between the nodes in the embodiment.
Generally, in a partial clustering algorithm, due to interference caused by outliers, clustering accuracy cannot be guaranteed. In this embodiment, to eliminate the interference of outliers, on the basis of SinglePass clustering, outlier detection is performed after a new node is inserted each time, so as to reduce the influence of outliers on the final clustering result.
As shown in fig. 3, which is a flowchart of an improved SinglePass clustering algorithm provided in this embodiment of the present application, according to the flowchart shown in fig. 3, a process of clustering news texts after being represented in a vector may include the following steps:
inputting an algorithm: clustering threshold values and text feature vectors;
step 1: adding the feature vector of the first text into the first cluster, and setting the feature vector as a clustering center;
step 2: traversing all text feature vectors;
and step 3: traversing all cluster centers;
and 4, step 4: calculating the Euclidean distance between the text feature vector and the cluster center;
and 5: recording a cluster with the minimum distance from the current text, and recording the value of the distance;
step 6: if the distance is smaller than the clustering threshold value, adding the text characteristic vector into the cluster with the minimum distance, updating the center of the cluster, and executing the step 7;
and 7: traversing the current cluster, if the distance between the directed vector and the center is greater than a threshold value, determining that the cluster is an outlier, removing the vector from the current cluster, and executing a step 4 on the vector;
and 8: if the distance is greater than the clustering threshold value, a cluster is newly established, the vector is inserted into the cluster, and the cluster center is updated;
and (3) outputting an algorithm: a plurality of clusters, all vectors of each cluster, a center vector of each cluster.
According to the improved SinglePass clustering algorithm, when clustering, any one sentence vector can be taken as a first cluster, the sentence vector is set as the center of the cluster, then Euclidean distances between other sentence vectors and the center of the cluster are calculated in sequence, if the distance is smaller than a clustering threshold value, the Euclidean distance can be added into the cluster, and the cluster center is updated; if the distance is larger than the clustering threshold, a cluster can be newly built and added into the newly built cluster. And (3) circularly calculating the Euclidean distance between each sentence vector and the center of each cluster obtained in the previous step, and adding all the sentence vectors into a certain cluster respectively to finish clustering all the news texts.
In the embodiment of the application, for newly acquired news texts, the cluster to which the newly added news text belongs can be determined according to the clustering mode.
In a specific implementation, when a new news text is collected, distances between sentence vectors corresponding to the new news text and cluster center vectors of a plurality of clustered clusters which are clustered can be calculated one by one. When the distance between the sentence vector corresponding to the newly added news text and the cluster center vector of the target cluster is smaller than the preset threshold value, the newly added news text can be added into the target cluster, and the calculation of the distance between the sentence vector corresponding to the newly added news text and the cluster center vectors of other clusters is stopped, wherein the target cluster can be any one of a plurality of clusters.
If the distances between the sentence vector corresponding to the newly added news text and the cluster center vectors of the plurality of clustering clusters are larger than the preset threshold, the newly added news text can be considered not to belong to any clustered cluster, a clustering cluster can be newly built at the moment, and the newly added news text is inserted into the newly built clustering cluster.
S206, aiming at any cluster, determining a cluster center vector of the cluster; respectively calculating the distance between each sentence vector in the clustering cluster and the cluster center vector;
after the clustering of the news texts is completed, corresponding hot topics can be generated according to the clusters obtained by clustering.
S207, extracting a target news text corresponding to the sentence vector with the minimum distance from the cluster center vector;
in a specific implementation, for any cluster, a cluster center vector of the cluster may be found first, then euclidean distances between each vector in the cluster and the center vector are calculated respectively, and a target news text corresponding to a vector with the smallest distance is selected from the euclidean distances, where the target news text is a reference text subsequently used for generating a hot topic.
S208, determining hot topics according to the news titles of the target news texts.
In the embodiment of the application, for the identified target news text, the news title of the target news text can be directly used as the finally determined hot topic.
On the other hand, after the hot topic is generated, news titles corresponding to the other vectors can be selected from the cluster clusters, and a similar news list is generated.
It should be noted that, because there is a case of reprinting news, there may be a plurality of news with the same title in the generated similar news list. For news of the same title in the list, only one of them is retained.
In this embodiment of the present application, the generated multiple cluster clusters may respectively have a corresponding time attribute, and the time attribute may indicate that each cluster is generated by collecting news in a specific time period.
That is, the hot topic may be generated in a sliding manner according to a certain time window, for example, one hour may be selected as the time window, and hot topic extraction may be performed on news of the last hour each time.
Since topics in different time windows may be repeated or similar, it is necessary to recall the historical hot spots and classify the new news text into a certain hot topic that has been extracted before.
In the embodiment of the application, historical cluster clusters to be processed can be determined according to time attributes, then, for any cluster, the similarity between the cluster and the historical cluster is respectively calculated, and if the similarity is smaller than a similarity threshold, the cluster smaller than the similarity threshold can be merged with the historical cluster.
As shown in fig. 4, which is a flowchart of a historical hot topic recall algorithm according to an embodiment of the present application, recalling a historical hot topic according to the flowchart shown in fig. 4 may include the following steps:
inputting an algorithm: the hot cluster similarity detection method comprises the steps of obtaining a central vector of a historical hot cluster, a central vector of a new hot cluster, similar news of the new hot cluster and a similarity threshold;
step 1: traversing all new hot spot center vectors;
step 2: traversing all historical hot spot cluster center vectors;
and step 3: respectively calculating Euclidean distances among the central vectors, and recording and sequencing the distances;
and 4, step 4: selecting a historical hot spot cluster with the minimum distance to the new hot spot center vector;
and 5: if the distance is smaller than the similarity threshold value, recalling the new hot spot into the historical hot spot, and merging similar news;
step 6: if the distance is larger than the similarity threshold value, the recall is failed, and a new hotspot is generated;
and (3) outputting an algorithm: a historical hotspot list and a new hotspot list.
According to the algorithm, the distance between the cluster to which the new hot topic belongs and the historical cluster can be calculated for the new hot topic generated in each time window, namely the distance between the central vectors of the two clusters, and if the distance is smaller than a preset similarity threshold, the new hot topic and the hot topic of the similar historical cluster can be merged, so that the accuracy of hot topic push is ensured.
According to the embodiment of the application, the outliers can be detected after the nodes are added every time based on the improved SinglePass clustering algorithm, if the distance is too large, the outliers are removed from the current cluster, the representativeness of a clustering center and the accuracy of a clustering result are guaranteed, secondly, the historical hotspot recalling algorithm provided by the embodiment can effectively judge the relation between a new hotspot and a historical hotspot, hotspots with the same theme and similar news are combined, and the accuracy of real-time pushing is guaranteed. Thirdly, in the embodiment, by using word2vec and TF-IDF to perform vectorization processing on the sentence, the global characteristics of the sentence vector can be more accurately represented, interference of irrelevant words is eliminated, and meanwhile, real-time quantization processing is supported, so that the time requirement of practical application can be met. The hot topic extraction method provided by the embodiment of the application realizes the functions of news sentence vector representation, hot topic clustering, hot title screening, historical hot recall and the like, improves the problems that sentence vector representation is not accurate and incremental clustering is not supported in the existing algorithm, does not need prior knowledge for large-scale dynamic news data, does not need obvious characteristics for news, and has better universality on the whole algorithm.
It should be noted that, the sequence numbers of the steps in the foregoing embodiments do not mean the execution sequence, and the execution sequence of each process should be determined by the function and the inherent logic of the process, and should not constitute any limitation on the implementation process of the embodiments of the present application.
Referring to fig. 5, a schematic diagram of a hot topic extraction apparatus according to an embodiment of the present application is shown, and specifically, the hot topic extraction apparatus may include the following modules:
a news text collection module 501, configured to collect a plurality of news texts;
a feature word extracting module 502, configured to extract, for any news text, a plurality of feature words of the news text;
a sentence vector generating module 503, configured to generate a sentence vector corresponding to the news text according to the plurality of feature words;
a news text clustering module 504, configured to cluster the multiple news texts based on respective sentence vectors corresponding to the multiple news texts, so as to obtain multiple clustering clusters;
a hot topic extracting module 505, configured to extract a hot topic from the plurality of cluster clusters.
In this embodiment of the present application, the feature word extraction module 502 may specifically include the following sub-modules:
the target text acquisition submodule is used for segmenting any news text, deleting non-target words after segmentation and acquiring the target text, wherein the non-target words comprise at least one of stop words, numbers or single words;
and the characteristic word extraction submodule is used for extracting a plurality of characteristic words in the preset text position of the target text.
In this embodiment of the present application, the sentence vector generating module 503 may specifically include the following sub-modules:
the dense vector mapping submodule is used for mapping each feature word into a dense vector of a preset dimension according to a preset first language model, and the first language model is obtained by training a sample news text by adopting a preset skip word model;
the weight value determining submodule is used for determining the weight value of each feature word according to a preset second language model, and the second language model is obtained by counting the inverse document frequency of each word in a sample news text;
and the sentence vector generation submodule is used for generating a sentence vector corresponding to the news text according to the dense vector of each feature word and the weight value.
In this embodiment of the present application, the sentence vector generation submodule may specifically include the following units:
the product calculation unit is used for calculating products of the values of the dense vectors corresponding to the feature words and the weight values of the feature words respectively aiming at any feature word;
and the sentence vector generating unit is used for calculating the ratio between the product and the number of all the characteristic words, and taking the ratio as the vector value of the dimensionality corresponding to the characteristic words in the sentence vector to obtain the sentence vector corresponding to the news text.
In this embodiment of the application, the hot topic extraction module 505 may specifically include the following sub-modules:
the cluster center vector determining submodule is used for determining a cluster center vector of any cluster;
the distance calculation submodule is used for calculating the distance between each sentence vector in the clustering cluster and the cluster center vector respectively;
the target news text extraction submodule is used for extracting a target news text corresponding to a sentence vector with the minimum distance from the cluster center vector;
and the hot topic determining submodule is used for determining the hot topic according to the news title of the target news text.
In this embodiment, the apparatus may further include the following modules:
the newly added news text distance calculation module is used for calculating distances between sentence vectors corresponding to the newly added news text and cluster center vectors of the cluster clusters one by one when the newly added news text is collected;
the newly added news text classification module is used for adding the newly added news text into a target cluster when the distance between a sentence vector corresponding to the newly added news text and a cluster center vector of the target cluster is smaller than a preset threshold value, and stopping calculating the distance between the sentence vector corresponding to the newly added news text and cluster center vectors of other clusters; and if the distances between the sentence vector corresponding to the newly added news text and the cluster center vectors of the cluster clusters are larger than the preset threshold value, newly building a cluster, and inserting the newly added news text into the newly built cluster, wherein the target cluster is any one of the cluster clusters.
In this embodiment of the present application, the plurality of cluster clusters respectively have corresponding time attributes, and the apparatus may further include the following modules:
the historical clustering cluster determining module is used for determining a historical clustering cluster to be processed according to the time attribute;
the similarity calculation module is used for respectively calculating the similarity between the clustering clusters and the historical clustering clusters aiming at any clustering cluster;
and the cluster merging module is used for merging the cluster which is smaller than the similarity threshold value with the historical cluster if the similarity is smaller than the similarity threshold value.
For the apparatus embodiment, since it is substantially similar to the method embodiment, it is described relatively simply, and reference may be made to the description of the method embodiment section for relevant points.
Referring to fig. 6, a schematic diagram of a terminal device according to an embodiment of the present application is shown. As shown in fig. 6, the terminal device 600 of the present embodiment includes: a processor 610, a memory 620, and a computer program 621 stored in the memory 620 and operable on the processor 610. When the processor 610 executes the computer program 621, the steps in the embodiments of the hot topic extraction method described above, such as steps S101 to S105 shown in fig. 1, are implemented. Alternatively, the processor 610, when executing the computer program 621, implements the functions of each module/unit in each device embodiment described above, such as the functions of the modules 501 to 505 shown in fig. 5.
Illustratively, the computer program 621 may be divided into one or more modules/units, which are stored in the memory 620 and executed by the processor 610 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which may be used to describe the execution process of the computer program 621 in the terminal device 600. For example, the computer program 621 may be divided into a news text collection module, a feature word extraction module, a sentence vector generation module, a news text clustering module, and a hot topic extraction module, and the specific functions of each module are as follows:
the news text acquisition module is used for acquiring a plurality of news texts;
the characteristic word extraction module is used for extracting a plurality of characteristic words of any news text;
a sentence vector generating module, configured to generate a sentence vector corresponding to the news text according to the plurality of feature words;
the news text clustering module is used for clustering the news texts based on the sentence vectors corresponding to the news texts to obtain a plurality of clustering clusters;
and the hot topic extracting module is used for extracting the hot topics from the plurality of clustering clusters.
The terminal device 600 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device 600 may include, but is not limited to, a processor 610, a memory 620. Those skilled in the art will appreciate that fig. 6 is only one example of a terminal device 600 and does not constitute a limitation of the terminal device 600 and may include more or less components than those shown, or combine certain components, or different components, for example, the terminal device 600 may also include input and output devices, network access devices, buses, etc.
The Processor 610 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The storage 620 may be an internal storage unit of the terminal device 600, such as a hard disk or a memory of the terminal device 600. The memory 620 may also be an external storage device of the terminal device 600, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and so on, provided on the terminal device 600. Further, the memory 620 may also include both an internal storage unit and an external storage device of the terminal device 600. The memory 620 is used for storing the computer program 621 and other programs and data required by the terminal device 600. The memory 620 may also be used to temporarily store data that has been output or is to be output.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same. Although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A hot topic extraction method is characterized by comprising the following steps:
collecting a plurality of news texts;
aiming at any news text, extracting a plurality of characteristic words of the news text;
generating a sentence vector corresponding to the news text according to the plurality of feature words;
clustering the news texts based on sentence vectors corresponding to the news texts to obtain a plurality of cluster clusters;
extracting hot topics from the plurality of cluster clusters.
2. The method of claim 1, wherein the extracting a plurality of feature words of the news text for any news text comprises:
for any news text, segmenting the news text, and deleting non-target words after segmentation to obtain a target text, wherein the non-target words comprise at least one of stop words, numbers or single words;
and extracting a plurality of feature words in preset text positions of the target text.
3. The method of claim 1 or 2, wherein the generating a sentence vector corresponding to the news text from the plurality of feature words comprises:
mapping each feature word into a dense vector of a preset dimension according to a preset first language model, wherein the first language model is obtained by training a sample news text by adopting a preset skipping model;
determining the weight value of each feature word according to a preset second language model, wherein the second language model is obtained by counting the inverse document frequency of each word in a sample news text;
and generating a sentence vector corresponding to the news text according to the dense vector of each feature word and the weight value.
4. The method of claim 3, wherein generating a sentence vector corresponding to the news text according to the dense vector of each feature word and the weight value comprises:
aiming at any feature word, respectively calculating the product of the value of the dense vector corresponding to the feature word and the weight value of the feature word;
and calculating the ratio between the product and the number of all the feature words, and taking the ratio as the vector value of the dimensionality corresponding to the feature words in the sentence vector to obtain the sentence vector corresponding to the news text.
5. The method of claim 1, 2 or 4, wherein the extracting of the hot topics from the plurality of cluster clusters comprises:
for any cluster, determining a cluster center vector of the cluster;
respectively calculating the distance between each sentence vector in the clustering cluster and the cluster center vector;
extracting a target news text corresponding to a sentence vector with the minimum distance to the cluster center vector;
and determining hot topics according to the news titles of the target news texts.
6. The method of claim 5, further comprising:
when newly added news texts are collected, calculating distances between sentence vectors corresponding to the newly added news texts and cluster center vectors of the plurality of clustering clusters one by one;
when the distance between the sentence vector corresponding to the newly added news text and the cluster center vector of a target cluster is smaller than a preset threshold value, adding the newly added news text into the target cluster, and stopping calculating the distance between the sentence vector corresponding to the newly added news text and the cluster center vectors of other clusters, wherein the target cluster is any one of the clusters;
and if the distances between the sentence vector corresponding to the newly added news text and the cluster center vectors of the cluster clusters are larger than the preset threshold value, newly building the cluster clusters, and inserting the newly added news text into the newly built cluster clusters.
7. The method of claim 1 or 2 or 4 or 6, wherein the plurality of clusters each have a respective temporal attribute, the method further comprising:
determining a historical clustering cluster to be processed according to the time attribute;
respectively calculating the similarity between the cluster and the historical cluster aiming at any cluster;
if the similarity is smaller than the similarity threshold, merging the cluster smaller than the similarity threshold with the historical cluster.
8. A hot topic extraction device, comprising:
the news text acquisition module is used for acquiring a plurality of news texts;
the characteristic word extraction module is used for extracting a plurality of characteristic words of any news text;
a sentence vector generating module, configured to generate a sentence vector corresponding to the news text according to the plurality of feature words;
the news text clustering module is used for clustering the news texts based on the sentence vectors corresponding to the news texts to obtain a plurality of clustering clusters;
and the hot topic extracting module is used for extracting the hot topics from the plurality of clustering clusters.
9. A terminal device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the hot topic extraction method as recited in any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the hot topic extraction method as recited in any one of claims 1 to 7.
CN202010231954.9A 2020-03-27 2020-03-27 Hot topic extraction method, device, terminal equipment and storage medium Active CN111460153B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010231954.9A CN111460153B (en) 2020-03-27 2020-03-27 Hot topic extraction method, device, terminal equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010231954.9A CN111460153B (en) 2020-03-27 2020-03-27 Hot topic extraction method, device, terminal equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111460153A true CN111460153A (en) 2020-07-28
CN111460153B CN111460153B (en) 2023-09-22

Family

ID=71681517

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010231954.9A Active CN111460153B (en) 2020-03-27 2020-03-27 Hot topic extraction method, device, terminal equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111460153B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914536A (en) * 2020-08-06 2020-11-10 北京嘀嘀无限科技发展有限公司 Viewpoint analysis method, viewpoint analysis device, viewpoint analysis equipment and storage medium
CN112257801A (en) * 2020-10-30 2021-01-22 浙江商汤科技开发有限公司 Incremental clustering method and device for images, electronic equipment and storage medium
CN112613296A (en) * 2020-12-07 2021-04-06 深圳价值在线信息科技股份有限公司 News importance degree acquisition method and device, terminal equipment and storage medium
CN112989042A (en) * 2021-03-15 2021-06-18 平安科技(深圳)有限公司 Hot topic extraction method and device, computer equipment and storage medium
CN113407679A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN113420723A (en) * 2021-07-21 2021-09-21 北京有竹居网络技术有限公司 Method and device for acquiring video hotspot, readable medium and electronic equipment
CN113761196A (en) * 2021-07-28 2021-12-07 北京中科模识科技有限公司 Text clustering method and system, electronic device and storage medium
WO2023009256A1 (en) * 2021-07-26 2023-02-02 Microsoft Technology Licensing, Llc Computing system for news aggregation
CN116049414A (en) * 2023-04-03 2023-05-02 北京中科闻歌科技股份有限公司 Topic description-based text clustering method, electronic equipment and storage medium
CN116361470A (en) * 2023-04-03 2023-06-30 北京中科闻歌科技股份有限公司 Text clustering cleaning and merging method based on topic description

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080091405A1 (en) * 2006-10-10 2008-04-17 Konstantin Anisimovich Method and system for analyzing various languages and constructing language-independent semantic structures
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080091405A1 (en) * 2006-10-10 2008-04-17 Konstantin Anisimovich Method and system for analyzing various languages and constructing language-independent semantic structures
CN102937960A (en) * 2012-09-06 2013-02-20 北京邮电大学 Device and method for identifying and evaluating emergency hot topic
CN109710728A (en) * 2018-11-26 2019-05-03 西南电子技术研究所(中国电子科技集团公司第十研究所) News topic automatic discovering method

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914536B (en) * 2020-08-06 2021-12-17 北京嘀嘀无限科技发展有限公司 Viewpoint analysis method, viewpoint analysis device, viewpoint analysis equipment and storage medium
CN111914536A (en) * 2020-08-06 2020-11-10 北京嘀嘀无限科技发展有限公司 Viewpoint analysis method, viewpoint analysis device, viewpoint analysis equipment and storage medium
CN112257801A (en) * 2020-10-30 2021-01-22 浙江商汤科技开发有限公司 Incremental clustering method and device for images, electronic equipment and storage medium
CN112613296A (en) * 2020-12-07 2021-04-06 深圳价值在线信息科技股份有限公司 News importance degree acquisition method and device, terminal equipment and storage medium
CN112989042A (en) * 2021-03-15 2021-06-18 平安科技(深圳)有限公司 Hot topic extraction method and device, computer equipment and storage medium
CN112989042B (en) * 2021-03-15 2024-03-15 平安科技(深圳)有限公司 Hot topic extraction method and device, computer equipment and storage medium
CN113407679B (en) * 2021-06-30 2023-10-03 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN113407679A (en) * 2021-06-30 2021-09-17 竹间智能科技(上海)有限公司 Text topic mining method and device, electronic equipment and storage medium
CN113420723A (en) * 2021-07-21 2021-09-21 北京有竹居网络技术有限公司 Method and device for acquiring video hotspot, readable medium and electronic equipment
WO2023000782A1 (en) * 2021-07-21 2023-01-26 北京有竹居网络技术有限公司 Method and apparatus for acquiring video hotspot, readable medium, and electronic device
WO2023009256A1 (en) * 2021-07-26 2023-02-02 Microsoft Technology Licensing, Llc Computing system for news aggregation
CN113761196B (en) * 2021-07-28 2024-02-20 北京中科模识科技有限公司 Text clustering method and system, electronic equipment and storage medium
CN113761196A (en) * 2021-07-28 2021-12-07 北京中科模识科技有限公司 Text clustering method and system, electronic device and storage medium
CN116361470A (en) * 2023-04-03 2023-06-30 北京中科闻歌科技股份有限公司 Text clustering cleaning and merging method based on topic description
CN116049414A (en) * 2023-04-03 2023-05-02 北京中科闻歌科技股份有限公司 Topic description-based text clustering method, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111460153B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN111460153B (en) Hot topic extraction method, device, terminal equipment and storage medium
CN110162695B (en) Information pushing method and equipment
US11238310B2 (en) Training data acquisition method and device, server and storage medium
TWI653542B (en) Method, system and device for discovering and tracking hot topics based on network media data flow
Li et al. Filtering out the noise in short text topic modeling
CN106874292B (en) Topic processing method and device
WO2022095374A1 (en) Keyword extraction method and apparatus, and terminal device and storage medium
CN111814770B (en) Content keyword extraction method of news video, terminal device and medium
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
CN110851598B (en) Text classification method and device, terminal equipment and storage medium
CN112148889A (en) Recommendation list generation method and device
CN110413787B (en) Text clustering method, device, terminal and storage medium
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN108090216B (en) Label prediction method, device and storage medium
CN110210028A (en) For domain feature words extracting method, device, equipment and the medium of speech translation text
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
CN113660541B (en) Method and device for generating abstract of news video
CN111767796A (en) Video association method, device, server and readable storage medium
CN109271624B (en) Target word determination method, device and storage medium
CN114416998A (en) Text label identification method and device, electronic equipment and storage medium
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
CN110019763B (en) Text filtering method, system, equipment and computer readable storage medium
CN110825868A (en) Topic popularity based text pushing method, terminal device and storage medium
JP7395377B2 (en) Content search methods, devices, equipment, and storage media
CN112163415A (en) User intention identification method and device for feedback content and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant