CN113987192A

CN113987192A - Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm

Info

Publication number: CN113987192A
Application number: CN202111615836.9A
Authority: CN
Inventors: 刘锟; 曾曦; 邱梓珩; 陈天莹; 王效武; 魏刚
Original assignee: Shenzhen Wanglian Anrui Network Technology Co ltd; China Electronic Technology Cyber Security Co Ltd
Current assignee: Shenzhen Wanglian Anrui Network Technology Co ltd; China Electronic Technology Cyber Security Co Ltd
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2022-01-28
Anticipated expiration: 2041-12-28
Also published as: CN113987192B

Abstract

The invention discloses a hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm, which comprises off-line hot topic detection and on-line hot topic detection; the offline hot topic detection is used for detecting hot topics contained in existing data in a database, and the online hot topic detection is used for detecting hot topics occurring in an internet media platform in a certain time interval; the hot topic detection method provided by the invention avoids the problem of poor distinguishability between vectors caused by the fact that the keywords are represented by the keyword vectors in the traditional technology, and fundamentally improves the accuracy of topic detection.

Description

Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm

Technical Field

The invention belongs to the technical field of natural language processing and network cognitive security, and particularly relates to a hotspot topic detection method based on RoBERTA-WWM and HDBSCAN algorithms.

Background

The hot topic detection is a technology which can dig out hot topics or events which are concerned and discussed by people from the current mass network public opinion data. The traditional hot topic detection comprises topic detection technology based on a topic model and topic detection technology based on text clustering.

With the development of natural language processing technology, the most common topic detection technology based on text clustering at present is a topic detection technology based on text clustering, which firstly expresses text data into a vector form capable of facilitating mathematical computation, then divides the text data into different clusters by computing similarity between the collected text data, finally sorts all the clusters according to comprehensive ranking of interaction information such as forwarding, praise and the like attached to posts contained in each cluster, and selects a plurality of clusters with the highest ranking, so as to achieve the purpose of detecting hot topics.

The topic detection technology based on the text clustering algorithm has the following defects at present:

(1) with topic detection techniques based on text clustering algorithms, all that is needed is to process text data into vector form that can facilitate mathematical computation. The main ideas of the Word bag model, Word2Vec and the like commonly used at present to express text data into a vector form are as follows: firstly, preprocessing and word segmentation processing are carried out on all texts, then key words in each text are synthesized into a corpus, and finally vector representation of each text is obtained by mapping the key words in each text on the corpus. However, data in the current internet media platform has the characteristics of large data volume, short text length, non-standard wording, serious fragmentation, more noise information and the like, so that the dimensionality of a text vector obtained based on the existing text representation algorithm is very high, and the differentiability of the data is very poor.

(2) The clustering algorithm commonly used for topic detection at present comprises a DBSCAN algorithm based on density clustering and an HAC algorithm based on hierarchical clustering. However, these algorithms have certain limitations, in which the parameter adjustment of the DBSCAN algorithm is difficult and difficult to converge when the data size is large, and the HAC algorithm based on hierarchical clustering has high computational complexity. Therefore, in practical application, the two algorithms are difficult to achieve a good topic detection effect

(3) When expressing the obtained topic in a vectorization manner, the conventional topic detection algorithm expresses the topic by using a tf-idf (term frequency-inverse document frequency) value of a text keyword included in the topic. However, in general, the keywords with higher word frequencies used by two similar events are basically the same, so that the two events cannot be distinguished by using the method, and even the two events can be divided into one topic; in addition, the tf-idf value algorithm based on the keywords cannot cope with the evolution and drift of topics. Both of these problems affect the accuracy of the final topic detection result.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithms.

The purpose of the invention is realized by the following technical scheme:

a hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm comprises off-line hot topic detection and on-line hot topic detection; the off-line hot topic detection is used for detecting the hot topics contained in the existing data in the database, the data volume and the topic number are not changed, the on-line hot topic detection is used for detecting the hot topics generated in the Internet media platform in a certain time interval, and the data volume and the topic number are continuously increased along with the time;

the offline hot topic detection method comprises the following steps:

A1. a data cleaning step, namely performing data cleaning on the existing text data in the database to remove interference information in the text;

A2. a text vectorization representation step, namely using labeled similar sentences and dissimilar sentences to finely adjust a RoBERTA-WWM model externally connected with a fine adjustment structure for a data set, and inputting the text data subjected to data cleaning into the RoBERTA-WWM model externally connected with the fine adjustment structure after fine adjustment (or training) to obtain vector representation of all the text data;

A3. clustering, namely clustering the text vectors obtained in the step A2 by using an HDBSCAN algorithm to obtain topic distribution conditions of the text data;

A4. evaluating the effect of the offline topic detection model by using two indexes, namely an outline coefficient and a mutual information index, and if the effect does not reach the preset effect, adjusting parameters of the RoBERTA-WWM model and the HDBSCAN algorithm until an optimal solution is reached;

A5. a result generation step, namely calculating the heat value of each post and the heat value of each topic according to the interaction information of the posts in each topic, sequencing according to the heat values, and determining a hot topic list; and selecting posts with M% of post popularity ranking in the hot topic to represent the topic, and calculating the average value of text vectors of the posts as the vector representation of the topic.

According to a preferred embodiment, the distracting information in the text in step a1 includes news links and symbols.

According to a preferred embodiment, in step a5, the hot topics are the top N topics with the heat value of the topic greater than a set threshold.

According to a preferred embodiment, in step a5,

the heat calculation formula of the post is as follows:

；

wherein,

refers to the posting heat value of the ith post,

refers to the number of praise for the ith post,

refers to the number of hops of the ith post,

the number of comments in the ith post is referred to, and x, y and z are weight coefficients obtained by an entropy weight method.

According to a preferred embodiment, the heat of the topic is calculated as:

wherein

The heat value of the jth topic is shown, and n shows the number of posts in the topic.

According to a preferred embodiment, the online hot topic detection comprises the following steps:

B1. a data acquisition step, namely acquiring network public opinion data in an internet media platform in real time;

B2. an off-line topic detection step, namely selecting the network public opinion data crawled in a fixed time window each time, and performing topic detection on the collected data by using an off-line topic detection method;

B3. calculating the similarity, classifying and fusing new topics, and sequentially calculating the similarity between the newly obtained topic and the existing topic in the step B2;

if the similarity is greater than the threshold value for realizing the setting, combining the newly obtained topic with the highest similarity in the existing topics, simultaneously sequencing and updating the combined topic representation vector according to the heat value of the post, if the similarity is less than the set threshold value, the topic is the new topic, and adding the representation vector of the new topic into the existing topic after obtaining the representation vector of the new topic;

B4. and a result generation step: and obtaining all topics in a fixed time window, sequencing all topics according to the heat value of each topic to obtain a heat ranking list of the topics, and finally selecting the previous P topics as the hot topics concerned and discussed by the people in the time period.

The aforementioned main aspects of the invention and their respective further alternatives can be freely combined to form a plurality of aspects, all of which are aspects that can be adopted and claimed by the present invention. The skilled person in the art can understand that there are many combinations, which are all the technical solutions to be protected by the present invention, according to the prior art and the common general knowledge after understanding the scheme of the present invention, and the technical solutions are not exhaustive herein.

The invention has the beneficial effects that:

the method of the invention is based on a pre-training language model RoBERTA-WWM (a Robertly Optimized BERT prediction application, WholeWordMask) model of Chinese language environment to represent texts, and adds a fine tuning structure on the basis of the model, so that text vectors obtained by the RoBERTA-WWM model can more completely reserve semantic information and context information of texts, the problem of poor distinguishability among vectors caused by the fact that keywords vector represent topics is avoided, and the accuracy of topic detection is fundamentally improved.

The method of the invention innovatively uses HDBSCAN (high Density-Based Spatial Clustering of Applications with Noise) algorithm to cluster the text vector after the text representation, the algorithm can be more suitable for the characteristics of data in the current Internet media platform, and the complexity and the operation cost of the topic detection algorithm are also reduced.

And updating the expression vector of the topic by using the interaction information in the posts contained in each topic. The influence and the propagation capacity of each post in the topic are considered, so that the topic can be more accurately represented by the model, and the influence caused by topic drift and evolution is avoided.

Drawings

FIG. 1 is a schematic flow chart of an offline hot topic detection algorithm in the hot topic detection method of the present invention;

FIG. 2 is a schematic diagram of a fine-tuning structure of a RoBERTA-WWM model in the hot topic detection method of the present invention;

FIG. 3 is a schematic diagram of an online hot topic detection process in the hot topic detection method of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that, in order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments.

Example 1:

referring to fig. 1, the invention discloses a hot topic detection method based on RoBERTa-WWM and HDBSCAN algorithms, which includes offline hot topic detection and online hot topic detection.

The offline hot topic detection is used for detecting the hot topics contained in the existing data in the database, and in the processing process, the data is fixed and new topics cannot be generated.

The online hot topic detection is used for detecting the hot topics occurring in the Internet media platform in a certain time interval. In the processing process, data are continuously updated, the similarity between newly arrived reports and existing topics and the influence of topic drift and evolution on topic detection results need to be considered, and besides, the calculation efficiency of an algorithm needs to be considered, so that the real-time performance of calculation results is guaranteed.

Preferably, the offline hot topic detection comprises the following steps:

A1. and a data cleaning step, namely performing data cleaning on the existing text data in the database to remove the interference information in the text.

Specifically, news links, symbols, and other distracting information in the text are removed.

A2. And a text vectorization representation step, namely using labeled similar sentences and dissimilar sentences to finely adjust the RoBERTA-WWM model externally connected with a fine adjustment structure for the data set, and inputting the text data subjected to data cleaning into the RoBERTA-WWM model externally connected with the fine adjustment structure after fine adjustment (or training) to obtain vector representation of all the text data.

The fine tuning process is a model retraining process. As shown in fig. 2. For example, similar sentences with labels are respectively input into an original RoBERTA-WWM model, then the sentence vectors are respectively obtained in a posing layer of a fine tuning structure, then the two sentence vectors and the difference vector thereof are spliced, and finally the sentence vectors enter a Softmax Classifier to finish logistic regression processing to obtain the similarity of the two sentences, namely a retraining process is finished. Therefore, the fine adjustment of the RoBERTA-WWM model externally connected with the fine adjustment structure is completed through multiple times of training.

The method comprises the steps of performing text representation based on a pre-training language model RoBERTA-WWM (a Robertly Optimized BERT predicting Approach, WholeWordMask) model of a Chinese language environment, and adding a fine tuning structure on the basis of the model, so that text vectors obtained through the RoBERTA-WWM model can more completely keep semantic information and context information of texts, the problem of poor distinguishability among vectors caused by the fact that keywords are represented by vectors is solved, and the accuracy of topic detection is fundamentally improved.

A3. And a clustering step, namely clustering the text vectors obtained in the step A2 by using an HDBSCAN algorithm to obtain the topic distribution condition of the text data.

The step innovatively uses an HDBSCAN (high Density-Based Spatial Clustering of Applications with Noise) algorithm to cluster the text vectors after the text representation, the algorithm can be more suitable for the characteristics of data in the current Internet media platform, and the complexity and the operation cost of a topic detection algorithm are reduced.

A4. And an effect evaluation and parameter adjustment step, namely evaluating the effect of the offline topic detection model by using two indexes, namely the outline coefficient and the mutual information index, and if the preset effect is not achieved, adjusting parameters of the RoBERTA-WWM model and the HDBSCAN algorithm until an optimal solution is achieved.

A5. A result generation step, namely calculating the heat value of each post and the heat value of each topic according to the interaction information of the posts in each topic, sequencing according to the heat values, and determining hot topics; and selecting the posts with M% top rank of the post popularity in the hot topic to represent the topic (for example, selecting the posts with 50% top rank of the post popularity in the hot topic to represent the topic), and calculating the average value of the text vectors of the posts as the vector representation of the topic.

The interaction information in the posts contained by each topic is used to update the topic's representation vector. The influence and the propagation capacity of each post in the topic are considered, so that the topic can be more accurately represented by the model, and the influence caused by topic drift and evolution is avoided.

In step a5, the hot topics are the top N topics with the heat value of the topic greater than the set threshold.

In the step a5, the method comprises the steps of,

the heat calculation formula of the post is as follows:

；

wherein,

refers to the posting heat value of the ith post,

refers to the number of praise for the ith post,

refers to the number of hops of the ith post,

The heat degree calculation formula of the topic is as follows:

wherein

Preferably, as shown in fig. 3, the online hot topic detection includes the following steps:

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm is characterized in that the hot topic detection method comprises off-line hot topic detection and on-line hot topic detection;

the off-line hot topic detection is used for detecting hot topics contained in existing data in a database, the data volume and the topic number are not changed, the on-line hot topic detection is used for detecting hot topics occurring in an internet media platform in real time in a certain time interval, and the data volume and the topic number are continuously increased;

the offline hot topic detection method comprises the following steps:

A2. a text vectorization representation step, namely using labeled similar sentences and dissimilar sentences to finely adjust a RoBERTA-WWM model externally connected with a three-layer fine adjustment structure for a data set, inputting the text data subjected to data cleaning into the finely adjusted RoBERTA-WWM model externally connected with the fine adjustment structure, and obtaining vector representation of all the text data;

2. The method for detecting hot spots topics based on RoBERTa-WWM and HDBSCAN algorithms of claim 1, wherein in step a1, the interference information in the text includes news links and symbols.

3. The method for detecting the hot topic based on the RoBERTa-WWM and HDBSCAN algorithm as claimed in claim 1, wherein in the step a5, the hot topic is the first N topics with the heat value of the topic being greater than the set threshold.

4. The method for detecting hot spots topic based on the RoBERTa-WWM and HDBSCAN algorithms of claim 1, wherein, in the step a5,

the heat calculation formula of the post is as follows:

；

wherein,

refers to the posting heat value of the ith post,

refers to the number of praise for the ith post,

refers to the number of hops of the ith post,

5. The method of claim 4 for detecting hot spots topics based on the Roberta-WWM and HDBSCAN algorithms,

the heat degree calculation formula of the topic is as follows:

wherein

6. The method for detecting hot topics based on RoBERTa-WWM and HDBSCAN algorithms of any one of claims 1 to 5, wherein the online hot topic detection comprises the steps of:

if the similarity is larger than a preset threshold value, combining the newly obtained topics with the highest similarity in the existing topics, sequencing according to the heat value of the posts, updating the combined topic representation vector, and if the similarity is smaller than the preset threshold value, taking the topic as a new topic, and adding the new topic representation vector into an existing topic list after obtaining the new topic representation vector;