CN113987192A - Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm - Google Patents

Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm Download PDF

Info

Publication number
CN113987192A
CN113987192A CN202111615836.9A CN202111615836A CN113987192A CN 113987192 A CN113987192 A CN 113987192A CN 202111615836 A CN202111615836 A CN 202111615836A CN 113987192 A CN113987192 A CN 113987192A
Authority
CN
China
Prior art keywords
topic
hot
topics
data
roberta
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111615836.9A
Other languages
Chinese (zh)
Other versions
CN113987192B (en
Inventor
刘锟
曾曦
邱梓珩
陈天莹
王效武
魏刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wanglian Anrui Network Technology Co ltd
China Electronic Technology Cyber Security Co Ltd
Original Assignee
Shenzhen Wanglian Anrui Network Technology Co ltd
China Electronic Technology Cyber Security Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wanglian Anrui Network Technology Co ltd, China Electronic Technology Cyber Security Co Ltd filed Critical Shenzhen Wanglian Anrui Network Technology Co ltd
Priority to CN202111615836.9A priority Critical patent/CN113987192B/en
Publication of CN113987192A publication Critical patent/CN113987192A/en
Application granted granted Critical
Publication of CN113987192B publication Critical patent/CN113987192B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm, which comprises off-line hot topic detection and on-line hot topic detection; the offline hot topic detection is used for detecting hot topics contained in existing data in a database, and the online hot topic detection is used for detecting hot topics occurring in an internet media platform in a certain time interval; the hot topic detection method provided by the invention avoids the problem of poor distinguishability between vectors caused by the fact that the keywords are represented by the keyword vectors in the traditional technology, and fundamentally improves the accuracy of topic detection.

Description

Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm
Technical Field
The invention belongs to the technical field of natural language processing and network cognitive security, and particularly relates to a hotspot topic detection method based on RoBERTA-WWM and HDBSCAN algorithms.
Background
The hot topic detection is a technology which can dig out hot topics or events which are concerned and discussed by people from the current mass network public opinion data. The traditional hot topic detection comprises topic detection technology based on a topic model and topic detection technology based on text clustering.
With the development of natural language processing technology, the most common topic detection technology based on text clustering at present is a topic detection technology based on text clustering, which firstly expresses text data into a vector form capable of facilitating mathematical computation, then divides the text data into different clusters by computing similarity between the collected text data, finally sorts all the clusters according to comprehensive ranking of interaction information such as forwarding, praise and the like attached to posts contained in each cluster, and selects a plurality of clusters with the highest ranking, so as to achieve the purpose of detecting hot topics.
The topic detection technology based on the text clustering algorithm has the following defects at present:
(1) with topic detection techniques based on text clustering algorithms, all that is needed is to process text data into vector form that can facilitate mathematical computation. The main ideas of the Word bag model, Word2Vec and the like commonly used at present to express text data into a vector form are as follows: firstly, preprocessing and word segmentation processing are carried out on all texts, then key words in each text are synthesized into a corpus, and finally vector representation of each text is obtained by mapping the key words in each text on the corpus. However, data in the current internet media platform has the characteristics of large data volume, short text length, non-standard wording, serious fragmentation, more noise information and the like, so that the dimensionality of a text vector obtained based on the existing text representation algorithm is very high, and the differentiability of the data is very poor.
(2) The clustering algorithm commonly used for topic detection at present comprises a DBSCAN algorithm based on density clustering and an HAC algorithm based on hierarchical clustering. However, these algorithms have certain limitations, in which the parameter adjustment of the DBSCAN algorithm is difficult and difficult to converge when the data size is large, and the HAC algorithm based on hierarchical clustering has high computational complexity. Therefore, in practical application, the two algorithms are difficult to achieve a good topic detection effect
(3) When expressing the obtained topic in a vectorization manner, the conventional topic detection algorithm expresses the topic by using a tf-idf (term frequency-inverse document frequency) value of a text keyword included in the topic. However, in general, the keywords with higher word frequencies used by two similar events are basically the same, so that the two events cannot be distinguished by using the method, and even the two events can be divided into one topic; in addition, the tf-idf value algorithm based on the keywords cannot cope with the evolution and drift of topics. Both of these problems affect the accuracy of the final topic detection result.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithms.
The purpose of the invention is realized by the following technical scheme:
a hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm comprises off-line hot topic detection and on-line hot topic detection; the off-line hot topic detection is used for detecting the hot topics contained in the existing data in the database, the data volume and the topic number are not changed, the on-line hot topic detection is used for detecting the hot topics generated in the Internet media platform in a certain time interval, and the data volume and the topic number are continuously increased along with the time;
the offline hot topic detection method comprises the following steps:
A1. a data cleaning step, namely performing data cleaning on the existing text data in the database to remove interference information in the text;
A2. a text vectorization representation step, namely using labeled similar sentences and dissimilar sentences to finely adjust a RoBERTA-WWM model externally connected with a fine adjustment structure for a data set, and inputting the text data subjected to data cleaning into the RoBERTA-WWM model externally connected with the fine adjustment structure after fine adjustment (or training) to obtain vector representation of all the text data;
A3. clustering, namely clustering the text vectors obtained in the step A2 by using an HDBSCAN algorithm to obtain topic distribution conditions of the text data;
A4. evaluating the effect of the offline topic detection model by using two indexes, namely an outline coefficient and a mutual information index, and if the effect does not reach the preset effect, adjusting parameters of the RoBERTA-WWM model and the HDBSCAN algorithm until an optimal solution is reached;
A5. a result generation step, namely calculating the heat value of each post and the heat value of each topic according to the interaction information of the posts in each topic, sequencing according to the heat values, and determining a hot topic list; and selecting posts with M% of post popularity ranking in the hot topic to represent the topic, and calculating the average value of text vectors of the posts as the vector representation of the topic.
According to a preferred embodiment, the distracting information in the text in step a1 includes news links and symbols.
According to a preferred embodiment, in step a5, the hot topics are the top N topics with the heat value of the topic greater than a set threshold.
According to a preferred embodiment, in step a5,
the heat calculation formula of the post is as follows:
Figure 130429DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 636496DEST_PATH_IMAGE002
refers to the posting heat value of the ith post,
Figure DEST_PATH_IMAGE003
refers to the number of praise for the ith post,
Figure 180610DEST_PATH_IMAGE004
refers to the number of hops of the ith post,
Figure DEST_PATH_IMAGE005
the number of comments in the ith post is referred to, and x, y and z are weight coefficients obtained by an entropy weight method.
According to a preferred embodiment, the heat of the topic is calculated as:
Figure 531957DEST_PATH_IMAGE006
wherein
Figure DEST_PATH_IMAGE007
The heat value of the jth topic is shown, and n shows the number of posts in the topic.
According to a preferred embodiment, the online hot topic detection comprises the following steps:
B1. a data acquisition step, namely acquiring network public opinion data in an internet media platform in real time;
B2. an off-line topic detection step, namely selecting the network public opinion data crawled in a fixed time window each time, and performing topic detection on the collected data by using an off-line topic detection method;
B3. calculating the similarity, classifying and fusing new topics, and sequentially calculating the similarity between the newly obtained topic and the existing topic in the step B2;
if the similarity is greater than the threshold value for realizing the setting, combining the newly obtained topic with the highest similarity in the existing topics, simultaneously sequencing and updating the combined topic representation vector according to the heat value of the post, if the similarity is less than the set threshold value, the topic is the new topic, and adding the representation vector of the new topic into the existing topic after obtaining the representation vector of the new topic;
B4. and a result generation step: and obtaining all topics in a fixed time window, sequencing all topics according to the heat value of each topic to obtain a heat ranking list of the topics, and finally selecting the previous P topics as the hot topics concerned and discussed by the people in the time period.
The aforementioned main aspects of the invention and their respective further alternatives can be freely combined to form a plurality of aspects, all of which are aspects that can be adopted and claimed by the present invention. The skilled person in the art can understand that there are many combinations, which are all the technical solutions to be protected by the present invention, according to the prior art and the common general knowledge after understanding the scheme of the present invention, and the technical solutions are not exhaustive herein.
The invention has the beneficial effects that:
the method of the invention is based on a pre-training language model RoBERTA-WWM (a Robertly Optimized BERT prediction application, WholeWordMask) model of Chinese language environment to represent texts, and adds a fine tuning structure on the basis of the model, so that text vectors obtained by the RoBERTA-WWM model can more completely reserve semantic information and context information of texts, the problem of poor distinguishability among vectors caused by the fact that keywords vector represent topics is avoided, and the accuracy of topic detection is fundamentally improved.
The method of the invention innovatively uses HDBSCAN (high Density-Based Spatial Clustering of Applications with Noise) algorithm to cluster the text vector after the text representation, the algorithm can be more suitable for the characteristics of data in the current Internet media platform, and the complexity and the operation cost of the topic detection algorithm are also reduced.
And updating the expression vector of the topic by using the interaction information in the posts contained in each topic. The influence and the propagation capacity of each post in the topic are considered, so that the topic can be more accurately represented by the model, and the influence caused by topic drift and evolution is avoided.
Drawings
FIG. 1 is a schematic flow chart of an offline hot topic detection algorithm in the hot topic detection method of the present invention;
FIG. 2 is a schematic diagram of a fine-tuning structure of a RoBERTA-WWM model in the hot topic detection method of the present invention;
FIG. 3 is a schematic diagram of an online hot topic detection process in the hot topic detection method of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that, in order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments.
Example 1:
referring to fig. 1, the invention discloses a hot topic detection method based on RoBERTa-WWM and HDBSCAN algorithms, which includes offline hot topic detection and online hot topic detection.
The offline hot topic detection is used for detecting the hot topics contained in the existing data in the database, and in the processing process, the data is fixed and new topics cannot be generated.
The online hot topic detection is used for detecting the hot topics occurring in the Internet media platform in a certain time interval. In the processing process, data are continuously updated, the similarity between newly arrived reports and existing topics and the influence of topic drift and evolution on topic detection results need to be considered, and besides, the calculation efficiency of an algorithm needs to be considered, so that the real-time performance of calculation results is guaranteed.
Preferably, the offline hot topic detection comprises the following steps:
A1. and a data cleaning step, namely performing data cleaning on the existing text data in the database to remove the interference information in the text.
Specifically, news links, symbols, and other distracting information in the text are removed.
A2. And a text vectorization representation step, namely using labeled similar sentences and dissimilar sentences to finely adjust the RoBERTA-WWM model externally connected with a fine adjustment structure for the data set, and inputting the text data subjected to data cleaning into the RoBERTA-WWM model externally connected with the fine adjustment structure after fine adjustment (or training) to obtain vector representation of all the text data.
The fine tuning process is a model retraining process. As shown in fig. 2. For example, similar sentences with labels are respectively input into an original RoBERTA-WWM model, then the sentence vectors are respectively obtained in a posing layer of a fine tuning structure, then the two sentence vectors and the difference vector thereof are spliced, and finally the sentence vectors enter a Softmax Classifier to finish logistic regression processing to obtain the similarity of the two sentences, namely a retraining process is finished. Therefore, the fine adjustment of the RoBERTA-WWM model externally connected with the fine adjustment structure is completed through multiple times of training.
The method comprises the steps of performing text representation based on a pre-training language model RoBERTA-WWM (a Robertly Optimized BERT predicting Approach, WholeWordMask) model of a Chinese language environment, and adding a fine tuning structure on the basis of the model, so that text vectors obtained through the RoBERTA-WWM model can more completely keep semantic information and context information of texts, the problem of poor distinguishability among vectors caused by the fact that keywords are represented by vectors is solved, and the accuracy of topic detection is fundamentally improved.
A3. And a clustering step, namely clustering the text vectors obtained in the step A2 by using an HDBSCAN algorithm to obtain the topic distribution condition of the text data.
The step innovatively uses an HDBSCAN (high Density-Based Spatial Clustering of Applications with Noise) algorithm to cluster the text vectors after the text representation, the algorithm can be more suitable for the characteristics of data in the current Internet media platform, and the complexity and the operation cost of a topic detection algorithm are reduced.
A4. And an effect evaluation and parameter adjustment step, namely evaluating the effect of the offline topic detection model by using two indexes, namely the outline coefficient and the mutual information index, and if the preset effect is not achieved, adjusting parameters of the RoBERTA-WWM model and the HDBSCAN algorithm until an optimal solution is achieved.
A5. A result generation step, namely calculating the heat value of each post and the heat value of each topic according to the interaction information of the posts in each topic, sequencing according to the heat values, and determining hot topics; and selecting the posts with M% top rank of the post popularity in the hot topic to represent the topic (for example, selecting the posts with 50% top rank of the post popularity in the hot topic to represent the topic), and calculating the average value of the text vectors of the posts as the vector representation of the topic.
The interaction information in the posts contained by each topic is used to update the topic's representation vector. The influence and the propagation capacity of each post in the topic are considered, so that the topic can be more accurately represented by the model, and the influence caused by topic drift and evolution is avoided.
In step a5, the hot topics are the top N topics with the heat value of the topic greater than the set threshold.
In the step a5, the method comprises the steps of,
the heat calculation formula of the post is as follows:
Figure 477916DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 154885DEST_PATH_IMAGE002
refers to the posting heat value of the ith post,
Figure 327241DEST_PATH_IMAGE008
refers to the number of praise for the ith post,
Figure 482279DEST_PATH_IMAGE004
refers to the number of hops of the ith post,
Figure 813903DEST_PATH_IMAGE009
the number of comments in the ith post is referred to, and x, y and z are weight coefficients obtained by an entropy weight method.
The heat degree calculation formula of the topic is as follows:
Figure 661773DEST_PATH_IMAGE006
wherein
Figure 587004DEST_PATH_IMAGE010
The heat value of the jth topic is shown, and n shows the number of posts in the topic.
Preferably, as shown in fig. 3, the online hot topic detection includes the following steps:
B1. a data acquisition step, namely acquiring network public opinion data in an internet media platform in real time;
B2. an off-line topic detection step, namely selecting the network public opinion data crawled in a fixed time window each time, and performing topic detection on the collected data by using an off-line topic detection method;
B3. calculating the similarity, classifying and fusing new topics, and sequentially calculating the similarity between the newly obtained topic and the existing topic in the step B2;
if the similarity is greater than the threshold value for realizing the setting, combining the newly obtained topic with the highest similarity in the existing topics, simultaneously sequencing and updating the combined topic representation vector according to the heat value of the post, if the similarity is less than the set threshold value, the topic is the new topic, and adding the representation vector of the new topic into the existing topic after obtaining the representation vector of the new topic;
B4. and a result generation step: and obtaining all topics in a fixed time window, sequencing all topics according to the heat value of each topic to obtain a heat ranking list of the topics, and finally selecting the previous P topics as the hot topics concerned and discussed by the people in the time period.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (6)

1. A hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm is characterized in that the hot topic detection method comprises off-line hot topic detection and on-line hot topic detection;
the off-line hot topic detection is used for detecting hot topics contained in existing data in a database, the data volume and the topic number are not changed, the on-line hot topic detection is used for detecting hot topics occurring in an internet media platform in real time in a certain time interval, and the data volume and the topic number are continuously increased;
the offline hot topic detection method comprises the following steps:
A1. a data cleaning step, namely performing data cleaning on the existing text data in the database to remove interference information in the text;
A2. a text vectorization representation step, namely using labeled similar sentences and dissimilar sentences to finely adjust a RoBERTA-WWM model externally connected with a three-layer fine adjustment structure for a data set, inputting the text data subjected to data cleaning into the finely adjusted RoBERTA-WWM model externally connected with the fine adjustment structure, and obtaining vector representation of all the text data;
A3. clustering, namely clustering the text vectors obtained in the step A2 by using an HDBSCAN algorithm to obtain topic distribution conditions of the text data;
A4. evaluating the effect of the offline topic detection model by using two indexes, namely an outline coefficient and a mutual information index, and if the effect does not reach the preset effect, adjusting parameters of the RoBERTA-WWM model and the HDBSCAN algorithm until an optimal solution is reached;
A5. a result generation step, namely calculating the heat value of each post and the heat value of each topic according to the interaction information of the posts in each topic, sequencing according to the heat values, and determining a hot topic list; and selecting posts with M% of post popularity ranking in the hot topic to represent the topic, and calculating the average value of text vectors of the posts as the vector representation of the topic.
2. The method for detecting hot spots topics based on RoBERTa-WWM and HDBSCAN algorithms of claim 1, wherein in step a1, the interference information in the text includes news links and symbols.
3. The method for detecting the hot topic based on the RoBERTa-WWM and HDBSCAN algorithm as claimed in claim 1, wherein in the step a5, the hot topic is the first N topics with the heat value of the topic being greater than the set threshold.
4. The method for detecting hot spots topic based on the RoBERTa-WWM and HDBSCAN algorithms of claim 1, wherein, in the step a5,
the heat calculation formula of the post is as follows:
Figure 844328DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 85954DEST_PATH_IMAGE002
refers to the posting heat value of the ith post,
Figure 95498DEST_PATH_IMAGE003
refers to the number of praise for the ith post,
Figure 598024DEST_PATH_IMAGE004
refers to the number of hops of the ith post,
Figure 667611DEST_PATH_IMAGE005
the number of comments in the ith post is referred to, and x, y and z are weight coefficients obtained by an entropy weight method.
5. The method of claim 4 for detecting hot spots topics based on the Roberta-WWM and HDBSCAN algorithms,
the heat degree calculation formula of the topic is as follows:
Figure 130953DEST_PATH_IMAGE006
wherein
Figure 944188DEST_PATH_IMAGE007
The heat value of the jth topic is shown, and n shows the number of posts in the topic.
6. The method for detecting hot topics based on RoBERTa-WWM and HDBSCAN algorithms of any one of claims 1 to 5, wherein the online hot topic detection comprises the steps of:
B1. a data acquisition step, namely acquiring network public opinion data in an internet media platform in real time;
B2. an off-line topic detection step, namely selecting the network public opinion data crawled in a fixed time window each time, and performing topic detection on the collected data by using an off-line topic detection method;
B3. calculating the similarity, classifying and fusing new topics, and sequentially calculating the similarity between the newly obtained topic and the existing topic in the step B2;
if the similarity is larger than a preset threshold value, combining the newly obtained topics with the highest similarity in the existing topics, sequencing according to the heat value of the posts, updating the combined topic representation vector, and if the similarity is smaller than the preset threshold value, taking the topic as a new topic, and adding the new topic representation vector into an existing topic list after obtaining the new topic representation vector;
B4. and a result generation step: and obtaining all topics in a fixed time window, sequencing all topics according to the heat value of each topic to obtain a heat ranking list of the topics, and finally selecting the previous P topics as the hot topics concerned and discussed by the people in the time period.
CN202111615836.9A 2021-12-28 2021-12-28 Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm Active CN113987192B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111615836.9A CN113987192B (en) 2021-12-28 2021-12-28 Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111615836.9A CN113987192B (en) 2021-12-28 2021-12-28 Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm

Publications (2)

Publication Number Publication Date
CN113987192A true CN113987192A (en) 2022-01-28
CN113987192B CN113987192B (en) 2022-04-01

Family

ID=79734569

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111615836.9A Active CN113987192B (en) 2021-12-28 2021-12-28 Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm

Country Status (1)

Country Link
CN (1) CN113987192B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107894994A (en) * 2017-10-18 2018-04-10 北京京东尚科信息技术有限公司 A kind of method and apparatus for detecting much-talked-about topic classification
CN110209813A (en) * 2019-05-14 2019-09-06 天津大学 A kind of incident detection and prediction technique based on autocoder
CN110297988A (en) * 2019-07-06 2019-10-01 四川大学 Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN111125380A (en) * 2019-12-30 2020-05-08 华南理工大学 Entity linking method based on RoBERTA and heuristic algorithm
CN111191453A (en) * 2019-12-25 2020-05-22 中国电子科技集团公司第十五研究所 Named entity recognition method based on confrontation training
CN111339784A (en) * 2020-03-06 2020-06-26 支付宝(杭州)信息技术有限公司 Automatic new topic mining method and system
CN111626056A (en) * 2020-04-11 2020-09-04 中国人民解放军战略支援部队信息工程大学 Chinese named entity identification method and device based on RoBERTA-BiGRU-LAN model
CN113076734A (en) * 2021-04-15 2021-07-06 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts
CN113204643A (en) * 2021-06-23 2021-08-03 北京明略软件系统有限公司 Entity alignment method, device, equipment and medium
CN113380418A (en) * 2021-06-22 2021-09-10 浙江工业大学 System for analyzing and identifying depression through dialog text
CN113515593A (en) * 2021-04-23 2021-10-19 平安科技(深圳)有限公司 Topic detection method and device based on clustering model and computer equipment
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107894994A (en) * 2017-10-18 2018-04-10 北京京东尚科信息技术有限公司 A kind of method and apparatus for detecting much-talked-about topic classification
CN110209813A (en) * 2019-05-14 2019-09-06 天津大学 A kind of incident detection and prediction technique based on autocoder
CN110297988A (en) * 2019-07-06 2019-10-01 四川大学 Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN111191453A (en) * 2019-12-25 2020-05-22 中国电子科技集团公司第十五研究所 Named entity recognition method based on confrontation training
CN111125380A (en) * 2019-12-30 2020-05-08 华南理工大学 Entity linking method based on RoBERTA and heuristic algorithm
CN111339784A (en) * 2020-03-06 2020-06-26 支付宝(杭州)信息技术有限公司 Automatic new topic mining method and system
CN111626056A (en) * 2020-04-11 2020-09-04 中国人民解放军战略支援部队信息工程大学 Chinese named entity identification method and device based on RoBERTA-BiGRU-LAN model
CN113076734A (en) * 2021-04-15 2021-07-06 云南电网有限责任公司电力科学研究院 Similarity detection method and device for project texts
CN113515593A (en) * 2021-04-23 2021-10-19 平安科技(深圳)有限公司 Topic detection method and device based on clustering model and computer equipment
CN113380418A (en) * 2021-06-22 2021-09-10 浙江工业大学 System for analyzing and identifying depression through dialog text
CN113204643A (en) * 2021-06-23 2021-08-03 北京明略软件系统有限公司 Entity alignment method, device, equipment and medium
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHUANZHEN LI等: "Topic Detection and Tracking Based on Windowed DBSCAN and Parallel KNN", 《IEEE ACCESS》 *
朱岩等: "基于RoBERTa-WWM的中文电子病历命名实体识别", 《计算机与现代化》 *
陈玺等: "面向汉维机器翻译的BERT嵌入研究", 《计算机工程》 *

Also Published As

Publication number Publication date
CN113987192B (en) 2022-04-01

Similar Documents

Publication Publication Date Title
Rossi et al. Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts
CN113239181A (en) Scientific and technological literature citation recommendation method based on deep learning
Noori Classification of customer reviews using machine learning algorithms
CN110264372B (en) Topic community discovery method based on node representation
CN111046171B (en) Emotion discrimination method based on fine-grained labeled data
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
Devipriya et al. Deep learning sentiment analysis for recommendations in social applications
Nakatsuji et al. Semantic sensitive tensor factorization
Tembusai et al. K-nearest neighbor with K-fold cross validation and analytic hierarchy process on data classification
CN115329215A (en) Recommendation method and system based on self-adaptive dynamic knowledge graph in heterogeneous network
Daniel et al. A novel sentiment analysis for amazon data with TSA based feature selection
CN108491477B (en) Neural network recommendation method based on multi-dimensional cloud and user dynamic interest
CN113987192B (en) Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm
Gu et al. Fuzzy time series forecasting based on information granule and neural network
Li et al. Capsule neural tensor networks with multi-aspect information for Few-shot Knowledge Graph Completion
CN116662564A (en) Service recommendation method based on depth matrix decomposition and knowledge graph
CN115840853A (en) Course recommendation system based on knowledge graph and attention network
CN108694165B (en) Cross-domain dual emotion analysis method for product comments
Kim Research on Text Classification Based on Deep Neural Network
Sharma et al. A Review On Collaborative Filtering Using Knn Algorithm
Fan et al. Topic modeling methods for short texts: A survey
Thakur et al. OKO-SVM: Online kernel optimization-based support vector machine for the incremental learning and classification of the sentiments in the train reviews
Chebil et al. Clustering social media data for marketing strategies: Literature review using topic modelling techniques
CN116244501B (en) Cold start recommendation method based on first-order element learning and multi-supervisor association network
Wu et al. Research on the Prediction of Popular Opinion Trend of Web News based on BP neural Network and LSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant