CN113987192A - Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm - Google Patents
Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm Download PDFInfo
- Publication number
- CN113987192A CN113987192A CN202111615836.9A CN202111615836A CN113987192A CN 113987192 A CN113987192 A CN 113987192A CN 202111615836 A CN202111615836 A CN 202111615836A CN 113987192 A CN113987192 A CN 113987192A
- Authority
- CN
- China
- Prior art keywords
- topic
- hot
- topics
- data
- roberta
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 67
- 239000013598 vector Substances 0.000 claims abstract description 41
- 238000000034 method Methods 0.000 claims description 20
- 230000000694 effects Effects 0.000 claims description 11
- 238000004140 cleaning Methods 0.000 claims description 9
- 238000012163 sequencing technique Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000003993 interaction Effects 0.000 claims description 6
- 235000008694 Humulus lupulus Nutrition 0.000 claims description 3
- 230000008569 process Effects 0.000 description 7
- 238000012549 training Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000013604 expression vector Substances 0.000 description 1
- 238000013467 fragmentation Methods 0.000 description 1
- 238000006062 fragmentation reaction Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm, which comprises off-line hot topic detection and on-line hot topic detection; the offline hot topic detection is used for detecting hot topics contained in existing data in a database, and the online hot topic detection is used for detecting hot topics occurring in an internet media platform in a certain time interval; the hot topic detection method provided by the invention avoids the problem of poor distinguishability between vectors caused by the fact that the keywords are represented by the keyword vectors in the traditional technology, and fundamentally improves the accuracy of topic detection.
Description
Technical Field
The invention belongs to the technical field of natural language processing and network cognitive security, and particularly relates to a hotspot topic detection method based on RoBERTA-WWM and HDBSCAN algorithms.
Background
The hot topic detection is a technology which can dig out hot topics or events which are concerned and discussed by people from the current mass network public opinion data. The traditional hot topic detection comprises topic detection technology based on a topic model and topic detection technology based on text clustering.
With the development of natural language processing technology, the most common topic detection technology based on text clustering at present is a topic detection technology based on text clustering, which firstly expresses text data into a vector form capable of facilitating mathematical computation, then divides the text data into different clusters by computing similarity between the collected text data, finally sorts all the clusters according to comprehensive ranking of interaction information such as forwarding, praise and the like attached to posts contained in each cluster, and selects a plurality of clusters with the highest ranking, so as to achieve the purpose of detecting hot topics.
The topic detection technology based on the text clustering algorithm has the following defects at present:
(1) with topic detection techniques based on text clustering algorithms, all that is needed is to process text data into vector form that can facilitate mathematical computation. The main ideas of the Word bag model, Word2Vec and the like commonly used at present to express text data into a vector form are as follows: firstly, preprocessing and word segmentation processing are carried out on all texts, then key words in each text are synthesized into a corpus, and finally vector representation of each text is obtained by mapping the key words in each text on the corpus. However, data in the current internet media platform has the characteristics of large data volume, short text length, non-standard wording, serious fragmentation, more noise information and the like, so that the dimensionality of a text vector obtained based on the existing text representation algorithm is very high, and the differentiability of the data is very poor.
(2) The clustering algorithm commonly used for topic detection at present comprises a DBSCAN algorithm based on density clustering and an HAC algorithm based on hierarchical clustering. However, these algorithms have certain limitations, in which the parameter adjustment of the DBSCAN algorithm is difficult and difficult to converge when the data size is large, and the HAC algorithm based on hierarchical clustering has high computational complexity. Therefore, in practical application, the two algorithms are difficult to achieve a good topic detection effect
(3) When expressing the obtained topic in a vectorization manner, the conventional topic detection algorithm expresses the topic by using a tf-idf (term frequency-inverse document frequency) value of a text keyword included in the topic. However, in general, the keywords with higher word frequencies used by two similar events are basically the same, so that the two events cannot be distinguished by using the method, and even the two events can be divided into one topic; in addition, the tf-idf value algorithm based on the keywords cannot cope with the evolution and drift of topics. Both of these problems affect the accuracy of the final topic detection result.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithms.
The purpose of the invention is realized by the following technical scheme:
a hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm comprises off-line hot topic detection and on-line hot topic detection; the off-line hot topic detection is used for detecting the hot topics contained in the existing data in the database, the data volume and the topic number are not changed, the on-line hot topic detection is used for detecting the hot topics generated in the Internet media platform in a certain time interval, and the data volume and the topic number are continuously increased along with the time;
the offline hot topic detection method comprises the following steps:
A1. a data cleaning step, namely performing data cleaning on the existing text data in the database to remove interference information in the text;
A2. a text vectorization representation step, namely using labeled similar sentences and dissimilar sentences to finely adjust a RoBERTA-WWM model externally connected with a fine adjustment structure for a data set, and inputting the text data subjected to data cleaning into the RoBERTA-WWM model externally connected with the fine adjustment structure after fine adjustment (or training) to obtain vector representation of all the text data;
A3. clustering, namely clustering the text vectors obtained in the step A2 by using an HDBSCAN algorithm to obtain topic distribution conditions of the text data;
A4. evaluating the effect of the offline topic detection model by using two indexes, namely an outline coefficient and a mutual information index, and if the effect does not reach the preset effect, adjusting parameters of the RoBERTA-WWM model and the HDBSCAN algorithm until an optimal solution is reached;
A5. a result generation step, namely calculating the heat value of each post and the heat value of each topic according to the interaction information of the posts in each topic, sequencing according to the heat values, and determining a hot topic list; and selecting posts with M% of post popularity ranking in the hot topic to represent the topic, and calculating the average value of text vectors of the posts as the vector representation of the topic.
According to a preferred embodiment, the distracting information in the text in step a1 includes news links and symbols.
According to a preferred embodiment, in step a5, the hot topics are the top N topics with the heat value of the topic greater than a set threshold.
According to a preferred embodiment, in step a5,
the heat calculation formula of the post is as follows:
wherein,refers to the posting heat value of the ith post,refers to the number of praise for the ith post,refers to the number of hops of the ith post,the number of comments in the ith post is referred to, and x, y and z are weight coefficients obtained by an entropy weight method.
According to a preferred embodiment, the heat of the topic is calculated as:
According to a preferred embodiment, the online hot topic detection comprises the following steps:
B1. a data acquisition step, namely acquiring network public opinion data in an internet media platform in real time;
B2. an off-line topic detection step, namely selecting the network public opinion data crawled in a fixed time window each time, and performing topic detection on the collected data by using an off-line topic detection method;
B3. calculating the similarity, classifying and fusing new topics, and sequentially calculating the similarity between the newly obtained topic and the existing topic in the step B2;
if the similarity is greater than the threshold value for realizing the setting, combining the newly obtained topic with the highest similarity in the existing topics, simultaneously sequencing and updating the combined topic representation vector according to the heat value of the post, if the similarity is less than the set threshold value, the topic is the new topic, and adding the representation vector of the new topic into the existing topic after obtaining the representation vector of the new topic;
B4. and a result generation step: and obtaining all topics in a fixed time window, sequencing all topics according to the heat value of each topic to obtain a heat ranking list of the topics, and finally selecting the previous P topics as the hot topics concerned and discussed by the people in the time period.
The aforementioned main aspects of the invention and their respective further alternatives can be freely combined to form a plurality of aspects, all of which are aspects that can be adopted and claimed by the present invention. The skilled person in the art can understand that there are many combinations, which are all the technical solutions to be protected by the present invention, according to the prior art and the common general knowledge after understanding the scheme of the present invention, and the technical solutions are not exhaustive herein.
The invention has the beneficial effects that:
the method of the invention is based on a pre-training language model RoBERTA-WWM (a Robertly Optimized BERT prediction application, WholeWordMask) model of Chinese language environment to represent texts, and adds a fine tuning structure on the basis of the model, so that text vectors obtained by the RoBERTA-WWM model can more completely reserve semantic information and context information of texts, the problem of poor distinguishability among vectors caused by the fact that keywords vector represent topics is avoided, and the accuracy of topic detection is fundamentally improved.
The method of the invention innovatively uses HDBSCAN (high Density-Based Spatial Clustering of Applications with Noise) algorithm to cluster the text vector after the text representation, the algorithm can be more suitable for the characteristics of data in the current Internet media platform, and the complexity and the operation cost of the topic detection algorithm are also reduced.
And updating the expression vector of the topic by using the interaction information in the posts contained in each topic. The influence and the propagation capacity of each post in the topic are considered, so that the topic can be more accurately represented by the model, and the influence caused by topic drift and evolution is avoided.
Drawings
FIG. 1 is a schematic flow chart of an offline hot topic detection algorithm in the hot topic detection method of the present invention;
FIG. 2 is a schematic diagram of a fine-tuning structure of a RoBERTA-WWM model in the hot topic detection method of the present invention;
FIG. 3 is a schematic diagram of an online hot topic detection process in the hot topic detection method of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that, in order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments.
Example 1:
referring to fig. 1, the invention discloses a hot topic detection method based on RoBERTa-WWM and HDBSCAN algorithms, which includes offline hot topic detection and online hot topic detection.
The offline hot topic detection is used for detecting the hot topics contained in the existing data in the database, and in the processing process, the data is fixed and new topics cannot be generated.
The online hot topic detection is used for detecting the hot topics occurring in the Internet media platform in a certain time interval. In the processing process, data are continuously updated, the similarity between newly arrived reports and existing topics and the influence of topic drift and evolution on topic detection results need to be considered, and besides, the calculation efficiency of an algorithm needs to be considered, so that the real-time performance of calculation results is guaranteed.
Preferably, the offline hot topic detection comprises the following steps:
A1. and a data cleaning step, namely performing data cleaning on the existing text data in the database to remove the interference information in the text.
Specifically, news links, symbols, and other distracting information in the text are removed.
A2. And a text vectorization representation step, namely using labeled similar sentences and dissimilar sentences to finely adjust the RoBERTA-WWM model externally connected with a fine adjustment structure for the data set, and inputting the text data subjected to data cleaning into the RoBERTA-WWM model externally connected with the fine adjustment structure after fine adjustment (or training) to obtain vector representation of all the text data.
The fine tuning process is a model retraining process. As shown in fig. 2. For example, similar sentences with labels are respectively input into an original RoBERTA-WWM model, then the sentence vectors are respectively obtained in a posing layer of a fine tuning structure, then the two sentence vectors and the difference vector thereof are spliced, and finally the sentence vectors enter a Softmax Classifier to finish logistic regression processing to obtain the similarity of the two sentences, namely a retraining process is finished. Therefore, the fine adjustment of the RoBERTA-WWM model externally connected with the fine adjustment structure is completed through multiple times of training.
The method comprises the steps of performing text representation based on a pre-training language model RoBERTA-WWM (a Robertly Optimized BERT predicting Approach, WholeWordMask) model of a Chinese language environment, and adding a fine tuning structure on the basis of the model, so that text vectors obtained through the RoBERTA-WWM model can more completely keep semantic information and context information of texts, the problem of poor distinguishability among vectors caused by the fact that keywords are represented by vectors is solved, and the accuracy of topic detection is fundamentally improved.
A3. And a clustering step, namely clustering the text vectors obtained in the step A2 by using an HDBSCAN algorithm to obtain the topic distribution condition of the text data.
The step innovatively uses an HDBSCAN (high Density-Based Spatial Clustering of Applications with Noise) algorithm to cluster the text vectors after the text representation, the algorithm can be more suitable for the characteristics of data in the current Internet media platform, and the complexity and the operation cost of a topic detection algorithm are reduced.
A4. And an effect evaluation and parameter adjustment step, namely evaluating the effect of the offline topic detection model by using two indexes, namely the outline coefficient and the mutual information index, and if the preset effect is not achieved, adjusting parameters of the RoBERTA-WWM model and the HDBSCAN algorithm until an optimal solution is achieved.
A5. A result generation step, namely calculating the heat value of each post and the heat value of each topic according to the interaction information of the posts in each topic, sequencing according to the heat values, and determining hot topics; and selecting the posts with M% top rank of the post popularity in the hot topic to represent the topic (for example, selecting the posts with 50% top rank of the post popularity in the hot topic to represent the topic), and calculating the average value of the text vectors of the posts as the vector representation of the topic.
The interaction information in the posts contained by each topic is used to update the topic's representation vector. The influence and the propagation capacity of each post in the topic are considered, so that the topic can be more accurately represented by the model, and the influence caused by topic drift and evolution is avoided.
In step a5, the hot topics are the top N topics with the heat value of the topic greater than the set threshold.
In the step a5, the method comprises the steps of,
the heat calculation formula of the post is as follows:
wherein,refers to the posting heat value of the ith post,refers to the number of praise for the ith post,refers to the number of hops of the ith post,the number of comments in the ith post is referred to, and x, y and z are weight coefficients obtained by an entropy weight method.
The heat degree calculation formula of the topic is as follows:
Preferably, as shown in fig. 3, the online hot topic detection includes the following steps:
B1. a data acquisition step, namely acquiring network public opinion data in an internet media platform in real time;
B2. an off-line topic detection step, namely selecting the network public opinion data crawled in a fixed time window each time, and performing topic detection on the collected data by using an off-line topic detection method;
B3. calculating the similarity, classifying and fusing new topics, and sequentially calculating the similarity between the newly obtained topic and the existing topic in the step B2;
if the similarity is greater than the threshold value for realizing the setting, combining the newly obtained topic with the highest similarity in the existing topics, simultaneously sequencing and updating the combined topic representation vector according to the heat value of the post, if the similarity is less than the set threshold value, the topic is the new topic, and adding the representation vector of the new topic into the existing topic after obtaining the representation vector of the new topic;
B4. and a result generation step: and obtaining all topics in a fixed time window, sequencing all topics according to the heat value of each topic to obtain a heat ranking list of the topics, and finally selecting the previous P topics as the hot topics concerned and discussed by the people in the time period.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (6)
1. A hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm is characterized in that the hot topic detection method comprises off-line hot topic detection and on-line hot topic detection;
the off-line hot topic detection is used for detecting hot topics contained in existing data in a database, the data volume and the topic number are not changed, the on-line hot topic detection is used for detecting hot topics occurring in an internet media platform in real time in a certain time interval, and the data volume and the topic number are continuously increased;
the offline hot topic detection method comprises the following steps:
A1. a data cleaning step, namely performing data cleaning on the existing text data in the database to remove interference information in the text;
A2. a text vectorization representation step, namely using labeled similar sentences and dissimilar sentences to finely adjust a RoBERTA-WWM model externally connected with a three-layer fine adjustment structure for a data set, inputting the text data subjected to data cleaning into the finely adjusted RoBERTA-WWM model externally connected with the fine adjustment structure, and obtaining vector representation of all the text data;
A3. clustering, namely clustering the text vectors obtained in the step A2 by using an HDBSCAN algorithm to obtain topic distribution conditions of the text data;
A4. evaluating the effect of the offline topic detection model by using two indexes, namely an outline coefficient and a mutual information index, and if the effect does not reach the preset effect, adjusting parameters of the RoBERTA-WWM model and the HDBSCAN algorithm until an optimal solution is reached;
A5. a result generation step, namely calculating the heat value of each post and the heat value of each topic according to the interaction information of the posts in each topic, sequencing according to the heat values, and determining a hot topic list; and selecting posts with M% of post popularity ranking in the hot topic to represent the topic, and calculating the average value of text vectors of the posts as the vector representation of the topic.
2. The method for detecting hot spots topics based on RoBERTa-WWM and HDBSCAN algorithms of claim 1, wherein in step a1, the interference information in the text includes news links and symbols.
3. The method for detecting the hot topic based on the RoBERTa-WWM and HDBSCAN algorithm as claimed in claim 1, wherein in the step a5, the hot topic is the first N topics with the heat value of the topic being greater than the set threshold.
4. The method for detecting hot spots topic based on the RoBERTa-WWM and HDBSCAN algorithms of claim 1, wherein, in the step a5,
the heat calculation formula of the post is as follows:
6. The method for detecting hot topics based on RoBERTa-WWM and HDBSCAN algorithms of any one of claims 1 to 5, wherein the online hot topic detection comprises the steps of:
B1. a data acquisition step, namely acquiring network public opinion data in an internet media platform in real time;
B2. an off-line topic detection step, namely selecting the network public opinion data crawled in a fixed time window each time, and performing topic detection on the collected data by using an off-line topic detection method;
B3. calculating the similarity, classifying and fusing new topics, and sequentially calculating the similarity between the newly obtained topic and the existing topic in the step B2;
if the similarity is larger than a preset threshold value, combining the newly obtained topics with the highest similarity in the existing topics, sequencing according to the heat value of the posts, updating the combined topic representation vector, and if the similarity is smaller than the preset threshold value, taking the topic as a new topic, and adding the new topic representation vector into an existing topic list after obtaining the new topic representation vector;
B4. and a result generation step: and obtaining all topics in a fixed time window, sequencing all topics according to the heat value of each topic to obtain a heat ranking list of the topics, and finally selecting the previous P topics as the hot topics concerned and discussed by the people in the time period.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111615836.9A CN113987192B (en) | 2021-12-28 | 2021-12-28 | Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111615836.9A CN113987192B (en) | 2021-12-28 | 2021-12-28 | Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113987192A true CN113987192A (en) | 2022-01-28 |
CN113987192B CN113987192B (en) | 2022-04-01 |
Family
ID=79734569
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111615836.9A Active CN113987192B (en) | 2021-12-28 | 2021-12-28 | Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113987192B (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107894994A (en) * | 2017-10-18 | 2018-04-10 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus for detecting much-talked-about topic classification |
CN110209813A (en) * | 2019-05-14 | 2019-09-06 | 天津大学 | A kind of incident detection and prediction technique based on autocoder |
CN110297988A (en) * | 2019-07-06 | 2019-10-01 | 四川大学 | Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm |
CN111125380A (en) * | 2019-12-30 | 2020-05-08 | 华南理工大学 | Entity linking method based on RoBERTA and heuristic algorithm |
CN111191453A (en) * | 2019-12-25 | 2020-05-22 | 中国电子科技集团公司第十五研究所 | Named entity recognition method based on confrontation training |
CN111339784A (en) * | 2020-03-06 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Automatic new topic mining method and system |
CN111626056A (en) * | 2020-04-11 | 2020-09-04 | 中国人民解放军战略支援部队信息工程大学 | Chinese named entity identification method and device based on RoBERTA-BiGRU-LAN model |
CN113076734A (en) * | 2021-04-15 | 2021-07-06 | 云南电网有限责任公司电力科学研究院 | Similarity detection method and device for project texts |
CN113204643A (en) * | 2021-06-23 | 2021-08-03 | 北京明略软件系统有限公司 | Entity alignment method, device, equipment and medium |
CN113380418A (en) * | 2021-06-22 | 2021-09-10 | 浙江工业大学 | System for analyzing and identifying depression through dialog text |
CN113515593A (en) * | 2021-04-23 | 2021-10-19 | 平安科技(深圳)有限公司 | Topic detection method and device based on clustering model and computer equipment |
CN113657113A (en) * | 2021-08-24 | 2021-11-16 | 北京字跳网络技术有限公司 | Text processing method and device and electronic equipment |
-
2021
- 2021-12-28 CN CN202111615836.9A patent/CN113987192B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107894994A (en) * | 2017-10-18 | 2018-04-10 | 北京京东尚科信息技术有限公司 | A kind of method and apparatus for detecting much-talked-about topic classification |
CN110209813A (en) * | 2019-05-14 | 2019-09-06 | 天津大学 | A kind of incident detection and prediction technique based on autocoder |
CN110297988A (en) * | 2019-07-06 | 2019-10-01 | 四川大学 | Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm |
CN111191453A (en) * | 2019-12-25 | 2020-05-22 | 中国电子科技集团公司第十五研究所 | Named entity recognition method based on confrontation training |
CN111125380A (en) * | 2019-12-30 | 2020-05-08 | 华南理工大学 | Entity linking method based on RoBERTA and heuristic algorithm |
CN111339784A (en) * | 2020-03-06 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Automatic new topic mining method and system |
CN111626056A (en) * | 2020-04-11 | 2020-09-04 | 中国人民解放军战略支援部队信息工程大学 | Chinese named entity identification method and device based on RoBERTA-BiGRU-LAN model |
CN113076734A (en) * | 2021-04-15 | 2021-07-06 | 云南电网有限责任公司电力科学研究院 | Similarity detection method and device for project texts |
CN113515593A (en) * | 2021-04-23 | 2021-10-19 | 平安科技(深圳)有限公司 | Topic detection method and device based on clustering model and computer equipment |
CN113380418A (en) * | 2021-06-22 | 2021-09-10 | 浙江工业大学 | System for analyzing and identifying depression through dialog text |
CN113204643A (en) * | 2021-06-23 | 2021-08-03 | 北京明略软件系统有限公司 | Entity alignment method, device, equipment and medium |
CN113657113A (en) * | 2021-08-24 | 2021-11-16 | 北京字跳网络技术有限公司 | Text processing method and device and electronic equipment |
Non-Patent Citations (3)
Title |
---|
CHUANZHEN LI等: "Topic Detection and Tracking Based on Windowed DBSCAN and Parallel KNN", 《IEEE ACCESS》 * |
朱岩等: "基于RoBERTa-WWM的中文电子病历命名实体识别", 《计算机与现代化》 * |
陈玺等: "面向汉维机器翻译的BERT嵌入研究", 《计算机工程》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113987192B (en) | 2022-04-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Rossi et al. | Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts | |
Noori | Classification of customer reviews using machine learning algorithms | |
CN110263257B (en) | Deep learning based recommendation method for processing multi-source heterogeneous data | |
Tang et al. | Multi-label patent categorization with non-local attention-based graph convolutional network | |
CN109582785A (en) | Emergency event public sentiment evolution analysis method based on text vector and machine learning | |
Alboaneen et al. | Sentiment analysis via multi-layer perceptron trained by meta-heuristic optimisation | |
CN110264372B (en) | Topic community discovery method based on node representation | |
CN111046171B (en) | Emotion discrimination method based on fine-grained labeled data | |
CN109359302A (en) | A kind of optimization method of field term vector and fusion sort method based on it | |
Tembusai et al. | K-nearest neighbor with k-fold cross validation and analytic hierarchy process on data classification | |
CN109214454A (en) | A kind of emotion community classification method towards microblogging | |
Devipriya et al. | Deep learning sentiment analysis for recommendations in social applications | |
CN113255366A (en) | Aspect-level text emotion analysis method based on heterogeneous graph neural network | |
CN115329215A (en) | Recommendation method and system based on self-adaptive dynamic knowledge graph in heterogeneous network | |
CN115840853A (en) | Course recommendation system based on knowledge graph and attention network | |
Daniel et al. | A novel sentiment analysis for amazon data with TSA based feature selection | |
CN108491477B (en) | Neural network recommendation method based on multi-dimensional cloud and user dynamic interest | |
Li et al. | Capsule neural tensor networks with multi-aspect information for Few-shot Knowledge Graph Completion | |
Gu et al. | Fuzzy time series forecasting based on information granule and neural network | |
CN113987192B (en) | Hot topic detection method based on RoBERTA-WWM and HDBSCAN algorithm | |
CN117056609A (en) | Session recommendation method based on multi-layer aggregation enhanced contrast learning | |
CN117194771A (en) | Dynamic knowledge graph service recommendation method for graph model characterization learning | |
Zhang et al. | Bilinear graph neural network-enhanced Web services classification | |
Kim | Research on Text Classification Based on Deep Neural Network | |
Evangeline et al. | Text categorization techniques: A survey |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |