CN111694958A - Microblog topic clustering method based on word vector and single-pass fusion - Google Patents

Microblog topic clustering method based on word vector and single-pass fusion Download PDF

Info

Publication number
CN111694958A
CN111694958A CN202010504392.0A CN202010504392A CN111694958A CN 111694958 A CN111694958 A CN 111694958A CN 202010504392 A CN202010504392 A CN 202010504392A CN 111694958 A CN111694958 A CN 111694958A
Authority
CN
China
Prior art keywords
word
topic
microblog
clustering
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010504392.0A
Other languages
Chinese (zh)
Inventor
陈海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenlan Artificial Intelligence Application Research Institute (Shandong) Co.,Ltd.
Original Assignee
DeepBlue AI Chips Research Institute Jiangsu Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DeepBlue AI Chips Research Institute Jiangsu Co Ltd filed Critical DeepBlue AI Chips Research Institute Jiangsu Co Ltd
Priority to CN202010504392.0A priority Critical patent/CN111694958A/en
Publication of CN111694958A publication Critical patent/CN111694958A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2132Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a microblog topic clustering method based on word vector and single-pass fusion, which comprises the following steps: preprocessing the acquired microblog data and constructing a word list library; carrying out Word2vec Word vector mapping on the feature words; clustering the microblog texts by using single-pass fused with Word2vec Word vectors; and (5) clustering the clusters by using an LDA topic model to find topics. The single-pass incremental clustering algorithm based on the fusion Word2vec Word vector can effectively mine text depth semantic information, can avoid influence on computer processing speed due to overhigh VSM dimension, and effectively solves the problems of neglecting distribution of characteristic words among classes and distribution of the characteristic words in documents inside the classes in the traditional TF-IDF statistical method.

Description

Microblog topic clustering method based on word vector and single-pass fusion
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a microblog topic clustering method based on word vector and single-pass fusion.
Background
With the rapid development of network technologies and the overall popularization of mobile internet, traditional news media represented by newspapers, televisions, magazines and the like cannot well meet the requirement of audience persons for obtaining information, and new electronic media taking interconnection as a main factor are more and more concerned by more netizens. Microblogs are an emerging platform for new electronic media, and are favored by more and more users due to their unique flexibility and convenience. With the increasing of the number of users, the data volume of microblogs is increased day by day, meanwhile, sensitive information such as false messages, violence, reaction, terrorism and the like is randomly spread in a network due to the lack of supervision of microblogs and emerging electronic media at present, and serious negative effects are brought to the social health development and the long-term security of the country. Therefore, necessary topic monitoring and tracking research is carried out on microblog data, network public opinion dynamics can be effectively mastered, and early warning capacity of a supervision department or a government safety department on real world emergencies can be improved.
Due to the characteristics of nonstandard terms, random word number and length and the like of the microblog texts, the traditional clustering algorithm and topic monitoring and tracking technology have poor performance in the aspect of microblog public opinion analysis. The thesis is improved by analyzing the structural characteristics of the microblog text and combining a word vector space model and a deep learning method on the basis of a traditional algorithm based on text public opinion analysis.
The text clustering algorithm is an unsupervised clustering algorithm, and is used for clustering texts according to certain rules and strategies, wherein the number of clustering clusters can be determined according to predicted information, and the number of uncertain clustering clusters can also be generated according to different clustering algorithms. At present, the text clustering algorithm commonly used in the natural language field mainly comprises the partition clustering, hierarchical clustering, grid clustering and Single-pass incremental clustering of texts.
The division clustering is a simple and efficient clustering algorithm, which divides N texts in a corpus set into K classes according to the similarity of the texts, wherein K represents the number of clustering result clusters, and ensures that the number of texts is greater than the number of clusters to be clustered, and simultaneously ensures that data in different clusters are mutually exclusive. The common K-means clustering algorithm, the K-methods clustering algorithm and the like are based on the idea of partition clustering.
The K-means clustering algorithm is a typical representative of partitional clustering. Meanwhile, the training process of the K-means clustering algorithm adopts the idea of machine learning EM algorithm, the implicit category C of the text is determined through the iteration of the step E according to the input N text data and the number K of the clusters to be clustered, and the training parameters are updated through the step M, so that the loss is minimum, namely the central point of the clustering cluster is not updated any more. The K-means algorithm uses the squared difference as its loss function:
Figure BDA0002525990920000021
in the formula CiDenotes the divided clusters, k denotes the number of class clusters, μiIs represented by CiMay be referred to as a centroid. The above technique has the following disadvantages:
k value must be given
When performing the k-means algorithm, the number of clusters must be specified. But sometimes we do not know how many classes we should cluster into, but rather want the algorithm to give a reasonable number of clusters, often the k value is difficult to estimate and give in advance from the beginning.
2. Random k center points influence the results
In the k-means algorithm, the first k center points are randomly selected and recalculated in subsequent iterations until convergence. However, it is not difficult to see the steps of the algorithm, so that the final result is often largely dependent on the position of the first K center points. This means that the result is highly random, and each calculation results in different results because the initial randomly selected center particles are different.
3. Performance of computation
The algorithm needs to continuously classify and adjust the objects and continuously calculate new cluster center points after adjustment, so that the time overhead of the algorithm is very large when the data volume is very large.
Single-pass belongs to an incremental unsupervised clustering algorithm and is an algorithm model for clustering clusters of questions in a large number of news reports. The basic idea of the clustering algorithm is as follows: and carrying out clustering analysis on the text data represented by the feature vectors one by one, namely clustering in one pass. The algorithm takes the first piece of data as a new topic cluster, and similarity calculation is respectively carried out on the subsequently input text vector and the existing topic cluster. If the obtained similarity exceeds a preset threshold value, the text belongs to the topic cluster, and if not, the text is established as a new topic cluster until all texts are read. The technology has the defects that the traditional Single-pass incremental clustering algorithm usually calculates the feature weight (TF-IDF) by means of Term Frequency (TF) statistics and introduction of inverse article Frequency (IDF), forms the space vector representation of the words, but has larger dimensionality and higher calculation cost; the semantic ambiguity of the natural language cannot be distinguished, and the semantic information among the text word sequences is ignored; meanwhile, the influence of the context is ignored, and the recall ratio and precision ratio of the information retrieval result are influenced.
The traditional single-pass clustering algorithm calculates semantic similarity based on the space vector of the feature words, easily causes the problems of excessive data dimension, context semantic deletion and the like, and the clustering algorithm combining word embedded word2vec and the single-pass algorithm introduced into Wikipedia is provided.
Disclosure of Invention
1. Objects of the invention
The invention provides a microblog topic clustering method based on word vector and single-pass fusion to solve the problems.
2. The technical scheme adopted by the invention
The invention provides a microblog topic clustering method based on word vector and single-pass fusion, which comprises the following steps:
preprocessing the acquired microblog data and constructing a word list library;
carrying out Word2vec Word vector mapping on the feature words;
clustering the microblog texts by using single-pass fused with Word2vec Word vectors;
and (5) clustering the clusters by using an LDA topic model to find topics.
Furthermore, the mapping step is a Skip-gram step, which is an inverse process of the CBOW algorithm model, and inputs word vectors of the current feature words and outputs context-dependent word vectors corresponding to the feature words.
Furthermore, in the Single-pass incremental clustering step, the text data represented by the feature vectors are subjected to clustering analysis item by item, namely one-pass clustering.
Further, clustering the microblog texts by using single-pass fused with Word2vec Word vectors:
taking the first piece of data as a new topic cluster, and respectively calculating the similarity between the subsequently input text vector and the existing topic cluster;
if the obtained similarity exceeds a preset threshold value, the text belongs to the topic cluster;
and if not, establishing the text as a new topic cluster until all the texts are read.
Further, the method includes classifying the cluster with the highest similarity if the obtained similarity exceeds a preset threshold and the similarity exceeds the preset threshold.
Further, in the LDA polynomial distribution relation step, each word is considered to select a specific topic in the article with a certain probability, and the topic also selects a word with a certain probability value, so that a polynomial distribution relation exists between the document and the topic, and a certain polynomial distribution relation also exists between the topic and the word.
Still further, a preprocessing operation is included: noise data filtering step, Chinese word segmentation step, stop word filtering step and text feature selection step.
Further, a step of filtering noise data, wherein the data noise mainly comprises advertisements, emoticons, special characters, pictures and hyperlinks which appear in the text.
Furthermore, Chinese word segmentation step, selecting and using the ending word segmentation.
And further, a stop word filtering step, namely performing stop word removing processing on the segmentation result by constructing a stop word table.
3. Advantageous effects adopted by the present invention
(1) The topic detection is carried out on the microblog short texts as the targets;
(2) in the invention, the single-pass incremental clustering algorithm and the Word2vec mapped feature Word vector are fused, so that the integrity of the context semantic information of the short text is increased while the feature dimension is reduced, and the effect of coarse topic clustering is improved;
(3) the detection of the text topics of the inventor is firstly the clustering of topic clusters and secondly the discovery of the topics; the fusion of the Word2vec mapped feature Word vector and single-pass in the topic clustering is more prominent in reducing feature dimension and algorithm timeliness.
Drawings
FIG. 1 is a CBOW model;
FIG. 2 is a Skip-gram model;
FIG. 3 is a flow diagram of topic cluster discovery;
FIG. 4 is an LDA topic model.
Detailed Description
The technical solutions in the examples of the present invention are clearly and completely described below with reference to the drawings in the examples of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without inventive step, are within the scope of the present invention.
The present invention will be described in further detail with reference to the accompanying drawings.
In order to improve the clustering effect of microblog text topics, Word vectors obtained by fusing a Word2vec model in a single-pass incremental clustering algorithm are integrated, and deep text semantic features are mined; clustering microblog topics, namely preprocessing microblog texts, including denoising, Chinese word segmentation, removal of stop words and the like; then constructing a space vector model based on sentences through word segmentation results; the single-pass incremental clustering algorithm based on the fusion Word2vec Word vector can effectively mine text depth semantic information, can avoid influence on computer processing speed due to overhigh VSM dimension, and effectively solves the problems of neglecting the distribution of characteristic words among classes and the distribution of the characteristic words in documents inside the classes in the traditional TF-IDF statistical method.
Example 1
The LDA topic model is relatively sensitive in topic discovery and relatively good in topic discovery effect when aiming at long texts similar to news reports and the like, but for microblog short texts, the word number is relatively short, irrelevant information including noise and the like is relatively large, the number of characteristic words is relatively small, and therefore the expression effect is poor, therefore, the single-pass algorithm is improved, the improved single-pass algorithm is adopted to perform topic cluster clustering on the microblog texts, and finally the LDA topic model is used to perform topic discovery on the texts in the same cluster. The realization process is as follows: preprocessing the acquired microblog data to construct a word list library; carrying out Word2vec Word vector mapping on the feature words; clustering the microblog texts by using single-pass fused with Word2vec Word vectors; and clustering the clusters by using an LDA topic model to find topics.
1. Filtering noisy data
The data noise mainly comprises advertisements, emoticons, special characters, pictures, hyperlinks and the like which appear in the text, the information appears frequently and contains little useful information, the learning of an algorithm model is seriously influenced, and the accuracy of topic detection and tracking is greatly influenced.
2. Chinese word segmentation
The chinese data is not similar to the english data, and each word is separated by a space, so the chinese data cannot be segmented in the same manner as english. The Chinese word segmentation belongs to the category of natural language processing, and the existing word segmentation methods can be mainly divided into three categories: 1. a method based on string matching; 2. an understanding-based word segmentation method; 3. a statistical-based word segmentation method. At present, Chinese word segmentation technology is mature, commonly used word segmentation tools mainly comprise Stanford Chinese word segmentation, ICT CLAS word segmentation, Paoding cattle and crust word segmentation, and the crust is used by a plurality of researchers due to the fact that the Chinese word segmentation tools are simple to install, convenient to operate and satisfactory in word segmentation effect. The word segmentation is carried out on the microblog text in an accurate mode by selecting the ending word segmentation.
3. Filter stop words
Stop Words (Stop Words) are to some extent equivalent to Filter Words (Filter Words), and mainly comprise the following two features: 1. the words are used more widely or are words with higher use frequency, such as words of the type "i", "you", "he", "is", etc. in the text; 2. words which are frequently used but have no specific meaning in the text mainly include adverbs, prepositions, word-atmosphere assistant words, exclamation words and the like. The text filtering stop words can reduce the interference of noise to effective information and improve the performance of the algorithm to a certain extent. The method comprises the step of removing stop words from a word segmentation result by constructing a stop word list of 1210 words.
4. Text feature selection
The Word2Vector algorithm [ i ] is an algorithm model which expresses a feature item as a high-efficiency real number Vector by utilizing the thought of a deep learning algorithm, the feature item is expressed as a K-dimensional space Vector by training a text, and the Vector similarity in the space can be expressed as the semantic similarity in the text. The Word2Vector model contains two different methods CBOW (continuous Bag of words) and Skip-gram. By training the data of the corpus using the CBOW or Skip-gram methods, an optimized feature vector representation for each word can be obtained.
The input of the CBOW algorithm model is word vectors corresponding to the relevant words of the context of a certain feature item, the word vectors corresponding to the feature words are obtained through the training of the model, and the algorithm model is shown in fig. 1:
the Skip-gram algorithm model is the inverse process of the CBOW algorithm model, and the input of the Skip-gram algorithm model is a word vector of a current characteristic word, and the output of the Skip-gram algorithm model is a context-related word vector corresponding to the characteristic word. A Skip-gram algorithm model is shown in FIG. 2:
5. single-pass incremental clustering
The basic idea of the clustering algorithm is as follows: and carrying out clustering analysis on the text data represented by the feature vectors one by one, namely clustering in one pass. The algorithm takes the first piece of data as a new topic cluster, and similarity calculation is respectively carried out on the subsequently input text vector and the existing topic cluster. If the obtained similarity exceeds a preset threshold value, the text belongs to the topic cluster, and if not, the text is established as a new topic cluster until all texts are read. The flow chart of the algorithm is shown in fig. 3:
6. LDA topic model
Lda (late Dirichlet allocation), proposed in 2013 by David m.blei, Andrew Ng et al, is a document generation model, also called a three-layer bayesian probability model based on documents, topics and words, and its basic idea is: it is considered that each word selects a specific topic in the article with a certain probability, and the topic also selects a word with a certain probability value, so that the relationship that the document and the topic have polynomial distribution can be obtained, and the relationship that the topic and the word have polynomial distribution also exists, as shown in fig. 4.
In fig. 4, a hyper-parameter α represents a prior probability distribution of a document d, a hyper-parameter β represents a control parameter of a prior word of a current topic, θ represents a probability distribution of text based on a topic, and ψ represents a probability distribution of a word based on a current topic word.
The experimental environment can be seen in the following table
System environment
Figure BDA0002525990920000071
7. Test evaluation index
The confusion matrix is an index for judging the result of the model, is often used for judging the quality of the performance of the classifier, is suitable for the data model with different types, and uses the accuracy, the recall rate, the omission factor and the false detection rate in the confusion matrix to evaluate the experimental performance. The evaluation indexes of the confusion matrix are described by taking the second classification as an example, and are shown in the following table:
Figure BDA0002525990920000072
Figure BDA0002525990920000073
in the table, a (TP) represents the sample which is actually a positive example and is correctly classified; b (FP) represents samples that are actually positive examples and are misclassified; c (tn) represents samples that are actually negative and divided into positive examples; d (FN) represents samples that are actually negative and are divided into negative examples.
The evaluation index formula is as follows:
the accuracy P is:
Figure BDA0002525990920000074
the recall ratio R is:
Figure BDA0002525990920000075
false detection rate Pflase:
Figure BDA0002525990920000076
Missing rate Pmiss:
Figure BDA0002525990920000081
(9) Analysis of Experimental results
In order to evaluate the performance of the LDA topic-based detection algorithm, the accuracy P, the recall rate R, the false detection rate Pflash, the undetected rate Pmis and the like are adopted to detect the LDA topic-based detection algorithm.
8. Performance evaluation analysis of topic cluster clustering algorithm
The K-means clustering algorithm based on TF-IDF characteristic value combination, the TF-IDF characteristic value combination and single-pass algorithm and the single-pass algorithm combined with Wikipedia word2vec are adopted for experimental comparison. The experimental data comprises 14 subjects, 800 pieces of data are totally arranged under each subject, 100 pieces of data are randomly drawn from the subjects to serve as test data, and the experimental performance is compared as follows:
Figure BDA0002525990920000082
as can be seen from the above table, the performance of the word2vec + single-pass algorithm based on the combination of Wikipedia presented herein is significantly better than that of the traditional single-pass algorithm and the k-means algorithm. The main reason for improving the performance of the algorithm is that potential semantic information among feature words is combined in word vectors, so that the Euclidean distance of similar texts is closer.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A microblog topic clustering method based on word vector and single-pass fusion is characterized by comprising the following steps:
preprocessing the acquired microblog data and constructing a word list library;
carrying out Word2vec Word vector mapping on the feature words;
clustering the microblog texts by using single-pass fused with Word2vec Word vectors;
and (5) clustering the clusters by using an LDA topic model to find topics.
2. The method for clustering microblog topics based on the fusion of word vectors and single-pass according to claim 1, wherein the mapping step is a Skip-gram step, and the inverse process of the CBOW algorithm model is that the input is the word vector of the current feature word and the output is the word vector related to the context corresponding to the feature word.
3. The microblog topic clustering method based on the fusion of the word vectors and the Single-pass according to claim 1, which is characterized in that in the Single-pass incremental clustering step, the text data represented by the feature vectors are clustered one by one, namely, clustered one time.
4. The microblog topic clustering method based on the fusion of the Word vector and the single-pass according to the claim 3, characterized in that the single-pass fusion Word2vec Word vector is adopted to cluster microblog texts:
taking the first piece of data as a new topic cluster, and respectively calculating the similarity between the subsequently input text vector and the existing topic cluster;
if the obtained similarity exceeds a preset threshold value, the text belongs to the topic cluster;
otherwise, the text is established as a new topic cluster until all the texts are read.
5. The microblog topic clustering method based on the fusion of the word vector and the single-pass according to claim 4, characterized in that: if the obtained similarity exceeds a preset threshold value and the multiple similarities exceed the preset threshold value, the cluster class with the maximum similarity is classified.
6. The microblog topic clustering method based on the fusion of the word vector and the single-pass according to claim 2, characterized in that: in the LDA polynomial distribution relation step, each word is considered to select a specific topic in the article with a certain probability, and the topic also selects a word with a certain probability value, so that a polynomial distribution relation exists between the document and the topic, and a certain polynomial distribution relation also exists between the topic and the word.
7. The microblog topic clustering method based on the fusion of the word vector and the single-pass according to claim 1, characterized by further comprising preprocessing operations:
noise data filtering step, Chinese word segmentation step, stop word filtering step and text feature selection step.
8. The microblog topic clustering method based on the fusion of the word vector and the single-pass according to claim 7, characterized in that: and a step of filtering noise data, wherein the data noise comprises advertisements, emoticons, special characters, pictures and hyperlinks which appear in the text.
9. The microblog topic clustering method based on the fusion of the word vector and the single-pass according to claim 7, characterized in that: chinese word segmentation step, selecting and using the ending word segmentation.
10. The microblog topic clustering method based on the fusion of the word vector and the single-pass according to claim 7, characterized in that: and a step of filtering stop words, namely removing the stop words from the word segmentation result by constructing a stop word list.
CN202010504392.0A 2020-06-05 2020-06-05 Microblog topic clustering method based on word vector and single-pass fusion Pending CN111694958A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010504392.0A CN111694958A (en) 2020-06-05 2020-06-05 Microblog topic clustering method based on word vector and single-pass fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010504392.0A CN111694958A (en) 2020-06-05 2020-06-05 Microblog topic clustering method based on word vector and single-pass fusion

Publications (1)

Publication Number Publication Date
CN111694958A true CN111694958A (en) 2020-09-22

Family

ID=72479496

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010504392.0A Pending CN111694958A (en) 2020-06-05 2020-06-05 Microblog topic clustering method based on word vector and single-pass fusion

Country Status (1)

Country Link
CN (1) CN111694958A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579780A (en) * 2020-12-25 2021-03-30 青牛智胜(深圳)科技有限公司 Single-pass based clustering method, system, device and storage medium
CN112612873A (en) * 2020-12-25 2021-04-06 上海德拓信息技术股份有限公司 NLP technology-based centralized event mining method
CN112632229A (en) * 2020-12-30 2021-04-09 语联网(武汉)信息技术有限公司 Text clustering method and device
CN112632965A (en) * 2020-12-25 2021-04-09 上海德拓信息技术股份有限公司 Work order automatic classification method for government service hotline field
CN112926297A (en) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN113065341A (en) * 2021-03-14 2021-07-02 北京工业大学 Automatic labeling and classifying method for environmental complaint report text
CN113139599A (en) * 2021-04-22 2021-07-20 北方工业大学 Service distributed clustering method fusing word vector expansion and topic model
CN113378950A (en) * 2021-06-22 2021-09-10 深圳市查策网络信息技术有限公司 Unsupervised classification method for long texts
CN113515593A (en) * 2021-04-23 2021-10-19 平安科技(深圳)有限公司 Topic detection method and device based on clustering model and computer equipment
CN114691861A (en) * 2020-12-28 2022-07-01 北京市博汇科技股份有限公司 Topic clustering method based on subject term semantic similarity
CN114896393A (en) * 2022-04-15 2022-08-12 中国电子科技集团公司第十研究所 Data-driven text incremental clustering method
CN115099373A (en) * 2022-08-26 2022-09-23 南京中孚信息技术有限公司 Single-pass-based text clustering method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069703A (en) * 2019-03-19 2019-07-30 南京大学 A kind of microblog topic detecting method based on feature enhancing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069703A (en) * 2019-03-19 2019-07-30 南京大学 A kind of microblog topic detecting method based on feature enhancing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
丁晓庆: "微博热点话题发现研究与实现" *
周茜: "融合word2vec和Single-Pass的微博话题检测方法研究" *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112612873A (en) * 2020-12-25 2021-04-06 上海德拓信息技术股份有限公司 NLP technology-based centralized event mining method
CN112632965A (en) * 2020-12-25 2021-04-09 上海德拓信息技术股份有限公司 Work order automatic classification method for government service hotline field
CN112612873B (en) * 2020-12-25 2023-07-07 上海德拓信息技术股份有限公司 Centralized event mining method based on NLP technology
CN112579780A (en) * 2020-12-25 2021-03-30 青牛智胜(深圳)科技有限公司 Single-pass based clustering method, system, device and storage medium
CN112632965B (en) * 2020-12-25 2024-05-03 上海德拓信息技术股份有限公司 Work order automatic classification method for government service hotline field
CN112579780B (en) * 2020-12-25 2022-02-15 青牛智胜(深圳)科技有限公司 Single-pass based clustering method, system, device and storage medium
CN114691861A (en) * 2020-12-28 2022-07-01 北京市博汇科技股份有限公司 Topic clustering method based on subject term semantic similarity
CN112632229A (en) * 2020-12-30 2021-04-09 语联网(武汉)信息技术有限公司 Text clustering method and device
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN112926297A (en) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN112926297B (en) * 2021-02-26 2023-06-30 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN113065341A (en) * 2021-03-14 2021-07-02 北京工业大学 Automatic labeling and classifying method for environmental complaint report text
CN113139599B (en) * 2021-04-22 2023-08-08 北方工业大学 Service distributed clustering method integrating word vector expansion and topic model
CN113139599A (en) * 2021-04-22 2021-07-20 北方工业大学 Service distributed clustering method fusing word vector expansion and topic model
CN113515593A (en) * 2021-04-23 2021-10-19 平安科技(深圳)有限公司 Topic detection method and device based on clustering model and computer equipment
CN113378950A (en) * 2021-06-22 2021-09-10 深圳市查策网络信息技术有限公司 Unsupervised classification method for long texts
CN114896393A (en) * 2022-04-15 2022-08-12 中国电子科技集团公司第十研究所 Data-driven text incremental clustering method
CN115099373A (en) * 2022-08-26 2022-09-23 南京中孚信息技术有限公司 Single-pass-based text clustering method and device

Similar Documents

Publication Publication Date Title
CN111694958A (en) Microblog topic clustering method based on word vector and single-pass fusion
CN110297988B (en) Hot topic detection method based on weighted LDA and improved Single-Pass clustering algorithm
CN112581006B (en) Public opinion information screening and enterprise subject risk level monitoring public opinion system and method
Faguo et al. Research on short text classification algorithm based on statistics and rules
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN111914087B (en) Public opinion analysis method
CN108932311B (en) Method for detecting and predicting emergency
CN105335352A (en) Entity identification method based on Weibo emotion
CN110705247B (en) Based on x2-C text similarity calculation method
CN113806547B (en) Deep learning multi-label text classification method based on graph model
Patil et al. Machine learning techniques for the classification of fake news
CN109885675A (en) Method is found based on the text sub-topic for improving LDA
Aggarwal Mining text and social streams: A review
Ahmed et al. Natural language processing and machine learning based cyberbullying detection for Bangla and Romanized Bangla texts
Yang et al. Understanding online consumer review opinions with sentiment analysis using machine learning
CN111259156A (en) Hot spot clustering method facing time sequence
Aggarwal Mining text streams
CN114611491A (en) Intelligent government affair public opinion analysis research method based on text mining technology
Kandhro et al. Classification of Sindhi headline news documents based on TF-IDF text analysis scheme
Yang et al. News topic detection based on capsule semantic graph
CN108920475B (en) Short text similarity calculation method
Zong et al. Topic detection and tracking
CN114491062A (en) Short text classification method fusing knowledge graph and topic model
CN112905751A (en) Topic evolution tracking method combining topic model and twin network model
CN116881451A (en) Text classification method based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220418

Address after: 250000 13th floor, Hanyu Golden Valley artificial intelligence building, Jingshi Road, Jinan area, China (Shandong) pilot Free Trade Zone, Jinan City, Shandong Province

Applicant after: Shenlan Artificial Intelligence Application Research Institute (Shandong) Co.,Ltd.

Address before: 213000 No.103, building 4, Chuangyan port, Changzhou science and Education City, No.18, middle Changwu Road, Wujin District, Changzhou City, Jiangsu Province

Applicant before: SHENLAN ARTIFICIAL INTELLIGENCE CHIP RESEARCH INSTITUTE (JIANGSU) Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200922