CN106383877A - On-line short text clustering and topic detection method of social media - Google Patents
On-line short text clustering and topic detection method of social media Download PDFInfo
- Publication number
- CN106383877A CN106383877A CN201610818311.8A CN201610818311A CN106383877A CN 106383877 A CN106383877 A CN 106383877A CN 201610818311 A CN201610818311 A CN 201610818311A CN 106383877 A CN106383877 A CN 106383877A
- Authority
- CN
- China
- Prior art keywords
- class
- text
- word
- clustering
- classes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 18
- 239000013598 vector Substances 0.000 claims abstract description 104
- 238000000034 method Methods 0.000 claims abstract description 84
- 238000007781 pre-processing Methods 0.000 claims abstract description 12
- 238000002372 labelling Methods 0.000 claims description 12
- 230000009467 reduction Effects 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 10
- 238000001914 filtration Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 3
- 238000012544 monitoring process Methods 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 abstract description 12
- 230000001133 acceleration Effects 0.000 abstract 1
- 238000004422 calculation algorithm Methods 0.000 description 18
- 230000008569 process Effects 0.000 description 7
- 230000002776 aggregation Effects 0.000 description 4
- 238000007418 data mining Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005054 agglomeration Methods 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 230000001788 irregular Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 235000018185 Betula X alpestris Nutrition 0.000 description 1
- 235000018212 Betula X uliginosa Nutrition 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an on-line short text clustering and topic detection method of social media. Through text preprocessing, on-line text clustering, similar class detection and combination and hot topic identification, the problem of clustering and insufficiency caused by a high-dimensional sparse word vector space in an existing on-line short text clustering method is overcome to a certain degree, and the effective clustering of on-line large-scale short texts is realized. By use of an expandable word vector space which is put forward by the method, the storage of the high-dimensional sparse word vector is solved, and calculation complexity is lowered. The method adopts a clustering method for acceleration by a word index; and an improved clustering way ''similar merged win-take-all'' and a similar class merging rule of ''no entropy increase'' can be used for alleviating the problem of insufficient same topic class merging brought by high-dimensional sparse short text features. By use of a hot topic detection and identification method which is adopted by the invention, valuable and valueless topics can be subjected to simple but effective classification, and the valuable topics can be mined and tracked.
Description
Technical Field
The invention belongs to the field of data mining, and particularly relates to a social media data mining technology.
Background
Topic detection of social media is started in the last decade, and due to the similar twitter and Facebook at abroad and the explosive development of a domestic social platform similar to microblog, the social media becomes a huge real-time information exchange platform and a commercial market, and social media data mining has high value. The social media product changes the social mode of the traditional long text blog, limits the number of text words and enables information to be spread more quickly and efficiently. However, recently, social networks such as twitter and microblog have started to relax the limitation of the number of text words, but in fast-paced modern life, users are still used to the communication mode of short texts, so that information on these social platforms still mainly adopts short texts.
The text clustering is an important means for mining text information, and has important significance in aspects of simplifying text data, accelerating text retrieval, analyzing text information and semanteme and the like. Because a great amount of short text information exists in the current social network, cluster analysis of the short text is the key point of social media data mining. The social media short text has the characteristics of incomplete information (omission of the text due to word number limitation), irregular expression (colloquialization, harmonious tone of words, irregular abbreviation, popular lines and symbolic expressions), few available features (short text) and the like, so that the clustering of the social media short text is more difficult than that of the traditional long text. The existing text clustering method is mainly based on the traditional clustering method, and like the traditional clustering method, can be divided into a hierarchy method, a division method, a density-based method, a grid-based method and a model-based method, and the methods can also be applied to short text clustering.
The hierarchical clustering algorithm is divided into an agglomeration hierarchical clustering and a split hierarchical clustering according to the clustering direction. Firstly, using each data object as a cluster, calculating pairwise similarity between the clusters, selecting two clusters with the highest similarity for combination, recalculating the similarity between newly synthesized clusters and other clusters, and iterating until the clusters are clustered into one class or the maximum similarity is smaller than a set threshold value; the split hierarchical clustering algorithm is just the inverse process of the agglomeration algorithm, a hierarchical tree is finally formed, the distance between clusters needs to be calculated every time combination and splitting are carried out, the calculation complexity is high, the algorithm is not suitable for large-scale text integral clustering, and BIRCH and CURE are represented algorithms. The dividing-based method is a method with higher use frequency in text clustering, a representative algorithm is K-means, the algorithm divides a sample set into K clusters according to the average closest distance between the samples and the cluster centers, and the cluster centers of all newly obtained clusters are recalculated so as to carry out continuous iteration to approach the optimal. The algorithm has more iteration times and is not suitable for clustering large-scale texts. The density-based method performs clustering according to the density of the samples in the space, and a representative algorithm is a DBSCAN algorithm. The grid-based method divides a data space into a grid structure with a limited number of units, all processing is performed by taking a single unit as an object, and typical algorithms include a STING algorithm, a CLIQUE algorithm and a WAVE-CLUSTER algorithm. The model-based method represents LDA model clustering and SOM neural network clustering.
Although many methods for clustering texts exist, the existing methods have three problems in short text clustering application:
one, most methods are difficult to extend to online incremental clustering. On a social platform, the data volume is continuously increased along with the release of a user, all data are difficult to obtain and cluster at the same time, the method is not practical, and the incremental online clustering method is more meaningful for processing social media text data. However, the existing text clustering method is not suitable for incremental clustering, and for example, the hierarchical tree generation of most hierarchical clusters cannot insert and adjust nodes according to a certain sequence, and the method capable of incremental hierarchical clustering has higher computational complexity. Adding a new sample every time in the K-means clustering is equivalent to performing an iteration process again, and the processing speed is very low when the iteration times are more. The sample integrity of model-based clustering affects the probabilistic model, so that samples need to be input all at once, and incremental clustering is difficult to realize.
Secondly, the dimension of the word vector space cannot be dynamically changed. Clustering typically begins by constructing a word vector space to describe the text or class of text, with the vector space typically selecting the vocabulary and other features of the text. On a social platform, the increase of fresh things and the expression of the individuality of people lead to the continuous increase of the characteristics in the text, if a characteristic space with fixed dimensionality is adopted, a larger amount of information than a common long text is lost, and a short text on a social media possibly loses a large amount of characteristic space to have a huge influence on clustering. However, the space dimensions of word vectors of most existing incremental clustering algorithms cannot be increased, and in these methods, all words in a sample text are used as each dimension of a word vector to establish a fixed word vector space, and the similarity between the word vectors of the fixed dimension is calculated during clustering, such as a density algorithm-based DBSCAN (direct memory access controller) and an incremental SOM (self-organizing map) algorithm.
Thirdly, the sparsity problem of the word vector space is serious. Sparsity of the word vector space means that the proportion of non-zero components in the feature vector of each text in the total vector space is small. The number of words of a single short text in social media is small, but the total number of different words is large, so that word vectors are sparse in high dimension. The high-dimensional sparsity of the word vectors can cause the storage cost of the word vectors to be high, the similarity calculation cost among the word vectors to be high, and the text clustering effect to be poor.
Disclosure of Invention
The invention provides a social media online short text clustering and topic detection method for solving the technical problems, and solves the problem of insufficient clustering caused by high-dimensional sparsity of word vector space in the conventional online short text clustering method to a certain extent through text preprocessing, text online clustering, detection and combination of similar clusters and identification of hot topics.
The technical scheme adopted by the invention is as follows: a social media online short text clustering and topic detection method comprises the following steps:
s1, preprocessing the social media short text to obtain a pure word sequence with marks;
s2, carrying out online clustering on the social media short texts preprocessed in the step S1, wherein the online clustering comprises the following steps:
s21, constructing an expandable word vector of the short text;
s22, calculating the cosine similarity of the short text and the class;
s23, calculating the cosine similarity of the new text and the class according to the step S22, selecting the class with the cosine similarity higher than a first threshold value with the new text, calling the class with the highest cosine similarity with the new text as a most similar class, calling all similar classes except the most similar class with the cosine similarity higher than the threshold value with the new text as candidate similar classes, calculating the cosine similarity of the word frequency number vectors of each candidate similar class and the most similar class after adding and merging, and then calling the new text word vectors, if the reduction of the cosine similarity and the most similar class before merging is smaller than a third threshold value or the cosine similarity of the cosine similarity and the most similar class before merging is increased, then merging the candidate similar classes into the most similar class;
s24, detecting and merging similar short text classes, and detecting and identifying the classes which are not fully merged according to the similarity of the two classes;
and S3, detecting the trending topics, estimating the popularity of the topics according to the total number of the class texts, the average arrival rate of the class texts and the current average arrival rate of the class texts, and identifying the trending topics.
Further, the step S1 includes: text standardization, text word segmentation, named entity labeling, part of speech labeling, word shape reduction and word stop removal;
the text standardization is to convert short text into a standard format, including filtering out letters except English letters and partial Latin letters and filtering out all symbols except for relevant punctuation; the text segmentation method comprises the steps that a blank space is used as a segmentation character for a standardized text to obtain an ordered sequence of words and partial symbols; the named entity labeling mainly extracts a person name, a place name and an organization name by adopting the existing named entity labeling; the part-of-speech tagging is to use the existing method to tag simple nouns, verbs, adjectives, adverbs and the like to words; the word form reduction uses the existing method to convert words into original forms, and reduces the dimensionality of a word vector space; and the stop word removing step is to remove the stop word and all the symbols by converting the text into the lower case, so as to obtain a pure word sequence with the marks.
Further, the step S21 includes: constructing a single short text word frequency number vector and constructing a word frequency number vector of a short text class;
constructing a single short text word frequency number vector: converting the words obtained by preprocessing into weighted word frequency number vectors according to corresponding parts of speech or named entities;
constructing a word frequency number vector of the short text class: when a new short text is aggregated into a class, the word frequency number vector of the new text and the word frequency number vector of the class are added.
Further, the calculation formula of step S22 is:
wherein,a word frequency number vector representing a single text,a word frequency number vector representing a short text class,representing the cosine similarity of a single text to a class,to representAndthe space of the intersection is taken out,to representAndinner product of (d);express getAnd2 norm of (d).
Further, the step S24 includes: after clustering a certain amount of text, the similarity degree of each existing class and other classes is detected, and the class c is classified1All feature words are searched for and classified according to the index from the words to the classes1The set C of classes containing the most identical words, class C is calculated according to step S221The cosine similarity of the word frequency number vector of each class in the set C and the word frequency number vector of each class in the set C, and respectively calculating the cosine similarity of the two classes when the similarity is higher than a fourth threshold valueEntropy is calculated, and then the entropy of the information after the two classes are combined is calculated; when the information entropy of the merged class is compared with the maximum information entropy before merging, and the reduction is smaller than a fifth threshold h, merging the two classes;
H(c1∩c2)-max(H(c1),H(c2))≤h,c2∈C
where H () is the entropy of the computation information, c2One in the representation set C is different from C1Class (c).
Further, the step S3 includes:
if the number of the short texts of the category reaches above a sixth threshold value alpha and the current average arrival rate of the text of the category reaches above a seventh threshold value theta, the category is inferred to be a current hot topic;
if the number of the short texts in the class reaches the sixth threshold value alpha or more, but the current average arrival rate of the class texts is below a seventh threshold value theta, judging whether the average arrival rate of the class texts is above an eighth threshold value beta or not; if yes, the class is inferred to be a hot topic in the past; otherwise, the topic does not have a certain popularity and is regarded as a common topic;
if the number of short texts in the category is less than or equal to the sixth threshold value alpha, but the average current arrival rate of the category texts is greater than or equal to the seventh threshold value theta, the topic is estimated to have the potential of a hot topic, and the topic is judged to be a sudden topic.
Still further, still include: tracking and monitoring the sudden topics, and if the number of the short texts of the category reaches above a sixth threshold value alpha within a certain time, deducing that the topics are changed into the current hot topics.
The invention has the beneficial effects that: the invention overcomes the problem of insufficient clustering caused by high-dimensional sparsity of word vector space in the existing online short text clustering method to a certain extent by text preprocessing, online clustering of texts, detection and combination of similar classes and identification of hot topics, and realizes effective clustering of online large-scale short texts. The method provided by the invention has the following advantages:
1. the invention provides an incremental short text clustering method, which solves the problem that the traditional short text clustering method cannot be incremental or the traditional incremental method is not suitable for large-scale short text clustering.
2. The method for describing the social media short text by using the expandable word vector space and calculating the similarity between the short text and the class by using the word vector solves the problems that the storage and calculation of high-dimensional sparse word vectors are complex, and the traditional incremental text clustering method cannot increment the feature word vector space.
3. The method for reducing the cosine similarity calculation amount by using the word index and accelerating the clustering accelerates the cosine similarity calculation on the premise of ensuring the correctness, and accelerates the clustering method. The improved clustering mode of 'full achievement after similar winner combination' and the similar type combination criterion of 'entropy not increased' relieve the problem of insufficient similar topic combination caused by high-dimensional sparsity of short text features.
4. The hot topic detection and identification method adopted by the invention can simply and effectively classify valuable and non-valuable topics, and mine and track the valuable topics.
Drawings
FIG. 1 is a flow chart of a scheme provided by the present invention;
FIG. 2 is a flow chart of social media short text preprocessing provided by the present invention;
FIG. 3 is an index diagram of words and text classes provided by the present invention.
Detailed Description
To facilitate the understanding of the technical content of the present invention by those skilled in the art, the following threshold values are defined:
first threshold value st: cosine similarity threshold of text and class;
a second threshold value sw: a similar word number threshold for text and class;
third threshold value sd: a similarity reduction threshold for class merging;
fourth threshold sc: class-to-class cosine similarity threshold;
fifth threshold h: an information entropy increase threshold for the class;
sixth threshold value α: a text quantity threshold for a class;
seventh threshold value θ: a current average arrival rate threshold of the class text;
eighth threshold value β: the average arrival rate threshold of class text.
The present invention will be further explained with reference to the accompanying drawings.
As shown in fig. 1, a scheme flow chart of the present application is provided, and the technical scheme of the present application is as follows: a social media online short text clustering and topic detection method comprises the following steps:
s1, preprocessing the social media short text to obtain a pure word sequence with marks; text preprocessing has formed many mature technical solutions and a complete set of process flows, and each process has slight differences according to subsequent requirements, but the processes are all the same and different. The pretreatment method adopted by the invention is shown in fig. 2, and the specific process comprises the following steps: text standardization, text word segmentation, named entity labeling, part of speech labeling, word form reduction and word stop removal.
The method mainly aims at the short social media English text which has the nonstandard characteristics of words and text formats, and the standardization is to convert the short text into a format required by subsequent preprocessing, including filtering out letters except English letters and partial Latin letters and filtering out all symbols related to broken sentences; in the invention, the text word segmentation directly takes a blank space as a segmentation character word segmentation of the standardized text to obtain an ordered sequence of words and partial symbols; the named entity labeling mainly extracts a person name, a place name and an organization name by adopting the existing named entity labeling; the part-of-speech tagging is to use the existing method to tag simple nouns, verbs, adjectives, adverbs and the like to words; the word form reduction uses the existing method to convert words into original forms, and reduces the dimensionality of a word vector space; finally, the text is completely converted into lower case, stop words and all symbols are removed, and a pure word sequence with marks is obtained.
S2, carrying out online clustering on the social media short texts preprocessed in the step S1, wherein the online clustering of the short texts is a core part of the method, and the clustering method is enhanced and improved based on a Leader-follow method. The method specifically comprises the following steps:
s21, constructing an extensible word vector of the short text, wherein the construction comprises the following steps: constructing a single short text word frequency number vector and constructing a word frequency number vector of a short text class;
constructing a single short text word frequency number vector: and converting the words obtained by preprocessing into weighted word frequency vectors according to corresponding parts of speech or named entities, specifically, giving higher weight to the words of the named entities, giving medium-sized weight to verbs and nouns, and giving lower weight to other words such as adverbs, adjectives and the like. And accumulating the values of the same words to obtain a weighted word frequency vector. The word frequency vector in the application does not refer to a word frequency vector but a word frequency number vector. The word frequency vector is as follows: { word1:2.5, word2:2, word3:2, word4:1, } form, the word frequency number vector stored as a key-value pair has the advantage of low dimensionality, overcomes the difficulty of storage and conversion of high-dimensional data, and can reduce overhead in word vector computation. The word frequency number vector is adopted in the method because the word frequency number vector is not normalized, and the calculation amount when the word frequency number vector is changed is reduced by using the word frequency number vector as a class. According to the traditional method, all words in a sample set are counted out and used as characteristic dimensions, a word frequency vector matrix of each text is counted, and high-dimensional sparse vectors and matrixes are obtained.
Constructing a word frequency number vector of the short text class: the word vector of the short text class has the form of table 1, Cluster represents that the class word represents the short text, and when a new short text is aggregated into a class, the word frequency number vector of the new text and the word frequency number vector of the class are added. The word frequency number vector form of the short text class is similar to the word vector form of a single text, but the upper limit of the feature word number of each class is limited within a reasonable range NUM, when the feature word number of the short text class exceeds NUM, the feature words are sorted according to the word frequency, words with the number of X%. NUM before the feature weight are reserved, low-frequency feature words with the number of NUM after (1-X%). behind the deletion are reserved, and a space is reserved for new high-frequency feature words which may appear, so that the word vector error deviation of the classes caused by the infinite increase of the feature words is prevented, the feature words appearing after the feature words are extruded out of the word vector, and error aggregation is reduced, so the value of NUM and the value of X need to be determined through experimental experience.
Word frequency number vector space of table 1 class
Cluster1 | Word1:5.2word2:2.2word3:1.0... |
Cluster2 | Word2:10.5word5:5.8word6:4.6... |
Cluster3 | Word1:7.4word3:6.6word4:5.0... |
Cluster4 | Word2:9.0word7:3.2word8:3.2... |
Cluster5 | Word3:3.2word5:2.7word9:0.5... |
... | ... |
S22, calculating similarity of short text and class
The word frequency number vector shown in the table 1 is stored in each short text class, and the cosine similarity between the new short text class and the existing class can be calculated according to the word vector of the new short text and the word vector of the existing class. When calculating cosine similarity, the intersection space of the word frequency number vector of the new short text and the word frequency number vector of the class is firstly taken, and only the inner product of the intersection word frequency number vector space is calculated, so that the calculation speed can be accelerated, as shown in the molecular part of 2-1Is a word vector of a single text,the word vectors are word vectors of short text class, the two word vectors firstly take a cross space, and then the inner product of the two vectors is carried out on the cross space. The denominator is the product of the 2-norm of the two word vectors. Compared with the traditional word frequency vector numerical matrix, the method has the advantages that the characteristic dimension of the input word vector is flexible and variable, and the dynamic increase of the word vector dimension is realized.
S23 short text clustering method
Firstly, searching a class containing the same words with the short text to be clustered according to the index from the words to the text class, wherein the form of the index is shown in FIG. 3, the index is many-to-many, one word may be contained in a plurality of classes, for example, in FIG. 3, one word points to a plurality of classes, and one class also contains a plurality of words, for example, in FIG. 3, a plurality of words point to the same class. The index is empty initially, and if the new text is clustered into an existing class, the index of words existing in the new text but not existing in the class is updated, the words are pointed to the class, and the word vector of the class is updated. If the text establishes a new class, the indexes of all words of the new text to the new class are directly established, and word vectors of the class are established.
The method and the device approximate and accelerate the Leader-follow clustering by searching the short text class containing the same words with the new short text through indexes. If the word appears once, the frequency of the class is increased by one; counting the classes with the most times, wherein some classes with the most times contain the most identical words with the new short text, if the number of the most identical words is lower than a second threshold (the threshold is determined by the length of the new short text and is set by man-made proportion, and the proportion is set according to experimental experience and cosine similarity of the next step), the cosine similarity is not calculated, and the new short text is taken as the new class. The approximation method reduces the calculation times of the cosine similarity but does not deviate too much from the calculation mode of the cosine similarity, accelerates short text processing and is suitable for large-scale data. And if the maximum number of the same words is higher than a second threshold value, performing cosine similarity calculation on all classes containing the most same words.
The traditional Leader-Follower algorithm adopts a 'winner-all' method, namely, new texts are gathered into the most similar classes, so that the problem is caused that the result is related to the data input sequence, and the aggregation of the classes is insufficient. In order to reduce the negative effect, the method adopts a mode of 'similar winner merging all over': calculating cosine similarity between the new text and all classes screened in the previous step according to the method in the step S22, selecting the class (the threshold is set according to experiments and experiences) with the cosine similarity higher than the first threshold with the new text, wherein the class with the highest cosine similarity with the new text is called the most similar class, all the similar classes except the most similar class with the cosine similarity higher than the first threshold are called candidate similar classes, calculating the cosine similarity between each candidate similar class and the word vector of the most similar class after adding and combining the word vectors of the most similar class and the word vector of the new text respectively, and combining the candidate similar classes into the most similar class if the cosine similarity and the most similar class before combining are not reduced or increased compared with the cosine similarity of the new text (the third threshold is selected according to experiences), namely combining two classes c when the condition of the formula 2-2 is metcAnd cmWhereinAre candidate similar class word vectors,is the most similar word vectorIs the new text word vector and sd is the third threshold. The method measures the invariance of the characteristics of the classes according to the criterion that the cosine similarity of the combined classes and the new text is unchanged, thereby achieving the purpose of correctly combining the classes, and finally combining the new text into the most similar class. And if the similarity of the new text and all classes does not reach the threshold value, the existing classes are considered to have no class similar to the new text, and the new text creates a new class.
And S24, detecting and merging similar short text classes, and detecting and identifying the classes which are not fully merged according to the similarity of the two classes.
In the improved clustering method of 'merging similar winners' adopted in step S23 of the present application, there is still a case that similar categories cannot be merged. The short texts have the characteristic of few word features, and the short texts in a topic class may not contain all the word features of the topic in the initial clustering process, so that the short texts which may have other same topics but contain a large number of different word features form a new class. The problem can be solved by detecting and identifying the insufficiently combined classes according to the similarity of the two classes after the classes are combined to a certain scale, namely after the classes of the similar topics are clustered to a certain number and contain a large number of topic word features. The invention provides a method for detecting and combining similar objects based on the principle.
The method of the application provides that after a certain amount of texts are clustered, the similarity degree of each existing class and other classes is detected, and a scheme similar to new text clustering and merging is adopted. Will be class c1All feature words are searched for and classified according to the index from the words to the classes1The set C of classes containing the most identical words, class C is calculated according to step S221When the similarity is higher than a fourth threshold (which is set according to experiments and experience), the information entropies of the two classes are calculated respectively, and then the information entropies after the two classes are combined are calculated. H () is the basis for calculating the information entropy, and here, the word vector of the class is still used as the basis for calculating the class information entropy, and the occurrence probability of the word is calculated by weighting the word frequency number with the word of the class. When the expression 2-3 is satisfied, that is, the information entropy of the merged class is reduced by less than a fifth threshold h compared with the maximum information entropy before merging, the two classes are merged.
H(c1∩c2)-max(H(c1),H(c2))≤h,c2∈C (2-3)
When the scale of the class is increased, the word frequency is finally biased to a few topic words when the number of texts containing the same topic words is increased, the information entropy of the class is not remarkably increased, namely the uncertainty of the class is not remarkably increased, but because new feature words are possibly added to enable the entropy to slightly increase and fluctuate, a fifth threshold h of the increasing amplitude is set in the formula 2-3, and a relatively proper value can be obtained by selecting the fifth threshold according to the summary of practical experiments. As the clustering texts increase, the correct clustering should reduce the uncertainty of the classes and tend to be stable, so the method and the device adopt the principle that the entropy is not increased to merge similar classes.
And S3, detecting the trending topics, estimating the popularity of the topics according to the total number of the class texts, the average arrival rate of the class texts and the current average arrival rate of the class texts, and identifying the trending topics.
Short texts of social media have a common feature, including the exact time of text posting. The method identifies the hot topic class in the short text class of the social media by analyzing the time characteristics of the short text class. The method of the present application defines two criteria: an average text arrival rate of the class and a current average text arrival rate of the class. The average arrival rate of the class refers to the average number of texts in the period from the earliest text time and the latest text time in the class, and the current average arrival rate of the class refers to the average number of texts in the period from the earliest text time to the current text time in the short text class;
wherein, Rateaverage(c) Is the average achievement rate of class c, | c | is the number of texts in class c, TcIs the set of all text timestamps of class c, Rateaverage_now(c) Is the current average text arrival rate for class c and T is the set of timestamps for the text of all classes.
According to the method, the popularity of the topic is estimated according to the total number of the class texts, the average arrival rate of the class texts and the current average arrival rate of the class texts, and the trending topic is identified.
The specific method comprises the following steps:
if the number of the short texts in the category reaches above a sixth threshold value alpha and the current average arrival rate of the text in the category reaches above a seventh threshold value theta, the category is inferred to be a current hot topic according to the fact that the topic has quite a large number of texts and the current hot degree.
If the number of the short texts of the class reaches the sixth threshold value alpha or more and the current average arrival rate of the class texts is below the seventh threshold value theta, judging whether the average arrival rate of the class texts is above the eighth threshold value beta or not, if so, deducing that the class is a past hot topic, wherein the heat is already resolved according to the fact that the texts participating in the topic have a certain scale, and the past has a certain heat but the current heat is lower; otherwise, the topic does not have a certain popularity and is regarded as a common topic.
If the number of the short texts of the category is less than or equal to the sixth threshold value alpha, but the current average arrival rate of the category texts is more than or equal to the seventh threshold value theta, the category topic is inferred to have the potential of the hot topic, and is judged to be the sudden topic, and the category topic has higher current popularity and has the characteristics of the hot topic just generated or the characteristics of some small-scale sudden topics. And tracking and monitoring the topic, and if the number of short texts of the topic reaches above a sixth threshold value alpha within a certain time, deducing that the topic is changed into the current hot topic. In addition to these, other topic classes are inferred to be generic non-trending topics.
The setting of the sixth threshold α, the eighth threshold β, and the seventh threshold θ involved in the method of the present application is selected according to experimental effects and experience, and if the setting is a constant threshold, the method does not have good versatility. Therefore, the invention adopts variable threshold values, three threshold values are set according to a certain proportion of the total short text arrival number in the time period from one hour to the present, a lower threshold value is set for the three threshold values, and the proportion and the lower threshold value are selected according to experiments and experiences.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.
Claims (7)
1. A social media online short text clustering and topic detection method is characterized by comprising the following steps:
s1, preprocessing the social media short text to obtain a pure word sequence with marks;
s2, carrying out online clustering on the social media short texts preprocessed in the step S1, wherein the online clustering comprises the following steps:
s21, constructing an expandable word vector of the short text;
s22, calculating the cosine similarity of the short text and the class;
s23, calculating the cosine similarity of the new text and the class according to the step S22, selecting the class with the cosine similarity higher than a first threshold value with the new text, calling the class with the highest cosine similarity with the new text as a most similar class, calling all similar classes except the most similar class with the cosine similarity higher than the threshold value with the new text as candidate similar classes, calculating the cosine similarity of the word frequency number vectors of each candidate similar class and the most similar class after adding and merging, and then calling the new text word vectors, if the reduction of the cosine similarity and the most similar class before merging is smaller than a third threshold value or the cosine similarity of the cosine similarity and the most similar class before merging is increased, then merging the candidate similar classes into the most similar class;
s24, detecting and merging similar short text classes, and detecting and identifying the classes which are not fully merged according to the similarity of the two classes;
and S3, detecting the trending topics, estimating the popularity of the topics according to the total number of the class texts, the average arrival rate of the class texts and the current average arrival rate of the class texts, and identifying the trending topics.
2. The method for social media online short text clustering and topic detection according to claim 1, wherein the step S1 comprises: text standardization, text word segmentation, named entity labeling, part of speech labeling, word shape reduction and word stop removal;
the text standardization is to convert short text into a standard format, including filtering out letters except English letters and partial Latin letters and filtering out all symbols except for relevant punctuation; the text segmentation method comprises the steps that a blank space is used as a segmentation character for a standardized text to obtain an ordered sequence of words and partial symbols; the named entity labeling mainly extracts a person name, a place name and an organization name by adopting the existing named entity labeling; the part-of-speech tagging is to use the existing method to tag simple nouns, verbs, adjectives, adverbs and the like to words; the word form reduction uses the existing method to convert words into original forms, and reduces the dimensionality of a word vector space; and the stop word removing step is to remove the stop word and all the symbols by converting the text into the lower case, so as to obtain a pure word sequence with the marks.
3. The method for social media online short text clustering and topic detection according to claim 1, wherein the step S21 comprises: constructing a single short text word frequency number vector and constructing a word frequency number vector of a short text class;
constructing a single short text word frequency number vector: converting the words obtained by preprocessing into weighted word frequency number vectors according to corresponding parts of speech or named entities;
constructing a word frequency number vector of the short text class: when a new short text is aggregated into a class, the word frequency number vector of the new text and the word frequency number vector of the class are added.
4. The method for social media online short text clustering and topic detection according to claim 1, wherein the step S22 is calculated as:
wherein,a word frequency number vector representing a single text,a word frequency number vector representing a short text class,representing the cosine similarity of a single text to a class,to representAndthe space of the intersection is taken out,to representAndinner product of (d);express getAnd2 norm of (d).
5. The method for social media online short text clustering and topic detection according to claim 1, wherein the step S24 comprises: after clustering a certain amount of text, the similarity degree of each existing class and other classes is detected, and the class c is classified1All feature words are searched for and classified according to the index from the words to the classes1The set C of classes containing the most identical words, class C is calculated according to step S221When the similarity is higher than a fourth threshold value, respectively calculating the information entropies of the two classes, and then calculating the information entropies after the two classes are combined; when the information entropy of the merged class is compared with the maximum information entropy before merging, and the reduction is smaller than a fifth threshold h, merging the two classes;
H(c1∩c2)-max(H(c1),H(c2))≤h,c2∈C
where H () is the entropy of the computation information, c2One in the representation set C is different from C1Class (c).
6. The method for social media online short text clustering and topic detection according to claim 1, wherein the step S3 comprises:
if the number of the short texts of the category reaches above a sixth threshold value alpha and the current average arrival rate of the text of the category reaches above a seventh threshold value theta, the category is inferred to be a current hot topic;
if the number of the short texts in the class reaches the sixth threshold value alpha or more, but the current average arrival rate of the class texts is below a seventh threshold value theta, judging whether the average arrival rate of the class texts is above an eighth threshold value beta or not; if yes, the class is inferred to be a hot topic in the past; otherwise, the topic does not have a certain popularity and is regarded as a common topic;
if the number of short texts in the category is less than or equal to the sixth threshold value alpha, but the average current arrival rate of the category texts is greater than or equal to the seventh threshold value theta, the topic is estimated to have the potential of a hot topic, and the topic is judged to be a sudden topic.
7. The method of claim 6, further comprising: tracking and monitoring the sudden topics, and if the number of the short texts of the category reaches above a sixth threshold value alpha within a certain time, deducing that the topics are changed into the current hot topics.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610818311.8A CN106383877B (en) | 2016-09-12 | 2016-09-12 | Social media online short text clustering and topic detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610818311.8A CN106383877B (en) | 2016-09-12 | 2016-09-12 | Social media online short text clustering and topic detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106383877A true CN106383877A (en) | 2017-02-08 |
CN106383877B CN106383877B (en) | 2020-10-27 |
Family
ID=57936610
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610818311.8A Active CN106383877B (en) | 2016-09-12 | 2016-09-12 | Social media online short text clustering and topic detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106383877B (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106934005A (en) * | 2017-03-07 | 2017-07-07 | 重庆邮电大学 | A kind of Text Clustering Method based on density |
CN107609102A (en) * | 2017-09-12 | 2018-01-19 | 电子科技大学 | A kind of short text on-line talking method |
CN107609103A (en) * | 2017-09-12 | 2018-01-19 | 电子科技大学 | It is a kind of based on push away spy event detecting method |
CN108170773A (en) * | 2017-12-26 | 2018-06-15 | 百度在线网络技术(北京)有限公司 | Media event method for digging, device, computer equipment and storage medium |
CN108334628A (en) * | 2018-02-23 | 2018-07-27 | 北京东润环能科技股份有限公司 | A kind of method, apparatus, equipment and the storage medium of media event cluster |
CN108417210A (en) * | 2018-01-10 | 2018-08-17 | 苏州思必驰信息科技有限公司 | A kind of word insertion language model training method, words recognition method and system |
CN109086443A (en) * | 2018-08-17 | 2018-12-25 | 电子科技大学 | Social media short text on-line talking method based on theme |
CN109101633A (en) * | 2018-08-15 | 2018-12-28 | 北京神州泰岳软件股份有限公司 | A kind of hierarchy clustering method and device |
CN109189912A (en) * | 2018-10-09 | 2019-01-11 | 阿里巴巴集团控股有限公司 | The update method and device of user's consulting statement library |
CN110008334A (en) * | 2017-08-04 | 2019-07-12 | 腾讯科技(北京)有限公司 | A kind of information processing method, device and storage medium |
CN110245355A (en) * | 2019-06-24 | 2019-09-17 | 深圳市腾讯网域计算机网络有限公司 | Text topic detecting method, device, server and storage medium |
CN110348529A (en) * | 2019-07-16 | 2019-10-18 | 韶关市启之信息技术有限公司 | A kind of intelligent clothes Trend of fashion prediction technique and system |
CN110597980A (en) * | 2019-09-12 | 2019-12-20 | 腾讯科技(深圳)有限公司 | Data processing method and device and computer readable storage medium |
CN111339784A (en) * | 2020-03-06 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Automatic new topic mining method and system |
CN111488429A (en) * | 2020-03-19 | 2020-08-04 | 杭州叙简科技股份有限公司 | Short text clustering system based on search engine and short text clustering method thereof |
CN111782801A (en) * | 2019-05-17 | 2020-10-16 | 北京京东尚科信息技术有限公司 | Method and device for grouping keywords |
CN114547316A (en) * | 2022-04-27 | 2022-05-27 | 深圳市网联安瑞网络科技有限公司 | System, method, device, medium, and terminal for optimizing aggregation-type hierarchical clustering algorithm |
CN115310429A (en) * | 2022-08-05 | 2022-11-08 | 厦门靠谱云股份有限公司 | Data compression and high-performance calculation method in multi-turn listening dialogue model |
CN117974340A (en) * | 2024-03-29 | 2024-05-03 | 昆明理工大学 | Social media event detection method combining deep learning classification and graph clustering |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103177024A (en) * | 2011-12-23 | 2013-06-26 | 微梦创科网络科技(中国)有限公司 | Method and device of topic information show |
US20130346082A1 (en) * | 2012-06-20 | 2013-12-26 | Microsoft Corporation | Low-dimensional structure from high-dimensional data |
CN104091054A (en) * | 2014-06-26 | 2014-10-08 | 中国科学院自动化研究所 | Mass disturbance warning method and system applied to short texts |
-
2016
- 2016-09-12 CN CN201610818311.8A patent/CN106383877B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103177024A (en) * | 2011-12-23 | 2013-06-26 | 微梦创科网络科技(中国)有限公司 | Method and device of topic information show |
US20130346082A1 (en) * | 2012-06-20 | 2013-12-26 | Microsoft Corporation | Low-dimensional structure from high-dimensional data |
CN104091054A (en) * | 2014-06-26 | 2014-10-08 | 中国科学院自动化研究所 | Mass disturbance warning method and system applied to short texts |
Non-Patent Citations (2)
Title |
---|
MENG WANG: "An incremental clustering method of micro-blog topic detection", 《IEEE》 * |
陶舒怡: "一种基于簇相合性的文本增量聚类算法", 《计算机工程》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106934005A (en) * | 2017-03-07 | 2017-07-07 | 重庆邮电大学 | A kind of Text Clustering Method based on density |
CN110008334A (en) * | 2017-08-04 | 2019-07-12 | 腾讯科技(北京)有限公司 | A kind of information processing method, device and storage medium |
CN107609102A (en) * | 2017-09-12 | 2018-01-19 | 电子科技大学 | A kind of short text on-line talking method |
CN107609103A (en) * | 2017-09-12 | 2018-01-19 | 电子科技大学 | It is a kind of based on push away spy event detecting method |
CN108170773A (en) * | 2017-12-26 | 2018-06-15 | 百度在线网络技术(北京)有限公司 | Media event method for digging, device, computer equipment and storage medium |
CN108417210A (en) * | 2018-01-10 | 2018-08-17 | 苏州思必驰信息科技有限公司 | A kind of word insertion language model training method, words recognition method and system |
CN108417210B (en) * | 2018-01-10 | 2020-06-26 | 苏州思必驰信息科技有限公司 | Word embedding language model training method, word recognition method and system |
CN108334628A (en) * | 2018-02-23 | 2018-07-27 | 北京东润环能科技股份有限公司 | A kind of method, apparatus, equipment and the storage medium of media event cluster |
CN109101633A (en) * | 2018-08-15 | 2018-12-28 | 北京神州泰岳软件股份有限公司 | A kind of hierarchy clustering method and device |
CN109086443A (en) * | 2018-08-17 | 2018-12-25 | 电子科技大学 | Social media short text on-line talking method based on theme |
CN109189912A (en) * | 2018-10-09 | 2019-01-11 | 阿里巴巴集团控股有限公司 | The update method and device of user's consulting statement library |
CN111782801B (en) * | 2019-05-17 | 2024-02-06 | 北京京东尚科信息技术有限公司 | Method and device for grouping keywords |
CN111782801A (en) * | 2019-05-17 | 2020-10-16 | 北京京东尚科信息技术有限公司 | Method and device for grouping keywords |
CN110245355A (en) * | 2019-06-24 | 2019-09-17 | 深圳市腾讯网域计算机网络有限公司 | Text topic detecting method, device, server and storage medium |
CN110245355B (en) * | 2019-06-24 | 2024-02-13 | 深圳市腾讯网域计算机网络有限公司 | Text topic detection method, device, server and storage medium |
CN110348529B (en) * | 2019-07-16 | 2021-10-22 | 上海惟也新文化科技有限公司 | Intelligent clothes fashion style prediction method and system |
CN110348529A (en) * | 2019-07-16 | 2019-10-18 | 韶关市启之信息技术有限公司 | A kind of intelligent clothes Trend of fashion prediction technique and system |
CN110597980A (en) * | 2019-09-12 | 2019-12-20 | 腾讯科技(深圳)有限公司 | Data processing method and device and computer readable storage medium |
CN111339784A (en) * | 2020-03-06 | 2020-06-26 | 支付宝(杭州)信息技术有限公司 | Automatic new topic mining method and system |
CN111488429A (en) * | 2020-03-19 | 2020-08-04 | 杭州叙简科技股份有限公司 | Short text clustering system based on search engine and short text clustering method thereof |
CN114547316A (en) * | 2022-04-27 | 2022-05-27 | 深圳市网联安瑞网络科技有限公司 | System, method, device, medium, and terminal for optimizing aggregation-type hierarchical clustering algorithm |
CN115310429A (en) * | 2022-08-05 | 2022-11-08 | 厦门靠谱云股份有限公司 | Data compression and high-performance calculation method in multi-turn listening dialogue model |
CN117974340A (en) * | 2024-03-29 | 2024-05-03 | 昆明理工大学 | Social media event detection method combining deep learning classification and graph clustering |
Also Published As
Publication number | Publication date |
---|---|
CN106383877B (en) | 2020-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106383877B (en) | Social media online short text clustering and topic detection method | |
Abbas et al. | Multinomial Naive Bayes classification model for sentiment analysis | |
CN107066553B (en) | Short text classification method based on convolutional neural network and random forest | |
CN105045812B (en) | The classification method and system of text subject | |
CN103678670B (en) | Micro-blog hot word and hot topic mining system and method | |
CN111581354A (en) | FAQ question similarity calculation method and system | |
CN107066555B (en) | On-line theme detection method for professional field | |
CN110619051B (en) | Question sentence classification method, device, electronic equipment and storage medium | |
CN109271514B (en) | Generation method, classification method, device and storage medium of short text classification model | |
Du et al. | Parallel processing of improved KNN text classification algorithm based on Hadoop | |
CN106339495A (en) | Topic detection method and system based on hierarchical incremental clustering | |
CN108733647B (en) | Word vector generation method based on Gaussian distribution | |
CN109446423B (en) | System and method for judging sentiment of news and texts | |
CN108804595B (en) | Short text representation method based on word2vec | |
CN110297888A (en) | A kind of domain classification method based on prefix trees and Recognition with Recurrent Neural Network | |
CN113672718A (en) | Dialog intention recognition method and system based on feature matching and field self-adaption | |
US20230074771A1 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
CN113962293A (en) | LightGBM classification and representation learning-based name disambiguation method and system | |
CN116756347B (en) | Semantic information retrieval method based on big data | |
CN114491062B (en) | Short text classification method integrating knowledge graph and topic model | |
Ahmadi et al. | Persian text classification based on topic models | |
CN116738068A (en) | Trending topic mining method, device, storage medium and equipment | |
CN114298020B (en) | Keyword vectorization method based on topic semantic information and application thereof | |
CN112507071B (en) | Network platform short text mixed emotion classification method based on novel emotion dictionary | |
Dzogang et al. | An ellipsoidal k-means for document clustering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |