CN108197144B - Hot topic discovery method based on BTM and Single-pass - Google Patents

Hot topic discovery method based on BTM and Single-pass Download PDF

Info

Publication number
CN108197144B
CN108197144B CN201711210195.2A CN201711210195A CN108197144B CN 108197144 B CN108197144 B CN 108197144B CN 201711210195 A CN201711210195 A CN 201711210195A CN 108197144 B CN108197144 B CN 108197144B
Authority
CN
China
Prior art keywords
data
topic
clustering
btm
pass
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711210195.2A
Other languages
Chinese (zh)
Other versions
CN108197144A (en
Inventor
许国艳
夭荣朋
张网娟
平萍
朱帅
李敏佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201711210195.2A priority Critical patent/CN108197144B/en
Publication of CN108197144A publication Critical patent/CN108197144A/en
Application granted granted Critical
Publication of CN108197144B publication Critical patent/CN108197144B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Abstract

The invention provides a hot topic discovery method based on a BTM (Business transaction model) and a Single-pass, which comprises the steps of firstly carrying out topic modeling by utilizing a BTM topic model to obtain topic distribution of a corpus data set, then carrying out vectorization by utilizing a VSM (virtual service model), clustering the obtained results by utilizing an improved Single-pass algorithm, sorting the clustering results to obtain new clustering results, and secondly carrying out parallelization calculation on the hot topic discovery method to improve the topic mining speed under the condition of large data volume. The method can well solve the problems of microblog data sparseness and the capacity of processing mass data, the improved Single-pass algorithm can well reduce the calculation complexity, maintain the stability of the algorithm, effectively process new data, have better calculation analysis on the continuous influence of hot topics, and still can maintain the topic discovery quality on the basis that the data processing efficiency of a data set is improved through a MapReduce framework.

Description

Hot topic discovery method based on BTM and Single-pass
Technical Field
The invention relates to a hot topic discovery method based on BTM and Single-pass, belonging to text clustering in the field of data mining.
Background
With the popularization of smart phones and networks, people can pay attention to the latest big things in some countries and the society through the microblog APP, the microblog hot topics are found and researched, and the research on the microblog hot topics has great value in the fields of commerce, scientific research and the like, and more scholars conduct related research on microblogs.
In the traditional hot topic discovery, an LDA topic model, K-Means and other algorithms are generally adopted for research, however, the traditional LDA model mainly solves the problem of long texts, the processing effect on short text data such as microblogs is poor, and meanwhile, the microblog data has the characteristics of data sparseness, strong context relevance and the like, and is difficult to solve by the LDA model.
In order to handle large data sets, traditional hot topic discovery techniques have encountered bottlenecks. First, the amount of data to be processed in the process of discovering hot topics is huge, and a single host and a single processor are time-consuming and labor-consuming to process. Secondly, the speed of topic mining on data by a pure BTM model is too slow, and finally, after modeling is carried out by using the BTM model, the characteristics of microblog streaming data are not considered, so that the classification effect is still to be improved. Therefore, the method is particularly important for finding the microblog hot topics in a distributed environment.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a hot topic discovery method based on BTM and Single-pass, which is suitable for short text and data sparse streaming data, and meanwhile, the method can adapt to the condition of large data volume and accelerate the topic mining speed.
The technical scheme is as follows: the invention provides a hot topic discovery method based on BTM and Single-pass, which comprises a Mapper stage and a Reducer stage of a MapReduce framework;
the Mapper stage specifically comprises:
(1) preprocessing the input data set D;
(2) averagely dividing the preprocessed data set D into C nodes, wherein each node comprises a quantitative word pair, and randomly distributing a theme to the word pair;
(3) performing topic modeling by using a BTM topic model to obtain topic distribution of a corpus data set;
(4) vectorizing the result by adopting VSM on each node;
(5) performing cluster analysis on each node by adopting an improved Single-pass algorithm to obtain a local topic;
(6) outputting the local topic;
the Reducer phase specifically comprises:
(1) inputting an initialized clustering result on each node;
(2) selecting the clustering result of the main node as an initial clustering center;
(3) clustering results of other nodes and clustering results of the main node by adopting an improved Single-pass algorithm to finally obtain hot topics;
(4) and outputting the hot topic.
Preferably, in the step (5), the cluster analysis specifically includes:
(51) dividing the node C into n data pieces C according to a certain scale1,C2,...,CnSequentially using the data slices as input data; each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;
(52) selecting the data slice C1As a first part, for said C1Performing internal clustering to obtain a clustering result of the first part;
(53) for the data sheet C2,...,CnBefore inputting, each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;
(54) the data slice C2Each cluster center obtained is associated with the data piece C already existing1Is calculated as sim (C2ci, C1ci), wherein C1ci and C2ci respectively represent the data slice C1And C2Different word pairs in;
(55) selecting a maximum similarity value max (sim (C2ci, C1ci)), judging whether max is larger than a threshold value C, and if max is larger than max (sim (C2ci, C1ci)) > C, classifying the C2ci into a class with the maximum similarity value; if max is max (sim (C2ci, C1ci)) < C, newly establishing a topic by taking the C2ci as a clustering center;
(56) the data slice C1、C2The clustering result is obtained by sortingNew clustering results;
(57) and (5) circulating the steps (54), (55) and (56) until all data in the data set are processed, and obtaining a final result.
Preferably, the internal clustering uses a classical Single-pass algorithm.
Preferably, the word pair refers to two different words which exist in the same data slice at the same time and are random and unordered after the data set is preprocessed.
Preferably, in step (3), the topic modeling adopts a Gibbs sampling method as a sampling method, and an iteration of the sampling process adopts the topic matrix obtained by the set of word pair tuples of each node, that is, each word pair b is (w) for each word pairi,wj)∈BCAnd solving the probability of the subject under the k in the BTM model.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: 1. the problem of data sparsity is well solved through word pairs by adopting a BTM topic model, and the model has better data processing capacity than an LDA model; 2. k-means is frequently used in the traditional document clustering method, however, microblog data is in a streaming data form, and the K-means cannot well cluster the microblog data, so that the Single-pass incremental clustering algorithm is adopted to analyze the microblog data; 3. the improved Single-pass algorithm can well reduce the complexity of calculation, maintain the stability of the algorithm, effectively process new data and have better calculation analysis on the continuous influence of hot topics; 4. the invention also performs distributed parallelization processing on the hot topic discovery method based on the BTM and the Single-pass, and still can maintain the topic discovery quality on the basis that the data processing efficiency of the data set is improved through the MapReduce framework.
Drawings
FIG. 1 is a flow chart of the modified Single-pass algorithm;
FIG. 2 is a parallelization flow diagram of the hot topic discovery algorithm of the present invention;
FIG. 3 is a diagram of a BTM model topology;
FIG. 4 is a basic flowchart of the BTM and Single-pass based hot topic discovery method algorithm of the present invention;
FIG. 5 is the F value size of BTMs at different K values;
FIG. 6 is a graph of beta value analysis in the BTM model;
FIG. 7 is a graph of different threshold performance;
FIG. 8 is a partial topic experiment result;
FIG. 9 is a comparison of three experimental methods;
FIG. 10 is a hotword trend graph as used by the present invention;
FIG. 11 is a graph of stand-alone versus distributed time;
FIG. 12 is a graph of server versus elapsed time.
Detailed Description
The invention provides a hot topic discovery method based on BTM and Single-pass, which is suitable for being used under short text and data sparse streaming data, and carries out parallelization calculation in order to adapt to the condition of large data volume and accelerate the topic mining speed, and the method mainly comprises the following steps: (1) carrying out cluster analysis by adopting an improved Single-pass algorithm; (2) and carrying out MapReduce distributed parallelization processing based on the hot topic discovery method of the BTM and the Single-pass.
(1) Cluster analysis by improving Single-pass algorithm
As shown in FIG. 1, a data set D, which is divided into a plurality of data slices on a certain scale, i.e., D1,D2,...,DnThe data sets after the decomposition are sequentially input in order.
1) For D1,D2,...,DnThe data pieces are sequentially used as input data; each part carries out internal clustering by itself, the clustering method is similar to the classic Single-pass algorithm, and the respective clustering result of each part can be obtained;
2) selection D1The part is used as a first part and is clustered by utilizing a classic Single-pass algorithm to obtain a clustering result of the part;
3) for D2,...,DnThe parts are subjected to internal clustering before input, and the clustering algorithm is also a classic Single-pass algorithm to obtain the clustering result of each part;
4) for D2Each of the obtained cluster centers is subjected to similarity calculation with each of the cluster centers of the already existing parts, wherein D1di and D2di represent D, respectively1And D2Different word pairs in the data set, wherein the word pairs are simultaneously present in one same segment after word segmentation and the like of the data set and are two random and unordered different words;
5) selecting max (D2di, D1di)) as the maximum similarity value, judging whether max is larger than the threshold value c, wherein c is selected differently according to different situations, and if max (sim (D2di, D1di)) > c) is larger than or equal to the threshold value, then classifying the maximum similarity value into the class with the maximum similarity value; if the maximum similarity value is smaller than the threshold value, namely max is max (sim (D2di, D1di)) < c, newly establishing a topic for the clustering center;
6) for D1、D2Sorting the two parts of clustering results to obtain a new clustering result;
7) and (4) circulating the steps 4), 5) and 6) until all data are processed, thereby obtaining the final result.
(2) Method for conducting MapReduce distributed parallelization processing on hot topic discovery method based on BTM and Single-pass
1) As shown in fig. 2, all data D after preprocessing is divided into C nodes on average, and each node roughly processes D/C data amount. Each node contains a certain number of word pairs b ═ wi,wj)∈BC(ii) a Then for all data sets D, its global set of word pairs is B ═ B1,B2,...,Bc}。
2) Local data is sampled by using a Gibbs sampling method in a BTM (Branch target model), and each iteration adopts a word pair element group set of each node to obtain a topic matrix, namely, each word pair b is equal to (w)i,wj)∈BCThen, the probability of the current topic k is determined.
3) For each node, the results obtained above are vectorized on the respective host in preparation for subsequent clustering.
4) Clustering is carried out on each node, the clustering result of the main node is selected as an initial clustering center, and the clustering results of the other nodes and the clustering results of the main node are merged and sorted by adopting an improved Single-pass algorithm to obtain a result.
The distribution of the subject probability is the same for all word pairs in the data set, where subject is understood to be the probability distribution of different words. BTMs are based on LDA models and unary mixture models that learn the subject of short text sparse data by word pairs generated across the entire dataset and associations between words, represented by word co-occurrence. Therefore, the BTM topic model models the topic by using word pairs that co-occur in all data sets.
The structure of the BTM topic model is shown in fig. 3, where theta represents the global topic distribution in the dataset,
Figure GDA0002616433950000041
representing the probability distribution of words under a single topic, | B | representing the number of word pairs in the data set, | B | representing the number of implied topics, Wi,WjAre two different words in a certain word pair b. Alpha and beta are respectively theta and beta
Figure GDA0002616433950000042
Is given as a hyperparameter of the Dirichlet prior distribution.
Referring to FIG. 4, the hot topic discovery algorithm based on BTM and Single-pass comprises the following steps:
step 1, preprocessing the acquired data set, and removing noise data which do not contribute to the discovery of hot topics, such as stop words, hyperlinks, special characters and the like;
step 2, segmenting the data by adopting a Chinese word segmentation tool (NLPIR) provided by the research institute of Chinese academy of sciences computing technology;
step 3, obtaining relevant information of the required characteristic items, keeping relevant verbs, nouns, topic labels, time and the like, using the relevant verbs, the nouns, the topic labels, the time and the like as the characteristic items, sorting repeated words and counting word frequency;
step 4, modeling the data by using a BTM model, calculating the values of the data sets p (z | d) and p (w | z) (document-subject, subject-word), and aiming at each piece of data d in the data setsiUsing the formula Inf (d)i) To calculate topic influence, using formula wk(di) And (5) circularly calculating the weight value from 0 to K, wherein K is the number of the feature words.
(1) With Inf (d)i) Expressing the influence of quantifying one piece of microblog information, and the formula is
Figure GDA0002616433950000051
Wherein N iscomNumber of comments, N, representing one piece of microblog informationrepNumber of retransmissions representing a piece of microblog information, NsupRepresenting the praise number, MAX { N, of a microblogcom(dj)}、MAX{Nrep(dj)}、MAX{Nsup(dj) And expressing the maximum comment number, the maximum forwarding number and the maximum like number in the document set respectively. α, β, γ are parameters, and α + β + γ is 1.
(2) TF-IDF is a commonly used feature weighting technique and is calculated by the formula
TFij-IDFi=TFij×IDFiEquation 2
Wherein, the word frequency TF represents the frequency of the given characteristic word in the document, and the larger the value of the frequency is, the more important the frequency is; the inverse file frequency IDF means the number of times a given token appears in the entire data set. The calculation formula of TF and IDF is as follows:
Figure GDA0002616433950000052
Figure GDA0002616433950000053
wherein n isijIs the frequency of a given word in the document, sigma knkjRepresents the sum of all words in the document, | D | represents the number of total documents, | { j: t |, in the documenti∈dj} | is the number of documents containing feature words, which are incremented by 1 in order to avoid denominators of zero.
Step 5, utilizing a VSM vector space model to vectorize the obtained result, wherein the formula is
Figure GDA0002616433950000054
A result matrix M is obtained.
And 6, performing incremental clustering on the matrix M by using the improved Single-pass algorithm to obtain a final result.
A hot topic discovery parallelization algorithm based on BTMs and Single-pass comprises the following steps:
parallelization processing is carried out aiming at the characteristics of the micro-blog large data volume, and the method is mainly used for designing algorithms of a Mapper stage and a Reducer stage of a MapReduce framework.
(1) Mapper phase
Step 1, inputting a data set D, a theme number K and parameters alpha and beta;
step 2, preprocessing the acquired data set, and removing noise data which do not contribute to the hot topic discovery, such as stop words, hyperlinks, special characters and the like;
step 3, dividing all the data D after preprocessing into C nodes on average, wherein each node contains a certain amount of word pairs b ═ (w ═i,wj)∈BCRandomly distributing a theme k for the word pair b;
step 4, Zi←Zi+ZjOn host i, the global subject word matrix is the sum of the other hosts and the local host. And traversing all the nodes C, and solving the global subject word matrix of each host.
Step 5, each iteration adopts a word pair element group set of each node to obtain a topic matrix, namely, each b is (w)i,wj)∈BCThe probability of it under topic k is:
Figure GDA0002616433950000061
wherein n isk|CRepresents the number of word pair tuples divided into k topic numbers in process C, nwi|kAnd nwj|kRespectively represent words wi,wjThe number of the subject numbers divided into k is N, the N represents the size of a total dictionary in the data set, and alpha and beta represent prior parameters;
step 6, adopting VSM to carry out vectorization on the results of the respective host on each node;
step 7, clustering each node by adopting an improved Single-pass algorithm to obtain local topics;
specifically, (71) dividing the node C into n pieces of data C on a certain scale1,C2,...,CnSequentially using the data slices as input data; each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;
(72) selecting the data slice C1As a first part, for said C1Performing internal clustering to obtain a clustering result of the first part;
(73) for the data sheet C2,...,CnBefore inputting, each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;
(74) the data slice C2Each cluster center obtained is associated with the data piece C already existing1Is calculated as sim (C2ci, C1ci), wherein C1ci and C2ci respectively represent the data slice C1And C2Different word pairs in;
(75) selecting a maximum similarity value max (sim (C2ci, C1ci)), judging whether max is larger than a threshold value C, and if max is larger than max (sim (C2ci, C1ci)) > C, classifying the C2ci into a class with the maximum similarity value; if max is max (sim (C2ci, C1ci)) < C, newly establishing a topic by taking the C2ci as a clustering center;
(76) the data slice C1、C2Sorting the clustering results to obtain new clustering results;
(77) looping steps (74), (75) and (76) until all data in the data set are processed, and obtaining a final result;
and 8, outputting the local topics.
(2) Reducer phase
Step 1, inputting an initialization clustering result on each node;
step 2, selecting a clustering result of the main node as an initial clustering center;
step 3, clustering results of other nodes and clustering results of the main nodes by adopting an improved Single-pass algorithm to finally obtain hot topics;
and 4, outputting the hot topics.
Analysis of experiments
The experimental data of the invention is a real data set on the Xinlang microblog collected by a crawler program, and the contents of the crawled data mainly comprise 20000 data of topics such as 'Alphago' and 'Liyoyo Olympic Games'.
At present, no unified evaluation standard exists in the microblog hot topic discovery method, and the microblog hot topic discovery method adopts the following evaluation indexes to carry out experiments.
In terms of accuracy, the evaluation indexes published by NIST (national institute of standards and technology) are used, including accuracy P, recall R, F value and omission factor PmissAnd false detection rate PFAAnd the like. The formula is as follows:
Figure GDA0002616433950000071
Figure GDA0002616433950000072
Figure GDA0002616433950000073
Figure GDA0002616433950000081
Figure GDA0002616433950000082
wherein, a is the number of related microbo texts discovered by detection; b is the number of irrelevant microbo texts found by detection; c represents the number of related microbo texts which are not detected; d is the number of irrelevant microblock texts without detection findings.
Value analysis of each parameter in BTM algorithm model
In the hot topic discovery method, the BTM model is the first step of topic identification on microblog linguistic data, and the result obtained by utilizing the BTM has great influence on the subsequent incremental clustering effect, so that the method firstly analyzes the value problem of each parameter in the BTM topic model. The BTM topic model is an unsupervised model, the topic number K of a data set needs to be set first when the BTM topic model is modeled, different topic number models are different in estimation and have influence on the performance of the BTM, and therefore the value of K needs to be determined first. In this experiment, the K values were set to 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, and 15, respectively, and the F values were used as evaluation indexes for comparison, and the results of the experiment are shown in fig. 5.
It can be seen that as the number of subjects K increases, the F value tends to decrease, with the best effect when K is 7. Because microblogs are short text data with sparse data and contain fewer words, formed word pairs are relatively fewer, and in the modeling process, a topic is assigned to each word pair, and an excessive number of topics is set, so that the probability of word pair-topic is probably segmented, and further the probability of document-topic is influenced, and the final result is influenced. The number of subjects selected in the present invention is 7 as the value of K.
Under the condition of determining the value of the subject number K, the values of parameters alpha and beta in the BTM are analyzed, and the value of alpha can be known according to empirical values
Figure GDA0002616433950000083
So the value of alpha is 50/7. After the value of α is determined, the value of β is analyzed, the experimental result is shown in fig. 6, and the histogram shows the accuracy, the recall rate, and the F value from left to right.
It can be seen that the overall effect is relatively stable when beta is 0.01, so the value of beta in the invention is 0.01. To sum up the two experiments, the values of different parameters of the algorithm provided by the invention are respectively K value of 7, alpha value of 50/7 and beta value of 0.01.
Threshold selection analysis
The threshold setting has an important influence on the clustering effect of the Single-pass incremental clustering method, and the clustering result has an important influence on the last topic finding, so the threshold setting is an important aspect. The rate P of the evaluation index of the threshold selection in the experiment from the omission factormissAnd false detection rate PFATwo aspects are measured. The experiment selects a part from the above topic data as an experiment, wherein 1000 pieces of data are selected for each topic. The results of the experiment for the different thresholds are shown in fig. 7.
Through the above experiment, it can be seen that the false detection rate is inversely proportional to the threshold value, the false detection rate is proportional to the threshold value, the value of the threshold value is increasing, the value of the false detection rate is gradually decreasing, and the value of the false detection rate is gradually increasing. When the threshold value is 0.4, the overall effect is relatively good, so that the threshold value is 0.4.
Comparison with other hot topic discovery methods
According to the invention, hot topics are mined based on the BTM and the improved Single-pass algorithm, related topics are represented by keyword sets, and the part of experiment results of 'Alphago' displayed by the invention are shown in figure 8 due to limited space.
In order to verify the effectiveness of the method provided by the invention, a BTM topic model and a BTK method are used for comparison, and since the F value is integrated with P, R index, the F value is used as an evaluation index in the experiment, and the experiment result is shown in FIG. 9.
As can be seen from fig. 9, the clustering result is different for different topics, which is mainly caused by the characteristics of the topics themselves. For example, the clustering result of the topic of "the olympic games in the same province" is relatively poor because the olympic games involve a plurality of different items, including a plurality of different items such as "diving," "table tennis," "volleyball," and the like, which contain a large number of items, so that the co-occurrence of the words of the BTM to the medium words is relatively reduced, and the clustering effect is not ideal.
According to the method and the system, hot word trends in the micro-indexes provided by the microblog are used for comparison, and the trends and the frequency of a certain keyword within a period of time can be visually displayed according to the hot word trends. Data of "the olympic games of the Rio York" contains a plurality of different hot topics, and are more suitable for comparison, so data of 2016, 3, month, 1 and 10 days are selected for comparison. The first ten keywords obtained by performing experiments on the data of the Riyoyo Olympic Games by using the method are as follows:
TABLE 1 Rio Olympic Games keywords
Figure GDA0002616433950000091
Fig. 10 shows the trend of hotwords during the rio olympic games, and since the female line of day 21/8 appears much more frequently than other hotwords, the female line data is deployed on the secondary axis in order to better show the trend of other hotwords on the graph. From fig. 10, it can be seen that the heat of capital, the strength of flood and barren, houton, and the female row, zhangjunke, increased sharply in a period of time during the rio olympic games, became a hot topic, a debris flow, and the tomato fried eggs were not specific keywords during the olympic games, so the appearance frequency was low. Therefore, the effectiveness of the algorithm provided by the invention can be intuitively proved.
The experimental result shows that the hot topic discovery algorithm based on the BTM and the Single-pass has better effect compared with other methods, and the F value of the hot topic discovery algorithm based on the BTM and the Single-pass is better proved to be effective and feasible.
Hot topic discovery algorithm parallelization experimental analysis based on BTM and Single-pass
The experiment has 4 servers, one of which is used as a Master node and is provided with a NameNode and a JobTracker, and the other 3 servers are used as Slave nodes and are provided with a DataNode and a TaskTracker. The operating systems on all servers are Ubuntu14.04, the version of Hadoop used is 2.6.5, and the version of JDK is JDK-8u121-linux-i 586.
The experimental data set is 5-day microblog data between 5 days of 1 month and 10 days of 2017 by the crawler tool, and the size of the experimental data set is about 1.1 GB. The obtained data is in an xml format, the content in the data is extracted, and the data is uploaded through a JAVA toolkit with an open source of HDFS for distributed storage.
The experiment is used for parallelizing hot topic discovery algorithms based on BTMs and Single-pass, so that the evaluation of the hot topic discovery algorithms is measured from two aspects of the quality of the hot topic discovery and the speed of data processing. The Coherence value is used as an index to measure the quality of hot topic findings.
The formula for the Coherence value is as follows:
Figure GDA0002616433950000101
wherein, V(z)=(V1 (z),V2 (z),...,VT (z)) Represents the first T words under a known topic z, and V(z)The values of the probability p (w | z) are sorted from high to low. D (v) represents the number of occurrences of word v in the document, and D (v, z) represents the number of occurrences of both words at the same time. Quality of hotspot topic discovery with C (z, V)(z)) The values are proportional in magnitude, with greater values giving better quality.
In order to evaluate the quality of topic discovery in a single-machine environment and a distributed environment, the invention takes the average value of Coherence of all topics in the current environment as an evaluation standard, and the calculation formula of the average value of Coherence is as follows:
Figure GDA0002616433950000102
in the case of similar quality of the subject findings, we measure the speed of data processing as the time it takes to process the data.
Evaluation index analysis
Topic discovery quality
In the same configuration, the present invention sets the size of T to be 30, compares the quality of topic findings in the stand-alone environment and the distributed environment using Coherence values, and the experimental results are shown in the following table:
TABLE 2 Coherference value comparison
Figure GDA0002616433950000103
Figure GDA0002616433950000111
As can be seen from the results in table 2, the Coherence values in the stand-alone environment and the distributed environment are relatively close, and in the distributed environment, the Coherence value decreases slowly as the number of nodes increases, so that the quality of the topic discovery is quite different in general.
Speed of topic discovery
In order to better illustrate the influence of the distributed environment on the topic discovery speed, the invention compares the topic discovery speed through two groups of experiments, wherein the first group is used for keeping the number of the nodes in the distributed environment unchanged and changing the size of the data volume, and the second group is used for keeping the data volume unchanged and changing the number of the nodes for comparison.
For the first group of experiments, the experimental environment is set to be 4 servers, wherein 1 server serves as a Master node, the rest servers serve as Slave nodes, experimental data are divided into four groups of 256MB, 512MB, 768MB and 1024MB respectively, and the unit of used time is minutes. The results of the experiments are shown in the following table:
TABLE 3 Single machine and Hadoop distributed topic discovery time
Figure GDA0002616433950000112
The transformation of a stand-alone environment to a distributed environment for more intuitive display with an increasing amount of data is shown again in the form of a graph, and the result is shown in fig. 11.
It can be seen from table 3 that as the data processing amount increases, the difference between the time used in the Hadoop distributed environment and the single computer gradually increases, so that the Hadoop distributed environment is more effective for processing operations with large data amount.
The second set of experiments was to keep the size of the data constant and change the number of nodes in the distribution to compare the time spent. The selected data size is 1024MB, the number of nodes is increased from 1 to 4, and the experimental result is shown in FIG. 12.
As can be seen from fig. 12, when processing data of the same size, the time used is decreasing as the number of nodes in Hadoop increases. Of course, as the number of the nodes is larger, the number of data exchange between the Slave node and the Master node is larger and larger, and the trend of reducing the used time is slower and slower. A reasonable number of nodes is to be set.
The algorithm after parallelization is verified through experiments, and under the condition that the topic finding quality is not much different, the topic finding speed can be obviously improved, and the time for processing data is reduced.

Claims (4)

1. A hot topic discovery method based on BTM and Single-pass is characterized in that MapReduce is adopted for distributed parallelization processing, including a Mapper stage and a Reducer stage of a MapReduce framework;
the Mapper stage specifically comprises:
(1) preprocessing an input data set D;
(2) averagely dividing the preprocessed data set D into C nodes, wherein each node comprises a quantitative word pair, and randomly distributing a theme to the word pair;
(3) performing topic modeling by using a BTM topic model to obtain topic distribution of a corpus data set;
(4) vectorizing the result by adopting VSM on each node;
(5) performing cluster analysis on each node by adopting an improved Single-pass algorithm to obtain a local topic;
the cluster analysis specifically includes:
(51) dividing each node into n data pieces C according to a certain scale1,C2,...,CnSequentially using the data slices as input data; each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;
(52) selecting the data slice C1As a first part, for said C1Performing internal clustering to obtain a clustering result of the first part;
(53) for the data sheet C2,...,CnBefore inputting, each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;
(54) the data slice C2Each cluster center obtained is associated with the data piece C already existing1Is calculated as sim (C2ci, C1ci), wherein C1ci and C2ci respectively represent the data slice C1And C2Different word pairs in;
(55) selecting a maximum similarity value max (sim (C2ci, C1ci)), judging whether max is larger than a threshold value C, and if max is larger than max (sim (C2ci, C1ci)) > C, classifying the C2ci into a class with the maximum similarity value; if max is max (sim (C2ci, C1ci)) < C, newly establishing a topic by taking the C2ci as a clustering center;
(56) the data slice C1、C2Sorting the clustering results to obtain new clustering results;
(57) looping steps (54), (55) and (56) until all data in the data set are processed, and obtaining a final result;
(6) outputting the local topic;
the Reducer phase specifically comprises:
(1) inputting an initialized clustering result on each node;
(2) selecting the clustering result of the main node as an initial clustering center;
(3) clustering results of other nodes and clustering results of the main node by adopting an improved Single-pass algorithm to finally obtain hot topics;
(4) and outputting the hot topic.
2. The BTM and Single-pass based hot topic discovery method as claimed in claim 1 wherein in the Mapper stage step (2), the word pair refers to two different words which exist in one same data slice at the same time and are arbitrary and unordered after the data set is preprocessed.
3. The BTM and Single-pass based hot topic discovery method as claimed in claim 1, wherein in the step (3) of the Mapper stage, the topic modeling adopts a Gibbs sampling method as the sampling method, and the iteration of the sampling process adopts the word pair tuple set of each node to obtain the topic matrix, that is, each word pair b is (w is) for each word pairi,wj)∈BCAnd solving the probability of the subject under the k in the BTM model.
4. The BTM and Single-pass based hot topic discovery method of claim 1, characterized in that the inner clustering uses the Single-pass algorithm.
CN201711210195.2A 2017-11-28 2017-11-28 Hot topic discovery method based on BTM and Single-pass Active CN108197144B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711210195.2A CN108197144B (en) 2017-11-28 2017-11-28 Hot topic discovery method based on BTM and Single-pass

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711210195.2A CN108197144B (en) 2017-11-28 2017-11-28 Hot topic discovery method based on BTM and Single-pass

Publications (2)

Publication Number Publication Date
CN108197144A CN108197144A (en) 2018-06-22
CN108197144B true CN108197144B (en) 2021-02-09

Family

ID=62573247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711210195.2A Active CN108197144B (en) 2017-11-28 2017-11-28 Hot topic discovery method based on BTM and Single-pass

Country Status (1)

Country Link
CN (1) CN108197144B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109509110B (en) * 2018-07-27 2021-08-31 福州大学 Microblog hot topic discovery method based on improved BBTM model
CN109726394A (en) * 2018-12-18 2019-05-07 电子科技大学 Short text Subject Clustering method based on fusion BTM model
CN110046260B (en) * 2019-04-16 2021-06-08 广州大学 Knowledge graph-based hidden network topic discovery method and system
CN110134958B (en) * 2019-05-14 2021-05-18 南京大学 Short text topic mining method based on semantic word network
CN110297988B (en) * 2019-07-06 2020-05-01 四川大学 Hot topic detection method based on weighted LDA and improved Single-Pass clustering algorithm
CN111090811B (en) * 2019-12-24 2023-09-01 北京理工大学 Massive news hot topic extraction method and system
CN113378558B (en) * 2021-05-25 2024-04-16 浙江工业大学 RESTful API document theme distribution extraction method based on representative word pairs
CN115718680B (en) * 2023-01-09 2023-06-06 江铃汽车股份有限公司 Data reading method, system, computer and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745000A (en) * 2014-01-24 2014-04-23 福州大学 Hot topic detection method of Chinese micro-blogs
EP2068526B1 (en) * 2007-11-06 2014-04-30 Intel Corporation End-to-end network security with traffic visibility
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN106599029A (en) * 2016-11-02 2017-04-26 焦点科技股份有限公司 Chinese short text clustering method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2068526B1 (en) * 2007-11-06 2014-04-30 Intel Corporation End-to-end network security with traffic visibility
CN103745000A (en) * 2014-01-24 2014-04-23 福州大学 Hot topic detection method of Chinese micro-blogs
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN106599029A (en) * 2016-11-02 2017-04-26 焦点科技股份有限公司 Chinese short text clustering method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Microblog Topic Detection Based on LDA Model and Single-Pass Clustering";Bo Huang et al.;《International Conference on Rough Sets and Current Trends in Computing》;20120820;第166-171页 *
"基于MapReduce的热点话题发现及演化分析方法研究";谭真;《万方数据知识服务平台》;20170815;第27页 *
"基于主题模型建模的微博话题发现";梁亚楠 等;《中国优秀硕士学位论文全文数据库 信息科技辑》;20151115;第28-36页 *

Also Published As

Publication number Publication date
CN108197144A (en) 2018-06-22

Similar Documents

Publication Publication Date Title
CN108197144B (en) Hot topic discovery method based on BTM and Single-pass
CN106815369B (en) A kind of file classification method based on Xgboost sorting algorithm
US10579661B2 (en) System and method for machine learning and classifying data
WO2017097231A1 (en) Topic processing method and device
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
Eluri et al. A comparative study of various clustering techniques on big data sets using Apache Mahout
CN107066555A (en) Towards the online topic detection method of professional domain
Cao et al. HitFraud: a broad learning approach for collective fraud detection in heterogeneous information networks
Kim et al. A web service for author name disambiguation in scholarly databases
Dovgopol et al. Twitter hash tag recommendation
CN103761286B (en) A kind of Service Source search method based on user interest
Mathivanan et al. A comparative study on dimensionality reduction between principal component analysis and k-means clustering
Long et al. Tcsst: transfer classification of short & sparse text using external data
Jayanthi et al. Clustering approach for classification of research articles based on keyword search
Ła̧giewka et al. Distributed image retrieval with colour and keypoint features
Pita et al. Strategies for short text representation in the word vector space
Xu et al. Research on topic discovery technology for Web news
CN115098690A (en) Multi-data document classification method and system based on cluster analysis
Kaleel et al. Event detection and trending in multiple social networking sites
Qiang et al. Lifelong learning augmented short text stream clustering method
Yang et al. Mining hidden concepts: Using short text clustering and wikipedia knowledge
CN113157915A (en) Naive Bayes text classification method based on cluster environment
Zhao et al. MapReduce-based clustering for near-duplicate image identification
JP2020113267A (en) System and method for creating reading list

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant