CN108197144B

CN108197144B - Hot topic discovery method based on BTM and Single-pass

Info

Publication number: CN108197144B
Application number: CN201711210195.2A
Authority: CN
Inventors: 许国艳; 夭荣朋; 张网娟; 平萍; 朱帅; 李敏佳
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2017-11-28
Filing date: 2017-11-28
Publication date: 2021-02-09
Anticipated expiration: 2037-11-28
Also published as: CN108197144A

Abstract

The invention provides a hot topic discovery method based on a BTM (Business transaction model) and a Single-pass, which comprises the steps of firstly carrying out topic modeling by utilizing a BTM topic model to obtain topic distribution of a corpus data set, then carrying out vectorization by utilizing a VSM (virtual service model), clustering the obtained results by utilizing an improved Single-pass algorithm, sorting the clustering results to obtain new clustering results, and secondly carrying out parallelization calculation on the hot topic discovery method to improve the topic mining speed under the condition of large data volume. The method can well solve the problems of microblog data sparseness and the capacity of processing mass data, the improved Single-pass algorithm can well reduce the calculation complexity, maintain the stability of the algorithm, effectively process new data, have better calculation analysis on the continuous influence of hot topics, and still can maintain the topic discovery quality on the basis that the data processing efficiency of a data set is improved through a MapReduce framework.

Description

Hot topic discovery method based on BTM and Single-pass

Technical Field

The invention relates to a hot topic discovery method based on BTM and Single-pass, belonging to text clustering in the field of data mining.

Background

With the popularization of smart phones and networks, people can pay attention to the latest big things in some countries and the society through the microblog APP, the microblog hot topics are found and researched, and the research on the microblog hot topics has great value in the fields of commerce, scientific research and the like, and more scholars conduct related research on microblogs.

In the traditional hot topic discovery, an LDA topic model, K-Means and other algorithms are generally adopted for research, however, the traditional LDA model mainly solves the problem of long texts, the processing effect on short text data such as microblogs is poor, and meanwhile, the microblog data has the characteristics of data sparseness, strong context relevance and the like, and is difficult to solve by the LDA model.

In order to handle large data sets, traditional hot topic discovery techniques have encountered bottlenecks. First, the amount of data to be processed in the process of discovering hot topics is huge, and a single host and a single processor are time-consuming and labor-consuming to process. Secondly, the speed of topic mining on data by a pure BTM model is too slow, and finally, after modeling is carried out by using the BTM model, the characteristics of microblog streaming data are not considered, so that the classification effect is still to be improved. Therefore, the method is particularly important for finding the microblog hot topics in a distributed environment.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a hot topic discovery method based on BTM and Single-pass, which is suitable for short text and data sparse streaming data, and meanwhile, the method can adapt to the condition of large data volume and accelerate the topic mining speed.

The technical scheme is as follows: the invention provides a hot topic discovery method based on BTM and Single-pass, which comprises a Mapper stage and a Reducer stage of a MapReduce framework;

the Mapper stage specifically comprises:

(1) preprocessing the input data set D;

(2) averagely dividing the preprocessed data set D into C nodes, wherein each node comprises a quantitative word pair, and randomly distributing a theme to the word pair;

(3) performing topic modeling by using a BTM topic model to obtain topic distribution of a corpus data set;

(4) vectorizing the result by adopting VSM on each node;

(5) performing cluster analysis on each node by adopting an improved Single-pass algorithm to obtain a local topic;

(6) outputting the local topic;

the Reducer phase specifically comprises:

(1) inputting an initialized clustering result on each node;

(2) selecting the clustering result of the main node as an initial clustering center;

(3) clustering results of other nodes and clustering results of the main node by adopting an improved Single-pass algorithm to finally obtain hot topics;

(4) and outputting the hot topic.

Preferably, in the step (5), the cluster analysis specifically includes:

(51) dividing the node C into n data pieces C according to a certain scale₁,C₂,...,C_nSequentially using the data slices as input data; each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;

(52) selecting the data slice C₁As a first part, for said C₁Performing internal clustering to obtain a clustering result of the first part;

(53) for the data sheet C₂,...,C_nBefore inputting, each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;

(54) the data slice C₂Each cluster center obtained is associated with the data piece C already existing₁Is calculated as sim (C2ci, C1ci), wherein C1ci and C2ci respectively represent the data slice C₁And C₂Different word pairs in;

(55) selecting a maximum similarity value max (sim (C2ci, C1ci)), judging whether max is larger than a threshold value C, and if max is larger than max (sim (C2ci, C1ci)) > C, classifying the C2ci into a class with the maximum similarity value; if max is max (sim (C2ci, C1ci)) < C, newly establishing a topic by taking the C2ci as a clustering center;

(56) the data slice C₁、C₂The clustering result is obtained by sortingNew clustering results;

(57) and (5) circulating the steps (54), (55) and (56) until all data in the data set are processed, and obtaining a final result.

Preferably, the internal clustering uses a classical Single-pass algorithm.

Preferably, the word pair refers to two different words which exist in the same data slice at the same time and are random and unordered after the data set is preprocessed.

Preferably, in step (3), the topic modeling adopts a Gibbs sampling method as a sampling method, and an iteration of the sampling process adopts the topic matrix obtained by the set of word pair tuples of each node, that is, each word pair b is (w) for each word pair_i,w_j)∈B_CAnd solving the probability of the subject under the k in the BTM model.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages: 1. the problem of data sparsity is well solved through word pairs by adopting a BTM topic model, and the model has better data processing capacity than an LDA model; 2. k-means is frequently used in the traditional document clustering method, however, microblog data is in a streaming data form, and the K-means cannot well cluster the microblog data, so that the Single-pass incremental clustering algorithm is adopted to analyze the microblog data; 3. the improved Single-pass algorithm can well reduce the complexity of calculation, maintain the stability of the algorithm, effectively process new data and have better calculation analysis on the continuous influence of hot topics; 4. the invention also performs distributed parallelization processing on the hot topic discovery method based on the BTM and the Single-pass, and still can maintain the topic discovery quality on the basis that the data processing efficiency of the data set is improved through the MapReduce framework.

Drawings

FIG. 1 is a flow chart of the modified Single-pass algorithm;

FIG. 2 is a parallelization flow diagram of the hot topic discovery algorithm of the present invention;

FIG. 3 is a diagram of a BTM model topology;

FIG. 4 is a basic flowchart of the BTM and Single-pass based hot topic discovery method algorithm of the present invention;

FIG. 5 is the F value size of BTMs at different K values;

FIG. 6 is a graph of beta value analysis in the BTM model;

FIG. 7 is a graph of different threshold performance;

FIG. 8 is a partial topic experiment result;

FIG. 9 is a comparison of three experimental methods;

FIG. 10 is a hotword trend graph as used by the present invention;

FIG. 11 is a graph of stand-alone versus distributed time;

FIG. 12 is a graph of server versus elapsed time.

Detailed Description

The invention provides a hot topic discovery method based on BTM and Single-pass, which is suitable for being used under short text and data sparse streaming data, and carries out parallelization calculation in order to adapt to the condition of large data volume and accelerate the topic mining speed, and the method mainly comprises the following steps: (1) carrying out cluster analysis by adopting an improved Single-pass algorithm; (2) and carrying out MapReduce distributed parallelization processing based on the hot topic discovery method of the BTM and the Single-pass.

(1) Cluster analysis by improving Single-pass algorithm

As shown in FIG. 1, a data set D, which is divided into a plurality of data slices on a certain scale, i.e., D₁,D₂,...,D_nThe data sets after the decomposition are sequentially input in order.

1) For D₁,D₂,...,D_nThe data pieces are sequentially used as input data; each part carries out internal clustering by itself, the clustering method is similar to the classic Single-pass algorithm, and the respective clustering result of each part can be obtained;

2) selection D₁The part is used as a first part and is clustered by utilizing a classic Single-pass algorithm to obtain a clustering result of the part;

3) for D₂,...,D_nThe parts are subjected to internal clustering before input, and the clustering algorithm is also a classic Single-pass algorithm to obtain the clustering result of each part;

4) for D₂Each of the obtained cluster centers is subjected to similarity calculation with each of the cluster centers of the already existing parts, wherein D1di and D2di represent D, respectively₁And D₂Different word pairs in the data set, wherein the word pairs are simultaneously present in one same segment after word segmentation and the like of the data set and are two random and unordered different words;

5) selecting max (D2di, D1di)) as the maximum similarity value, judging whether max is larger than the threshold value c, wherein c is selected differently according to different situations, and if max (sim (D2di, D1di)) > c) is larger than or equal to the threshold value, then classifying the maximum similarity value into the class with the maximum similarity value; if the maximum similarity value is smaller than the threshold value, namely max is max (sim (D2di, D1di)) < c, newly establishing a topic for the clustering center;

6) for D₁、D₂Sorting the two parts of clustering results to obtain a new clustering result;

7) and (4) circulating the steps 4), 5) and 6) until all data are processed, thereby obtaining the final result.

(2) Method for conducting MapReduce distributed parallelization processing on hot topic discovery method based on BTM and Single-pass

1) As shown in fig. 2, all data D after preprocessing is divided into C nodes on average, and each node roughly processes D/C data amount. Each node contains a certain number of word pairs b ═ w_i,w_j)∈B_C(ii) a Then for all data sets D, its global set of word pairs is B ═ B₁,B₂,...,B_c}。

2) Local data is sampled by using a Gibbs sampling method in a BTM (Branch target model), and each iteration adopts a word pair element group set of each node to obtain a topic matrix, namely, each word pair b is equal to (w)_i,w_j)∈B_CThen, the probability of the current topic k is determined.

3) For each node, the results obtained above are vectorized on the respective host in preparation for subsequent clustering.

4) Clustering is carried out on each node, the clustering result of the main node is selected as an initial clustering center, and the clustering results of the other nodes and the clustering results of the main node are merged and sorted by adopting an improved Single-pass algorithm to obtain a result.

The distribution of the subject probability is the same for all word pairs in the data set, where subject is understood to be the probability distribution of different words. BTMs are based on LDA models and unary mixture models that learn the subject of short text sparse data by word pairs generated across the entire dataset and associations between words, represented by word co-occurrence. Therefore, the BTM topic model models the topic by using word pairs that co-occur in all data sets.

The structure of the BTM topic model is shown in fig. 3, where theta represents the global topic distribution in the dataset,

representing the probability distribution of words under a single topic, | B | representing the number of word pairs in the data set, | B | representing the number of implied topics, W_i，W_jAre two different words in a certain word pair b. Alpha and beta are respectively theta and beta

Is given as a hyperparameter of the Dirichlet prior distribution.

Referring to FIG. 4, the hot topic discovery algorithm based on BTM and Single-pass comprises the following steps:

step 1, preprocessing the acquired data set, and removing noise data which do not contribute to the discovery of hot topics, such as stop words, hyperlinks, special characters and the like;

step 2, segmenting the data by adopting a Chinese word segmentation tool (NLPIR) provided by the research institute of Chinese academy of sciences computing technology;

step 3, obtaining relevant information of the required characteristic items, keeping relevant verbs, nouns, topic labels, time and the like, using the relevant verbs, the nouns, the topic labels, the time and the like as the characteristic items, sorting repeated words and counting word frequency;

step 4, modeling the data by using a BTM model, calculating the values of the data sets p (z | d) and p (w | z) (document-subject, subject-word), and aiming at each piece of data d in the data sets_iUsing the formula Inf (d)_i) To calculate topic influence, using formula w_k(d_i) And (5) circularly calculating the weight value from 0 to K, wherein K is the number of the feature words.

(1) With Inf (d)_i) Expressing the influence of quantifying one piece of microblog information, and the formula is

Wherein N is_comNumber of comments, N, representing one piece of microblog information_repNumber of retransmissions representing a piece of microblog information, N_supRepresenting the praise number, MAX { N, of a microblog_com(d_j)}、MAX{N_rep(d_j)}、MAX{N_sup(d_j) And expressing the maximum comment number, the maximum forwarding number and the maximum like number in the document set respectively. α, β, γ are parameters, and α + β + γ is 1.

(2) TF-IDF is a commonly used feature weighting technique and is calculated by the formula

TF_ij-IDF_i＝TF_ij×IDF_iEquation 2

Wherein, the word frequency TF represents the frequency of the given characteristic word in the document, and the larger the value of the frequency is, the more important the frequency is; the inverse file frequency IDF means the number of times a given token appears in the entire data set. The calculation formula of TF and IDF is as follows:

wherein n is_ijIs the frequency of a given word in the document, sigma kn_kjRepresents the sum of all words in the document, | D | represents the number of total documents, | { j: t |, in the document_i∈d_j} | is the number of documents containing feature words, which are incremented by 1 in order to avoid denominators of zero.

Step 5, utilizing a VSM vector space model to vectorize the obtained result, wherein the formula is

A result matrix M is obtained.

And 6, performing incremental clustering on the matrix M by using the improved Single-pass algorithm to obtain a final result.

A hot topic discovery parallelization algorithm based on BTMs and Single-pass comprises the following steps:

parallelization processing is carried out aiming at the characteristics of the micro-blog large data volume, and the method is mainly used for designing algorithms of a Mapper stage and a Reducer stage of a MapReduce framework.

(1) Mapper phase

Step 1, inputting a data set D, a theme number K and parameters alpha and beta;

step 2, preprocessing the acquired data set, and removing noise data which do not contribute to the hot topic discovery, such as stop words, hyperlinks, special characters and the like;

step 3, dividing all the data D after preprocessing into C nodes on average, wherein each node contains a certain amount of word pairs b ═ (w ═_i,w_j)∈B_CRandomly distributing a theme k for the word pair b;

step 4, Z_i←Z_i+Z_jOn host i, the global subject word matrix is the sum of the other hosts and the local host. And traversing all the nodes C, and solving the global subject word matrix of each host.

Step 5, each iteration adopts a word pair element group set of each node to obtain a topic matrix, namely, each b is (w)_i,w_j)∈B_CThe probability of it under topic k is:

wherein n is_k|CRepresents the number of word pair tuples divided into k topic numbers in process C, n_wi|kAnd n_wj|kRespectively represent words w_i，w_jThe number of the subject numbers divided into k is N, the N represents the size of a total dictionary in the data set, and alpha and beta represent prior parameters;

step 6, adopting VSM to carry out vectorization on the results of the respective host on each node;

step 7, clustering each node by adopting an improved Single-pass algorithm to obtain local topics;

specifically, (71) dividing the node C into n pieces of data C on a certain scale₁,C₂,...,C_nSequentially using the data slices as input data; each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;

(72) selecting the data slice C₁As a first part, for said C₁Performing internal clustering to obtain a clustering result of the first part;

(73) for the data sheet C₂,...,C_nBefore inputting, each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;

(74) the data slice C₂Each cluster center obtained is associated with the data piece C already existing₁Is calculated as sim (C2ci, C1ci), wherein C1ci and C2ci respectively represent the data slice C₁And C₂Different word pairs in;

(75) selecting a maximum similarity value max (sim (C2ci, C1ci)), judging whether max is larger than a threshold value C, and if max is larger than max (sim (C2ci, C1ci)) > C, classifying the C2ci into a class with the maximum similarity value; if max is max (sim (C2ci, C1ci)) < C, newly establishing a topic by taking the C2ci as a clustering center;

(76) the data slice C₁、C₂Sorting the clustering results to obtain new clustering results;

(77) looping steps (74), (75) and (76) until all data in the data set are processed, and obtaining a final result;

and 8, outputting the local topics.

(2) Reducer phase

Step 1, inputting an initialization clustering result on each node;

step 2, selecting a clustering result of the main node as an initial clustering center;

step 3, clustering results of other nodes and clustering results of the main nodes by adopting an improved Single-pass algorithm to finally obtain hot topics;

and 4, outputting the hot topics.

Analysis of experiments

The experimental data of the invention is a real data set on the Xinlang microblog collected by a crawler program, and the contents of the crawled data mainly comprise 20000 data of topics such as 'Alphago' and 'Liyoyo Olympic Games'.

At present, no unified evaluation standard exists in the microblog hot topic discovery method, and the microblog hot topic discovery method adopts the following evaluation indexes to carry out experiments.

In terms of accuracy, the evaluation indexes published by NIST (national institute of standards and technology) are used, including accuracy P, recall R, F value and omission factor P_missAnd false detection rate P_FAAnd the like. The formula is as follows:

wherein, a is the number of related microbo texts discovered by detection; b is the number of irrelevant microbo texts found by detection; c represents the number of related microbo texts which are not detected; d is the number of irrelevant microblock texts without detection findings.

Value analysis of each parameter in BTM algorithm model

In the hot topic discovery method, the BTM model is the first step of topic identification on microblog linguistic data, and the result obtained by utilizing the BTM has great influence on the subsequent incremental clustering effect, so that the method firstly analyzes the value problem of each parameter in the BTM topic model. The BTM topic model is an unsupervised model, the topic number K of a data set needs to be set first when the BTM topic model is modeled, different topic number models are different in estimation and have influence on the performance of the BTM, and therefore the value of K needs to be determined first. In this experiment, the K values were set to 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, and 15, respectively, and the F values were used as evaluation indexes for comparison, and the results of the experiment are shown in fig. 5.

It can be seen that as the number of subjects K increases, the F value tends to decrease, with the best effect when K is 7. Because microblogs are short text data with sparse data and contain fewer words, formed word pairs are relatively fewer, and in the modeling process, a topic is assigned to each word pair, and an excessive number of topics is set, so that the probability of word pair-topic is probably segmented, and further the probability of document-topic is influenced, and the final result is influenced. The number of subjects selected in the present invention is 7 as the value of K.

Under the condition of determining the value of the subject number K, the values of parameters alpha and beta in the BTM are analyzed, and the value of alpha can be known according to empirical values

So the value of alpha is 50/7. After the value of α is determined, the value of β is analyzed, the experimental result is shown in fig. 6, and the histogram shows the accuracy, the recall rate, and the F value from left to right.

It can be seen that the overall effect is relatively stable when beta is 0.01, so the value of beta in the invention is 0.01. To sum up the two experiments, the values of different parameters of the algorithm provided by the invention are respectively K value of 7, alpha value of 50/7 and beta value of 0.01.

Threshold selection analysis

The threshold setting has an important influence on the clustering effect of the Single-pass incremental clustering method, and the clustering result has an important influence on the last topic finding, so the threshold setting is an important aspect. The rate P of the evaluation index of the threshold selection in the experiment from the omission factor_missAnd false detection rate P_FATwo aspects are measured. The experiment selects a part from the above topic data as an experiment, wherein 1000 pieces of data are selected for each topic. The results of the experiment for the different thresholds are shown in fig. 7.

Through the above experiment, it can be seen that the false detection rate is inversely proportional to the threshold value, the false detection rate is proportional to the threshold value, the value of the threshold value is increasing, the value of the false detection rate is gradually decreasing, and the value of the false detection rate is gradually increasing. When the threshold value is 0.4, the overall effect is relatively good, so that the threshold value is 0.4.

Comparison with other hot topic discovery methods

According to the invention, hot topics are mined based on the BTM and the improved Single-pass algorithm, related topics are represented by keyword sets, and the part of experiment results of 'Alphago' displayed by the invention are shown in figure 8 due to limited space.

In order to verify the effectiveness of the method provided by the invention, a BTM topic model and a BTK method are used for comparison, and since the F value is integrated with P, R index, the F value is used as an evaluation index in the experiment, and the experiment result is shown in FIG. 9.

As can be seen from fig. 9, the clustering result is different for different topics, which is mainly caused by the characteristics of the topics themselves. For example, the clustering result of the topic of "the olympic games in the same province" is relatively poor because the olympic games involve a plurality of different items, including a plurality of different items such as "diving," "table tennis," "volleyball," and the like, which contain a large number of items, so that the co-occurrence of the words of the BTM to the medium words is relatively reduced, and the clustering effect is not ideal.

According to the method and the system, hot word trends in the micro-indexes provided by the microblog are used for comparison, and the trends and the frequency of a certain keyword within a period of time can be visually displayed according to the hot word trends. Data of "the olympic games of the Rio York" contains a plurality of different hot topics, and are more suitable for comparison, so data of 2016, 3, month, 1 and 10 days are selected for comparison. The first ten keywords obtained by performing experiments on the data of the Riyoyo Olympic Games by using the method are as follows:

TABLE 1 Rio Olympic Games keywords

Fig. 10 shows the trend of hotwords during the rio olympic games, and since the female line of day 21/8 appears much more frequently than other hotwords, the female line data is deployed on the secondary axis in order to better show the trend of other hotwords on the graph. From fig. 10, it can be seen that the heat of capital, the strength of flood and barren, houton, and the female row, zhangjunke, increased sharply in a period of time during the rio olympic games, became a hot topic, a debris flow, and the tomato fried eggs were not specific keywords during the olympic games, so the appearance frequency was low. Therefore, the effectiveness of the algorithm provided by the invention can be intuitively proved.

The experimental result shows that the hot topic discovery algorithm based on the BTM and the Single-pass has better effect compared with other methods, and the F value of the hot topic discovery algorithm based on the BTM and the Single-pass is better proved to be effective and feasible.

Hot topic discovery algorithm parallelization experimental analysis based on BTM and Single-pass

The experiment has 4 servers, one of which is used as a Master node and is provided with a NameNode and a JobTracker, and the other 3 servers are used as Slave nodes and are provided with a DataNode and a TaskTracker. The operating systems on all servers are Ubuntu14.04, the version of Hadoop used is 2.6.5, and the version of JDK is JDK-8u121-linux-i 586.

The experimental data set is 5-day microblog data between 5 days of 1 month and 10 days of 2017 by the crawler tool, and the size of the experimental data set is about 1.1 GB. The obtained data is in an xml format, the content in the data is extracted, and the data is uploaded through a JAVA toolkit with an open source of HDFS for distributed storage.

The experiment is used for parallelizing hot topic discovery algorithms based on BTMs and Single-pass, so that the evaluation of the hot topic discovery algorithms is measured from two aspects of the quality of the hot topic discovery and the speed of data processing. The Coherence value is used as an index to measure the quality of hot topic findings.

The formula for the Coherence value is as follows:

wherein, V^(z)＝(V₁ ^(z),V₂ ^(z),...,V_T ^(z)) Represents the first T words under a known topic z, and V^(z)The values of the probability p (w | z) are sorted from high to low. D (v) represents the number of occurrences of word v in the document, and D (v, z) represents the number of occurrences of both words at the same time. Quality of hotspot topic discovery with C (z, V)^(z)) The values are proportional in magnitude, with greater values giving better quality.

In order to evaluate the quality of topic discovery in a single-machine environment and a distributed environment, the invention takes the average value of Coherence of all topics in the current environment as an evaluation standard, and the calculation formula of the average value of Coherence is as follows:

in the case of similar quality of the subject findings, we measure the speed of data processing as the time it takes to process the data.

Evaluation index analysis

Topic discovery quality

In the same configuration, the present invention sets the size of T to be 30, compares the quality of topic findings in the stand-alone environment and the distributed environment using Coherence values, and the experimental results are shown in the following table:

TABLE 2 Coherference value comparison

As can be seen from the results in table 2, the Coherence values in the stand-alone environment and the distributed environment are relatively close, and in the distributed environment, the Coherence value decreases slowly as the number of nodes increases, so that the quality of the topic discovery is quite different in general.

Speed of topic discovery

In order to better illustrate the influence of the distributed environment on the topic discovery speed, the invention compares the topic discovery speed through two groups of experiments, wherein the first group is used for keeping the number of the nodes in the distributed environment unchanged and changing the size of the data volume, and the second group is used for keeping the data volume unchanged and changing the number of the nodes for comparison.

For the first group of experiments, the experimental environment is set to be 4 servers, wherein 1 server serves as a Master node, the rest servers serve as Slave nodes, experimental data are divided into four groups of 256MB, 512MB, 768MB and 1024MB respectively, and the unit of used time is minutes. The results of the experiments are shown in the following table:

TABLE 3 Single machine and Hadoop distributed topic discovery time

The transformation of a stand-alone environment to a distributed environment for more intuitive display with an increasing amount of data is shown again in the form of a graph, and the result is shown in fig. 11.

It can be seen from table 3 that as the data processing amount increases, the difference between the time used in the Hadoop distributed environment and the single computer gradually increases, so that the Hadoop distributed environment is more effective for processing operations with large data amount.

The second set of experiments was to keep the size of the data constant and change the number of nodes in the distribution to compare the time spent. The selected data size is 1024MB, the number of nodes is increased from 1 to 4, and the experimental result is shown in FIG. 12.

As can be seen from fig. 12, when processing data of the same size, the time used is decreasing as the number of nodes in Hadoop increases. Of course, as the number of the nodes is larger, the number of data exchange between the Slave node and the Master node is larger and larger, and the trend of reducing the used time is slower and slower. A reasonable number of nodes is to be set.

The algorithm after parallelization is verified through experiments, and under the condition that the topic finding quality is not much different, the topic finding speed can be obviously improved, and the time for processing data is reduced.

Claims

1. A hot topic discovery method based on BTM and Single-pass is characterized in that MapReduce is adopted for distributed parallelization processing, including a Mapper stage and a Reducer stage of a MapReduce framework;

the Mapper stage specifically comprises:

(1) preprocessing an input data set D;

(4) vectorizing the result by adopting VSM on each node;

the cluster analysis specifically includes:

(51) dividing each node into n data pieces C according to a certain scale₁,C₂,...,C_nSequentially using the data slices as input data; each data sheet is independently subjected to internal clustering to obtain a clustering result of each data sheet;

(56) the data slice C₁、C₂Sorting the clustering results to obtain new clustering results;

(57) looping steps (54), (55) and (56) until all data in the data set are processed, and obtaining a final result;

(6) outputting the local topic;

the Reducer phase specifically comprises:

(1) inputting an initialized clustering result on each node;

(4) and outputting the hot topic.

2. The BTM and Single-pass based hot topic discovery method as claimed in claim 1 wherein in the Mapper stage step (2), the word pair refers to two different words which exist in one same data slice at the same time and are arbitrary and unordered after the data set is preprocessed.

3. The BTM and Single-pass based hot topic discovery method as claimed in claim 1, wherein in the step (3) of the Mapper stage, the topic modeling adopts a Gibbs sampling method as the sampling method, and the iteration of the sampling process adopts the word pair tuple set of each node to obtain the topic matrix, that is, each word pair b is (w is) for each word pair_i,w_j)∈B_CAnd solving the probability of the subject under the k in the BTM model.

4. The BTM and Single-pass based hot topic discovery method of claim 1, characterized in that the inner clustering uses the Single-pass algorithm.