CN108874974A

CN108874974A - Parallelization Topic Tracking method based on frequent term set

Info

Publication number: CN108874974A
Application number: CN201810585627.6A
Authority: CN
Inventors: 孙健; 许强; 陆川; 张明
Original assignee: Chengdu Cloud Future Information Science Co Ltd
Current assignee: Chengdu Cloud Future Information Science Co Ltd
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2018-11-23

Abstract

The parallelization Topic Tracking method based on frequent term set that the invention discloses a kind of, including multiple topic text sets are calculated by Text Clustering Algorithm in certain amount or multiple texts in a period of time in report stream；It is calculated by parallelization and frequent term set excavation is carried out to topic text set；Pass through the parallel similarity that frequent term set is converted to frequent term vector collection, calculates between the frequent term vector collection of report stream and the frequent term vector collection of priori report of term vector model；Similarity and the Topic Tracking threshold value of setting are compared, topic ownership is determined, completes Topic Tracking.The present invention indicates topic text set using word set, reduces similarity calculation expense；It proposes to calculate the similarity between word set based on the similarity calculating method of Word2vec term vector model, the accuracy of similarity-rough set between word set can be improved；The advantages of being carried out frequent term set excavation and term vector conversion process using the calculation of parallelization, parallel computation is utilized, improves the efficiency of Topic Tracking.

Description

Parallelization Topic Tracking method based on frequent term set

Technical field

The present invention relates to network information processing field more particularly to a kind of parallelization Topic Tracking sides based on frequent term set Method.

Background technique

With the rapid development of information network technique and further popularizing for internet, the data on network present geometric Growth, data " explosion " have become one of the feature in current network epoch.The internet information of magnanimity is difficult to user fastly Speed and therefrom obtain useful information at a glance and specific information given more sustained attention, in order to alleviate existing information amount The problem of overload, people, being capable of the current hot topics of quick obtaining there is an urgent need to a kind of more efficiently information acquiring pattern With the subsequent relevant report of oneself content of interest.

Topic Tracking technology can collect in relative text according to known topic from follow-up text information flow Hold, can help people obtain associated topic follow-up report, Topic Tracking be divided into traditional Topic Tracking and adaptive topic with Track.

Traditional Topic Tracking mainly according to the subsequent relevant report of topic model tracking of priori, is divided into knowledge based and is based on Two kinds of research directions are counted, the former finds topic belonging to report based on specific domain knowledge, and the latter mainly passes through report The probability distribution and statistical method of feature determine topic ownership the similarity degree between topic to judge to report in turn.

Adaptive Topic Tracking can update topic mould according to the report dynamic traced on the basis of traditional Topic Tracking Type recycles model treatment follow-up report adjusted.

With the increasingly mature of Topic Tracking technology, in network public-opinion monitoring, hot news recommendation, financial market analysis Etc. multiple fields played great function.Traditional topic tracing task only provides 1 to 4 report relevant to topic as first Data are tested, in the case where no topic other correlated characteristics, Topic Tracking needs to establish words for so sparse priori data Inscribe model and trace model, then will newly arrive in follow-up report stream report and existing topic model carry out similarity calculation and Threshold value comparison identifies the report of associated topic.Actualite tracking be often used text classification algorithm come to newly arrive report into Jargon topic ownership, such as KNN, decision tree, support vector machines sorting algorithm, wherein KNN sorting algorithm is theoretical mature, in priori number Relatively satisfactory result can be also obtained according in the case where less.But when facing large-scale data, traditional topic tracking will Face following problem：

(1) under the situation of large-scale data, the relevant report sample of single topic is far more than one to four, this is to needs Other sorting algorithms of relatively more training dataset provide application scenarios.For traditional KNN algorithm, when the report of tracking Road could will therefrom select K arest neighbors with all priori data progress similarity calculations after entering system and then determine topic Ownership, when system long-play or there are multiple topics simultaneously be tracked when, a large amount of priori data will lead to calculating Complexity steeply rises, and tracing task executes slow.

(2) when in face of extensive report stream, it is far above one into systematic account in a bit of time, if by this The a little modes of data traditionally carry out serial process, first is that treatment effeciency is lower, every data will wait previous data Being disposed can just be executed, second is that the data first handled are less compared with the information that the data of post-processing obtain, because first handling Data and the information of unbonded post-processing data handled, increase tracking result for reporting the sensitivity of order of arrival Property.

Summary of the invention

The object of the invention is that providing a kind of parallelization topic based on frequent term set to solve the above-mentioned problems Tracking.

The present invention is achieved through the following technical solutions above-mentioned purpose：

A kind of parallelization Topic Tracking method based on frequent term set, includes the following steps：

S1, certain amount in report stream or multiple texts in a period of time are calculated by Text Clustering Algorithm it is more A topic text set；

S2, it is calculated by parallelization to the progress frequent term set excavation of topic text set；

S3, frequent term set is concurrently converted to by frequent term vector collection by term vector model, calculates the frequent of report stream Similarity between term vector collection and the frequent term vector collection of priori report；

S4, similarity and the Topic Tracking threshold value of setting are compared, determines topic ownership, completes Topic Tracking.

Specifically, the Text Clustering Algorithm in above-mentioned steps S1 includes the following steps：

A1, to text carry out word segmentation processing, remove word segmentation result in punctuation mark, onomatopoeia, interjection, auxiliary word, conjunction, Preposition, adverbial word, number and quantifier；

A2, the feature vector that each text is calculated using the vector space model based on TF-IDF weight mechanism；

A3, Text eigenvector is clustered using Single-Pass algorithm, obtains multiple topic text sets.

Specifically, the acquisition methods of the topic text set in above-mentioned steps A3 include the following steps：

A1, all texts of traversal create a text set if the text is first text；

If not a2, first text, the cosine similarity of the text and all processed texts is calculated, calculation formula is：

In formula：d_j=(ω_1,j,ω_2,j,…ω_i,j…ω_t,j), ω_i,jIndicate text d_jThe TF-IDF of middle ith feature word Weight；

A3, cosine similarity maximum value Max, and the cluster threshold comparison with setting are taken；

A4, Max are greater than cluster threshold value, and two texts for obtaining Max are attributed to one text collection, otherwise newly-built for the text Text set；

A5, the a1-a4 that repeats the above steps obtain multiple topic text sets.

Specifically, the method for digging of frequent term set includes the following steps in above-mentioned steps S2：

B1, text set is averagely distributed on each distributed node；

Its word frequency is expanded as original 1.5 if this time appears in title by the word frequency in B2, each text of statistics Times, finally retain the high preceding n word of word frequency；

B3, amount of text in text set is judged, if amount of text is 1, using the high frequency words that it retains as frequent word Collection；

If the amount of text in B4, text set is greater than 1, using each text as a things, the high frequency words of reservation are made For the item in things, the frequent item set of text set is excavated using FP-Growth algorithm and obtains the word frequency word set of text set, specifically Method includes：

If obtaining no less than 5 or more frequent item sets, by the branch of the number of lexical item in frequent item set and frequent item set The product of degree of holding counting extracts maximum 3 frequent item sets of the measurement standard as measurement standard from all frequent item sets, and A set is merged into as the last frequent term set for indicating topic text set, otherwise, text collection is rejected, it is not allowed to join With subsequent Topic Tracking.

Specifically, the method for digging of frequent term set includes the following steps in above-mentioned steps S3：

C1, term vector expression is carried out to the word in frequent term set using Word2vec term vector model, by frequent word Collection is converted to frequent term vector collection；

C2, the cosine similarity for calculating frequent term vector collection in report stream and priori report, calculation formula are as follows：

In formula：X is frequent term vector collection, X=(x in report stream₁,x₂,…,x_n), x_iIt indicates i-th in frequent term vector collection X A term vector；

Y is frequent term vector collection, Y=(y in priori report₁,y₂,…,y_m), y_jIt indicates in frequent term vector collection Y j-th Term vector；

C3, similarity matrix S is obtained, wherein S_i,jIndicate x_iWith y_jCosine similarity；

C4, the maximum value for calculating similarity matrix S each row of data, and it is weighed according to the term vector of frequent term vector collection X It sums again, obtains l₁, calculating formula is as follows：

In formula：X_iIndicate x_iWeight in frequent term vector collection X；

C5, the maximum value for calculating the every column data of similarity matrix S, and it is weighed according to the term vector of frequent term vector collection Y It sums again, obtains l₂, calculating formula is as follows：

In formula：Y_jIndicate y_jWeight in frequent term vector collection Y；

C6, to l₁And l₂Average, obtain report stream frequent term vector collection and priori report frequent term vector collection it Between similarity.

Preferably, the Word2vec term vector model used in above-mentioned steps C1 is by Skip-gram model and level Softmax method is obtained by a large amount of corpus of training.

Specifically, have in above-mentioned steps S4 and include the following steps：

D1, calculate a frequent term vector collection X and the priori in report stream report in all frequent term vector collection Y it is similar Degree takes out maximum similarity max_X,Y；

D2, compare max_X,YWith the size of Topic Tracking threshold value, work as max_X,YIt, will be in report stream when greater than Topic Tracking threshold value The corresponding text set of frequent term vector collection X is included into the text set in the corresponding priori report of frequent term vector collection Y, complete topic with Track.

The present invention is based on the beneficial effects of the parallelization Topic Tracking method of frequent term set to be：

1, topic text set is indicated using word set, greatly reduces similarity calculation expense；

2, it proposes based on the similarity calculating method of Word2vec term vector model to calculate the similarity between word set, it can be with The accuracy of similarity-rough set between raising word set；

3, frequent term set excavation and term vector conversion process are carried out using the calculation of parallelization, taken full advantage of parallel The efficiency of Topic Tracking can be improved in the advantages of calculating.

Detailed description of the invention

Fig. 1 is the flow chart of the parallelization Topic Tracking method of the present invention based on frequent term set；

Fig. 2 is the flow chart of step A3 in the present invention.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings：

As shown in Figure 1, a kind of parallelization Topic Tracking method based on frequent term set of the present invention, includes the following steps：

1, select a certain number of report streams or the report streams in a period of time, quantity or time by user from Row setting, does not do Compulsory Feature.

2, to text carry out word segmentation processing, remove word segmentation result in punctuation mark, onomatopoeia, interjection, auxiliary word, conjunction, Preposition, adverbial word, number and quantifier, using the vector space model based on TF-IDF weight mechanism calculate the feature of each text to Amount；

3, all texts are traversed, if the text is first text, create a text set, if not first text, meter The cosine similarity of the text and all processed texts is calculated, calculation formula is：

In formula：d_j=(ω_1,j,ω_2,j,…ω_i,j…ω_t,j), ω_i,jIndicate text d_jThe TF-IDF of middle ith feature word Weight.

4, cosine similarity maximum value Max, and the cluster threshold comparison with setting, are taken, Max is greater than cluster threshold value, will Two texts to Max are attributed to one text collection, otherwise create text set for the text, as shown in Figure 2.

5, text set is averagely distributed on each distributed node, each node counts that be distributed to the node each respectively Word frequency in text realizes parallelization processing, if this time appears in title, its word frequency is expanded as original 1.5 times, most Retain the high preceding n word of word frequency afterwards, is set as 10 in the present embodiment；

6, the amount of text in text set is judged, if amount of text is 1, using the high frequency words that it retains as frequent word Collection, if the amount of text in text set is greater than 1, using each text as a things, the high frequency words of reservation are as in things Item, setting minimum support is 0.5 in the present embodiment, using FP-Growth algorithm excavate text set frequent item set and To the word frequency word set of text set, specific method includes：

7, term vector expression is carried out to the word in frequent term set using Word2vec term vector model, by frequent term set It is converted to frequent term vector collection, Word2vec term vector model passes through instruction by Skip-gram model and level Softmax method Practice a large amount of corpus to obtain.

8, the cosine similarity of frequent term vector collection in report stream and priori report is calculated, calculation formula is as follows：

In formula：X is frequent term vector collection, X=(x in report stream₁,x₂,…,x_n), x_iIt indicates i-th in frequent term vector collection X A term vector；Y is frequent term vector collection, Y=(y in priori report₁,y₂,…,y_m), y_jIt indicates in frequent term vector collection Y j-th Term vector；

9, similarity matrix S is obtained, wherein S_i,jIndicate x_iWith y_jCosine similarity；

The maximum value of similarity matrix S each row of data is calculated, and it is asked according to the term vector weight of frequent term vector collection X With obtain l₁, the maximum value of the every column data of similarity matrix S is calculated, and by it according to the term vector weight of frequent term vector collection Y Summation, obtains l₂, calculating formula is as follows：

In formula：X_iIndicate x_iWeight in frequent term vector collection X, Y_jIndicate y_jWeight in frequent term vector collection Y,

10, to l₁And l₂Average, obtain report stream frequent term vector collection and priori report frequent term vector collection it Between similarity.

11, calculate a frequent term vector collection X and priori in report stream report in all frequent term vector collection Y it is similar Degree takes out maximum similarity max_X,Y, compare max_X,YWith the size of Topic Tracking threshold value (user sets according to demand), when max_X,YWhen greater than Topic Tracking threshold value, by frequently the corresponding text set of term vector collection X is included into frequent term vector collection Y in report stream Text set in corresponding priori report, completes Topic Tracking.

The limitation that technical solution of the present invention is not limited to the above specific embodiments, it is all to do according to the technique and scheme of the present invention Technology deformation out, falls within the scope of protection of the present invention.

Claims

1. a kind of parallelization Topic Tracking method based on frequent term set, which is characterized in that include the following steps：

S1, multiple words are calculated by Text Clustering Algorithm in certain amount or multiple texts in a period of time in report stream Inscribe text set；

S3, frequent term set is concurrently converted to by frequent term vector collection by term vector model, calculate the frequent word of report stream to Similarity between quantity set and the frequent term vector collection of priori report；

2. the parallelization Topic Tracking method according to claim 1 based on frequent term set, it is characterised in that：Above-mentioned steps Text Clustering Algorithm in S1 includes the following steps：

A1, word segmentation processing is carried out to text, removes punctuation mark, onomatopoeia, interjection, auxiliary word, conjunction, Jie in word segmentation result Word, adverbial word, number and quantifier；

3. the parallelization Topic Tracking method according to claim 2 based on frequent term set, it is characterised in that：Above-mentioned steps The acquisition methods of topic text set in A3 include the following steps：

A1, all texts of traversal create a text set if the text is first text；

In formula：d_j=(ω_1,j,ω_2,j,…ω_i,j…ω_t,j), ω_i,jIndicate text d_jThe TF-IDF weight of middle ith feature word；

A4, Max are greater than cluster threshold value, and two texts for obtaining Max are attributed to one text collection, otherwise create text for the text Collection；

A5, the a1-a4 that repeats the above steps obtain multiple topic text sets.

4. the parallelization Topic Tracking method according to claim 1 based on frequent term set, it is characterised in that：Above-mentioned steps The method for digging of frequent term set includes the following steps in S2：

B1, text set is averagely distributed on each distributed node；

Its word frequency is expanded as original 1.5 times, most if this time appears in title by the word frequency in B2, each text of statistics Retain the high preceding n word of word frequency afterwards；

B3, amount of text in text set is judged, if amount of text is 1, using the high frequency words that it retains as frequent term set；

If the amount of text in B4, text set is greater than 1, using each text as a things, the high frequency words of reservation are as thing Item in object excavates the frequent item set of text set using FP-Growth algorithm and obtains the word frequency word set of text set, specific method Including：

If obtaining no less than 5 or more frequent item sets, by the support of the number of lexical item in frequent item set and frequent item set The product of counting extracts maximum 3 frequent item sets of the measurement standard as measurement standard from all frequent item sets, and by its A set is merged into as the last frequent term set for indicating topic text set, otherwise, text collection is rejected, after not allowing it to participate in Continuous Topic Tracking.

5. the parallelization Topic Tracking method according to claim 1 based on frequent term set, it is characterised in that：Above-mentioned steps The method for digging of frequent term set includes the following steps in S3：

C1, term vector expression is carried out to the word in frequent term set using Word2vec term vector model, frequent term set is turned Get frequent term vector collection in return；

In formula：X is frequent term vector collection, X=(x in report stream₁,x₂,…,x_n), x_iIndicate i-th of word in frequent term vector collection X Vector；

Y is frequent term vector collection, Y=(y in priori report₁,y₂,…,y_m), y_jIndicate in frequent term vector collection Y j-th of word to Amount；

C4, the maximum value for calculating similarity matrix S each row of data, and it is asked according to the term vector weight of frequent term vector collection X With obtain l₁, calculating formula is as follows：

In formula：X_iIndicate x_iWeight in frequent term vector collection X；

C5, the maximum value for calculating the every column data of similarity matrix S, and it is asked according to the term vector weight of frequent term vector collection Y With obtain l₂, calculating formula is as follows：

In formula：Y_jIndicate y_jWeight in frequent term vector collection Y；

C6, to l₁And l₂It averages, obtains between the frequent term vector collection of report stream and the frequent term vector collection of priori report Similarity.

6. the parallelization Topic Tracking method according to claim 5 based on frequent term set, it is characterised in that：Above-mentioned steps The Word2vec term vector model used in C1 passes through a large amount of corpus of training by Skip-gram model and level Softmax method It obtains.

7. the parallelization Topic Tracking method according to claim 6 based on frequent term set, it is characterised in that：Above-mentioned steps Have in S4 and includes the following steps：

D1, the similarity for reporting all frequent term vector collection Y in a frequent term vector collection X and priori report in stream is calculated, Take out maximum similarity max_X,Y；

D2, compare max_X,YWith the size of Topic Tracking threshold value, work as max_X,YIt, will be frequent in report stream when greater than Topic Tracking threshold value The corresponding text set of term vector collection X is included into the text set in the corresponding priori report of frequent term vector collection Y, completes Topic Tracking.