CN107291886A

CN107291886A - A kind of microblog topic detecting method and system based on incremental clustering algorithm

Info

Publication number: CN107291886A
Application number: CN201710473108.6A
Authority: CN
Inventors: 王萌; 王晓荣; 梁伟鄯
Original assignee: Guangxi University of Science and Technology; Lushan College of Guangxi University of Science and Technology
Current assignee: Guangxi University of Science and Technology; Lushan College of Guangxi University of Science and Technology
Priority date: 2017-06-21
Filing date: 2017-06-21
Publication date: 2017-10-24

Abstract

The present invention discloses a kind of microblog topic detecting method based on incremental clustering algorithm, the described method comprises the following steps：S1, obtains micro-blog information set；S2, is pre-processed to micro-blog information set；S3, after pretreatment operation, Feature Words are extracted according to word occurrence frequency, word in the distribution situation of microblogging text, word in the distribution situation of time window；Feature Words are assigned weight by S4, according to Feature Words and its weight by microblogging text vector；S5, topic merging is carried out using the similitude determination methods based on distance between vector.Microblog topic detecting system used in the above-mentioned microblog topic detecting method based on incremental clustering algorithm, including the analysis of micro-blog information collection module, microblogging pretreatment module, Feature Words extraction module, microblogging text vector module, microblog topic and merging module.The present invention obtains preferable effect in each side such as recall rate, accuracys rate, and the speed of service has a distinct increment compared to k means methods.

Description

A kind of microblog topic detecting method and system based on incremental clustering algorithm

Technical field

The invention belongs to topic detection technical field, and in particular to a kind of microblog topic detection based on incremental clustering algorithm Method and system.

Background technology

It is micro- particularly after the rise of web2.0 technologies with the swift and violent growth of development and its application of Internet technology The application of blog (i.e. microblogging) is released news in time fast due to it, and spread speed is fast, the various more and more netizens of circulation way Concern and like.Microblogging is a kind of Information Sharing based on customer relationship, the platform propagated and obtained, and can pass through interconnection Net, mobile Internet or some clients carry out real-time Information Sharing with propagating.Microblogging is sent out with the information content of the word of highest 140 Cloth message, and be equipped with picture, sound, the file of video and provide the user abundant, diversification Information Sharing and propagation.At present, Microblogging has turned into netizens and has expressed the Important Platform of itself all kinds of emotion, particularly at present country to network rumour hitting dynamics not Disconnected today improved, how an important subject for having become information security field is effectively managed to microblogging, its Middle the important point is how that the discovery to microblog topic on network and examination have become hot research problem.

In natural language understanding field, topic detection and tracking (topic detection and tracking, TDT) has been Through there is years of researches history, its target is exactly the development and change for detecting relevant information and tracking event, in its main research Appearance includes two parts, and Part I is topic detection, and topic detection is to cluster same topic in multiple collection of document；It is another Part is Topic Tracking, and the part is mainly moved back the dependent event under some same topic according to the order of time and followed the trail of. Due to the fast development of microblogging, TDT research is incorporated into microblogging by some current researchers by traditional text carrier In carrier, reached by the topic detection and tracking to microblogging and find the real-time of hot microblog topic and hot microblog topic in time The problems such as progress.Microblogging is compared with traditional text, with text is short, user group's level difference is big, word is lack of standardization, style of writing lattice The various features such as formula is not rigorous, text spoken languageization is strong, because the presence of these features to microblog topic detection band carrys out very big difficulty. For these reasons, although topic detection research has been carried out for many years, but diversity due to data acquisition and feature extraction are not Certainty, current topic detection is concentrated mainly in the association area research such as news report, the research of relevant microblog topic detection It is relatively fewer.

More and more important role is played with the quick popularization of microblogging, and in internet life, it is domestic at present Some outer scholars also begin to carry out correlative study to microblog data, particularly in hot microblog topic context of detection.Rui Long Et al. propose a kind of method that validity event detection towards microblog data is followed the trail of, they to microblog data feature by gathering Alanysis determines descriptor to carry out the event detection in microblog data and tracking；Ramage et al. takes the potential language of mark Model, four potential dimensions are mapped to by microblogging text, and analysis result realizes that microblogging sorts, and is obtained using microblogging sequence Much-talked-about topic；Ma Bin et al. marks microblog data feature using threaded tree, and poly- to microblog data using bilateral clustering method Class, microblog topic is obtained using cluster result；Zheng Feiran et al. using the keyword largely occurred in on-line checking Twitter message, Hot microblog topic is obtained by the cluster to keyword；Xue Su sesames et al. are by finding microblogging inherent law, using identical In time window the growth rate of different themes word excavates focus Topic word at that time, and the cluster of focus descriptor is produced Raw much-talked-about topic.

Traditional topic detection model, in the case where microblog topic discusses this noisy environment, treatment effect is unsatisfactory, and it is led Want reason to be mainly content of microblog to be made up of the text no more than 140 words, comprising content be considerably less than traditional text, Simultaneously some special forms, such as " # themes # ", "@user " etc. are also included in microblogging.In addition, microblogging is used as network social intercourse Contain substantial amounts of network words inside instrument, these did not often occur in traditional text, such as " children's footwear ", " old bird ", " younger sister's paper " etc..Microblogging text also has very big difference with traditional text in structure, and microblogging text is shorter, therefore, uses vector The problems such as characteristic vector is sparse must occur in spatial model (Vector Space Model, VSM) the modeling trends of the times.Therefore, entering During the detection of row hot microblog topic either in microblogging Text Pretreatment method, or in microblogging feature extraction and much-talked-about topic All there is significant difference with traditional topic detection model in the method for cluster.

The pertinent literature detected on microblog topic, we find as follows：

1st, application number：201110164560.7, denomination of invention：Microblog topic detecting method and system, this method include step Suddenly：S1, is vocabulary by microblogging text dividing；S2, construction microblogging text clue and microblogging text forest；S3, for specific micro- This clue of blog article, carries out microblog topic analysis, to find out the main topic and noise topic in microblogging text clue；S4, for every Individual microblogging text clue, merges the microblogging text in its main topic, so as to generate a microblogging line for each microblogging text clue Suo Wenben；S5, carries out global microblog topic analysis, so as to detect global microblog topic, forms microblog topic storehouse.The invention It is disadvantageous in that：The invention needs first to construct microblogging text clue, forms microblogging text forest, need to carry out substantial amounts of microblogging words Topic analysis, forms microblog topic storehouse, and effect of such invention in specific area can clearly, but in microblogging explosion type Processing speed in mobile Internet relatively can be relatively slow, and the discovery effect to timely much-talked-about topic may not be obvious.

2nd, application number：201310017814.1, denomination of invention：A kind of microblog hot event based on monitoring subnet is examined in real time Method and system are surveyed, this method includes：1) microblogging monitoring subnet is built, based on user activity, influence power and response time structure Build the microblogging monitoring subnet containing a small amount of key user；2) microblog data real-time collecting, at regular intervals cycle real-time collecting The new microblogging of all user's issues of microblogging monitoring subnet；3) participle is carried out to the new microblogging of collection and topic merges；4) build, look into Ask and update topic list；5) certain time window is based on, is changed according to the number for participating in certain topic in topic list and carried out Focus incident is adjudicated.The invention is disadvantageous in that：The invention groundwork concentrates on the monitoring to microblogging event and user Aspect, in being found to microblog topic, the invention is judged mainly to monitor sample in the different characteristic of user.

3rd, application number：201310177797.8, denomination of invention：Method and dress that a kind of hot information based on microblogging is extracted Put, the invention provides the method and apparatus that a kind of hot information based on microblogging is extracted, wherein methods described includes：Obtain micro- Rich data acquisition system；The characteristic information extraction from the microblog data set, it is special that the characteristic information includes text feature, sequential Levy, social networks feature；According to the text feature, temporal aspect, social networks feature clustering into one or more topics；Carry The critical event factor of each topic is taken, hot information will be constituted based on the critical event factor.The weak point of the invention It is：The invention is mainly the hot information extracted in microblogging, rather than carries out the topic detection of microblogging, it may be said that microblogging focus is believed Breath extraction is a part of microblog topic detection.

Therefore a kind of fast and accurately microblog topic detecting method and system are provided, so that microblogging search hit rate is improved, Just seem particularly necessary.

The content of the invention

The invention aims to overcome the deficiencies in the prior art to be detected there is provided a kind of fast and accurately microblog topic Method and system, so as to improve microblogging search hit rate.

The present invention is achieved by the following technical solutions：

A kind of microblog topic detecting method based on incremental clustering algorithm, the described method comprises the following steps：

S1, obtains micro-blog information set；

S2, is pre-processed to micro-blog information set；

S3, after pretreatment operation, according to word occurrence frequency, word the distribution situation of microblogging text, word when Between the distribution situation of window extract Feature Words；

Feature Words are assigned weight by S4, according to Feature Words and its weight by microblogging text vector；

S5, topic merging is carried out using the similitude determination methods based on distance between vector.

Further, the micro-blog information of the step S1 and S2 includes user profile and microblogging text.

Further, pre-process and carry out in accordance with the following steps in the S2：

(1) user profile that number of listening to is less than threshold value F is deleted；

(2) directive property dialogue interaction information is ignored；

(3) participle is carried out to microblogging text, retains verb, nouns and adjectives.

Further, the threshold value F is set as 30.

Further, the step of S3 steps extract Feature Words specifically includes：

1. word occurrence frequency model is set up：

Wherein, | D | pretreated microblogging text collection sum is represented, | (d:t_i∈ d) | represent word i text occur Number, here

D represents certain document in collection of document, and ti represents the number of times that some word occurs in all collection of document；

It is according to word occurrence number in pending microblogging text after pretreatment to determine word occurrence frequency average value E, E value Average value；Meet：

k₁For word quantity to be handled after pretreatment；

Selection frequency of occurrence values is not less than average value E word, into the extraction of next step；

2. distributed model of the word in microblogging text collection is set up, is calculated using maximum informational entropy：

In formula：

H(d|t_i)=- ∑ p (j | i) log p (j | i), (4)；

H (d)=logN_t, (6)；

Wherein, tf_ijRepresent the number of times that word i occurs in microblogging text j, gf_iRepresent word i after S3 1. step Microblogging

The total degree occurred in text collection, N_tRepresent the sum of the microblogging text collection after S3 1. step；Meanwhile, really Determine most

The average value G of big comentropy is met：

Wherein, k₂Word quantity to be handled after 1. step for S3, selection word distribution maximum informational entropy is less than flat Average G

Word, into the extraction of next step；

3. distributed model of the word in time window is set up, is calculated using maximum informational entropy：

TT (i)=- ∑ p_i*log(p_i) (8),

In formula,

Wherein, tf_iRepresent the frequencies of occurrences of the word i in some time window, gf_iRepresent word i after S3 2. step Microblogging text collection in the total degree that occurs；Meanwhile, the average value H of maximum informational entropy is determined,

Wherein, k₃Word quantity to be handled after 2. step for S3, selection word Annual distribution most comentropy is less than Average value H word, as Feature Words.

Further, the form of expression of the microblogging vectorization is：D_i(t₁, W₁；t₂, W₂…；t_i, W_i), t_iRepresent special

Levy item, W_iRepresent weight, the W_iSpan in [0,1],

tf_ijIt is characterized a t_iIn microblogging document D_iThe number of times of middle appearance, N is the total textual data of microblogging after S3 steps, n_iFor bag T containing characteristic item_iNumber.

Further, in the step S5, topic merging is carried out using the similitude determination methods based on distance between vector, Specifically

Operating procedure is：

(1) microblogging model counter is set, and initial value is 0；The vectorial model of microblogging is successively read, if it 0 is that counter, which is, One microblogging model, then directly create a new topic class C_l, counter is increased 1；If it is not, going to next step；

(2) after newly-increased microblogging model is read, it is calculated first with all existing topic centers apart from d (D_i, C_l), choosing Select out that i.e. d of minimum_max(D_i, C_l), judge d_max(D_i, C_l)<Whether ρ sets up, if setting up newly-increased microblogging model D_iIt is incorporated to Topic class C_lIn, recalculate new topic central point：Calculation formula is：

If not, then handled microblogging text Di as a new topic class；Represent a topic center Point, ρ represents distance between microblogging model, μ₁Weighting parameters are represented, wherein,k′₁It is to cluster the microblogging set number in l, will Counter increases 1；

(3) when reaching microblogging the last item model, then algorithm is terminated.

Further, the computation model of the ρ values is：(1) in the case of having historical data,

(2) in the case of without historical data, meet：

Wherein, m represents microblogging dimension of a vector space, and n represents clustered number, and k is represented in cluster centre C_lIn Comprising microblogging model number；W_ijRepresent microblogging document D_iIn each term weighing, W_1jRepresent C in cluster₁Middle term weighing.

Microblog topic detecting system used in a kind of microblog topic detecting method based on incremental clustering algorithm, its feature It is to include with lower module：

(1) micro-blog information collection module, all new microbloggings of the real-time collecting microblog within the time cycle；

(2) microblogging pretreatment module, the noise data for filtering microblogging text；

(3) Feature Words extraction module, for extracting Feature Words according to word occurrence frequency, word distribution situation；

(4) microblogging text vector module, for according to Feature Words and its weight, by microblogging text vector, it to be showed Form is：D_i(t₁, W₁；t₂, W₂…；t_i, W_i), t_iRepresent characteristic item, W_iRepresent weight；

(5) microblog topic analysis and merging module, judge for the similitude distance vector, then by distance Close topic merges.

Further, the microblogging pretreatment module includes information processing module of user's, at directive property dialogue interaction information Manage module and word-dividing mode；The information processing module of user's, threshold value F user profile is less than for deleting number of listening to； The directive property dialogue interaction message processing module, for ignoring directive property dialogue interaction information；The word-dividing mode, for pair Microblogging text carries out participle, retains or delete the word for specifying part of speech；

The Feature Words extraction module, including word frequency processing module, word distribution process module；The word frequency handles mould Block, the occurrence frequency for counting word, and the word of assigned frequency can be screened；The word distribution process module, for uniting Distribution situation of the word in microblogging text or time window is counted, and the word for specifying Distribution Value can be screened.

Compared with prior art, the beneficial effects of the invention are as follows：The present invention passes through the progress to microblog account, network words Pretreatment, reduces microblogging noise information amount；Feature Words are carried out using information such as microblogging word part of speech, the frequency of occurrences, distribution situations Extract；Microblog topic detection is carried out using increment Clustering Model.Draw this method in each side such as recall rate, accuracys rate by experiment Face obtains preferable effect, and the speed of service has a distinct increment compared to k-means methods.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of microblog topic detecting method of the present invention；

Fig. 2 is the structural representation of microblog topic detecting system of the present invention；

Fig. 3 is incremental clustering algorithm of the present invention and the performance indications comparison diagram of K-means clustering methods；

Fig. 4 is incremental clustering algorithm of the present invention and the run time comparison diagram of K-means clustering methods.

Embodiment

With reference to embodiment, the present invention is described in further detail, but embodiments of the present invention are not limited to The scope that embodiment is represented.

Embodiment：

A kind of microblog topic detecting method based on incremental clustering algorithm, as are shown in figure 1.2, methods described include following step Suddenly：

S1, for convenience of stating, the present invention obtains one group of micro-blog information set from network, as shown in table 1：

Table 1

S2, is pre-processed to micro-blog information set：

(1) user profile that number of listening to is less than threshold value F is deleted；7th article of microblogging is excluded due to bean vermicelli number less than 30 Outside pending microblogging；

(2) directive property dialogue interaction information is ignored；In the embodiment, the 7th article of microblogging comprising "@user " due to being rejected to Outside pending microblogging.

(3) participle is carried out to microblogging text, retains verb, nouns and adjectives.This method is according to Chinese Academy of Sciences's Chinese word segmentation System ICTCLAS2013 versions carry out participle, such as to " China keeps development to comply fully with U.S.'s interests." word segmentation result For：" China/ns holdings/v development/v completely/ad meets the/v U.S./ns interests/n.Participle is not only contained in/w ", the result As a result it further comprises the part of speech of word after participle.After participle, whether some networks or colloquial style word are included in content of microblog Remit these noise informations.After so by noise filtering, obtain microblogging as real as possible and carry out topic detection, improve In the degree of accuracy, the embodiment, the 8th article of microblogging is due to being also excluded from pending microblogging comprising network words and information content are very few Outside.

After microblogging has been pre-processed,

1. word occurrence frequency model is set up：

Wherein, | D | pretreated microblogging text collection sum is represented, | (d:t_i∈ d) | represent word i text occur Number, d represents certain document in collection of document here, and ti represents the number of times that some word occurs in all collection of document；

k₁For word quantity to be handled after pretreatment, k represents currently pending word；

Selection frequency of occurrence values is not less than average value E word, into the extraction word quantity of next step；

In the present embodiment, | D |=7, word " relation " all occurs in 3 documents | (d:t_i∈ d) |=3, df_i=7, E =0.1853, the occurrence frequency of word is not less than average value E, into the extraction of next step；

In formula：

H(d|t_i)=- ∑ p (j | i) log p (j | i), (4)；

H (d)=logN_t, (6)；

tf_ijRepresent the number of times that word i occurs in microblogging text j, gf_iRepresent microbloggings of the word i after S3 1. step The total degree occurred in text collection, N_tRepresent the sum of the microblogging text collection after S3 1. step；Meanwhile, it is determined that maximum letter The average value G for ceasing entropy is met：

Wherein, k₂Word quantity to be handled after 1. step for S3, selection word distribution maximum informational entropy is less than flat Average G word, into the extraction of next step；

In the present embodiment, the frequency tf that " relation " word occurs in first microblogging text_ij=2, " relation " word The total degree gf occurred in microblogging text collection after S3 1. step_i=7, Nt=7, then p (j | i)=0.2857, H (d | t_i)=0.4315, GT (i)=0.2622；Institute word quantity k to be handled after S3 1. step₂=38, then G=0.689；Word The word that maximum informational entropy is more than average value G is distributed, into the extraction of next step；

TT (i)=- ∑ p_i*log(p_i) (8),

In formula,

tf_iRepresent the frequencies of occurrences of the word i in some time window, gf_iRepresent that word i is micro- after S3 2. step The total degree occurred in rich text collection；Meanwhile, the average value H of maximum informational entropy is determined,

Wherein, k₃Word quantity to be handled after 2. step for S3, selection word Annual distribution maximum informational entropy is big In average value H word, as Feature Words.

In the present embodiment, " China " word is 5 in the frequency of occurrences of 4 hours windows, and " China " word is in S3 2. step after microblogging text collection in the total degree that occurs be 6, then p_i=0.8333, TT (i)=0.0659；S3 2. step Institute's word quantity to be handled is 33 after rapid, then H=0.1031, " China ", " problem ", the Annual distribution maximum information of " interests " Entropy is less than average value H word, that is, elects Feature Words as.

Select after Feature Words, every microblogging is set up into term vector space model with feature word, at microblogging Reason counter is set to 0, is successively read vectorial model, if first, a new topic class is directly created, if it is not, then The note and the topic class existed are entered into row distance to calculate, specific formula for calculation is That minimum is selected after having calculated apart from dmax (Di, Cl), dmax (Di, Cl) is judged<Whether ρ sets up, will if setting up Newly-increased model Di is incorporated in topic class Cl, recalculates new topic central pointIf not into It is vertical, then by microblogging text D_iHandled as a new topic class.The computation model of ρ values described above is：(1) when there is history In the case of data, (2) when without history number In the case of,So complete after cluster, the in above-mentioned example the 1st, 2,3,4,5 will be gathered for a class, master If description American vice president visits the related model stepped on and visited in China.Nearest focus words can be so found in numerous models Topic.

Each characteristic item of text is endowed a weight W_k, to represent important journey of this characteristic item in the text Spend, the form of expression of its specific microblogging text is：D_i(t₁,W₁；t₂,W₂…；t_n,W_n), t_iItem is characterized, its specific selection is done Method is realized using the algorithm of part 2；And characteristic item t_iWeight W_iEntered using classical weighing computation method TF-IDF algorithms OK, i.e. W_ik=TF_ik·IDF_k.Influence due to text size to Feature item weighting, normalized is done to Features weight, will Each weights span control is arrived between [0,1], typically using equation below：

Wherein tf_ijIt is characterized a t_iIn microblogging document D_jThe number of times of middle appearance, N is total textual data, n_jTo include characteristic item t_j Number.

K-means clustering methods are after Feature Words are selected, it is established that term vector space model, due to unknown poly- Class quantity, is generally required from 2 classes and proceeds by cluster, and cluster is arrived alwaysEvery time when cluster (1) from | D | it is individualData pair AsArbitrarily k object of selection is used as initial cluster center；(2) according to eachClusterObjectAverage(center object), calculates every Individual object and the distance of these center objects；And corresponding object is divided again according to minimum range；(3) recalculate every Individual (changing)Cluster'sAverage(center object)；(4) canonical measure function is calculated, when meeting certain condition, such as function convergence When, then algorithm is terminated；Step (2) is returned to if condition is unsatisfactory for.K-means clustering methods are due to being that random center is clicked Select and make it that the fluctuation of result is larger, while need repeatedly to carry out k-means cluster compared with increment cluster to obtain result, So at runtime also than it is longer.

From figure 3, it can be seen that incremental clustering algorithm is relative to k-means algorithms, there is larger carry on indices Height, some performance indications are lifted beyond 20%.This is deleted largely not mainly due to incremental clustering algorithm in pretreatment stage Correlation word, has taken into full account micro-blog information feature in Feature Selection, and historical information has been taken into full account in increment clustering phase, All kinds of characteristic informations are enriched, the performance of detection algorithm is significantly improved.In traditional k-means algorithms, due to not drawing Enter the various structured messages of microblogging, the feature for adding microblogging is openness so that algorithm can not obtain satisfied performance.Meanwhile, Although introducing various features information in microblog topic detection algorithm, complexity is added, by incremental clustering algorithm Improve, reduce cluster number of times, its arithmetic speed can be improved.

Fig. 4 mainly compares run time of two different systems in microblog topic detection in the case of various number of documents. The generation of different document quantity passes through the document progress that correlated measure is randomly selected in corpus.Fig. 4 gives different system words Inscribe the comparison the time required to being run under detection algorithm.From operation result as can be seen that incremental clustering algorithm is calculated with respect to k-means Method, run time is less in the case of identical document quantity.This is due to be avoided k- after incremental clustering algorithm is by improvement The multiplicating cluster of different initial points in means algorithms, reduces cluster number of times, thus the time required to topic detection on Significantly decrease.

Claims

1. a kind of microblog topic detecting method based on incremental clustering algorithm, it is characterised in that the described method comprises the following steps：

S1, obtains micro-blog information set；

S2, is pre-processed to micro-blog information set；

S3, after pretreatment operation, according to word occurrence frequency, word in the distribution situation of microblogging text, word in time window Mouthful distribution situation extract Feature Words；

2. the microblog topic detecting method according to claim 1 based on incremental clustering algorithm, it is characterised in that：The step Rapid S1 and S2 micro-blog information includes user profile and microblogging text.

3. the microblog topic detecting method according to claim 1 based on incremental clustering algorithm, it is characterised in that：The S2 Middle pretreatment is carried out in accordance with the following steps：

(2) directive property dialogue interaction information is ignored；

4. the microblog topic detecting method according to claim 3 based on incremental clustering algorithm, it is characterised in that：The threshold Value F is set as 30.

5. the microblog topic detecting method according to claim 1 based on incremental clustering algorithm, it is characterised in that：The S3 The step of step extracts Feature Words specifically includes：

1. word occurrence frequency model is set up：

<mrow> <msub> <mi>df</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <mo>|</mo> <mrow> <mo>(</mo> <mi>d</mi> <mo>:</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>&Element;</mo> <mi>d</mi> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <mi>D</mi> <mo>|</mo> </mrow> </mfrac> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein, | D | pretreated microblogging text collection sum is represented, | (d:t_i∈ d) | represent word i textual data, d occur Represent certain document in collection of document；Ti represents the number of times that some word occurs in all collection of document；

Determine word occurrence frequency average value E, E value be according in pending microblogging text after pretreatment word occurrence number it is flat Average；E is met：

<mrow> <mi>E</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <msub> <mi>df</mi> <mi>i</mi> </msub> </mrow> <msub> <mi>k</mi> <mn>1</mn> </msub> </mfrac> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

Wherein, k is currently pending word, k₁For word quantity to be handled after pretreatment；

In formula：

H(d|t_i)=- ∑ p (j | i) log p (j | i), (4)；

H (d)=logN_t, (6)；

Wherein, tf_ijRepresent the number of times that word i occurs in microblogging text j, gf_iRepresent microbloggings of the word i after S3 1. step The total degree occurred in text collection, N_tRepresent the sum of the microblogging text collection after S3 1. step；Meanwhile, it is determined that maximum letter The average value G for ceasing entropy is met：

<mrow> <mi>G</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>G</mi> <mi>T</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> <msub> <mi>k</mi> <mn>2</mn> </msub> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

Wherein, k₂Word quantity to be handled after 1. step for S3, selection word distribution maximum informational entropy is less than average value G Word, into the extraction of next step；

TT (i)=- ∑ p_i*log(p_i) (8),

In formula,

tf_iRepresent the frequencies of occurrences of the word i in some time window, gf_iRepresent microblogging texts of the word i after S3 2. step The total degree occurred in this set；Meanwhile, the average value H of maximum informational entropy is determined,

<mrow> <mi>H</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>T</mi> <mi>T</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> <msub> <mi>k</mi> <mn>3</mn> </msub> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> <mo>,</mo> </mrow>

6. the microblog topic detecting method according to claim 1 based on incremental clustering algorithm, it is characterised in that：It is described micro- The form of expression of rich vectorization is：D_i(t₁, W₁；t₂, W₂…；t_i, W_i), t_iRepresent characteristic item, W_iRepresent weight, the W_iTake It is worth scope in [0,1],

<mrow> <msub> <mi>W</mi> <mi>i</mi> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>tf</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>&times;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <mo>+</mo> <mn>0.01</mn> <mo>)</mo> </mrow> </mrow> <msqrt> <mrow> <mi>&Sigma;</mi> <mrow> <msub> <mo>&Integral;</mo> <mrow> <mo>=</mo> <mn>1</mn> </mrow> </msub> <msup> <mrow> <mo>&lsqb;</mo> <msub> <mi>tf</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>&times;</mo> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mfrac> <mi>N</mi> <msub> <mi>n</mi> <mi>i</mi> </msub> </mfrac> <mo>+</mo> <mn>0.01</mn> <mo>)</mo> </mrow> <mo>&rsqb;</mo> </mrow> <mn>2</mn> </msup> </mrow> </mrow> </msqrt> </mfrac> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>11</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

tf_ijIt is characterized a t_iIn microblogging document D_iThe number of times of middle appearance, N is the total textual data of microblogging after S3 steps, n_iTo include spy Levy a t_iNumber.

7. the microblog topic detecting method according to claim 1 based on incremental clustering algorithm, it is characterised in that：The step In rapid S5, topic merging is carried out using the similitude determination methods based on distance between vector, concrete operation step is：

(1) microblogging model counter is set, and initial value is 0；The vectorial model of microblogging is successively read, if it 0 is first that counter, which is, Microblogging model, then directly create a new topic class C_l, counter is increased 1；If it is not, going to next step；

(2) after newly-increased microblogging model is read, it is calculated first with all existing topic centers apart from d (D_i, C_l), select That minimum is d_max(D_i, C_l), judge d_max(D_i, C_l)<Whether ρ sets up, if setting up newly-increased microblogging model D_iIt is incorporated to topic Class C_lIn, recalculate new topic central point：Calculation formula is：

If not, then handled microblogging text Di as a new topic class；Represent a topic central point, ρ Represent distance, μ between microblogging model₁Weighting parameters are represented, wherein,k′₁It is to cluster the microblogging set number in l, will counts Device increases 1；

8. the microblog topic detecting method according to claim 7 based on incremental clustering algorithm, it is characterised in that：The ρ The computation model of value is：(1) in the case of having historical data,

<mrow> <mi>&rho;</mi> <mo>=</mo> <mfrac> <msqrt> <mi>m</mi> </msqrt> <mrow> <mi>n</mi> <mo>&times;</mo> <mi>k</mi> </mrow> </mfrac> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>l</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </msubsup> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>D</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>13</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

<mrow> <mi>d</mi> <mrow> <mo>(</mo> <msub> <mi>D</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>C</mi> <mn>1</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msqrt> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>W</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>W</mi> <mrow> <mi>l</mi> <mi>j</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>14</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

(2) in the case of without historical data, meet：

<mrow> <mi>&rho;</mi> <mo>=</mo> <mn>0.3</mn> <mo>&times;</mo> <mfrac> <msqrt> <mi>m</mi> </msqrt> <msqrt> <mn>2</mn> </msqrt> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>15</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

Wherein, m represents microblogging dimension of a vector space, and n represents clustered number, and k is represented in cluster centre C_lIn include Microblogging model number；W_ijRepresent microblogging document D_iIn each term weighing, W_1jRepresent C in cluster₁Middle term weighing.

9. it is micro- used in a kind of microblog topic detecting method based on incremental clustering algorithm as described in claim 1-8 is any Rich topic detection system, it is characterised in that including with lower module：

(4) microblogging text vector module, for according to Feature Words and its weight, by microblogging text vector, its form of expression For：D_i(t₁, W₁；t₂, W₂…；t_i, W_i), t_iRepresent characteristic item, W_iRepresent weight；

(5) microblog topic analysis and merging module, judge for the similitude distance vector, then will be closely located Topic merge.

10. microblog topic detecting system according to claim 9, it is characterised in that：The microblogging pretreatment module includes Information processing module of user's, directive property dialogue interaction message processing module and word-dividing mode；The information processing module of user's, It is less than threshold value F user profile for deleting number of listening to；The directive property dialogue interaction message processing module, for ignoring finger Tropism dialogue interaction information；The word-dividing mode, for carrying out participle to microblogging text, retains or deletes the word for specifying part of speech Language；

The Feature Words extraction module, including word frequency processing module, word distribution process module；The word frequency processing module, is used In the occurrence frequency of statistics word, and the word of assigned frequency can be screened；The word distribution process module, for counting word In the distribution situation of microblogging text or time window, and the word for specifying Distribution Value can be screened.