CN104731811B

CN104731811B - A kind of clustering information evolution analysis method towards extensive dynamic short text

Info

Publication number: CN104731811B
Application number: CN201310716896.9A
Authority: CN
Inventors: 陈蕾; 边晓鸿; 冯文荣; 赵宝瑾; 逯登宇; 林信惠; 李楠; 赵丽娜; 马冰; 马一冰
Original assignee: Beijing Normal University Zhuhai
Current assignee: Beijing Normal University Zhuhai
Priority date: 2013-12-20
Filing date: 2013-12-20
Publication date: 2018-10-09
Anticipated expiration: 2033-12-20
Also published as: CN104731811A

Abstract

The present invention relates to a kind of clustering information evolution analysis methods towards extensive dynamic short text, combine the neuron representation method in Self-organization clustering algorithm first, document class is represented with neuron；Then the neuron represented as classification is uniformly shared on each microcontroller, makes on each microcontroller that there are small-scale local neuron set；Then based on iteration adjustment thought, local parallel adjustment is carried out to category division result；Followed by a global synchronization adjustment is carried out again after carrying out multiple local parallel adjustment, to complete the quick clustering of network mass data；Ultimate analysis, comparison different time sections in Clustering Model change situation to obtain the evolutionary process of the different information contained in short text data.Feature selecting and category division are fused together by the present invention by the operation " local parallel adjustment " of iteration and " global synchronization adjustment ", are realized to the quick clustering of large scale network dynamic short text, are greatly improved operational efficiency.

Description

A kind of clustering information evolution analysis method towards extensive dynamic short text

【Technical field】

It is the invention belongs to the network data excavation field that gears to the needs of the society, more particularly to a kind of towards extensive dynamic short text Clustering information evolution analysis method.

【Background technology】

Along with the arrival in Web2.0 epoch, internet industry is just experiencing a huge change, with social network Based on, it is Typical Representative with " microblogging ", it is this to be dedicated to the interpersonal relationship of network communication, to be stopped A series of movable virtual AC platforms of not busy amusement, business ventures, study discussion etc. obtain chasing after for user once appearance It holds in both hands.

Social network is a dynamic platform, and data therein are constantly updating, if it is possible to obtain dynamic number According to middle contained different information evolutionary process (analyze which information no longer be user it is of interest, which information continue Paid close attention to by user, which newest information is that user is of interest), the overall variation that can hold user's focus first becomes Secondly gesture can also draw the development trend figure of information, be predicted with the Evolutionary direction to information, by limited manpower and object Power invests user's focus of attention, the trend correctly to guide public opinion.Numerous Internet users also can be by comparing different information Evolutionary process quickly finds its interested information from immense information ocean.

Information analysis problem more popular now has " the analysis of public opinion ", " hot spot discovery ", " topic evolution " and " hot spot chases after The starting point that track ", wherein the analysis of public opinion and hot spot are found is " short timeliness analysis ", that is, be intended to analyze and obtain in the short time, Concentrate the information of outburst.Compared with them, information EVOLUTION ANALYSIS lays particular emphasis on " long timeliness analysis ", by comparing different time Dynamic data in section, to obtain the development trend of the information contained in data.Topic develops and follow-up of hot issues also can be to dynamic State data are handled, however it is confined to the development trend of one or several topics mostly, and compared with them, information, which develops, divides Analysis is intended to be shown the entire change situation of information.

" news " or " blog " data are different from, the data being widely present in social network are a kind of typical " short Text ", length are generally less than 140 words (by taking Sina weibo as an example).When text size is too short, it is with " vector space model " The representation method of representative will produce " high dimension vector is sparse " problem, while be decided by the principal element of similarity between short text not It is co-occurrence word frequency again, but the semantic similarity between text.Above-mentioned two problems to be widely used in " long text " Analysis method can not be applied in " short text " analysis.Therefore, only realize that extensive dynamic short essay can be effectively treated in one kind This clustering method, can cope with the arrival in Web2.0 epoch to huge caused by traditional text analysis method well Big challenge.

【Invention content】

The present invention is the above-mentioned technical problem solved, provides a kind of introducing parallelization thought, passes through the operation " office of iteration Portion adjusts parallel " and " global synchronization adjustment " feature selecting and category division are fused together, realize towards extensive dynamic The Fast Speed Clustering of short text, this method greatly improve operational efficiency, and discloses net with visual tag set The evolutionary process of different information, reflects overall variation trend of the focus of user in different time sections with this in network.

In order to solve the above technical problems, the present invention adopts the following technical scheme that：

A kind of clustering information evolution analysis method towards extensive dynamic short text combines Self-organization clustering algorithm first In neuron representation method, document class is represented with neuron；Then the neuron represented as classification is uniformly shared On each microcontroller, make on each microcontroller that there are small-scale local neuron set；Then in Self-organization clustering algorithm Based on iteration adjustment thought, local parallel adjustment is carried out to category division result；Followed by the multiple local parallel tune of progress A global synchronization adjustment is carried out after whole again, to complete the quick clustering towards Massive short documents notebook data；Finally on this basis The different letters contained in short text data are obtained by analyzing, comparing the change situation of the Clustering Model in different time sections The evolutionary process of breath.

Further, described " carrying out local parallel adjustment to category division result ", specifically includes following steps：

A1. a document is randomly choosed from short text data set to be clustered using distributed term clustering method, if It is d_i；

A2. d is calculated using the semantic similarity calculation method of iteration_iGather with the local neuron on current monolithic machine In similarity between each neuron, and choose and d_iNeuron with maximum similarity, if it is n_j；

A3. n is adjusted_jThe weights of middle feature, and gathered in local neuron using the semantic similarity calculation method of iteration In find and n_jMost like neuron, if it is n_b；

A4. n is detected_jAnd n_bBetween with the presence or absence of side then create a line to connect them, if n if there is no side_jAnd n_b Between side be l_jb；

A5. l is updated_jbWeights, and assign l_jbRenewal time parameter be 0；

A6. the renewal time parameter on all interneuronal sides adds 1 in local neuron being gathered；

A7. above-mentioned all sides are detected, if certain while renewal time parameter be more than it is all while average value, delete this Side, and execute iterations t=t+1；

A8. detection short text arrive its cluster centre average distance, when distance less than cluster process convergence threshold u when, Stop cluster process and enter Clustering Model quantizing process, is otherwise transferred to global synchronization set-up procedure or return starts.Whether judge t It is the integral multiple of m, "Yes" is then transferred to global synchronization set-up procedure, and "No", which then returns, to be started ".

Further, the distributed term clustering method is using the Mutual Information Theory in information theory as foundation, and selection makes letter The part of speech division result that is minimized of breath loss as term clustering as a result, in information theory if by a variable to another When variable is encoded, the information content I transmitted is calculated by following formula：

p(n_i)=f (n_i)/g；p(v_i|n_i)=f (v_i, n_i)/f(n_t)；

V and N respectively represents word variables collection in formula, and l and g represent its size, I (V；N it) represents using the variable pair in N The information content that variable in V transmits when being encoded, n_tAnd v_iRespectively represent the variable in N and V, π_tRepresent the general of occurrences Rate, p (n_t) and p (v_i|n_t) respectively represent noun n_tThe probability and verb v occurred in the text_iWith noun n_tIn the text Co-occurrence probability, f (n_t) and f (v_i, n_t) respectively represent n_tThe frequency and v of appearance_iAnd n_tThe frequency occurred simultaneously.

Further, the semantic similarity calculation method of the iteration realizes iteration using the iterative process that gradient declines Semantic Similarity Measurement specifically includes following steps：

B1. it initializes：If V and N respectively represent verb and name set of words, l and g represent its size, and k represents what user specified Part of speech number；

B2. classification ownership is determined：N is redefined as the following formula_tClassification ownership；

B3. part of speech is determined：It will be mapped to the noun composition part of speech Nj of j-th of classification as the following formula；

N_j={ n_t：j*(n_t)=j }

B4. part of speech center is determined：Class center is recalculated as the following formula for each part of speech；

B5. iterative process：B2~b4 repeat the above steps until distribution of the word between each part of speech no longer changes.

Further, described " global synchronization adjustment " is the weights of feature in the distribution adjustment neuron according to feature to carry The accuracy rate of high cluster, specifically includes following steps：

C1. the distribution within class of feature in all neurons in neuronal ensemble N is calculated；

C2. the distribution between class of feature in all neurons in neuronal ensemble N is calculated；

C3. merge the distribution within class of feature and distribution between class obtains the weights of feature；

C4. feature is ranked up by the weights of feature, and the feature for selecting weights to be more than current signature selection threshold value is made For the representative feature of neuron, it is transferred to local parallel set-up procedure later.

Further, the clustering information evolution analysis method also completes local parallel adjustment using dynamic topological structure It being adjusted with global synchronization, the dynamic topological structure is that different neurons is connected by side, then according to the variation of short text, Dynamically interneuronal side is inserted into and deleted to adjust neuron.

Further, described " analysis, comparison different time sections in Clustering Model change situation to obtain short text number According to the evolutionary process of middle contained different information ", specially：Using the change of the Clustering Model in grid quantization different time sections Change amplitude and institute in short text data of the changing content to quantify magnanimity of the Clustering Model in different time sections is disclosed with label The evolutionary process of the different information contained.

Further, described " to use the amplitude of variation of the Clustering Model in grid quantization different time sections and taken off with label Show the different information contained in short text data of the changing content of the Clustering Model in different time sections to quantify magnanimity Evolutionary process." specifically include following steps：

D1. assume that it is Gt1 to be formed by network by the short text data in the t1 periods, by short in the t2 periods It is Gt2 that text data, which is formed by network,.

D2. the dense grid of DGt1 and DGt2 is calculated, and is stored in Gt1 and Gt2, following three subset is obtained：DGt1- DGt2、DGt2-DGt1、DGt1∩DGt2；

Wherein DGt1-DGt2 represents those information for only being included by the short text data in the t1 periods, DGt2-DGt1 Represent those information for only being included by the short text data in the t2 periods, DGt1 ∩ DGt2 represent those time period t 1 to The information remained unchanged between t2.

The beneficial effects of the invention are as follows：

The present invention passes through the operation " local parallel adjustment " of iteration through the above technical solutions, introduce parallelization thought Feature selecting and category division are fused together by " global synchronization adjustment ", are realized towards large scale network dynamic short text Quick clustering, the classification for rapidly finding optimization represent feature and best category division as a result, greatly improving operation effect Rate；And the evolutionary process of different information in network is disclosed with visual tag set, the focus of user is reflected with this Overall variation trend in different time sections can provide guidance for the formulation of government or enterprise's relevant policies, secondly may be used also To draw the development trend figure of information, limited man power and material is invested into user's focus of attention, what is correctly guided public opinion walks To；Numerous Internet users also can quickly be had found from immense information ocean by the evolutionary process of the different information of comparison Its interested information, and grasp the development and change process of the information.

【Description of the drawings】

Fig. 1 is the stream of the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention Journey schematic diagram；

Fig. 2 is office in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention The flow diagram that portion adjusts parallel；

Fig. 3 be in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention repeatedly The flow diagram of the semantic similarity calculation method in generation；

Fig. 4 is complete in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention The flow diagram of office's synchronous adjustment；

Fig. 5 is office in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention Portion adjusts the flow diagram for being inserted into and deleting crack approach parallel；

Fig. 6 is measured in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention Change the flow diagram of the evolutionary process of the network data of magnanimity.

【Specific implementation mode】

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.

As shown in Figure 1 to Figure 3：

The present invention provides a kind of clustering information evolution analysis methods towards extensive dynamic short text, combine first certainly The neuron representation method in clustering algorithm is organized, document class is represented with neuron；Then the neuron that will be represented as classification Uniformly share on each microcontroller, makes on each microcontroller that there are small-scale local neuron set；Then with self-organizing Based on iteration adjustment thought in clustering algorithm, local parallel adjustment is carried out to category division result；Followed by more in progress A global synchronization adjustment is carried out after secondary local parallel adjustment again, to complete the quick clustering towards Massive short documents notebook data；Most Afterwards on this basis by analyzing, comparing the change situation of the Clustering Model in different time sections to obtain institute in short text data The evolutionary process of the different information contained.

Described " carrying out local parallel adjustment to category division result ", specifically includes following steps：

Step A1. is selected using distributed term clustering method from short text data set (such as " microblogging ") to be clustered at random A document is selected, if it is d_i；

Step A2. calculates d using the semantic similarity calculation method of iteration_iWith the local neuron on current monolithic machine Similarity in set between each neuron, and selection and d_iNeuron with maximum similarity, if it is n_j；

Step A3. adjusts n_jThe weights of middle feature, and using the semantic similarity calculation method of iteration in local neuron It is found in set and n_jMost like neuron, if it is n_b；

Step A4. detects n_jAnd n_bBetween with the presence or absence of side then create a line to connect them, if n if there is no side_j And n_bBetween side be l_jb；

Step A5. updates l_jbWeights, and assign l_jbRenewal time parameter be 0；

Step A6. during local neuron is gathered the renewal time parameter on all interneuronal sides add 1；

Step A7. detects above-mentioned all sides, if certain while renewal time parameter be more than all while average value, delete Except this side, and execute iterations t=t+1；

Step A8. detects short text to the average distance of its cluster centre, when distance is less than the convergence threshold u of cluster process When, stop cluster process and enter Clustering Model quantizing process, is otherwise transferred to global synchronization set-up procedure or return starts.Judge t Whether be m integral multiple, "Yes" is then transferred to global synchronization set-up procedure, and "No", which then returns, to be started.

Wherein, the distributed term clustering method is using the Mutual Information Theory in information theory as foundation, and selection makes information damage The part of speech division result that is minimized is lost as term clustering as a result, if by a variable to another variable in information theory When being encoded, the information content I transmitted is calculated using following formula：

V and N in formula respectively represent word variables collection, and l and g represent its size, I (V；N it) represents using the variable in N The information content transmitted when being encoded to the variable in V, n_tAnd v_iRespectively represent the variable in N and V, π_tRepresent occurrences Probability, and the probability of all variables and be 1, i.e.,：

In text cluster, the word that can represent text subject is usually noun and verb.Therefore the present invention only takes out Noun and verb in short text is taken to be clustered as short essay eigen, and to noun and verb.Above-mentioned V and N distinguishes Represent verb and name set of words, p (n_t) and p (v_i|n_t) respectively represent noun n_tThe probability and verb v occurred in the text_iWith Noun n_tCo-occurrence probability in the text, it is assumed that f (n_t) and f (v_i, n_t) respectively represent n_tThe frequency and v of appearance_iAnd n_tSimultaneously The frequency of appearance, then have：

p(n_i)=f (n_i)/g；p(v_i|n_i)=f (v_i, n_i)/f(n_t)

If compressed to N, that is, it is divided into k classification：NC={ N₁, N₂..., N_k, andAgain When being encoded to V using NC, according to the coding meeting existence information loss of information theory at this time, distributed term clustering is with letter " information loss minimization principle " in breath opinion, it is that best word gathers to regard loss amount to reach when minimum corresponding category division result Class result.

According to above-mentioned analysis, you can obtain the object function of distributed term clustering, corresponded to when this function reaches minimum value Part of speech be divided into optimal term clustering as a result, when being encoded using this part of speech set, the information content of loss is minimum.

Wherein, π (N_j) represent part of speech N_jThe probability of appearance is calculated by the Word probability for including in part of speech, p (v_i|N_j) generation Table part of speech N_jWith verb v_iCo-occurrence probability, computational methods are as follows：

By p (v_i) remove：

The object function is similar with the object function of K-means algorithms, the main distinction be K-means use " it is European away from From " it is used as similarity calculating method；Since " Euclidean distance " can not calculate the similitude of probability distribution, this patent with KL away from From the distribution similarity calculated between word, KL distances can also solve in K-means classification and divide in " suprasphere " well in addition The problem of cloth.

The present invention is agglomerated similar word for part of speech to generate " part of speech indicates model " by distributed term clustering method, The model by multiple dimension maps in " vector space model " be a dimension to mitigate " high dimension vector is sparse " problem.

The semantic similarity calculation method of the iteration realizes that the semanteme of iteration is similar using the iterative process that gradient declines Degree calculates, and specifically includes following steps：

Step B1. initialization：If V and N respectively represent verb and name set of words, l and g represent its size, and k represents user and refers to Fixed part of speech number.Name set of words is divided into 1 initial category, noun n first_tClassification by j* (n_t)=argmaxjp (v_j| n_t) determine.If k ＞ 1, to 1 initial category random splitting until classification number is k, otherwise the random part of speech that merges is until class Shuo not be k；

Step B2. determines that classification belongs to：N is redefined as the following formula_tClassification ownership；

Step B3. determines part of speech：It will be mapped to the noun composition part of speech Nj of j-th of classification；

N_j={ n_i：j*(n_t)=j }

Step B4. determines part of speech center：For each part of speech, class center is recalculated；

Step B5. iterative process：D2~D4 repeat the above steps until distribution of the word between each part of speech no longer changes.

" semantic similarity calculation method of iteration " of the present invention is similar between word and word by calculating for iteration Similarity seeks an equalization point between the two between degree, document and document, and using the corresponding result of this equalization point as Best similarity calculation result.

" the global synchronization adjustment " is the weights of feature in the distribution adjustment neuron according to feature to improve cluster Accuracy rate specifically includes following steps：

Step C1. calculates the distribution within class of feature in all neurons in neuronal ensemble N；

Step C2. calculates the distribution between class of feature in all neurons in neuronal ensemble N；

Step C3. merges the distribution within class of feature and distribution between class obtains the weights of feature；

Step C4. is ranked up feature by the weights of feature, and weights is selected to be more than the spy that current signature selects threshold value The representative feature as neuron is levied, is transferred to local parallel set-up procedure later.

" the global synchronization adjustment " goes actively to select classification by the distribution (distribution within class and distribution between class) of feature Feature is represented with filtering characteristic, solve local parallel and adjust insurmountable classification represent may include in feature cannot be effective The problem of distinguishing the uncorrelated features of classification, to promote the accuracy rate of category division in global scope.

It is of the present invention " analysis, comparison different time sections in Clustering Model change situation to obtain short text number According to the evolutionary process of middle contained different information ", specially：Using the change of the Clustering Model in grid quantization different time sections Change amplitude and the changing content that the Clustering Model in different time sections is disclosed with label quantify network data (such as microblogging of magnanimity Data) evolutionary process of different information that is contained.

In addition, clustering information evolution analysis method of the present invention additionally uses dynamic topological structure, connected by side Different neurons is dynamically inserted into and deletes interneuronal side to adjust neuron then according to the variation of short text Topological structure；I.e.：Insertion and deletion of two parameters to control while, respectively side right value and side are being set more to be interneuronal New time, wherein side right value are used for indicating that the tightness degree of interneuronal correlation, side renewal time are used for writing edge distance recently The time difference being once updated；Side weight w_ijCalculation formula is as follows：

Above-mentioned formula is neuron n_iAnd n_jBetween side weight w_ijComputational methods, wherein Sim (n_i, n_j) represent neuron n_iAnd n_jSimilarity, can be obtained by Euclidean distance.

Local parallel adjustment and global synchronization adjustment above-mentioned combine dynamic topological structure, and the difference of the two exists In：Supplemented with the process for being inserted into and deleting side in local parallel set-up procedure, dynamically to change the topological structure of neuron, Increase following steps：

Step D1. local parallel set-up procedures are to n_jAfter being adjusted, found and n in local neuron gathers_jIt is most like Neuron, if its be n_b；

Step D2. detects n_jAnd n_bBetween whether there is side；If there is no side, then a line is created to connect them, if n_j And n_bBetween side be l_jb；

Step D3. updates l_jbWeights, and assign l_jbRenewal time parameter be 0；

Step D4. during local neuron is gathered the renewal time parameter on all interneuronal sides add 1；

Step D5. detects above-mentioned all sides, if certain while renewal time parameter be more than all while average value, delete It removes.

It is of the present invention " to use the amplitude of variation of the Clustering Model in grid quantization different time sections and disclosed with label The different information that the changing content of Clustering Model in different time sections is contained in the short text data to quantify magnanimity are drilled Change process " specifically includes following steps：

Step E1. assumes that it is Gt1 to be formed by network by the short text data in the t1 periods, by the t2 periods Short text data be formed by network be Gt2.

Step E2. calculates the dense grid of DGt1 and DGt2, and is stored in Gt1 and Gt2, obtains following three subset： DGt1-DGt2、DGt2-DGt1、DGt1∩DGt2；

Clustering information evolution analysis method of the present invention towards extensive dynamic short text is being realized towards extensive On the basis of the cluster of dynamic short text, develops into row information to the extensive dynamic short text data in social network and divide It analyses, and discloses the evolutionary process of different information in network with visual tag set, reflect that the focus of user exists with this Overall variation trend in different time sections.

Clustering information evolution analysis method of the present invention towards extensive dynamic short text has the characteristics that：

1, with the increase of short text scale, " high dimension vector is sparse " characteristic of short text is more notable, makes a large amount of Similarity between short text levels off to 0, to constrain the performance of traditional Text Clustering Algorithm.For this problem, the present invention It is part of speech that will be distributed similar word cohesion by " mutual information " in information theory theoretical realization " distributed term clustering ", short to improve " high dimension vector is sparse " characteristic of text；Text message not only can more accurately be described to obtain good cluster result, but also can Realize low dimensional, non-sparse representation method to realize quick and accurate similarity calculation.

2, " distributed term clustering " clusters verb and noun respectively, therefore can not find the pass between verb and noun System, such as the relationship between " kicking " and " ball ", i.e., in addition to " the synonymous similitude " of narrow sense, " distributed term clustering " can not be found The more similitude of broad sense.Therefore it is found between the more word of broad sense using " semantic similarity calculation method of iteration " Similarity, this method seek the two by calculating the similarity between similarity, document and document between word and word Between an equalization point, and using the corresponding result of calculation of this equalization point as best similarity calculation result.

3, feature selecting and category division are merged by the present invention by way of iteration, optimal gradually to determine Category feature indicates and best category division result.It is realized by operation " local parallel adjustment " and " global synchronization adjustment " Quick clustering towards extensive dynamic short text, in combination with above-mentioned " term clustering representation method " and " semanteme of iteration Similarity calculating method " further promotes its performance.

4, the present invention is formed by Clustering Model before and after being updated to short text data using network and quantified, and is led to Label is crossed to show the result of quantization to embody the evolutionary process of information, provides intuitive analysis result.

In conclusion present invention introduces parallelization thought, pass through the operation " local parallel adjustment " of iteration and " global synchronization Feature selecting and category division are fused together by adjustment ", realize the quick clustering to extensive dynamic short text；And this hair The bright evolutionary process that different information in network are disclosed with visual tag set, reflects the focus of user in difference with this Overall variation trend in period can provide guidance for the formulation of government or enterprise's relevant policies, secondly can also draw Limited man power and material is invested user's focus of attention, the trend correctly to guide public opinion by the development trend figure of information；It is vast Internet user also quickly can have found that its is interested by the evolutionary process of the different information of comparison from immense information ocean Information, and grasp the development and change process of the information.

The above content is combine specific optimal technical scheme it is made for the present invention be further described, and it cannot be said that The specific implementation of the present invention is confined to these explanations.For those of ordinary skill in the art to which the present invention belongs, exist Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to the present invention's Protection domain.

Claims

1. a kind of clustering information evolution analysis method towards extensive dynamic short text combines in Self-organization clustering algorithm first Neuron representation method, document class is represented with neuron；Then the neuron represented as classification is uniformly shared each On microcontroller, make on each microcontroller that there are small-scale local neuron set；Then with changing in Self-organization clustering algorithm Based on generation adjustment thought, local parallel adjustment is carried out to category division result；Followed by the multiple local parallel adjustment of progress A global synchronization adjustment is carried out again afterwards, to complete the quick clustering towards Massive short documents notebook data；Finally lead on this basis Cross analysis, comparison different time sections in Clustering Model change situation to obtain the different information contained in short text data Evolutionary process；It is characterized in that, described " carrying out local parallel adjustment to category division result ", specifically includes following steps：

A2. d is calculated using the semantic similarity calculation method of iteration_iIt is every in gathering with the local neuron on current monolithic machine Similarity between a neuron, and selection and d_iNeuron with maximum similarity, if it is n_j；

A3. n is adjusted_jThe weights of middle feature, and found in local neuron gathers using the semantic similarity calculation method of iteration With n_jMost like neuron, if it is n_b；

A4. n is detected_jAnd n_bBetween with the presence or absence of side then create a line to connect them, if n if there is no side_jAnd n_bBetween While being l_jb；

A5. l is updated_jbWeights, and assign l_jbRenewal time parameter be 0；

A8. the average distance of detection short text to its cluster centre stops when distance is less than the convergence threshold u of cluster process Cluster process enters Clustering Model quantizing process, is otherwise transferred to global synchronization set-up procedure or return starts.

2. the clustering information evolution analysis method according to claim 1 towards extensive dynamic short text, feature exist In the distribution term clustering method is using the Mutual Information Theory in information theory as foundation, and selection makes information loss be minimized Part of speech division result as term clustering as a result, if encoded to another variable by a variable in information theory When, the information content I transmitted is calculated by following formula：

π_t=p (n_t)；

p(n_t)=f (n_t)/g；p(v_i|n_t)=f (v_i, n_t)/f(n_t)；

V and N respectively represents word variables collection in formula, and l and g represent its size, I (V；N it) represents using the variable in N in V The information content transmitted when being encoded of variable, n_tAnd v_iRespectively represent the variable in N and V, π_tRepresent the probability of occurrences, p (n_t) and p (v_i|n_t) respectively represent noun n_tThe probability and verb v occurred in the text_iWith noun n_tCo-occurrence in the text Probability, f (n_t) and f (v_i, n_t) respectively represent n_tThe frequency and v of appearance_iAnd n_tThe frequency occurred simultaneously.

3. the clustering information evolution analysis method according to claim 2 towards extensive dynamic short text, feature exist In the semantic similarity calculation method of the iteration realizes the semantic similarity meter of iteration using the iterative process that gradient declines It calculates, specifically includes following steps：

B1. it initializes：If V and N respectively represent verb and name set of words, l and g represent its size, and k represents the part of speech that user specifies Number；

N_j={ n_t：j*(n_t)=j }

4. the clustering information evolution analysis method according to claim 1 towards extensive dynamic short text, feature exist In " the global synchronization adjustment " is the weights of feature in the distribution adjustment neuron according to feature to improve the accurate of cluster Rate specifically includes following steps：

C4. feature is ranked up by the weights of feature, and selects weights to be more than current signature and selects the feature of threshold value as god Representative feature through member, is transferred to local parallel set-up procedure later.

5. according to any clustering information evolution analysis method towards extensive dynamic short text in Claims 1-4, It is characterized in that, the clustering information evolution analysis method also completes local parallel adjustment and the overall situation using dynamic topological structure Synchronous adjustment, the dynamic topological structure are that different neurons is connected by side, then according to the variation of short text, dynamically Interneuronal side is inserted into and deleted to adjust neuron.

6. according to any clustering information evolution analysis method towards extensive dynamic short text in Claims 1-4, It is characterized in that, described " analysis, Clustering Model in comparison different time sections change situation to obtain institute in short text data The evolutionary process of the different information contained ", specially：Using the amplitude of variation of the Clustering Model in grid quantization different time sections With contained in short text data of the changing content to quantify magnanimity of the Clustering Model disclosed with label in different time sections The evolutionary process of different information.

7. the clustering information evolution analysis method according to claim 6 towards extensive dynamic short text, feature exist In described " to use the amplitude of variation of the Clustering Model in grid quantization different time sections and disclosed in different time sections with label Clustering Model short text data of the changing content to quantify magnanimity in the evolutionary process of different information that is contained " specific packet Include following steps：

D1. assume that it is Gt1 to be formed by network by the short text data in the t1 periods, by the short text in the t2 periods It is Gt2 that data, which are formed by network,；

Wherein DGt1-DGt2 represents those information for only being included by the short text data in the t1 periods, and DGt2-DGt1 is represented Those information for only being included by the short text data in the t2 periods, DGt1 ∩ DGt2 represent those in time period t 1 between t2 The information remained unchanged.