CN104731811B - A kind of clustering information evolution analysis method towards extensive dynamic short text - Google Patents

A kind of clustering information evolution analysis method towards extensive dynamic short text Download PDF

Info

Publication number
CN104731811B
CN104731811B CN201310716896.9A CN201310716896A CN104731811B CN 104731811 B CN104731811 B CN 104731811B CN 201310716896 A CN201310716896 A CN 201310716896A CN 104731811 B CN104731811 B CN 104731811B
Authority
CN
China
Prior art keywords
clustering
short text
information
neuron
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310716896.9A
Other languages
Chinese (zh)
Other versions
CN104731811A (en
Inventor
陈蕾
边晓鸿
冯文荣
赵宝瑾
逯登宇
林信惠
李楠
赵丽娜
马冰
马一冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Normal University Zhuhai
Original Assignee
Beijing Normal University Zhuhai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Normal University Zhuhai filed Critical Beijing Normal University Zhuhai
Priority to CN201310716896.9A priority Critical patent/CN104731811B/en
Publication of CN104731811A publication Critical patent/CN104731811A/en
Application granted granted Critical
Publication of CN104731811B publication Critical patent/CN104731811B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention relates to a kind of clustering information evolution analysis methods towards extensive dynamic short text, combine the neuron representation method in Self-organization clustering algorithm first, document class is represented with neuron;Then the neuron represented as classification is uniformly shared on each microcontroller, makes on each microcontroller that there are small-scale local neuron set;Then based on iteration adjustment thought, local parallel adjustment is carried out to category division result;Followed by a global synchronization adjustment is carried out again after carrying out multiple local parallel adjustment, to complete the quick clustering of network mass data;Ultimate analysis, comparison different time sections in Clustering Model change situation to obtain the evolutionary process of the different information contained in short text data.Feature selecting and category division are fused together by the present invention by the operation " local parallel adjustment " of iteration and " global synchronization adjustment ", are realized to the quick clustering of large scale network dynamic short text, are greatly improved operational efficiency.

Description

A kind of clustering information evolution analysis method towards extensive dynamic short text
【Technical field】
It is the invention belongs to the network data excavation field that gears to the needs of the society, more particularly to a kind of towards extensive dynamic short text Clustering information evolution analysis method.
【Background technology】
Along with the arrival in Web2.0 epoch, internet industry is just experiencing a huge change, with social network Based on, it is Typical Representative with " microblogging ", it is this to be dedicated to the interpersonal relationship of network communication, to be stopped A series of movable virtual AC platforms of not busy amusement, business ventures, study discussion etc. obtain chasing after for user once appearance It holds in both hands.
Social network is a dynamic platform, and data therein are constantly updating, if it is possible to obtain dynamic number According to middle contained different information evolutionary process (analyze which information no longer be user it is of interest, which information continue Paid close attention to by user, which newest information is that user is of interest), the overall variation that can hold user's focus first becomes Secondly gesture can also draw the development trend figure of information, be predicted with the Evolutionary direction to information, by limited manpower and object Power invests user's focus of attention, the trend correctly to guide public opinion.Numerous Internet users also can be by comparing different information Evolutionary process quickly finds its interested information from immense information ocean.
Information analysis problem more popular now has " the analysis of public opinion ", " hot spot discovery ", " topic evolution " and " hot spot chases after The starting point that track ", wherein the analysis of public opinion and hot spot are found is " short timeliness analysis ", that is, be intended to analyze and obtain in the short time, Concentrate the information of outburst.Compared with them, information EVOLUTION ANALYSIS lays particular emphasis on " long timeliness analysis ", by comparing different time Dynamic data in section, to obtain the development trend of the information contained in data.Topic develops and follow-up of hot issues also can be to dynamic State data are handled, however it is confined to the development trend of one or several topics mostly, and compared with them, information, which develops, divides Analysis is intended to be shown the entire change situation of information.
" news " or " blog " data are different from, the data being widely present in social network are a kind of typical " short Text ", length are generally less than 140 words (by taking Sina weibo as an example).When text size is too short, it is with " vector space model " The representation method of representative will produce " high dimension vector is sparse " problem, while be decided by the principal element of similarity between short text not It is co-occurrence word frequency again, but the semantic similarity between text.Above-mentioned two problems to be widely used in " long text " Analysis method can not be applied in " short text " analysis.Therefore, only realize that extensive dynamic short essay can be effectively treated in one kind This clustering method, can cope with the arrival in Web2.0 epoch to huge caused by traditional text analysis method well Big challenge.
【Invention content】
The present invention is the above-mentioned technical problem solved, provides a kind of introducing parallelization thought, passes through the operation " office of iteration Portion adjusts parallel " and " global synchronization adjustment " feature selecting and category division are fused together, realize towards extensive dynamic The Fast Speed Clustering of short text, this method greatly improve operational efficiency, and discloses net with visual tag set The evolutionary process of different information, reflects overall variation trend of the focus of user in different time sections with this in network.
In order to solve the above technical problems, the present invention adopts the following technical scheme that:
A kind of clustering information evolution analysis method towards extensive dynamic short text combines Self-organization clustering algorithm first In neuron representation method, document class is represented with neuron;Then the neuron represented as classification is uniformly shared On each microcontroller, make on each microcontroller that there are small-scale local neuron set;Then in Self-organization clustering algorithm Based on iteration adjustment thought, local parallel adjustment is carried out to category division result;Followed by the multiple local parallel tune of progress A global synchronization adjustment is carried out after whole again, to complete the quick clustering towards Massive short documents notebook data;Finally on this basis The different letters contained in short text data are obtained by analyzing, comparing the change situation of the Clustering Model in different time sections The evolutionary process of breath.
Further, described " carrying out local parallel adjustment to category division result ", specifically includes following steps:
A1. a document is randomly choosed from short text data set to be clustered using distributed term clustering method, if It is di
A2. d is calculated using the semantic similarity calculation method of iterationiGather with the local neuron on current monolithic machine In similarity between each neuron, and choose and diNeuron with maximum similarity, if it is nj
A3. n is adjustedjThe weights of middle feature, and gathered in local neuron using the semantic similarity calculation method of iteration In find and njMost like neuron, if it is nb
A4. n is detectedjAnd nbBetween with the presence or absence of side then create a line to connect them, if n if there is no sidejAnd nb Between side be ljb
A5. l is updatedjbWeights, and assign ljbRenewal time parameter be 0;
A6. the renewal time parameter on all interneuronal sides adds 1 in local neuron being gathered;
A7. above-mentioned all sides are detected, if certain while renewal time parameter be more than it is all while average value, delete this Side, and execute iterations t=t+1;
A8. detection short text arrive its cluster centre average distance, when distance less than cluster process convergence threshold u when, Stop cluster process and enter Clustering Model quantizing process, is otherwise transferred to global synchronization set-up procedure or return starts.Whether judge t It is the integral multiple of m, "Yes" is then transferred to global synchronization set-up procedure, and "No", which then returns, to be started ".
Further, the distributed term clustering method is using the Mutual Information Theory in information theory as foundation, and selection makes letter The part of speech division result that is minimized of breath loss as term clustering as a result, in information theory if by a variable to another When variable is encoded, the information content I transmitted is calculated by following formula:
p(ni)=f (ni)/g;p(vi|ni)=f (vi, ni)/f(nt);
V and N respectively represents word variables collection in formula, and l and g represent its size, I (V;N it) represents using the variable pair in N The information content that variable in V transmits when being encoded, ntAnd viRespectively represent the variable in N and V, πtRepresent the general of occurrences Rate, p (nt) and p (vi|nt) respectively represent noun ntThe probability and verb v occurred in the textiWith noun ntIn the text Co-occurrence probability, f (nt) and f (vi, nt) respectively represent ntThe frequency and v of appearanceiAnd ntThe frequency occurred simultaneously.
Further, the semantic similarity calculation method of the iteration realizes iteration using the iterative process that gradient declines Semantic Similarity Measurement specifically includes following steps:
B1. it initializes:If V and N respectively represent verb and name set of words, l and g represent its size, and k represents what user specified Part of speech number;
B2. classification ownership is determined:N is redefined as the following formulatClassification ownership;
B3. part of speech is determined:It will be mapped to the noun composition part of speech Nj of j-th of classification as the following formula;
Nj={ nt:j*(nt)=j }
B4. part of speech center is determined:Class center is recalculated as the following formula for each part of speech;
B5. iterative process:B2~b4 repeat the above steps until distribution of the word between each part of speech no longer changes.
Further, described " global synchronization adjustment " is the weights of feature in the distribution adjustment neuron according to feature to carry The accuracy rate of high cluster, specifically includes following steps:
C1. the distribution within class of feature in all neurons in neuronal ensemble N is calculated;
C2. the distribution between class of feature in all neurons in neuronal ensemble N is calculated;
C3. merge the distribution within class of feature and distribution between class obtains the weights of feature;
C4. feature is ranked up by the weights of feature, and the feature for selecting weights to be more than current signature selection threshold value is made For the representative feature of neuron, it is transferred to local parallel set-up procedure later.
Further, the clustering information evolution analysis method also completes local parallel adjustment using dynamic topological structure It being adjusted with global synchronization, the dynamic topological structure is that different neurons is connected by side, then according to the variation of short text, Dynamically interneuronal side is inserted into and deleted to adjust neuron.
Further, described " analysis, comparison different time sections in Clustering Model change situation to obtain short text number According to the evolutionary process of middle contained different information ", specially:Using the change of the Clustering Model in grid quantization different time sections Change amplitude and institute in short text data of the changing content to quantify magnanimity of the Clustering Model in different time sections is disclosed with label The evolutionary process of the different information contained.
Further, described " to use the amplitude of variation of the Clustering Model in grid quantization different time sections and taken off with label Show the different information contained in short text data of the changing content of the Clustering Model in different time sections to quantify magnanimity Evolutionary process." specifically include following steps:
D1. assume that it is Gt1 to be formed by network by the short text data in the t1 periods, by short in the t2 periods It is Gt2 that text data, which is formed by network,.
D2. the dense grid of DGt1 and DGt2 is calculated, and is stored in Gt1 and Gt2, following three subset is obtained:DGt1- DGt2、DGt2-DGt1、DGt1∩DGt2;
Wherein DGt1-DGt2 represents those information for only being included by the short text data in the t1 periods, DGt2-DGt1 Represent those information for only being included by the short text data in the t2 periods, DGt1 ∩ DGt2 represent those time period t 1 to The information remained unchanged between t2.
The beneficial effects of the invention are as follows:
The present invention passes through the operation " local parallel adjustment " of iteration through the above technical solutions, introduce parallelization thought Feature selecting and category division are fused together by " global synchronization adjustment ", are realized towards large scale network dynamic short text Quick clustering, the classification for rapidly finding optimization represent feature and best category division as a result, greatly improving operation effect Rate;And the evolutionary process of different information in network is disclosed with visual tag set, the focus of user is reflected with this Overall variation trend in different time sections can provide guidance for the formulation of government or enterprise's relevant policies, secondly may be used also To draw the development trend figure of information, limited man power and material is invested into user's focus of attention, what is correctly guided public opinion walks To;Numerous Internet users also can quickly be had found from immense information ocean by the evolutionary process of the different information of comparison Its interested information, and grasp the development and change process of the information.
【Description of the drawings】
Fig. 1 is the stream of the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention Journey schematic diagram;
Fig. 2 is office in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention The flow diagram that portion adjusts parallel;
Fig. 3 be in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention repeatedly The flow diagram of the semantic similarity calculation method in generation;
Fig. 4 is complete in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention The flow diagram of office's synchronous adjustment;
Fig. 5 is office in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention Portion adjusts the flow diagram for being inserted into and deleting crack approach parallel;
Fig. 6 is measured in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention Change the flow diagram of the evolutionary process of the network data of magnanimity.
【Specific implementation mode】
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.
As shown in Figure 1 to Figure 3:
The present invention provides a kind of clustering information evolution analysis methods towards extensive dynamic short text, combine first certainly The neuron representation method in clustering algorithm is organized, document class is represented with neuron;Then the neuron that will be represented as classification Uniformly share on each microcontroller, makes on each microcontroller that there are small-scale local neuron set;Then with self-organizing Based on iteration adjustment thought in clustering algorithm, local parallel adjustment is carried out to category division result;Followed by more in progress A global synchronization adjustment is carried out after secondary local parallel adjustment again, to complete the quick clustering towards Massive short documents notebook data;Most Afterwards on this basis by analyzing, comparing the change situation of the Clustering Model in different time sections to obtain institute in short text data The evolutionary process of the different information contained.
Described " carrying out local parallel adjustment to category division result ", specifically includes following steps:
Step A1. is selected using distributed term clustering method from short text data set (such as " microblogging ") to be clustered at random A document is selected, if it is di
Step A2. calculates d using the semantic similarity calculation method of iterationiWith the local neuron on current monolithic machine Similarity in set between each neuron, and selection and diNeuron with maximum similarity, if it is nj
Step A3. adjusts njThe weights of middle feature, and using the semantic similarity calculation method of iteration in local neuron It is found in set and njMost like neuron, if it is nb
Step A4. detects njAnd nbBetween with the presence or absence of side then create a line to connect them, if n if there is no sidej And nbBetween side be ljb
Step A5. updates ljbWeights, and assign ljbRenewal time parameter be 0;
Step A6. during local neuron is gathered the renewal time parameter on all interneuronal sides add 1;
Step A7. detects above-mentioned all sides, if certain while renewal time parameter be more than all while average value, delete Except this side, and execute iterations t=t+1;
Step A8. detects short text to the average distance of its cluster centre, when distance is less than the convergence threshold u of cluster process When, stop cluster process and enter Clustering Model quantizing process, is otherwise transferred to global synchronization set-up procedure or return starts.Judge t Whether be m integral multiple, "Yes" is then transferred to global synchronization set-up procedure, and "No", which then returns, to be started.
Wherein, the distributed term clustering method is using the Mutual Information Theory in information theory as foundation, and selection makes information damage The part of speech division result that is minimized is lost as term clustering as a result, if by a variable to another variable in information theory When being encoded, the information content I transmitted is calculated using following formula:
V and N in formula respectively represent word variables collection, and l and g represent its size, I (V;N it) represents using the variable in N The information content transmitted when being encoded to the variable in V, ntAnd viRespectively represent the variable in N and V, πtRepresent occurrences Probability, and the probability of all variables and be 1, i.e.,:
In text cluster, the word that can represent text subject is usually noun and verb.Therefore the present invention only takes out Noun and verb in short text is taken to be clustered as short essay eigen, and to noun and verb.Above-mentioned V and N distinguishes Represent verb and name set of words, p (nt) and p (vi|nt) respectively represent noun ntThe probability and verb v occurred in the textiWith Noun ntCo-occurrence probability in the text, it is assumed that f (nt) and f (vi, nt) respectively represent ntThe frequency and v of appearanceiAnd ntSimultaneously The frequency of appearance, then have:
p(ni)=f (ni)/g;p(vi|ni)=f (vi, ni)/f(nt)
If compressed to N, that is, it is divided into k classification:NC={ N1, N2..., Nk, andAgain When being encoded to V using NC, according to the coding meeting existence information loss of information theory at this time, distributed term clustering is with letter " information loss minimization principle " in breath opinion, it is that best word gathers to regard loss amount to reach when minimum corresponding category division result Class result.
According to above-mentioned analysis, you can obtain the object function of distributed term clustering, corresponded to when this function reaches minimum value Part of speech be divided into optimal term clustering as a result, when being encoded using this part of speech set, the information content of loss is minimum.
Wherein, π (Nj) represent part of speech NjThe probability of appearance is calculated by the Word probability for including in part of speech, p (vi|Nj) generation Table part of speech NjWith verb viCo-occurrence probability, computational methods are as follows:
By p (vi) remove:
The object function is similar with the object function of K-means algorithms, the main distinction be K-means use " it is European away from From " it is used as similarity calculating method;Since " Euclidean distance " can not calculate the similitude of probability distribution, this patent with KL away from From the distribution similarity calculated between word, KL distances can also solve in K-means classification and divide in " suprasphere " well in addition The problem of cloth.
The present invention is agglomerated similar word for part of speech to generate " part of speech indicates model " by distributed term clustering method, The model by multiple dimension maps in " vector space model " be a dimension to mitigate " high dimension vector is sparse " problem.
The semantic similarity calculation method of the iteration realizes that the semanteme of iteration is similar using the iterative process that gradient declines Degree calculates, and specifically includes following steps:
Step B1. initialization:If V and N respectively represent verb and name set of words, l and g represent its size, and k represents user and refers to Fixed part of speech number.Name set of words is divided into 1 initial category, noun n firsttClassification by j* (nt)=argmaxjp (vj| nt) determine.If k > 1, to 1 initial category random splitting until classification number is k, otherwise the random part of speech that merges is until class Shuo not be k;
Step B2. determines that classification belongs to:N is redefined as the following formulatClassification ownership;
Step B3. determines part of speech:It will be mapped to the noun composition part of speech Nj of j-th of classification;
Nj={ ni:j*(nt)=j }
Step B4. determines part of speech center:For each part of speech, class center is recalculated;
Step B5. iterative process:D2~D4 repeat the above steps until distribution of the word between each part of speech no longer changes.
" semantic similarity calculation method of iteration " of the present invention is similar between word and word by calculating for iteration Similarity seeks an equalization point between the two between degree, document and document, and using the corresponding result of this equalization point as Best similarity calculation result.
" the global synchronization adjustment " is the weights of feature in the distribution adjustment neuron according to feature to improve cluster Accuracy rate specifically includes following steps:
Step C1. calculates the distribution within class of feature in all neurons in neuronal ensemble N;
Step C2. calculates the distribution between class of feature in all neurons in neuronal ensemble N;
Step C3. merges the distribution within class of feature and distribution between class obtains the weights of feature;
Step C4. is ranked up feature by the weights of feature, and weights is selected to be more than the spy that current signature selects threshold value The representative feature as neuron is levied, is transferred to local parallel set-up procedure later.
" the global synchronization adjustment " goes actively to select classification by the distribution (distribution within class and distribution between class) of feature Feature is represented with filtering characteristic, solve local parallel and adjust insurmountable classification represent may include in feature cannot be effective The problem of distinguishing the uncorrelated features of classification, to promote the accuracy rate of category division in global scope.
It is of the present invention " analysis, comparison different time sections in Clustering Model change situation to obtain short text number According to the evolutionary process of middle contained different information ", specially:Using the change of the Clustering Model in grid quantization different time sections Change amplitude and the changing content that the Clustering Model in different time sections is disclosed with label quantify network data (such as microblogging of magnanimity Data) evolutionary process of different information that is contained.
In addition, clustering information evolution analysis method of the present invention additionally uses dynamic topological structure, connected by side Different neurons is dynamically inserted into and deletes interneuronal side to adjust neuron then according to the variation of short text Topological structure;I.e.:Insertion and deletion of two parameters to control while, respectively side right value and side are being set more to be interneuronal New time, wherein side right value are used for indicating that the tightness degree of interneuronal correlation, side renewal time are used for writing edge distance recently The time difference being once updated;Side weight wijCalculation formula is as follows:
Above-mentioned formula is neuron niAnd njBetween side weight wijComputational methods, wherein Sim (ni, nj) represent neuron niAnd njSimilarity, can be obtained by Euclidean distance.
Local parallel adjustment and global synchronization adjustment above-mentioned combine dynamic topological structure, and the difference of the two exists In:Supplemented with the process for being inserted into and deleting side in local parallel set-up procedure, dynamically to change the topological structure of neuron, Increase following steps:
Step D1. local parallel set-up procedures are to njAfter being adjusted, found and n in local neuron gathersjIt is most like Neuron, if its be nb
Step D2. detects njAnd nbBetween whether there is side;If there is no side, then a line is created to connect them, if nj And nbBetween side be ljb
Step D3. updates ljbWeights, and assign ljbRenewal time parameter be 0;
Step D4. during local neuron is gathered the renewal time parameter on all interneuronal sides add 1;
Step D5. detects above-mentioned all sides, if certain while renewal time parameter be more than all while average value, delete It removes.
It is of the present invention " to use the amplitude of variation of the Clustering Model in grid quantization different time sections and disclosed with label The different information that the changing content of Clustering Model in different time sections is contained in the short text data to quantify magnanimity are drilled Change process " specifically includes following steps:
Step E1. assumes that it is Gt1 to be formed by network by the short text data in the t1 periods, by the t2 periods Short text data be formed by network be Gt2.
Step E2. calculates the dense grid of DGt1 and DGt2, and is stored in Gt1 and Gt2, obtains following three subset: DGt1-DGt2、DGt2-DGt1、DGt1∩DGt2;
Wherein DGt1-DGt2 represents those information for only being included by the short text data in the t1 periods, DGt2-DGt1 Represent those information for only being included by the short text data in the t2 periods, DGt1 ∩ DGt2 represent those time period t 1 to The information remained unchanged between t2.
Clustering information evolution analysis method of the present invention towards extensive dynamic short text is being realized towards extensive On the basis of the cluster of dynamic short text, develops into row information to the extensive dynamic short text data in social network and divide It analyses, and discloses the evolutionary process of different information in network with visual tag set, reflect that the focus of user exists with this Overall variation trend in different time sections.
Clustering information evolution analysis method of the present invention towards extensive dynamic short text has the characteristics that:
1, with the increase of short text scale, " high dimension vector is sparse " characteristic of short text is more notable, makes a large amount of Similarity between short text levels off to 0, to constrain the performance of traditional Text Clustering Algorithm.For this problem, the present invention It is part of speech that will be distributed similar word cohesion by " mutual information " in information theory theoretical realization " distributed term clustering ", short to improve " high dimension vector is sparse " characteristic of text;Text message not only can more accurately be described to obtain good cluster result, but also can Realize low dimensional, non-sparse representation method to realize quick and accurate similarity calculation.
2, " distributed term clustering " clusters verb and noun respectively, therefore can not find the pass between verb and noun System, such as the relationship between " kicking " and " ball ", i.e., in addition to " the synonymous similitude " of narrow sense, " distributed term clustering " can not be found The more similitude of broad sense.Therefore it is found between the more word of broad sense using " semantic similarity calculation method of iteration " Similarity, this method seek the two by calculating the similarity between similarity, document and document between word and word Between an equalization point, and using the corresponding result of calculation of this equalization point as best similarity calculation result.
3, feature selecting and category division are merged by the present invention by way of iteration, optimal gradually to determine Category feature indicates and best category division result.It is realized by operation " local parallel adjustment " and " global synchronization adjustment " Quick clustering towards extensive dynamic short text, in combination with above-mentioned " term clustering representation method " and " semanteme of iteration Similarity calculating method " further promotes its performance.
4, the present invention is formed by Clustering Model before and after being updated to short text data using network and quantified, and is led to Label is crossed to show the result of quantization to embody the evolutionary process of information, provides intuitive analysis result.
In conclusion present invention introduces parallelization thought, pass through the operation " local parallel adjustment " of iteration and " global synchronization Feature selecting and category division are fused together by adjustment ", realize the quick clustering to extensive dynamic short text;And this hair The bright evolutionary process that different information in network are disclosed with visual tag set, reflects the focus of user in difference with this Overall variation trend in period can provide guidance for the formulation of government or enterprise's relevant policies, secondly can also draw Limited man power and material is invested user's focus of attention, the trend correctly to guide public opinion by the development trend figure of information;It is vast Internet user also quickly can have found that its is interested by the evolutionary process of the different information of comparison from immense information ocean Information, and grasp the development and change process of the information.
The above content is combine specific optimal technical scheme it is made for the present invention be further described, and it cannot be said that The specific implementation of the present invention is confined to these explanations.For those of ordinary skill in the art to which the present invention belongs, exist Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to the present invention's Protection domain.

Claims (7)

1. a kind of clustering information evolution analysis method towards extensive dynamic short text combines in Self-organization clustering algorithm first Neuron representation method, document class is represented with neuron;Then the neuron represented as classification is uniformly shared each On microcontroller, make on each microcontroller that there are small-scale local neuron set;Then with changing in Self-organization clustering algorithm Based on generation adjustment thought, local parallel adjustment is carried out to category division result;Followed by the multiple local parallel adjustment of progress A global synchronization adjustment is carried out again afterwards, to complete the quick clustering towards Massive short documents notebook data;Finally lead on this basis Cross analysis, comparison different time sections in Clustering Model change situation to obtain the different information contained in short text data Evolutionary process;It is characterized in that, described " carrying out local parallel adjustment to category division result ", specifically includes following steps:
A1. a document is randomly choosed from short text data set to be clustered using distributed term clustering method, if it is di
A2. d is calculated using the semantic similarity calculation method of iterationiIt is every in gathering with the local neuron on current monolithic machine Similarity between a neuron, and selection and diNeuron with maximum similarity, if it is nj
A3. n is adjustedjThe weights of middle feature, and found in local neuron gathers using the semantic similarity calculation method of iteration With njMost like neuron, if it is nb
A4. n is detectedjAnd nbBetween with the presence or absence of side then create a line to connect them, if n if there is no sidejAnd nbBetween While being ljb
A5. l is updatedjbWeights, and assign ljbRenewal time parameter be 0;
A6. the renewal time parameter on all interneuronal sides adds 1 in local neuron being gathered;
A7. above-mentioned all sides are detected, if certain while renewal time parameter be more than it is all while average value, delete this side, And execute iterations t=t+1;
A8. the average distance of detection short text to its cluster centre stops when distance is less than the convergence threshold u of cluster process Cluster process enters Clustering Model quantizing process, is otherwise transferred to global synchronization set-up procedure or return starts.
2. the clustering information evolution analysis method according to claim 1 towards extensive dynamic short text, feature exist In the distribution term clustering method is using the Mutual Information Theory in information theory as foundation, and selection makes information loss be minimized Part of speech division result as term clustering as a result, if encoded to another variable by a variable in information theory When, the information content I transmitted is calculated by following formula:
πt=p (nt);
p(nt)=f (nt)/g;p(vi|nt)=f (vi, nt)/f(nt);
V and N respectively represents word variables collection in formula, and l and g represent its size, I (V;N it) represents using the variable in N in V The information content transmitted when being encoded of variable, ntAnd viRespectively represent the variable in N and V, πtRepresent the probability of occurrences, p (nt) and p (vi|nt) respectively represent noun ntThe probability and verb v occurred in the textiWith noun ntCo-occurrence in the text Probability, f (nt) and f (vi, nt) respectively represent ntThe frequency and v of appearanceiAnd ntThe frequency occurred simultaneously.
3. the clustering information evolution analysis method according to claim 2 towards extensive dynamic short text, feature exist In the semantic similarity calculation method of the iteration realizes the semantic similarity meter of iteration using the iterative process that gradient declines It calculates, specifically includes following steps:
B1. it initializes:If V and N respectively represent verb and name set of words, l and g represent its size, and k represents the part of speech that user specifies Number;
B2. classification ownership is determined:N is redefined as the following formulatClassification ownership;
B3. part of speech is determined:It will be mapped to the noun composition part of speech Nj of j-th of classification as the following formula;
Nj={ nt:j*(nt)=j }
B4. part of speech center is determined:Class center is recalculated as the following formula for each part of speech;
B5. iterative process:B2~b4 repeat the above steps until distribution of the word between each part of speech no longer changes.
4. the clustering information evolution analysis method according to claim 1 towards extensive dynamic short text, feature exist In " the global synchronization adjustment " is the weights of feature in the distribution adjustment neuron according to feature to improve the accurate of cluster Rate specifically includes following steps:
C1. the distribution within class of feature in all neurons in neuronal ensemble N is calculated;
C2. the distribution between class of feature in all neurons in neuronal ensemble N is calculated;
C3. merge the distribution within class of feature and distribution between class obtains the weights of feature;
C4. feature is ranked up by the weights of feature, and selects weights to be more than current signature and selects the feature of threshold value as god Representative feature through member, is transferred to local parallel set-up procedure later.
5. according to any clustering information evolution analysis method towards extensive dynamic short text in Claims 1-4, It is characterized in that, the clustering information evolution analysis method also completes local parallel adjustment and the overall situation using dynamic topological structure Synchronous adjustment, the dynamic topological structure are that different neurons is connected by side, then according to the variation of short text, dynamically Interneuronal side is inserted into and deleted to adjust neuron.
6. according to any clustering information evolution analysis method towards extensive dynamic short text in Claims 1-4, It is characterized in that, described " analysis, Clustering Model in comparison different time sections change situation to obtain institute in short text data The evolutionary process of the different information contained ", specially:Using the amplitude of variation of the Clustering Model in grid quantization different time sections With contained in short text data of the changing content to quantify magnanimity of the Clustering Model disclosed with label in different time sections The evolutionary process of different information.
7. the clustering information evolution analysis method according to claim 6 towards extensive dynamic short text, feature exist In described " to use the amplitude of variation of the Clustering Model in grid quantization different time sections and disclosed in different time sections with label Clustering Model short text data of the changing content to quantify magnanimity in the evolutionary process of different information that is contained " specific packet Include following steps:
D1. assume that it is Gt1 to be formed by network by the short text data in the t1 periods, by the short text in the t2 periods It is Gt2 that data, which are formed by network,;
D2. the dense grid of DGt1 and DGt2 is calculated, and is stored in Gt1 and Gt2, following three subset is obtained:DGt1- DGt2、DGt2-DGt1、DGt1∩DGt2;
Wherein DGt1-DGt2 represents those information for only being included by the short text data in the t1 periods, and DGt2-DGt1 is represented Those information for only being included by the short text data in the t2 periods, DGt1 ∩ DGt2 represent those in time period t 1 between t2 The information remained unchanged.
CN201310716896.9A 2013-12-20 2013-12-20 A kind of clustering information evolution analysis method towards extensive dynamic short text Expired - Fee Related CN104731811B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310716896.9A CN104731811B (en) 2013-12-20 2013-12-20 A kind of clustering information evolution analysis method towards extensive dynamic short text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310716896.9A CN104731811B (en) 2013-12-20 2013-12-20 A kind of clustering information evolution analysis method towards extensive dynamic short text

Publications (2)

Publication Number Publication Date
CN104731811A CN104731811A (en) 2015-06-24
CN104731811B true CN104731811B (en) 2018-10-09

Family

ID=53455708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310716896.9A Expired - Fee Related CN104731811B (en) 2013-12-20 2013-12-20 A kind of clustering information evolution analysis method towards extensive dynamic short text

Country Status (1)

Country Link
CN (1) CN104731811B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105183804B (en) * 2015-08-26 2018-12-28 陕西师范大学 A kind of cluster method of servicing based on ontology
CN106067029B (en) * 2016-05-24 2019-06-18 哈尔滨工程大学 The entity classification method in data-oriented space
CN106776748A (en) * 2016-11-17 2017-05-31 天津大学 A kind of solution for harmful information monitoring in social network data
CN110008334B (en) * 2017-08-04 2023-03-14 腾讯科技(北京)有限公司 Information processing method, device and storage medium
CN110276375B (en) * 2019-05-14 2021-08-20 嘉兴职业技术学院 Method for identifying and processing crowd dynamic clustering information
CN114579739B (en) * 2022-01-12 2023-05-30 中国电子科技集团公司第十研究所 Topic detection and tracking method for text data stream

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1808474A (en) * 2006-03-02 2006-07-26 哈尔滨工业大学 Self-organized mapping network based document clustering method
CN101408893A (en) * 2008-11-26 2009-04-15 哈尔滨工业大学 Method for rapidly clustering documents
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
EP2639749A1 (en) * 2012-03-15 2013-09-18 CEPT Systems GmbH Methods, apparatus and products for semantic processing of text
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1808474A (en) * 2006-03-02 2006-07-26 哈尔滨工业大学 Self-organized mapping network based document clustering method
CN101408893A (en) * 2008-11-26 2009-04-15 哈尔滨工业大学 Method for rapidly clustering documents
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
EP2639749A1 (en) * 2012-03-15 2013-09-18 CEPT Systems GmbH Methods, apparatus and products for semantic processing of text
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data

Also Published As

Publication number Publication date
CN104731811A (en) 2015-06-24

Similar Documents

Publication Publication Date Title
CN104731811B (en) A kind of clustering information evolution analysis method towards extensive dynamic short text
CN103699606B (en) A kind of large-scale graphical partition method assembled with community based on summit cutting
CN102289522B (en) Method of intelligently classifying texts
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
CN105095419B (en) A kind of informational influence power maximization approach towards microblogging particular type of user
CN104156436A (en) Social association cloud media collaborative filtering and recommending method
CN103631859A (en) Intelligent review expert recommending method for science and technology projects
CN103942308A (en) Method and device for detecting large-scale social network communities
CN109902203A (en) The network representation learning method and device of random walk based on side
CN107679069A (en) Method is found based on a kind of special group of news data and related commentary information
CN109657147A (en) Microblogging abnormal user detection method based on firefly and weighting extreme learning machine
Jiang et al. Biterm pseudo document topic model for short text
Peng et al. Entropy chain multi-label classifiers for traditional medicine diagnosing Parkinson's disease
Lee et al. Preferential attachment in graphs with affinities
CN103793747A (en) Sensitive information template construction method in network content safety management
Zhang et al. A decision tree scoring model based on genetic algorithm and k-means algorithm
CN112463974A (en) Method and device for establishing knowledge graph
CN112765489A (en) Social network link prediction method and system
Singh et al. Adaptive genetic programming based linkage rule miner for entity linking in Semantic Web
CN109711478A (en) A kind of large-scale data group searching method based on timing Density Clustering
Zhan et al. Keyword extraction of document based on weighted complex network
Zhang et al. Parallel overlapping community discovery based on grey relational analysis
Zhan et al. Semantic similarity calculation of short texts based on language network and word semantic information
Guo et al. An improved AD-LDA topic model based on weighted Gibbs sampling
CN108170725A (en) The social network user relationship strength computational methods and device of integrated multicharacteristic information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181009

Termination date: 20191220

CF01 Termination of patent right due to non-payment of annual fee