CN104731811B - A kind of clustering information evolution analysis method towards extensive dynamic short text - Google Patents
A kind of clustering information evolution analysis method towards extensive dynamic short text Download PDFInfo
- Publication number
- CN104731811B CN104731811B CN201310716896.9A CN201310716896A CN104731811B CN 104731811 B CN104731811 B CN 104731811B CN 201310716896 A CN201310716896 A CN 201310716896A CN 104731811 B CN104731811 B CN 104731811B
- Authority
- CN
- China
- Prior art keywords
- clustering
- short text
- information
- neuron
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 42
- 210000002569 neuron Anatomy 0.000 claims abstract description 54
- 238000000034 method Methods 0.000 claims abstract description 42
- 230000010429 evolutionary process Effects 0.000 claims abstract description 22
- 238000004364 calculation method Methods 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 13
- 230000003585 interneuronal effect Effects 0.000 claims description 9
- 238000013139 quantization Methods 0.000 claims description 7
- 238000012804 iterative process Methods 0.000 claims description 6
- 230000001537 neural effect Effects 0.000 claims description 6
- 239000000203 mixture Substances 0.000 claims description 5
- 238000001514 detection method Methods 0.000 claims description 2
- 230000001360 synchronised effect Effects 0.000 claims description 2
- 238000011161 development Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000000007 visual effect Effects 0.000 description 4
- 238000000205 computational method Methods 0.000 description 2
- 238000009472 formulation Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 244000097202 Rathbunia alamosensis Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
Abstract
The present invention relates to a kind of clustering information evolution analysis methods towards extensive dynamic short text, combine the neuron representation method in Self-organization clustering algorithm first, document class is represented with neuron;Then the neuron represented as classification is uniformly shared on each microcontroller, makes on each microcontroller that there are small-scale local neuron set;Then based on iteration adjustment thought, local parallel adjustment is carried out to category division result;Followed by a global synchronization adjustment is carried out again after carrying out multiple local parallel adjustment, to complete the quick clustering of network mass data;Ultimate analysis, comparison different time sections in Clustering Model change situation to obtain the evolutionary process of the different information contained in short text data.Feature selecting and category division are fused together by the present invention by the operation " local parallel adjustment " of iteration and " global synchronization adjustment ", are realized to the quick clustering of large scale network dynamic short text, are greatly improved operational efficiency.
Description
【Technical field】
It is the invention belongs to the network data excavation field that gears to the needs of the society, more particularly to a kind of towards extensive dynamic short text
Clustering information evolution analysis method.
【Background technology】
Along with the arrival in Web2.0 epoch, internet industry is just experiencing a huge change, with social network
Based on, it is Typical Representative with " microblogging ", it is this to be dedicated to the interpersonal relationship of network communication, to be stopped
A series of movable virtual AC platforms of not busy amusement, business ventures, study discussion etc. obtain chasing after for user once appearance
It holds in both hands.
Social network is a dynamic platform, and data therein are constantly updating, if it is possible to obtain dynamic number
According to middle contained different information evolutionary process (analyze which information no longer be user it is of interest, which information continue
Paid close attention to by user, which newest information is that user is of interest), the overall variation that can hold user's focus first becomes
Secondly gesture can also draw the development trend figure of information, be predicted with the Evolutionary direction to information, by limited manpower and object
Power invests user's focus of attention, the trend correctly to guide public opinion.Numerous Internet users also can be by comparing different information
Evolutionary process quickly finds its interested information from immense information ocean.
Information analysis problem more popular now has " the analysis of public opinion ", " hot spot discovery ", " topic evolution " and " hot spot chases after
The starting point that track ", wherein the analysis of public opinion and hot spot are found is " short timeliness analysis ", that is, be intended to analyze and obtain in the short time,
Concentrate the information of outburst.Compared with them, information EVOLUTION ANALYSIS lays particular emphasis on " long timeliness analysis ", by comparing different time
Dynamic data in section, to obtain the development trend of the information contained in data.Topic develops and follow-up of hot issues also can be to dynamic
State data are handled, however it is confined to the development trend of one or several topics mostly, and compared with them, information, which develops, divides
Analysis is intended to be shown the entire change situation of information.
" news " or " blog " data are different from, the data being widely present in social network are a kind of typical " short
Text ", length are generally less than 140 words (by taking Sina weibo as an example).When text size is too short, it is with " vector space model "
The representation method of representative will produce " high dimension vector is sparse " problem, while be decided by the principal element of similarity between short text not
It is co-occurrence word frequency again, but the semantic similarity between text.Above-mentioned two problems to be widely used in " long text "
Analysis method can not be applied in " short text " analysis.Therefore, only realize that extensive dynamic short essay can be effectively treated in one kind
This clustering method, can cope with the arrival in Web2.0 epoch to huge caused by traditional text analysis method well
Big challenge.
【Invention content】
The present invention is the above-mentioned technical problem solved, provides a kind of introducing parallelization thought, passes through the operation " office of iteration
Portion adjusts parallel " and " global synchronization adjustment " feature selecting and category division are fused together, realize towards extensive dynamic
The Fast Speed Clustering of short text, this method greatly improve operational efficiency, and discloses net with visual tag set
The evolutionary process of different information, reflects overall variation trend of the focus of user in different time sections with this in network.
In order to solve the above technical problems, the present invention adopts the following technical scheme that:
A kind of clustering information evolution analysis method towards extensive dynamic short text combines Self-organization clustering algorithm first
In neuron representation method, document class is represented with neuron;Then the neuron represented as classification is uniformly shared
On each microcontroller, make on each microcontroller that there are small-scale local neuron set;Then in Self-organization clustering algorithm
Based on iteration adjustment thought, local parallel adjustment is carried out to category division result;Followed by the multiple local parallel tune of progress
A global synchronization adjustment is carried out after whole again, to complete the quick clustering towards Massive short documents notebook data;Finally on this basis
The different letters contained in short text data are obtained by analyzing, comparing the change situation of the Clustering Model in different time sections
The evolutionary process of breath.
Further, described " carrying out local parallel adjustment to category division result ", specifically includes following steps:
A1. a document is randomly choosed from short text data set to be clustered using distributed term clustering method, if
It is di;
A2. d is calculated using the semantic similarity calculation method of iterationiGather with the local neuron on current monolithic machine
In similarity between each neuron, and choose and diNeuron with maximum similarity, if it is nj;
A3. n is adjustedjThe weights of middle feature, and gathered in local neuron using the semantic similarity calculation method of iteration
In find and njMost like neuron, if it is nb;
A4. n is detectedjAnd nbBetween with the presence or absence of side then create a line to connect them, if n if there is no sidejAnd nb
Between side be ljb;
A5. l is updatedjbWeights, and assign ljbRenewal time parameter be 0;
A6. the renewal time parameter on all interneuronal sides adds 1 in local neuron being gathered;
A7. above-mentioned all sides are detected, if certain while renewal time parameter be more than it is all while average value, delete this
Side, and execute iterations t=t+1;
A8. detection short text arrive its cluster centre average distance, when distance less than cluster process convergence threshold u when,
Stop cluster process and enter Clustering Model quantizing process, is otherwise transferred to global synchronization set-up procedure or return starts.Whether judge t
It is the integral multiple of m, "Yes" is then transferred to global synchronization set-up procedure, and "No", which then returns, to be started ".
Further, the distributed term clustering method is using the Mutual Information Theory in information theory as foundation, and selection makes letter
The part of speech division result that is minimized of breath loss as term clustering as a result, in information theory if by a variable to another
When variable is encoded, the information content I transmitted is calculated by following formula:
p(ni)=f (ni)/g;p(vi|ni)=f (vi, ni)/f(nt);
V and N respectively represents word variables collection in formula, and l and g represent its size, I (V;N it) represents using the variable pair in N
The information content that variable in V transmits when being encoded, ntAnd viRespectively represent the variable in N and V, πtRepresent the general of occurrences
Rate, p (nt) and p (vi|nt) respectively represent noun ntThe probability and verb v occurred in the textiWith noun ntIn the text
Co-occurrence probability, f (nt) and f (vi, nt) respectively represent ntThe frequency and v of appearanceiAnd ntThe frequency occurred simultaneously.
Further, the semantic similarity calculation method of the iteration realizes iteration using the iterative process that gradient declines
Semantic Similarity Measurement specifically includes following steps:
B1. it initializes:If V and N respectively represent verb and name set of words, l and g represent its size, and k represents what user specified
Part of speech number;
B2. classification ownership is determined:N is redefined as the following formulatClassification ownership;
B3. part of speech is determined:It will be mapped to the noun composition part of speech Nj of j-th of classification as the following formula;
Nj={ nt:j*(nt)=j }
B4. part of speech center is determined:Class center is recalculated as the following formula for each part of speech;
B5. iterative process:B2~b4 repeat the above steps until distribution of the word between each part of speech no longer changes.
Further, described " global synchronization adjustment " is the weights of feature in the distribution adjustment neuron according to feature to carry
The accuracy rate of high cluster, specifically includes following steps:
C1. the distribution within class of feature in all neurons in neuronal ensemble N is calculated;
C2. the distribution between class of feature in all neurons in neuronal ensemble N is calculated;
C3. merge the distribution within class of feature and distribution between class obtains the weights of feature;
C4. feature is ranked up by the weights of feature, and the feature for selecting weights to be more than current signature selection threshold value is made
For the representative feature of neuron, it is transferred to local parallel set-up procedure later.
Further, the clustering information evolution analysis method also completes local parallel adjustment using dynamic topological structure
It being adjusted with global synchronization, the dynamic topological structure is that different neurons is connected by side, then according to the variation of short text,
Dynamically interneuronal side is inserted into and deleted to adjust neuron.
Further, described " analysis, comparison different time sections in Clustering Model change situation to obtain short text number
According to the evolutionary process of middle contained different information ", specially:Using the change of the Clustering Model in grid quantization different time sections
Change amplitude and institute in short text data of the changing content to quantify magnanimity of the Clustering Model in different time sections is disclosed with label
The evolutionary process of the different information contained.
Further, described " to use the amplitude of variation of the Clustering Model in grid quantization different time sections and taken off with label
Show the different information contained in short text data of the changing content of the Clustering Model in different time sections to quantify magnanimity
Evolutionary process." specifically include following steps:
D1. assume that it is Gt1 to be formed by network by the short text data in the t1 periods, by short in the t2 periods
It is Gt2 that text data, which is formed by network,.
D2. the dense grid of DGt1 and DGt2 is calculated, and is stored in Gt1 and Gt2, following three subset is obtained:DGt1-
DGt2、DGt2-DGt1、DGt1∩DGt2;
Wherein DGt1-DGt2 represents those information for only being included by the short text data in the t1 periods, DGt2-DGt1
Represent those information for only being included by the short text data in the t2 periods, DGt1 ∩ DGt2 represent those time period t 1 to
The information remained unchanged between t2.
The beneficial effects of the invention are as follows:
The present invention passes through the operation " local parallel adjustment " of iteration through the above technical solutions, introduce parallelization thought
Feature selecting and category division are fused together by " global synchronization adjustment ", are realized towards large scale network dynamic short text
Quick clustering, the classification for rapidly finding optimization represent feature and best category division as a result, greatly improving operation effect
Rate;And the evolutionary process of different information in network is disclosed with visual tag set, the focus of user is reflected with this
Overall variation trend in different time sections can provide guidance for the formulation of government or enterprise's relevant policies, secondly may be used also
To draw the development trend figure of information, limited man power and material is invested into user's focus of attention, what is correctly guided public opinion walks
To;Numerous Internet users also can quickly be had found from immense information ocean by the evolutionary process of the different information of comparison
Its interested information, and grasp the development and change process of the information.
【Description of the drawings】
Fig. 1 is the stream of the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention
Journey schematic diagram;
Fig. 2 is office in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention
The flow diagram that portion adjusts parallel;
Fig. 3 be in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention repeatedly
The flow diagram of the semantic similarity calculation method in generation;
Fig. 4 is complete in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention
The flow diagram of office's synchronous adjustment;
Fig. 5 is office in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention
Portion adjusts the flow diagram for being inserted into and deleting crack approach parallel;
Fig. 6 is measured in the clustering information evolution analysis method towards extensive dynamic short text described in the embodiment of the present invention
Change the flow diagram of the evolutionary process of the network data of magnanimity.
【Specific implementation mode】
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not
For limiting the present invention.
As shown in Figure 1 to Figure 3:
The present invention provides a kind of clustering information evolution analysis methods towards extensive dynamic short text, combine first certainly
The neuron representation method in clustering algorithm is organized, document class is represented with neuron;Then the neuron that will be represented as classification
Uniformly share on each microcontroller, makes on each microcontroller that there are small-scale local neuron set;Then with self-organizing
Based on iteration adjustment thought in clustering algorithm, local parallel adjustment is carried out to category division result;Followed by more in progress
A global synchronization adjustment is carried out after secondary local parallel adjustment again, to complete the quick clustering towards Massive short documents notebook data;Most
Afterwards on this basis by analyzing, comparing the change situation of the Clustering Model in different time sections to obtain institute in short text data
The evolutionary process of the different information contained.
Described " carrying out local parallel adjustment to category division result ", specifically includes following steps:
Step A1. is selected using distributed term clustering method from short text data set (such as " microblogging ") to be clustered at random
A document is selected, if it is di;
Step A2. calculates d using the semantic similarity calculation method of iterationiWith the local neuron on current monolithic machine
Similarity in set between each neuron, and selection and diNeuron with maximum similarity, if it is nj;
Step A3. adjusts njThe weights of middle feature, and using the semantic similarity calculation method of iteration in local neuron
It is found in set and njMost like neuron, if it is nb;
Step A4. detects njAnd nbBetween with the presence or absence of side then create a line to connect them, if n if there is no sidej
And nbBetween side be ljb;
Step A5. updates ljbWeights, and assign ljbRenewal time parameter be 0;
Step A6. during local neuron is gathered the renewal time parameter on all interneuronal sides add 1;
Step A7. detects above-mentioned all sides, if certain while renewal time parameter be more than all while average value, delete
Except this side, and execute iterations t=t+1;
Step A8. detects short text to the average distance of its cluster centre, when distance is less than the convergence threshold u of cluster process
When, stop cluster process and enter Clustering Model quantizing process, is otherwise transferred to global synchronization set-up procedure or return starts.Judge t
Whether be m integral multiple, "Yes" is then transferred to global synchronization set-up procedure, and "No", which then returns, to be started.
Wherein, the distributed term clustering method is using the Mutual Information Theory in information theory as foundation, and selection makes information damage
The part of speech division result that is minimized is lost as term clustering as a result, if by a variable to another variable in information theory
When being encoded, the information content I transmitted is calculated using following formula:
V and N in formula respectively represent word variables collection, and l and g represent its size, I (V;N it) represents using the variable in N
The information content transmitted when being encoded to the variable in V, ntAnd viRespectively represent the variable in N and V, πtRepresent occurrences
Probability, and the probability of all variables and be 1, i.e.,:
In text cluster, the word that can represent text subject is usually noun and verb.Therefore the present invention only takes out
Noun and verb in short text is taken to be clustered as short essay eigen, and to noun and verb.Above-mentioned V and N distinguishes
Represent verb and name set of words, p (nt) and p (vi|nt) respectively represent noun ntThe probability and verb v occurred in the textiWith
Noun ntCo-occurrence probability in the text, it is assumed that f (nt) and f (vi, nt) respectively represent ntThe frequency and v of appearanceiAnd ntSimultaneously
The frequency of appearance, then have:
p(ni)=f (ni)/g;p(vi|ni)=f (vi, ni)/f(nt)
If compressed to N, that is, it is divided into k classification:NC={ N1, N2..., Nk, andAgain
When being encoded to V using NC, according to the coding meeting existence information loss of information theory at this time, distributed term clustering is with letter
" information loss minimization principle " in breath opinion, it is that best word gathers to regard loss amount to reach when minimum corresponding category division result
Class result.
According to above-mentioned analysis, you can obtain the object function of distributed term clustering, corresponded to when this function reaches minimum value
Part of speech be divided into optimal term clustering as a result, when being encoded using this part of speech set, the information content of loss is minimum.
Wherein, π (Nj) represent part of speech NjThe probability of appearance is calculated by the Word probability for including in part of speech, p (vi|Nj) generation
Table part of speech NjWith verb viCo-occurrence probability, computational methods are as follows:
By p (vi) remove:
The object function is similar with the object function of K-means algorithms, the main distinction be K-means use " it is European away from
From " it is used as similarity calculating method;Since " Euclidean distance " can not calculate the similitude of probability distribution, this patent with KL away from
From the distribution similarity calculated between word, KL distances can also solve in K-means classification and divide in " suprasphere " well in addition
The problem of cloth.
The present invention is agglomerated similar word for part of speech to generate " part of speech indicates model " by distributed term clustering method,
The model by multiple dimension maps in " vector space model " be a dimension to mitigate " high dimension vector is sparse " problem.
The semantic similarity calculation method of the iteration realizes that the semanteme of iteration is similar using the iterative process that gradient declines
Degree calculates, and specifically includes following steps:
Step B1. initialization:If V and N respectively represent verb and name set of words, l and g represent its size, and k represents user and refers to
Fixed part of speech number.Name set of words is divided into 1 initial category, noun n firsttClassification by j* (nt)=argmaxjp (vj|
nt) determine.If k > 1, to 1 initial category random splitting until classification number is k, otherwise the random part of speech that merges is until class
Shuo not be k;
Step B2. determines that classification belongs to:N is redefined as the following formulatClassification ownership;
Step B3. determines part of speech:It will be mapped to the noun composition part of speech Nj of j-th of classification;
Nj={ ni:j*(nt)=j }
Step B4. determines part of speech center:For each part of speech, class center is recalculated;
Step B5. iterative process:D2~D4 repeat the above steps until distribution of the word between each part of speech no longer changes.
" semantic similarity calculation method of iteration " of the present invention is similar between word and word by calculating for iteration
Similarity seeks an equalization point between the two between degree, document and document, and using the corresponding result of this equalization point as
Best similarity calculation result.
" the global synchronization adjustment " is the weights of feature in the distribution adjustment neuron according to feature to improve cluster
Accuracy rate specifically includes following steps:
Step C1. calculates the distribution within class of feature in all neurons in neuronal ensemble N;
Step C2. calculates the distribution between class of feature in all neurons in neuronal ensemble N;
Step C3. merges the distribution within class of feature and distribution between class obtains the weights of feature;
Step C4. is ranked up feature by the weights of feature, and weights is selected to be more than the spy that current signature selects threshold value
The representative feature as neuron is levied, is transferred to local parallel set-up procedure later.
" the global synchronization adjustment " goes actively to select classification by the distribution (distribution within class and distribution between class) of feature
Feature is represented with filtering characteristic, solve local parallel and adjust insurmountable classification represent may include in feature cannot be effective
The problem of distinguishing the uncorrelated features of classification, to promote the accuracy rate of category division in global scope.
It is of the present invention " analysis, comparison different time sections in Clustering Model change situation to obtain short text number
According to the evolutionary process of middle contained different information ", specially:Using the change of the Clustering Model in grid quantization different time sections
Change amplitude and the changing content that the Clustering Model in different time sections is disclosed with label quantify network data (such as microblogging of magnanimity
Data) evolutionary process of different information that is contained.
In addition, clustering information evolution analysis method of the present invention additionally uses dynamic topological structure, connected by side
Different neurons is dynamically inserted into and deletes interneuronal side to adjust neuron then according to the variation of short text
Topological structure;I.e.:Insertion and deletion of two parameters to control while, respectively side right value and side are being set more to be interneuronal
New time, wherein side right value are used for indicating that the tightness degree of interneuronal correlation, side renewal time are used for writing edge distance recently
The time difference being once updated;Side weight wijCalculation formula is as follows:
Above-mentioned formula is neuron niAnd njBetween side weight wijComputational methods, wherein Sim (ni, nj) represent neuron
niAnd njSimilarity, can be obtained by Euclidean distance.
Local parallel adjustment and global synchronization adjustment above-mentioned combine dynamic topological structure, and the difference of the two exists
In:Supplemented with the process for being inserted into and deleting side in local parallel set-up procedure, dynamically to change the topological structure of neuron,
Increase following steps:
Step D1. local parallel set-up procedures are to njAfter being adjusted, found and n in local neuron gathersjIt is most like
Neuron, if its be nb;
Step D2. detects njAnd nbBetween whether there is side;If there is no side, then a line is created to connect them, if nj
And nbBetween side be ljb;
Step D3. updates ljbWeights, and assign ljbRenewal time parameter be 0;
Step D4. during local neuron is gathered the renewal time parameter on all interneuronal sides add 1;
Step D5. detects above-mentioned all sides, if certain while renewal time parameter be more than all while average value, delete
It removes.
It is of the present invention " to use the amplitude of variation of the Clustering Model in grid quantization different time sections and disclosed with label
The different information that the changing content of Clustering Model in different time sections is contained in the short text data to quantify magnanimity are drilled
Change process " specifically includes following steps:
Step E1. assumes that it is Gt1 to be formed by network by the short text data in the t1 periods, by the t2 periods
Short text data be formed by network be Gt2.
Step E2. calculates the dense grid of DGt1 and DGt2, and is stored in Gt1 and Gt2, obtains following three subset:
DGt1-DGt2、DGt2-DGt1、DGt1∩DGt2;
Wherein DGt1-DGt2 represents those information for only being included by the short text data in the t1 periods, DGt2-DGt1
Represent those information for only being included by the short text data in the t2 periods, DGt1 ∩ DGt2 represent those time period t 1 to
The information remained unchanged between t2.
Clustering information evolution analysis method of the present invention towards extensive dynamic short text is being realized towards extensive
On the basis of the cluster of dynamic short text, develops into row information to the extensive dynamic short text data in social network and divide
It analyses, and discloses the evolutionary process of different information in network with visual tag set, reflect that the focus of user exists with this
Overall variation trend in different time sections.
Clustering information evolution analysis method of the present invention towards extensive dynamic short text has the characteristics that:
1, with the increase of short text scale, " high dimension vector is sparse " characteristic of short text is more notable, makes a large amount of
Similarity between short text levels off to 0, to constrain the performance of traditional Text Clustering Algorithm.For this problem, the present invention
It is part of speech that will be distributed similar word cohesion by " mutual information " in information theory theoretical realization " distributed term clustering ", short to improve
" high dimension vector is sparse " characteristic of text;Text message not only can more accurately be described to obtain good cluster result, but also can
Realize low dimensional, non-sparse representation method to realize quick and accurate similarity calculation.
2, " distributed term clustering " clusters verb and noun respectively, therefore can not find the pass between verb and noun
System, such as the relationship between " kicking " and " ball ", i.e., in addition to " the synonymous similitude " of narrow sense, " distributed term clustering " can not be found
The more similitude of broad sense.Therefore it is found between the more word of broad sense using " semantic similarity calculation method of iteration "
Similarity, this method seek the two by calculating the similarity between similarity, document and document between word and word
Between an equalization point, and using the corresponding result of calculation of this equalization point as best similarity calculation result.
3, feature selecting and category division are merged by the present invention by way of iteration, optimal gradually to determine
Category feature indicates and best category division result.It is realized by operation " local parallel adjustment " and " global synchronization adjustment "
Quick clustering towards extensive dynamic short text, in combination with above-mentioned " term clustering representation method " and " semanteme of iteration
Similarity calculating method " further promotes its performance.
4, the present invention is formed by Clustering Model before and after being updated to short text data using network and quantified, and is led to
Label is crossed to show the result of quantization to embody the evolutionary process of information, provides intuitive analysis result.
In conclusion present invention introduces parallelization thought, pass through the operation " local parallel adjustment " of iteration and " global synchronization
Feature selecting and category division are fused together by adjustment ", realize the quick clustering to extensive dynamic short text;And this hair
The bright evolutionary process that different information in network are disclosed with visual tag set, reflects the focus of user in difference with this
Overall variation trend in period can provide guidance for the formulation of government or enterprise's relevant policies, secondly can also draw
Limited man power and material is invested user's focus of attention, the trend correctly to guide public opinion by the development trend figure of information;It is vast
Internet user also quickly can have found that its is interested by the evolutionary process of the different information of comparison from immense information ocean
Information, and grasp the development and change process of the information.
The above content is combine specific optimal technical scheme it is made for the present invention be further described, and it cannot be said that
The specific implementation of the present invention is confined to these explanations.For those of ordinary skill in the art to which the present invention belongs, exist
Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to the present invention's
Protection domain.
Claims (7)
1. a kind of clustering information evolution analysis method towards extensive dynamic short text combines in Self-organization clustering algorithm first
Neuron representation method, document class is represented with neuron;Then the neuron represented as classification is uniformly shared each
On microcontroller, make on each microcontroller that there are small-scale local neuron set;Then with changing in Self-organization clustering algorithm
Based on generation adjustment thought, local parallel adjustment is carried out to category division result;Followed by the multiple local parallel adjustment of progress
A global synchronization adjustment is carried out again afterwards, to complete the quick clustering towards Massive short documents notebook data;Finally lead on this basis
Cross analysis, comparison different time sections in Clustering Model change situation to obtain the different information contained in short text data
Evolutionary process;It is characterized in that, described " carrying out local parallel adjustment to category division result ", specifically includes following steps:
A1. a document is randomly choosed from short text data set to be clustered using distributed term clustering method, if it is
di;
A2. d is calculated using the semantic similarity calculation method of iterationiIt is every in gathering with the local neuron on current monolithic machine
Similarity between a neuron, and selection and diNeuron with maximum similarity, if it is nj;
A3. n is adjustedjThe weights of middle feature, and found in local neuron gathers using the semantic similarity calculation method of iteration
With njMost like neuron, if it is nb;
A4. n is detectedjAnd nbBetween with the presence or absence of side then create a line to connect them, if n if there is no sidejAnd nbBetween
While being ljb;
A5. l is updatedjbWeights, and assign ljbRenewal time parameter be 0;
A6. the renewal time parameter on all interneuronal sides adds 1 in local neuron being gathered;
A7. above-mentioned all sides are detected, if certain while renewal time parameter be more than it is all while average value, delete this side,
And execute iterations t=t+1;
A8. the average distance of detection short text to its cluster centre stops when distance is less than the convergence threshold u of cluster process
Cluster process enters Clustering Model quantizing process, is otherwise transferred to global synchronization set-up procedure or return starts.
2. the clustering information evolution analysis method according to claim 1 towards extensive dynamic short text, feature exist
In the distribution term clustering method is using the Mutual Information Theory in information theory as foundation, and selection makes information loss be minimized
Part of speech division result as term clustering as a result, if encoded to another variable by a variable in information theory
When, the information content I transmitted is calculated by following formula:
πt=p (nt);
p(nt)=f (nt)/g;p(vi|nt)=f (vi, nt)/f(nt);
V and N respectively represents word variables collection in formula, and l and g represent its size, I (V;N it) represents using the variable in N in V
The information content transmitted when being encoded of variable, ntAnd viRespectively represent the variable in N and V, πtRepresent the probability of occurrences, p
(nt) and p (vi|nt) respectively represent noun ntThe probability and verb v occurred in the textiWith noun ntCo-occurrence in the text
Probability, f (nt) and f (vi, nt) respectively represent ntThe frequency and v of appearanceiAnd ntThe frequency occurred simultaneously.
3. the clustering information evolution analysis method according to claim 2 towards extensive dynamic short text, feature exist
In the semantic similarity calculation method of the iteration realizes the semantic similarity meter of iteration using the iterative process that gradient declines
It calculates, specifically includes following steps:
B1. it initializes:If V and N respectively represent verb and name set of words, l and g represent its size, and k represents the part of speech that user specifies
Number;
B2. classification ownership is determined:N is redefined as the following formulatClassification ownership;
B3. part of speech is determined:It will be mapped to the noun composition part of speech Nj of j-th of classification as the following formula;
Nj={ nt:j*(nt)=j }
B4. part of speech center is determined:Class center is recalculated as the following formula for each part of speech;
B5. iterative process:B2~b4 repeat the above steps until distribution of the word between each part of speech no longer changes.
4. the clustering information evolution analysis method according to claim 1 towards extensive dynamic short text, feature exist
In " the global synchronization adjustment " is the weights of feature in the distribution adjustment neuron according to feature to improve the accurate of cluster
Rate specifically includes following steps:
C1. the distribution within class of feature in all neurons in neuronal ensemble N is calculated;
C2. the distribution between class of feature in all neurons in neuronal ensemble N is calculated;
C3. merge the distribution within class of feature and distribution between class obtains the weights of feature;
C4. feature is ranked up by the weights of feature, and selects weights to be more than current signature and selects the feature of threshold value as god
Representative feature through member, is transferred to local parallel set-up procedure later.
5. according to any clustering information evolution analysis method towards extensive dynamic short text in Claims 1-4,
It is characterized in that, the clustering information evolution analysis method also completes local parallel adjustment and the overall situation using dynamic topological structure
Synchronous adjustment, the dynamic topological structure are that different neurons is connected by side, then according to the variation of short text, dynamically
Interneuronal side is inserted into and deleted to adjust neuron.
6. according to any clustering information evolution analysis method towards extensive dynamic short text in Claims 1-4,
It is characterized in that, described " analysis, Clustering Model in comparison different time sections change situation to obtain institute in short text data
The evolutionary process of the different information contained ", specially:Using the amplitude of variation of the Clustering Model in grid quantization different time sections
With contained in short text data of the changing content to quantify magnanimity of the Clustering Model disclosed with label in different time sections
The evolutionary process of different information.
7. the clustering information evolution analysis method according to claim 6 towards extensive dynamic short text, feature exist
In described " to use the amplitude of variation of the Clustering Model in grid quantization different time sections and disclosed in different time sections with label
Clustering Model short text data of the changing content to quantify magnanimity in the evolutionary process of different information that is contained " specific packet
Include following steps:
D1. assume that it is Gt1 to be formed by network by the short text data in the t1 periods, by the short text in the t2 periods
It is Gt2 that data, which are formed by network,;
D2. the dense grid of DGt1 and DGt2 is calculated, and is stored in Gt1 and Gt2, following three subset is obtained:DGt1-
DGt2、DGt2-DGt1、DGt1∩DGt2;
Wherein DGt1-DGt2 represents those information for only being included by the short text data in the t1 periods, and DGt2-DGt1 is represented
Those information for only being included by the short text data in the t2 periods, DGt1 ∩ DGt2 represent those in time period t 1 between t2
The information remained unchanged.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310716896.9A CN104731811B (en) | 2013-12-20 | 2013-12-20 | A kind of clustering information evolution analysis method towards extensive dynamic short text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310716896.9A CN104731811B (en) | 2013-12-20 | 2013-12-20 | A kind of clustering information evolution analysis method towards extensive dynamic short text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104731811A CN104731811A (en) | 2015-06-24 |
CN104731811B true CN104731811B (en) | 2018-10-09 |
Family
ID=53455708
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310716896.9A Expired - Fee Related CN104731811B (en) | 2013-12-20 | 2013-12-20 | A kind of clustering information evolution analysis method towards extensive dynamic short text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104731811B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105183804B (en) * | 2015-08-26 | 2018-12-28 | 陕西师范大学 | A kind of cluster method of servicing based on ontology |
CN106067029B (en) * | 2016-05-24 | 2019-06-18 | 哈尔滨工程大学 | The entity classification method in data-oriented space |
CN106776748A (en) * | 2016-11-17 | 2017-05-31 | 天津大学 | A kind of solution for harmful information monitoring in social network data |
CN110008334B (en) * | 2017-08-04 | 2023-03-14 | 腾讯科技(北京)有限公司 | Information processing method, device and storage medium |
CN110276375B (en) * | 2019-05-14 | 2021-08-20 | 嘉兴职业技术学院 | Method for identifying and processing crowd dynamic clustering information |
CN114579739B (en) * | 2022-01-12 | 2023-05-30 | 中国电子科技集团公司第十研究所 | Topic detection and tracking method for text data stream |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1808474A (en) * | 2006-03-02 | 2006-07-26 | 哈尔滨工业大学 | Self-organized mapping network based document clustering method |
CN101408893A (en) * | 2008-11-26 | 2009-04-15 | 哈尔滨工业大学 | Method for rapidly clustering documents |
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
EP2639749A1 (en) * | 2012-03-15 | 2013-09-18 | CEPT Systems GmbH | Methods, apparatus and products for semantic processing of text |
CN103390051A (en) * | 2013-07-25 | 2013-11-13 | 南京邮电大学 | Topic detection and tracking method based on microblog data |
-
2013
- 2013-12-20 CN CN201310716896.9A patent/CN104731811B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1808474A (en) * | 2006-03-02 | 2006-07-26 | 哈尔滨工业大学 | Self-organized mapping network based document clustering method |
CN101408893A (en) * | 2008-11-26 | 2009-04-15 | 哈尔滨工业大学 | Method for rapidly clustering documents |
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
EP2639749A1 (en) * | 2012-03-15 | 2013-09-18 | CEPT Systems GmbH | Methods, apparatus and products for semantic processing of text |
CN103390051A (en) * | 2013-07-25 | 2013-11-13 | 南京邮电大学 | Topic detection and tracking method based on microblog data |
Also Published As
Publication number | Publication date |
---|---|
CN104731811A (en) | 2015-06-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104731811B (en) | A kind of clustering information evolution analysis method towards extensive dynamic short text | |
CN103699606B (en) | A kind of large-scale graphical partition method assembled with community based on summit cutting | |
CN102289522B (en) | Method of intelligently classifying texts | |
CN103678670B (en) | Micro-blog hot word and hot topic mining system and method | |
CN105095419B (en) | A kind of informational influence power maximization approach towards microblogging particular type of user | |
CN104156436A (en) | Social association cloud media collaborative filtering and recommending method | |
CN103631859A (en) | Intelligent review expert recommending method for science and technology projects | |
CN103942308A (en) | Method and device for detecting large-scale social network communities | |
CN109902203A (en) | The network representation learning method and device of random walk based on side | |
CN107679069A (en) | Method is found based on a kind of special group of news data and related commentary information | |
CN109657147A (en) | Microblogging abnormal user detection method based on firefly and weighting extreme learning machine | |
Jiang et al. | Biterm pseudo document topic model for short text | |
Peng et al. | Entropy chain multi-label classifiers for traditional medicine diagnosing Parkinson's disease | |
Lee et al. | Preferential attachment in graphs with affinities | |
CN103793747A (en) | Sensitive information template construction method in network content safety management | |
Zhang et al. | A decision tree scoring model based on genetic algorithm and k-means algorithm | |
CN112463974A (en) | Method and device for establishing knowledge graph | |
CN112765489A (en) | Social network link prediction method and system | |
Singh et al. | Adaptive genetic programming based linkage rule miner for entity linking in Semantic Web | |
CN109711478A (en) | A kind of large-scale data group searching method based on timing Density Clustering | |
Zhan et al. | Keyword extraction of document based on weighted complex network | |
Zhang et al. | Parallel overlapping community discovery based on grey relational analysis | |
Zhan et al. | Semantic similarity calculation of short texts based on language network and word semantic information | |
Guo et al. | An improved AD-LDA topic model based on weighted Gibbs sampling | |
CN108170725A (en) | The social network user relationship strength computational methods and device of integrated multicharacteristic information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20181009 Termination date: 20191220 |
|
CF01 | Termination of patent right due to non-payment of annual fee |