CN107609102A

CN107609102A - A kind of short text on-line talking method

Info

Publication number: CN107609102A
Application number: CN201710816052.XA
Authority: CN
Inventors: 费高雷; 赵海林; 胡光岷; 于富财
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-09-12
Filing date: 2017-09-12
Publication date: 2018-01-19

Abstract

The present invention discloses a kind of short text on-line talking method, for the existing on-line talking method degree of accuracy it is not high the problem of, the application passes through improved short text increment clustering method, clustering processing is carried out to short text, similarity threshold changes with the social short text Number dynamics included in class, adds the flexibility of cluster；And short text semantic similarity is combined, increment cluster is further handled；And reunion class, class merging and class trimming are introduced, solve the problems, such as the problem of intrinsic class off-centring of on-line talking and short text polymerism difference.

Description

A kind of short text on-line talking method

Technical field

The invention belongs to Data Mining, more particularly to a kind of short text clustering technology.

Background technology

With the arrival in web2.0 epoch, microblog service has been to be concerned by more and more people, and has attracted a large number of users.Push away Spy, Sina weibo etc. are wherein most successful cases, according to statistics, push away upper 400,000,000 short-text messages of generation daily of spy.By right These contents are analyzed, and can obtain many priceless information, and these information contribute to the mechanisms such as auxiliary company, government department Make important decision.And short text on-line talking is a kind of important of the extract real-time valuable information from these social short texts Means.

The algorithm of text cluster has a lot, mainly including non-increment text cluster and increment text cluster.Non- increment cluster Need disposably to cluster total data, cluster data and overall cluster result can not be obtained in batches.Increment cluster then allows more Batch or even many times data input, it is desirable to which increment clustering method can continue to cluster new data based on former cluster result.It is non- Increment text cluster can be divided into hierarchical clustering, gather based on partition clustering, based on Density Clustering, based on Grid Clustering, based on model Class, in addition with self organizing neural network cluster, based on ant colony clustering the methods of.These methods can be used for text cluster, but Level and based on the cluster of division in text cluster task it is more conventional.Typical increment Text Clustering Method can be divided into two Kind, a kind of is the clustering algorithm that the non-incremental clustering algorithm of tradition is transformed into increment by some extra calculating, be for second as The most-often used Single-Pass greedy algorithms each data only handled once in modern increment text cluster.

Hierarchical clustering method be divided into from top and under divide and rule and from the two methods of upper polymerization in bottom.Method of dividing and ruling is initially by all numbers Strong point is classified as one kind, splits data category according to a certain distance standard, down splits from level to level until meeting to terminate bar Part or data point all separate.The method of polymerization is then initially to regard each data point as one kind respectively, according to certain similar The multiple classes of scale brigadier merge, and merge upwards from level to level until data point is all classified as one kind.Hierarchical clustering method the advantages of be There is clearly data organizational structure, can be by threshold value or the granularity of cluster entropy adjustment cluster, shortcoming is the calculating for having comparison high Complexity, calculating Time ＆ Space Complexity is.Representative algorithm has BIRCH algorithms, CURE algorithms, CHAMELEON algorithms etc..

Most typical algorithm is k-means in clustering method based on division, and step is k class center of initial random selection Point, nearest class center is arrived a little by the distribution of nearest principle according to data point to class central point distance, recalculates data in class The central point of point repeats cluster and class center calculation until result is stable or reaches certain number as new class center. K-means substantially combines greedy algorithm and EM algorithms.The shortcomings that k-means, is also apparent from：Classification number needs specified but logical It is difficult often to determine, the selection at initial classes center influences cluster result, all very quick to Outlier Data point based on division and hierarchical clustering Sense.

Self-organizing Maps (SOM) neutral net is the unsupervised neural network algorithm proposed by Kohonen et al..The nerve Network is made up of two layers of neuron of input layer and competition layer, and each neuron of competition layer initially assigns less random weight vectors. Then selection document vector input is concentrated from training sample, document vector is the vector of a fixed dimension, calculates input vector With the similarity of neuron, the maximum neuron of similarity is won, by the weight of adjustment triumph neuron and its neighbouring neuron, Make triumph neuron and input data more like, mould different in output layer neuron identification data is made by a large amount of text training Formula.This method has the characteristics of and clustering result quality insensitive to noise spot is high, but higher than the time complexity of k average.Output layer Neuron be to have certain topological structure, how to design the topological structure of output layer and how to determine the nerve of output layer The problem of first number is all to be solved.The topological structure of output layer should be close with the due structure of data class, and determines text It is that a kind of what kind of topological relation is a problem between class.The neuron number of output layer determines classification number, and k-means The problem of it is similar, excessive or very few classification number has a strong impact on the result of cluster, and can not know in many cases general poly- Class number.

The content of the invention

In order to solve the above technical problems, present applicant proposes a kind of short text on-line talking method, pass through improved short essay This greediness cluster carries out clustering processing to short text, and introduces reunion class, merges class and trim class, further improves online poly- The accuracy of class.

The technical scheme that the application uses for：A kind of short text on-line talking method, including：

S1, the social short text to acquisition pre-process, and extract text feature；

The text feature includes：Part of speech corresponding to word, word in short text and name entity indicia.

S2, according to text feature, short text similarity is calculated using vector space model；

If the maximum similarity calculated in S3, step S2 is more than the upper limit of the first threshold scope of setting；Then by this article Originally it is added to known class corresponding to maximum similarity；If the maximum similarity calculated in step S2 is less than the first threshold of setting The lower limit of scope；Then create a new class；Otherwise step S4 is performed；

S4, the similarity according to semantic method calculating class corresponding with the maximum similarity that step S2 is calculated, if The similarity is more than the maximum of the first threshold of setting；The text is then added to known class corresponding to maximum similarity；If The similarity being calculated is less than the minimum value of the first threshold of setting；Then create a new class；

S5, each class obtained to step S4 calculate the similarity of short text and current class center vector in class, for phase Like the short text for spending the lower limit less than first threshold scope, then return to step S2 is reclassified；Will be current if most like class is found The short text in class is deleted, and is added in most like class, and new class is created if without most like class；

S6, operation is merged to the class after being handled through step S5；

S7, cut operation is carried out to the class handled through step S6.

Further, step S2 is specially：Short text vector is built according to text feature；Calculate short text vector with it is each Know class center vector and cosine similarity.

Further, step S4 is specially：

Text similarity is calculated based on WordNet, expression is as follows：

Wherein, α is smoothing factor, and β is scale factor of two measurements of adjustable range and depth to similarity contribution, and h is Two concept nodes it is minimum it is public include node depth, l is the distance of two concept nodes；

Or

Text similarity is calculated based on Word2vec, is specially：By the way that word is characterized into the vector into fixed dimension, then Text vector is calculated according to word vector, the similitude between text is calculated according to cosine similarity.

Further, step S6 is specially：Calculate the comentropy of the class after being handled through step S5, and any two class Similarity；If similarity is more than Second Threshold, if calculating the comentropy of the class after two classes merge, if the class after merging Comentropy size variation value is no more than 0.1, then merges two classes；Otherwise nonjoinder.

Further, step S7 is specially：According to through step S6 processing after all kinds of centre times by whole classes from it is small to Big sequence, deletes the class beyond class time horizon to failure；

When class total number exceeds the upper limit, the first minor sort is carried out to class from small to large by social short text quantity in class； The second minor sort is carried out to class from small to large by class renewal time again, finally deletes forward class after the second minor sort successively, directly It is less than the upper limit to class total number.

Beneficial effects of the present invention：A kind of short text on-line talking method of the present invention, passes through improved short text increment Clustering method, clustering processing is carried out to short text, similarity threshold changes with the social short text Number dynamics included in class, increases The flexibility of cluster is added；And short text semantic similarity is combined, increment cluster is further handled；And introduce reunion class, Class merges and class trimming, solves the problems, such as the problem of intrinsic class off-centring of on-line talking and short text polymerism difference；This Shen Method please improves the accuracy of on-line talking, contribute to faster, valuable information is more accurately obtained from social media, Generation economic benefit that can be direct or indirect.

Brief description of the drawings

Fig. 1 is the solution of the present invention flow chart.

Embodiment

For ease of skilled artisan understands that the technology contents of the present invention, enter one to present invention below in conjunction with the accompanying drawings Step explaination.

It is the protocol procedures figure of the application as shown in Figure 1, the technical scheme of the application is：A kind of short text on-line talking side Method, including：

If the maximum similarity calculated in S3, step S2 is more than the maximum of the similarity threshold of setting；Then by this article Originally it is added to known class corresponding to maximum similarity；If the maximum similarity calculated in step S2 is less than the similarity threshold of setting The minimum value of value；Then create a new class；Otherwise step S4 is performed；

S4, the similarity according to semantic method calculating class corresponding with the maximum similarity that step S2 is calculated, if The similarity is more than the maximum of the similarity threshold of setting；The text is then added to known class corresponding to maximum similarity； If the similarity being calculated is less than the minimum value of the similarity threshold of setting；Then create a new class；

S5, each class obtained to step S4 calculate the similarity of short text and current class center vector in class, for phase Like the short text spent less than threshold value, then return to step S2 is reclassified；If if finding most like class by the short text in current class Delete, and add in most like class, new class is created if without most like class；

S6, operation is merged to the class after being handled through step S5；

S7, cut operation is carried out to the class handled through step S6.

The step S1 is specially：Social short text is read from stream data one by one, by pretreatment by original short essay Originally canonical form is converted into, then extracts short essay eigen, specific features are included corresponding to word and word including short text Part of speech and name entity indicia.

Step S2 is specially：The short essay eigen extracted by step S1, structure short text vector.Short text vector is to add Power word frequency vector, especially by the part of speech of word, name entity type be weighted, weighting embody word difference part of speech with And whether it is to name the significance level of entity different；General weight is set：Noun, verb weight are fixed 1.0, and name is real Name, place name, mechanism name weight are 1.2 in body, and other are such as adjective, adverbial word 0.5.Short text vector is short text similarity The foundation of calculating, it is also used for the structure weighting short text semantic vector in Semantic Similarity Measurement.

The construction method of class center vector is the cumulative social short text vector gathered in class, therefore class center vector also base In word frequency vector.When a social short text removes from class, it is necessary to subtracted from class center vector the social short text to Amount.Class center vector is used to building and weighting class center semantic vector, is the foundation of Similarity Measure.Class center vector is that have length Degree limitation, because the length of general social short text is restricted.The length of class center vector is typically limited in 25 lists Word, the content of an event are that can describe that clearly, class center vector length and social activity can also be allowed with the word within 25 Short text vector length is more or less the same, and preventing in class that edge word increases causes the dissimilarity of social short text and class gradually to increase Add.

Similarity threshold uses a value with each class size variation.When social short text negligible amounts in class, one As for class theme also unobvious, class center vector is relatively equally distributed on big measure word, the similar social short essay of theme This and class center similarity are relatively low, and reducing similarity threshold contributes to the similar social short text clustering of theme；And work as Lei Zhong societies Short text increase is handed over, theme starts clearly, and class center vector starts to be inclined to distribution subject word, and increase similarity threshold is advantageous to arrange Except the dissimilar social short text of theme.The circular of similarity threshold such as formula (1) piecewise function, threshold value have a upper limit With a lower limit, the upper limit takes 0.6 in the application, and lower limit takes 0.4, will be similar using the class of 20 social short texts as max-thresholds Threshold value is uniformly stretched between 0.4 to 0.6.

Similarity Measure is mainly shown as short text and the Similarity Measure of class in cluster.Society is used in VSM methods The cosine similarity algorithm of short text vector sum class center vector is handed over, and mentioned before containing synthesis in semantic method WordNet and Word2vec semantic similarity calculation methods.

The cosine similarity with each known class is calculated using VSM methods first.

Step S3 is specially：If whether the maximum similarity for judging to calculate in step S2 is in the similarity threshold model provided Enclose in (0.46-0.61), if the upper limit of the similarity threshold more than setting；Then the text is added to corresponding to maximum similarity Known class；If the lower limit of the similarity threshold less than setting；Then create a new class；Otherwise step S4 is performed.

Step S4 is specially：Based on WordNet calculate text similarity method be a kind of semantic method, i.e., with WordNet calculates the similitude of word as instrument, so as to obtain text similarity.WordNet is opened by Princeton University The English glossary database based on cognitive science of hair.Substantial amounts of English glossary is organized into term network by it, in term network Base unit be synset, synset is the lexical set of semantic equivalence, and the word in a synset has phase Same concept.Synset connects composition term network according to a variety of relations, and noun, verb, adjective, adverbial word are organized into Essentially independent term network.Currently used method includes following three class：

Li similarities are minimum public similar comprising the calculating of the distance between node and two concepts based on two concept nodes Degree, computational methods such as formula (2).α is a smoothing factor in formula；β is two measurements of an adjustable range and depth to similar Spend the scale factor of contribution；H be two concept nodes it is minimum it is public include node depth, l is the distance of two concept nodes.Li phases Increase like degree with h and increase, increase with l and reduce, depth is deeper, and the explanation public concept comprising node of the minimum is more specific, phase It is equidistant should be closer to so the measurement be based on shortest-path method than the actual range in abstract aspect in specific aspect Improve.Li similarity based methods combine depth and distance and flexibility and changeability, are the better methods of Words similarity measurement.

Wu＆Palmer similarities concept node depth in a network and two concept nodes according to corresponding to two words It is minimum it is public calculated comprising the depth that node is last common ancestor node, therefore this method also with node up and down Position relation, computational methods such as formula (3).

Method based on Word2vec, it is by the way that word is characterized into the vector into fixed dimension, then according to word vector Text vector is calculated, the similitude between text is finally calculated using cosine similarity.Word2vec is that Google exists A efficient tool that word is characterized as to real number value vector increased income for 2013, it utilizes the thought of deep learning, can passed through Training, the vector operation processing to content of text being reduced in K gts, and the similarity in vector space can be with For representing the similarity on text semantic.The term vector of Word2vec outputs can be used to do many natural language processing phases The work of pass, for example cluster, look for synonym, part of speech analysis etc..

Step S5 is specially：The application solves the problems, such as class center excursD using the method for reunion class；The drift at class center Shifting result in that greedy delta cluster result is relevant with text input order, and the phenomenon is primarily present in the incremental process of cluster, As increasing social short text is got together, class center can gradually change and gather the short text in class in the past It is no longer similar to class center, thus increment cluster to need after certain amount or separated in time in each class into Member is adjusted.

Primary operational process is：The similarity of short text and current class center vector in class is calculated each class, for low Return to step 2 is needed to find more like class again in the short text of the lower limit 0.46 of first threshold scope；If find most phase Then the target short text in current class is deleted like class, adds in most like class, new class is created if without most like class.

Step S6 is specially：The application is merged using class to solve the problems, such as to introduce outlier caused by weight clustering method, real Border situation is that some outliers can merge again with existing class, and no matter more outliers pass through how many times reunion class and conjunction And process can not all be merged into suitable class, it is therefore desirable to these outliers are carried out in time processing be advantageous to improve cluster When processing speed.

Social short text may cause to describe to only describe master during a theme because text size limitation word feature is few A part for content is inscribed, the short text of identical theme may be because a little difference comprising word without gathering one during cluster Rise, but with class social short text quantity gradually increase, the center of each class is gradually offset to the direction of entire subject matter information, The class of same theme may be allowed more and more similar, sorting procedure is weighed in addition and adjusted by member dissimilar in class, changed The word tendency of some classes is become, and has generated some outliers, therefore similar class is merged after reunion class be It is necessary.

The application weighs the similarity of two classes using the comentropy of class, principle be when two classes are similar, Comentropy will not increase too much when one class is merged into another class.Assuming that B, C are two classes to be combined, comentropy change threshold Be worth for 0.1, if calculate after B, C merge, the comentropy changing value for merging class is less than threshold value, then illustrates that two classes of B, C are similar, It can merge；Otherwise it is dissimilar, it is impossible to merge.

The two indices for merging Similarity Class are class similarity and Entropy.Entropy is obtained by class center vector, meter Calculation method is to normalize class center vector, i.e., per one-dimensional divided by vector sum, allows vector to add up to 1 per one-dimensional, such as formula (4). The probability distribution occurred with word in normalized class center vector approximation class, the information of class is calculated with approximate probability distribution Entropy, such as formula (5).Comentropy increase after class merges illustrates that the theme of class becomes indefinite, and entropy reduces the theme of explanation class More determine, therefore class can be merged in the case where the comentropy increase of class is few.

But except that entropy can also reduce when if one or two of word is partial at the center of class after merging, but such case The merging of general remark class is incorrect, because one or two of word is typically that a theme can not possibly be summarized to what is come.In order to eliminate This error situation is, it is necessary to first compare the similarity of two classes, and just permission class merges only when class is similar enough.Therefore The method that class merges is first to calculate the similarity of two classes, and when class is similar enough (be more than class similarity threshold 0.51, i.e., second Threshold value) the comentropy change that class merges just is calculated, merge two classes when comentropy increase is little or reduces.Information entropy condition subtracts It is more prominent rather than fuzzyyer by theme to have lacked mixed noise when class is merged by similarity, and similarity condition reduces by letter The excessive polymerization of mistake when entropy merges is ceased, two conditions complement each other and complemented one another.

Step S7 is specially：The application is by trimming the noise in class removal cluster.Exist in social short text substantial amounts of Noise, be present in in general class and outlier in.Quite a lot of social short text unrelated with event has been filtered in pretreatment, But the noise social activity short text unrelated with event still be present, a portion can turn into outlier, certainly some and event phase The social short text of pass can also turn into outlier because stating social short text related to other and differing greatly, and these outliers also may be used With as noise processed.Outlier quantity in short text clustering is more, and the processing to outlier can be accelerated to cluster speed, increase Add cluster accuracy.

The specific method for trimming class is that class is trimmed after reunion class with class merging treatment, and trimming is divided into two steps, The first step sorts whole classes according to class centre time from small to large, and deletion exceeds class time horizon to failure.Second step, when class is total When number exceeds the upper limit, class is sorted according to social short text quantity in class and renewal time, social short text quantity is pressed in class It is the first ordering rule from small to large, class renewal time, by being the second ordering rule from small to large, it is straight to delete the forward class that sorts It is less than the upper limit to class quantity.

Trim the following index of class Main Basiss：

(1) class time horizon to failure.When class centre time has illustrated the thing of such description beyond time horizon to failure Significant period of time is have passed through, loses the value of event detection, can be deleted.The general basis of class time horizon to failure The time requirement of event detection takes suitable value, and such as 1 day or 2 days proper.

(2) class recent renewal time.Each classification has an independent renewal time, and class recent renewal time is that last time should Computer time when member changes in class, renewal time represent that the possibility that class updates again is bigger more rearward, The class of renewal time earlier above should be first deleted when deleting class.

(3) the class total number upper limit.The class total number upper limit controls the total number of class in upper range.Class number too much may be used It can cause to cluster that speed is excessively slow or the processing beyond computer and storage capacity, the class total number upper limit is according to the reality of hardware performance Situation determines.

One of ordinary skill in the art will be appreciated that embodiment described here is to aid in reader and understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such especially statement and embodiment.For ability For the technical staff in domain, the present invention can have various modifications and variations.Within the spirit and principles of the invention, made Any modification, equivalent substitution and improvements etc., should be included within scope of the presently claimed invention.

Claims

A kind of 1. short text on-line talking method, it is characterised in that including：

S1, the social short text to acquisition pre-process, and extract text feature；

The text feature includes：Part of speech corresponding to word, word in short text and name entity indicia.

S2, according to text feature, short text similarity is calculated using vector space model；

If the maximum similarity calculated in S3, step S2 is more than the upper limit of the first threshold scope of setting；Then the text is added Enter to known class corresponding to maximum similarity；If the maximum similarity calculated in step S2 is less than the first threshold scope of setting Lower limit；Then create a new class；Otherwise step S4 is performed；

S4, the similarity according to semantic method calculating class corresponding with the maximum similarity that step S2 is calculated, if the phase Like maximum of the degree more than the first threshold of setting；The text is then added to known class corresponding to maximum similarity；If calculate Obtained similarity is less than the minimum value of the first threshold of setting；Then create a new class；

S5, each class obtained to step S4 calculate the similarity of short text and current class center vector in class, for similarity Less than the short text of lower limit of first threshold scope, then return to step S2 is reclassified；If if finding most like class by current class The short text delete, and add in most like class, new class created if without most like class；

S6, operation is merged to the class after being handled through step S5；

S7, cut operation is carried out to the class handled through step S6.
2. a kind of short text on-line talking method according to claim 1, it is characterised in that step S2 is specially：According to Text feature structure short text vector；Calculate vectorial with each known class center vector and cosine similarity of short text.
3. a kind of short text on-line talking method according to claim 1, it is characterised in that step S4 is specially：

Text similarity is calculated based on WordNet, expression is as follows：

<mrow> <msub> <mi>Sim</mi> <mrow> <mi>L</mi> <mi>i</mi> </mrow> </msub> <mo>=</mo> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mi>a</mi> <mi>l</mi> </mrow> </msup> <mo>&CenterDot;</mo> <mfrac> <mrow> <msup> <mi>e</mi> <mrow> <mi>&beta;</mi> <mi>h</mi> </mrow> </msup> <mo>-</mo> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mi>&beta;</mi> <mi>h</mi> </mrow> </msup> </mrow> <mrow> <msup> <mi>e</mi> <mrow> <mi>&beta;</mi> <mi>h</mi> </mrow> </msup> <mo>+</mo> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mi>&beta;</mi> <mi>h</mi> </mrow> </msup> </mrow> </mfrac> </mrow>

Wherein, α is smoothing factor, and β is scale factor of two measurements of adjustable range and depth to similarity contribution, and h is two general Read node it is minimum it is public include node depth, l is the distance of two concept nodes；

Or

Text similarity is calculated based on Word2vec, is specially：By the way that word is characterized into the vector into fixed dimension, then basis Word vector calculates text vector, and the similitude between text is calculated according to cosine similarity.
4. a kind of short text on-line talking method according to claim 1, it is characterised in that step S6 is specially：Calculate The comentropy of class after being handled through step S5, and the similarity of any two class；If similarity is more than Second Threshold, count If calculating the comentropy of the class after two classes merge, if the comentropy size variation value of the class after merging is no more than 0.1, by two Class merges；Otherwise nonjoinder.
5. a kind of short text on-line talking method according to claim 1, it is characterised in that step S7 is specially：According to All kinds of centre times after step S6 processing sort whole classes from small to large, delete the class beyond class time horizon to failure；

When class total number exceeds the upper limit, the first minor sort is carried out to class from small to large by social short text quantity in class；Press again Class renewal time carries out the second minor sort to class from small to large, forward class after the second minor sort is finally deleted successively, until class Total number is less than the upper limit.