CN103198103A

CN103198103A - Microblog pushing method and device based on dense word clustering

Info

Publication number: CN103198103A
Application number: CN201310090524XA
Authority: CN
Inventors: 冯扬; 姜贵彬; 宋莉; 刘莹莹; 桑军
Original assignee: Weimeng Chuangke Network Technology China Co Ltd
Current assignee: Weimeng Chuangke Network Technology China Co Ltd
Priority date: 2013-03-20
Filing date: 2013-03-20
Publication date: 2013-07-10
Anticipated expiration: 2033-03-20
Also published as: CN103198103B

Abstract

The invention discloses a microblog pushing method and a microblog pushing device based on dense word clustering, and solves the problems that in the prior art, a server is high in pressure and network resources are wasted. The microblog pushing method comprises the following steps: determining a word spacing between words by the server; according to the word spacing, determining core words and dividing the core words into different word sets; clustering the words in each word set by an OPTICS clustering algorithm to obtain a plurality of word clusters to be merged; merging the word clusters to be merged to obtain a merged word cluster; and finally, pushing a microblog to be pushed according to the merged word cluster to which words interesting a user belong and contents of the microblog to be pushed. By the microblog pushing method, during word set division, words with a general meaning can be excluded out of the divided word sets, and during clustering, the words are not affected by initial values, so that the clustering accuracy can be improved; and the server can accurately push the microblog to be pushed according to the obtained merged word cluster, so that the pressure of the server can be effectively reduced and the network resources are saved.

Description

A kind of microblogging method for pushing and device based on the density term clustering

Technical field

The present invention relates to networking technology area, particularly a kind of microblogging method for pushing based on the density term clustering is Ji Zhuan Ge.

Background technology

At present, the application of microblogging in the socialization medium is more and more influential, has become one of main means that user's information of carrying out issues, exchanges, obtains.

For a microblogging user, the microblogging that server is issued except other users that this user can be paid close attention to is pushed to this user, can also the microblogging relevant with this user's interest be pushed to this user according to this user's interest.

Concrete, can set this user's interest word earlier (can be set by this user oneself, also can be determined this user's interest word by the microblogging that server is browsed, transmits, collects, paid close attention to according to this user), server is again according to the content of waiting to push microblogging, judge whether this waits to push microblogging relevant with this user's interest word, if relevant, then this microblogging to be pushed is pushed to this user.

For example, this user's interest word is " computing machine ", and server then according to the content of waiting to push microblogging, judges whether this waits to push microblogging relevant with " computing machine ", if then this microblogging to be pushed is pushed to this user.

Yet, there are a plurality of different expressed identical or close situations of meaning of word in actual applications, as above the word close with this user's interest word " computing machine " comprises " computer ", " notebook " etc. in the example.If at a microblogging to be pushed, only whether relevant determining whether is pushed to this user with it with this user's interest word according to it, will certainly cause the accuracy that pushes microblogging lower.Therefore, need carry out cluster to each vocabulary in the dictionary, also, the word that expression and significance is identical or close is brought together and forms word bunch.Like this, judging one when waiting to push microblogging and whether should be pushed to a user, can judge that then this content of waiting to push microblogging is whether bunch relevant with the word at this user's interest word place, pushes according to judged result again.As seen, the accuracy that each vocabulary in the dictionary is carried out cluster is directly connected to the accuracy that pushes microblogging.

In the prior art, generally can adopt following two kinds of clustering algorithms that vocabulary is carried out cluster.

One, based on the clustering algorithm of dividing, as the k-means clustering algorithm, its method is, specifies the quantity k of word bunch earlier; From dictionary, select k word respectively as the centre word of k word bunch more at random; At each other vocabulary in the dictionary, calculate the distance of this vocabulary and this k centre word respectively then, and determine and centre word that this vocabulary is nearest, this vocabulary branch is gone into the word bunch at this centre word place; After other all vocabulary are finished dealing with, redefine the centre word (being k centre word equally) of each word bunch, and calculate again other vocabulary respectively with the distance of the k that a redefines centre word, carry out repartitioning of word bunch according to distance, so iteration is gone down, till satisfying certain termination of iterations condition.

But, adopt the accuracy of above-mentioned first method cluster can be subjected to the influence of the word number of clusters amount k of initial appointment, and, the word that the said method cluster goes out bunch is the word bunch of " sphere ", for a word that obtains bunch, often the correlativity with this word bunch is very low apart from the centre word vocabulary farthest of this word bunch.

Two, based on the hierarchical clustering algorithm of coagulation type, its method is, in the starting stage with each word as a word bunch, calculate the distance between each word bunch then, the nearer word of combined distance bunch to be to form bigger word bunch, recomputates the distance between each word bunch, continues to merge according to the distance between the word bunch, so iteration is gone down, till satisfying certain termination of iterations condition.

But, adopt the accuracy of above-mentioned second kind of clustering algorithm can be subjected to many influences with vocabulary of generality implication, as " company ", " enterprise ", " experience " etc., these vocabulary with generality implication all have certain correlativity with a lot of different classes of vocabulary, therefore, when combinatorial word bunch, the word that tends to because these have recapitulative vocabulary two bases not to be had correlativity bunch combines.

In sum, it is lower in the prior art vocabulary to be carried out the accuracy of cluster, thereby causing server can not treat the propelling movement microblogging pushes accurately, for a user, if the microblogging to be pushed that server will be not relevant with its interest is pushed to this user, then this user will certainly wait other modes to search for these microbloggings to be pushed by search, increase server stress, if and server will be pushed to this user with the not related microblogging to be pushed of its interest, also can cause waste of network resources.

Summary of the invention

A kind of microblogging method for pushing based on the density term clustering is provided the embodiment of the invention Ji Zhuan Ge, big in order to solve in the prior art server stress, the problem of waste Internet resources.

A kind of microblogging method for pushing based on the density term clustering that the embodiment of the invention provides comprises:

Server is determined the word spacing between each vocabulary according to the co-occurrence word set of each vocabulary; And

Determine core word according to the word spacing between each vocabulary; And

At each core word of determining, will be divided into a word set, first quantity of N for presetting with N vocabulary and this core word of the word spacing minimum of this core word;

At each word set that marks off, adopt the OPTICS clustering algorithm that the vocabulary in this word set is carried out cluster, obtain several words to be combined bunch; And

According to the vocabulary in each word to be combined that obtains bunch, each word to be combined of obtaining bunch is merged processing, obtain combinatorial word bunch;

Described server pushes microblogging described to be pushed according to the combinatorial word at user's interest word place bunch and the content of waiting to push microblogging.

A kind of microblogging based on the density term clustering that the embodiment of the invention provides pushes Zhuan Ge, comprising:

Word spacing determination module is used for the co-occurrence word set according to each vocabulary, determines the word spacing between each vocabulary;

The core word determination module is used for determining core word according to the word spacing between each vocabulary;

Word set is divided module, is used at each core word of determining, will be divided into a word set, first quantity of N for presetting with N vocabulary and this core word of the word spacing minimum of this core word;

The cluster module is used at each word set that marks off, and adopts the OPTICS clustering algorithm that the vocabulary in this word set is carried out cluster, obtains several words to be combined bunch;

Merge module, be used for the vocabulary according to each word to be combined that obtains bunch, each word to be combined of obtaining bunch is merged processing, obtain combinatorial word bunch;

Push module, be used for pushing microblogging described to be pushed according to the combinatorial word at user's interest word place bunch and the content of waiting to push microblogging.

A kind of microblogging method for pushing based on the density term clustering is provided the embodiment of the invention Ji Zhuan Ge, this method server is determined the word spacing between each vocabulary earlier, determine core word accordingly, and at each core word, to be divided into a word set with N vocabulary and this core word of the word spacing minimum of this core word, adopt the OPTICS clustering algorithm that the vocabulary in each word set is carried out cluster again, obtain several words to be combined bunch, then several words to be combined bunch are merged and obtain combinatorial word bunch, at last according to the combinatorial word at user's interest word place bunch and wait that the content that pushes microblogging pushes microblogging to be pushed.Said method can be got rid of the vocabulary with generality implication beyond the word set of dividing when dividing word set, and be not subjected to the influence of initial value during cluster, therefore can improve the accuracy of cluster, server bunch can be treated according to the combinatorial word that obtains and push microblogging and push accurately, thereby can effectively reduce server stress, also save Internet resources.

Description of drawings

Fig. 1 pushes process for the microblogging based on the density term clustering that the embodiment of the invention provides;

The server that Fig. 2 provides for the embodiment of the invention is divided the process of word set;

The word set synoptic diagram with vocabulary p division that Fig. 3 provides for the embodiment of the invention;

Two crossing word set synoptic diagram that Fig. 4 provides for the embodiment of the invention;

The word set synoptic diagram of two mutual exclusions that Fig. 5 provides for the embodiment of the invention;

Two word set synoptic diagram that excessively intersect that Fig. 6 provides for the embodiment of the invention;

The employing OPTICS clustering algorithm that Fig. 7 provides for the embodiment of the invention carries out the process of cluster to the vocabulary in the word set;

Fig. 8 pushes Zhuan Ge structural representation for the microblogging based on the density term clustering that the embodiment of the invention provides.

Embodiment

The initial value of importing during for fear of the vocabulary with generality implication and cluster (as needing the appointment word number of clusters amount k of input earlier in the k-means clustering algorithm) is to the influence of cluster result, earlier each vocabulary in the dictionary is divided into several word sets in the embodiment of the invention, when dividing word set, can get rid of have the generality implication vocabulary (as " company ", " enterprise ", vocabulary such as " experiences "), adopt the OPTICS clustering algorithm need not to import initial value and can go out " shape " word arbitrarily bunch according to the Density Clustering of vocabulary that each word set is carried out cluster again, obtain word to be combined bunch, at last word to be combined bunch is merged and obtain combinatorial word bunch, thereby can improve the accuracy of cluster, server bunch can be treated according to the combinatorial word that obtains and push microblogging and push accurately, therefore can effectively reduce server stress, save Internet resources.

Below in conjunction with accompanying drawing the preferred embodiment of the present invention is elaborated.

Fig. 1 specifically may further comprise the steps for the microblogging based on the density term clustering that the embodiment of the invention provides pushes process:

S101: server is determined the word spacing between each vocabulary according to the co-occurrence word set of each vocabulary.

In embodiments of the present invention, server is at any two vocabulary in the dictionary, can determine word spacing between these two vocabulary by the similarity of these two vocabulary between expected context distributes, if it is more similar that the context of these two words distributes, then the expressed implication of these two words is more similar, word spacing between the two is more little, otherwise then the word spacing is more big.

Concrete, any two vocabulary that represent in the dictionary with first vocabulary and second vocabulary are example, server can adopt formula at first vocabulary and second vocabulary

Determine the word spacing between first vocabulary and second vocabulary, wherein, i represents first vocabulary, and j represents second vocabulary, and (i j) is word spacing between first vocabulary and second vocabulary, T to D _iBe the co-occurrence word set of first vocabulary, T _jBe the co-occurrence word set of second vocabulary, | T _i∩ T _j| be the quantity of the vocabulary that comprises in the common factor of co-occurrence word set of the co-occurrence word set of first vocabulary and second vocabulary, | T _i| be the quantity of the concentrated vocabulary that comprises of co-occurrence word of first vocabulary, | T _j| be the quantity of the concentrated vocabulary that comprises of co-occurrence word of second vocabulary.

S102: determine core word according to the word spacing between each vocabulary.

In embodiments of the present invention, at a vocabulary undetermined, if more with the quantity of less other vocabulary of the word spacing of this vocabulary undetermined, can determine that then this vocabulary undetermined is core word.

Concrete, server has been determined in the dictionary after the word spacing between each vocabulary, can be with each vocabulary in the dictionary all as vocabulary undetermined, and at each vocabulary undetermined, judge word spacing with this vocabulary undetermined be not more than default neighborhood apart from the quantity of other vocabulary of ε whether greater than the second default quantity M, if then definite this vocabulary undetermined is core word, otherwise, determine that this vocabulary undetermined is not core word.Wherein, default neighborhood all can be set as required apart from ε and the second quantity M.

S103: at each core word of determining, will be divided into a word set with N vocabulary and this core word of the word spacing minimum of this core word.

Wherein, N is the first default quantity.

Need to prove that a word set for marking off not only comprises a core word in this word set.And, owing to be that N vocabulary with this core word minimum is divided in this word set when dividing, for the vocabulary with generality implication, these vocabulary mostly and the word spacing of any one core word all inadequately little (little of being divided in the word set), therefore, after dividing word set, will be left some unallocated vocabulary in any word set, these vocabulary are exactly the vocabulary with generality implication.Follow-up cluster process is then only handled the word set that marks off, and these vocabulary that are not divided in any one word set is not handled, and the vocabulary that so just can avoid having the generality implication influences the accuracy of follow-up cluster.

S104: at each word set that marks off, adopt the OPTICS clustering algorithm that the vocabulary in this word set is carried out cluster, obtain several words to be combined bunch.

Since based on the OPTICS clustering algorithm of density than having higher cluster accuracy based on the clustering algorithm of dividing (as the k-means clustering algorithm) with based on the hierarchical clustering algorithm of coagulation type, therefore, in embodiments of the present invention, after server is divided word set, can adopt the OPTICS clustering algorithm that each word set is carried out cluster, obtain several words to be combined bunch.

S105: according to the vocabulary in each word to be combined that obtains bunch, each word to be combined of obtaining bunch is merged processing, obtain combinatorial word bunch.

Concrete, server can bunch add each word to be combined of obtaining in bunch formation to; At each word to be combined in bunch formation bunch, extract this word to be combined bunch, determine in bunch formation bunch to comprise with this word to be combined that extracts other words to be combined bunch of at least one identical vocabulary, other words to be combined of determining bunch are incorporated in the word to be combined bunch of extraction into word to be combined bunch in the middle of obtaining; This word to be combined that deletion is extracted from bunch formation bunch and definite other words to be combined bunch, continue to determine in the prevariety formation bunch to comprise with this centre word to be combined other words to be combined bunch of at least one identical vocabulary, and merge, till the quantity of the vocabulary that comprises in this centre word to be combined bunch no longer changes; When the quantity of the vocabulary that comprises in this centre word to be combined bunch no longer changes, with this centre word to be combined bunch as the combinatorial word that obtains bunch.

For example, obtain 4 words to be combined bunch altogether, be respectively C, C1, C2, C3, wherein, C comprises identical vocabulary with C1, and C1 comprises identical vocabulary with C2, and the vocabulary that C3 comprises and C, C1, C2 are all inequality, after then server bunch adds in the thick formation these 4 words, extract C, determine in bunch formation that other words to be combined that comprise identical vocabulary with C bunch are C1 into therefore C1 to be incorporated among the C, word to be combined bunch C ∪ C1 in the middle of obtaining, deletion C and C1 from bunch formation.

At this moment, also comprise 2 words to be combined bunch in bunch formation, i.e. C2 and C3.Server continue to determine to comprise with this centre word to be combined bunch C ∪ C1 other words to be combined bunch of identical vocabulary, be C2(because, C1 comprises identical vocabulary with C2, therefore C2 also comprises identical vocabulary with C ∪ C1), therefore, C2 is incorporated among the C ∪ C1 into word to be combined bunch C ∪ C1 ∪ C2 in the middle of obtaining.

Then, server is deleted C2 again from bunch formation, only surplus C3 in bunch formation this moment, because the vocabulary that C3 and C ∪ C1 ∪ C2 comprise is all inequality, therefore the quantity of the vocabulary that comprises among the centre word to be combined bunch C ∪ C1 ∪ C2 of this moment no longer changes, thus with C ∪ C1 ∪ C2 as a combinatorial word bunch.

S106: server pushes this microblogging to be pushed according to the combinatorial word at user's interest word place bunch and the content of waiting to push microblogging.

Obtain each after the combinatorial word bunch, server is for microblogging to be pushed, and then can wait to push the content of microblogging according to the combinatorial word at each user's interest word place bunch and this, pushes this microblogging to be pushed.As, for a user, server can be determined the combinatorial word bunch at this user's interest word place, again according to the content of waiting to push microblogging, judge that whether the correlativity of this content of waiting to push microblogging and the combinatorial word of determining bunch is greater than setting threshold, if then this microblogging to be pushed is pushed to this user, otherwise, this microblogging to be pushed is not pushed to this user.

Pass through said method, server can be got rid of the vocabulary with generality implication beyond the word set of dividing when dividing word set, and be not subjected to the influence of initial value during cluster, therefore can improve the accuracy of cluster, thereby, wait to push for the microblogging for one, server is according to the combinatorial word that obtains bunch, if determine that the correlativity of combinatorial word bunch at the interest word place that this waits to push microblogging and this user is higher, then microblogging to be pushed can be pushed to this user, make that this user need not to wait other modes to search for this microblogging to be pushed by search, therefore can effectively reduce server stress advances, if and the correlativity of combinatorial word bunch of determining the interest word place that this waits to push microblogging and this user is lower, then this microblogging to be pushed is not pushed to this user, also saved Internet resources.

Preferable, in order to improve the cluster efficient in the said process as far as possible, in above-mentioned steps S103, server can adopt method as shown in Figure 2 to divide word set.

The server that Fig. 2 provides for the embodiment of the invention is divided the process of word set, specifically may further comprise the steps:

S1031: server adds each vocabulary in the original formation to random order.

S1032: each vocabulary in the original formation, extract this vocabulary, whether this vocabulary that judge to extract is core word, if, execution in step S1033 then, otherwise, execution in step S1034.

S1033: will be divided into a word set with N vocabulary of the word spacing minimum of this vocabulary that extracts and this vocabulary of extraction in the original formation, and this vocabulary that deletion is extracted from original formation and with the word spacing of this vocabulary that extracts 2 times core word less than default neighborhood distance, and return step S1032.

Wherein, delete in the original formation with the word spacing of this vocabulary that extracts as follows less than the purpose of 2 times vocabulary of default neighborhood distance:

If vocabulary p is a core word, regard this vocabulary p as in the space a point, then the word set (being called S (p)) of dividing with this vocabulary p is as shown in Figure 3.The word set synoptic diagram with vocabulary p division that Fig. 3 provides for the embodiment of the invention, in Fig. 3, radius is that the circle expression of ε and the word spacing of this vocabulary p are not more than the scope of presetting the neighborhood distance, also be, ε is default neighborhood distance, h (p) be in the word set of dividing with this vocabulary p with the word spacing vocabulary farthest of this vocabulary p word spacing to this vocabulary p, h (p) is called word set divides distance, the vocabulary in radius is h (p) scope is included among the S (p).

Based on word set shown in Figure 3, two word sets may exist crossing or two kinds of situations of mutual exclusion as can be known, as shown in Figure 4 and Figure 5.

Two crossing word set synoptic diagram that Fig. 4 provides for the embodiment of the invention in Fig. 4, all to comprise identical vocabulary r among vocabulary p1 the word set S (p1) that divides and the word set S (p2) that divides with vocabulary p2, then claim word set S (p1) and word set S (p2) to intersect.

The word set synoptic diagram of two mutual exclusions that Fig. 5 provides for the embodiment of the invention in Fig. 5, does not comprise any identical vocabulary among the word set S (p1) that divides with vocabulary p1 and the word set S (p2) that divides with vocabulary p2, then claims word set S (p1) and word set S (p2) mutual exclusion.

Further, two crossing situations of word set are divided into common intersecting and excessive intersecting again, and crossing situation as shown in Figure 4 is common crossing situation, and shown in Figure 6 is excessive situation about intersecting.

Two word set synoptic diagram that excessively intersect that Fig. 6 provides for the embodiment of the invention, in Fig. 6, all to comprise identical vocabulary r among vocabulary p1 the word set S (p1) that divides and the word set S (p2) that divides with vocabulary p2, and the word spacing of this vocabulary r and p1 less than default neighborhood apart from ε, the word spacing of vocabulary r and p2 also less than default neighborhood apart from ε, then claim word set S (p1) and word set S (p2) excessively to intersect.

Obviously, if divide word set with each core word in the dictionary, will certainly sharply increase the number of the word set of division, influence the efficient of following adopted OPTICS cluster, therefore, in order to improve the efficient of follow-up cluster, the situation that need avoid excessively intersecting produces.So, in step S1033, divided after the word set with a core word, need be in the original formation and other core words deletions less than 2 ε of the word spacing of this core word.Like this, any word set of subsequent divided all can be excessively not crossing with this word set.

S1034: this vocabulary that will extract is put back in the original formation, returns step S1032.

There is one at least in the core word that comprises in the word set by the said method division, can have many core words in the general word set.

Further, adopt the OPTICS clustering algorithm that the vocabulary in the word set is carried out cluster process as shown in Figure 7 among the step S104 as shown in Figure 1.

The employing OPTICS clustering algorithm that Fig. 7 provides for the embodiment of the invention carries out the process of cluster to the vocabulary in the word set, specifically may further comprise the steps:

S1041: at each word set that marks off, all vocabulary in this word set are added in the orderly seed formation.

For example, suppose to comprise in this word set n vocabulary, p1～pn can add this n vocabulary in the orderly seed formation to random order when then server is initial.

S1042: the sequencing according to each vocabulary in the orderly seed formation, extract first vocabulary.

Continue to continue to use example, suppose that the order of an interpolation said n vocabulary is p1～pn, then server extracts first vocabulary p1 from orderly seed formation.

S1043: whether this vocabulary that judge to extract is core word, if, execution in step S1044 then, otherwise execution in step S1045.

S1044: with the candidate word of other vocabulary in the orderly seed formation as this vocabulary that extracts, according to the reach distance of each candidate word to this vocabulary that extracts, upgrade the reach distance intermediate value of each candidate word, this vocabulary that extracts is inserted into the end of result queue, this vocabulary that deletion is extracted from orderly seed formation, and according to the current reach distance intermediate value order from small to large of each candidate word each candidate word in the orderly seed formation is sorted execution in step S1046.

Wherein, according to the reach distance of each candidate word to this vocabulary that extracts, the method of upgrading the reach distance intermediate value of each candidate word is specially: at each candidate word, determine that this candidate word is to the reach distance of this vocabulary that extracts, if the reach distance intermediate value of current this candidate word is not more than this candidate word to the reach distance of this vocabulary that extracts, then keep the reach distance intermediate value of this candidate word constant, if the reach distance intermediate value of current this candidate word is greater than the reach distance of this candidate word to this vocabulary that extracts, then with the reach distance of this candidate word again as the reach distance intermediate value of this candidate word.

Continue to continue to use the example, extract first vocabulary p1 after, then with other vocabulary in the orderly seed formation, namely p2～pn is as the candidate word of p1.Server determine p2 to p1, p3 to p1 ... pn is to the reach distance of p1.

Below to determine that p2 is example to the reach distance of p1, illustrate that server determines the method for two reach distances between the vocabulary.If p1 and p2 are core words, then p2 is core distance and the p2 maximal value in the word spacing of p1 of p1 to the reach distance of p1.

Wherein, definite method of the core distance of p1 is: in the dictionary of dictionary, according to the word spacing order from small to large of p1, determine M(namely, above-mentioned second quantity) individual vocabulary, this M vocabulary is defined as the core distance of p1 to the major term spacing of p1.

At p2, if the reach distance intermediate value of current p2 is not more than p2 to the reach distance of p1, then keep the reach distance intermediate value of current p2 constant, if the reach distance intermediate value of current p2 is greater than the reach distance of p2 to p1, then the reach distance intermediate value with p2 is updated to p2 to the reach distance of p1.Similarly, the reach distance intermediate value of renewable p3～pn.Wherein, when initial, server can be initial value with the reach distance intermediate value She Ge of p1～pn, as, for core word, server can be a bigger value (greater than the reach distance between any two core words in this word set) with its reach distance intermediate value She Ge, for non-core word, can be a less value (as-1) with its reach distance intermediate value She Ge then.

To the reach distance intermediate value of p2～pn all upgrade finish after, server then is inserted into the p1 that extracts the end of result queue (because this moment, result queue was sky, therefore after inserting p1, p1 makes number one in result queue), and from orderly seed formation, delete p1, according to reach distance intermediate value order from small to large, the p2～pn in the orderly seed formation is sorted again.

Suppose the order of p2～pn after the ordering be p3, p2, p4, p5, p6 ... pn, then behind the execution in step S1046, determine also to have vocabulary in the current orderly seed formation, therefore return step S1042, first vocabulary that continues to extract in the orderly seed formation is handled.

Because first vocabulary in orderly seed formation this moment is p3, so server extraction p3, and handle according to the method described above.

Like this, after all vocabulary in the orderly seed formation are all disposed, at any one the core word pi in the result queue, can guarantee that the next core word pj that comes after this core word pi in the result queue to the reach distance of pi is: come all core words after this core word pi minimum in the reach distance of this core word pi in the result queue.

S1045: this vocabulary that will extract is inserted into the end of result queue, this vocabulary that deletion is extracted from orderly seed formation, execution in step S1046.

S1046: judge in the current orderly seed formation whether also have vocabulary, if, then return step S1042, otherwise execution in step S1047.

S1047: according to the sequencing of each vocabulary in the result queue, search successively except first vocabulary and current reach distance intermediate value greater than the vocabulary of default neighborhood distance.

S1048: when finding, be a word to be combined bunch with coming all vocabulary clusters before this vocabulary of finding in the result queue, all vocabulary from result queue in this word to be combined of deletion bunch.

Based on the above-mentioned result queue that obtains, then according to the sequencing of each vocabulary in the result queue, search the bigger vocabulary of reach distance intermediate value successively backward, the vocabulary that finds is exactly the less zone of density, the border of word bunch just, therefore, server find out in the result queue successively backward except first vocabulary and current reach distance intermediate value greater than the vocabulary of default neighborhood apart from ε.When finding, be a word to be combined bunch with coming all vocabulary clusters before the vocabulary that finds in the result queue then, and from result queue all vocabulary in this word to be combined of deletion bunch.Continue to search successively backward except first vocabulary and current reach distance intermediate value greater than the vocabulary of default neighborhood apart from ε, carry out above-mentioned merging and handle, till result queue is sky.

S1049: judge whether also there is vocabulary in the current results formation, if, then return step S1047, otherwise, execution in step S1040.

S1040: the cluster to this word set finishes.

Adopt said method that the vocabulary in the word set is carried out cluster, then can obtain several words to be combined bunch, the follow-up then step S105 by as shown in Figure 1 bunch merges each word to be combined of obtaining at each word set respectively, obtain combinatorial word bunch, and treat by step S106 and to push microblogging and push.

Though also existing directly in the prior art adopts the OPTICS clustering algorithm to carry out the method for cluster to all vocabulary in the dictionary, namely, directly all vocabulary in the dictionary are carried out as shown in Figure 7 cluster process, but, because process shown in Figure 7 comprises the process of two loop iterations, this also just need carry out cluster by individual server, and, if breaking down in cluster process, individual server cause cluster process to interrupt, just need carry out cluster again to the whole vocabulary in the dictionary, therefore, it is lower directly to adopt OPTICS clustering algorithms to carry out stability and the efficient of cluster to all vocabulary in the dictionary.

And in the said method that the embodiment of the invention provides, adopt method shown in Figure 2 that vocabulary in the dictionary is divided into word set earlier, the word set of dividing can't be destroyed the word spacing between each vocabulary in original dictionary, also be, can not destroy the density relationship of each vocabulary in the dictionary, adopt method as shown in Figure 7 to carry out cluster at each word set then, except reaching and directly adopting OPTICS clustering algorithms to carry out the effect same of cluster to all vocabulary in the dictionary, also can support distributed treatment, also be, can realize that a server carries out cluster to some or certain several word set, another server carries out cluster to other word set, and, when a server hinders in cluster process and when causing cluster process to interrupt for some reason, only need when interrupting handled word set to begin to carry out cluster again and get final product, and need not all vocabulary in the dictionary are carried out cluster, therefore, the above-mentioned clustering method that provides of the embodiment of the invention can effectively improve efficient and the stability of cluster.

More than the microblogging method for pushing based on the density term clustering that provides for the embodiment of the invention, based on same invention thinking, the embodiment of the invention also provides a kind of microblogging based on the density term clustering to push Zhuan Ge, as shown in Figure 8.

Fig. 8 specifically comprises for the microblogging based on the density term clustering that the embodiment of the invention provides pushes Zhuan Ge structural representation:

Word spacing determination module 801 is used for the co-occurrence word set according to each vocabulary, determines the word spacing between each vocabulary;

Core word determination module 802 is used for determining core word according to the word spacing between each vocabulary;

Word set is divided module 803, is used at each core word of determining, will be divided into a word set, first quantity of N for presetting with N vocabulary and this core word of the word spacing minimum of this core word;

Cluster module 804 is used at each word set that marks off, and adopts the OPTICS clustering algorithm that the vocabulary in this word set is carried out cluster, obtains several words to be combined bunch;

Merge module 805, be used for the vocabulary according to each word to be combined that obtains bunch, each word to be combined of obtaining bunch is merged processing, obtain combinatorial word bunch;

Push module 806, be used for pushing microblogging described to be pushed according to the combinatorial word at user's interest word place bunch and the content of waiting to push microblogging.

Institute's predicate spacing determination module 801 specifically is used for, and at first vocabulary and second vocabulary, adopts formula Determine the word spacing between described first vocabulary and second vocabulary, wherein, first vocabulary and second vocabulary are any two vocabulary, and i represents described first vocabulary, and j represents described second vocabulary, and (i j) is word spacing between described first vocabulary and second vocabulary, T to D _iBe the co-occurrence word set of first vocabulary, T _jBe the co-occurrence word set of second vocabulary, | T _i∩ T _j| the quantity of the vocabulary that comprises in the common factor for the co-occurrence word set of the co-occurrence word set of described first vocabulary and described second vocabulary, | T _i| be the quantity of the concentrated vocabulary that comprises of co-occurrence word of first vocabulary, | T _j| be the quantity of the concentrated vocabulary that comprises of co-occurrence word of second vocabulary.

Described core word determination module 802 specifically is used for, at each vocabulary undetermined, whether judgement is not more than the quantity of other vocabulary of presetting the neighborhood distance greater than the second default quantity with the word spacing of this vocabulary undetermined, if, determine that then this vocabulary undetermined is core word, otherwise, determine that this vocabulary undetermined is not core word.

Described word set is divided module 803 and specifically is used for, and each vocabulary is added in the original formation with random order; Each vocabulary in the original formation extracts this vocabulary, judges whether this vocabulary that extracts is core word; If, then will be divided into a word set with N vocabulary of the word spacing minimum of this vocabulary that extracts and this vocabulary of extraction in the original formation, and this vocabulary that deletion is extracted from original formation and with the word spacing of this vocabulary of extraction 2 times core word less than described default neighborhood distance; Otherwise, this vocabulary that extracts is put back in the original formation.

Described cluster module 804 specifically is used for, and at each word set that marks off, all vocabulary in this word set is added in the orderly seed formation; Sequencing according to each vocabulary in the orderly seed formation extracts first vocabulary; Judge whether this vocabulary that extracts is core word; If then with the candidate word of other vocabulary in the orderly seed formation as this vocabulary that extracts, and at each candidate word, determine that this candidate word is to the reach distance of this vocabulary that extracts; If the reach distance intermediate value of current this candidate word is not more than this candidate word to the reach distance of this vocabulary that extracts, then keep the reach distance intermediate value of this candidate word constant, if the reach distance intermediate value of current this candidate word is greater than the reach distance of this candidate word to this vocabulary that extracts, then with this candidate word to the reach distance of this vocabulary that extracts again as the reach distance intermediate value of this candidate word; This vocabulary that extracts is inserted into the end of result queue, this vocabulary that deletion is extracted from orderly seed formation, and according to the current reach distance intermediate value order from small to large of each candidate word each candidate word in the orderly seed formation is sorted; First vocabulary that continue to extract in the orderly seed formation is handled, in orderly seed formation, do not have any vocabulary till; Otherwise, this vocabulary that extracts is inserted into the end of result queue, this vocabulary that deletion is extracted from orderly seed formation, first vocabulary that continues to extract in the orderly seed formation is handled, in orderly seed formation, do not have any vocabulary till; When not having any vocabulary in the orderly seed team row, according to the sequencing of each vocabulary in the result queue, search successively except first vocabulary and current reach distance intermediate value greater than the vocabulary of default neighborhood distance; When finding, be a word to be combined bunch with coming all vocabulary clusters before this vocabulary of finding in the result queue, all vocabulary from result queue in deletion this word to be combined bunch, and continuation is according to the sequencing of each vocabulary in the result queue, search successively except first vocabulary and current reach distance intermediate value is carried out cluster greater than the vocabulary of default neighborhood distance, in result queue, do not exist till any vocabulary.

Described merging module 805 specifically is used for, and each word to be combined of obtaining bunch is added in bunch formation; At each word to be combined in bunch formation bunch, extract this word to be combined bunch, determine in bunch formation bunch to comprise with this word to be combined that extracts other words to be combined bunch of at least one identical vocabulary, other words to be combined of determining bunch are incorporated in this word to be combined bunch of extraction, obtained centre word to be combined bunch; This word to be combined that deletion is extracted from bunch formation bunch and definite other words to be combined bunch, continue to determine in the prevariety formation bunch to comprise with this centre word to be combined other words to be combined bunch of at least one identical vocabulary, and merge, till the quantity of the vocabulary that comprises in this centre word to be combined bunch no longer changes; When the quantity of the vocabulary that comprises in this centre word to be combined bunch no longer changes, with this centre word to be combined bunch as the combinatorial word that obtains bunch.

Concrete above-mentioned microblogging pushes Zhuan Ge can be arranged in server.

Those skilled in the art should understand that the application's embodiment can be provided as method, system or computer program.Therefore, the application can adopt complete hardware embodiment, complete software embodiment or in conjunction with the form of the embodiment of software and hardware aspect.And the application can adopt the form of the computer program of implementing in one or more computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) that wherein include computer usable program code.

The application is that reference is described according to process flow diagram and/or the block scheme of method, equipment (system) and the computer program of the embodiment of the present application.Should understand can be by the flow process in each flow process in computer program instructions realization flow figure and/or the block scheme and/or square frame and process flow diagram and/or the block scheme and/or the combination of square frame.Can provide these computer program instructions to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, make the instruction of carrying out by the processor of computing machine or other programmable data processing device produce to be used for the dress Ge of the function that is implemented in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame appointments.

These computer program instructions also can be stored in energy vectoring computer or the computer-readable memory of other programmable data processing device with ad hoc fashion work, make the instruction that is stored in this computer-readable memory produce the manufacture that comprises finger order dress Ge, this refers to make dress Ge to be implemented in the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame.

These computer program instructions also can be loaded on computing machine or other programmable data processing device, make and carry out the sequence of operations step producing computer implemented processing at computing machine or other programmable devices, thereby be provided for being implemented in the step of the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame in the instruction that computing machine or other programmable devices are carried out.

Although described the application's preferred embodiment, in a single day those skilled in the art get the basic creative concept of cicada, then can make other change and modification to these embodiment.So claims are intended to all changes and the modification that are interpreted as comprising preferred embodiment and fall into the application's scope.

Obviously, those skilled in the art can carry out various changes and modification to the embodiment of the present application and not break away from the spirit and scope of the embodiment of the present application.Like this, if these of the embodiment of the present application are revised and modification belongs within the scope of the application's claim and equivalent technologies thereof, then the application also is intended to comprise these changes and modification interior.

Claims

1. the microblogging method for pushing based on the density term clustering is characterized in that, comprising:

Determine core word according to the word spacing between each vocabulary; And

2. the method for claim 1 is characterized in that, server is determined the word spacing between each vocabulary according to the co-occurrence word set of each vocabulary, specifically comprises:

Described server adopts formula at first vocabulary and second vocabulary Determine the word spacing between described first vocabulary and second vocabulary, wherein, first vocabulary and second vocabulary are any two vocabulary, and i represents described first vocabulary, and j represents described second vocabulary, and (i j) is word spacing between described first vocabulary and second vocabulary, T to D _iBe the co-occurrence word set of first vocabulary, T _jBe the co-occurrence word set of second vocabulary, | T _i∩ T _j| the quantity of the vocabulary that comprises in the common factor for the co-occurrence word set of the co-occurrence word set of described first vocabulary and described second vocabulary, | T _i| be the quantity of the concentrated vocabulary that comprises of co-occurrence word of first vocabulary, | T _j| be the quantity of the concentrated vocabulary that comprises of co-occurrence word of second vocabulary.

3. the method for claim 1 is characterized in that, determines core word according to the word spacing between each vocabulary, specifically comprises:

Described server is at each vocabulary undetermined, whether the word spacing of judgement and this vocabulary undetermined is not more than the quantity of other vocabulary of presetting the neighborhood distance greater than the second default quantity, if determine that then this vocabulary undetermined is core word, otherwise, determine that this vocabulary undetermined is not core word.

4. method as claimed in claim 3 is characterized in that, at each core word of determining, will be divided into a word set with N vocabulary and this core word of the word spacing minimum of this core word, specifically comprises:

Each vocabulary is added in the original formation with random order;

Each vocabulary in the original formation extracts this vocabulary, judges whether this vocabulary that extracts is core word;

If, then will be divided into a word set with N vocabulary of the word spacing minimum of this vocabulary that extracts and this vocabulary of extraction in the original formation, and this vocabulary that deletion is extracted from original formation and with the word spacing of this vocabulary of extraction 2 times core word less than described default neighborhood distance;

Otherwise, this vocabulary that extracts is put back in the original formation.

5. the method for claim 1 is characterized in that, at each word set that marks off, adopts the OPTICS clustering algorithm that the vocabulary in this word set is carried out cluster, obtains several words to be combined bunch, specifically comprises:

At each word set that marks off, all vocabulary in this word set are added in the orderly seed formation;

Sequencing according to each vocabulary in the orderly seed formation extracts first vocabulary;

Judge whether this vocabulary that extracts is core word;

If then with the candidate word of other vocabulary in the orderly seed formation as this vocabulary that extracts, and at each candidate word, determine that this candidate word is to the reach distance of this vocabulary that extracts; If the reach distance intermediate value of current this candidate word is not more than this candidate word to the reach distance of this vocabulary that extracts, then keep the reach distance intermediate value of this candidate word constant, if the reach distance intermediate value of current this candidate word is greater than the reach distance of this candidate word to this vocabulary that extracts, then with this candidate word to the reach distance of this vocabulary that extracts again as the reach distance intermediate value of this candidate word; This vocabulary that extracts is inserted into the end of result queue, this vocabulary that deletion is extracted from orderly seed formation, and according to the current reach distance intermediate value order from small to large of each candidate word each candidate word in the orderly seed formation is sorted; First vocabulary that continue to extract in the orderly seed formation is handled, in orderly seed formation, do not have any vocabulary till;

Otherwise, this vocabulary that extracts is inserted into the end of result queue, this vocabulary that deletion is extracted from orderly seed formation, first vocabulary that continues to extract in the orderly seed formation is handled, in orderly seed formation, do not have any vocabulary till;

When not having any vocabulary in the orderly seed team row, according to the sequencing of each vocabulary in the result queue, search successively except first vocabulary and current reach distance intermediate value greater than the vocabulary of default neighborhood distance;

When finding, be a word to be combined bunch with coming all vocabulary clusters before this vocabulary of finding in the result queue, all vocabulary from result queue in deletion this word to be combined bunch, and continuation is according to the sequencing of each vocabulary in the result queue, search successively except first vocabulary and current reach distance intermediate value is carried out cluster greater than the vocabulary of default neighborhood distance, in result queue, do not exist till any vocabulary.

6. the method for claim 1 is characterized in that, according to the vocabulary in each word to be combined that obtains bunch, each word to be combined of obtaining bunch is merged processing, obtains combinatorial word bunch, specifically comprises:

Each word to be combined of obtaining bunch is added in bunch formation;

At each word to be combined in bunch formation bunch, extract this word to be combined bunch, determine in bunch formation bunch to comprise with this word to be combined that extracts other words to be combined bunch of at least one identical vocabulary, other words to be combined of determining bunch are incorporated in this word to be combined bunch of extraction, obtained centre word to be combined bunch;

This word to be combined that deletion is extracted from bunch formation bunch and definite other words to be combined bunch, continue to determine in the prevariety formation bunch to comprise with this centre word to be combined other words to be combined bunch of at least one identical vocabulary, and merge, till the quantity of the vocabulary that comprises in this centre word to be combined bunch no longer changes;

When the quantity of the vocabulary that comprises in this centre word to be combined bunch no longer changes, with this centre word to be combined bunch as the combinatorial word that obtains bunch.

7. the microblogging pusher based on the density term clustering is characterized in that, comprising:

8. device as claimed in claim 7 is characterized in that, institute's predicate spacing determination module specifically is used for, and at first vocabulary and second vocabulary, adopts formula

Determine the word spacing between described first vocabulary and second vocabulary, wherein, first vocabulary and second vocabulary are any two vocabulary, and i represents described first vocabulary, and j represents described second vocabulary, and (i j) is word spacing between described first vocabulary and second vocabulary, T to D _iBe the co-occurrence word set of first vocabulary, T _jBe the co-occurrence word set of second vocabulary, | T _i∩ T _j| the quantity of the vocabulary that comprises in the common factor for the co-occurrence word set of the co-occurrence word set of described first vocabulary and described second vocabulary, | T _i| be the quantity of the concentrated vocabulary that comprises of co-occurrence word of first vocabulary, | T _j| be the quantity of the concentrated vocabulary that comprises of co-occurrence word of second vocabulary.

9. device as claimed in claim 7, it is characterized in that, described core word determination module specifically is used for, at each vocabulary undetermined, whether the word spacing of judgement and this vocabulary undetermined is not more than the quantity of other vocabulary of presetting the neighborhood distance greater than the second default quantity, if determine that then this vocabulary undetermined is core word, otherwise, determine that this vocabulary undetermined is not core word.

10. device as claimed in claim 7 is characterized in that, described word set is divided module and specifically is used for, and each vocabulary is added in the original formation with random order; Each vocabulary in the original formation extracts this vocabulary, judges whether this vocabulary that extracts is core word; If, then will be divided into a word set with N vocabulary of the word spacing minimum of this vocabulary that extracts and this vocabulary of extraction in the original formation, and this vocabulary that deletion is extracted from original formation and with the word spacing of this vocabulary of extraction 2 times core word less than described default neighborhood distance; Otherwise, this vocabulary that extracts is put back in the original formation.

11. device as claimed in claim 7 is characterized in that, described cluster module specifically is used for, and at each word set that marks off, all vocabulary in this word set is added in the orderly seed formation; Sequencing according to each vocabulary in the orderly seed formation extracts first vocabulary; Judge whether this vocabulary that extracts is core word; If then with the candidate word of other vocabulary in the orderly seed formation as this vocabulary that extracts, and at each candidate word, determine that this candidate word is to the reach distance of this vocabulary that extracts; If the reach distance intermediate value of current this candidate word is not more than this candidate word to the reach distance of this vocabulary that extracts, then keep the reach distance intermediate value of this candidate word constant, if the reach distance intermediate value of current this candidate word is greater than the reach distance of this candidate word to this vocabulary that extracts, then with this candidate word to the reach distance of this vocabulary that extracts again as the reach distance intermediate value of this candidate word; This vocabulary that extracts is inserted into the end of result queue, this vocabulary that deletion is extracted from orderly seed formation, and according to the current reach distance intermediate value order from small to large of each candidate word each candidate word in the orderly seed formation is sorted; First vocabulary that continue to extract in the orderly seed formation is handled, in orderly seed formation, do not have any vocabulary till; Otherwise, this vocabulary that extracts is inserted into the end of result queue, this vocabulary that deletion is extracted from orderly seed formation, first vocabulary that continues to extract in the orderly seed formation is handled, in orderly seed formation, do not have any vocabulary till; When not having any vocabulary in the orderly seed team row, according to the sequencing of each vocabulary in the result queue, search successively except first vocabulary and current reach distance intermediate value greater than the vocabulary of default neighborhood distance; When finding, be a word to be combined bunch with coming all vocabulary clusters before this vocabulary of finding in the result queue, all vocabulary from result queue in deletion this word to be combined bunch, and continuation is according to the sequencing of each vocabulary in the result queue, search successively except first vocabulary and current reach distance intermediate value is carried out cluster greater than the vocabulary of default neighborhood distance, in result queue, do not exist till any vocabulary.

12. device as claimed in claim 7 is characterized in that, described merging module specifically is used for, and each word to be combined of obtaining bunch is added in bunch formation; At each word to be combined in bunch formation bunch, extract this word to be combined bunch, determine in bunch formation bunch to comprise with this word to be combined that extracts other words to be combined bunch of at least one identical vocabulary, other words to be combined of determining bunch are incorporated in this word to be combined bunch of extraction, obtained centre word to be combined bunch; This word to be combined that deletion is extracted from bunch formation bunch and definite other words to be combined bunch, continue to determine in the prevariety formation bunch to comprise with this centre word to be combined other words to be combined bunch of at least one identical vocabulary, and merge, till the quantity of the vocabulary that comprises in this centre word to be combined bunch no longer changes; When the quantity of the vocabulary that comprises in this centre word to be combined bunch no longer changes, with this centre word to be combined bunch as the combinatorial word that obtains bunch.