CN101989281A - Clustering method and device - Google Patents

Clustering method and device Download PDF

Info

Publication number
CN101989281A
CN101989281A CN2009100891768A CN200910089176A CN101989281A CN 101989281 A CN101989281 A CN 101989281A CN 2009100891768 A CN2009100891768 A CN 2009100891768A CN 200910089176 A CN200910089176 A CN 200910089176A CN 101989281 A CN101989281 A CN 101989281A
Authority
CN
China
Prior art keywords
word
string
clustered
word string
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2009100891768A
Other languages
Chinese (zh)
Other versions
CN101989281B (en
Inventor
孙宏伟
胡珉
罗治国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN2009100891768A priority Critical patent/CN101989281B/en
Publication of CN101989281A publication Critical patent/CN101989281A/en
Application granted granted Critical
Publication of CN101989281B publication Critical patent/CN101989281B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a clustering method for overcome the defect that the retrieval result provided by the prior art is difficult to generate a clustering label with relatively good readability. The method comprises the following steps: selecting a first candidate string set from the documents to be clustered according to a pre-set selection policy; for each string in the first candidate string set, selecting a second candidate string from the first candidate string set according to a string related parameter, wherein the string related parameter comprises at least one parameter of the total times of the string appearing in all documents to be clustered, the total times of the string appearing in a designated document, the number of characters included in the string and the number of documents including each string in the documents to be clustered; and determining the second candidate string as the clustering label for clustering the documents to be clustered, and classifying the documents to be clustered into a cluster corresponding to the clustering label. The invention also discloses a clustering device.

Description

Clustering method and device
Technical field
The present invention relates to information retrieval field, relate in particular to a kind of clustering method and device.
Background technology
The result for retrieval cluster, be meant with search engine searches to result for retrieval in the similar Search Results process of assembling cluster, wherein, bunch be the set of one group of similar each other result for retrieval, result for retrieval in the same cluster is similar each other, and the result for retrieval in different bunches is then often different each other.The result for retrieval cluster can help the user better to use search engine, such as, the information that can help the user to navigate to more fast to need, perhaps can help the user to obtain more comprehensively information etc.
In the prior art, existing searching result clustering method mainly is divided into two classes: a class is called as the method based on document (Documents-Based); And the another kind of method that is called as based on label (Label-Based).So-called method based on document is meant at first by traditional document clustering method, document is gathered into a plurality of classifications, and then from of all categories, extract suitable cluster label respectively and mark each classification, owing to adopt method often can not generate readability cluster label preferably based on document, the property distinguished is less between the different cluster labels, thereby the user is difficult to find the result for retrieval that meets own demand from each less cluster label of the property distinguished, so these class methods are just used in early days the result for retrieval cluster work more; Method based on label then is meant at first some representational words of extraction from document, then the word that extracts is carried out rational evaluation and screening, and will through evaluation and Screening Treatment after the different terms conduct that obtains corresponding to the cluster label of different classes of document, can be thereby follow-up based on this different classes of cluster label, further realize classification to document, in these class methods, choosing of cluster label is very crucial, but choose mode according to the cluster label that provides in the prior art, be difficult to obtain readability cluster label preferably equally.
From the above, all kinds of searching result clustering methods that prior art adopts all exist and are difficult to generate readability cluster label preferably, thereby make the user be difficult to find according to the cluster label defective of the result for retrieval that meets own demand.
Summary of the invention
The embodiment of the invention provides a kind of clustering method and device, is difficult to generate the readable defective of cluster label preferably in order to the searching result clustering method that provides according to prior art to be provided.
For this reason, the embodiment of the invention is by the following technical solutions:
A kind of clustering method comprises: according to the default strategy of choosing, choose the set of first candidate character string from each document to be clustered; At each word string in described first candidate character string set, according to the parameter relevant with this word string, choose second candidate character string from the set of described first candidate character string, the described parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered for total degree, this word string in this word string appears at described all documents to be clustered; Described second candidate character string is defined as described each document to be clustered is carried out the cluster label of cluster, and described each document to be clustered is referred to respectively in corresponding with described cluster label bunch.
Preferably, at each word string in described first candidate character string set, according to the parameter relevant with this word string, choosing second candidate character string from described first candidate character string set specifically comprises: at each word string in described first candidate character string set, appear at total degree, this word string in described all documents to be clustered according to this word string and appear at the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered, adopt following formula to calculate the importance degree Score of this word string:
Score = word . tf word . normtf * word . df * log ( word . length )
Wherein, word.tf is the total degree in this word string appears at described each document to be clustered, word.normtf appears at total degree in the described specified documents for this word string, word.df is the document number described to be clustered that comprises this word string, the character number that word.length comprises for this word string;
In calculating the set of described first candidate character string, during the importance degree Score of each word string,, from described first candidate character string set, choose second candidate character string according to described importance degree Score.
Preferably, described method also comprises: according to the importance degree Score order from large to small of described definite cluster label, described definite cluster label is carried out correspondence arrange.
Preferably, according to the default strategy of choosing, choosing the set of first candidate character string from each document to be clustered specifically comprises: from the word string that each document comprised to be clustered, choose character number and the default consistent word string of the first character number threshold value that word string comprises; From the described word string of choosing, choose first candidate character string set that meets preset rules, described preset rules be in the following rule any one or be the combination in any of following rule: at each word string in the set of described first candidate character string, the number that comprises the document described to be clustered of this word string is not less than presetting first threshold; At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the second default threshold value; At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the 3rd default threshold value; At each word string in the set of described first candidate character string, this word string appears at the numerical value that each character that the total degree in described all documents to be clustered comprises divided by this word string appears at the total degree gained in described all documents to be clustered and is not less than the 4th default threshold value.
Preferably, adopt the method for multi-mode coupling, described each document to be clustered is referred to respectively in corresponding with described cluster label bunch.
Preferably, described method also comprises:
At each the cluster label in the described definite cluster label, determine that this cluster label appears at the total degree in described all documents to be clustered, and according to each described definite total degree by the few order of as many as, described definite cluster label is carried out correspondence arranges; Or at each the cluster label in the described definite cluster label, determine to include the document number described to be clustered of this cluster label, and according to each described definite document number by the few order of as many as, described definite cluster label is carried out correspondence arranges; Or be used as the frequency order from high to low of the employed query word of search engine respectively according to described definite cluster label, described definite cluster label is carried out correspondence to be arranged, wherein, the described Search Results of document for arriving to be clustered by search engine searches.
A kind of clustering apparatus comprises: first chooses the unit, is used for choosing the set of first candidate character string according to the default strategy of choosing from each document to be clustered; Second chooses the unit, be used for each word string of choosing first candidate character string set of unit selection at first, according to the parameter relevant with this word string, choose second candidate character string from the set of described first candidate character string, the described parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered for total degree, this word string in this word string appears at described all documents to be clustered; The label determining unit is used for second second candidate character string of choosing unit selection is defined as described each document to be clustered is carried out the cluster label of cluster; Sort out the unit, be used for described each document to be clustered be referred to respectively the cluster label determined with described label determining unit corresponding bunch.
A kind of cluster scheme that the embodiment of the invention provides according to the default strategy of choosing, from the word string that each document comprised to be clustered, is chosen the set of first candidate character string by earlier; Again at each word string in this first candidate character string set, according to reflecting that this word string appears at the correlation parameter of the frequency in all documents to be clustered, the document classification representativeness of this word string etc., from the set of first candidate character string, choose second candidate character string, wherein, these parameters comprise that this word string appears at total degree, this word string in described all documents to be clustered respectively and appears in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and each document to be clustered at least one respectively; Second candidate character string of choosing is defined as described each document to be clustered is carried out the cluster label of cluster, and described each document to be clustered is referred to respectively in corresponding with described cluster label bunch, thereby realize to the cluster label determine and to the cluster of document.Because the embodiment of the invention is in the process that the cluster label is chosen, at each word string in this first candidate character string set, taken all factors into consideration and to have reflected that this word string appears at the correlation parameter of frequency in each document to be clustered and document classification representativeness of this word string etc., make the cluster label of generation can demonstrate fully the classification of document to be clustered, thereby it is readable preferably that definite cluster label is had.
Description of drawings
The idiographic flow synoptic diagram of a kind of clustering method that Fig. 1 provides for the embodiment of the invention;
Fig. 2 carries out idiographic flow synoptic diagram in the process of cluster for a kind of clustering method that the embodiment of the invention is provided is applied to result for retrieval;
Fig. 3 is the concrete synoptic diagram of resultant two result for retrieval of search engine in the embodiment of the invention;
The concrete structure synoptic diagram of a kind of clustering apparatus that Fig. 4 provides for the embodiment of the invention.
Embodiment
The embodiment of the invention provides a kind of cluster scheme, by choosing strategy and can reflect that word string appears at the correlation parameter of the frequency in all documents to be clustered and the document classification representativeness of word string etc. according to default, from the word string that each document comprised to be clustered, choose as the cluster label that each document to be clustered is carried out cluster, thereby the cluster label that makes generation can demonstrate fully the classification of document to be clustered, reaches readable preferably.
Be explained in detail to the main realization principle of embodiment of the invention technical scheme, embodiment and to the beneficial effect that should be able to reach below in conjunction with each accompanying drawing.
The embodiment of the invention at first provides a kind of clustering method, and the idiographic flow synoptic diagram of this method may further comprise the steps as shown in Figure 1:
Step 11 according to the default strategy of choosing, is chosen the set of first candidate character string from each document to be clustered;
Step 12, at each word string in the set of first candidate character string, according to the parameter relevant with this word string, from the set of first candidate character string, choose second candidate character string, wherein, the parameter relevant with this word string is the total degree in this word string appears at all documents to be clustered, this word string appears at the total degree in the specified documents, include at least one in the document number of this each word string in the character number that this word string comprises respectively and each document to be clustered, it should be explained that, carry out cluster with the Search Results that the search engine that uses in the internet is obtained and be example, above-mentioned " specified documents " can be meant in the internet webpage arbitrarily, usually, this specified documents can be the webpage of 100,000 or 200,000 or other quantity (the quantity here is generally bigger) arbitrarily, in this case, the total degree that word string appears in this named web page is big more, illustrates that then this word string may be a word string commonly used in the webpage;
Step 13 is defined as each document to be clustered is carried out the cluster label of cluster with second candidate character string of choosing, and each document to be clustered is referred to respectively in corresponding with described cluster label bunch.
At above-mentioned steps 11, can adopt following step to realize from each document to be clustered, choosing the set of first candidate character string according to the default strategy of choosing:
At first, from the word string that each document comprised to be clustered, choose character number and the default consistent word string of the first character number threshold value that word string comprises, at Chinese, if regard a word in the Chinese as in a embodiment of the invention said " char ", then can the above-mentioned first character number threshold value be set to 2~6, to meet the speech habits of Chinese, and at English, if regard a word as in a embodiment of the invention said " char ", then also can the above-mentioned first character number threshold value be set to 1~4;
Then, from the above-mentioned word string of choosing, choose first candidate character string set of satisfying preset rules again, wherein the preset rules here can but be not limited to any one or multiple combination arbitrarily in following four kinds of rules:
Rule one: at each word string in the set of first candidate character string, the number that comprises the document to be clustered of this word string is not less than presetting first threshold, and the word string frequency of occurrences is higher to have a set of representational first candidate character string of stronger document classification thereby this rule one is used for selecting from the above-mentioned word string of choosing;
Rule two: at each word string in the set of first candidate character string, in each document to be clustered, adjacent with this word string, be positioned at before this word string, and the number of characters that comprises is not less than the second default threshold value with the number of the different word strings of the second character number threshold value unanimity of presetting, if with above-mentioned adjacent with this word string, be positioned at before this word string, and the word string that the number of characters that comprises is consistent with the second default character number threshold value be called adjacent before word string, effect that then should rule two is to select the lower word string of correlativity of adjacent preceding word string with this and gathers as first candidate character string;
Rule three: at each word string in the set of first candidate character string, in each document to be clustered, adjacent with this word string, be positioned at after this word string, and the number of characters that comprises is not less than the 3rd default threshold value with the number of the different word strings of the second character number threshold value unanimity of presetting, if with above-mentioned adjacent with this word string, be positioned at before this word string, and the word string that the number of characters that comprises is consistent with default three-character doctrine number threshold value is called adjacent back word string, and effect that then should rule three is to select the lower word string of correlativity of adjacent back word string with this and gathers as first candidate character string;
Rule four: at each word string in the set of first candidate character string, this word string appears at the numerical value that each character that the total degree in all documents to be clustered comprises divided by this word string appears at the total degree gained in all documents to be clustered and is not less than the 4th default threshold value, it is less that the effect of this rule four is to select the character number that word string comprises, and have representational first candidate character string set of stronger document classification.
Preferably, in the step 13 in embodiments of the present invention can but be not limited to adopt the method for multi-mode coupling, each document to be clustered is referred to respectively in corresponding with described cluster label bunch, the concrete implementation of the method for this multi-mode coupling is at each document in the document to be clustered, at first, the document determines the cluster label that comprises in the document by being scanned, and then the cluster label that comprises according to the document of determining, the document be referred to respectively different cluster label correspondences bunch in.
In addition, because same piece of writing document might be included in the pairing difference of different cluster labels bunch, therefore, for can be easily from a certain cluster label correspondence bunch find the document that needs, can consider to come the cluster label is sorted according to the document classification representativeness of cluster label, particularly, the embodiment of the invention can further adopt following steps that each cluster label of determining is arranged:
At first, at each the cluster label in the cluster label of determining, determine that this cluster label appears at the total degree in all documents to be clustered;
Because the number of times that the cluster label occurs in document is many more, can show that more this cluster label has stronger document classification representativeness, therefore, can the cluster label of determining be carried out correspondence arrange according to each total degree of determining by the few order of as many as.
Perhaps, the embodiment of the invention also can further adopt following steps that each cluster label of determining is arranged:
At first, at each the cluster label in the cluster label of determining, determine the importance degree Score of this cluster label;
Because the importance degree Score of cluster label is big more, can show that more this cluster label has the higher frequency of occurrences, and have stronger document classification representativeness, therefore, can the cluster label of determining be carried out correspondence arrange according to each importance degree Score order from large to small of determining.
Perhaps, the embodiment of the invention can also further adopt following steps that each cluster label of determining is arranged:
At first, at each the cluster label in the cluster label of determining, determine to include the document number to be clustered of this cluster label;
Because it is many more to include the document number to be clustered of this cluster label, illustrates that this cluster label has the higher frequency of occurrences, therefore can the cluster label of determining be carried out correspondence arrange according to each document number of determining by the few order of as many as.
Since this scheme of providing of the embodiment of the invention can be applied to by search engine searches to Search Results carry out in the process of cluster, therefore, this scheme that the embodiment of the invention provides both can adopt above-mentioned arbitrary arrangement mode that the cluster label of determining is arranged, also can adopt the frequency order from high to low that is used as the employed query word of search engine according to the cluster label of determining respectively, the mode that the cluster label of determining is arranged, thus make the user who uses search engine can find the result for retrieval of own needs easily according to the cluster label.
Below the such scheme that provides with the embodiment of the invention in practice be applied as example, describe the implementing procedure of this scheme in detail:
As shown in Figure 2; for this scheme that the embodiment of the invention is provided be applied to search engine searches to result for retrieval carry out idiographic flow synoptic diagram in the process of cluster; in this specific embodiment; is that example illustrates this programme with the result for retrieval for the Chinese webpage; but as if this programme being applied to the process of English or other language web pages being carried out cluster; then also within protection scope of the present invention, particularly, process flow diagram shown in Figure 2 may further comprise the steps Dui Ying scheme:
Step 21, from result for retrieval to be clustered, choose candidate character string, wherein, the character number that this candidate character string of choosing comprises is consistent with the first default character number threshold value, in embodiments of the present invention, the result for retrieval to be clustered here can refer to the webpage that search engine searches arrives, also can refer to pairing summary of this webpage that searches and/or title, and, can set the number of result for retrieval to be clustered as required, result for retrieval number that can be to be clustered in the present embodiment is set to 200, because the word in the Chinese generally comprises at least two characters (two characters here promptly are meant two words in the Chinese), therefore, in this step 21 of present embodiment, if with the first character number threshold setting is 2~6, then needing meeting character number in summary or the title is that 2~6 word string all selects and is used as candidate character string, if two result for retrieval as shown in Figure 3, then can from the title " east wind honda automobile CRV " of first result for retrieval, choose and obtain following candidate character string, wherein, a character be used as in the English word " honda " that lowercase constitutes, and the mode that the English word " CRV " that capitalization is constituted is a character according to a letter is added up, but these 3 characters are inseparable;
Figure B2009100891768D0000091
Step 22, from the candidate character string of choosing, remove " noise candidate character string ", usually, a large amount of " noise candidate character string " arranged in the Chinese character candidate character string of choosing by step 21, these " noise candidate character strings " refer in particular to the candidate character string that word string itself is not a significant phrase, such as top " wind honda vapour ", " honda vapour " etc., because these " noise candidate character strings " itself do not have any meaning, be not suitable as the cluster label, therefore need filter out these " noise candidate character string ", can adopt following manner to filter out " noise candidate character string ": at first, if the frequency of occurrences of a certain candidate character string in result for retrieval to be clustered is less than a threshold value f1, just this candidate character string is defined as " noise candidate character string ", and it is filtered out, in embodiments of the present invention, can be set to 3 by this f1; Secondly, if it is adjacent with a certain candidate character string, and the number that is positioned at the different Chinese character after this candidate character string is less than a threshold value f2, just this candidate character string is defined as " noise candidate character string ", and it is filtered out, in embodiments of the present invention, can be set to 5 by f2, according to this filter type, for example in result for retrieval to be clustered, because the Chinese character of " east wind honda vapour " this candidate character string back has only " car " word usually, therefore " east wind honda vapour " is filtered possibly, similarly, " wind honda ", " wind honda vapour " also might be because identical reason be filtered, and the printed words that from second result for retrieval as shown in Figure 3, can see " east wind is beautiful ... ", therefore the Chinese character of " east wind " this candidate character string back is except " honda ", it also might be " mark " word, therefore " east wind " just might not be " noise candidate character string ", two different words " honda " are only arranged certainly, " mark " also is not enough to explanation and should " east wind " can be filtered, and the number of the different words in " east wind " back must reach 5 and can determine that this " east wind " can not be filtered; If it is and adjacent with a certain candidate character string, and be positioned at before this candidate character string the number of different Chinese character less than a threshold value f3, just this candidate character string is defined as " noise candidate character string ", and it is filtered out, in embodiments of the present invention, can be set to 5 by this f3, according to this filter type, for example " wind honda automobile ", " wind honda vapour " etc. probably can be filtered, because the Chinese character of these candidate character string fronts has only " east " word usually; In addition, if the total degree that a certain candidate character string appears in all result for retrieval to be clustered appears at total degree sum in all result for retrieval to be clustered separately less than a threshold value f4 divided by each Chinese character in this candidate character string, just this candidate character string is defined as " noise candidate character string ", and it is filtered out, can be set to 0.1 by f4 in the present invention, for convenience of description, the word string that candidate character string the constituted set that below will carry out obtaining after " noise candidate character string " filters is called the set of first candidate character string;
Step 23, calculate the importance degree Score of each candidate character string in the set of first candidate character string, because the quantity of the candidate character string that comprises in first candidate character string set that obtains after above two steps 21 of process, 21 are handled is still a lot, in order further from the set of first candidate character string, to select readable better cluster label, therefore need further to adopt following computing formula (1) to calculate the importance degree Score of each candidate character string in the set of first candidate character string respectively:
Score = word . tf word . normtf * word . df * log ( word . length ) - - - ( 1 )
Wherein, word represents the arbitrary candidate character string in the set of first candidate character string, word.tf is that this arbitrary candidate character string appears at the total degree in all result for retrieval to be clustered, word.normtf appears at the total degree of specifying in the result for retrieval for this arbitrary candidate character string, word.df is the result for retrieval number that comprises this arbitrary candidate character string, the character number that word.length comprises for this arbitrary candidate character string, wherein, specifying result for retrieval can be generic web page arbitrarily in a large number, and word.normtf is in order to embody the frequency that this arbitrary candidate character string occurs in these a large amount of generic web page arbitrarily, and the implication of each parameter is as follows respectively in the formula (1):
This importance degree Score that has embodied arbitrary candidate character string in the set of first candidate character string is directly proportional with the frequency that this arbitrary candidate character string appears at result for retrieval to be clustered, and be inversely proportional to frequency that this arbitrary candidate character string appears under a large amount of language environments that generic web page represented arbitrarily, its implication is the result for retrieval classification representativeness that is used to weigh this arbitrary candidate character string;
This importance degree Score that has embodied arbitrary candidate character string in the set of first candidate character string of word.df is directly proportional with the total degree that this arbitrary candidate character string appears in each result for retrieval to be clustered, its physical meaning is that this arbitrary candidate character string appears at the frequency in each result for retrieval to be clustered, if this arbitrary candidate character string appear in the result for retrieval to be clustered number of times very little, this candidate character string does not have result for retrieval classification representativeness yet so, is not suitable for as the cluster label;
This has embodied log (word.length) character number that candidate character string comprised and should be a suitable value, because comparatively speaking, the word.tf value that comprises the more candidate character string of character number generally is less than the word.tf value that comprises the less candidate character string of character number, therefore, in the employed formula of the embodiment of the invention (1), need this factor of character number word.length of considering that candidate character string comprises, because this factor is excessive to the influence of importance degree Score, therefore the operation by word.length is taken the logarithm in the formula (1) is with the influence of minimizing this factor of character number that candidate character string was comprised to Score;
In embodiments of the present invention, at two result for retrieval shown in Figure 3, through each word string in first candidate character string set that obtains after above-mentioned steps 21,22 processing and as shown in table 1 below with each the word string corresponding parameters in the set of first candidate character string, in the corresponding substitution formula of occurrence (1) with each parameter in the table 1, can calculate the importance degree Score of each candidate character string in the set of first candidate character string respectively:
Table 1:
Figure B2009100891768D0000121
Need to prove, as long as can embody the corresponding relation on numerical values recited changes between the importance degree of each word string and above-mentioned each parameter (such as can be known according to the implication of above-mentioned each parameter, importance degree variation from large to small is corresponding with word.df variation from large to small, and importance degree from large to small variation and word.normtf by little to big variation also be corresponding), then the embodiment of the invention also can but be not limited to adopt following formula (2) to calculate the importance degree Score1 of each word string:
Score 1 = word . tf word . normtf * word . df * log ( word . length ) - - - ( 2 )
Perhaps, also can adopt following formula (3)~(5) to come the corresponding respectively importance degree Score2~Score4 that calculates each word string:
Score 2 = word . tf word . normtf - - - ( 3 )
Score3=word.df (4)
Score4=log(word.length)?(5)
Above-mentioned Score1~Score4 can be used as the parameter of weighing the importance degree of each candidate character string in the set of first candidate character string equally.
Step 24, in calculating first candidate character string set during importance degree Score of each word string, from the set of first candidate character string, choose second candidate character string, such as, can be according to the importance degree Score selecting sequence from large to small of each word string in the set of first candidate character string, from the set of first candidate character string, choose second candidate character string that satisfies preset number, in embodiments of the present invention, it is 20 that this preset number can be set, then need in this step 24 from the set of first candidate character string, to choose 20 second candidate character strings, if 20 of the candidate character string number less thaies in the set of first candidate character string, then can directly all candidate character strings in the set of first candidate character string all be chosen as second candidate character string, in addition, can also set an importance degree threshold value, and regulation only from the set of first candidate character string, choose importance degree Score greater than the candidate character string of importance degree threshold value as second candidate character string, in embodiments of the present invention, according to the enforcement of step 21~24, second candidate character string of finally choosing is last " beautiful 207 ", " automobile ", " east wind " etc.;
Step 25 is defined as second candidate character string result for retrieval to be clustered is carried out the cluster label of cluster;
Step 26, adopt the method for multi-mode string coupling, the result for retrieval that each is to be clustered is referred to respectively in corresponding with each cluster label bunch, such as, the result for retrieval that will comprise " beautiful 207 " all be included into the cluster label for " beautiful 207 " bunch in, the result for retrieval that will comprise " automobile " all be included into the cluster label for " automobile " bunch in, the result for retrieval that will comprise " east wind " all be included into the cluster label for " east wind " bunch in;
Step 27, frequency (hereinafter to be referred as the frequency of utilization) order from high to low that is used as the employed query word of search engine according to the cluster label respectively, the cluster label is carried out correspondence to be arranged, such as can be according to reading habit, the cluster label that will have maximum useful frequency is placed on the Far Left of the page that is arranged with the cluster label (or topmost), and the cluster label correspondence that correspondingly will have lowest useful frequency is placed on the rightmost of the page that is arranged with the cluster label (or bottom).
Correspondingly, the embodiment of the invention also provides a kind of clustering apparatus, uses so that the cluster label that utilizes this clustering apparatus to determine can demonstrate fully the classification of document to be clustered, reach readable preferably, particularly, the structural representation of this device comprises following functional unit as shown in Figure 4:
First chooses unit 41, is used for choosing the set of first candidate character string according to the default strategy of choosing from each document to be clustered;
Second chooses unit 42, be used for each word string of choosing first candidate character string set of choosing unit 41 at first, according to the parameter relevant with this word string, from the set of first candidate character string, choose second candidate character string, wherein, the parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the document to be clustered for total degree, this word string in this word string appears at all documents to be clustered;
Label determining unit 43 is used for choosing the cluster label that second candidate character string of choosing unit 42 is defined as each document to be clustered is carried out cluster with second;
Sort out unit 44, be used for each document to be clustered be referred to respectively the cluster label determined with label determining unit 43 corresponding bunch.
Preferably, at above-mentioned first a kind of implementation of choosing unit 41 functions, can choose unit 41 with above-mentioned first and further be divided into following functional module:
First chooses module, is used for from the word string that each document comprised to be clustered, chooses character number and the default consistent word string of the first character number threshold value that word string comprises;
Second chooses module, be used for choosing the word string that module chooses and choose first candidate character string set that meets preset rules from first, wherein, the preset rules here be in the following rule any one or be the combination in any of following rule:
Rule one: at each word string in the set of first candidate character string, the number that comprises the document to be clustered of this word string is not less than presetting first threshold;
Rule two: at each word string in the set of first candidate character string, in each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the second default threshold value;
Rule three: at each word string in the set of first candidate character string, in each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the 3rd default threshold value;
Rule four: at each word string in first candidate character string set, this word string appears at the numerical value that each character that the total degree in all documents to be clustered comprises divided by this word string appears at the total degree gained in all documents to be clustered and is not less than the 4th default threshold value.
Choose the situation that unit 41 is divided into above-mentioned each module at above-mentioned first, this device that the embodiment of the invention provides can also further comprise:
Importance degree Score determining unit is used for each cluster label of the cluster label determined at label determining unit 43, determines the importance degree Score of this cluster label; And the label arrangement units, be used for described definite cluster label being carried out correspondence arranging according to definite each importance degree Score order from large to small of importance degree Score determining unit.
Preferably, at above-mentioned second a kind of implementation of choosing unit 42 functions, can choose unit 42 with above-mentioned second and further be divided into following functional module:
Computing module is used for each word string at the set of first candidate character string, adopts following formula to calculate the importance degree Score of this word string:
Score = word . tf word . normtf * word . df * log ( word . length )
Wherein, word.tf is the total degree in this word string appears at each document to be clustered, word.normtf appears at total degree in the specified documents for this word string, and word.df is the number that comprises the document to be retrieved of this word string, the character number that word.length comprises for this word string;
Choose module, be used for when computing module calculates the importance degree Score of each word string of first candidate character string set, according to described importance degree Score, from the set of first candidate character string, choose second candidate character string, wherein, in calculating first candidate character string set during importance degree Score of each word string, can be according to the importance degree Score selecting sequence from large to small of each word string in the set of first candidate character string, from the set of first candidate character string, choose second candidate character string that satisfies preset number, also can set an importance degree threshold value, and stipulate only to choose importance degree Score and gather as second candidate character string greater than first candidate character string of importance degree threshold value.
Preferably, above-mentioned classification unit 44 can adopt the method for multi-mode coupling, and each document to be clustered is referred to respectively in pairing bunch of the cluster label determined with label determining unit 43.
In addition, because same piece of writing document might be included in the pairing difference of different cluster labels bunch, therefore, for can be easily from a certain cluster label correspondence bunch find the document that needs, can consider to come the cluster label is sorted according to the document classification representativeness of cluster label, particularly, this device of providing of the embodiment of the invention can further include:
Number of times determining unit 45 is used for respectively each cluster label of the cluster label determined at label determining unit 43, determines that this cluster label appears at the total degree in all documents to be clustered;
Label arrangement units 46, each total degree that is used for determining respectively according to number of times determining unit 45 are by the few order of as many as, and the cluster label that label determining unit 43 is determined carries out the correspondence arrangement.
In addition, it is also conceivable that, come the cluster label is sorted that particularly, this device that the embodiment of the invention provides can further include according to the frequency of occurrences of cluster label in document to be clustered:
Document number determining unit is used for each the cluster label at the definite cluster label of label determining unit, determines to include the document number to be clustered of this cluster label;
Label arrangement units, each document number that is used for determining according to document number determining unit are carried out correspondence to the cluster label of determining and are arranged by the few order of as many as.
Need to prove, when this device that the embodiment of the invention provides be applied to by search engine searches to Search Results when carrying out in the process of cluster, this device that the embodiment of the invention provides can also comprise another label arrangement units, can be used for being used as respectively the frequency order from high to low of the employed query word of search engine according to the cluster label that label determining unit 43 is determined, the cluster label that label determining unit 43 is determined carries out the correspondence arrangement, thereby makes the user who uses search engine to find the result for retrieval that oneself needs according to the cluster label easily.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (10)

1. a clustering method is characterized in that, comprising:
According to the default strategy of choosing, from each document to be clustered, choose the set of first candidate character string;
At each word string in described first candidate character string set, according to the parameter relevant with this word string, choose second candidate character string from the set of described first candidate character string, the described parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered for total degree, this word string in this word string appears at described all documents to be clustered;
Described second candidate character string is defined as described each document to be clustered is carried out the cluster label of cluster, and described each document to be clustered is referred to respectively in corresponding with described cluster label bunch.
2. the method for claim 1 is characterized in that, at each word string in described first candidate character string set, according to the parameter relevant with this word string, chooses second candidate character string and specifically comprise from described first candidate character string set:
At each word string in described first candidate character string set, appear at total degree, this word string in described all documents to be clustered according to this word string and appear at the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered, adopt following formula to calculate the importance degree Score of this word string:
Score = word . tf word . normtf * word . df * log ( word . length )
Wherein, word.tf is the total degree in this word string appears at described each document to be clustered, word.normtf appears at total degree in the described specified documents for this word string, word.df is the document number described to be clustered that comprises this word string, the character number that word.length comprises for this word string;
In calculating described first candidate character string set, behind the importance degree Score of each word string,, from described first candidate character string set, choose second candidate character string according to described importance degree Score.
3. method as claimed in claim 2 is characterized in that, also comprises:
According to the importance degree Score order from large to small of described definite cluster label, described definite cluster label is carried out correspondence arrange.
4. the method for claim 1 is characterized in that, according to the default strategy of choosing, chooses the set of first candidate character string and specifically comprise from each document to be clustered:
From the word string that each document comprised to be clustered, choose character number and the default consistent word string of the first character number threshold value that word string comprises;
From the described word string of choosing, choose first candidate character string set that meets preset rules, described preset rules be in the following rule any one or be the combination in any of following rule:
At each word string in described first candidate character string set, the number that comprises the document described to be clustered of this word string is not less than presetting first threshold;
At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the second default threshold value;
At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the 3rd default threshold value;
At each word string in the set of described first candidate character string, this word string appears at the numerical value that each character that the total degree in described all documents to be clustered comprises divided by this word string appears at the total degree gained in described all documents to be clustered and is not less than the 4th default threshold value.
5. as claim 1,2 or 4 described methods, it is characterized in that, adopt the method for multi-mode coupling, described each document to be clustered is referred to respectively in corresponding with described cluster label bunch.
6. as claim 1,2 or 4 described methods, it is characterized in that, also comprise:
At each the cluster label in the described definite cluster label, determine that this cluster label appears at the total degree in described all documents to be clustered, and according to each described definite total degree by the few order of as many as, described definite cluster label is carried out correspondence arranges; Or
At each the cluster label in the described definite cluster label, determine to include the document number described to be clustered of this cluster label, and according to each described definite document number by the few order of as many as, described definite cluster label is carried out correspondence arranges; Or
The frequency order from high to low that is used as the employed query word of search engine according to described definite cluster label respectively, described definite cluster label is carried out correspondence to be arranged, wherein, the described Search Results of document for arriving to be clustered by search engine searches.
7. a clustering apparatus is characterized in that, comprising:
First chooses the unit, is used for choosing the set of first candidate character string according to the default strategy of choosing from each document to be clustered;
Second chooses the unit, be used for each word string of choosing first candidate character string set of unit selection at first, according to the parameter relevant with this word string, choose second candidate character string from the set of described first candidate character string, the described parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered for total degree, this word string in this word string appears at described all documents to be clustered;
The label determining unit is used for second second candidate character string of choosing unit selection is defined as described each document to be clustered is carried out the cluster label of cluster;
Sort out the unit, be used for described each document to be clustered be referred to respectively the cluster label determined with described label determining unit corresponding bunch.
8. device as claimed in claim 7 is characterized in that, described second chooses the unit specifically comprises:
Computing module, be used for each word string at described first candidate character string set, appear at total degree, this word string in described all documents to be clustered according to this word string and appear at the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered, adopt following formula to calculate the importance degree Score of this word string:
Score = word . tf word . normtf * word . df * log ( word . length )
Wherein, word.tf is the total degree in this word string appears at described each document to be clustered, word.normtf appears at total degree in the described specified documents for this word string, word.df is the document number described to be clustered that comprises this word string, the character number that word.length comprises for this word string;
Choose module, be used for when computing module calculates the importance degree Score of described each word string of first candidate character string set,, from described first candidate character string set, choose second candidate character string according to described importance degree Score.
9. device as claimed in claim 7 is characterized in that, described first chooses the unit specifically comprises:
First chooses module, is used for from the described word string that each document comprised to be clustered, chooses character number and the default consistent word string of the first character number threshold value that word string comprises;
Second chooses module, be used for choosing the word string that module chooses and choose first candidate character string set that meets preset rules from first, described preset rules be in the following rule any one or be the combination in any of following rule:
At each word string in described first candidate character string set, the number that comprises the document described to be clustered of this word string is not less than presetting first threshold;
At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the second default threshold value;
At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the 3rd default threshold value;
At each word string in the set of described first candidate character string, this word string appears at the numerical value that each character that the total degree in described all documents to be clustered comprises divided by this word string appears at the total degree gained in described all documents to be clustered and is not less than the 4th default threshold value.
10. as the described device of 7~9 arbitrary claims, it is characterized in that, also comprise:
The number of times determining unit is used for respectively each cluster label of the cluster label determined at the label determining unit, determines that this cluster label appears at the total degree in described all documents to be clustered;
Label arrangement units, each total degree that is used for determining respectively according to the number of times determining unit are carried out correspondence to described definite cluster label and are arranged by the few order of as many as; Perhaps
Also comprise: document number determining unit, be used for each cluster label at the definite cluster label of label determining unit, determine to include the document number described to be clustered of this cluster label;
Label arrangement units, each document number that is used for determining according to document number determining unit are carried out correspondence to described definite cluster label and are arranged by the few order of as many as; Perhaps
Also comprise: the label arrangement units, be used for being used as respectively the frequency order from high to low of the employed query word of search engine according to the cluster label that the label determining unit is determined, described definite cluster label is carried out correspondence to be arranged, wherein, the described Search Results of document for arriving to be clustered by search engine searches.
CN2009100891768A 2009-08-03 2009-08-03 Clustering method and device Expired - Fee Related CN101989281B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100891768A CN101989281B (en) 2009-08-03 2009-08-03 Clustering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100891768A CN101989281B (en) 2009-08-03 2009-08-03 Clustering method and device

Publications (2)

Publication Number Publication Date
CN101989281A true CN101989281A (en) 2011-03-23
CN101989281B CN101989281B (en) 2012-06-27

Family

ID=43745818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100891768A Expired - Fee Related CN101989281B (en) 2009-08-03 2009-08-03 Clustering method and device

Country Status (1)

Country Link
CN (1) CN101989281B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760142A (en) * 2011-04-29 2012-10-31 北京百度网讯科技有限公司 Method and device for extracting subject label in search result aiming at searching query
CN103207896A (en) * 2013-03-14 2013-07-17 无锡清华信息科学与技术国家实验室物联网技术中心 Method and system for stable and efficient self-adaptive clustering
CN106033444A (en) * 2015-03-16 2016-10-19 北京国双科技有限公司 Method and device for clustering text content

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (en) * 1996-12-31 1997-09-03 复旦大学 Multiple languages automatic classifying and searching method
CN101458708B (en) * 2008-12-05 2012-07-04 北京大学 Searching result clustering method and device

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760142A (en) * 2011-04-29 2012-10-31 北京百度网讯科技有限公司 Method and device for extracting subject label in search result aiming at searching query
CN103207896A (en) * 2013-03-14 2013-07-17 无锡清华信息科学与技术国家实验室物联网技术中心 Method and system for stable and efficient self-adaptive clustering
CN103207896B (en) * 2013-03-14 2017-02-01 无锡清华信息科学与技术国家实验室物联网技术中心 Method and system for stable and efficient self-adaptive clustering
CN106033444A (en) * 2015-03-16 2016-10-19 北京国双科技有限公司 Method and device for clustering text content
CN106033444B (en) * 2015-03-16 2019-12-10 北京国双科技有限公司 Text content clustering method and device

Also Published As

Publication number Publication date
CN101989281B (en) 2012-06-27

Similar Documents

Publication Publication Date Title
CN101246499B (en) Network information search method and system
US7424421B2 (en) Word collection method and system for use in word-breaking
Yu et al. Improving pseudo-relevance feedback in web information retrieval using web page segmentation
CN108829658B (en) Method and device for discovering new words
CN101251837B (en) Display handling method and system of electronic file list
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
US20080086453A1 (en) Method and apparatus for correlating the results of a computer network text search with relevant multimedia files
CN100419755C (en) Systems and methods for document data analysis
CN101609459A (en) A kind of extraction system of affective characteristic words
CN101609450A (en) Web page classification method based on training set
CN110232126B (en) Hot spot mining method, server and computer readable storage medium
EP2425353A1 (en) Method and apparatus for identifying synonyms and using synonyms to search
CN102169496A (en) Anchor text analysis-based automatic domain term generating method
CN102144229A (en) System for extracting term from document containing text segment
CN101071422A (en) Musicfile search processing system and method
CN102236654A (en) Web useless link filtering method based on content relevancy
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
CN104915422A (en) Webpage collecting method and device based on browser
CN101853298B (en) Event-oriented query expansion method
CN110276079A (en) A kind of dictionary method for building up, information retrieval method and corresponding system
WO2007113585A1 (en) Methods and systems of indexing and retrieving documents
KR100913733B1 (en) Method for Providing Search Result Using Template
CN101989281B (en) Clustering method and device
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
KR101908073B1 (en) Sentence completion type search system and method that recommends words of high interest as search words

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120627

Termination date: 20210803

CF01 Termination of patent right due to non-payment of annual fee