CN101989281A - Clustering method and device - Google Patents
Clustering method and device Download PDFInfo
- Publication number
- CN101989281A CN101989281A CN2009100891768A CN200910089176A CN101989281A CN 101989281 A CN101989281 A CN 101989281A CN 2009100891768 A CN2009100891768 A CN 2009100891768A CN 200910089176 A CN200910089176 A CN 200910089176A CN 101989281 A CN101989281 A CN 101989281A
- Authority
- CN
- China
- Prior art keywords
- word
- string
- clustered
- word string
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a clustering method for overcome the defect that the retrieval result provided by the prior art is difficult to generate a clustering label with relatively good readability. The method comprises the following steps: selecting a first candidate string set from the documents to be clustered according to a pre-set selection policy; for each string in the first candidate string set, selecting a second candidate string from the first candidate string set according to a string related parameter, wherein the string related parameter comprises at least one parameter of the total times of the string appearing in all documents to be clustered, the total times of the string appearing in a designated document, the number of characters included in the string and the number of documents including each string in the documents to be clustered; and determining the second candidate string as the clustering label for clustering the documents to be clustered, and classifying the documents to be clustered into a cluster corresponding to the clustering label. The invention also discloses a clustering device.
Description
Technical field
The present invention relates to information retrieval field, relate in particular to a kind of clustering method and device.
Background technology
The result for retrieval cluster, be meant with search engine searches to result for retrieval in the similar Search Results process of assembling cluster, wherein, bunch be the set of one group of similar each other result for retrieval, result for retrieval in the same cluster is similar each other, and the result for retrieval in different bunches is then often different each other.The result for retrieval cluster can help the user better to use search engine, such as, the information that can help the user to navigate to more fast to need, perhaps can help the user to obtain more comprehensively information etc.
In the prior art, existing searching result clustering method mainly is divided into two classes: a class is called as the method based on document (Documents-Based); And the another kind of method that is called as based on label (Label-Based).So-called method based on document is meant at first by traditional document clustering method, document is gathered into a plurality of classifications, and then from of all categories, extract suitable cluster label respectively and mark each classification, owing to adopt method often can not generate readability cluster label preferably based on document, the property distinguished is less between the different cluster labels, thereby the user is difficult to find the result for retrieval that meets own demand from each less cluster label of the property distinguished, so these class methods are just used in early days the result for retrieval cluster work more; Method based on label then is meant at first some representational words of extraction from document, then the word that extracts is carried out rational evaluation and screening, and will through evaluation and Screening Treatment after the different terms conduct that obtains corresponding to the cluster label of different classes of document, can be thereby follow-up based on this different classes of cluster label, further realize classification to document, in these class methods, choosing of cluster label is very crucial, but choose mode according to the cluster label that provides in the prior art, be difficult to obtain readability cluster label preferably equally.
From the above, all kinds of searching result clustering methods that prior art adopts all exist and are difficult to generate readability cluster label preferably, thereby make the user be difficult to find according to the cluster label defective of the result for retrieval that meets own demand.
Summary of the invention
The embodiment of the invention provides a kind of clustering method and device, is difficult to generate the readable defective of cluster label preferably in order to the searching result clustering method that provides according to prior art to be provided.
For this reason, the embodiment of the invention is by the following technical solutions:
A kind of clustering method comprises: according to the default strategy of choosing, choose the set of first candidate character string from each document to be clustered; At each word string in described first candidate character string set, according to the parameter relevant with this word string, choose second candidate character string from the set of described first candidate character string, the described parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered for total degree, this word string in this word string appears at described all documents to be clustered; Described second candidate character string is defined as described each document to be clustered is carried out the cluster label of cluster, and described each document to be clustered is referred to respectively in corresponding with described cluster label bunch.
Preferably, at each word string in described first candidate character string set, according to the parameter relevant with this word string, choosing second candidate character string from described first candidate character string set specifically comprises: at each word string in described first candidate character string set, appear at total degree, this word string in described all documents to be clustered according to this word string and appear at the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered, adopt following formula to calculate the importance degree Score of this word string:
Wherein, word.tf is the total degree in this word string appears at described each document to be clustered, word.normtf appears at total degree in the described specified documents for this word string, word.df is the document number described to be clustered that comprises this word string, the character number that word.length comprises for this word string;
In calculating the set of described first candidate character string, during the importance degree Score of each word string,, from described first candidate character string set, choose second candidate character string according to described importance degree Score.
Preferably, described method also comprises: according to the importance degree Score order from large to small of described definite cluster label, described definite cluster label is carried out correspondence arrange.
Preferably, according to the default strategy of choosing, choosing the set of first candidate character string from each document to be clustered specifically comprises: from the word string that each document comprised to be clustered, choose character number and the default consistent word string of the first character number threshold value that word string comprises; From the described word string of choosing, choose first candidate character string set that meets preset rules, described preset rules be in the following rule any one or be the combination in any of following rule: at each word string in the set of described first candidate character string, the number that comprises the document described to be clustered of this word string is not less than presetting first threshold; At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the second default threshold value; At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the 3rd default threshold value; At each word string in the set of described first candidate character string, this word string appears at the numerical value that each character that the total degree in described all documents to be clustered comprises divided by this word string appears at the total degree gained in described all documents to be clustered and is not less than the 4th default threshold value.
Preferably, adopt the method for multi-mode coupling, described each document to be clustered is referred to respectively in corresponding with described cluster label bunch.
Preferably, described method also comprises:
At each the cluster label in the described definite cluster label, determine that this cluster label appears at the total degree in described all documents to be clustered, and according to each described definite total degree by the few order of as many as, described definite cluster label is carried out correspondence arranges; Or at each the cluster label in the described definite cluster label, determine to include the document number described to be clustered of this cluster label, and according to each described definite document number by the few order of as many as, described definite cluster label is carried out correspondence arranges; Or be used as the frequency order from high to low of the employed query word of search engine respectively according to described definite cluster label, described definite cluster label is carried out correspondence to be arranged, wherein, the described Search Results of document for arriving to be clustered by search engine searches.
A kind of clustering apparatus comprises: first chooses the unit, is used for choosing the set of first candidate character string according to the default strategy of choosing from each document to be clustered; Second chooses the unit, be used for each word string of choosing first candidate character string set of unit selection at first, according to the parameter relevant with this word string, choose second candidate character string from the set of described first candidate character string, the described parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered for total degree, this word string in this word string appears at described all documents to be clustered; The label determining unit is used for second second candidate character string of choosing unit selection is defined as described each document to be clustered is carried out the cluster label of cluster; Sort out the unit, be used for described each document to be clustered be referred to respectively the cluster label determined with described label determining unit corresponding bunch.
A kind of cluster scheme that the embodiment of the invention provides according to the default strategy of choosing, from the word string that each document comprised to be clustered, is chosen the set of first candidate character string by earlier; Again at each word string in this first candidate character string set, according to reflecting that this word string appears at the correlation parameter of the frequency in all documents to be clustered, the document classification representativeness of this word string etc., from the set of first candidate character string, choose second candidate character string, wherein, these parameters comprise that this word string appears at total degree, this word string in described all documents to be clustered respectively and appears in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and each document to be clustered at least one respectively; Second candidate character string of choosing is defined as described each document to be clustered is carried out the cluster label of cluster, and described each document to be clustered is referred to respectively in corresponding with described cluster label bunch, thereby realize to the cluster label determine and to the cluster of document.Because the embodiment of the invention is in the process that the cluster label is chosen, at each word string in this first candidate character string set, taken all factors into consideration and to have reflected that this word string appears at the correlation parameter of frequency in each document to be clustered and document classification representativeness of this word string etc., make the cluster label of generation can demonstrate fully the classification of document to be clustered, thereby it is readable preferably that definite cluster label is had.
Description of drawings
The idiographic flow synoptic diagram of a kind of clustering method that Fig. 1 provides for the embodiment of the invention;
Fig. 2 carries out idiographic flow synoptic diagram in the process of cluster for a kind of clustering method that the embodiment of the invention is provided is applied to result for retrieval;
Fig. 3 is the concrete synoptic diagram of resultant two result for retrieval of search engine in the embodiment of the invention;
The concrete structure synoptic diagram of a kind of clustering apparatus that Fig. 4 provides for the embodiment of the invention.
Embodiment
The embodiment of the invention provides a kind of cluster scheme, by choosing strategy and can reflect that word string appears at the correlation parameter of the frequency in all documents to be clustered and the document classification representativeness of word string etc. according to default, from the word string that each document comprised to be clustered, choose as the cluster label that each document to be clustered is carried out cluster, thereby the cluster label that makes generation can demonstrate fully the classification of document to be clustered, reaches readable preferably.
Be explained in detail to the main realization principle of embodiment of the invention technical scheme, embodiment and to the beneficial effect that should be able to reach below in conjunction with each accompanying drawing.
The embodiment of the invention at first provides a kind of clustering method, and the idiographic flow synoptic diagram of this method may further comprise the steps as shown in Figure 1:
Step 12, at each word string in the set of first candidate character string, according to the parameter relevant with this word string, from the set of first candidate character string, choose second candidate character string, wherein, the parameter relevant with this word string is the total degree in this word string appears at all documents to be clustered, this word string appears at the total degree in the specified documents, include at least one in the document number of this each word string in the character number that this word string comprises respectively and each document to be clustered, it should be explained that, carry out cluster with the Search Results that the search engine that uses in the internet is obtained and be example, above-mentioned " specified documents " can be meant in the internet webpage arbitrarily, usually, this specified documents can be the webpage of 100,000 or 200,000 or other quantity (the quantity here is generally bigger) arbitrarily, in this case, the total degree that word string appears in this named web page is big more, illustrates that then this word string may be a word string commonly used in the webpage;
At above-mentioned steps 11, can adopt following step to realize from each document to be clustered, choosing the set of first candidate character string according to the default strategy of choosing:
At first, from the word string that each document comprised to be clustered, choose character number and the default consistent word string of the first character number threshold value that word string comprises, at Chinese, if regard a word in the Chinese as in a embodiment of the invention said " char ", then can the above-mentioned first character number threshold value be set to 2~6, to meet the speech habits of Chinese, and at English, if regard a word as in a embodiment of the invention said " char ", then also can the above-mentioned first character number threshold value be set to 1~4;
Then, from the above-mentioned word string of choosing, choose first candidate character string set of satisfying preset rules again, wherein the preset rules here can but be not limited to any one or multiple combination arbitrarily in following four kinds of rules:
Rule one: at each word string in the set of first candidate character string, the number that comprises the document to be clustered of this word string is not less than presetting first threshold, and the word string frequency of occurrences is higher to have a set of representational first candidate character string of stronger document classification thereby this rule one is used for selecting from the above-mentioned word string of choosing;
Rule two: at each word string in the set of first candidate character string, in each document to be clustered, adjacent with this word string, be positioned at before this word string, and the number of characters that comprises is not less than the second default threshold value with the number of the different word strings of the second character number threshold value unanimity of presetting, if with above-mentioned adjacent with this word string, be positioned at before this word string, and the word string that the number of characters that comprises is consistent with the second default character number threshold value be called adjacent before word string, effect that then should rule two is to select the lower word string of correlativity of adjacent preceding word string with this and gathers as first candidate character string;
Rule three: at each word string in the set of first candidate character string, in each document to be clustered, adjacent with this word string, be positioned at after this word string, and the number of characters that comprises is not less than the 3rd default threshold value with the number of the different word strings of the second character number threshold value unanimity of presetting, if with above-mentioned adjacent with this word string, be positioned at before this word string, and the word string that the number of characters that comprises is consistent with default three-character doctrine number threshold value is called adjacent back word string, and effect that then should rule three is to select the lower word string of correlativity of adjacent back word string with this and gathers as first candidate character string;
Rule four: at each word string in the set of first candidate character string, this word string appears at the numerical value that each character that the total degree in all documents to be clustered comprises divided by this word string appears at the total degree gained in all documents to be clustered and is not less than the 4th default threshold value, it is less that the effect of this rule four is to select the character number that word string comprises, and have representational first candidate character string set of stronger document classification.
Preferably, in the step 13 in embodiments of the present invention can but be not limited to adopt the method for multi-mode coupling, each document to be clustered is referred to respectively in corresponding with described cluster label bunch, the concrete implementation of the method for this multi-mode coupling is at each document in the document to be clustered, at first, the document determines the cluster label that comprises in the document by being scanned, and then the cluster label that comprises according to the document of determining, the document be referred to respectively different cluster label correspondences bunch in.
In addition, because same piece of writing document might be included in the pairing difference of different cluster labels bunch, therefore, for can be easily from a certain cluster label correspondence bunch find the document that needs, can consider to come the cluster label is sorted according to the document classification representativeness of cluster label, particularly, the embodiment of the invention can further adopt following steps that each cluster label of determining is arranged:
At first, at each the cluster label in the cluster label of determining, determine that this cluster label appears at the total degree in all documents to be clustered;
Because the number of times that the cluster label occurs in document is many more, can show that more this cluster label has stronger document classification representativeness, therefore, can the cluster label of determining be carried out correspondence arrange according to each total degree of determining by the few order of as many as.
Perhaps, the embodiment of the invention also can further adopt following steps that each cluster label of determining is arranged:
At first, at each the cluster label in the cluster label of determining, determine the importance degree Score of this cluster label;
Because the importance degree Score of cluster label is big more, can show that more this cluster label has the higher frequency of occurrences, and have stronger document classification representativeness, therefore, can the cluster label of determining be carried out correspondence arrange according to each importance degree Score order from large to small of determining.
Perhaps, the embodiment of the invention can also further adopt following steps that each cluster label of determining is arranged:
At first, at each the cluster label in the cluster label of determining, determine to include the document number to be clustered of this cluster label;
Because it is many more to include the document number to be clustered of this cluster label, illustrates that this cluster label has the higher frequency of occurrences, therefore can the cluster label of determining be carried out correspondence arrange according to each document number of determining by the few order of as many as.
Since this scheme of providing of the embodiment of the invention can be applied to by search engine searches to Search Results carry out in the process of cluster, therefore, this scheme that the embodiment of the invention provides both can adopt above-mentioned arbitrary arrangement mode that the cluster label of determining is arranged, also can adopt the frequency order from high to low that is used as the employed query word of search engine according to the cluster label of determining respectively, the mode that the cluster label of determining is arranged, thus make the user who uses search engine can find the result for retrieval of own needs easily according to the cluster label.
Below the such scheme that provides with the embodiment of the invention in practice be applied as example, describe the implementing procedure of this scheme in detail:
As shown in Figure 2; for this scheme that the embodiment of the invention is provided be applied to search engine searches to result for retrieval carry out idiographic flow synoptic diagram in the process of cluster; in this specific embodiment; is that example illustrates this programme with the result for retrieval for the Chinese webpage; but as if this programme being applied to the process of English or other language web pages being carried out cluster; then also within protection scope of the present invention, particularly, process flow diagram shown in Figure 2 may further comprise the steps Dui Ying scheme:
Wherein, word represents the arbitrary candidate character string in the set of first candidate character string, word.tf is that this arbitrary candidate character string appears at the total degree in all result for retrieval to be clustered, word.normtf appears at the total degree of specifying in the result for retrieval for this arbitrary candidate character string, word.df is the result for retrieval number that comprises this arbitrary candidate character string, the character number that word.length comprises for this arbitrary candidate character string, wherein, specifying result for retrieval can be generic web page arbitrarily in a large number, and word.normtf is in order to embody the frequency that this arbitrary candidate character string occurs in these a large amount of generic web page arbitrarily, and the implication of each parameter is as follows respectively in the formula (1):
This importance degree Score that has embodied arbitrary candidate character string in the set of first candidate character string is directly proportional with the frequency that this arbitrary candidate character string appears at result for retrieval to be clustered, and be inversely proportional to frequency that this arbitrary candidate character string appears under a large amount of language environments that generic web page represented arbitrarily, its implication is the result for retrieval classification representativeness that is used to weigh this arbitrary candidate character string;
This importance degree Score that has embodied arbitrary candidate character string in the set of first candidate character string of word.df is directly proportional with the total degree that this arbitrary candidate character string appears in each result for retrieval to be clustered, its physical meaning is that this arbitrary candidate character string appears at the frequency in each result for retrieval to be clustered, if this arbitrary candidate character string appear in the result for retrieval to be clustered number of times very little, this candidate character string does not have result for retrieval classification representativeness yet so, is not suitable for as the cluster label;
This has embodied log (word.length) character number that candidate character string comprised and should be a suitable value, because comparatively speaking, the word.tf value that comprises the more candidate character string of character number generally is less than the word.tf value that comprises the less candidate character string of character number, therefore, in the employed formula of the embodiment of the invention (1), need this factor of character number word.length of considering that candidate character string comprises, because this factor is excessive to the influence of importance degree Score, therefore the operation by word.length is taken the logarithm in the formula (1) is with the influence of minimizing this factor of character number that candidate character string was comprised to Score;
In embodiments of the present invention, at two result for retrieval shown in Figure 3, through each word string in first candidate character string set that obtains after above-mentioned steps 21,22 processing and as shown in table 1 below with each the word string corresponding parameters in the set of first candidate character string, in the corresponding substitution formula of occurrence (1) with each parameter in the table 1, can calculate the importance degree Score of each candidate character string in the set of first candidate character string respectively:
Table 1:
Need to prove, as long as can embody the corresponding relation on numerical values recited changes between the importance degree of each word string and above-mentioned each parameter (such as can be known according to the implication of above-mentioned each parameter, importance degree variation from large to small is corresponding with word.df variation from large to small, and importance degree from large to small variation and word.normtf by little to big variation also be corresponding), then the embodiment of the invention also can but be not limited to adopt following formula (2) to calculate the importance degree Score1 of each word string:
Perhaps, also can adopt following formula (3)~(5) to come the corresponding respectively importance degree Score2~Score4 that calculates each word string:
Score3=word.df (4)
Score4=log(word.length)?(5)
Above-mentioned Score1~Score4 can be used as the parameter of weighing the importance degree of each candidate character string in the set of first candidate character string equally.
Step 24, in calculating first candidate character string set during importance degree Score of each word string, from the set of first candidate character string, choose second candidate character string, such as, can be according to the importance degree Score selecting sequence from large to small of each word string in the set of first candidate character string, from the set of first candidate character string, choose second candidate character string that satisfies preset number, in embodiments of the present invention, it is 20 that this preset number can be set, then need in this step 24 from the set of first candidate character string, to choose 20 second candidate character strings, if 20 of the candidate character string number less thaies in the set of first candidate character string, then can directly all candidate character strings in the set of first candidate character string all be chosen as second candidate character string, in addition, can also set an importance degree threshold value, and regulation only from the set of first candidate character string, choose importance degree Score greater than the candidate character string of importance degree threshold value as second candidate character string, in embodiments of the present invention, according to the enforcement of step 21~24, second candidate character string of finally choosing is last " beautiful 207 ", " automobile ", " east wind " etc.;
Correspondingly, the embodiment of the invention also provides a kind of clustering apparatus, uses so that the cluster label that utilizes this clustering apparatus to determine can demonstrate fully the classification of document to be clustered, reach readable preferably, particularly, the structural representation of this device comprises following functional unit as shown in Figure 4:
First chooses unit 41, is used for choosing the set of first candidate character string according to the default strategy of choosing from each document to be clustered;
Second chooses unit 42, be used for each word string of choosing first candidate character string set of choosing unit 41 at first, according to the parameter relevant with this word string, from the set of first candidate character string, choose second candidate character string, wherein, the parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the document to be clustered for total degree, this word string in this word string appears at all documents to be clustered;
Label determining unit 43 is used for choosing the cluster label that second candidate character string of choosing unit 42 is defined as each document to be clustered is carried out cluster with second;
Sort out unit 44, be used for each document to be clustered be referred to respectively the cluster label determined with label determining unit 43 corresponding bunch.
Preferably, at above-mentioned first a kind of implementation of choosing unit 41 functions, can choose unit 41 with above-mentioned first and further be divided into following functional module:
First chooses module, is used for from the word string that each document comprised to be clustered, chooses character number and the default consistent word string of the first character number threshold value that word string comprises;
Second chooses module, be used for choosing the word string that module chooses and choose first candidate character string set that meets preset rules from first, wherein, the preset rules here be in the following rule any one or be the combination in any of following rule:
Rule one: at each word string in the set of first candidate character string, the number that comprises the document to be clustered of this word string is not less than presetting first threshold;
Rule two: at each word string in the set of first candidate character string, in each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the second default threshold value;
Rule three: at each word string in the set of first candidate character string, in each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the 3rd default threshold value;
Rule four: at each word string in first candidate character string set, this word string appears at the numerical value that each character that the total degree in all documents to be clustered comprises divided by this word string appears at the total degree gained in all documents to be clustered and is not less than the 4th default threshold value.
Choose the situation that unit 41 is divided into above-mentioned each module at above-mentioned first, this device that the embodiment of the invention provides can also further comprise:
Importance degree Score determining unit is used for each cluster label of the cluster label determined at label determining unit 43, determines the importance degree Score of this cluster label; And the label arrangement units, be used for described definite cluster label being carried out correspondence arranging according to definite each importance degree Score order from large to small of importance degree Score determining unit.
Preferably, at above-mentioned second a kind of implementation of choosing unit 42 functions, can choose unit 42 with above-mentioned second and further be divided into following functional module:
Computing module is used for each word string at the set of first candidate character string, adopts following formula to calculate the importance degree Score of this word string:
Wherein, word.tf is the total degree in this word string appears at each document to be clustered, word.normtf appears at total degree in the specified documents for this word string, and word.df is the number that comprises the document to be retrieved of this word string, the character number that word.length comprises for this word string;
Choose module, be used for when computing module calculates the importance degree Score of each word string of first candidate character string set, according to described importance degree Score, from the set of first candidate character string, choose second candidate character string, wherein, in calculating first candidate character string set during importance degree Score of each word string, can be according to the importance degree Score selecting sequence from large to small of each word string in the set of first candidate character string, from the set of first candidate character string, choose second candidate character string that satisfies preset number, also can set an importance degree threshold value, and stipulate only to choose importance degree Score and gather as second candidate character string greater than first candidate character string of importance degree threshold value.
Preferably, above-mentioned classification unit 44 can adopt the method for multi-mode coupling, and each document to be clustered is referred to respectively in pairing bunch of the cluster label determined with label determining unit 43.
In addition, because same piece of writing document might be included in the pairing difference of different cluster labels bunch, therefore, for can be easily from a certain cluster label correspondence bunch find the document that needs, can consider to come the cluster label is sorted according to the document classification representativeness of cluster label, particularly, this device of providing of the embodiment of the invention can further include:
Number of times determining unit 45 is used for respectively each cluster label of the cluster label determined at label determining unit 43, determines that this cluster label appears at the total degree in all documents to be clustered;
Label arrangement units 46, each total degree that is used for determining respectively according to number of times determining unit 45 are by the few order of as many as, and the cluster label that label determining unit 43 is determined carries out the correspondence arrangement.
In addition, it is also conceivable that, come the cluster label is sorted that particularly, this device that the embodiment of the invention provides can further include according to the frequency of occurrences of cluster label in document to be clustered:
Document number determining unit is used for each the cluster label at the definite cluster label of label determining unit, determines to include the document number to be clustered of this cluster label;
Label arrangement units, each document number that is used for determining according to document number determining unit are carried out correspondence to the cluster label of determining and are arranged by the few order of as many as.
Need to prove, when this device that the embodiment of the invention provides be applied to by search engine searches to Search Results when carrying out in the process of cluster, this device that the embodiment of the invention provides can also comprise another label arrangement units, can be used for being used as respectively the frequency order from high to low of the employed query word of search engine according to the cluster label that label determining unit 43 is determined, the cluster label that label determining unit 43 is determined carries out the correspondence arrangement, thereby makes the user who uses search engine to find the result for retrieval that oneself needs according to the cluster label easily.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.
Claims (10)
1. a clustering method is characterized in that, comprising:
According to the default strategy of choosing, from each document to be clustered, choose the set of first candidate character string;
At each word string in described first candidate character string set, according to the parameter relevant with this word string, choose second candidate character string from the set of described first candidate character string, the described parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered for total degree, this word string in this word string appears at described all documents to be clustered;
Described second candidate character string is defined as described each document to be clustered is carried out the cluster label of cluster, and described each document to be clustered is referred to respectively in corresponding with described cluster label bunch.
2. the method for claim 1 is characterized in that, at each word string in described first candidate character string set, according to the parameter relevant with this word string, chooses second candidate character string and specifically comprise from described first candidate character string set:
At each word string in described first candidate character string set, appear at total degree, this word string in described all documents to be clustered according to this word string and appear at the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered, adopt following formula to calculate the importance degree Score of this word string:
Wherein, word.tf is the total degree in this word string appears at described each document to be clustered, word.normtf appears at total degree in the described specified documents for this word string, word.df is the document number described to be clustered that comprises this word string, the character number that word.length comprises for this word string;
In calculating described first candidate character string set, behind the importance degree Score of each word string,, from described first candidate character string set, choose second candidate character string according to described importance degree Score.
3. method as claimed in claim 2 is characterized in that, also comprises:
According to the importance degree Score order from large to small of described definite cluster label, described definite cluster label is carried out correspondence arrange.
4. the method for claim 1 is characterized in that, according to the default strategy of choosing, chooses the set of first candidate character string and specifically comprise from each document to be clustered:
From the word string that each document comprised to be clustered, choose character number and the default consistent word string of the first character number threshold value that word string comprises;
From the described word string of choosing, choose first candidate character string set that meets preset rules, described preset rules be in the following rule any one or be the combination in any of following rule:
At each word string in described first candidate character string set, the number that comprises the document described to be clustered of this word string is not less than presetting first threshold;
At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the second default threshold value;
At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the 3rd default threshold value;
At each word string in the set of described first candidate character string, this word string appears at the numerical value that each character that the total degree in described all documents to be clustered comprises divided by this word string appears at the total degree gained in described all documents to be clustered and is not less than the 4th default threshold value.
5. as claim 1,2 or 4 described methods, it is characterized in that, adopt the method for multi-mode coupling, described each document to be clustered is referred to respectively in corresponding with described cluster label bunch.
6. as claim 1,2 or 4 described methods, it is characterized in that, also comprise:
At each the cluster label in the described definite cluster label, determine that this cluster label appears at the total degree in described all documents to be clustered, and according to each described definite total degree by the few order of as many as, described definite cluster label is carried out correspondence arranges; Or
At each the cluster label in the described definite cluster label, determine to include the document number described to be clustered of this cluster label, and according to each described definite document number by the few order of as many as, described definite cluster label is carried out correspondence arranges; Or
The frequency order from high to low that is used as the employed query word of search engine according to described definite cluster label respectively, described definite cluster label is carried out correspondence to be arranged, wherein, the described Search Results of document for arriving to be clustered by search engine searches.
7. a clustering apparatus is characterized in that, comprising:
First chooses the unit, is used for choosing the set of first candidate character string according to the default strategy of choosing from each document to be clustered;
Second chooses the unit, be used for each word string of choosing first candidate character string set of unit selection at first, according to the parameter relevant with this word string, choose second candidate character string from the set of described first candidate character string, the described parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered for total degree, this word string in this word string appears at described all documents to be clustered;
The label determining unit is used for second second candidate character string of choosing unit selection is defined as described each document to be clustered is carried out the cluster label of cluster;
Sort out the unit, be used for described each document to be clustered be referred to respectively the cluster label determined with described label determining unit corresponding bunch.
8. device as claimed in claim 7 is characterized in that, described second chooses the unit specifically comprises:
Computing module, be used for each word string at described first candidate character string set, appear at total degree, this word string in described all documents to be clustered according to this word string and appear at the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the described document to be clustered, adopt following formula to calculate the importance degree Score of this word string:
Wherein, word.tf is the total degree in this word string appears at described each document to be clustered, word.normtf appears at total degree in the described specified documents for this word string, word.df is the document number described to be clustered that comprises this word string, the character number that word.length comprises for this word string;
Choose module, be used for when computing module calculates the importance degree Score of described each word string of first candidate character string set,, from described first candidate character string set, choose second candidate character string according to described importance degree Score.
9. device as claimed in claim 7 is characterized in that, described first chooses the unit specifically comprises:
First chooses module, is used for from the described word string that each document comprised to be clustered, chooses character number and the default consistent word string of the first character number threshold value that word string comprises;
Second chooses module, be used for choosing the word string that module chooses and choose first candidate character string set that meets preset rules from first, described preset rules be in the following rule any one or be the combination in any of following rule:
At each word string in described first candidate character string set, the number that comprises the document described to be clustered of this word string is not less than presetting first threshold;
At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the second default threshold value;
At each word string in described first candidate character string set, in described each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises and the different word strings of the default second character number threshold value unanimity is not less than the 3rd default threshold value;
At each word string in the set of described first candidate character string, this word string appears at the numerical value that each character that the total degree in described all documents to be clustered comprises divided by this word string appears at the total degree gained in described all documents to be clustered and is not less than the 4th default threshold value.
10. as the described device of 7~9 arbitrary claims, it is characterized in that, also comprise:
The number of times determining unit is used for respectively each cluster label of the cluster label determined at the label determining unit, determines that this cluster label appears at the total degree in described all documents to be clustered;
Label arrangement units, each total degree that is used for determining respectively according to the number of times determining unit are carried out correspondence to described definite cluster label and are arranged by the few order of as many as; Perhaps
Also comprise: document number determining unit, be used for each cluster label at the definite cluster label of label determining unit, determine to include the document number described to be clustered of this cluster label;
Label arrangement units, each document number that is used for determining according to document number determining unit are carried out correspondence to described definite cluster label and are arranged by the few order of as many as; Perhaps
Also comprise: the label arrangement units, be used for being used as respectively the frequency order from high to low of the employed query word of search engine according to the cluster label that the label determining unit is determined, described definite cluster label is carried out correspondence to be arranged, wherein, the described Search Results of document for arriving to be clustered by search engine searches.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009100891768A CN101989281B (en) | 2009-08-03 | 2009-08-03 | Clustering method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2009100891768A CN101989281B (en) | 2009-08-03 | 2009-08-03 | Clustering method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101989281A true CN101989281A (en) | 2011-03-23 |
CN101989281B CN101989281B (en) | 2012-06-27 |
Family
ID=43745818
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2009100891768A Expired - Fee Related CN101989281B (en) | 2009-08-03 | 2009-08-03 | Clustering method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101989281B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102760142A (en) * | 2011-04-29 | 2012-10-31 | 北京百度网讯科技有限公司 | Method and device for extracting subject label in search result aiming at searching query |
CN103207896A (en) * | 2013-03-14 | 2013-07-17 | 无锡清华信息科学与技术国家实验室物联网技术中心 | Method and system for stable and efficient self-adaptive clustering |
CN106033444A (en) * | 2015-03-16 | 2016-10-19 | 北京国双科技有限公司 | Method and device for clustering text content |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1158460A (en) * | 1996-12-31 | 1997-09-03 | 复旦大学 | Multiple languages automatic classifying and searching method |
CN101458708B (en) * | 2008-12-05 | 2012-07-04 | 北京大学 | Searching result clustering method and device |
-
2009
- 2009-08-03 CN CN2009100891768A patent/CN101989281B/en not_active Expired - Fee Related
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102760142A (en) * | 2011-04-29 | 2012-10-31 | 北京百度网讯科技有限公司 | Method and device for extracting subject label in search result aiming at searching query |
CN103207896A (en) * | 2013-03-14 | 2013-07-17 | 无锡清华信息科学与技术国家实验室物联网技术中心 | Method and system for stable and efficient self-adaptive clustering |
CN103207896B (en) * | 2013-03-14 | 2017-02-01 | 无锡清华信息科学与技术国家实验室物联网技术中心 | Method and system for stable and efficient self-adaptive clustering |
CN106033444A (en) * | 2015-03-16 | 2016-10-19 | 北京国双科技有限公司 | Method and device for clustering text content |
CN106033444B (en) * | 2015-03-16 | 2019-12-10 | 北京国双科技有限公司 | Text content clustering method and device |
Also Published As
Publication number | Publication date |
---|---|
CN101989281B (en) | 2012-06-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101246499B (en) | Network information search method and system | |
US7424421B2 (en) | Word collection method and system for use in word-breaking | |
Yu et al. | Improving pseudo-relevance feedback in web information retrieval using web page segmentation | |
CN108829658B (en) | Method and device for discovering new words | |
CN101251837B (en) | Display handling method and system of electronic file list | |
CN101908071B (en) | Method and device thereof for improving search efficiency of search engine | |
US20080086453A1 (en) | Method and apparatus for correlating the results of a computer network text search with relevant multimedia files | |
CN100419755C (en) | Systems and methods for document data analysis | |
CN101609459A (en) | A kind of extraction system of affective characteristic words | |
CN101609450A (en) | Web page classification method based on training set | |
CN110232126B (en) | Hot spot mining method, server and computer readable storage medium | |
EP2425353A1 (en) | Method and apparatus for identifying synonyms and using synonyms to search | |
CN102169496A (en) | Anchor text analysis-based automatic domain term generating method | |
CN102144229A (en) | System for extracting term from document containing text segment | |
CN101071422A (en) | Musicfile search processing system and method | |
CN102236654A (en) | Web useless link filtering method based on content relevancy | |
CN103186556A (en) | Method for obtaining and searching structural semantic knowledge and corresponding device | |
CN104915422A (en) | Webpage collecting method and device based on browser | |
CN101853298B (en) | Event-oriented query expansion method | |
CN110276079A (en) | A kind of dictionary method for building up, information retrieval method and corresponding system | |
WO2007113585A1 (en) | Methods and systems of indexing and retrieving documents | |
KR100913733B1 (en) | Method for Providing Search Result Using Template | |
CN101989281B (en) | Clustering method and device | |
CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method | |
KR101908073B1 (en) | Sentence completion type search system and method that recommends words of high interest as search words |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20120627 Termination date: 20210803 |
|
CF01 | Termination of patent right due to non-payment of annual fee |