CN101989281B - Clustering method and device - Google Patents

Clustering method and device Download PDF

Info

Publication number
CN101989281B
CN101989281B CN2009100891768A CN200910089176A CN101989281B CN 101989281 B CN101989281 B CN 101989281B CN 2009100891768 A CN2009100891768 A CN 2009100891768A CN 200910089176 A CN200910089176 A CN 200910089176A CN 101989281 B CN101989281 B CN 101989281B
Authority
CN
China
Prior art keywords
word
string
clustered
word string
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009100891768A
Other languages
Chinese (zh)
Other versions
CN101989281A (en
Inventor
孙宏伟
胡珉
罗治国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN2009100891768A priority Critical patent/CN101989281B/en
Publication of CN101989281A publication Critical patent/CN101989281A/en
Application granted granted Critical
Publication of CN101989281B publication Critical patent/CN101989281B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a clustering method for overcome the defect that the retrieval result provided by the prior art is difficult to generate a clustering label with relatively good readability. The method comprises the following steps: selecting a first candidate string set from the documents to be clustered according to a pre-set selection policy; for each string in the first candidate string set, selecting a second candidate string from the first candidate string set according to a string related parameter, wherein the string related parameter comprises at least one parameter of the total times of the string appearing in all documents to be clustered, the total times of the string appearing in a designated document, the number of characters included in the string and the number of documents including each string in the documents to be clustered; and determining the second candidate string as the clustering label for clustering the documents to be clustered, and classifying the documents to be clustered into a cluster corresponding to the clustering label. The invention also discloses a clustering device.

Description

Clustering method and device
Technical field
The present invention relates to information retrieval field, relate in particular to a kind of clustering method and device.
Background technology
The result for retrieval cluster; Be meant with search engine searches to result for retrieval in the similar Search Results process of assembling cluster, wherein, bunch be the set of one group of similar each other result for retrieval; Result for retrieval in the same cluster is similar each other, and the result for retrieval in different bunches is then often different each other.The result for retrieval cluster can help the user better to use search engine, such as, the information that can help the user to navigate to more fast to need, perhaps can help the user to obtain more comprehensively information etc.
In the prior art, existing searching result clustering method mainly is divided into two types: one type of method that is called as based on document (Documents-Based); And the another kind of method that is called as based on label (Label-Based).So-called method based on document is meant at first through traditional document clustering method; Be gathered into a plurality of classifications to document; And then from of all categories, extract suitable cluster label respectively and mark each classification; Owing to adopt method often can not generate readability cluster label preferably based on document; The property distinguished is less between the different cluster labels, thereby the user is difficult to from each less cluster label of the property distinguished, find the result for retrieval that meets own demand, so these class methods are just used in early days the result for retrieval cluster work more; Method based on label then is meant at first some representational words of extraction from document; Then the word that extracts is carried out rational evaluation and screening; And will through evaluation with Screening Treatment after the different terms that obtains as cluster label corresponding to different classes of document, thereby follow-uply can be basis with this different classes of cluster label, further realization is to the classification of document; In these class methods; Choosing of cluster label is very crucial, but chooses mode according to the cluster label that provides in the prior art, is difficult to obtain readability cluster label preferably equally.
From the above, all kinds of searching result clustering methods that prior art adopts all exist and are difficult to generate readability cluster label preferably, thereby make the user be difficult to find according to the cluster label defective of the result for retrieval that meets own demand.
Summary of the invention
The embodiment of the invention provides a kind of clustering method and device, is difficult to generate the readable defective of cluster label preferably in order to solve the searching result clustering method that provides according to prior art.
For this reason, the embodiment of the invention adopts following technical scheme:
A kind of clustering method comprises: according to the preset strategy of choosing, from each document to be clustered, choose the set of first candidate character string; To each word string in said first candidate character string set; According to the parameter relevant with this word string; From the set of said first candidate character string, choose second candidate character string, the said parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the said document to be clustered for total degree, this word string in this word string appears at said all documents to be clustered; Said second candidate character string is confirmed as the cluster label that said each document to be clustered is carried out cluster, and said each document to be clustered is referred to respectively in corresponding with said cluster label bunch.
Preferably; To each word string in said first candidate character string set; According to the parameter relevant with this word string; From said first candidate character string set, choosing second candidate character string specifically comprises: to each word string in said first candidate character string set; Appear at total degree, this word string in said all documents to be clustered according to this word string and appear at the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the said document to be clustered, adopt following formula to calculate the importance degree Score of this word string:
Score = word . tf word . normtf * word . df * log ( word . length )
Wherein, Word.tf is the total degree in this word string appears at said each document to be clustered; Word.normtf appears at the total degree in the said specified documents for this word string; Word.df is the document number said to be clustered that comprises this word string, the character number that word.length comprises for this word string;
In calculating the set of said first candidate character string, during the importance degree Score of each word string,, from said first candidate character string set, choose second candidate character string according to said importance degree Score.
Preferably, said method also comprises: according to the importance degree Score of said definite cluster label by big to little order, said definite cluster label is carried out correspondence arrangement.
Preferably; According to the preset strategy of choosing; From each document to be clustered, choosing the set of first candidate character string specifically comprises: from the word string that each document comprised to be clustered, choose character number and the preset consistent word string of the first character number threshold value that word string comprises; From the said word string of choosing, choose first candidate character string set that meets preset rules; Said preset rules be in the following rule any one or be the combination in any of following rule: to each word string in the set of said first candidate character string, the number that comprises the document said to be clustered of this word string is not less than presetting first threshold; To each word string in said first candidate character string set; In said each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises consistent different word strings with the second preset character number threshold value is not less than the second preset threshold value; To each word string in said first candidate character string set; In said each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises consistent different word strings with the second preset character number threshold value is not less than the 3rd preset threshold value; To each word string in the set of said first candidate character string, this word string appears at the numerical value that each character that the total degree in said all documents to be clustered comprises divided by this word string appears at the total degree gained in said all documents to be clustered and is not less than the 4th preset threshold value.
Preferably, adopt the method for multi-mode coupling, said each document to be clustered is referred to respectively in corresponding with said cluster label bunch.
Preferably, said method also comprises:
To each the cluster label in the said definite cluster label; Confirm that this cluster label appears at the total degree in said all documents to be clustered; And according to each said definite total degree by the few order of as many as, said definite cluster label is carried out correspondence arranges; Or to each the cluster label in the said definite cluster label, confirm to include the document number said to be clustered of this cluster label, and according to each said definite document number by the few order of as many as, said definite cluster label is carried out correspondence arrangement; Or be used as the frequency order from high to low of the employed query word of search engine respectively according to said definite cluster label; Said definite cluster label is carried out correspondence to be arranged; Wherein, the said Search Results of document for arriving to be clustered through search engine searches.
A kind of clustering apparatus comprises: first chooses the unit, is used for from each document to be clustered, choosing the set of first candidate character string according to the preset strategy of choosing; Second chooses the unit; Be used for each word string of choosing first candidate character string set of unit selection to first; According to the parameter relevant with this word string; From the set of said first candidate character string, choose second candidate character string, the said parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the said document to be clustered for total degree, this word string in this word string appears at said all documents to be clustered; Label is confirmed the unit, be used for second choose unit selection second candidate character string confirm as the cluster label that said each document to be clustered is carried out cluster; Sort out the unit, be used for said each document to be clustered be referred to respectively the cluster label confirming to confirm the unit with said label corresponding bunch.
A kind of cluster scheme that the embodiment of the invention provides according to the preset strategy of choosing, from the word string that each document comprised to be clustered, is chosen the set of first candidate character string through earlier; Again to each word string in this first candidate character string set; According to reflecting that this word string appears at the correlation parameter of the frequency in all documents to be clustered, the document classification representativeness of this word string etc.; From the set of first candidate character string, choose second candidate character string; Wherein, these parameters comprise that this word string appears at total degree, this word string in said all documents to be clustered respectively and appears at least one in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and each document to be clustered respectively; Second candidate character string of choosing is confirmed as the cluster label that said each document to be clustered is carried out cluster; And said each document to be clustered is referred to respectively in corresponding with said cluster label bunch, thereby realize to the cluster label confirm and to the cluster of document.Because the embodiment of the invention is in the process that the cluster label is chosen; To each word string in this first candidate character string set; Taken all factors into consideration and to have reflected that this word string appears at the correlation parameter of frequency and document classification representativeness of this word string in each document to be clustered etc.; Make the cluster label of generation can demonstrate fully the classification of document to be clustered, thereby it is readable preferably that definite cluster label is had.
Description of drawings
The idiographic flow synoptic diagram of a kind of clustering method that Fig. 1 provides for the embodiment of the invention;
Fig. 2 carries out the idiographic flow synoptic diagram in the process of cluster for a kind of clustering method that the embodiment of the invention is provided is applied to result for retrieval;
Fig. 3 is the concrete synoptic diagram of resultant two result for retrieval of search engine in the embodiment of the invention;
The concrete structure synoptic diagram of a kind of clustering apparatus that Fig. 4 provides for the embodiment of the invention.
Embodiment
The embodiment of the invention provides a kind of cluster scheme; Through choosing strategy and can reflect that word string appears at the correlation parameter of frequency and the document classification representativeness of word string in all documents to be clustered etc. according to preset; From the word string that each document comprised to be clustered; Choose as each document to be clustered is carried out the cluster label of cluster, thereby make the cluster label of generation can demonstrate fully the classification of document to be clustered, reach readable preferably.
Carry out detailed elaboration below in conjunction with each accompanying drawing to the main realization principle of embodiment of the invention technical scheme, embodiment and to the beneficial effect that should be able to reach.
The embodiment of the invention at first provides a kind of clustering method, and the idiographic flow synoptic diagram of this method is as shown in Figure 1, may further comprise the steps:
Step 11 according to the preset strategy of choosing, is chosen the set of first candidate character string from each document to be clustered;
Step 12; To each word string in the set of first candidate character string,, from the set of first candidate character string, choose second candidate character string according to the parameter relevant with this word string; Wherein, The parameter relevant with this word string appears at least one in the document number that includes this each word string in total degree in the specified documents, character number that this word string comprises respectively and each document to be clustered for total degree, this word string in this word string appears at all documents to be clustered, what need to explain is to carry out cluster with the Search Results that the search engine that uses in the internet is obtained and be example; Above-mentioned " specified documents " can be meant in the internet webpage arbitrarily; Usually, this specified documents can be the webpage of 100,000 or 200,000 or other quantity (the quantity here is generally bigger) arbitrarily, in this case; The total degree that word string appears in this named web page is big more, explains that then this word string possibly be a word string commonly used in the webpage;
Step 13 is confirmed as the cluster label that each document to be clustered is carried out cluster with second candidate character string of choosing, and each document to be clustered is referred to respectively in corresponding with said cluster label bunch.
To above-mentioned steps 11, can adopt following step to realize from each document to be clustered, choosing the set of first candidate character string based on the preset strategy of choosing:
At first; From the word string that each document comprised to be clustered, choose character number and the preset consistent word string of the first character number threshold value that word string comprises, to Chinese; If regard a word in the Chinese as in a embodiment of the invention said " char "; Then can the above-mentioned first character number threshold value be set to 2~6, meeting the speech habits of Chinese, and to English; If regard a word as in a embodiment of the invention said " char ", then also can the above-mentioned first character number threshold value be set to 1~4;
Then, from the above-mentioned word string of choosing, choose first candidate character string set of satisfying preset rules again, wherein here preset rules can but be not limited to any one or the multiple combination arbitrarily in following four kinds of rules:
Rule one: to each word string in the set of first candidate character string; The number that comprises the document to be clustered of this word string is not less than presetting first threshold, and the word string frequency of occurrences is higher to have a set of representational first candidate character string of stronger document classification thereby this rule one is used for selecting from the above-mentioned word string of choosing;
Rule two: to each word string in the set of first candidate character string; In each document to be clustered; Adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises consistent different word strings with the second preset character number threshold value is not less than the second preset threshold value; If will be above-mentioned adjacent with this word string, be positioned at before this word string and the number of characters that comprises is called adjacent preceding word string with the preset consistent word string of the second character number threshold value, effect that then should rule two is to select the lower word string of correlativity of adjacent preceding word string with this and gathers as first candidate character string;
Rule three: to each word string in the set of first candidate character string; In each document to be clustered; Adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises consistent different word strings with the second preset character number threshold value is not less than the 3rd preset threshold value; If will be above-mentioned adjacent with this word string, be positioned at before this word string and the number of characters that comprises and the preset consistent word string of three-character doctrine number threshold value be called adjacent after word string, effect that then should rule three be to select adjacent with this after the lower word string of correlativity of word string gather as first candidate character string;
Rule four: to each word string in the set of first candidate character string; This word string appears at the numerical value that each character that the total degree in all documents to be clustered comprises divided by this word string appears at the total degree gained in all documents to be clustered and is not less than the 4th preset threshold value; It is less that the effect of this rule four is to select the character number that word string comprises, and have representational first candidate character string set of stronger document classification.
Preferably; In the step 13 in embodiments of the present invention can but be not limited to adopt the method for multi-mode coupling; Each document to be clustered is referred to respectively in corresponding with said cluster label bunch; The concrete implementation of the method for this multi-mode coupling is for being directed against each document in the document to be clustered; At first determine the cluster label that comprises in the document, and then the cluster label that comprises according to the document of determining through the document is scanned, the document be referred to respectively different cluster labels corresponding bunch in.
In addition; Because same piece of writing document might be included in the pairing difference of different cluster labels bunch; Therefore, for can be easily from a certain cluster label corresponding bunch find the document that needs, can consider to come the cluster label is sorted according to the document classification representativeness of cluster label; Particularly, the embodiment of the invention can further adopt following steps that each cluster label of confirming is arranged:
At first, to each the cluster label in the cluster label of confirming, confirm that this cluster label appears at the total degree in all documents to be clustered;
Because the number of times that the cluster label occurs in document is many more, it is representative to show that more this cluster label has stronger document classification, therefore, can the cluster label of confirming be carried out correspondence arrange according to each total degree of confirming by the few order of as many as.
Perhaps, the embodiment of the invention also can further adopt following steps that each cluster label of confirming is arranged:
At first, to each the cluster label in the cluster label of confirming, confirm the importance degree Score of this cluster label;
Because the importance degree Score of cluster label is big more; Can show that more this cluster label has the higher frequency of occurrences, and it is representative to have stronger document classification, therefore; Can the cluster label of confirming be carried out correspondence arrange according to each importance degree Score that confirms by big extremely little order.
Perhaps, the embodiment of the invention can also further adopt following steps that each cluster label of confirming is arranged:
At first, to each the cluster label in the cluster label of confirming, confirm to include the document number to be clustered of this cluster label;
Because it is many more to include the document number to be clustered of this cluster label, explains that this cluster label has the higher frequency of occurrences, therefore can the cluster label of confirming be carried out correspondence arrange according to each document number of confirming by the few order of as many as.
Since this scheme of providing of the embodiment of the invention can be applied to through search engine searches to Search Results carry out in the process of cluster; Therefore; This scheme that the embodiment of the invention provides both can adopt above-mentioned arbitrary arrangement mode that the cluster label of confirming is arranged; Also can adopt the frequency order from high to low that is used as the employed query word of search engine according to the cluster label of confirming respectively; The mode that the cluster label of confirming is arranged, thus make the user who uses search engine can find the result for retrieval of own needs easily according to the cluster label.
Below the such scheme that provides with the embodiment of the invention in reality, be applied as example, specify the implementing procedure of this scheme:
As shown in Figure 2; For this scheme that the embodiment of the invention is provided be applied to search engine searches to result for retrieval carry out the idiographic flow synoptic diagram in the process of cluster, in this specific embodiment, be that example is explained this programme for the Chinese webpage with the result for retrieval; But as if this programme being applied to the process of English or other language web pages being carried out cluster; Also within protection scope of the present invention, particularly, process flow diagram shown in Figure 2 may further comprise the steps then corresponding scheme:
Step 21 is chosen candidate character string, wherein from result for retrieval to be clustered; The character number that this candidate character string of choosing comprises is consistent with the first preset character number threshold value, and in embodiments of the present invention, the result for retrieval to be clustered here can refer to the webpage that search engine searches arrives; Also can refer to pairing summary of this webpage that searches and/or title, and, can set the number of result for retrieval to be clustered as required; Result for retrieval number that can be to be clustered in the present embodiment is set to 200; Because the word in the Chinese generally comprises at least two characters (two characters here promptly are meant two words in the Chinese), therefore, in this step 21 of present embodiment; If with the first character number threshold setting is 2~6; Then need to make a summary or title in to meet character number be that 2~6 word string all selects and is used as candidate character string, if two result for retrieval shown in accompanying drawing 3, then can be chosen from the title " east wind honda automobile CRV " of first result for retrieval and obtain following candidate character string; Wherein, A character be used as in the English word " honda " that lowercase constitutes, and the mode that the English word " CRV " that capitalization is constituted is a character according to a letter is added up, but these 3 characters are inseparable;
Figure G2009100891768D00091
Step 22 is removed from the candidate character string of choosing " noise candidate character string ", usually; A large amount of " noise candidate character string " arranged in the Chinese character candidate character string of choosing through step 21; These " noise candidate character strings " refer in particular to the candidate character string that word string itself is not a significant phrase, such as top " wind honda vapour ", " honda vapour " etc., because what meaning these " noise candidate character strings " itself do not have; Be not suitable for as the cluster label; Therefore need filter out these " noise candidate character string ", can adopt following manner to filter out " noise candidate character string ": at first, if the frequency of occurrences of a certain candidate character string in result for retrieval to be clustered is less than a threshold value f1; Just this candidate character string is confirmed as " noise candidate character string "; And it is filtered out, in embodiments of the present invention, can be set to 3 by this f1; Secondly, if adjacent with a certain candidate character string, and the number that is positioned at the different Chinese character after this candidate character string is less than a threshold value f2; Just this candidate character string is confirmed as " noise candidate character string ", and it is filtered out, in embodiments of the present invention; Can be set to 5 by f2, according to this filter type, for example in result for retrieval to be clustered; Because the Chinese character of " east wind honda vapour " this candidate character string back has only " car " word usually; Therefore " east wind honda vapour " is filtered possibly, and similarly, " wind honda ", " wind honda vapour " also might be because identical reason be filtered; And the printed words that from second result for retrieval as shown in Figure 3, can see " east wind is beautiful ... "; Therefore the Chinese character of " east wind " this candidate character string back also might be " mark " word except " honda ", and therefore " east wind " just might not be " noise candidate character string "; Certainly only having two different words " honda ", " mark " also to be not enough to explanation should " east wind " can not be filtered, and the number of the different words in " east wind " back must reach 5 and can confirm that this " east wind " can not be filtered; And if adjacent with a certain candidate character string, and be positioned at before this candidate character string the number of different Chinese character less than a threshold value f3, just this candidate character string is confirmed as " noise candidate character string "; And it is filtered out; In embodiments of the present invention, can be set to 5 by this f3, according to this filter type; For example " wind honda automobile ", " wind honda vapour " etc. probably can be filtered, because the Chinese character of these candidate character string fronts has only " east " word usually; In addition; If the total degree that a certain candidate character string appears in all result for retrieval to be clustered appears at total degree sum in all result for retrieval to be clustered separately less than a threshold value f4 divided by each Chinese character in this candidate character string; Just this candidate character string is confirmed as " noise candidate character string "; And it is filtered out; Can be set to 0.1 by f4 in the present invention, for the ease of describing, the word string that candidate character string the constituted set that below will carry out obtaining after " noise candidate character string " filters is called the set of first candidate character string;
Step 23; Calculate the importance degree Score of each candidate character string in the set of first candidate character string; Because the quantity of the candidate character string that comprises in first candidate character string set that obtains after above two steps 21 of process, 21 are handled is still a lot; In order further from the set of first candidate character string, to select readable better cluster label, the computing formula (1) below therefore needing further to adopt is calculated the importance degree Score of each candidate character string in the set of first candidate character string respectively:
Score = word . tf word . normtf * word . df * log ( word . length ) - - - ( 1 )
Wherein, Word representes the arbitrary candidate character string in the set of first candidate character string, and word.tf appears at the total degree in all result for retrieval to be clustered for this arbitrary candidate character string, and word.normtf appears at the total degree of specifying in the result for retrieval for this arbitrary candidate character string; Word.df is the result for retrieval number that comprises this arbitrary candidate character string; The character number that word.length comprises for this arbitrary candidate character string, wherein, specifying result for retrieval can be generic web page arbitrarily in a large number; And word.normtf is in order to embody the frequency that this arbitrary candidate character string occurs in these a large amount of generic web page arbitrarily, and the implication of each parameter is respectively as follows in the formula (1):
Figure G2009100891768D00111
this importance degree Score that has embodied arbitrary candidate character string in the set of first candidate character string is directly proportional with the frequency that this arbitrary candidate character string appears at result for retrieval to be clustered; And be inversely proportional to frequency that this arbitrary candidate character string appears under a large amount of language environments that generic web page represented arbitrarily, its implication is that to be used to weigh the result for retrieval classification of this arbitrary candidate character string representative;
This importance degree Score that has embodied arbitrary candidate character string in the set of first candidate character string of word.df is directly proportional with the total degree that this arbitrary candidate character string appears in each result for retrieval to be clustered; Its physical meaning appears at the frequency in each result for retrieval to be clustered for this arbitrary candidate character string; If this arbitrary candidate character string appear in the result for retrieval to be clustered number of times very little; This candidate character string does not have result for retrieval classification representativeness yet so, and is suitable to the cluster label;
This has embodied log (word.length) character number that candidate character string comprised and should be a suitable value; Because comparatively speaking; The word.tf value that comprises the more candidate character string of character number generally is less than the word.tf value that comprises the less candidate character string of character number; Therefore, in the employed formula of the embodiment of the invention (1), need consider this factor of character number word.length that candidate character string comprises, because this factor is excessive to the influence of importance degree Score; Therefore the operation through word.length is taken the logarithm in the formula (1) is with the influence of minimizing this factor of character number that candidate character string was comprised to Score;
In embodiments of the present invention; To two result for retrieval shown in Figure 3; Through each word string in first candidate character string set that obtains after above-mentioned steps 21,22 processing and as shown in table 1 below with each the word string corresponding parameters in the set of first candidate character string; In the corresponding substitution formula of occurrence (1) with each parameter in the table 1, can calculate the importance degree Score of each candidate character string in the set of first candidate character string respectively:
Table 1:
Each word string in the set of first candidate character string word.tf word.normtf word.df word.length score
Automobile 95 3500 52 2 0.42
East wind 82 4300 46 2 0.26
Beautiful 43 2000 32 2 0.20
Beautiful 207 40 1300 32 4 0.59
One 92 10500 78 2 0.2
Constantly 88 8300 67 2 0.21
Need to prove; As long as (implication such as according to above-mentioned each parameter can be known to embody between importance degree and above-mentioned each parameter of each word string the corresponding relation on numerical values recited changes; Importance degree is corresponding by big extremely little variation with word.df by big extremely little variation; And importance degree by big to little variation also be corresponding with word.normtf by little extremely big variation), then the embodiment of the invention also can but be not limited to adopt following formula (2) to calculate the importance degree Score1 of each word string:
Score 1 = word . tf word . normtf + word . df + log ( word . length ) - - - ( 2 )
Perhaps, also can adopt following formula (3)~(5) to come the corresponding respectively importance degree Score2~Score4 that calculates each word string:
Score 2 = word . tf word . normtf - - - ( 3 )
Score3=word.df(4)
Score4=log(word.length)(5)
Above-mentioned Score1~Score4 can be used as the parameter of weighing the importance degree of each candidate character string in the set of first candidate character string equally.
Step 24 during the importance degree Score of each word string, is chosen second candidate character string from the set of first candidate character string in calculating first candidate character string set; Such as; The importance degree Score of each word string in can gathering according to first candidate character string to little selecting sequence, chooses second candidate character string that satisfies preset number, in embodiments of the present invention by big from the set of first candidate character string; It is 20 that this preset number can be set; Then need from the set of first candidate character string, choose 20 second candidate character strings in this step 24,, then can directly all candidate character strings in the set of first candidate character string all be chosen as second candidate character string as if 20 of the candidate character string number less thaies in the set of first candidate character string; In addition; Can also set an importance degree threshold value, and regulation only from the set of first candidate character string, choose importance degree Score greater than the candidate character string of importance degree threshold value as second candidate character string, in embodiments of the present invention; According to the enforcement of step 21~24, second candidate character string of finally choosing is last " beautiful 207 ", " automobile ", " east wind " etc.;
Step 25 is confirmed as second candidate character string on the cluster label that result for retrieval to be clustered is carried out cluster;
Step 26; Adopt the method for multi-mode string coupling; The result for retrieval that each is to be clustered is referred to respectively in corresponding with each cluster label bunch, such as, the result for retrieval that will comprise " beautiful 207 " all be included into the cluster label for " beautiful 207 " bunch in; The result for retrieval that will comprise " automobile " all be included into the cluster label for " automobile " bunch in, the result for retrieval that will comprise " east wind " all be included into the cluster label for " east wind " bunch in;
Step 27; Frequency (hereinafter to be referred as the frequency of utilization) order from high to low that is used as the employed query word of search engine according to the cluster label respectively; The cluster label is carried out correspondence to be arranged; Such as can be according to reading habit, the cluster label that will have maximum useful frequency be placed on the Far Left of the page that is arranged with the cluster label (or topmost), and the cluster label correspondence that correspondingly will have lowest useful frequency is placed on the rightmost of the page that is arranged with the cluster label (or bottom).
Correspondingly, the embodiment of the invention also provides a kind of clustering apparatus, uses so that the cluster label that utilizes this clustering apparatus to determine can demonstrate fully the classification of document to be clustered; Reach readable preferably; Particularly, the structural representation of this device is as shown in Figure 4, comprises following functional unit:
First chooses unit 41, is used for from each document to be clustered, choosing the set of first candidate character string according to the preset strategy of choosing;
Second chooses unit 42; Be used for each word string of choosing first candidate character string set of choosing unit 41 to first; According to the parameter relevant with this word string; From the set of first candidate character string, choose second candidate character string; Wherein, the parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the document to be clustered for total degree, this word string in this word string appears at all documents to be clustered;
Label is confirmed unit 43, is used for choosing second candidate character string of choosing unit 42 with second and confirms as the cluster label that each document to be clustered is carried out cluster;
Sort out unit 44, be used for each document to be clustered be referred to respectively the cluster label confirming to confirm unit 43 with label corresponding bunch.
Preferably, to above-mentioned first choose unit 41 functions a kind of implementation, can choose unit 41 with above-mentioned first and further be divided into following functional module:
First chooses module, is used for from the word string that each document comprised to be clustered, chooses character number and the preset consistent word string of the first character number threshold value that word string comprises;
Second chooses module, be used for choosing the word string that module chooses and choose first candidate character string set that meets preset rules from first, wherein, the preset rules here be in the following rule any one or be the combination in any of following rule:
Rule one: to each word string in the set of first candidate character string, the number that comprises the document to be clustered of this word string is not less than presetting first threshold;
Rule two: to each word string in the set of first candidate character string; In each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises consistent different word strings with the second preset character number threshold value is not less than the second preset threshold value;
Rule three: to each word string in the set of first candidate character string; In each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises consistent different word strings with the second preset character number threshold value is not less than the 3rd preset threshold value;
Rule four: to each word string in first candidate character string set, this word string appears at the numerical value that each character that the total degree in all documents to be clustered comprises divided by this word string appears at the total degree gained in all documents to be clustered and is not less than the 4th preset threshold value.
Choose the situation that unit 41 is divided into above-mentioned each module to above-mentioned first, this device that the embodiment of the invention provides can also further comprise:
Importance degree Score confirms the unit, is used for confirming to label each cluster label of the cluster label that unit 43 is confirmed, confirms the importance degree Score of this cluster label; And the label arrangement units, each importance degree Score that is used for confirming to confirm the unit according to importance degree Score is carried out correspondence to said definite cluster label and is arranged by big extremely little order.
Preferably, to above-mentioned second choose unit 42 functions a kind of implementation, can choose unit 42 with above-mentioned second and further be divided into following functional module:
Computing module is used for each word string to the set of first candidate character string, adopts following formula to calculate the importance degree Score of this word string:
Score = word . tf word . normtf * word . df * log ( word . length )
Wherein, Word.tf is the total degree in this word string appears at each document to be clustered; Word.normtf appears at the total degree in the specified documents for this word string, and word.df is the number that comprises the document to be retrieved of this word string, the character number that word.length comprises for this word string;
Choose module; Be used for when computing module calculates the importance degree Score of each word string of first candidate character string set,, from the set of first candidate character string, choosing second candidate character string according to said importance degree Score; Wherein, In calculating first candidate character string set during importance degree Score of each word string, can be according to the importance degree Score of each word string in the set of first candidate character string by big to little selecting sequence, from the set of first candidate character string, choose second candidate character string that satisfies preset number; Also can set an importance degree threshold value, and stipulate only to choose importance degree Score and gather as second candidate character string greater than first candidate character string of importance degree threshold value.
Preferably, above-mentioned classification unit 44 can adopt the method for multi-mode coupling, and each document to be clustered is referred to respectively in pairing bunch of the cluster label confirming to confirm unit 43 with label.
In addition; Because same piece of writing document might be included in the pairing difference of different cluster labels bunch; Therefore, for can be easily from a certain cluster label corresponding bunch find the document that needs, can consider to come the cluster label is sorted according to the document classification representativeness of cluster label; Particularly, this device of providing of the embodiment of the invention can further include:
Number of times is confirmed unit 45, is used for confirming to label respectively each cluster label of the cluster label that unit 43 is confirmed, confirms that this cluster label appears at the total degree in all documents to be clustered;
Label arrangement units 46, each total degree that is used for confirming to confirm respectively unit 45 according to number of times confirm that to label the cluster label that unit 43 is confirmed carries out the correspondence arrangement by the few order of as many as.
In addition, it is also conceivable that, come the cluster label is sorted that particularly, this device that the embodiment of the invention provides can further include according to the frequency of occurrences of cluster label in document to be clustered:
The document number is confirmed the unit, is used for confirming to label each cluster label of the cluster label that the unit is confirmed, confirms to include the document number to be clustered of this cluster label;
Label arrangement units, each document number that is used for confirming to confirm the unit according to the document number are carried out correspondence to the cluster label of confirming and are arranged by the few order of as many as.
Need to prove; When this device that the embodiment of the invention provides be applied to through search engine searches to Search Results when carrying out in the process of cluster; This device that the embodiment of the invention provides can also comprise another label arrangement units; Can be used for confirming that according to label unit 43 definite cluster labels are used as the frequency order from high to low of the employed query word of search engine respectively; Label is confirmed that the cluster label that unit 43 is confirmed carries out the correspondence arrangement, thereby make the user who uses search engine to find the result for retrieval that oneself needs according to the cluster label easily.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, belong within the scope of claim of the present invention and equivalent technologies thereof if of the present invention these are revised with modification, then the present invention also is intended to comprise these changes and modification interior.

Claims (8)

1. a clustering method is characterized in that, comprising:
According to the preset strategy of choosing, from each document to be clustered, choose the set of first candidate character string;
To each word string in said first candidate character string set; According to the parameter relevant with this word string; From the set of said first candidate character string, choose second candidate character string, the said parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the said document to be clustered for total degree, this word string in this word string appears at said all documents to be clustered;
Said second candidate character string is confirmed as the cluster label that said each document to be clustered is carried out cluster, and said each document to be clustered is referred to respectively in corresponding with said cluster label bunch;
Wherein, according to the preset strategy of choosing, from each document to be clustered, choose the set of first candidate character string and specifically comprise:
From the word string that each document comprised to be clustered, choose character number and the preset consistent word string of the first character number threshold value that word string comprises;
From the said word string of choosing, choose first candidate character string set that meets preset rules, said preset rules be in the following rule any one or be the combination in any of following rule:
To each word string in said first candidate character string set, the number that comprises the document said to be clustered of this word string is not less than presetting first threshold;
To each word string in said first candidate character string set; In said each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises consistent different word strings with the second preset character number threshold value is not less than the second preset threshold value;
To each word string in said first candidate character string set; In said each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises consistent different word strings with the second preset character number threshold value is not less than the 3rd preset threshold value;
To each word string in the set of said first candidate character string, this word string appears at the numerical value that each character that the total degree in said all documents to be clustered comprises divided by this word string appears at the total degree gained in said all documents to be clustered and is not less than the 4th preset threshold value.
2. the method for claim 1 is characterized in that, to each word string in said first candidate character string set, according to the parameter relevant with this word string, from said first candidate character string set, chooses second candidate character string and specifically comprises:
To each word string in said first candidate character string set; Appear at total degree, this word string in said all documents to be clustered according to this word string and appear at the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the said document to be clustered, adopt following formula to calculate the importance degree Score of this word string:
Score = word . tf word . normtf * word . df * log ( word . length )
Wherein, Word.tf is the total degree in this word string appears at said each document to be clustered; Word.normtf appears at the total degree in the said specified documents for this word string; Word.df is the document number said to be clustered that comprises this word string, the character number that word.length comprises for this word string;
In calculating said first candidate character string set, behind the importance degree Score of each word string,, from said first candidate character string set, choose second candidate character string according to said importance degree Score.
3. method as claimed in claim 2 is characterized in that, also comprises:
According to the importance degree Score of said definite cluster label by big to little order, said definite cluster label is carried out correspondence arrangement.
4. according to claim 1 or claim 2 method is characterized in that, adopts the method for multi-mode coupling, and said each document to be clustered is referred to respectively in corresponding with said cluster label bunch.
5. according to claim 1 or claim 2 method is characterized in that, also comprises:
To each the cluster label in the said definite cluster label; Confirm that this cluster label appears at the total degree in said all documents to be clustered; And according to each said definite total degree by the few order of as many as, said definite cluster label is carried out correspondence arranges; Or
To each the cluster label in the said definite cluster label, confirm to include the document number said to be clustered of this cluster label, and according to each said definite document number by the few order of as many as, said definite cluster label is carried out correspondence arranges; Or
The frequency order from high to low that is used as the employed query word of search engine according to said definite cluster label respectively; Said definite cluster label is carried out correspondence to be arranged; Wherein, the said Search Results of document for arriving to be clustered through search engine searches.
6. a clustering apparatus is characterized in that, comprising:
First chooses the unit, is used for from each document to be clustered, choosing the set of first candidate character string according to the preset strategy of choosing;
Second chooses the unit; Be used for each word string of choosing first candidate character string set of unit selection to first; According to the parameter relevant with this word string; From the set of said first candidate character string, choose second candidate character string, the said parameter relevant with this word string appears at least one parameter in the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the said document to be clustered for total degree, this word string in this word string appears at said all documents to be clustered;
Label is confirmed the unit, be used for second choose unit selection second candidate character string confirm as the cluster label that said each document to be clustered is carried out cluster;
Sort out the unit, be used for said each document to be clustered be referred to respectively the cluster label confirming to confirm the unit with said label corresponding bunch;
Wherein, said first choose the unit and specifically comprise:
First chooses module, is used for from the said word string that each document comprised to be clustered, chooses character number and the preset consistent word string of the first character number threshold value that word string comprises;
Second chooses module, be used for choosing the word string that module chooses and choose first candidate character string set that meets preset rules from first, said preset rules be in the following rule any one or be the combination in any of following rule:
To each word string in said first candidate character string set, the number that comprises the document said to be clustered of this word string is not less than presetting first threshold;
To each word string in said first candidate character string set; In said each document to be clustered, adjacent with this word string, be positioned at before this word string and the number of the number of characters that comprises consistent different word strings with the second preset character number threshold value is not less than the second preset threshold value;
To each word string in said first candidate character string set; In said each document to be clustered, adjacent with this word string, be positioned at after this word string and the number of the number of characters that comprises consistent different word strings with the second preset character number threshold value is not less than the 3rd preset threshold value;
To each word string in the set of said first candidate character string, this word string appears at the numerical value that each character that the total degree in said all documents to be clustered comprises divided by this word string appears at the total degree gained in said all documents to be clustered and is not less than the 4th preset threshold value.
7. device as claimed in claim 6 is characterized in that, said second chooses the unit specifically comprises:
Computing module; Be used for each word string to said first candidate character string set; Appear at total degree, this word string in said all documents to be clustered according to this word string and appear at the document number that comprises this each word string in total degree in the specified documents, character number that this word string comprises and the said document to be clustered, adopt following formula to calculate the importance degree Score of this word string:
Score = word . tf word . normtf * word . df * log ( word . length )
Wherein, Word.tf is the total degree in this word string appears at said each document to be clustered; Word.normtf appears at the total degree in the said specified documents for this word string; Word.df is the document number said to be clustered that comprises this word string, the character number that word.length comprises for this word string;
Choose module, be used for when computing module calculates the importance degree Score of said each word string of first candidate character string set,, from said first candidate character string set, choose second candidate character string according to said importance degree Score.
8. like claim 6 or 7 described devices, it is characterized in that, also comprise:
Number of times is confirmed the unit, is used for respectively confirming to label each cluster label of the cluster label that the unit is confirmed, confirms that this cluster label appears at the total degree in said all documents to be clustered;
Label arrangement units, each total degree that is used for confirming to confirm respectively the unit according to number of times are carried out correspondence to said definite cluster label and are arranged by the few order of as many as; Perhaps
Also comprise: the document number is confirmed the unit, is used for confirming to label each cluster label of the cluster label that the unit is confirmed, confirms to include the document number said to be clustered of this cluster label;
Label arrangement units, each document number that is used for confirming to confirm the unit according to the document number are carried out correspondence to said definite cluster label and are arranged by the few order of as many as; Perhaps
Also comprise: the label arrangement units; Be used for confirming that according to label the definite cluster label in unit is used as the frequency order from high to low of the employed query word of search engine respectively; Said definite cluster label is carried out correspondence to be arranged; Wherein, the said Search Results of document for arriving to be clustered through search engine searches.
CN2009100891768A 2009-08-03 2009-08-03 Clustering method and device Expired - Fee Related CN101989281B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100891768A CN101989281B (en) 2009-08-03 2009-08-03 Clustering method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100891768A CN101989281B (en) 2009-08-03 2009-08-03 Clustering method and device

Publications (2)

Publication Number Publication Date
CN101989281A CN101989281A (en) 2011-03-23
CN101989281B true CN101989281B (en) 2012-06-27

Family

ID=43745818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100891768A Expired - Fee Related CN101989281B (en) 2009-08-03 2009-08-03 Clustering method and device

Country Status (1)

Country Link
CN (1) CN101989281B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760142A (en) * 2011-04-29 2012-10-31 北京百度网讯科技有限公司 Method and device for extracting subject label in search result aiming at searching query
CN103207896B (en) * 2013-03-14 2017-02-01 无锡清华信息科学与技术国家实验室物联网技术中心 Method and system for stable and efficient self-adaptive clustering
CN106033444B (en) * 2015-03-16 2019-12-10 北京国双科技有限公司 Text content clustering method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (en) * 1996-12-31 1997-09-03 复旦大学 Multiple languages automatic classifying and searching method
CN101458708A (en) * 2008-12-05 2009-06-17 北京大学 Searching result clustering method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (en) * 1996-12-31 1997-09-03 复旦大学 Multiple languages automatic classifying and searching method
CN101458708A (en) * 2008-12-05 2009-06-17 北京大学 Searching result clustering method and device

Also Published As

Publication number Publication date
CN101989281A (en) 2011-03-23

Similar Documents

Publication Publication Date Title
US7424421B2 (en) Word collection method and system for use in word-breaking
CN101246499B (en) Network information search method and system
US8577155B2 (en) System and method for duplicate text recognition
CN101251837B (en) Display handling method and system of electronic file list
CN100419755C (en) Systems and methods for document data analysis
CN101609450A (en) Web page classification method based on training set
CN102073684B (en) Method and device for excavating search log and page search method and device
CN108829658A (en) The method and device of new word discovery
JPWO2003046764A1 (en) Information analysis method and apparatus
CN101673306B (en) Website information query method and system thereof
CN101609459A (en) A kind of extraction system of affective characteristic words
CN101751386B (en) Identification method of unknown words
CN102236654A (en) Web useless link filtering method based on content relevancy
CN108363694B (en) Keyword extraction method and device
CN103123624A (en) Method of confirming head word, device of confirming head word, searching method and device
CN103186556A (en) Method for obtaining and searching structural semantic knowledge and corresponding device
CN106844482B (en) Search engine-based retrieval information matching method and device
CN103064880A (en) Method, device and system based on searching information for providing users with website choice
CN101853298B (en) Event-oriented query expansion method
CN109614626A (en) Keyword Automatic method based on gravitational model
CN104915422A (en) Webpage collecting method and device based on browser
CN104899215A (en) Data processing method, recommendation source information organization, information recommendation method and information recommendation device
CN103544307A (en) Multi-search-engine automatic comparison and evaluation method independent of document library
CN101989281B (en) Clustering method and device
CN103136212A (en) Mining method of class new words and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120627

Termination date: 20210803