CN102867006A - Method and system for batching and clustering - Google Patents

Method and system for batching and clustering Download PDF

Info

Publication number
CN102867006A
CN102867006A CN2011101895621A CN201110189562A CN102867006A CN 102867006 A CN102867006 A CN 102867006A CN 2011101895621 A CN2011101895621 A CN 2011101895621A CN 201110189562 A CN201110189562 A CN 201110189562A CN 102867006 A CN102867006 A CN 102867006A
Authority
CN
China
Prior art keywords
document
coherency
class
cluster
batches
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011101895621A
Other languages
Chinese (zh)
Other versions
CN102867006B (en
Inventor
王新文
张姝
贾文杰
夏迎炬
孟遥
于浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN201110189562.1A priority Critical patent/CN102867006B/en
Publication of CN102867006A publication Critical patent/CN102867006A/en
Application granted granted Critical
Publication of CN102867006B publication Critical patent/CN102867006B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

An embodiment of the invention provides a method and a system for batching and clustering. The method includes the steps: batching documents to be clustered according to preset strategies; clustering the batched documents in each batch to obtain clustering results of the documents in each batch; cohering the clustering results of the documents in each batch to obtain cohering results of the documents in each batch; and combining each category of the cohering results of the documents in each batch except for the first batch of documents with the category of the cohering results of the documents in a last batch to obtain batching and clustering results of the documents to be clustered. According to the method for batching and clustering, by batching and clustering the documents to be clustered, cohering the batching and clustering results and combining batching coherence results, clustering performance is improved, and incremental clustering is realized.

Description

A kind of in batches clustering method and system
Technical field
The present invention relates to cluster, relate in particular to a kind of in batches clustering method and system.
Background technology
Along with the development of network, duplicate message is more and more on the network.How these similar webpages are distinguished just become extremely important.To the differentiation of webpage, usually adopt the method for cluster to realize.The effect of now general clustering algorithm cluster has bottleneck and defective separately.
Traditional cluster analysis computing method mainly contain division methods, hierarchical method, density-based method, based on method and the model-based methods of grid, wherein, the representative algorithm of division methods has the K-MEANS algorithm, the representative algorithm of hierarchical method has HAC (Hierarchical Agglomerative Clustering, level cohesion cluster) algorithm.
The more successful solution of traditional clustering method the clustering problem of low dimension data.But because the complicacy of data in the practical application, when processing many problems, existing algorithm often lost efficacy, particularly for the situation of high dimensional data and large data.Because traditional clustering method is concentrated when carrying out cluster at high dimensional data, mainly run into two problems.1. concentrate there are a large amount of irrelevant attributes in high dimensional data so that the possibility that exists bunch is almost nil in all dimensions; 2. Data In High-dimensional Spaces distribute than data in the lower dimensional space sparse, wherein data pitch from almost equal be universal phenomenon, and traditional clustering method is based on distance and carries out cluster, therefore can't distance-based in higher dimensional space makes up bunch.
Summary of the invention
The purpose of the embodiment of the invention is to provide a kind of in batches clustering method and system, by the document of wanting cluster is carried out in batches, the coherency of cluster, batch cluster result is processed, the merging of batch cluster result, reaches the effect that improves clustering performance.
An aspect according to the embodiment of the invention provides a kind of in batches clustering method, and wherein, described method comprises:
According to predetermined policy the document of wanting cluster is carried out in batches;
Each certification shelves are in batches carried out cluster, obtain the cluster result of each certification shelves;
Cluster result to each certification shelves carries out the coherency processing, obtains the coherency result of each certification shelves;
Class in the coherency result of each class in the coherency result of each the certification shelves except first document and last consignment of document is merged, obtain the described in batches cluster result of wanting the document of cluster.
According to the embodiment of the invention on the other hand, also provide a kind of in batches clustering system, wherein, described system comprises:
Unit in batches, it carries out in batches the document of wanting cluster according to predetermined policy;
Cluster cell, it carries out cluster to each the certification shelves after in batches, obtains the cluster result of each certification shelves;
The first processing unit, its cluster result to each certification shelves carry out coherency to be processed, and obtains the coherency result of each certification shelves;
Merge cells, it merges the class in the coherency result of each class in the coherency result of each the certification shelves except first document and last consignment of document, obtains the described in batches cluster result of wanting the document of cluster.
The beneficial effect of the embodiment of the invention is: process by the result of cluster in batches being carried out coherency, coherency remerges after processing, and has improved the performance of cluster.Merge (interpolation) in the middle of existing cluster result through the coherency processing by the document that follow-up process coherency is processed, realized the cluster of increment type.
Explanation and accompanying drawing with reference to hereinafter disclose specific implementations of the present invention in detail, and having indicated principle of the present invention can adopted mode.Should be appreciated that, therefore embodiments of the present invention are not restricted on scope.In the scope of the spirit of claims and clause, embodiments of the present invention comprise many changes, revise and are equal to.
Can in one or more other embodiment, use in same or similar mode for the feature that a kind of embodiment is described and/or illustrated, combined with the feature in other embodiment, or the feature in alternative other embodiment.
Should emphasize, term " comprises/comprise " existence that refers to feature, whole, step or assembly when this paper uses, but does not get rid of the existence of one or more further feature, whole, step or assembly or additional.
Description of drawings
Included accompanying drawing is used to provide further understanding of the invention, it has consisted of the part of instructions, illustration preferred implementation of the present invention, and be used for explaining principle of the present invention with explanatory note, wherein for identical key element, represent with identical Reference numeral all the time.
In the accompanying drawings:
Fig. 1 is the process flow diagram of the in batches clustering method of one embodiment of the invention;
Fig. 2 carries out the process flow diagram that coherency is processed to the cluster result of each certification shelves in embodiment illustrated in fig. 1;
Fig. 3 determines whether each document in each class of current batch of document has coherent process flow diagram for the class under the document in embodiment illustrated in fig. 2;
Fig. 4 determines whether described other classes that do not have coherent document and current batch of document have coherent process flow diagram in embodiment illustrated in fig. 2;
Fig. 5 is the process flow diagram of the in batches clustering method of another embodiment of the present invention;
Fig. 6 is that each document except described first document carries out the process flow diagram that coherency is processed in each the class document of current batch of document after being combined in embodiment illustrated in fig. 5;
Fig. 7 determines whether described each document has coherent process flow diagram for the class under the document in embodiment illustrated in fig. 6;
Fig. 8 determines whether described other classes that do not have coherent document and current batch of document have coherent process flow diagram in embodiment illustrated in fig. 6;
Fig. 9 is the composition schematic diagram of the in batches clustering system of the embodiment of the invention.
Embodiment
For the purpose, technical scheme and the advantage that make the embodiment of the invention is clearer, below in conjunction with embodiment and accompanying drawing, the embodiment of the invention is described in further details.At this, illustrative examples of the present invention and explanation thereof are used for explanation the present invention, but not as a limitation of the invention.
Embodiment 1
The process flow diagram of a kind of in batches clustering method that Fig. 1 provides for the embodiment of the invention please refer to Fig. 1, and the method comprises:
Step 101: the document of wanting cluster is carried out in batches according to predetermined policy;
Wherein, predetermined policy can be in proportion in batches, also can be other, the present embodiment not with this as restriction.
Wherein, differentiation for webpage on the network, consider the characteristics of search engine return results, also namely more forward results relevance is higher, in order to improve the Clustering Effect of return results, the present embodiment can be retrieved with given searching keyword by search engine first, and the result that search engine is returned carries out in batches as the document of wanting cluster.If in proportion in batches, can be according to certain number percent and sequencing the Search Results that search engine returns to be carried out in batches.For example, in the Search Results that search engine is returned, front 40% as first, remaining 60%, is divided into four batches, is called second batch, the 3rd batch, the 4th batch and the 5th batch according to vertical order.Because search engine has the Search Results of ordering before and has preferably coherent characteristic, wherein therefore relatively large the and outbalance of the proportion of first after in batches can be called first in the first batch.
Wherein, the Search Results for search engine directly returns for the ease of cluster, also needs to carry out the front processing of cluster.In the present embodiment, the processing before the cluster here comprises: webpage pre-service, proper vector are extracted and the webpage similarity is calculated.
The webpage pre-service is webpage to be carried out the operations such as content extraction, effectively url extraction, title extraction, and webpage to be clustered is preserved with the xml file layout according to certain unity of form.
Feature vector extraction is according to the xml file of preserving, and sets up the proper vector group, and gives weight for each proper vector.Can adopt TFIDF method or additive method to obtain proper vector here.
The webpage similarity is calculated can adopt the Euclidean distance formula, and the classical formulas such as cosine range formula are calculated.
Processing before the above cluster is some the conventional treatment steps before the cluster, and its concrete processing mode can adopt existing means to realize, does not repeat them here.
Step 102: each the certification shelves are in batches carried out cluster, obtain the cluster result of each certification shelves;
Wherein, clustering method that can be by routine carries out cluster to each the certification shelves after in batches, obtains batch cluster result, such as K-MEANS clustering algorithm, hierarchical agglomerative clustering algorithm, density-based clustering algorithm etc.The preferred hierarchical agglomerative clustering algorithm of the present embodiment.
Step 103: the cluster result to each certification shelves carries out the coherency processing, obtains the coherency result of each certification shelves;
Wherein, process by batch cluster result is carried out coherency, make the classification of document clearer and more definite, increased the performance of cluster.
In an embodiment of step 103, the cluster result of each certification shelves is carried out coherency process and can realize by method shown in Figure 2, please refer to Fig. 2, the method comprises:
Step 201: according to the similarity mean value between all classes of current batch of document, generate a coherency threshold value according to pre-defined rule;
Wherein, the coherency threshold value is to judge the standard of similarity between document and the document, its pre-defined rule for example can be described similarity mean value to be multiply by a coefficient add a smooth value, coherency threshold value as this batch document, also can be other rules, the present embodiment not with this as restriction.
Step 202: according to described coherency threshold value, determine whether each document in each class of current batch of document has coherency for the class under the document;
Wherein, determine whether each document has coherency for the class under the document, namely cluster result is analyzed, the similarity of other documents of statistics the document and the class under it exceeds the right number of document of above-mentioned coherency threshold value, calculates the proportion that it accounts for the affiliated class internal document number of the document, is referred to as AgF, if AgF exceeds certain threshold value, think that the document does not have coherency for such, be referred to as not condense, otherwise be referred to as and condense.
In an embodiment of step 202, determine whether each document has coherency for the class under it, can realize by method shown in Figure 3, please refer to Fig. 3, the method comprises:
Step 301: other Documents Similarities of adding up each document of current class inside and current class inside exceed the right number of document of described coherency threshold value;
Step 302: calculate the proportion that described number accounts for described current class internal document number;
Step 303: if described proportion exceeds certain threshold value, determine that then described document does not have coherency for described current class, otherwise determine that described document has coherency for described current class.
Method by Fig. 3 is processed each document in each class of current batch of document, can determine whether the document has coherency for the class under it.
Step 203: will not have the class of coherent document under the document and reject;
Wherein, according to the judged result of step 202, if certain document does not condense for other documents in its described class, this document is rejected from such.
Step 204: determine whether described other classes that do not have coherent document and current batch of document have coherency;
Wherein, the coherency of the document after judgement is rejected and other classes of current batch of document, the method of judging is similar to method shown in Figure 3, different is, in step 301, it is the right number of document that these other Documents Similarities that do not have other class inside of coherent document and current batch of statistics exceed described coherency threshold value.
In an embodiment of step 204, determine whether described other classes that do not have coherent document and current batch of document have coherency, can realize by method shown in Figure 4, and please refer to Fig. 4, the method comprises:
Step 401: add up the right number of document that described all Documents Similarities that do not have other class inside of coherent document and current batch of document exceed described coherency threshold value;
Step 402: calculate the proportion that described number accounts for described other class internal document numbers;
Step 403: if described proportion exceeds certain threshold value, then determine describedly not have coherent document and do not have coherency for described other classes, otherwise determine describedly not have coherent document and have coherency for described other classes.
By method shown in Figure 4, whether other classes that do not have coherent document and current batch of rejecting before can determining have coherency.
Step 205: do not have a class that coherent document can condense if exist with described, then do not have coherent document and add described class described, otherwise do not have coherent document separately as a class of described current batch of document with described.
Wherein, according to the judged result of step 204, can condense if exist certain class and this not have coherent document, then the document be merged in this class; Can not condense class if do not deposit, then this document separately as a class.
Step 104: the class in the coherency result of each class in the coherency result of each the certification shelves except first document and last consignment of document is merged, obtain the described in batches cluster result of wanting the document of cluster.
Wherein, batch cluster result is being carried out after coherency processes, the present embodiment can also criticize Cluster merging, also, gets each class in the current batch of document, and the class merging in the obtained in the previous step batch of cluster result.For example each class in the second batch document and the class in first document are merged, class after again each class in the 3rd certification shelves and second batch document and first document being merged merges, by that analogy, to the last a collection of document merging is finished, obtain thus in batches cluster result, cluster finishes.
Wherein, merging method can be identical with aforesaid clustering method.
Wherein, owing to for first document, not having batch cluster result of previous step, therefore, this step is the processing of each the certification shelves except first document.
The in batches clustering method of the present embodiment by the document of wanting cluster being carried out the in batches coherency processing of cluster, batch cluster result and the merging of batch coherency result, has improved clustering performance, has realized the cluster of increment type.
Embodiment 2
The process flow diagram of a kind of in batches clustering method that Fig. 5 provides for another embodiment of the present invention, this in batches clustering method be on the basis of in batches clustering method shown in Figure 1, document after being combined carries out coherency again to be processed, and obtains more reliably in batches cluster result.Please refer to Fig. 5, the method comprises:
Step 501: the document of wanting cluster is carried out in batches according to predetermined policy;
Step 502: each the certification shelves are in batches carried out cluster, obtain the cluster result of each certification shelves;
Step 503: the cluster result to each certification shelves carries out the coherency processing, obtains the coherency result of each certification shelves;
Step 504: the class in the coherency result of each class in the coherency result of each the certification shelves except first document and last consignment of document is merged;
Step 505: each document in each class document of current batch of document after being combined except described first document carries out coherency to be processed, and obtains the described in batches cluster result of wanting the document of cluster.
Wherein, step 501-step 504 is identical with the step 101-step 104 of embodiment 1, and the enforcement of its each step can referring to the enforcement of the method for embodiment 1, repeat part and repeat no more.
Wherein, step 505 is similar to the step 103 of embodiment 1, and the enforcement of each step of Fig. 2-Fig. 4 that the enforcement of this step can reference example 1 repeats part and repeats no more.
In an embodiment of step 505, each document in each class document of current batch of document after being combined except described first document carries out coherency to be processed, and can realize by method shown in Figure 6, and please refer to Fig. 6, the method comprises:
Step 601: according to the similarity mean value between all classes of current batch of document, generate a coherency threshold value according to pre-defined rule;
Step 602: according to described coherency threshold value, determine whether described each document has coherency for the class under the document;
Step 603: will not have the class of coherent document under the document and reject;
Step 604: determine whether described other classes that do not have coherent document and current batch of document have coherency;
Step 605: do not have a class that coherent document can condense if exist with described, then do not have coherent document and add described class described, otherwise do not have coherent document separately as a class of described current batch of document with described.
In an embodiment of step 602, according to described coherency threshold value, determine whether described each document has coherency for the class under the document, can realize by method shown in Figure 7, and please refer to Fig. 7, the method comprises:
Step 701: other Documents Similarities of adding up described each document and current class inside exceed the right number of document of described coherency threshold value;
Step 702: calculate the proportion that described number accounts for described current class internal document number;
Step 703: if described proportion exceeds certain threshold value, determine that then described each document does not have coherency for described current class, otherwise determine that described each document has coherency for described current class.
In an embodiment of step 604, determine whether described other classes that do not have coherent document and current batch of document have coherency, can realize by method shown in Figure 8, and please refer to Fig. 8, the method comprises:
Step 801: add up the right number of document that described all Documents Similarities that do not have other class inside of coherent document and current batch of document exceed described coherency threshold value;
Step 802: calculate the proportion that described number accounts for described other class internal document numbers;
Step 803: if described proportion exceeds certain threshold value, then determine describedly not have coherent document and do not have coherency for described other classes, otherwise determine describedly not have coherent document and have coherency for described other classes.
The in batches clustering method of the present embodiment, coherency processing, the merging of batch coherency result and the again coherency after the merging of cluster, batch cluster result are processed in batches by the document of wanting cluster is carried out, further improve clustering performance, realized the cluster of increment type.
The embodiment of the invention also provides a kind of in batches clustering system, as described in the following examples 3.Because clustering system principle of dealing with problems is similar with the method for embodiment 2 to above-described embodiment 1 for this in batches, thus this enforcement of clustering system can be referring to the enforcement of the method for embodiment 1 and embodiment 2 in batches, the repetition part repeats no more.
Embodiment 3
Fig. 9 is the composition schematic diagram of a kind of in batches clustering system of providing of the embodiment of the invention, please refer to Fig. 9, this in batches clustering system comprise:
The unit 901 in batches, and it carries out in batches the document of wanting cluster according to predetermined policy;
Cluster cell 902, it carries out cluster to each the certification shelves after in batches, obtains the cluster result of each certification shelves;
The first processing unit 903, its cluster result to each certification shelves carry out coherency to be processed, and obtains the coherency result of each certification shelves;
Merge cells 904, it merges the class in the coherency result of each class in the coherency result of each the certification shelves except first document and last consignment of document, obtains the described in batches cluster result of wanting the document of cluster.
In one embodiment, this in batches clustering system also comprise:
Search unit 905 is used for returning Search Results by search engine, described Search Results is wanted the document of cluster as described in batches unit.
In the present embodiment, the unit 901 concrete Search Results that are used for described search engine is returned carry out in batches according to certain number percent and sequencing in batches.
In one embodiment, the first processing unit 903 comprises:
The first generation module 9031, it generates a coherency threshold value according to the similarity mean value between all classes of current batch of document according to pre-defined rule;
The first determination module 9032, it is according to the coherency threshold value that described the first generation module 9031 generates, and determines whether each document in each class of current batch of document has coherency for the class under the document;
The first processing module 9033, it rejects 9032 definite not the having the class of coherent document under the document of described the first determination module;
The second determination module 9034, whether its other classes that do not have coherent document and current batch of document of determining that described the first determination module 9032 is determined have coherency;
The second processing module 9035, when it is defined as being at described the second determination module 9034, do not have coherent document and add described class described, it is defined as when no at described the second determination module 9034, does not have coherent document separately as a class of described current batch of document with described.
In one embodiment, the first determination module 9032 comprises:
The first statistics submodule, the right number of document that its other Documents Similarities of adding up each document of current class inside and current class inside exceed described coherency threshold value;
The first calculating sub module, its number of calculating described the first statistics submodule statistics accounts for the proportion of described current class internal document number;
First determines submodule, when its proportion that calculates in described the first calculating sub module exceeds the first predetermined threshold value, determine that described document does not have coherency for described current class, when the proportion that calculates in described the first calculating sub module does not exceed described the first predetermined threshold value, determine that described document has coherency for described current class.
In one embodiment, the second determination module 9034 comprises:
The second statistics submodule, the right number of document that described all Documents Similarities that do not have other class inside of coherent document and current batch of document of its statistics exceed described coherency threshold value;
The second calculating sub module, its number of calculating described the second statistics submodule statistics accounts for the proportion of described other class internal document numbers;
Second determines submodule, when its proportion that calculates in described the second calculating sub module exceeds the second predetermined threshold value, determine describedly not have coherent document and do not have coherency for described other classes, when the proportion that calculates in described the second calculating sub module does not exceed described the second predetermined threshold value, determine describedly not have coherent document and have coherency for described other classes.
In one embodiment, this in batches clustering system also comprise:
The second processing unit 906, each document in each class document of current batch of document after it merges described merge cells 904 except described first document carries out coherency to be processed.
In one embodiment, this second processing unit 906 comprises:
The second generation module 9061, it generates a coherency threshold value according to the similarity mean value between all classes of current batch of document according to pre-defined rule;
The 3rd determination module 9062, it determines according to the coherency threshold value that described the second generation module 9061 generates whether described each document has coherency for the class under the document;
The 3rd processing module 9063, it rejects 9062 definite not the having the class of coherent document under the document of described the 3rd determination module;
The 4th determination module 9064, whether its other classes that do not have coherent document and current batch of document of determining that described the 3rd determination module 9062 is determined have coherency;
Manages module 9065 everywhere, when it is defined as being at described the 4th determination module 9064, do not have coherent document and add described class described, it is defined as when no at described the 4th determination module 9064, does not have coherent document separately as a class of described current batch of document with described.
In one embodiment, the 3rd determination module 9062 comprises:
The 3rd statistics submodule, other Documents Similarities of its described each document of statistics and current class inside exceed the right number of document of described coherency threshold value;
The 3rd calculating sub module, its number of calculating described the 3rd statistics submodule statistics accounts for the proportion of described current class internal document number;
The 3rd determines submodule, when its proportion that calculates in described the 3rd calculating sub module exceeds the first predetermined threshold value, determine that described each document does not have coherency for described current class, when the proportion that calculates in described the 3rd calculating sub module does not exceed described the first predetermined threshold value, determine that described each document has coherency for described current class.
In one embodiment, the 4th determination module 9064 comprises:
The 4th statistics submodule, the right number of document that described all Documents Similarities that do not have other class inside of coherent document and current batch of document of its statistics exceed described coherency threshold value;
The 4th calculating sub module, its number of calculating described the 4th statistics submodule statistics accounts for the proportion of described other class internal document numbers;
The 4th determines submodule, when its proportion that calculates in described the 4th calculating sub module exceeds the second predetermined threshold value, determine describedly not have coherent document and do not have coherency for described other classes, when the proportion that calculates in described the 4th calculating sub module does not exceed described the second predetermined threshold value, determine describedly not have coherent document and have coherency for described other classes.
The in batches clustering system of the present embodiment, coherency processing, the merging of batch coherency result and the again coherency after the merging of cluster, batch cluster result are processed in batches by the document of wanting cluster is carried out, further improve clustering performance, realized the cluster of increment type.
More than describe preferred implementation of the present invention with reference to the accompanying drawings.Many feature and advantage of these embodiments are clearly according to this detailed instructionss, thus claims be intended to cover these embodiments fall into its true spirit and all these interior feature and advantage of scope.In addition, owing to those skilled in the art will find apparent that a lot of modifications and change, therefore not embodiments of the present invention to be limited to precision architecture and operation illustrated and that describe, but can contain all suitable modifications and the equivalent that falls in its scope.
Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, a plurality of steps or method can realize with being stored in the storer and by software or firmware that suitable instruction execution system is carried out.For example, if realize with hardware, the same in another embodiment, in the following technology that can know altogether with this area each or their combination realize: have for the discrete logic of data-signal being realized the logic gates of logic function, special IC with suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.
Describe or frame can be understood in the process flow diagram or in this any process of otherwise describing or method, expression comprises module, fragment or the part of code of the executable instruction of the step that one or more is used for realizing specific logical function or process, and the scope of preferred implementation of the present invention comprises other realization, wherein, can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by opposite order, carry out function, this should those skilled in the art described by the present invention understand.
In process flow diagram the expression or in this logic of otherwise describing and/or step, for example, can be considered to the sequencing tabulation for the executable instruction that realizes logic function, may be embodied in any computer-readable medium, use for instruction execution system, device or equipment (such as the computer based system, comprise that the system of processor or other can and carry out the system of instruction from instruction execution system, device or equipment instruction fetch), or use in conjunction with these instruction execution systems, device or equipment.With regard to this instructions, " computer-readable medium " can be anyly can comprise, storage, communication, propagation or transmission procedure be for instruction execution system, device or equipment or the device that uses in conjunction with these instruction execution systems, device or equipment.Computer-readable medium for example can be but be not limited to electronics, magnetic, light, electromagnetism, infrared or semiconductor system, device, equipment or propagation medium.The more specifically example of computer-readable medium (non-exhaustive list) comprises following: the electrical connection section (electronic installation) with one or more wiring, portable computer diskette box (magnetic device), random access memory (RAM) (electronic installation), ROM (read-only memory) (ROM) (electronic installation), Erasable Programmable Read Only Memory EPROM (EPROM or flash memory) (electronic installation), optical fiber (electro-optical device), and portable optic disk ROM (read-only memory) (CDROM) (optical devices).In addition, computer-readable medium even can be paper or other the suitable media that to print described program thereon, because can be for example by paper or other media be carried out optical scanning, then edit, decipher or process to obtain described program in the electronics mode with other suitable method in case of necessity, then it is stored in the computer memory.
Above-mentioned explanatory note and accompanying drawing show various feature of the present invention.Should be appreciated that those of ordinary skills can prepare that suitable computer code is described above realizing and in the accompanying drawings illustrative each step and process.It is also understood that above-described various terminal, computing machine, server, network etc. can be any types, and can prepare described computer code to utilize described device to realize the present invention according to disclosure.
Specific implementations of the present invention is disclosed at this.Those of ordinary skill in the art will readily appreciate that, the present invention has other application under other environment.In fact, also there are many embodiments and realization.Claims are absolutely not in order to limit the scope of the present invention to above-mentioned embodiment.In addition, arbitrarily for " be used for ... device " to quote all be the explaination that adds function for the device of describing key element and claim, and specifically use arbitrarily " be used for ... device " the key element of quoting do not wish to be understood to that device adds the element of function, even this claim has comprised the word of " device ".
Although illustrate and described the present invention for certain preferred embodiments or a plurality of embodiment, obviously, those skilled in the art can expect the modification and the modified example that are equal to when reading and understanding instructions and accompanying drawing.Especially for the various functions of being carried out by above-mentioned key element (parts, assembly, device, form etc.), unless otherwise noted, the term (comprising quoting of " device ") of wish to be used for describing these key elements corresponding to any key element of the concrete function of the described key element of execution (namely, function equivalent), even this key element structurally is different from the open structure of carrying out this function in illustrated illustrative embodiments of the present invention or a plurality of embodiment.In addition, although below in several illustrative embodiments only one or more has described specific features of the present invention, but can be as required and to any given or specifically use favourable aspect and consider, this feature is combined with one or more other features of other embodiments.
About comprising the embodiment of above a plurality of embodiment, following remarks is also disclosed.
Remarks 1, a kind of in batches clustering method, wherein, described method comprises:
According to predetermined policy the document of wanting cluster is carried out in batches;
Each certification shelves are in batches carried out cluster, obtain the cluster result of each certification shelves;
Cluster result to each certification shelves carries out the coherency processing, obtains the coherency result of each certification shelves;
Class in the coherency result of each class in the coherency result of each the certification shelves except first document and last consignment of document is merged, obtain the described in batches cluster result of wanting the document of cluster.
Remarks 2, according to remarks 1 described method, wherein, described to want the document of cluster be the Search Results that search engine returns.
Remarks 3, according to remarks 2 described methods, wherein, describedly according to predetermined policy the document of wanting cluster is comprised in batches:
The Search Results that described search engine is returned carries out in batches according to certain number percent and sequencing.
Remarks 4, according to remarks 1 described method, wherein, the cluster result of each certification shelves is carried out coherency processes and comprise:
According to the similarity mean value between all classes of current batch of document, generate a coherency threshold value according to pre-defined rule;
According to described coherency threshold value, determine whether each document in each class of current batch of document has coherency for the class under the document;
Reject not having the class of coherent document under the document;
Determine whether described other classes that do not have coherent document and current batch of document have coherency;
Do not have a class that coherent document can condense if exist with described, then do not have coherent document and add described class described, otherwise do not have coherent document separately as a class of described current batch of document with described.
Remarks 5, according to remarks 4 described methods, wherein described according to described coherency threshold value, determine whether each document in each class of current batch of document has coherency for the class under the document, comprising:
Other Documents Similarities of adding up each document of current class inside and current class inside exceed the right number of document of described coherency threshold value;
Calculate the proportion that described number accounts for described current class internal document number;
If described proportion exceeds certain threshold value, determine that then described document does not have coherency for described current class, otherwise determine that described document has coherency for described current class.
Remarks 6, according to remarks 4 described methods, wherein, describedly determine that whether described other classes that do not have coherent document and current batch of document have coherency, comprising:
Add up the right number of document that described all Documents Similarities that do not have other class inside of coherent document and current batch of document exceed described coherency threshold value;
Calculate the proportion that described number accounts for described other class internal document numbers;
If described proportion exceeds certain threshold value, then determine describedly not have coherent document and do not have coherency for described other classes, otherwise determine describedly not have coherent document and have coherency for described other classes.
Remarks 7, according to remarks 1 described method, wherein, described method also comprises:
Each document in each class document of current batch of document after being combined except described first document carries out coherency to be processed.
Remarks 8, according to remarks 7 described methods, wherein, each document in each class document of current batch of document after being combined except described first document carries out coherency to be processed, and comprising:
According to the similarity mean value between all classes of current batch of document, generate a coherency threshold value according to pre-defined rule;
According to described coherency threshold value, determine whether described each document has coherency for the class under the document;
Reject not having the class of coherent document under the document;
Determine whether described other classes that do not have coherent document and current batch of document have coherency;
Do not have a class that coherent document can condense if exist with described, then do not have coherent document and add described class described, otherwise do not have coherent document separately as a class of described current batch of document with described.
Remarks 9, according to remarks 8 described methods, wherein described according to described coherency threshold value, determine that whether described each document has coherency for the class under the document, comprising:
Other Documents Similarities of adding up described each document and current class inside exceed the right number of document of described coherency threshold value;
Calculate the proportion that described number accounts for described current class internal document number;
If described proportion exceeds certain threshold value, determine that then described each document does not have coherency for described current class, otherwise determine that described each document has coherency for described current class.
Remarks 10, according to remarks 8 described methods, wherein, describedly determine that whether described other classes that do not have coherent document and current batch of document have coherency, comprising:
Add up the right number of document that described all Documents Similarities that do not have other class inside of coherent document and current batch of document exceed described coherency threshold value;
Calculate the proportion that described number accounts for described other class internal document numbers;
If described proportion exceeds certain threshold value, then determine describedly not have coherent document and do not have coherency for described other classes, otherwise determine describedly not have coherent document and have coherency for described other classes.
Remarks 11, a kind of in batches clustering system, wherein, described system comprises:
Unit in batches, it carries out in batches the document of wanting cluster according to predetermined policy;
Cluster cell, it carries out cluster to each the certification shelves after in batches, obtains the cluster result of each certification shelves;
The first processing unit, its cluster result to each certification shelves carry out coherency to be processed, and obtains the coherency result of each certification shelves;
Merge cells, it merges the class in the coherency result of each class in the coherency result of each the certification shelves except first document and last consignment of document, obtains the described in batches cluster result of wanting the document of cluster.
Remarks 12, according to remarks 11 described systems, wherein, described system also comprises:
Search unit is used for returning Search Results by search engine, described Search Results is wanted the document of cluster as described in batches unit.
Remarks 13, according to remarks 12 described systems, wherein, the Search Results that described in batches unit specifically is used for described search engine is returned carries out in batches according to certain number percent and sequencing.
Remarks 14, according to remarks 11 described systems, wherein, described the first processing unit comprises:
The first generation module, it generates a coherency threshold value according to the similarity mean value between all classes of current batch of document according to pre-defined rule;
The first determination module, it is according to the coherency threshold value that described the first generation module generates, and determines whether each document in each class of current batch of document has coherency for the class under the document;
The first processing module, it rejects definite not the having the class of coherent document under the document of described the first determination module;
The second determination module, whether its other classes that do not have coherent document and current batch of document of determining that described the first determination module is determined have coherency;
The second processing module, when it is defined as being at described the second determination module, do not have coherent document and add described class described, it is defined as when no at described the second determination module, does not have coherent document separately as a class of described current batch of document with described.
Remarks 15, according to remarks 14 described systems, wherein, described the first determination module comprises:
The first statistics submodule, the right number of document that its other Documents Similarities of adding up each document of current class inside and current class inside exceed described coherency threshold value;
The first calculating sub module, its number of calculating described the first statistics submodule statistics accounts for the proportion of described current class internal document number;
First determines submodule, when its proportion that calculates in described the first calculating sub module exceeds the first predetermined threshold value, determine that described document does not have coherency for described current class, when the proportion that calculates in described the first calculating sub module does not exceed described the first predetermined threshold value, determine that described document has coherency for described current class.
Remarks 16, according to remarks 14 described systems, wherein, described the second determination module comprises:
The second statistics submodule, the right number of document that described all Documents Similarities that do not have other class inside of coherent document and current batch of document of its statistics exceed described coherency threshold value;
The second calculating sub module, its number of calculating described the second statistics submodule statistics accounts for the proportion of described other class internal document numbers;
Second determines submodule, when its proportion that calculates in described the second calculating sub module exceeds the second predetermined threshold value, determine describedly not have coherent document and do not have coherency for described other classes, when the proportion that calculates in described the second calculating sub module does not exceed described the second predetermined threshold value, determine describedly not have coherent document and have coherency for described other classes.
Remarks 17, according to remarks 11 described systems, wherein, described system also comprises:
The second processing unit, each document in each class document of current batch of document after it merges described merge cells except described first document carries out coherency to be processed.
Remarks 18, according to remarks 17 described systems, wherein, described the second processing unit comprises:
The second generation module, it generates a coherency threshold value according to the similarity mean value between all classes of current batch of document according to pre-defined rule;
The 3rd determination module, it determines according to the coherency threshold value that described the second generation module generates whether described each document has coherency for the class under the document;
The 3rd processing module, it rejects definite not the having the class of coherent document under the document of described the 3rd determination module;
The 4th determination module, whether its other classes that do not have coherent document and current batch of document of determining that described the 3rd determination module is determined have coherency;
Manages module everywhere, when it is defined as being at described the 4th determination module, do not have coherent document and add described class described, it is defined as when no at described the 4th determination module, does not have coherent document separately as a class of described current batch of document with described.
Remarks 19, according to remarks 18 described systems, wherein, described the 3rd determination module comprises:
The 3rd statistics submodule, other Documents Similarities of its described each document of statistics and current class inside exceed the right number of document of described coherency threshold value;
The 3rd calculating sub module, its number of calculating described the 3rd statistics submodule statistics accounts for the proportion of described current class internal document number;
The 3rd determines submodule, when its proportion that calculates in described the 3rd calculating sub module exceeds the first predetermined threshold value, determine that described each document does not have coherency for described current class, when the proportion that calculates in described the 3rd calculating sub module does not exceed described the first predetermined threshold value, determine that described each document has coherency for described current class.
Remarks 20, according to remarks 18 described systems, wherein, described the 4th determination module comprises:
The 4th statistics submodule, the right number of document that described all Documents Similarities that do not have other class inside of coherent document and current batch of document of its statistics exceed described coherency threshold value;
The 4th calculating sub module, its number of calculating described the 4th statistics submodule statistics accounts for the proportion of described other class internal document numbers;
The 4th determines submodule, when its proportion that calculates in described the 4th calculating sub module exceeds the second predetermined threshold value, determine describedly not have coherent document and do not have coherency for described other classes, when the proportion that calculates in described the 4th calculating sub module does not exceed described the second predetermined threshold value, determine describedly not have coherent document and have coherency for described other classes.

Claims (10)

1. clustering method in batches, wherein, described method comprises:
According to predetermined policy the document of wanting cluster is carried out in batches;
Each certification shelves are in batches carried out cluster, obtain the cluster result of each certification shelves;
Cluster result to each certification shelves carries out the coherency processing, obtains the coherency result of each certification shelves;
Class in the coherency result of each class in the coherency result of each the certification shelves except first document and last consignment of document is merged, obtain the described in batches cluster result of wanting the document of cluster.
2. method according to claim 1, wherein, described to want the document of cluster be the Search Results that search engine returns.
3. method according to claim 2 wherein, describedly comprises the document of wanting cluster according to predetermined policy in batches:
The Search Results that described search engine is returned carries out in batches according to certain number percent and sequencing.
4. method according to claim 1, wherein, the cluster result of each certification shelves is carried out coherency process and comprise:
According to the similarity mean value between all classes of current batch of document, generate a coherency threshold value according to pre-defined rule;
According to described coherency threshold value, determine whether each document in each class of current batch of document has coherency for the class under the document;
Reject not having the class of coherent document under the document;
Determine whether described other classes that do not have coherent document and current batch of document have coherency;
Do not have a class that coherent document can condense if exist with described, then do not have coherent document and add described class described, otherwise do not have coherent document separately as a class of described current batch of document with described.
5. method according to claim 1, wherein, described method also comprises:
Each document in each class document of current batch of document after being combined except described first document carries out coherency to be processed.
6. clustering system in batches, wherein, described system comprises:
Unit in batches, it carries out in batches the document of wanting cluster according to predetermined policy;
Cluster cell, it carries out cluster to each the certification shelves after in batches, obtains the cluster result of each certification shelves;
The first processing unit, its cluster result to each certification shelves carry out coherency to be processed, and obtains the coherency result of each certification shelves;
Merge cells, it merges the class in the coherency result of each class in the coherency result of each the certification shelves except first document and last consignment of document, obtains the described in batches cluster result of wanting the document of cluster.
7. system according to claim 6, wherein, described system also comprises:
Search unit is used for returning Search Results by search engine, described Search Results is wanted the document of cluster as described in batches unit.
8. system according to claim 7, wherein, the Search Results that described in batches unit specifically is used for described search engine is returned carries out in batches according to certain number percent and sequencing.
9. system according to claim 6, wherein, described the first processing unit comprises:
The first generation module, it generates a coherency threshold value according to the similarity mean value between all classes of current batch of document according to pre-defined rule;
The first determination module, it is according to the coherency threshold value that described the first generation module generates, and determines whether each document in each class of current batch of document has coherency for the class under the document;
The first processing module, it rejects definite not the having the class of coherent document under the document of described the first determination module;
The second determination module, whether its other classes that do not have coherent document and current batch of document of determining that described the first determination module is determined have coherency;
The second processing module, when it is defined as being at described the second determination module, do not have coherent document and add described class described, it is defined as when no at described the second determination module, does not have coherent document separately as a class of described current batch of document with described.
10. system according to claim 6, wherein, described system also comprises:
The second processing unit, each document in each class document of current batch of document after it merges described merge cells except described first document carries out coherency to be processed.
CN201110189562.1A 2011-07-07 2011-07-07 One is clustering method and system in batches Expired - Fee Related CN102867006B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110189562.1A CN102867006B (en) 2011-07-07 2011-07-07 One is clustering method and system in batches

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110189562.1A CN102867006B (en) 2011-07-07 2011-07-07 One is clustering method and system in batches

Publications (2)

Publication Number Publication Date
CN102867006A true CN102867006A (en) 2013-01-09
CN102867006B CN102867006B (en) 2016-04-13

Family

ID=47445881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110189562.1A Expired - Fee Related CN102867006B (en) 2011-07-07 2011-07-07 One is clustering method and system in batches

Country Status (1)

Country Link
CN (1) CN102867006B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095209A (en) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 Document clustering method, document clustering device and network equipment
CN106126613A (en) * 2016-06-22 2016-11-16 苏州大学 One composition of digressing from the subject determines method and device
CN107766315A (en) * 2017-10-30 2018-03-06 山东浪潮通软信息科技有限公司 A kind of document combination method and device
CN114547316A (en) * 2022-04-27 2022-05-27 深圳市网联安瑞网络科技有限公司 System, method, device, medium, and terminal for optimizing aggregation-type hierarchical clustering algorithm

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030126561A1 (en) * 2001-12-28 2003-07-03 Johannes Woehler Taxonomy generation
CN101853272A (en) * 2010-04-30 2010-10-06 华北电力大学(保定) Search engine technology based on relevance feedback and clustering
CN102053992A (en) * 2009-11-10 2011-05-11 阿里巴巴集团控股有限公司 Clustering method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030126561A1 (en) * 2001-12-28 2003-07-03 Johannes Woehler Taxonomy generation
CN102053992A (en) * 2009-11-10 2011-05-11 阿里巴巴集团控股有限公司 Clustering method and system
CN101853272A (en) * 2010-04-30 2010-10-06 华北电力大学(保定) Search engine technology based on relevance feedback and clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DESEN HOU等: "An Efficient Successive Iteration Partial Cluster Algorithm for Large Datasets", 《FUZZY INFORMATION AND ENGINEERING 2010》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095209A (en) * 2014-04-21 2015-11-25 北京金山网络科技有限公司 Document clustering method, document clustering device and network equipment
CN105095209B (en) * 2014-04-21 2019-05-10 珠海豹好玩科技有限公司 Document clustering method and device, the network equipment
CN106126613A (en) * 2016-06-22 2016-11-16 苏州大学 One composition of digressing from the subject determines method and device
CN107766315A (en) * 2017-10-30 2018-03-06 山东浪潮通软信息科技有限公司 A kind of document combination method and device
CN114547316A (en) * 2022-04-27 2022-05-27 深圳市网联安瑞网络科技有限公司 System, method, device, medium, and terminal for optimizing aggregation-type hierarchical clustering algorithm

Also Published As

Publication number Publication date
CN102867006B (en) 2016-04-13

Similar Documents

Publication Publication Date Title
Zhu et al. Geo-social group queries with minimum acquaintance constraints
Ram et al. Maximum inner-product search using cone trees
US8667007B2 (en) Hybrid and iterative keyword and category search technique
US8745055B2 (en) Clustering system and method
US7877385B2 (en) Information retrieval using query-document pair information
Lu et al. Efficient algorithms and cost models for reverse spatial-keyword k-nearest neighbor search
US10042914B2 (en) Database index for constructing large scale data level of details
US20100281035A1 (en) Method and System of Prioritising Operations On Network Objects
Agarwal et al. Evaluation of web service clustering using Dirichlet Multinomial Mixture model based approach for Dimensionality Reduction in service representation
Chatterjee et al. Single document extractive text summarization using genetic algorithms
US9002832B1 (en) Classifying sites as low quality sites
CN102867006A (en) Method and system for batching and clustering
US20140040297A1 (en) Keyword extraction
Jeong et al. i-TagRanker: an efficient tag ranking system for image sharing and retrieval using the semantic relationships between tags
Amagata et al. Fast and exact outlier detection in metric spaces: a proximity graph-based approach
Adhinugraha et al. Finding reverse nearest neighbors by region
Zhang et al. Personalized quality centric service recommendation
US20180129814A1 (en) Method and System for Assigning Privileges in an Online Community of News Content Readers and Authors
US20180129725A1 (en) Method and System for Self-Organizing an Online Community of News Content Readers and Authors
Hu et al. An incremental learning approach for updating approximations in rough set model over dual universes
Li et al. Topic modeling for sequential documents based on hybrid inter-document topic dependency
Mondal et al. Efficient indexing of top-k entities in systems of engagement with extensions for geo-tagged entities
US8250024B2 (en) Search relevance in business intelligence systems through networked ranking
Yu et al. A classifier chain algorithm with k-means for multi-label classification on clouds
US20230259531A1 (en) Method for efficient re-ranking and classification of ambiguous inputs in deep hierarchy

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160413

Termination date: 20180707