CN104391879A - Method and device for hierarchical clustering - Google Patents

Method and device for hierarchical clustering Download PDF

Info

Publication number
CN104391879A
CN104391879A CN201410602569.5A CN201410602569A CN104391879A CN 104391879 A CN104391879 A CN 104391879A CN 201410602569 A CN201410602569 A CN 201410602569A CN 104391879 A CN104391879 A CN 104391879A
Authority
CN
China
Prior art keywords
data object
class
distance
cluster
representative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410602569.5A
Other languages
Chinese (zh)
Other versions
CN104391879B (en
Inventor
陈志军
代阳
杨松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Technology Co Ltd
Xiaomi Inc
Original Assignee
Xiaomi Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaomi Inc filed Critical Xiaomi Inc
Priority to CN201410602569.5A priority Critical patent/CN104391879B/en
Publication of CN104391879A publication Critical patent/CN104391879A/en
Application granted granted Critical
Publication of CN104391879B publication Critical patent/CN104391879B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a device for hierarchical clustering, and belongs to the field of data mining. The method comprises the steps of obtaining a data object set to be clustered, wherein the data object set comprises a plurality of classes, and each class corresponds to at least one data object; clustering the data objects corresponding to the first class to obtain a clustering result, wherein the number of the data objects corresponding to the first class is greater than a first preset threshold value, the clustering result comprises a plurality of clusters, and each cluster comprises at least one data object; screening the data objects corresponding to the first class according to the clustering result to obtain representative data objects of the first class; calculating a between-class distance according to the representative data objects of the first class and data objects corresponding to the second class; performing hierarchical clustering on the data object set based on the between-class distance. According to the method and the device for hierarchical clustering, by clustering of the data objects corresponding to the first class, the calculation intensity is reduced, and the calculation time and resources are saved; furthermore, the clustering result is more reliable, and subsequent data analysis is facilitated.

Description

The method of hierarchical clustering and device
Technical field
The disclosure relates to Data Mining, particularly a kind of method of hierarchical clustering and device.
Background technology
In Data Mining, usually need a large amount of data analysis, to obtain valuable analysis result.Clustering algorithm is for analyzing a kind of important algorithm of data in Data Mining, this algorithm is used for the set be made up of multiple data to classify according to the different classes of of data, its objective is and as much as possible data aggregate larger for similarity is become a class, to facilitate follow-up data analysis.Wherein, hierarchical clustering is a kind of clustering algorithm comparatively commonly used.
Correlation technique, when the method for implementation level cluster, is the distance by calculating between two classes, i.e. between class distance, thus two classes between class distance being less than certain value merge into a new class.Because each class may comprise a more than data object, therefore, when compute classes spacing, all data objects of all data objects in a class and another class need be calculated between two, all result of calculation is added up, obtain mean value or minimum value, it can be used as between class distance, thus realize follow-up cluster according between class distance.
Realizing in process of the present disclosure, inventor finds that correlation technique at least exists following problem:
In correlation technique, during by compute classes spacing implementation level cluster, its calculated amount is excessive, when the data object comprised in class is a lot, too much time and resource will be expended, further, due to the data object not belonging to such may be comprised in each class, i.e. noise, use this data object carry out between class distance calculating and after forming new class, more noise may be introduced, cause cluster result poor, be unfavorable for follow-up data analysis.
Summary of the invention
For overcoming Problems existing in correlation technique, the disclosure provides a kind of method and device of hierarchical clustering.
According to the first aspect of disclosure embodiment, a kind of method of hierarchical clustering is provided, comprises:
Obtain set of data objects to be clustered, described set of data objects comprises multiple class, and each class corresponds at least one data object;
Carry out cluster to the data object corresponding to the first kind, obtain cluster result, the data object number corresponding to the described first kind is more than the first predetermined threshold value, and described cluster result comprises multiple bunches, and each bunch comprises at least one data object;
According to cluster result, the data object corresponding to the described first kind is screened, obtain the representative data object of the described first kind;
Based on the data object corresponding to the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing;
Based on described between class distance, hierarchical clustering is carried out to described set of data objects.
In conjunction with first aspect, in the first possible implementation of first aspect, according to cluster result, screen the data object corresponding to the described first kind, the representative data object obtaining the described first kind comprises:
According to the data object included by described multiple bunches, in each bunch, using the representative data object of data object nearest for the central point with described bunch as described bunch;
Using the representative data object of the representative data object of described multiple bunches as the described first kind.
In conjunction with first aspect, in the implementation that the second of first aspect is possible, based on the data object corresponding to the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing comprises:
According in the described first kind bunch comprise the number of data object, obtain the weight of the representative data object of the described first kind;
Based on the data object corresponding to the weight of the representative data object of the described first kind, the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing.
In conjunction with the implementation that the second of first aspect is possible, in the third possible implementation of first aspect, based on the data object corresponding to the weight of the representative data object of the described first kind, the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing comprises:
Represent data object and second for first and represent data object, calculate the first distance between data object, the weight of data object is represented according to the first weight and second representing data object, process is weighted to described first distance, obtains first and represent data object and second and represent Weighted distance between data object; Or,
The 3rd data object in data object and described Equations of The Second Kind is represented for first, calculate the second distance between data object, the weight of data object is represented according to first, process is weighted to described second distance, obtains first and represent Weighted distance between data object and described 3rd data object.
In conjunction with the third possible implementation of first aspect, in the 4th kind of possible implementation of first aspect, based on described between class distance, hierarchical clustering is carried out to described set of data objects and comprises:
Based on described between class distance, the multiple classes in described set of data objects are merged;
Based on the class after merging, continue to perform the cluster of the first kind and screening, until the between class distance calculated based on the result of cluster and screening is greater than the second predetermined threshold value, export level cluster result.
According to the second aspect of disclosure embodiment, a kind of device of hierarchical clustering is provided, comprises:
Acquisition module, for obtaining set of data objects to be clustered, described set of data objects comprises multiple class, and each class corresponds at least one data object;
First cluster module, for carrying out cluster to the data object corresponding to the first kind, obtain cluster result, the data object number corresponding to the described first kind is more than the first predetermined threshold value, described cluster result comprises multiple bunches, and each bunch comprises at least one data object;
Screening module, for according to cluster result, screens the data object corresponding to the described first kind, obtains the representative data object of the described first kind;
Computing module, for based on the data object corresponding to the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing;
Second cluster module, for based on described between class distance, carries out hierarchical clustering to described set of data objects.
In conjunction with first aspect, in the first possible implementation of first aspect, described screening module, for the data object included by described multiple bunches, in each bunch, using the representative data object of data object nearest for the central point with described bunch as described bunch; Using the representative data object of the representative data object of described multiple bunches as the described first kind.
In conjunction with first aspect, in the implementation that the second of first aspect is possible, described computing module, for according in the described first kind bunch comprise the number of data object, obtain the weight of the representative data object of the described first kind; Based on the data object corresponding to the weight of the representative data object of the described first kind, the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing.
In conjunction with the implementation that the second of first aspect is possible, in the third possible implementation of first aspect, described computing module, data object is represented for representing data object and second for first, calculate the first distance between data object, represent the weight of data object according to the first weight and second representing data object, process is weighted to described first distance, obtain first and represent data object and second and represent Weighted distance between data object; Or, for representing the 3rd data object in data object and described Equations of The Second Kind for first, calculate the second distance between data object, the weight of data object is represented according to first, process is weighted to described second distance, obtains first and represent Weighted distance between data object and described 3rd data object.
In conjunction with the third possible implementation of first aspect, in the 4th kind of possible implementation of first aspect, described second cluster module, for based on described between class distance, merges the multiple classes in described set of data objects; Based on the class after merging, continue to perform the cluster of the first kind and screening, until the between class distance calculated based on the result of cluster and screening is greater than the second predetermined threshold value, export level cluster result.
According to the third aspect of disclosure embodiment, a kind of device of hierarchical clustering is provided, comprises:
Processor;
For the executable instruction of storage of processor;
Wherein, described processor is configured to:
Obtain set of data objects to be clustered, described set of data objects comprises multiple class, and each class corresponds at least one data object;
Carry out cluster to the data object corresponding to the first kind, obtain cluster result, the data object number corresponding to the described first kind is more than the first predetermined threshold value, and described cluster result comprises multiple bunches, and each bunch comprises at least one data object;
According to cluster result, the data object corresponding to the described first kind is screened, obtain the representative data object of the described first kind;
Based on the data object corresponding to the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing;
Based on described between class distance, hierarchical clustering is carried out to described set of data objects.
The technical scheme that disclosure embodiment provides can comprise following beneficial effect:
The disclosure is by set of data objects to be clustered, data object corresponding to the first kind is carried out cluster, and according to cluster result from the representative data object filtering out the first kind, thus when utilizing data object to carry out between class distance calculating, decrease calculated amount, save computing time and resource, and the data object corresponding to the first kind is after screening, originally the data object not belonging to such may be screened out, namely such noise is eliminated, make cluster result more reliable, be conducive to follow-up data analysis.
Should be understood that, it is only exemplary and explanatory that above general description and details hereinafter describe, and can not limit the disclosure.
Accompanying drawing explanation
Accompanying drawing to be herein merged in instructions and to form the part of this instructions, shows embodiment according to the invention, and is used from instructions one and explains principle of the present invention.
Fig. 1 is the method flow diagram of a kind of hierarchical clustering according to an exemplary embodiment.
Fig. 2 is the method flow diagram of a kind of hierarchical clustering according to an exemplary embodiment.
Fig. 3 is the device block diagram of a kind of hierarchical clustering according to an exemplary embodiment.
Fig. 4 is the block diagram of a kind of device for hierarchical clustering according to an exemplary embodiment.
Embodiment
Here will be described exemplary embodiment in detail, its sample table shows in the accompanying drawings.When description below relates to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawing represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all embodiments consistent with the present invention.On the contrary, they only with as in appended claims describe in detail, the example of apparatus and method that aspects more of the present invention are consistent.
The method of hierarchical clustering can be applied in several scenes, as, for carrying out cluster to the customers of different purchasing power in market analysis scene, for carrying out cluster etc. to the biology of different population in biology, especially, disclosure embodiment is for recognition of face scene, is described the method for the hierarchical clustering that disclosure embodiment provides.
Fig. 1 is the method flow diagram of a kind of hierarchical clustering according to an exemplary embodiment, and as shown in Figure 1, the method for hierarchical clustering is used for, in server, comprising the following steps.
In step S101, obtain set of data objects to be clustered, this set of data objects comprises multiple class, and each class corresponds at least one data object.
Particularly, in different application scenarioss, the data object type included by this set of data objects is different, and correspondingly, included by this set of data objects, the type of class is also different.Such as, in recognition of face scene, the data object type that this set of data objects comprises can be human face data; Correspondingly, each class that this set of data objects comprises can represent a user, and the data object corresponding to each class is the human face data object corresponding to user of such representative; Particularly, each face data object can be a multi-C vector, and disclosure embodiment does not do concrete restriction to this.
In step s 102, carry out cluster, obtain cluster result to the data object corresponding to the first kind, the data object number corresponding to this first kind is more than the first predetermined threshold value, and this cluster result comprises multiple bunches, and each bunch comprises at least one data object.
In the disclosed embodiments, for the set of data objects comprising multiple class, the data object number corresponding to the class that has is more, may cause when utilizing the data object corresponding to class to carry out cluster, calculated amount is too much, therefore, needs the data object number reduced corresponding to above-mentioned class.Particularly, cluster can be carried out to the data object corresponding to the first kind, thus make the data object number corresponding to the first kind be no more than the first predetermined threshold value.During concrete enforcement, can clustering algorithm be adopted, by the similarity of the data object corresponding to the first kind according to data object, namely carry out cluster according to the distance between data object, obtain the cluster result comprising multiple bunches; Wherein, each bunch represent cluster after a class, each bunch comprises at least one data object, and the distance in each bunch between data object is less than certain threshold value, this threshold value can be set in advance by technological development personnel, and disclosure embodiment does not do concrete restriction to this.
In step s 103, according to cluster result, the data object corresponding to the first kind is screened, obtain the representative data object of the first kind.
Cluster is being carried out to the data object corresponding to the first kind, after obtaining comprising the cluster result of multiple bunches, the data object of also tackling in each bunch screens, a data object is filtered out from each bunch, as the representative data object of the first kind, thus decrease the follow-up data object number for compute classes spacing in the first kind, and due to each bunch represent cluster after a class, thus the representative data object filtered out can be utilized to carry out the calculating of follow-up cluster process.
In step S104, based on the data object corresponding to the representative data object of the first kind and Equations of The Second Kind, compute classes spacing.
In the disclosed embodiments, Equations of The Second Kind is all classes in set of data objects to be clustered except the first kind, data object number corresponding to the first kind is more than the first predetermined threshold value, and therefore, the data object number corresponding to Equations of The Second Kind is no more than the first predetermined threshold value.
In set of data objects to be clustered, filter out representative data object from the data object corresponding to the first kind after, can based on the data object corresponding to the representative data object of the first kind and Equations of The Second Kind, for the class in data acquisition to be clustered, compute classes spacing between two.Particularly, calculate the between class distance between two classes, carry out distance one by one by the data object corresponding to one of them class and all data objects corresponding to another class and calculate, obtain multiple distance result of calculation; Multiple distance result of calculation is averaged, obtains mean distance, or multiple distance result of calculation is screened, therefrom obtain minor increment; Using the mean distance that obtains or minor increment as the between class distance between two classes.
In step S105, based between class distance, hierarchical clustering is carried out to set of data objects.
In the disclosed embodiments, the result of cluster is that the data object making similarity higher gathers in a class, therefore, after can calculating all between class distances between two in all classes comprised set of data objects, based on the size of between class distance, obtain new cluster result.
It should be noted that, after obtaining new cluster result, also need again from step S102, circulation performs the process of hierarchical clustering, being gathered by data object larger for similarity in set of data objects is as far as possible a class, obtain hierarchical clustering result as far as possible accurately, thus data analysis can be carried out according to this hierarchical clustering result.
Alternatively, according to cluster result, screen the data object corresponding to the first kind, the representative data object obtaining the first kind comprises:
Data object included by multiple bunches, in each bunch, using with bunch the nearest data object of central point as bunch representative data object;
Using the representative data object of the representative data object of multiple bunches as the first kind.
Alternatively, based on the data object corresponding to the representative data object of the first kind and Equations of The Second Kind, compute classes spacing comprises:
According in the first kind bunch comprise the number of data object, obtain the weight of the representative data object of the first kind;
Based on the data object corresponding to the weight of the representative data object of the first kind, the representative data object of the first kind and Equations of The Second Kind, compute classes spacing.
Alternatively, based on the data object corresponding to the weight of the representative data object of the first kind, the representative data object of the first kind and Equations of The Second Kind, compute classes spacing comprises:
Represent data object and second for first and represent data object, calculate the first distance between data object, the weight of data object is represented according to the first weight and second representing data object, process is weighted to the first distance, obtains first and represent data object and second and represent Weighted distance between data object; Or,
The 3rd data object in data object and Equations of The Second Kind is represented for first, calculate the second distance between data object, represent the weight of data object according to first, process is weighted to second distance, obtain first and represent Weighted distance between data object and the 3rd data object.
Alternatively, based between class distance, hierarchical clustering is carried out to set of data objects and comprises:
Based between class distance, the multiple classes in set of data objects are merged;
Based on the class after merging, continue to perform the cluster of the first kind and screening, until the between class distance calculated based on the result of cluster and screening is greater than the second predetermined threshold value, export level cluster result.
The method that disclosure embodiment provides, by in set of data objects to be clustered, data object corresponding to the first kind is carried out cluster, and according to cluster result from the representative data object filtering out the first kind, thus when utilizing data object to carry out between class distance calculating, decrease calculated amount, save computing time and resource, and the data object corresponding to the first kind is after screening, originally the data object not belonging to such may be screened out, namely eliminate such noise, make cluster result more reliable, be conducive to follow-up data analysis.
Fig. 2 is the method flow diagram of a kind of hierarchical clustering according to an exemplary embodiment, and as shown in Figure 2, the method for hierarchical clustering is used for, in server, comprising the following steps:
In step s 201, obtain set of data objects to be clustered, this set of data objects comprises multiple class, and each class corresponds at least one data object.
In the disclosed embodiments, set of data objects to be clustered can be carry out in advance gathering and be stored in server, and disclosure embodiment does not do concrete restriction to this.In recognition of face scene, set of data objects to be clustered comprises multiple human face data object, the plurality of human face data object can be utilize the facial information of intelligent acquisition equipment to multiple different user to gather by technician, obtain the multiple human face data objects corresponding to different expression, and be stored in server.
It should be noted that, before cluster starts, set of data objects to be clustered comprises multiple independently data object, now, each independently data object need be regarded as a class respectively, thus carry out follow-up hierarchical clustering.
In step S202, carry out cluster, obtain cluster result to the data object corresponding to the first kind, the data object number corresponding to this first kind is more than the first predetermined threshold value, and this cluster result comprises multiple bunches, and each bunch comprises at least one data object.
In the disclosed embodiments, be only described to adopt kmeans algorithm to carry out cluster to the data object corresponding to the first kind.Particularly, server, before execution kmeans algorithm, also needs acquisition first predetermined threshold value; Data object number corresponding to each class, judges whether comprise the first kind in set of data objects to be clustered, thus carries out cluster to the data object corresponding to the first kind.Wherein, the first predetermined threshold value can be set in advance by technological development personnel, by server automatic acquisition in the implementation of algorithm; Or in the implementation of algorithm, the input according to user or technician is determined, disclosure embodiment does not do concrete restriction to this.
Suppose that the first predetermined threshold value is k, data object number then corresponding to the first kind is individual more than k, and correspondingly, this employing kmeans algorithm carries out the process of cluster to the data object corresponding to the first kind, comprise: from all data objects of the first kind, optional k is according to object; Each data object remaining in the first kind is compared according to object one by one with k respectively, calculate each data object respectively and k according to the distance between object; For each data object remaining in the first kind, based on itself and k according to the distance between object, itself and k are included in one bunch according to minimum one of object middle distance, thus the data object corresponding to the first kind are divided into k bunch; For each bunch, bunch all data objects the comprised value that is averaged is calculated, using the result of calculation that the obtains central value as each bunch; Using k bunch corresponding central value as new comparison other, the all data object of calculating corresponding to the first kind and the distance of k central value, circulation performs in the first kind, separates k bunch, and the step of the central value of compute cluster, until cycle index is greater than default cycle index position, after individual bunch of the k now obtained is and carries out cluster to the data object corresponding to the first kind, the cluster result obtained.
In step S203, the data object included by multiple bunches, in each bunch, using with bunch the nearest data object of central point as bunch representative data object, and using the representative data object of the representative data object of multiple bunches as the first kind.
After obtaining k bunch according to step S202 in the first kind, also need the data object included by multiple bunches, in each bunch, using with bunch the nearest data object of central point as bunch representative data object, and using the representative data object of the representative data object of multiple bunches as the first kind, thus decrease the follow-up data object number for compute classes spacing in the first kind, decrease the calculated amount that later use data object carries out calculating, and, from bunch the representative data object selected there is good representativeness, the all data objects corresponding to the first kind can be represented, also eliminate with bunch the distant data object of central point, namely the data object that may not belong to this first kind is eliminated, reduce noise.
Such as, in recognition of face scene, data object corresponding to each class is the facial information corresponding to all difference expressions of certain user, and each expression of user understands corresponding multiple facial information, corresponding multiple facial information during as smiled, corresponding multiple facial information time serious; After carrying out cluster and screening to the data object corresponding to the first kind, the representative data object obtained, is the representative facial information corresponding to each expression of this user, and this facial information gets final product all facial informations when representative of consumer is smiled.
In step S204, based on the data object corresponding to the representative data object of the first kind and Equations of The Second Kind, compute classes spacing.
Alternatively, based on the data object corresponding to the representative data object of the first kind and Equations of The Second Kind, compute classes spacing comprises: according in the first kind bunch comprise the number of data object, obtain the weight of the representative data object of the first kind; Based on the data object corresponding to the weight of the representative data object of the first kind, the representative data object of the first kind and Equations of The Second Kind, compute classes spacing.Particularly, this according in the first kind bunch comprise the number of data object, obtain the process of the weight of the representative data object of the first kind, it can be the total number according to data object in set of data objects to be clustered, calculate the ratio that each bunch of data object number comprised accounts for total number, using the weight of this ratio as the representative data object of this bunch.
In addition, alternatively, based on the data object corresponding to the weight of the representative data object of the first kind, the representative data object of the first kind and Equations of The Second Kind, compute classes spacing, the difference of class corresponding to data object involved in computation process, comprises following situation (1) ~ (2):
(1) represent data object and second for first and represent data object, calculate the first distance between data object, the weight of data object is represented according to the first weight and second representing data object, process is weighted to the first distance, obtains first and represent data object and second and represent Weighted distance between data object.
Wherein, situation is the situations of two classes participating in calculating when being the first kind in (1), and first represents data object and second represents the representative data object that data object is the first kind.
Particularly, calculate the first distance between data object, namely for two first kind, calculate the first distance between all data objects of one of them first kind and all data objects of another first kind, in the scene of recognition of face, because human face data object can be the vector of a multidimensional, therefore, calculating the distance between human face data object, can be calculate the cosine similarity between human face data object, thus using this cosine similarity as the first distance calculated.
After obtaining the first distance, also can be weighted process to the first distance, as the first distance and first represented the multiplied by weight of data object, then with the second multiplied by weight representing data object, obtain first and represent data object and second and represent Weighted distance between data object.
(2) the 3rd data object in data object and Equations of The Second Kind is represented for first, calculate the second distance between data object, the weight of data object is represented according to first, process is weighted to second distance, obtains first and represent Weighted distance between data object and the 3rd data object.
Wherein, the situation of situation to be two classes participating in calculating a be first kind and an Equations of The Second Kind in (2), first represents the representative data object that data object is the first kind, the data object of the 3rd data object corresponding to Equations of The Second Kind.
Particularly, the process of the first distance between this calculating data object, with (1) in situation in like manner, repeat no more herein.After obtaining the first distance, also can be weighted process to the first distance, because now the 3rd data object is independently data, not there is weight, therefore, first distance and first can be represented the multiplied by weight of data object, obtain first and represent Weighted distance between data object and the 3rd data object.
Certainly, in actual applications, also can the first distance between compute classes and class, and do not consider weight, using the first distance as between class distance.It should be noted that, above-mentioned computing method for representative data object weight and being only based on the method for this weight calculation between class distance illustrate, in actual applications, other computing method also can be adopted to calculate, and disclosure embodiment does not do concrete restriction to this.
In step S205, based between class distance, the multiple classes in set of data objects are merged.
Further, should based between class distance, comprise the process that the multiple classes in set of data objects merge: based on the size of between class distance, class between class distance being less than or equal to the second predetermined threshold value merges.
It should be noted that, the class after merging will comprise all data objects corresponding to merged class.In recognition of face scene, carry out class merging, merge by the class belonging to same user in set of data objects, the data object that the class after merging comprises is the human face data object of this user, until all face data objects of same user all divide in a class.Certainly, in actual applications, also can be whether belong to same expression according to human face data object to carry out cluster, correspondingly, when carrying out class and merging, the class belonging to same expression in set of data objects merged, the human face data object of the data object that the class after merging comprises corresponding to same expression, the embody rule scene of this clustering algorithm of disclosure embodiment does not do concrete restriction.
In step S206, based on the class after merging, continue to perform the cluster of the first kind and screening, until the between class distance calculated based on the result of cluster and screening is greater than the second predetermined threshold value, export level cluster result.
In the disclosed embodiments, after carrying out class merging, because the multiple classes obtained after merging may also have similarity higher, such as, in recognition of face scene, after carrying out class merging, same user may be belonged in addition in the multiple classes obtained and not carry out the class that merges, therefore, also need the set of data objects obtained after class is merged, continue to perform the cluster to the first kind and screening, namely circulation performs the process of step S202 ~ step S205, until the between class distance calculated based on the result of cluster and screening is greater than the second predetermined threshold value, namely hierarchical clustering is completed, thus level cluster result can be exported.
The method that disclosure embodiment provides, by in set of data objects to be clustered, data object corresponding to the first kind is carried out cluster, and according to cluster result from the representative data object filtering out the first kind, thus when utilizing data object to carry out between class distance calculating, decrease calculated amount, save computing time and resource; Further, according to cluster result from the process of representative data object filtering out the first kind, by the central point of compute cluster, the representative data object that decentering point is nearest is obtained from each bunch, thus by the data object corresponding to the first kind after screening, the data object originally not belonging to such is screened out, namely eliminates such noise, make cluster result more reliable, be conducive to follow-up data analysis.
Fig. 3 is the device block diagram of a kind of hierarchical clustering according to an exemplary embodiment.With reference to Fig. 3, this device comprises acquisition module 301, the first cluster module 302, screening module 303, computing module 304, the second cluster module 305.
This acquisition module 301 is configured to obtain set of data objects to be clustered, and described set of data objects comprises multiple class, and each class corresponds at least one data object;
This first cluster module 302 is configured to carry out cluster to the data object corresponding to the first kind, obtain cluster result, data object number corresponding to the described first kind is more than the first predetermined threshold value, and described cluster result comprises multiple bunches, and each bunch comprises at least one data object;
This screening module 303 is configured to according to cluster result, screens the data object corresponding to the described first kind, obtains the representative data object of the described first kind;
This computing module 304 is configured to based on the data object corresponding to the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing;
This second cluster module 305 is configured to, based on described between class distance, carry out hierarchical clustering to described set of data objects.
Particularly, this screening module 303, is configured to the data object included by described multiple bunches, in each bunch, using the representative data object of data object nearest for the central point with described bunch as described bunch; Using the representative data object of the representative data object of described multiple bunches as the described first kind.
Particularly, this computing module 304, be configured to according in the described first kind bunch comprise the number of data object, obtain the weight of the representative data object of the described first kind; Based on the data object corresponding to the weight of the representative data object of the described first kind, the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing.
Particularly, this computing module 304, be configured to represent data object and second for first and represent data object, calculate the first distance between data object, the weight of data object is represented according to the first weight and second representing data object, process is weighted to described first distance, obtains first and represent data object and second and represent Weighted distance between data object; Or, for representing the 3rd data object in data object and described Equations of The Second Kind for first, calculate the second distance between data object, the weight of data object is represented according to first, process is weighted to described second distance, obtains first and represent Weighted distance between data object and described 3rd data object.
Particularly, this second cluster module 305, is configured to based on described between class distance, merges the multiple classes in described set of data objects; Based on the class after merging, continue to perform the cluster of the first kind and screening, until the between class distance calculated based on the result of cluster and screening is greater than the second predetermined threshold value, export level cluster result.
The device that disclosure embodiment provides, by in set of data objects to be clustered, data object corresponding to the first kind is carried out cluster, and according to cluster result from the representative data object filtering out the first kind, thus when utilizing data object to carry out between class distance calculating, decrease calculated amount, save computing time and resource, and the data object corresponding to the first kind is after screening, originally the data object not belonging to such may be screened out, namely eliminate such noise, make cluster result more reliable, be conducive to follow-up data analysis.
About the device in above-described embodiment, wherein the concrete mode of modules executable operations has been described in detail in about the embodiment of the method, will not elaborate explanation herein.
Fig. 4 is the block diagram of a kind of device 400 for hierarchical clustering according to an exemplary embodiment.Such as, device 400 may be provided in a server.With reference to Fig. 4, device 400 comprises processing components 1922, and it comprises one or more processor further, and the memory resource representated by storer 1932, can such as, by the instruction of the execution of processing components 1922, application program for storing.The application program stored in storer 1932 can comprise each module corresponding to one group of instruction one or more.In addition, processing components 1922 is configured to perform instruction, to perform the above method:
Obtain set of data objects to be clustered, described set of data objects comprises multiple class, and each class corresponds at least one data object;
Carry out cluster to the data object corresponding to the first kind, obtain cluster result, the data object number corresponding to the described first kind is more than the first predetermined threshold value, and described cluster result comprises multiple bunches, and each bunch comprises at least one data object;
According to cluster result, the data object corresponding to the described first kind is screened, obtain the representative data object of the described first kind;
Based on the data object corresponding to the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing;
Based on described between class distance, hierarchical clustering is carried out to described set of data objects.
Suppose that above-mentioned is the first possible embodiment, then, in the embodiment that the second provided based on the embodiment that the first is possible is possible, in the storer of device, also comprise the instruction for performing following operation:
According to the data object included by described multiple bunches, in each bunch, using the representative data object of data object nearest for the central point with described bunch as described bunch;
Using the representative data object of the representative data object of described multiple bunches as the described first kind.
In the third the possible embodiment provided based on the embodiment that the first is possible, in the storer of terminal, also comprise the instruction for performing following operation:
According in the described first kind bunch comprise the number of data object, obtain the weight of the representative data object of the described first kind;
Based on the data object corresponding to the weight of the representative data object of the described first kind, the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing.
In the 4th kind of possible embodiment provided based on the embodiment that the third is possible, in the storer of terminal, also comprise the instruction for performing following operation:
Represent data object and second for first and represent data object, calculate the first distance between data object, the weight of data object is represented according to the first weight and second representing data object, process is weighted to described first distance, obtains first and represent data object and second and represent Weighted distance between data object; Or,
The 3rd data object in data object and described Equations of The Second Kind is represented for first, calculate the second distance between data object, the weight of data object is represented according to first, process is weighted to described second distance, obtains first and represent Weighted distance between data object and described 3rd data object.
In the 5th kind of possible embodiment provided based on the embodiment that the first is possible, in the storer of terminal, also comprise the instruction for performing following operation:
Based on described between class distance, the multiple classes in described set of data objects are merged;
Based on the class after merging, continue to perform the cluster of the first kind and screening, until the between class distance calculated based on the result of cluster and screening is greater than the second predetermined threshold value, export level cluster result.
Device 400 can also comprise the power management that a power supply module 1926 is configured to actuating unit 400, and a wired or wireless network interface 1950 is configured to device 400 to be connected to network, and input and output (I/O) interface 1958.Device 400 can operate the operating system based on being stored in storer 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
The device that disclosure embodiment provides, by in set of data objects to be clustered, data object corresponding to the first kind is carried out cluster, and according to cluster result from the representative data object filtering out the first kind, thus when utilizing data object to carry out between class distance calculating, decrease calculated amount, save computing time and resource, and the data object corresponding to the first kind is after screening, originally the data object not belonging to such may be screened out, namely eliminate such noise, make cluster result more reliable, be conducive to follow-up data analysis.
Those skilled in the art, at consideration instructions and after putting into practice invention disclosed herein, will easily expect other embodiment of the present invention.The application is intended to contain any modification of the present invention, purposes or adaptations, and these modification, purposes or adaptations are followed general principle of the present invention and comprised the undocumented common practise in the art of the disclosure or conventional techniques means.Instructions and embodiment are only regarded as exemplary, and true scope of the present invention and spirit are pointed out by claim below.
Should be understood that, the present invention is not limited to precision architecture described above and illustrated in the accompanying drawings, and can carry out various amendment and change not departing from its scope.Scope of the present invention is only limited by appended claim.

Claims (11)

1. a method for hierarchical clustering, is characterized in that, described method comprises:
Obtain set of data objects to be clustered, described set of data objects comprises multiple class, and each class corresponds at least one data object;
Carry out cluster to the data object corresponding to the first kind, obtain cluster result, the data object number corresponding to the described first kind is more than the first predetermined threshold value, and described cluster result comprises multiple bunches, and each bunch comprises at least one data object;
According to cluster result, the data object corresponding to the described first kind is screened, obtain the representative data object of the described first kind;
Based on the data object corresponding to the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing;
Based on described between class distance, hierarchical clustering is carried out to described set of data objects.
2. method according to claim 1, is characterized in that, according to cluster result, screen the data object corresponding to the described first kind, the representative data object obtaining the described first kind comprises:
According to the data object included by described multiple bunches, in each bunch, using the representative data object of data object nearest for the central point with described bunch as described bunch;
Using the representative data object of the representative data object of described multiple bunches as the described first kind.
3. method according to claim 1, is characterized in that, based on the data object corresponding to the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing comprises:
According in the described first kind bunch comprise the number of data object, obtain the weight of the representative data object of the described first kind;
Based on the data object corresponding to the weight of the representative data object of the described first kind, the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing.
4. method according to claim 3, is characterized in that, based on the data object corresponding to the weight of the representative data object of the described first kind, the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing comprises:
Represent data object and second for first and represent data object, calculate the first distance between data object, the weight of data object is represented according to the first weight and second representing data object, process is weighted to described first distance, obtains first and represent data object and second and represent Weighted distance between data object; Or,
The 3rd data object in data object and described Equations of The Second Kind is represented for first, calculate the second distance between data object, the weight of data object is represented according to first, process is weighted to described second distance, obtains first and represent Weighted distance between data object and described 3rd data object.
5. method according to claim 1, is characterized in that, based on described between class distance, carries out hierarchical clustering comprise described set of data objects:
Based on described between class distance, the multiple classes in described set of data objects are merged;
Based on the class after merging, continue to perform the cluster of the first kind and screening, until the between class distance calculated based on the result of cluster and screening is greater than the second predetermined threshold value, export level cluster result.
6. a device for hierarchical clustering, is characterized in that, described device comprises:
Acquisition module, for obtaining set of data objects to be clustered, described set of data objects comprises multiple class, and each class corresponds at least one data object;
First cluster module, for carrying out cluster to the data object corresponding to the first kind, obtain cluster result, the data object number corresponding to the described first kind is more than the first predetermined threshold value, described cluster result comprises multiple bunches, and each bunch comprises at least one data object;
Screening module, for according to cluster result, screens the data object corresponding to the described first kind, obtains the representative data object of the described first kind;
Computing module, for based on the data object corresponding to the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing;
Second cluster module, for based on described between class distance, carries out hierarchical clustering to described set of data objects.
7. device according to claim 6, is characterized in that, described screening module, for the data object included by described multiple bunches, in each bunch, using the representative data object of data object nearest for the central point with described bunch as described bunch; Using the representative data object of the representative data object of described multiple bunches as the described first kind.
8. device according to claim 6, is characterized in that, described computing module, for according in the described first kind bunch comprise the number of data object, obtain the weight of the representative data object of the described first kind; Based on the data object corresponding to the weight of the representative data object of the described first kind, the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing.
9. device according to claim 8, it is characterized in that, described computing module, data object is represented for representing data object and second for first, calculate the first distance between data object, represent the weight of data object according to the first weight and second representing data object, process is weighted to described first distance, obtain first and represent data object and second and represent Weighted distance between data object; Or, for representing the 3rd data object in data object and described Equations of The Second Kind for first, calculate the second distance between data object, the weight of data object is represented according to first, process is weighted to described second distance, obtains first and represent Weighted distance between data object and described 3rd data object.
10. device according to claim 6, is characterized in that, described second cluster module, for based on described between class distance, merges the multiple classes in described set of data objects; Based on the class after merging, continue to perform the cluster of the first kind and screening, until the between class distance calculated based on the result of cluster and screening is greater than the second predetermined threshold value, export level cluster result.
The device of 11. 1 kinds of hierarchical clusterings, is characterized in that, comprising:
Processor;
For the executable instruction of storage of processor;
Wherein, described processor is configured to:
Obtain set of data objects to be clustered, described set of data objects comprises multiple class, and each class corresponds at least one data object;
Carry out cluster to the data object corresponding to the first kind, obtain cluster result, the data object number corresponding to the described first kind is more than the first predetermined threshold value, and described cluster result comprises multiple bunches, and each bunch comprises at least one data object;
According to cluster result, the data object corresponding to the described first kind is screened, obtain the representative data object of the described first kind;
Based on the data object corresponding to the representative data object of the described first kind and Equations of The Second Kind, compute classes spacing;
Based on described between class distance, hierarchical clustering is carried out to described set of data objects.
CN201410602569.5A 2014-10-31 2014-10-31 The method and device of hierarchical clustering Active CN104391879B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410602569.5A CN104391879B (en) 2014-10-31 2014-10-31 The method and device of hierarchical clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410602569.5A CN104391879B (en) 2014-10-31 2014-10-31 The method and device of hierarchical clustering

Publications (2)

Publication Number Publication Date
CN104391879A true CN104391879A (en) 2015-03-04
CN104391879B CN104391879B (en) 2017-10-10

Family

ID=52609783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410602569.5A Active CN104391879B (en) 2014-10-31 2014-10-31 The method and device of hierarchical clustering

Country Status (1)

Country Link
CN (1) CN104391879B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654039A (en) * 2015-12-24 2016-06-08 小米科技有限责任公司 Image processing method and device
CN106776600A (en) * 2015-11-19 2017-05-31 北京国双科技有限公司 The method and device of text cluster
CN108062576A (en) * 2018-01-05 2018-05-22 百度在线网络技术(北京)有限公司 For the method and apparatus of output data
CN108205590A (en) * 2017-12-29 2018-06-26 北京奇元科技有限公司 A kind of method and device for establishing point of interest network level topological diagram
CN109086697A (en) * 2018-07-20 2018-12-25 腾讯科技(深圳)有限公司 A kind of human face data processing method, device and storage medium
CN109145844A (en) * 2018-08-29 2019-01-04 北京旷视科技有限公司 Archive management method, device and electronic equipment for city safety monitoring
CN110046586A (en) * 2019-04-19 2019-07-23 腾讯科技(深圳)有限公司 A kind of data processing method, equipment and storage medium
CN112287244A (en) * 2020-10-29 2021-01-29 平安科技(深圳)有限公司 Product recommendation method and device based on federal learning, computer equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079072A (en) * 2007-06-22 2007-11-28 中国科学院研究生院 Text clustering element study method and device
US8019711B1 (en) * 2003-11-10 2011-09-13 James Ralph Heidenreich System and method to provide a customized problem solving environment for the development of user thinking about an arbitrary problem
CN102360377A (en) * 2011-10-12 2012-02-22 中国测绘科学研究院 Spatial clustering mining PSE (Problem Solving Environments) system and construction method thereof
CN103473255A (en) * 2013-06-06 2013-12-25 中国科学院深圳先进技术研究院 Data clustering method and system, and data processing equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8019711B1 (en) * 2003-11-10 2011-09-13 James Ralph Heidenreich System and method to provide a customized problem solving environment for the development of user thinking about an arbitrary problem
CN101079072A (en) * 2007-06-22 2007-11-28 中国科学院研究生院 Text clustering element study method and device
CN102360377A (en) * 2011-10-12 2012-02-22 中国测绘科学研究院 Spatial clustering mining PSE (Problem Solving Environments) system and construction method thereof
CN103473255A (en) * 2013-06-06 2013-12-25 中国科学院深圳先进技术研究院 Data clustering method and system, and data processing equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
威滕等: "《数据挖掘 实用机器学习工具与技术》", 31 May 2014 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776600A (en) * 2015-11-19 2017-05-31 北京国双科技有限公司 The method and device of text cluster
CN105654039A (en) * 2015-12-24 2016-06-08 小米科技有限责任公司 Image processing method and device
CN105654039B (en) * 2015-12-24 2019-09-17 小米科技有限责任公司 The method and apparatus of image procossing
CN108205590A (en) * 2017-12-29 2018-06-26 北京奇元科技有限公司 A kind of method and device for establishing point of interest network level topological diagram
CN108205590B (en) * 2017-12-29 2022-01-28 北京奇元科技有限公司 Method and device for establishing network level topological graph of interest points
CN108062576A (en) * 2018-01-05 2018-05-22 百度在线网络技术(北京)有限公司 For the method and apparatus of output data
CN108062576B (en) * 2018-01-05 2019-05-03 百度在线网络技术(北京)有限公司 Method and apparatus for output data
CN109086697A (en) * 2018-07-20 2018-12-25 腾讯科技(深圳)有限公司 A kind of human face data processing method, device and storage medium
CN109145844A (en) * 2018-08-29 2019-01-04 北京旷视科技有限公司 Archive management method, device and electronic equipment for city safety monitoring
CN110046586A (en) * 2019-04-19 2019-07-23 腾讯科技(深圳)有限公司 A kind of data processing method, equipment and storage medium
CN110046586B (en) * 2019-04-19 2024-09-27 腾讯科技(深圳)有限公司 Data processing method, device and storage medium
CN112287244A (en) * 2020-10-29 2021-01-29 平安科技(深圳)有限公司 Product recommendation method and device based on federal learning, computer equipment and medium

Also Published As

Publication number Publication date
CN104391879B (en) 2017-10-10

Similar Documents

Publication Publication Date Title
CN104391879A (en) Method and device for hierarchical clustering
WO2017206936A1 (en) Machine learning based network model construction method and apparatus
CN109948641A (en) Anomaly groups recognition methods and device
CN105607952B (en) Method and device for scheduling virtualized resources
KR20160019897A (en) Fast grouping of time series
CN109685092B (en) Clustering method, equipment, storage medium and device based on big data
WO2022142859A1 (en) Data processing method and apparatus, computer readable medium, and electronic device
CN110895506B (en) Method and system for constructing test data
CN110458096A (en) A kind of extensive commodity recognition method based on deep learning
CN105335368A (en) Product clustering method and apparatus
CN112905340A (en) System resource allocation method, device and equipment
CN110109899A (en) Internet of things data complementing method, apparatus and system
CN113688490A (en) Network co-construction sharing processing method, device, equipment and storage medium
CN110796159A (en) Power data classification method and system based on k-means algorithm
CN111343416B (en) Distributed image analysis method, system and storage medium
CN112148942A (en) Business index data classification method and device based on data clustering
CN102722732A (en) Image set matching method based on data second order static modeling
CN114064834A (en) Target location determination method and device, storage medium and electronic equipment
CN102141988A (en) Method, system and device for clustering data in data mining system
CN104765820B (en) A kind of service dependence of non-intrusion type finds method
CN112749202A (en) Information operation strategy determination method, device, equipment and storage medium
CN114330720A (en) Knowledge graph construction method and device for cloud computing and storage medium
CN113946717A (en) Sub-map index feature obtaining method, device, equipment and storage medium
US10642864B2 (en) Information processing device and clustering method
US20210232582A1 (en) Optimizing breakeven points for enhancing system performance

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant