CN104391879B - The method and device of hierarchical clustering - Google Patents

The method and device of hierarchical clustering Download PDF

Info

Publication number
CN104391879B
CN104391879B CN201410602569.5A CN201410602569A CN104391879B CN 104391879 B CN104391879 B CN 104391879B CN 201410602569 A CN201410602569 A CN 201410602569A CN 104391879 B CN104391879 B CN 104391879B
Authority
CN
China
Prior art keywords
data object
cluster
class
distance
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410602569.5A
Other languages
Chinese (zh)
Other versions
CN104391879A (en
Inventor
陈志军
代阳
杨松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiaomi Inc
Original Assignee
Xiaomi Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaomi Inc filed Critical Xiaomi Inc
Priority to CN201410602569.5A priority Critical patent/CN104391879B/en
Publication of CN104391879A publication Critical patent/CN104391879A/en
Application granted granted Critical
Publication of CN104391879B publication Critical patent/CN104391879B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KGRAPHICAL DATA READING; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for recognising patterns
    • G06K9/62Methods or arrangements for pattern recognition using electronic means
    • G06K9/6217Design or setup of recognition systems and techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06K9/6218Clustering techniques
    • G06K9/6219Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendogram

Abstract

The disclosure is directed to a kind of method and device of hierarchical clustering, belong to Data Mining.Method includes:Set of data objects to be clustered is obtained, the set of data objects includes multiple classes, and each class corresponds at least one data object;Data object corresponding to the first kind is clustered, cluster result is obtained, the data object number corresponding to the first kind is more than the first predetermined threshold value, and the cluster result includes multiple clusters, and each cluster includes at least one data object;According to cluster result, the data object corresponding to the first kind is screened, the representative data object of the first kind is obtained;Data object corresponding to representative data object and Equations of The Second Kind based on the first kind, calculates between class distance;Based between class distance, hierarchical clustering is carried out to set of data objects.The disclosure reduces amount of calculation by the way that the data object corresponding to the first kind is clustered, and saves calculating time and resource, and make it that cluster result is more reliable, is conducive to follow-up data analysis.

Description

The method and device of hierarchical clustering
Technical field
This disclosure relates to Data Mining, more particularly to a kind of method and device of hierarchical clustering.
Background technology
In Data Mining, it usually needs substantial amounts of data are analyzed, to obtain valuable analysis result. Clustering algorithm is to be used for a kind of important algorithm of analyze data in Data Mining, and the algorithm is used to be made up of multiple data Set classified according to the different classes of of data, the purpose is to as much as possible by the larger data aggregate of similarity into one Class, to facilitate follow-up data analysis.Wherein, hierarchical clustering is a kind of more common clustering algorithm.
Correlation technique in the method that implementation level is clustered, be by calculating the distance between two classes, i.e. between class distance, So as to which two classes that between class distance is less than to certain value merge into a new class.Because each class may include more than one data Object, therefore, need to be by all data objects in a class and all data objects of another class when calculating between class distance Calculated two-by-two, all result of calculation is counted, obtain average value or minimum value, as between class distance, from And follow-up cluster is realized according between class distance.
During the disclosure is realized, inventor has found that correlation technique at least has problems with:
In correlation technique, when being clustered by calculating between class distance implementation level, its amount of calculation is excessive, when the number included in class According to object it is many when, it will expend excessive time and resource, also, due to may be comprising being not belonging to such number in each class According to object, i.e. noise, carry out the calculating of between class distance using the data object and formed after new class, may introduce more Noise, causes cluster result poor, is unfavorable for follow-up data analysis.
The content of the invention
To overcome problem present in correlation technique, the disclosure provides a kind of method and device of hierarchical clustering.
According to the first aspect of the embodiment of the present disclosure there is provided a kind of method of hierarchical clustering, including:
Set of data objects to be clustered is obtained, the set of data objects includes multiple classes, and each class corresponds at least One data object;
Data object corresponding to the first kind is clustered, cluster result, the data corresponding to the first kind is obtained Object number is more than the first predetermined threshold value, and the cluster result includes multiple clusters, and each cluster includes at least one data object;
According to cluster result, the data object corresponding to the first kind is screened, the generation of the first kind is obtained Table data object;
Based on the data object corresponding to the representative data object and Equations of The Second Kind of the first kind, between class distance is calculated;
Based on the between class distance, hierarchical clustering is carried out to the set of data objects.
With reference in a first aspect, in the first possible implementation of first aspect, according to cluster result, to described Data object corresponding to one class is screened, and is obtained the representative data object of the first kind and is included:
Data object according to included by the multiple cluster, will be closest with the central point of the cluster in each cluster Data object as the cluster representative data object;
Using the representative data object of the multiple cluster as the first kind representative data object.
With reference in a first aspect, in second of possible implementation of first aspect, the representative based on the first kind Data object corresponding to data object and Equations of The Second Kind, calculating between class distance includes:
The number of data object is included according to cluster in the first kind, the representative data object of the first kind is obtained Weight;
The weight and Equations of The Second Kind of the representative data object of representative data object, the first kind based on the first kind Corresponding data object, calculates between class distance.
With reference to second of possible implementation of first aspect, in the third possible implementation of first aspect In, weight and the Equations of The Second Kind institute of the representative data object of representative data object, the first kind based on the first kind are right The data object answered, calculating between class distance includes:
Data object and second is represented for first and represents data object, calculates the first distance between data object, root The weight of data object is represented according to the first weight for representing data object and second, processing is weighted to first distance, Obtain first and represent data object and second representing Weighted distance between data object;Or,
The 3rd data object in data object and the Equations of The Second Kind is represented for first, the between data object is calculated Two distances, according to the first weight for representing data object, processing is weighted to the second distance, is obtained first and is represented data Weighted distance between object and the 3rd data object.
With reference to the third possible implementation of first aspect, in the 4th kind of possible implementation of first aspect In, based on the between class distance, carrying out hierarchical clustering to the set of data objects includes:
Based on the between class distance, multiple classes in the set of data objects are merged;
Based on the class after merging, the cluster to the first kind and screening are continued executing with, until the result based on cluster and screening Calculate obtained between class distance and be more than the second predetermined threshold value, export level cluster result.
According to the second aspect of the embodiment of the present disclosure there is provided a kind of device of hierarchical clustering, including:
Acquisition module, the set of data objects to be clustered for obtaining, the set of data objects includes multiple classes, each Class corresponds at least one data object;
First cluster module, for being clustered to the data object corresponding to the first kind, obtains cluster result, described Data object number corresponding to one class is more than the first predetermined threshold value, and the cluster result includes multiple clusters, and each cluster is included extremely A few data object;
Screening module, for according to cluster result, being screened to the data object corresponding to the first kind, obtains institute State the representative data object of the first kind;
Computing module, for the data object corresponding to representative data object and Equations of The Second Kind based on the first kind, Calculate between class distance;
Second cluster module, for based on the between class distance, hierarchical clustering to be carried out to the set of data objects.
With reference in a first aspect, in the first possible implementation of first aspect, the screening module, for basis Data object included by the multiple cluster, in each cluster, will make with the closest data object of the central point of the cluster For the representative data object of the cluster;Using the representative data object of the multiple cluster as the first kind representative data pair As.
With reference in a first aspect, in second of possible implementation of first aspect, the computing module, for basis Cluster includes the number of data object in the first kind, obtains the weight of the representative data object of the first kind;Based on institute State the data pair corresponding to representative data object, the weight of the representative data object of the first kind and the Equations of The Second Kind of the first kind As calculating between class distance.
With reference to second of possible implementation of first aspect, in the third possible implementation of first aspect In, the computing module is represented between data object, calculating data object for representing data object and second for first First distance, the weight of data object is represented according to the first weight for representing data object and second, and first distance is entered Row weighting is handled, and obtains first and represent data object and second representing Weighted distance between data object;Or, for for One represents the 3rd data object in data object and the Equations of The Second Kind, calculates the second distance between data object, according to the One represents the weight of data object, and processing is weighted to the second distance, obtains first and represents data object and described Weighted distance between three data objects.
With reference to the third possible implementation of first aspect, in the 4th kind of possible implementation of first aspect In, second cluster module, for based on the between class distance, being closed to multiple classes in the set of data objects And;Based on the class after merging, the cluster to the first kind and screening are continued executing with, until the result based on cluster and screening is calculated Obtained between class distance is more than the second predetermined threshold value, exports level cluster result.
According to the third aspect of the embodiment of the present disclosure there is provided a kind of device of hierarchical clustering, including:
Processor;
The instruction executable for storing processor;
Wherein, the processor is configured as:
Set of data objects to be clustered is obtained, the set of data objects includes multiple classes, and each class corresponds at least One data object;
Data object corresponding to the first kind is clustered, cluster result, the data corresponding to the first kind is obtained Object number is more than the first predetermined threshold value, and the cluster result includes multiple clusters, and each cluster includes at least one data object;
According to cluster result, the data object corresponding to the first kind is screened, the generation of the first kind is obtained Table data object;
Based on the data object corresponding to the representative data object and Equations of The Second Kind of the first kind, between class distance is calculated;
Based on the between class distance, hierarchical clustering is carried out to the set of data objects.
The technical scheme that the embodiment of the present disclosure is provided can include the following benefits:
The disclosure by the way that in set of data objects to be clustered, the data object corresponding to the first kind is clustered, and According to cluster result from the representative data object for filtering out the first kind, so as to carry out between class distance calculating using data object When, amount of calculation is reduced, calculating time and resource is saved, and because the data object corresponding to the first kind is after screening, Originally the data object for being not belonging to such may be screened out, that is, eliminates such noise so that cluster result is more reliable, favorably In follow-up data analysis.
It should be appreciated that the general description of the above and detailed description hereinafter are only exemplary and explanatory, not The disclosure can be limited.
Brief description of the drawings
Accompanying drawing herein is merged in specification and constitutes the part of this specification, shows the implementation for meeting the present invention Example, and for explaining principle of the invention together with specification.
Fig. 1 is a kind of method flow diagram of hierarchical clustering according to an exemplary embodiment.
Fig. 2 is a kind of method flow diagram of hierarchical clustering according to an exemplary embodiment.
Fig. 3 is a kind of device block diagram of hierarchical clustering according to an exemplary embodiment.
Fig. 4 is a kind of block diagram of device for hierarchical clustering according to an exemplary embodiment.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment Described in embodiment do not represent and the consistent all embodiments of the present invention.On the contrary, they be only with it is such as appended The example of the consistent apparatus and method of some aspects be described in detail in claims, the present invention.
The method of hierarchical clustering can be applied in several scenes, e.g., be used in market analysis scene to different purchases The customers of ability are clustered, and are used to the biology of different population such as cluster in biology, especially, the disclosure is real Apply what the method that example is the hierarchical clustering by taking recognition of face scene as an example, provided the embodiment of the present disclosure was illustrated.
Fig. 1 is a kind of method flow diagram of hierarchical clustering according to an exemplary embodiment, as shown in figure 1, level The method of cluster is used in server, comprises the following steps.
In step S101, set of data objects to be clustered is obtained, the set of data objects includes multiple classes, each class Corresponding at least one data object.
Specifically, in different application scenarios, the data object type included by the set of data objects is different, accordingly Ground, the type of class is also different included by the set of data objects.For example, in recognition of face scene, the set of data objects bag The data object type contained can be human face data;Correspondingly, each class that the set of data objects includes can represent one User, the data object corresponding to each class is the human face data object corresponding to the user that such is represented;Specifically, each Human face data object can be not especially limited for a multi-C vector, the embodiment of the present disclosure to this.
In step s 102, the data object corresponding to the first kind is clustered, obtains cluster result, the first kind institute Corresponding data object number is more than the first predetermined threshold value, and the cluster result includes multiple clusters, and each cluster is counted including at least one According to object.
In the disclosed embodiments, for including the set of data objects of multiple classes, due to the data corresponding to the class that has Object number is more, may cause when being clustered using the data object corresponding to class, and amount of calculation is excessive, accordingly, it would be desirable to Reduce the data object number corresponding to above-mentioned class.Specifically, the data object corresponding to the first kind can be clustered, from And the data object number corresponding to the first kind is no more than the first predetermined threshold value.When it is implemented, clustering algorithm can be used, By similarity of the data object corresponding to the first kind according to data object, i.e., gathered according to the distance between data object Class, obtains including the cluster result of multiple clusters;Wherein, each cluster represents a class after cluster, and each cluster is included at least The distance between data object is less than certain threshold value in one data object, each cluster, and the threshold value can be by technological development personnel It is previously set, the embodiment of the present disclosure is not especially limited to this.
In step s 103, according to cluster result, the data object corresponding to the first kind is screened, the first kind is obtained Representative data object.
Cluster, after the cluster result for obtaining including multiple clusters, also tackle to the data object corresponding to the first kind Data object in each cluster is screened, and a data object is filtered out from each cluster, the representative data of the first kind are used as Object, so that the data object number for being subsequently used for calculating between class distance in the first kind is reduced, and because each cluster is represented Class after cluster, so as to utilize the calculating of the follow-up cluster process of representative data object progress filtered out.
In step S104, the data object corresponding to representative data object and Equations of The Second Kind based on the first kind is calculated Between class distance.
In the disclosed embodiments, Equations of The Second Kind is owning in addition to the first kind in set of data objects to be clustered Class, because the data object number corresponding to the first kind is more than the first predetermined threshold value, therefore, the data object corresponding to Equations of The Second Kind Number is no more than the first predetermined threshold value.
In set of data objects to be clustered, filtered out from the data object corresponding to the first kind and represent data object Afterwards, can be based on the data object corresponding to the representative data object and Equations of The Second Kind of the first kind, in data acquisition system to be clustered Class, between class distance is calculated two-by-two.Specifically, the between class distance between two classes is calculated, will be corresponding to one of class Data object enters row distance calculating one by one with all data objects corresponding to another class, obtains multiple apart from result of calculation; Averaged to multiple apart from result of calculation, obtain average distance, or screened to multiple apart from result of calculation, therefrom obtained Minimum range;It regard obtained average distance or minimum range as the between class distance between two classes.
In step S105, based between class distance, hierarchical clustering is carried out to set of data objects.
In the disclosed embodiments, the result of cluster is that the data object for making similarity higher is gathered in a class, because This, can be after all classes for including to set of data objects calculate two-by-two and obtain all between class distances, based on class spacing From size, obtain new cluster result.
It should be noted that after new cluster result is obtained, also needing again since step S102, execution level is circulated The process of secondary cluster, as far as possible gathers the larger data object of similarity in set of data objects for a class, obtains as accurate as possible True hierarchical clustering result, so as to carry out data analysis according to the hierarchical clustering result.
Alternatively, according to cluster result, the data object corresponding to the first kind is screened, the representative of the first kind is obtained Data object includes:
Data object according to included by multiple clusters, in each cluster, by the data pair closest with the central point of cluster As the representative data object as cluster;
Using the representative data object of multiple clusters as the first kind representative data object.
Alternatively, the data object corresponding to representative data object and Equations of The Second Kind based on the first kind, calculates class spacing From including:
The number of data object is included according to cluster in the first kind, the weight of the representative data object of the first kind is obtained;
Corresponding to the weight and Equations of The Second Kind of the representative data object of representative data object, the first kind based on the first kind Data object, calculates between class distance.
Alternatively, the representative data object based on the first kind, the weight of the representative data object of the first kind and Equations of The Second Kind Corresponding data object, calculating between class distance includes:
Data object and second is represented for first and represents data object, calculates the first distance between data object, root The weight of data object is represented according to the first weight for representing data object and second, processing is weighted to the first distance, obtained First represents data object and second represents Weighted distance between data object;Or,
Represent the 3rd data object in data object and Equations of The Second Kind for first, calculate between data object second away from From according to the first weight for representing data object, processing is weighted to second distance, first is obtained and represents data object and Weighted distance between three data objects.
Alternatively, based between class distance, carrying out hierarchical clustering to set of data objects includes:
Based between class distance, multiple classes in set of data objects are merged;
Based on the class after merging, the cluster to the first kind and screening are continued executing with, until the result based on cluster and screening Calculate obtained between class distance and be more than the second predetermined threshold value, export level cluster result.
The method that the embodiment of the present disclosure is provided, by set of data objects to be clustered, by the number corresponding to the first kind Clustered according to object, and according to cluster result from the representative data object of the first kind is filtered out, so as to utilize data object When carrying out between class distance calculating, amount of calculation is reduced, calculating time and resource is saved, and due to the data corresponding to the first kind Originally the data object for being not belonging to such may be screened out after screening, that is, eliminate such noise by object so that cluster As a result it is more reliable, be conducive to follow-up data analysis.
Fig. 2 is a kind of method flow diagram of hierarchical clustering according to an exemplary embodiment, as shown in Fig. 2 level The method of cluster is used in server, comprises the following steps:
In step s 201, set of data objects to be clustered is obtained, the set of data objects includes multiple classes, each class Corresponding at least one data object.
In the disclosed embodiments, set of data objects to be clustered can be acquired in advance and be stored in server In, the embodiment of the present disclosure is not especially limited to this.In recognition of face scene, set of data objects to be clustered includes Multiple human face data objects, the plurality of human face data object can be to multiple differences by technical staff using intelligent acquisition equipment The facial information of user is acquired, and obtains multiple human face data objects corresponding to different expressions, and be stored in server.
It should be noted that before cluster starts, set of data objects to be clustered includes multiple independent data Object, now, need to be regarded as a class respectively, so as to carry out follow-up hierarchical clustering by each independent data object.
In step S202, the data object corresponding to the first kind is clustered, cluster result is obtained, the first kind institute Corresponding data object number is more than the first predetermined threshold value, and the cluster result includes multiple clusters, and each cluster is counted including at least one According to object.
In the disclosed embodiments, only using use kmeans algorithms to corresponding to the first kind data object carry out cluster as Example is illustrated.Specifically, server also needs to obtain the first predetermined threshold value before kmeans algorithms are performed;According to each class Corresponding data object number, judges whether include the first kind in set of data objects to be clustered, so as to first kind institute Corresponding data object is clustered.Wherein, the first predetermined threshold value can be previously set by technological development personnel, in algorithm Implementation procedure in obtained automatically by server;Or in the implementation procedure of algorithm, according to the input of user or technical staff come It is determined that, the embodiment of the present disclosure is not especially limited to this.
Assuming that the first predetermined threshold value is k, then the data object number corresponding to the first kind is more than k, correspondingly, the use The process that kmeans algorithms are clustered to the data object corresponding to the first kind, including:From all data objects of the first kind In, optional k is according to object;Remaining each data object in the first kind is compared one by one with k according to object respectively, Each data object is calculated respectively with k according to the distance between object;For remaining each data object, base in the first kind In its with k according to the distance between object, it is included into a cluster according to minimum one of distance in object with k, from And the data object corresponding to the first kind is divided into k cluster;For each cluster, all data objects included to cluster are put down Mean value computation, using obtained result of calculation as each cluster central value;It regard the corresponding central value of k cluster as new comparison pair As calculating all data objects and the distance of k central value corresponding to the first kind, circulation is performed separates k in the first kind Cluster, and the step of calculate the central value of cluster, until cycle-index is more than default cycle-index position, the k cluster now obtained After as being clustered to the data object corresponding to the first kind, obtained cluster result.
In step S203, the data object according to included by multiple clusters, in each cluster, by the central point distance with cluster Nearest data object as cluster representative data object, and using the representative data object of multiple clusters as the first kind representative number According to object.
Obtained according to step S202 in the first kind after k cluster, the data object according to included by multiple clusters is also needed, every In individual cluster, using the data object closest with the central point of cluster as cluster representative data object, and by the representative of multiple clusters Data object as the first kind representative data object so that reduce be subsequently used in the first kind calculate between class distance data Object number, reduces the amount of calculation that later use data object is calculated, also, the representative data object selected from cluster With preferable representativeness, all data objects corresponding to the first kind can be represented, the central point distance with cluster is also eliminated Data object farther out, that is, the data object of the first kind may be not belonging to by eliminating, and reduce noise.
For example, in recognition of face scene, the corresponding data object of each class is corresponding to all different tables of some user The facial information of feelings, and user each expresses one's feelings can correspond to multiple facial informations, correspondence multiple facial informations when such as smiling, when serious The multiple facial informations of correspondence;After data object corresponding to the first kind is clustered and screened, obtained representative data object, As correspond to the representative facial information that the user each expresses one's feelings, the facial information can be represented when user smiles All facial informations.
In step S204, the data object corresponding to representative data object and Equations of The Second Kind based on the first kind is calculated Between class distance.
Alternatively, the data object corresponding to representative data object and Equations of The Second Kind based on the first kind, calculates class spacing From including:The number of data object is included according to cluster in the first kind, the weight of the representative data object of the first kind is obtained;It is based on Data object corresponding to the representative data object of the first kind, the weight of the representative data object of the first kind and Equations of The Second Kind, meter Calculate between class distance.Specifically, the number that data object is included according to cluster in the first kind, obtains the representative data pair of the first kind The process of the weight of elephant, can be the total number according to data object in set of data objects to be clustered, calculate each cluster institute Comprising data object number account for the ratio of total number, using the ratio as the representative data object of the cluster weight.
Optionally, in addition, the representative data object based on the first kind, the weight of the representative data object of the first kind and Data object corresponding to two classes, calculates between class distance, and class corresponding to the data object according to involved by calculating process is not Together, including following situations (1)~(2):
(1) represent data object and second for first and represent data object, calculate the first distance between data object, The weight of data object is represented according to the first weight for representing data object and second, processing is weighted to the first distance, obtained Data object and second, which is represented, to first represents Weighted distance between data object.
Wherein, situation is participates in situation when two classes calculated are the first kind in (1), and first represents data object The representative data object that data object is the first kind is represented with second.
Specifically, the first distance between data object is calculated, i.e., for two first kind, calculates one of first kind All data objects and another first kind all data objects between the first distance, in the scene of recognition of face, Because human face data object can be the vector of a multidimensional, therefore, the distance between human face data object is calculated, can be meter The cosine similarity between human face data object is calculated, so that the first distance that the cosine similarity is obtained as calculating.
After the first distance is obtained, can also processing be weighted to the first distance, the first distance is such as represented into number with first According to the multiplied by weight of object, then with the second multiplied by weight for representing data object, obtain first and represent data object and the second generation Weighted distance between table data object.
(2) the 3rd data object in data object and Equations of The Second Kind is represented for first, calculates the between data object Two distances, according to the first weight for representing data object, processing is weighted to second distance, is obtained first and is represented data object And the 3rd Weighted distance between data object.
Wherein, situation is that two classes that participation is calculated are the situation of a first kind and an Equations of The Second Kind, first in (2) Representative data object of the data object as the first kind is represented, the 3rd data object is the data object corresponding to Equations of The Second Kind.
Specifically, the process of the first distance between the calculating data object, with situation in (1) similarly, herein no longer Repeat.After the first distance is obtained, can also processing be weighted to the first distance, because now the 3rd data object is independent Data, the first distance and the first multiplied by weight for representing data object without weight, therefore, can be obtained first and represent number According to the Weighted distance between object and the 3rd data object.
Certainly, in actual applications, the first distance between class and class can also be only calculated, without considering weight, by the One distance is used as between class distance.It should be noted that above-mentioned computational methods for representing data object weight and based on the power The method of re-computation between class distance by way of example only, in actual applications, it would however also be possible to employ other computational methods are calculated, The embodiment of the present disclosure is not especially limited to this.
In step S205, based between class distance, multiple classes in set of data objects are merged.
Further, between class distance should be based on, the process that multiple classes in set of data objects are merged was included:Base In the size of between class distance, the class that between class distance is less than or equal to the second predetermined threshold value is merged.
It should be noted that the class after merging is by including all data objects corresponding to the class that is merged.Know in face In other scene, class merging is carried out, i.e., is merged the class for belonging to same user in set of data objects, wrapped in the class after merging The data object included is the human face data object of the user, until all face data objects of same user all divide at one In class.Certainly, clustered in actual applications or according to whether human face data object belongs to same expression, accordingly Ground, when carrying out class merging, the class that same expression is belonged in set of data objects is merged, the number that the class after merging includes It is the human face data object corresponding to same expression according to object, the concrete application scene of the embodiment of the present disclosure clustering algorithm is not made It is specific to limit.
In step S206, based on the class after merging, the cluster to the first kind and screening are continued executing with, until based on cluster Obtained between class distance is calculated with the result of screening and is more than the second predetermined threshold value, level cluster result is exported.
In the disclosed embodiments, carry out class merging after, due to the multiple classes obtained after merging may also have similarity compared with It is high, for example, in recognition of face scene, carrying out after class merging, may also have in obtained multiple classes belong to same user and The class not merged, therefore, also needs obtained set of data objects after merging to class, continue executing with to the cluster of the first kind and Screening, i.e. circulation perform step S202~step S205 process, until the result based on cluster and screening calculates what is obtained Between class distance is more than the second predetermined threshold value, that is, hierarchical clustering is completed, so as to export level cluster result.
The method that the embodiment of the present disclosure is provided, by set of data objects to be clustered, by the number corresponding to the first kind Clustered according to object, and according to cluster result from the representative data object of the first kind is filtered out, so as to utilize data object When carrying out between class distance calculating, amount of calculation is reduced, calculating time and resource is saved;Further, according to cluster result During the representative data object for filtering out the first kind, by calculating the central point of cluster, obtained from each cluster from center The nearest representative data object of point, so that after screening, the data object corresponding to the first kind will be not belonging into this originally The data object of class is screened out, that is, eliminates such noise so that cluster result is more reliable, is conducive to follow-up data analysis.
Fig. 3 is a kind of device block diagram of hierarchical clustering according to an exemplary embodiment.Reference picture 3, the device bag Include acquisition module 301, the first cluster module 302, screening module 303, computing module 304, the second cluster module 305.
The acquisition module 301 is configured as obtaining set of data objects to be clustered, and the set of data objects includes many Individual class, each class corresponds at least one data object;
First cluster module 302 is configured as clustering the data object corresponding to the first kind, obtains cluster knot Really, the data object number corresponding to the first kind is more than the first predetermined threshold value, and the cluster result includes multiple clusters, each Cluster includes at least one data object;
The screening module 303 is configured as according to cluster result, and the data object corresponding to the first kind is sieved Choosing, obtains the representative data object of the first kind;
The computing module 304 is configured as based on the number corresponding to the representative data object and Equations of The Second Kind of the first kind According to object, between class distance is calculated;
Second cluster module 305 is configured as being based on the between class distance, and level is carried out to the set of data objects Cluster.
Specifically, the screening module 303, is configured as the data object according to included by the multiple cluster, in each cluster In, using with the closest data object of the central point of the cluster as the cluster representative data object;By the multiple cluster Representative data object as the first kind representative data object.
Specifically, the computing module 304, is configured as according to the number that cluster includes data object in the first kind, Obtain the weight of the representative data object of the first kind;Representative data object, the first kind based on the first kind The data object corresponding to the weight and Equations of The Second Kind of data object is represented, between class distance is calculated.
Specifically, the computing module 304, is configured as representing data object and second for first and represents data object, The first distance between data object is calculated, the power of data object is represented according to the first weight for representing data object and second Weight, processing is weighted to first distance, obtains first and represent data object and second representing adding between data object Weigh distance;Or, for representing the 3rd data object in data object and the Equations of The Second Kind for first, calculate data object it Between second distance, according to the first weight for representing data object, processing is weighted to the second distance, the first generation is obtained Weighted distance between table data object and the 3rd data object.
Specifically, second cluster module 305, is configured as being based on the between class distance, to the set of data objects In multiple classes merge;Based on the class after merging, continue executing with the cluster to the first kind and screening, until based on cluster and The result of screening calculates obtained between class distance and is more than the second predetermined threshold value, exports level cluster result.
The device that the embodiment of the present disclosure is provided, by set of data objects to be clustered, by the number corresponding to the first kind Clustered according to object, and according to cluster result from the representative data object of the first kind is filtered out, so as to utilize data object When carrying out between class distance calculating, amount of calculation is reduced, calculating time and resource is saved, and due to the data corresponding to the first kind Originally the data object for being not belonging to such may be screened out after screening, that is, eliminate such noise by object so that cluster As a result it is more reliable, be conducive to follow-up data analysis.
On the device in above-described embodiment, wherein modules perform the concrete mode of operation in relevant this method Embodiment in be described in detail, explanation will be not set forth in detail herein.
Fig. 4 is a kind of block diagram of device 400 for hierarchical clustering according to an exemplary embodiment.For example, dress Put 400 and may be provided in a server.Reference picture 4, device 400 include processing assembly 1922, its further comprise one or Multiple processors, and as the memory resource representated by memory 1932, for store can by processing assembly 1922 execution Instruction, such as application program.The application program stored in memory 1932 can include it is one or more each Corresponding to the module of one group of instruction.In addition, processing assembly 1922 is configured as execute instruction, to perform the above method:
Set of data objects to be clustered is obtained, the set of data objects includes multiple classes, and each class corresponds at least One data object;
Data object corresponding to the first kind is clustered, cluster result, the data corresponding to the first kind is obtained Object number is more than the first predetermined threshold value, and the cluster result includes multiple clusters, and each cluster includes at least one data object;
According to cluster result, the data object corresponding to the first kind is screened, the generation of the first kind is obtained Table data object;
Based on the data object corresponding to the representative data object and Equations of The Second Kind of the first kind, between class distance is calculated;
Based on the between class distance, hierarchical clustering is carried out to the set of data objects.
Assuming that above-mentioned is the first possible embodiment, then provided based on the first possible embodiment Second of possible embodiment in, in the memory of device, also comprising for performing the instruction operated below:
Data object according to included by the multiple cluster, will be closest with the central point of the cluster in each cluster Data object as the cluster representative data object;
Using the representative data object of the multiple cluster as the first kind representative data object.
In the third the possible embodiment provided based on the first possible embodiment, terminal is deposited In reservoir, also comprising for performing the instruction operated below:
The number of data object is included according to cluster in the first kind, the representative data object of the first kind is obtained Weight;
The weight and Equations of The Second Kind of the representative data object of representative data object, the first kind based on the first kind Corresponding data object, calculates between class distance.
In the 4th kind of possible embodiment provided based on the third possible embodiment, terminal is deposited In reservoir, also comprising for performing the instruction operated below:
Data object and second is represented for first and represents data object, calculates the first distance between data object, root The weight of data object is represented according to the first weight for representing data object and second, processing is weighted to first distance, Obtain first and represent data object and second representing Weighted distance between data object;Or,
The 3rd data object in data object and the Equations of The Second Kind is represented for first, the between data object is calculated Two distances, according to the first weight for representing data object, processing is weighted to the second distance, is obtained first and is represented data Weighted distance between object and the 3rd data object.
In the 5th kind of possible embodiment provided based on the first possible embodiment, terminal is deposited In reservoir, also comprising for performing the instruction operated below:
Based on the between class distance, multiple classes in the set of data objects are merged;
Based on the class after merging, the cluster to the first kind and screening are continued executing with, until the result based on cluster and screening Calculate obtained between class distance and be more than the second predetermined threshold value, export level cluster result.
Device 400 can also include the power management that a power supply module 1926 is configured as performs device 400, and one has Line or radio network interface 1950 are configured as device 400 being connected to network, and input and output (I/O) interface 1958. Device 400 can be operated based on the operating system for being stored in memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
The device that the embodiment of the present disclosure is provided, by set of data objects to be clustered, by the number corresponding to the first kind Clustered according to object, and according to cluster result from the representative data object of the first kind is filtered out, so as to utilize data object When carrying out between class distance calculating, amount of calculation is reduced, calculating time and resource is saved, and due to the data corresponding to the first kind Originally the data object for being not belonging to such may be screened out after screening, that is, eliminate such noise by object so that cluster As a result it is more reliable, be conducive to follow-up data analysis.
Those skilled in the art will readily occur to its of the present invention after considering specification and putting into practice invention disclosed herein Its embodiment.The application be intended to the present invention any modification, purposes or adaptations, these modifications, purposes or Person's adaptations follow the general principle of the present invention and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.Description and embodiments are considered only as exemplary, and true scope and spirit of the invention are by following Claim is pointed out.
It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

Claims (11)

1. a kind of method of hierarchical clustering, it is characterised in that methods described includes:
Set of data objects to be clustered is obtained, the set of data objects includes multiple classes, and each class corresponds at least one Data object;
Data object corresponding to the first kind is clustered, cluster result, the data object corresponding to the first kind is obtained Number is more than the first predetermined threshold value, and the cluster result includes multiple clusters, and each cluster includes at least one data object;
According to cluster result, the data object corresponding to the first kind is screened, the representative number of the first kind is obtained According to object;
Based on the data object corresponding to the representative data object and Equations of The Second Kind of the first kind, between class distance is calculated;
Based on the between class distance, hierarchical clustering is carried out to the set of data objects.
2. according to the method described in claim 1, it is characterised in that according to cluster result, to the number corresponding to the first kind Screened according to object, obtaining the representative data object of the first kind includes:
Data object according to included by the multiple cluster, in each cluster, by the number closest with the central point of the cluster According to representative data object of the object as the cluster;
Using the representative data object of the multiple cluster as the first kind representative data object.
3. according to the method described in claim 1, it is characterised in that based on the representative data object of the first kind and second Data object corresponding to class, calculating between class distance includes:
The number of data object is included according to cluster in the first kind, the power of the representative data object of the first kind is obtained Weight;
Weight and the Equations of The Second Kind institute of the representative data object of representative data object, the first kind based on the first kind are right The data object answered, calculates between class distance.
4. method according to claim 3, it is characterised in that the representative data object based on the first kind, described Data object corresponding to the weight and Equations of The Second Kind of the representative data object of one class, calculating between class distance includes:
Data object and second is represented for first and represents data object, the first distance between data object is calculated, according to the One, which represents the weight of data object and second, represents the weight of data object, is weighted processing to first distance, obtains First represents data object and second represents Weighted distance between data object;Or,
Represent the 3rd data object in data object and the Equations of The Second Kind for first, calculate between data object second away from From according to the first weight for representing data object, processing is weighted to the second distance, first is obtained and represents data object Weighted distance between the 3rd data object.
5. according to the method described in claim 1, it is characterised in that based on the between class distance, to the set of data objects Carrying out hierarchical clustering includes:
Based on the between class distance, multiple classes in the set of data objects are merged;
Based on the class after merging, the cluster to the first kind and screening are continued executing with, until the result based on cluster and screening is counted Obtained between class distance is more than the second predetermined threshold value, exports level cluster result.
6. a kind of device of hierarchical clustering, it is characterised in that described device includes:
Acquisition module, the set of data objects to be clustered for obtaining, the set of data objects includes multiple classes, each class pair Should be at least one data object;
First cluster module, for being clustered to the data object corresponding to the first kind, obtains cluster result, the first kind Corresponding data object number is more than the first predetermined threshold value, and the cluster result includes multiple clusters, and each cluster includes at least one Individual data object;
Screening module, for according to cluster result, being screened to the data object corresponding to the first kind, obtains described The representative data object of one class;
Computing module, for the data object corresponding to representative data object and Equations of The Second Kind based on the first kind, is calculated Between class distance;
Second cluster module, for based on the between class distance, hierarchical clustering to be carried out to the set of data objects.
7. device according to claim 6, it is characterised in that the screening module, for being wrapped according to the multiple cluster The data object included, in each cluster, using with the closest data object of the central point of the cluster as the cluster representative Data object;Using the representative data object of the multiple cluster as the first kind representative data object.
8. device according to claim 6, it is characterised in that the computing module, for according to cluster in the first kind The number of included data object, obtains the weight of the representative data object of the first kind;Representative based on the first kind Data object corresponding to data object, the weight of the representative data object of the first kind and Equations of The Second Kind, calculates class spacing From.
9. device according to claim 8, it is characterised in that the computing module, for representing data pair for first As representing data object with second, calculate the first distance between data object, according to the first weight for representing data object and Second represents the weight of data object, and processing is weighted to first distance, obtains first and represents data object and second Represent the Weighted distance between data object;Or, for representing the 3rd number in data object and the Equations of The Second Kind for first According to object, the second distance between data object is calculated, according to the first weight for representing data object, the second distance is entered Row weighting is handled, and is obtained first and is represented Weighted distance between data object and the 3rd data object.
10. device according to claim 6, it is characterised in that second cluster module, for based on the class spacing From being merged to multiple classes in the set of data objects;Based on the class after merging, the cluster to the first kind is continued executing with And screening, it is more than the second predetermined threshold value until the result based on cluster and screening calculates obtained between class distance, exports level Cluster result.
11. a kind of device of hierarchical clustering, it is characterised in that including:
Processor;
The instruction executable for storing processor;
Wherein, the processor is configured as:
Set of data objects to be clustered is obtained, the set of data objects includes multiple classes, and each class corresponds at least one Data object;
Data object corresponding to the first kind is clustered, cluster result, the data object corresponding to the first kind is obtained Number is more than the first predetermined threshold value, and the cluster result includes multiple clusters, and each cluster includes at least one data object;
According to cluster result, the data object corresponding to the first kind is screened, the representative number of the first kind is obtained According to object;
Based on the data object corresponding to the representative data object and Equations of The Second Kind of the first kind, between class distance is calculated;
Based on the between class distance, hierarchical clustering is carried out to the set of data objects.
CN201410602569.5A 2014-10-31 2014-10-31 The method and device of hierarchical clustering Active CN104391879B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410602569.5A CN104391879B (en) 2014-10-31 2014-10-31 The method and device of hierarchical clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410602569.5A CN104391879B (en) 2014-10-31 2014-10-31 The method and device of hierarchical clustering

Publications (2)

Publication Number Publication Date
CN104391879A CN104391879A (en) 2015-03-04
CN104391879B true CN104391879B (en) 2017-10-10

Family

ID=52609783

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410602569.5A Active CN104391879B (en) 2014-10-31 2014-10-31 The method and device of hierarchical clustering

Country Status (1)

Country Link
CN (1) CN104391879B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776600A (en) * 2015-11-19 2017-05-31 北京国双科技有限公司 The method and device of text cluster
CN105654039B (en) * 2015-12-24 2019-09-17 小米科技有限责任公司 The method and apparatus of image procossing
CN108205590B (en) * 2017-12-29 2022-01-28 北京奇元科技有限公司 Method and device for establishing network level topological graph of interest points
CN108062576B (en) * 2018-01-05 2019-05-03 百度在线网络技术(北京)有限公司 Method and apparatus for output data
CN109086697A (en) * 2018-07-20 2018-12-25 腾讯科技(深圳)有限公司 A kind of human face data processing method, device and storage medium
CN109145844A (en) * 2018-08-29 2019-01-04 北京旷视科技有限公司 Archive management method, device and electronic equipment for city safety monitoring

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079072A (en) * 2007-06-22 2007-11-28 中国科学院研究生院 Text clustering element study method and device
US8019711B1 (en) * 2003-11-10 2011-09-13 James Ralph Heidenreich System and method to provide a customized problem solving environment for the development of user thinking about an arbitrary problem
CN102360377A (en) * 2011-10-12 2012-02-22 中国测绘科学研究院 Spatial clustering mining PSE (Problem Solving Environments) system and construction method thereof
CN103473255A (en) * 2013-06-06 2013-12-25 中国科学院深圳先进技术研究院 Data clustering method and system, and data processing equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8019711B1 (en) * 2003-11-10 2011-09-13 James Ralph Heidenreich System and method to provide a customized problem solving environment for the development of user thinking about an arbitrary problem
CN101079072A (en) * 2007-06-22 2007-11-28 中国科学院研究生院 Text clustering element study method and device
CN102360377A (en) * 2011-10-12 2012-02-22 中国测绘科学研究院 Spatial clustering mining PSE (Problem Solving Environments) system and construction method thereof
CN103473255A (en) * 2013-06-06 2013-12-25 中国科学院深圳先进技术研究院 Data clustering method and system, and data processing equipment

Also Published As

Publication number Publication date
CN104391879A (en) 2015-03-04

Similar Documents

Publication Publication Date Title
CN104391879B (en) The method and device of hierarchical clustering
WO2016127883A1 (en) Image area detection method and device
CN104679818A (en) Video keyframe extracting method and video keyframe extracting system
CN103559205A (en) Parallel feature selection method based on MapReduce
CN106228121B (en) Gesture feature recognition method and device
CN105354228B (en) Similar diagram searching method and device
CN109033955A (en) A kind of face tracking method and system
CN111368133B (en) Method and device for establishing index table of video library, server and storage medium
WO2019080908A1 (en) Image processing method and apparatus for implementing image recognition, and electronic device
Li et al. Research on QoS service composition based on coevolutionary genetic algorithm
CN109784474A (en) A kind of deep learning model compression method, apparatus, storage medium and terminal device
CN110689136B (en) Deep learning model obtaining method, device, equipment and storage medium
CN111310784B (en) Resource data processing method and device
US7991617B2 (en) Optimum design management apparatus from response surface calculation and method thereof
CN105491370B (en) Video saliency detection method based on graph collaborative low-high-level features
CN110648289A (en) Image denoising processing method and device
CN112508135B (en) Model training method, pedestrian attribute prediction method, device and equipment
CN111476886B (en) Smart building three-dimensional model rendering method and building cloud server
CN113515672A (en) Data processing method and device, computer readable medium and electronic equipment
CN112132279A (en) Convolutional neural network model compression method, device, equipment and storage medium
CN109242048B (en) Visual target distributed clustering method based on time sequence
CN111091106A (en) Image clustering method and device, storage medium and electronic device
CN108701206A (en) System and method for facial alignment
CN111382628A (en) Method for judging peer and related products
CN108206813B (en) Security audit method and device based on k-means clustering algorithm and server

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant