CN104765776B - The clustering method and device of a kind of data sample - Google Patents

The clustering method and device of a kind of data sample Download PDF

Info

Publication number
CN104765776B
CN104765776B CN201510119224.9A CN201510119224A CN104765776B CN 104765776 B CN104765776 B CN 104765776B CN 201510119224 A CN201510119224 A CN 201510119224A CN 104765776 B CN104765776 B CN 104765776B
Authority
CN
China
Prior art keywords
barycenter
data sample
target data
cluster
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510119224.9A
Other languages
Chinese (zh)
Other versions
CN104765776A (en
Inventor
徐斌
袁宏辉
陈伟祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Gaohang Intellectual Property Operation Co ltd
Nanjing Dekun Information Technology Co ltd
Nanjing Zhishuyun Information Technology Co ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201510119224.9A priority Critical patent/CN104765776B/en
Publication of CN104765776A publication Critical patent/CN104765776A/en
Application granted granted Critical
Publication of CN104765776B publication Critical patent/CN104765776B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses the clustering methods and device of a kind of data sample, belong to field of computer technology.The described method includes:Obtain target data sample and the corresponding barycenter of each cluster classification;Other corresponding barycenter of cluster classification beyond the first cluster classification according to belonging to the target data sample and the target data sample, definite each barycenter of other cluster classifications is with the target data sample apart from lower limit;In other described corresponding each barycenter of cluster classification, the barycenter of the corresponding sample centroid distance being less than apart from lower limit between target data sample barycenter corresponding with the described first cluster classification is chosen;In the barycenter and the corresponding barycenter of the first cluster classification of selection, determine the barycenter with the distance minimum of the target data sample, the target data sample is included into cluster classification corresponding with the barycenter of the distance minimum of the target data sample.Using the present invention, the process resource of server can be saved.

Description

The clustering method and device of a kind of data sample
Technical field
The present invention relates to field of computer technology, the clustering method and device of more particularly to a kind of data sample.
Background technology
With the development of computer technology, computer application is more and more extensive, and function is also more and more comprehensive.People can be with Various data processings, such as data clusters and data statistics are carried out by computer (such as server), each is to be treated Data can be referred to as a data sample.
Server, can be according to the number of default cluster classification when being clustered to the data sample in set of data samples Amount randomly selects the data sample of the quantity from data sample to be clustered, the barycenter as each cluster classification.For data Each data sample in sample set, server calculate the distance of the data sample and each barycenter, which can represent data The degree of closeness of sample and barycenter, there are many kinds of the methods for calculating distance, such as Euclidean distance algorithm.Server can determine with The barycenter of the distance minimum of the data sample, which is included into the classification belonging to the barycenter, then calculates the category In all data samples average value, the barycenter as the category.Server can repeat above-mentioned calculating processing, that is, service Device calculates the distance of each data sample and updated barycenter, and then data sample is clustered again, then calculates again The average value of all data samples in of all categories after cluster, as updated barycenter, the data sample in of all categories It remains unchanged.
In the implementation of the present invention, inventor has found that the prior art has at least the following problems:
Server when being clustered to some data sample, it is necessary to calculate the distance of the data sample and all barycenter, Calculation amount is larger, can so occupy the substantial amounts of process resource of service server.
The content of the invention
In order to solve problem of the prior art, an embodiment of the present invention provides the clustering methods and dress of a kind of data sample It puts.The technical solution is as follows:
In a first aspect, a kind of clustering method of data sample is provided, the described method includes:
Obtain target data sample and the corresponding barycenter of each cluster classification;
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification, definite each barycenter of other cluster classifications is with the target data sample apart from lower limit;
In other described corresponding each barycenter of cluster classification, selection is corresponding to be less than the target data sample apart from lower limit The barycenter of sample centroid distance between this barycenter corresponding with the described first cluster classification;
In the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with the target data sample away from From minimum barycenter, the target data sample is included into corresponding with the barycenter of the distance minimum of the target data sample poly- In class classification.
With reference to first aspect, it is described according to the target data in the first possible realization method of the first aspect Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample and the target data sample, determine each Barycenter and the target data sample of other cluster classifications apart from lower limit, including:
Determine target data sample barycenter corresponding with the first cluster classification belonging to the target data sample The barycenter of sample centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications Between distance;
The difference of distance between the sample centroid distance and each barycenter is determined, by the matter of other each cluster classifications The corresponding difference of the heart, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.
With reference to first aspect, it is described according to the target data in second of possible realization method of the first aspect Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample and the target data sample, determine each Barycenter and the target data sample of other cluster classifications apart from lower limit, including:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification determines the difference vector of target data sample barycenter corresponding with other each cluster classifications in list Projection in bit vector, by the length of the corresponding projection of barycenter of other each cluster classifications, as it is described it is each other The barycenter of cluster classification is with the target data sample apart from lower limit.
With reference to first aspect, it is described according to the target data in the third possible realization method of the first aspect Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample and the target data sample, determine described Target data sample with the barycenter of other cluster classifications apart from lower limit, including:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification determines average and the side of target data sample barycenter corresponding with other each cluster classifications Difference;
According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and Mean square deviation inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.
With reference to first aspect, in the 4th kind of possible realization method of the first aspect, the barycenter and institute in selection It states in the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of the target data sample is determined, by the target Data sample is included into cluster classification corresponding with the barycenter of the distance minimum of the target data sample, including:
According to the number of the barycenter of selection, the first number of processing unit is determined;
In default processing unit pond, the first number processing unit is obtained;
By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with The target data sample is included into the corresponding cluster of barycenter determined by the barycenter of the distance minimum of the target data sample In classification.
Second aspect, provides a kind of clustering apparatus of data sample, and described device includes:
Acquisition module, for obtaining target data sample and the corresponding barycenter of each cluster classification;
Determining module, for the first cluster classification according to belonging to the target data sample and the target data sample Other corresponding barycenter of cluster classification in addition, determine the barycenter of other each cluster classifications and the target data sample away from From lower limit;
Module is chosen, in other described corresponding each barycenter of cluster classification, selection is corresponding to be less than apart from lower limit The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification;
Cluster module, in the barycenter of selection and the corresponding barycenter of the first cluster classification, determining and the mesh The barycenter of the distance minimum of data sample is marked, the target data sample is included into minimum with the distance of the target data sample Barycenter it is corresponding cluster classification in.
With reference to second aspect, in the first possible realization method of the second aspect, the determining module is used for:
Determine target data sample barycenter corresponding with the first cluster classification belonging to the target data sample The barycenter of sample centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications Between distance;
The difference of distance between the sample centroid distance and each barycenter is determined, by the matter of other each cluster classifications The corresponding difference of the heart, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.
With reference to second aspect, in second of possible realization method of the second aspect, the determining module is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification determines the difference vector of target data sample barycenter corresponding with other each cluster classifications in list Projection in bit vector, by the length of the corresponding projection of barycenter of other each cluster classifications, as it is described it is each other The barycenter of cluster classification is with the target data sample apart from lower limit.
With reference to second aspect, in the third possible realization method of the second aspect, the determining module is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification determines average and the side of target data sample barycenter corresponding with other each cluster classifications Difference;
According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and Mean square deviation inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.
With reference to second aspect, in the 4th kind of possible realization method of the second aspect, the cluster module is used for:
According to the number of the barycenter of selection, the first number of processing unit is determined;
In default processing unit pond, the first number processing unit is obtained;
By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with The target data sample is included into the corresponding cluster of barycenter determined by the barycenter of the distance minimum of the target data sample In classification.
The advantageous effect that technical solution provided in an embodiment of the present invention is brought is:
In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly- The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen, The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets According to the calculation amount of the distance of sample, so as to save the process resource of server.
Description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.
Fig. 1 is a kind of clustering method flow chart of data sample provided in an embodiment of the present invention;
Fig. 2 is a kind of clustering apparatus structure diagram of data sample provided in an embodiment of the present invention;
Fig. 3 is a kind of structure diagram of server provided in an embodiment of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.
Embodiment one
An embodiment of the present invention provides a kind of clustering method of data sample, as shown in Figure 1, the process flow of this method can To include the steps:
Step 101, target data sample and the corresponding barycenter of each cluster classification are obtained.
Step 102, other beyond the first cluster classification according to belonging to target data sample and target data sample are poly- The corresponding barycenter of class classification, definite each barycenter of other cluster classifications is with target data sample apart from lower limit.
Step 103, in other corresponding each barycenter of cluster classification, selection is corresponding to be less than target data sample apart from lower limit The barycenter of sample centroid distance between this barycenter corresponding with the first cluster classification.
Step 104, in the barycenter of selection and the first corresponding barycenter of cluster classification, determine with target data sample away from From minimum barycenter, target data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum In.
In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly- The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen, The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets According to the calculation amount of the distance of sample, so as to save the process resource of server.
Embodiment two
An embodiment of the present invention provides a kind of clustering method of data sample, the executive agent of this method is server.Its In, server can be the background server for having the function of convergence.
Server can specifically be divided into following step when carrying out clustering processing to the data sample in set of data samples Suddenly:Step 1 according to the quantity of default cluster classification, randomly selects the data sample of the quantity from data sample to be clustered This, the barycenter as each cluster classification;Step 2 for each data sample, calculates the data of the data sample and each barycenter Distance (the data distance can represent the degree of closeness of data sample and barycenter) is determined with the data of data sample distance most The data sample is included into the classification belonging to the barycenter by small barycenter;Step 3 calculates all data samples in the category Average value, the barycenter as the category;Step 4, repeats the processing procedure Step 2: three, that is, calculate each data sample with The data distance of updated barycenter, and then data sample is clustered again, it then calculates of all categories after cluster again In all data samples average value, as updated barycenter, the data sample in of all categories remains unchanged.
This programme is in above-mentioned processing procedure, the processing procedure of step 2 is improved, below in conjunction with specific reality Mode is applied, process flow shown in FIG. 1 is described in detail, content can be as follows:
Step 101, target data sample and the corresponding barycenter of each cluster classification are obtained.
In force, server can obtain the data sample (i.e. target data sample) for needing to carry out clustering processing, with And respectively cluster the corresponding barycenter of classification.In the clustering processing of the first round, server can be according to the number of default cluster classification Amount randomly selects the data sample of the quantity from data sample to be clustered, as the barycenter of each cluster classification, subsequent It is each to cluster the average value that the corresponding barycenter of classification be all data samples in of all categories after cluster in clustering processing.
Step 102, other beyond the first cluster classification according to belonging to target data sample and target data sample are poly- The corresponding barycenter of class classification, definite each barycenter of other cluster classifications is with target data sample apart from lower limit.
In force, data sample can have a variety of attributes, for example, in the case of data sample is user, data The corresponding attribute of sample can be monthly cost, surf time, age and gender etc..Data sample can with m dimension to It measures to represent, as target data sample can use vector a1It represents, a1={ a11,a12,……a1m, data sample set can be with table Show { a }.When carrying out clustering processing to { a }, the quantity of cluster classification can be pre-set, such as k classes.This k cluster classification pair The barycenter answered can be expressed as c1、c2、c3……ck, c1、c2、c3……ckIt is the vector of m dimensions.With a1Belong to c1It is corresponding poly- Exemplified by class classification (the i.e. first cluster classification), after server obtains target data sample and the corresponding barycenter of each cluster classification, It can be according to a1And c2、c3……ck, c is determined respectively2、c3……ckWith a1Apart from lower limit, can be represented apart from lower limit with h ', a1With c2Apart from lower limit be h '2, a1With c3Apart from lower limit be h '3, and so on.
Optionally, other each corresponding barycenter of cluster classification can be determined according to distance between sample centroid distance and barycenter It is corresponding apart from lower limit, correspondingly, the processing procedure of step 102 can be as follows:Determine target data sample and target data sample The sample centroid distance and the corresponding barycenter of the first cluster classification of the corresponding barycenter of the first cluster classification belonging to this with it is each Distance between the barycenter of other corresponding barycenter of cluster classification;Determine the difference of distance between sample centroid distance and each barycenter, it will Corresponding differences of barycenter of other each cluster classifications, as other each cluster classifications barycenter and target data sample away from From lower limit.
In force, server can calculate a1With c1Sample centroid distance, which can be denoted as r, also C can be calculated respectively1With c2、c3……ckBarycenter between distance, distance can be represented sequentially as d between these barycenter2……dk.So Server can calculate r and d afterwards2Difference, using the difference as a1With c2Apart from lower limit h '2, calculate r and d3Difference, will The difference is as a1With c3Apart from lower limit h '3, and so on, calculate r and dkDifference, using the difference as a1With ckDistance Lower limit h 'k.Wherein, between sample centroid distance and barycenter there are many kinds of the computational methods of distance, such as Euclidean distance algorithm.European In distance algorithm, the distance between any two data sample (including barycenter) can be expressed as Wherein, x and c is data sample to be calculated, and is m dimensional vectors, xiFor the seat in vector x Mark, ciFor the coordinate in vectorial c.In the embodiment of the present invention, Euclidean distance uses not open the computational methods of radical sign, i.e.,To reduce the treating capacity of server.In addition, server is calculated between each barycenter after distance, it can be right Distance is stored between each barycenter, and server can simultaneously be handled multiple data samples, in the processing of same wheel, to it When his data sample is handled, server can call distance between each barycenter of storage, without being computed repeatedly.
For example, a1=(1,2,3), c1=(2,2,4), c2=(2,2,5), c3=(3,5,6) calculate public according to Euclidean distance Knowable to formula, r=2, d2=1, d3=14, then h '2=1, h '3=13.
It optionally, can be according to the difference vector of target data sample and the barycenter of other each cluster classifications in unit vector On projection determine apart from lower limit, correspondingly, the processing procedure of step 102 can be as follows:According to target data sample and target Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to data sample, determine target data sample with it is each Projection of the difference vector of other corresponding barycenter of cluster classification on unit vector, the barycenter of other each cluster classifications is corresponded to Projection length, barycenter and target data sample as other each cluster classifications apart from lower limit.
In force, a1And c2、c3……ckIt is a vector for m dimensions, server can be according to a1And c2、c3……ck, point It Que Ding not a1With c2、c3……ckDifference vector, and then determine projection of each difference vector on unit vector, unit vector can Think the vector on any direction.The length of these projections can be represented sequentially as l2、l3……lk.Server can calculate a1 With c2Projection of the difference vector on unit vector length l2, by l2As a1With c2Apart from lower limit h '2, calculate a1With c3's The length l of projection of the difference vector on unit vector3, by l3As a1With c3Apart from lower limit h '3, and so on, until calculating a1With ckProjection of the difference vector on unit vector length lk, by lkAs a1With ckApart from lower limit h 'k
For example, a1=(1,2,3), c2=(2,2,5), c3=(3,5,6), unit vector is u (0,0,1), according to projection Knowable to calculation formula, h '2=l2=2, h '3=l3=3.
Optionally, can be calculated according to mean square deviation inequality apart from lower limit, correspondingly, the processing procedure of step 102 can be with It is as follows:Other cluster classifications beyond the first cluster classification according to belonging to target data sample and target data sample are corresponding Barycenter determines the average and variance of target data sample barycenter corresponding with other each cluster classifications;According to target data sample The average and variance and mean square deviation inequality of this barycenter corresponding with other each cluster classifications, determine other each clusters The barycenter of classification is with target data sample apart from lower limit.
In force, server can calculate a respectively1And c2、c3……ckAverage and standard deviation, wherein, average m The average value of each coordinate in the vector of dimension, the standard deviation that standard deviation obtains for each coordinate in the vector of m dimensions according to mean value computation.Its In, a1Average beStandard deviation isc2Average beStandard deviation isAnd so on, ckAverage beStandard deviation isAccording to equal standard deviation inequality, a can be calculated1With c2Apart from lower limit h '2=m [(μa1c2)2+ (σa1c2)2], calculate a1With ckApart from lower limit h 'k=m [(μa1ck)2+(σa1ck)2], wherein, m is the dimension of vector.
For example, a1=(1,2,3), c2=(2,2,5), c3=(3,5,6), then a1With c2Apart from lower limit h '2=3* [(2-3)2+(0.81-1.41)2] =4.08, a1With c2Apart from lower limit h '3=3* [(2-4.67)2+(0.81-1.24)2]=21.9.
Step 103, in other corresponding each barycenter of cluster classification, selection is corresponding to be less than target data sample apart from lower limit The barycenter of sample centroid distance between this barycenter corresponding with the first cluster classification.
In force, server calculates c2、c3……ckWith a1Apart from lower limit (i.e. h '2、h′3……h′k) and a1 With c1Sample centroid distance (i.e. r) after, can be respectively by h '2、h′3……h′kCompared with r, determine less than r away from From lower limit, and then these are chosen apart from the corresponding barycenter of lower limit, exclude the corresponding barycenter for being greater than or equal to r apart from lower limit.Tool Body, for the corresponding barycenter for being greater than or equal to r apart from lower limit, then server can exclude, without subsequent processing.Example Such as, r=2, h '2=1, h '3=13, then h '2< r, h '3> r, therefore h ' can be chosen2Corresponding barycenter c2, exclude h '3It is corresponding Barycenter c3
Step 104, in the barycenter of selection and the first corresponding barycenter of cluster classification, determine with target data sample away from From minimum barycenter, target data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum In.
In force, after server chooses barycenter, a can be calculated1With the distance and a of the barycenter of selection1With c1Away from From, then calculated distance is compared, determine and a1Distance minimum barycenter, by a1It is corresponding to be included into the barycenter It clusters in classification.In this way, server after selection is handled, can filter out a part of barycenter, the first cluster classification is only calculated Corresponding barycenter and the distance of target data sample and the distance of the barycenter and target data sample chosen, without calculating mesh The distance of data sample and all barycenter is marked, so as to save the process resource of server.
For example, a1=(1,2,3), c1=(2,2,4), the barycenter of selection is c2=(2,2,5), then server can calculate a1With c1Distance, be 2, a1With c2Distance, be 5, server can be by a1It is included into c1In corresponding cluster classification.
Optionally, can processing unit be obtained according to the barycenter number of selection, target data sample is clustered, accordingly , the processing procedure of step 104 can be as follows:According to the number of the barycenter of selection, the first number of processing unit is determined;Pre- If processing unit pond in, obtain the first number processing unit;By the processing unit of acquisition, in the barycenter of selection and first It clusters in the corresponding barycenter of classification, determines the barycenter with the distance minimum of target data sample, target data sample is included into really In the corresponding cluster classification of barycenter made.
In force, processing unit pond can be pre-set in server, multiple places can be provided in processing unit pond Unit is managed, each processing unit can calculate a1With the distance of a barycenter.After server chooses barycenter, it may be determined that selection The number of barycenter, and then determine according to the number the first number of processing unit, then in processing unit pond, obtain the first number Mesh processing unit.For known a1With c1Distance (i.e. sample centroid distance) situation, server can be by the barycenter of selection Number as the first number, in processing unit pond, obtain the first number processing unit;For unknown a1With c1Distance Situation, the number of barycenter that server determines to choose adds the number after one, using the number as the first number, in processing unit Chi Zhong obtains the first number processing unit.After server obtains processing unit, it can selected by the processing unit of acquisition The barycenter and c taken1In, definite and a1Distance minimum barycenter, by a1It is included into the corresponding cluster classification of the barycenter determined.
In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly- The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen, The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets According to the calculation amount of the distance of sample, so as to save the process resource of server.
Embodiment three
Based on identical technical concept, the embodiment of the present invention additionally provides a kind of clustering apparatus of data sample, such as Fig. 2 institutes Show, which includes:
Acquisition module 210, for obtaining target data sample and the corresponding barycenter of each cluster classification;
Determining module 220, for the first cluster according to belonging to the target data sample and the target data sample Other corresponding barycenter of cluster classification beyond classification, determine the barycenter of other each cluster classifications and the target data sample Apart from lower limit;
Module 230 is chosen, in other described corresponding each barycenter of cluster classification, choosing corresponding small apart from lower limit The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification;
Cluster module 240, in the barycenter of selection and the corresponding barycenter of the first cluster classification, determine with it is described The target data sample is included into the distance of the target data sample most by the barycenter of the distance minimum of target data sample In the corresponding cluster classification of small barycenter.
Optionally, the determining module 220, is used for:
Determine target data sample barycenter corresponding with the first cluster classification belonging to the target data sample The barycenter of sample centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications Between distance;
The difference of distance between the sample centroid distance and each barycenter is determined, by the matter of other each cluster classifications The corresponding difference of the heart, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.
Optionally, the determining module 220, is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification determines the difference vector of target data sample barycenter corresponding with other each cluster classifications in list Projection in bit vector, by the length of the corresponding projection of barycenter of other each cluster classifications, as it is described it is each other The barycenter of cluster classification is with the target data sample apart from lower limit.
Optionally, the determining module 220, is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification determines average and the side of target data sample barycenter corresponding with other each cluster classifications Difference;
According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and Mean square deviation inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.
Optionally, the cluster module 240, is used for:
According to the number of the barycenter of selection, the first number of processing unit is determined;
In default processing unit pond, the first number processing unit is obtained;
By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with The target data sample is included into the corresponding cluster of barycenter determined by the barycenter of the distance minimum of the target data sample In classification.
In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly- The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen, The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets According to the calculation amount of the distance of sample, so as to save the process resource of server.
It should be noted that:The clustering apparatus for the data sample that above-described embodiment provides is clustered to data sample When, only with the division progress of above-mentioned each function module for example, in practical application, above-mentioned function can be divided as needed With by different function module completions, i.e., the internal structure of equipment is divided into different function modules, to complete above description All or part of function.In addition, the clustering apparatus of data sample of above-described embodiment offer and the cluster side of data sample Method embodiment belongs to same design, and specific implementation process refers to embodiment of the method, and which is not described herein again.
Example IV
Based on the technical concept identical with the clustering method of above-mentioned data sample, the embodiment of the present application additionally provides a kind of clothes Business device, the structure diagram of the server refer to Fig. 3.The server includes processor 310, transceiver 320 and memory 330, transceiver 320 and memory 330 are connected respectively with processor 310, wherein:
Memory 330 obtains target data sample and the corresponding barycenter of each cluster classification for passing through transceiver 320;
Processor 310, for the first cluster class according to belonging to the target data sample and the target data sample Other corresponding barycenter of cluster classification beyond not, determine the barycenter of other each cluster classifications and the target data sample Apart from lower limit;
Processor 310 is additionally operable in other described corresponding each barycenter of cluster classification, chooses corresponding small apart from lower limit The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification;
Processor 310 is additionally operable in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with it is described The target data sample is included into the distance of the target data sample most by the barycenter of the distance minimum of target data sample In the corresponding cluster classification of small barycenter.
Optionally, the processor 310, is used for:
Determine target data sample barycenter corresponding with the first cluster classification belonging to the target data sample The barycenter of sample centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications Between distance;
The difference of distance between the sample centroid distance and each barycenter is determined, by the matter of other each cluster classifications The corresponding difference of the heart, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.
Optionally, the processor 310, is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification determines the difference vector of target data sample barycenter corresponding with other each cluster classifications in list Projection in bit vector, by the length of the corresponding projection of barycenter of other each cluster classifications, as it is described it is each other The barycenter of cluster classification is with the target data sample apart from lower limit.
Optionally, the processor 310, is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification determines average and the side of target data sample barycenter corresponding with other each cluster classifications Difference;
According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and Mean square deviation inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.
Optionally, the processor 310, is used for:
According to the number of the barycenter of selection, the first number of processing unit is determined;
In default processing unit pond, the first number processing unit is obtained;
By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with The target data sample is included into the corresponding cluster of barycenter determined by the barycenter of the distance minimum of the target data sample In classification.
In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly- The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen, The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets According to the calculation amount of the distance of sample, so as to save the process resource of server.
One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment It completes, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modifications, equivalent replacements and improvements are made should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of clustering method of data sample, which is characterized in that the described method includes:
Obtain target data sample and the corresponding barycenter of each cluster classification;
Other cluster classes beyond the first cluster classification according to belonging to the target data sample and the target data sample Not corresponding barycenter, definite each barycenter of other cluster classifications is with the target data sample apart from lower limit;
In other described corresponding each barycenter of cluster classification, choose it is corresponding apart from lower limit be less than the target data sample with Described first clusters the barycenter of the sample centroid distance between the corresponding barycenter of classification;
In the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with the distance of the target data sample most The target data sample is included into cluster class corresponding with the barycenter of the distance minimum of the target data sample by small barycenter Not in.
It is 2. according to the method described in claim 1, it is characterized in that, described according to the target data sample and the number of targets According to other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample, the matter of other definite each cluster classifications The heart and the target data sample apart from lower limit, including:
Determine the sample of target data sample barycenter corresponding with the first cluster classification belonging to the target data sample The barycenter spacing of centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications From;
The difference of distance between the sample centroid distance and each barycenter is determined, by the barycenter pair of other each cluster classifications The difference answered, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.
It is 3. according to the method described in claim 1, it is characterized in that, described according to the target data sample and the number of targets According to other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample, the matter of other definite each cluster classifications The heart and the target data sample apart from lower limit, including:
Other cluster classes beyond the first cluster classification according to belonging to the target data sample and the target data sample Not corresponding barycenter, determine the difference vector of target data sample barycenter corresponding with other each cluster classifications unit to Projection in amount by the length of the corresponding projection of barycenter of other each cluster classifications, is clustered as each other For the barycenter of classification with the target data sample apart from lower limit, the unit vector is the vector on any direction.
It is 4. according to the method described in claim 1, it is characterized in that, described according to the target data sample and the number of targets According to other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample, the target data sample and institute are determined State the barycenter of other cluster classifications apart from lower limit, including:
Other cluster classes beyond the first cluster classification according to belonging to the target data sample and the target data sample Not corresponding barycenter determines the average and variance of target data sample barycenter corresponding with other each cluster classifications;
According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and just Poor inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.
5. according to the method described in claim 1, it is characterized in that, the barycenter in selection and the first cluster classification pair In the barycenter answered, the barycenter with the distance minimum of the target data sample is determined, the target data sample is included into and institute In the corresponding cluster classification of barycenter for stating the distance minimum of target data sample, including:
According to the number of the barycenter of selection, the first number of processing unit is determined;
In default processing unit pond, the first number processing unit is obtained;
By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with it is described The target data sample is included into the corresponding cluster classification of barycenter determined by the barycenter of the distance minimum of target data sample In.
6. a kind of clustering apparatus of data sample, which is characterized in that described device includes:
Acquisition module, for obtaining target data sample and the corresponding barycenter of each cluster classification;
Determining module, beyond the first cluster classification according to belonging to the target data sample and the target data sample Other corresponding barycenter of cluster classification, determine the barycenter of other each cluster classifications under the distance of the target data sample Limit;
Choose module, in other described corresponding each barycenter of cluster classification, choose it is corresponding be less than apart from lower limit it is described The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification;
Cluster module, in the barycenter of selection and the corresponding barycenter of the first cluster classification, determining and the number of targets According to the barycenter of the distance minimum of sample, the target data sample is included into the matter with the distance minimum of the target data sample In the corresponding cluster classification of the heart.
7. device according to claim 6, which is characterized in that the determining module is used for:
Determine the sample of target data sample barycenter corresponding with the first cluster classification belonging to the target data sample The barycenter spacing of centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications From;
The difference of distance between the sample centroid distance and each barycenter is determined, by the barycenter pair of other each cluster classifications The difference answered, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.
8. device according to claim 6, which is characterized in that the determining module is used for:
Other cluster classes beyond the first cluster classification according to belonging to the target data sample and the target data sample Not corresponding barycenter, determine the difference vector of target data sample barycenter corresponding with other each cluster classifications unit to Projection in amount by the length of the corresponding projection of barycenter of other each cluster classifications, is clustered as each other For the barycenter of classification with the target data sample apart from lower limit, the unit vector is the vector on any direction.
9. device according to claim 6, which is characterized in that the determining module is used for:
Other cluster classes beyond the first cluster classification according to belonging to the target data sample and the target data sample Not corresponding barycenter determines the average and variance of target data sample barycenter corresponding with other each cluster classifications;
According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and just Poor inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.
10. device according to claim 6, which is characterized in that the cluster module is used for:
According to the number of the barycenter of selection, the first number of processing unit is determined;
In default processing unit pond, the first number processing unit is obtained;
By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with it is described The target data sample is included into the corresponding cluster classification of barycenter determined by the barycenter of the distance minimum of target data sample In.
CN201510119224.9A 2015-03-18 2015-03-18 The clustering method and device of a kind of data sample Expired - Fee Related CN104765776B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510119224.9A CN104765776B (en) 2015-03-18 2015-03-18 The clustering method and device of a kind of data sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510119224.9A CN104765776B (en) 2015-03-18 2015-03-18 The clustering method and device of a kind of data sample

Publications (2)

Publication Number Publication Date
CN104765776A CN104765776A (en) 2015-07-08
CN104765776B true CN104765776B (en) 2018-06-05

Family

ID=53647607

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510119224.9A Expired - Fee Related CN104765776B (en) 2015-03-18 2015-03-18 The clustering method and device of a kind of data sample

Country Status (1)

Country Link
CN (1) CN104765776B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106909932A (en) * 2015-12-23 2017-06-30 北京奇虎科技有限公司 A kind of method and device of website cluster
CN105868261A (en) * 2015-12-31 2016-08-17 乐视网信息技术(北京)股份有限公司 Method and device for obtaining and ranking associated information
CN110909817B (en) * 2019-11-29 2022-11-11 深圳市商汤科技有限公司 Distributed clustering method and system, processor, electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6298174B1 (en) * 1996-08-12 2001-10-02 Battelle Memorial Institute Three-dimensional display of document set
CN101149759A (en) * 2007-11-09 2008-03-26 山西大学 K-means initial clustering center selection method based on neighborhood model
CN101477552A (en) * 2009-02-03 2009-07-08 辽宁般若网络科技有限公司 Website user rank division method
CN103164487A (en) * 2011-12-19 2013-06-19 中国科学院声学研究所 Clustering algorithm based on density and geometrical information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6298174B1 (en) * 1996-08-12 2001-10-02 Battelle Memorial Institute Three-dimensional display of document set
CN101149759A (en) * 2007-11-09 2008-03-26 山西大学 K-means initial clustering center selection method based on neighborhood model
CN101477552A (en) * 2009-02-03 2009-07-08 辽宁般若网络科技有限公司 Website user rank division method
CN103164487A (en) * 2011-12-19 2013-06-19 中国科学院声学研究所 Clustering algorithm based on density and geometrical information

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
K—Means聚类算法的研究;周爱武等;《计算机技术与发展》;20110228;第21卷(第2期);第62-65页 *
一种改进的K—means聚类算法;王勇等;《工业控制计算机》;20121231;第23卷(第8期);第91-93页 *
初始聚类中心优化的k-means算法;袁方等;《计算机工程》;20070228;第33卷(第3期);第65-66页 *

Also Published As

Publication number Publication date
CN104765776A (en) 2015-07-08

Similar Documents

Publication Publication Date Title
Webb Decision tree grafting from the all-tests-but-one partition
CN108595585B (en) Sample data classification method, model training method, electronic equipment and storage medium
CN110298346A (en) Image-recognizing method, device and computer equipment based on divisible convolutional network
Ryu et al. Breast cancer prediction using the isotonic separation technique
US20120150860A1 (en) Clustering with Similarity-Adjusted Entropy
JP2003529131A5 (en) Method and machine for identifying patterns
CN104391879B (en) The method and device of hierarchical clustering
CN104765776B (en) The clustering method and device of a kind of data sample
Ding et al. The effectiveness of multitask learning for phenotyping with electronic health records data
Gao et al. James–Stein shrinkage to improve k-means cluster analysis
CN111260220B (en) Group control equipment identification method and device, electronic equipment and storage medium
CN110705602A (en) Large-scale data clustering method and device and computer readable storage medium
CN107679553A (en) Clustering method and device based on density peaks
CN110276243A (en) Score mapping method, face comparison method, device, equipment and storage medium
EP3452916A1 (en) Large scale social graph segmentation
CN114783021A (en) Intelligent detection method, device, equipment and medium for wearing of mask
CN113962401A (en) Federal learning system, and feature selection method and device in federal learning system
CN109508087A (en) Brain line signal recognition method and terminal device
Ziller et al. Complex-valued deep learning with differential privacy
CN108960246A (en) A kind of binary conversion treatment device and method for image recognition
CN108230253A (en) Image recovery method, device, electronic equipment and computer storage media
CN106934837A (en) Image reconstructing method and device
CN113891323B (en) WiFi-based user tag acquisition system
KR101462748B1 (en) Method for clustering health-information
CN107203916A (en) A kind of user credit method for establishing model and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20191210

Address after: 210000 room er201, east side, office, building 2, Park 1, Renshan Road, Jiangpu street, Pukou District, Nanjing City, Jiangsu Province

Co-patentee after: Nanjing zhishuyun Information Technology Co.,Ltd.

Patentee after: Nanjing Dekun Information Technology Co.,Ltd.

Address before: 510000 unit 2414-2416, building, No. five, No. 371, Tianhe District, Guangdong, China

Patentee before: GUANGDONG GAOHANG INTELLECTUAL PROPERTY OPERATION Co.,Ltd.

Effective date of registration: 20191210

Address after: 510000 unit 2414-2416, building, No. five, No. 371, Tianhe District, Guangdong, China

Patentee after: GUANGDONG GAOHANG INTELLECTUAL PROPERTY OPERATION Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180605

CF01 Termination of patent right due to non-payment of annual fee