CN104765776B

CN104765776B - The clustering method and device of a kind of data sample

Info

Publication number: CN104765776B
Application number: CN201510119224.9A
Authority: CN
Inventors: 徐斌; 袁宏辉; 陈伟祥
Original assignee: Huawei Technologies Co Ltd
Current assignee: Guangdong Gaohang Intellectual Property Operation Co ltd; Nanjing Dekun Information Technology Co ltd; Nanjing Zhishuyun Information Technology Co ltd
Priority date: 2015-03-18
Filing date: 2015-03-18
Publication date: 2018-06-05
Anticipated expiration: 2035-03-18
Also published as: CN104765776A

Abstract

The invention discloses the clustering methods and device of a kind of data sample, belong to field of computer technology.The described method includes：Obtain target data sample and the corresponding barycenter of each cluster classification；Other corresponding barycenter of cluster classification beyond the first cluster classification according to belonging to the target data sample and the target data sample, definite each barycenter of other cluster classifications is with the target data sample apart from lower limit；In other described corresponding each barycenter of cluster classification, the barycenter of the corresponding sample centroid distance being less than apart from lower limit between target data sample barycenter corresponding with the described first cluster classification is chosen；In the barycenter and the corresponding barycenter of the first cluster classification of selection, determine the barycenter with the distance minimum of the target data sample, the target data sample is included into cluster classification corresponding with the barycenter of the distance minimum of the target data sample.Using the present invention, the process resource of server can be saved.

Description

The clustering method and device of a kind of data sample

Technical field

The present invention relates to field of computer technology, the clustering method and device of more particularly to a kind of data sample.

Background technology

With the development of computer technology, computer application is more and more extensive, and function is also more and more comprehensive.People can be with Various data processings, such as data clusters and data statistics are carried out by computer (such as server), each is to be treated Data can be referred to as a data sample.

Server, can be according to the number of default cluster classification when being clustered to the data sample in set of data samples Amount randomly selects the data sample of the quantity from data sample to be clustered, the barycenter as each cluster classification.For data Each data sample in sample set, server calculate the distance of the data sample and each barycenter, which can represent data The degree of closeness of sample and barycenter, there are many kinds of the methods for calculating distance, such as Euclidean distance algorithm.Server can determine with The barycenter of the distance minimum of the data sample, which is included into the classification belonging to the barycenter, then calculates the category In all data samples average value, the barycenter as the category.Server can repeat above-mentioned calculating processing, that is, service Device calculates the distance of each data sample and updated barycenter, and then data sample is clustered again, then calculates again The average value of all data samples in of all categories after cluster, as updated barycenter, the data sample in of all categories It remains unchanged.

In the implementation of the present invention, inventor has found that the prior art has at least the following problems：

Server when being clustered to some data sample, it is necessary to calculate the distance of the data sample and all barycenter, Calculation amount is larger, can so occupy the substantial amounts of process resource of service server.

The content of the invention

In order to solve problem of the prior art, an embodiment of the present invention provides the clustering methods and dress of a kind of data sample It puts.The technical solution is as follows：

In a first aspect, a kind of clustering method of data sample is provided, the described method includes：

Obtain target data sample and the corresponding barycenter of each cluster classification；

Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification, definite each barycenter of other cluster classifications is with the target data sample apart from lower limit；

In other described corresponding each barycenter of cluster classification, selection is corresponding to be less than the target data sample apart from lower limit The barycenter of sample centroid distance between this barycenter corresponding with the described first cluster classification；

In the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with the target data sample away from From minimum barycenter, the target data sample is included into corresponding with the barycenter of the distance minimum of the target data sample poly- In class classification.

With reference to first aspect, it is described according to the target data in the first possible realization method of the first aspect Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample and the target data sample, determine each Barycenter and the target data sample of other cluster classifications apart from lower limit, including：

Determine target data sample barycenter corresponding with the first cluster classification belonging to the target data sample The barycenter of sample centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications Between distance；

The difference of distance between the sample centroid distance and each barycenter is determined, by the matter of other each cluster classifications The corresponding difference of the heart, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.

With reference to first aspect, it is described according to the target data in second of possible realization method of the first aspect Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample and the target data sample, determine each Barycenter and the target data sample of other cluster classifications apart from lower limit, including：

Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification determines the difference vector of target data sample barycenter corresponding with other each cluster classifications in list Projection in bit vector, by the length of the corresponding projection of barycenter of other each cluster classifications, as it is described it is each other The barycenter of cluster classification is with the target data sample apart from lower limit.

With reference to first aspect, it is described according to the target data in the third possible realization method of the first aspect Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample and the target data sample, determine described Target data sample with the barycenter of other cluster classifications apart from lower limit, including：

Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly- The corresponding barycenter of class classification determines average and the side of target data sample barycenter corresponding with other each cluster classifications Difference；

According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and Mean square deviation inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.

With reference to first aspect, in the 4th kind of possible realization method of the first aspect, the barycenter and institute in selection It states in the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of the target data sample is determined, by the target Data sample is included into cluster classification corresponding with the barycenter of the distance minimum of the target data sample, including：

According to the number of the barycenter of selection, the first number of processing unit is determined；

In default processing unit pond, the first number processing unit is obtained；

By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with The target data sample is included into the corresponding cluster of barycenter determined by the barycenter of the distance minimum of the target data sample In classification.

Second aspect, provides a kind of clustering apparatus of data sample, and described device includes：

Acquisition module, for obtaining target data sample and the corresponding barycenter of each cluster classification；

Determining module, for the first cluster classification according to belonging to the target data sample and the target data sample Other corresponding barycenter of cluster classification in addition, determine the barycenter of other each cluster classifications and the target data sample away from From lower limit；

Module is chosen, in other described corresponding each barycenter of cluster classification, selection is corresponding to be less than apart from lower limit The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification；

Cluster module, in the barycenter of selection and the corresponding barycenter of the first cluster classification, determining and the mesh The barycenter of the distance minimum of data sample is marked, the target data sample is included into minimum with the distance of the target data sample Barycenter it is corresponding cluster classification in.

With reference to second aspect, in the first possible realization method of the second aspect, the determining module is used for：

With reference to second aspect, in second of possible realization method of the second aspect, the determining module is used for：

With reference to second aspect, in the third possible realization method of the second aspect, the determining module is used for：

With reference to second aspect, in the 4th kind of possible realization method of the second aspect, the cluster module is used for：

The advantageous effect that technical solution provided in an embodiment of the present invention is brought is：

In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly- The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen, The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets According to the calculation amount of the distance of sample, so as to save the process resource of server.

Description of the drawings

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, for For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings Attached drawing.

Fig. 1 is a kind of clustering method flow chart of data sample provided in an embodiment of the present invention；

Fig. 2 is a kind of clustering apparatus structure diagram of data sample provided in an embodiment of the present invention；

Fig. 3 is a kind of structure diagram of server provided in an embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Embodiment one

An embodiment of the present invention provides a kind of clustering method of data sample, as shown in Figure 1, the process flow of this method can To include the steps：

Step 101, target data sample and the corresponding barycenter of each cluster classification are obtained.

Step 102, other beyond the first cluster classification according to belonging to target data sample and target data sample are poly- The corresponding barycenter of class classification, definite each barycenter of other cluster classifications is with target data sample apart from lower limit.

Step 103, in other corresponding each barycenter of cluster classification, selection is corresponding to be less than target data sample apart from lower limit The barycenter of sample centroid distance between this barycenter corresponding with the first cluster classification.

Step 104, in the barycenter of selection and the first corresponding barycenter of cluster classification, determine with target data sample away from From minimum barycenter, target data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum In.

Embodiment two

An embodiment of the present invention provides a kind of clustering method of data sample, the executive agent of this method is server.Its In, server can be the background server for having the function of convergence.

Server can specifically be divided into following step when carrying out clustering processing to the data sample in set of data samples Suddenly：Step 1 according to the quantity of default cluster classification, randomly selects the data sample of the quantity from data sample to be clustered This, the barycenter as each cluster classification；Step 2 for each data sample, calculates the data of the data sample and each barycenter Distance (the data distance can represent the degree of closeness of data sample and barycenter) is determined with the data of data sample distance most The data sample is included into the classification belonging to the barycenter by small barycenter；Step 3 calculates all data samples in the category Average value, the barycenter as the category；Step 4, repeats the processing procedure Step 2: three, that is, calculate each data sample with The data distance of updated barycenter, and then data sample is clustered again, it then calculates of all categories after cluster again In all data samples average value, as updated barycenter, the data sample in of all categories remains unchanged.

This programme is in above-mentioned processing procedure, the processing procedure of step 2 is improved, below in conjunction with specific reality Mode is applied, process flow shown in FIG. 1 is described in detail, content can be as follows：

In force, server can obtain the data sample (i.e. target data sample) for needing to carry out clustering processing, with And respectively cluster the corresponding barycenter of classification.In the clustering processing of the first round, server can be according to the number of default cluster classification Amount randomly selects the data sample of the quantity from data sample to be clustered, as the barycenter of each cluster classification, subsequent It is each to cluster the average value that the corresponding barycenter of classification be all data samples in of all categories after cluster in clustering processing.

In force, data sample can have a variety of attributes, for example, in the case of data sample is user, data The corresponding attribute of sample can be monthly cost, surf time, age and gender etc..Data sample can with m dimension to It measures to represent, as target data sample can use vector a₁It represents, a₁={ a₁₁,a₁₂,……a_1m, data sample set can be with table Show { a }.When carrying out clustering processing to { a }, the quantity of cluster classification can be pre-set, such as k classes.This k cluster classification pair The barycenter answered can be expressed as c₁、c₂、c₃……c_k, c₁、c₂、c₃……c_kIt is the vector of m dimensions.With a₁Belong to c₁It is corresponding poly- Exemplified by class classification (the i.e. first cluster classification), after server obtains target data sample and the corresponding barycenter of each cluster classification, It can be according to a₁And c₂、c₃……c_k, c is determined respectively₂、c₃……c_kWith a₁Apart from lower limit, can be represented apart from lower limit with h ', a₁With c₂Apart from lower limit be h '₂, a₁With c₃Apart from lower limit be h '₃, and so on.

Optionally, other each corresponding barycenter of cluster classification can be determined according to distance between sample centroid distance and barycenter It is corresponding apart from lower limit, correspondingly, the processing procedure of step 102 can be as follows：Determine target data sample and target data sample The sample centroid distance and the corresponding barycenter of the first cluster classification of the corresponding barycenter of the first cluster classification belonging to this with it is each Distance between the barycenter of other corresponding barycenter of cluster classification；Determine the difference of distance between sample centroid distance and each barycenter, it will Corresponding differences of barycenter of other each cluster classifications, as other each cluster classifications barycenter and target data sample away from From lower limit.

In force, server can calculate a₁With c₁Sample centroid distance, which can be denoted as r, also C can be calculated respectively₁With c₂、c₃……c_kBarycenter between distance, distance can be represented sequentially as d between these barycenter₂……d_k.So Server can calculate r and d afterwards₂Difference, using the difference as a₁With c₂Apart from lower limit h '₂, calculate r and d₃Difference, will The difference is as a₁With c₃Apart from lower limit h '₃, and so on, calculate r and d_kDifference, using the difference as a₁With c_kDistance Lower limit h '_k.Wherein, between sample centroid distance and barycenter there are many kinds of the computational methods of distance, such as Euclidean distance algorithm.European In distance algorithm, the distance between any two data sample (including barycenter) can be expressed as Wherein, x and c is data sample to be calculated, and is m dimensional vectors, x_iFor the seat in vector x Mark, c_iFor the coordinate in vectorial c.In the embodiment of the present invention, Euclidean distance uses not open the computational methods of radical sign, i.e.,To reduce the treating capacity of server.In addition, server is calculated between each barycenter after distance, it can be right Distance is stored between each barycenter, and server can simultaneously be handled multiple data samples, in the processing of same wheel, to it When his data sample is handled, server can call distance between each barycenter of storage, without being computed repeatedly.

For example, a₁=(1,2,3), c₁=(2,2,4), c₂=(2,2,5), c₃=(3,5,6) calculate public according to Euclidean distance Knowable to formula, r=2, d₂=1, d₃=14, then h '₂=1, h '₃=13.

It optionally, can be according to the difference vector of target data sample and the barycenter of other each cluster classifications in unit vector On projection determine apart from lower limit, correspondingly, the processing procedure of step 102 can be as follows：According to target data sample and target Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to data sample, determine target data sample with it is each Projection of the difference vector of other corresponding barycenter of cluster classification on unit vector, the barycenter of other each cluster classifications is corresponded to Projection length, barycenter and target data sample as other each cluster classifications apart from lower limit.

In force, a₁And c₂、c₃……c_kIt is a vector for m dimensions, server can be according to a₁And c₂、c₃……c_k, point It Que Ding not a₁With c₂、c₃……c_kDifference vector, and then determine projection of each difference vector on unit vector, unit vector can Think the vector on any direction.The length of these projections can be represented sequentially as l₂、l₃……l_k.Server can calculate a₁ With c₂Projection of the difference vector on unit vector length l₂, by l₂As a₁With c₂Apart from lower limit h '₂, calculate a₁With c₃'s The length l of projection of the difference vector on unit vector₃, by l₃As a₁With c₃Apart from lower limit h '₃, and so on, until calculating a₁With c_kProjection of the difference vector on unit vector length l_k, by l_kAs a₁With c_kApart from lower limit h '_k。

For example, a₁=(1,2,3), c₂=(2,2,5), c₃=(3,5,6), unit vector is u (0,0,1), according to projection Knowable to calculation formula, h '₂=l₂=2, h '₃=l₃=3.

Optionally, can be calculated according to mean square deviation inequality apart from lower limit, correspondingly, the processing procedure of step 102 can be with It is as follows：Other cluster classifications beyond the first cluster classification according to belonging to target data sample and target data sample are corresponding Barycenter determines the average and variance of target data sample barycenter corresponding with other each cluster classifications；According to target data sample The average and variance and mean square deviation inequality of this barycenter corresponding with other each cluster classifications, determine other each clusters The barycenter of classification is with target data sample apart from lower limit.

In force, server can calculate a respectively₁And c₂、c₃……c_kAverage and standard deviation, wherein, average m The average value of each coordinate in the vector of dimension, the standard deviation that standard deviation obtains for each coordinate in the vector of m dimensions according to mean value computation.Its In, a₁Average beStandard deviation isc₂Average beStandard deviation isAnd so on, c_kAverage beStandard deviation isAccording to equal standard deviation inequality, a can be calculated₁With c₂Apart from lower limit h '₂=m [(μ_a1-μ_c2)²+ (σ_a1-σ_c2)²], calculate a₁With c_kApart from lower limit h '_k=m [(μ_a1-μ_ck)²+(σ_a1-σ_ck)²], wherein, m is the dimension of vector.

For example, a₁=(1,2,3), c₂=(2,2,5), c₃=(3,5,6), then a₁With c₂Apart from lower limit h '₂=3* [(2-3)²+(0.81-1.41)²] =4.08, a₁With c₂Apart from lower limit h '₃=3* [(2-4.67)²+(0.81-1.24)²]=21.9.

In force, server calculates c₂、c₃……c_kWith a₁Apart from lower limit (i.e. h '₂、h′₃……h′_k) and a₁ With c₁Sample centroid distance (i.e. r) after, can be respectively by h '₂、h′₃……h′_kCompared with r, determine less than r away from From lower limit, and then these are chosen apart from the corresponding barycenter of lower limit, exclude the corresponding barycenter for being greater than or equal to r apart from lower limit.Tool Body, for the corresponding barycenter for being greater than or equal to r apart from lower limit, then server can exclude, without subsequent processing.Example Such as, r=2, h '₂=1, h '₃=13, then h '₂＜ r, h '₃＞ r, therefore h ' can be chosen₂Corresponding barycenter c₂, exclude h '₃It is corresponding Barycenter c₃。

In force, after server chooses barycenter, a can be calculated₁With the distance and a of the barycenter of selection₁With c₁Away from From, then calculated distance is compared, determine and a₁Distance minimum barycenter, by a₁It is corresponding to be included into the barycenter It clusters in classification.In this way, server after selection is handled, can filter out a part of barycenter, the first cluster classification is only calculated Corresponding barycenter and the distance of target data sample and the distance of the barycenter and target data sample chosen, without calculating mesh The distance of data sample and all barycenter is marked, so as to save the process resource of server.

For example, a₁=(1,2,3), c₁=(2,2,4), the barycenter of selection is c₂=(2,2,5), then server can calculate a₁With c₁Distance, be 2, a₁With c₂Distance, be 5, server can be by a₁It is included into c₁In corresponding cluster classification.

Optionally, can processing unit be obtained according to the barycenter number of selection, target data sample is clustered, accordingly , the processing procedure of step 104 can be as follows：According to the number of the barycenter of selection, the first number of processing unit is determined；Pre- If processing unit pond in, obtain the first number processing unit；By the processing unit of acquisition, in the barycenter of selection and first It clusters in the corresponding barycenter of classification, determines the barycenter with the distance minimum of target data sample, target data sample is included into really In the corresponding cluster classification of barycenter made.

In force, processing unit pond can be pre-set in server, multiple places can be provided in processing unit pond Unit is managed, each processing unit can calculate a₁With the distance of a barycenter.After server chooses barycenter, it may be determined that selection The number of barycenter, and then determine according to the number the first number of processing unit, then in processing unit pond, obtain the first number Mesh processing unit.For known a₁With c₁Distance (i.e. sample centroid distance) situation, server can be by the barycenter of selection Number as the first number, in processing unit pond, obtain the first number processing unit；For unknown a₁With c₁Distance Situation, the number of barycenter that server determines to choose adds the number after one, using the number as the first number, in processing unit Chi Zhong obtains the first number processing unit.After server obtains processing unit, it can selected by the processing unit of acquisition The barycenter and c taken₁In, definite and a₁Distance minimum barycenter, by a₁It is included into the corresponding cluster classification of the barycenter determined.

Embodiment three

Based on identical technical concept, the embodiment of the present invention additionally provides a kind of clustering apparatus of data sample, such as Fig. 2 institutes Show, which includes：

Acquisition module 210, for obtaining target data sample and the corresponding barycenter of each cluster classification；

Determining module 220, for the first cluster according to belonging to the target data sample and the target data sample Other corresponding barycenter of cluster classification beyond classification, determine the barycenter of other each cluster classifications and the target data sample Apart from lower limit；

Module 230 is chosen, in other described corresponding each barycenter of cluster classification, choosing corresponding small apart from lower limit The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification；

Cluster module 240, in the barycenter of selection and the corresponding barycenter of the first cluster classification, determine with it is described The target data sample is included into the distance of the target data sample most by the barycenter of the distance minimum of target data sample In the corresponding cluster classification of small barycenter.

Optionally, the determining module 220, is used for：

Optionally, the cluster module 240, is used for：

It should be noted that：The clustering apparatus for the data sample that above-described embodiment provides is clustered to data sample When, only with the division progress of above-mentioned each function module for example, in practical application, above-mentioned function can be divided as needed With by different function module completions, i.e., the internal structure of equipment is divided into different function modules, to complete above description All or part of function.In addition, the clustering apparatus of data sample of above-described embodiment offer and the cluster side of data sample Method embodiment belongs to same design, and specific implementation process refers to embodiment of the method, and which is not described herein again.

Example IV

Based on the technical concept identical with the clustering method of above-mentioned data sample, the embodiment of the present application additionally provides a kind of clothes Business device, the structure diagram of the server refer to Fig. 3.The server includes processor 310, transceiver 320 and memory 330, transceiver 320 and memory 330 are connected respectively with processor 310, wherein：

Memory 330 obtains target data sample and the corresponding barycenter of each cluster classification for passing through transceiver 320；

Processor 310, for the first cluster class according to belonging to the target data sample and the target data sample Other corresponding barycenter of cluster classification beyond not, determine the barycenter of other each cluster classifications and the target data sample Apart from lower limit；

Processor 310 is additionally operable in other described corresponding each barycenter of cluster classification, chooses corresponding small apart from lower limit The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification；

Processor 310 is additionally operable in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with it is described The target data sample is included into the distance of the target data sample most by the barycenter of the distance minimum of target data sample In the corresponding cluster classification of small barycenter.

Optionally, the processor 310, is used for：

One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment It completes, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and Within principle, any modifications, equivalent replacements and improvements are made should all be included in the protection scope of the present invention.

Claims

1. a kind of clustering method of data sample, which is characterized in that the described method includes：

Other cluster classes beyond the first cluster classification according to belonging to the target data sample and the target data sample Not corresponding barycenter, definite each barycenter of other cluster classifications is with the target data sample apart from lower limit；

In other described corresponding each barycenter of cluster classification, choose it is corresponding apart from lower limit be less than the target data sample with Described first clusters the barycenter of the sample centroid distance between the corresponding barycenter of classification；

In the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with the distance of the target data sample most The target data sample is included into cluster class corresponding with the barycenter of the distance minimum of the target data sample by small barycenter Not in.

It is 2. according to the method described in claim 1, it is characterized in that, described according to the target data sample and the number of targets According to other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample, the matter of other definite each cluster classifications The heart and the target data sample apart from lower limit, including：

Determine the sample of target data sample barycenter corresponding with the first cluster classification belonging to the target data sample The barycenter spacing of centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications From；

The difference of distance between the sample centroid distance and each barycenter is determined, by the barycenter pair of other each cluster classifications The difference answered, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.

It is 3. according to the method described in claim 1, it is characterized in that, described according to the target data sample and the number of targets According to other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample, the matter of other definite each cluster classifications The heart and the target data sample apart from lower limit, including：

Other cluster classes beyond the first cluster classification according to belonging to the target data sample and the target data sample Not corresponding barycenter, determine the difference vector of target data sample barycenter corresponding with other each cluster classifications unit to Projection in amount by the length of the corresponding projection of barycenter of other each cluster classifications, is clustered as each other For the barycenter of classification with the target data sample apart from lower limit, the unit vector is the vector on any direction.

It is 4. according to the method described in claim 1, it is characterized in that, described according to the target data sample and the number of targets According to other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample, the target data sample and institute are determined State the barycenter of other cluster classifications apart from lower limit, including：

Other cluster classes beyond the first cluster classification according to belonging to the target data sample and the target data sample Not corresponding barycenter determines the average and variance of target data sample barycenter corresponding with other each cluster classifications；

According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and just Poor inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.

5. according to the method described in claim 1, it is characterized in that, the barycenter in selection and the first cluster classification pair In the barycenter answered, the barycenter with the distance minimum of the target data sample is determined, the target data sample is included into and institute In the corresponding cluster classification of barycenter for stating the distance minimum of target data sample, including：

By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with it is described The target data sample is included into the corresponding cluster classification of barycenter determined by the barycenter of the distance minimum of target data sample In.

6. a kind of clustering apparatus of data sample, which is characterized in that described device includes：

Determining module, beyond the first cluster classification according to belonging to the target data sample and the target data sample Other corresponding barycenter of cluster classification, determine the barycenter of other each cluster classifications under the distance of the target data sample Limit；

Choose module, in other described corresponding each barycenter of cluster classification, choose it is corresponding be less than apart from lower limit it is described The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification；

Cluster module, in the barycenter of selection and the corresponding barycenter of the first cluster classification, determining and the number of targets According to the barycenter of the distance minimum of sample, the target data sample is included into the matter with the distance minimum of the target data sample In the corresponding cluster classification of the heart.

7. device according to claim 6, which is characterized in that the determining module is used for：

8. device according to claim 6, which is characterized in that the determining module is used for：

9. device according to claim 6, which is characterized in that the determining module is used for：

10. device according to claim 6, which is characterized in that the cluster module is used for：