The clustering method and device of a kind of data sample
Technical field
The present invention relates to field of computer technology, the clustering method and device of more particularly to a kind of data sample.
Background technology
With the development of computer technology, computer application is more and more extensive, and function is also more and more comprehensive.People can be with
Various data processings, such as data clusters and data statistics are carried out by computer (such as server), each is to be treated
Data can be referred to as a data sample.
Server, can be according to the number of default cluster classification when being clustered to the data sample in set of data samples
Amount randomly selects the data sample of the quantity from data sample to be clustered, the barycenter as each cluster classification.For data
Each data sample in sample set, server calculate the distance of the data sample and each barycenter, which can represent data
The degree of closeness of sample and barycenter, there are many kinds of the methods for calculating distance, such as Euclidean distance algorithm.Server can determine with
The barycenter of the distance minimum of the data sample, which is included into the classification belonging to the barycenter, then calculates the category
In all data samples average value, the barycenter as the category.Server can repeat above-mentioned calculating processing, that is, service
Device calculates the distance of each data sample and updated barycenter, and then data sample is clustered again, then calculates again
The average value of all data samples in of all categories after cluster, as updated barycenter, the data sample in of all categories
It remains unchanged.
In the implementation of the present invention, inventor has found that the prior art has at least the following problems:
Server when being clustered to some data sample, it is necessary to calculate the distance of the data sample and all barycenter,
Calculation amount is larger, can so occupy the substantial amounts of process resource of service server.
The content of the invention
In order to solve problem of the prior art, an embodiment of the present invention provides the clustering methods and dress of a kind of data sample
It puts.The technical solution is as follows:
In a first aspect, a kind of clustering method of data sample is provided, the described method includes:
Obtain target data sample and the corresponding barycenter of each cluster classification;
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly-
The corresponding barycenter of class classification, definite each barycenter of other cluster classifications is with the target data sample apart from lower limit;
In other described corresponding each barycenter of cluster classification, selection is corresponding to be less than the target data sample apart from lower limit
The barycenter of sample centroid distance between this barycenter corresponding with the described first cluster classification;
In the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with the target data sample away from
From minimum barycenter, the target data sample is included into corresponding with the barycenter of the distance minimum of the target data sample poly-
In class classification.
With reference to first aspect, it is described according to the target data in the first possible realization method of the first aspect
Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample and the target data sample, determine each
Barycenter and the target data sample of other cluster classifications apart from lower limit, including:
Determine target data sample barycenter corresponding with the first cluster classification belonging to the target data sample
The barycenter of sample centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications
Between distance;
The difference of distance between the sample centroid distance and each barycenter is determined, by the matter of other each cluster classifications
The corresponding difference of the heart, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.
With reference to first aspect, it is described according to the target data in second of possible realization method of the first aspect
Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample and the target data sample, determine each
Barycenter and the target data sample of other cluster classifications apart from lower limit, including:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly-
The corresponding barycenter of class classification determines the difference vector of target data sample barycenter corresponding with other each cluster classifications in list
Projection in bit vector, by the length of the corresponding projection of barycenter of other each cluster classifications, as it is described it is each other
The barycenter of cluster classification is with the target data sample apart from lower limit.
With reference to first aspect, it is described according to the target data in the third possible realization method of the first aspect
Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to sample and the target data sample, determine described
Target data sample with the barycenter of other cluster classifications apart from lower limit, including:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly-
The corresponding barycenter of class classification determines average and the side of target data sample barycenter corresponding with other each cluster classifications
Difference;
According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and
Mean square deviation inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.
With reference to first aspect, in the 4th kind of possible realization method of the first aspect, the barycenter and institute in selection
It states in the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of the target data sample is determined, by the target
Data sample is included into cluster classification corresponding with the barycenter of the distance minimum of the target data sample, including:
According to the number of the barycenter of selection, the first number of processing unit is determined;
In default processing unit pond, the first number processing unit is obtained;
By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with
The target data sample is included into the corresponding cluster of barycenter determined by the barycenter of the distance minimum of the target data sample
In classification.
Second aspect, provides a kind of clustering apparatus of data sample, and described device includes:
Acquisition module, for obtaining target data sample and the corresponding barycenter of each cluster classification;
Determining module, for the first cluster classification according to belonging to the target data sample and the target data sample
Other corresponding barycenter of cluster classification in addition, determine the barycenter of other each cluster classifications and the target data sample away from
From lower limit;
Module is chosen, in other described corresponding each barycenter of cluster classification, selection is corresponding to be less than apart from lower limit
The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification;
Cluster module, in the barycenter of selection and the corresponding barycenter of the first cluster classification, determining and the mesh
The barycenter of the distance minimum of data sample is marked, the target data sample is included into minimum with the distance of the target data sample
Barycenter it is corresponding cluster classification in.
With reference to second aspect, in the first possible realization method of the second aspect, the determining module is used for:
Determine target data sample barycenter corresponding with the first cluster classification belonging to the target data sample
The barycenter of sample centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications
Between distance;
The difference of distance between the sample centroid distance and each barycenter is determined, by the matter of other each cluster classifications
The corresponding difference of the heart, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.
With reference to second aspect, in second of possible realization method of the second aspect, the determining module is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly-
The corresponding barycenter of class classification determines the difference vector of target data sample barycenter corresponding with other each cluster classifications in list
Projection in bit vector, by the length of the corresponding projection of barycenter of other each cluster classifications, as it is described it is each other
The barycenter of cluster classification is with the target data sample apart from lower limit.
With reference to second aspect, in the third possible realization method of the second aspect, the determining module is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly-
The corresponding barycenter of class classification determines average and the side of target data sample barycenter corresponding with other each cluster classifications
Difference;
According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and
Mean square deviation inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.
With reference to second aspect, in the 4th kind of possible realization method of the second aspect, the cluster module is used for:
According to the number of the barycenter of selection, the first number of processing unit is determined;
In default processing unit pond, the first number processing unit is obtained;
By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with
The target data sample is included into the corresponding cluster of barycenter determined by the barycenter of the distance minimum of the target data sample
In classification.
The advantageous effect that technical solution provided in an embodiment of the present invention is brought is:
In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample
This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly-
The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample
It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting
In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target
Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated
The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen,
The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets
According to the calculation amount of the distance of sample, so as to save the process resource of server.
Description of the drawings
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only some embodiments of the present invention, for
For those of ordinary skill in the art, without creative efforts, other are can also be obtained according to these attached drawings
Attached drawing.
Fig. 1 is a kind of clustering method flow chart of data sample provided in an embodiment of the present invention;
Fig. 2 is a kind of clustering apparatus structure diagram of data sample provided in an embodiment of the present invention;
Fig. 3 is a kind of structure diagram of server provided in an embodiment of the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention
Formula is described in further detail.
Embodiment one
An embodiment of the present invention provides a kind of clustering method of data sample, as shown in Figure 1, the process flow of this method can
To include the steps:
Step 101, target data sample and the corresponding barycenter of each cluster classification are obtained.
Step 102, other beyond the first cluster classification according to belonging to target data sample and target data sample are poly-
The corresponding barycenter of class classification, definite each barycenter of other cluster classifications is with target data sample apart from lower limit.
Step 103, in other corresponding each barycenter of cluster classification, selection is corresponding to be less than target data sample apart from lower limit
The barycenter of sample centroid distance between this barycenter corresponding with the first cluster classification.
Step 104, in the barycenter of selection and the first corresponding barycenter of cluster classification, determine with target data sample away from
From minimum barycenter, target data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum
In.
In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample
This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly-
The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample
It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting
In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target
Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated
The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen,
The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets
According to the calculation amount of the distance of sample, so as to save the process resource of server.
Embodiment two
An embodiment of the present invention provides a kind of clustering method of data sample, the executive agent of this method is server.Its
In, server can be the background server for having the function of convergence.
Server can specifically be divided into following step when carrying out clustering processing to the data sample in set of data samples
Suddenly:Step 1 according to the quantity of default cluster classification, randomly selects the data sample of the quantity from data sample to be clustered
This, the barycenter as each cluster classification;Step 2 for each data sample, calculates the data of the data sample and each barycenter
Distance (the data distance can represent the degree of closeness of data sample and barycenter) is determined with the data of data sample distance most
The data sample is included into the classification belonging to the barycenter by small barycenter;Step 3 calculates all data samples in the category
Average value, the barycenter as the category;Step 4, repeats the processing procedure Step 2: three, that is, calculate each data sample with
The data distance of updated barycenter, and then data sample is clustered again, it then calculates of all categories after cluster again
In all data samples average value, as updated barycenter, the data sample in of all categories remains unchanged.
This programme is in above-mentioned processing procedure, the processing procedure of step 2 is improved, below in conjunction with specific reality
Mode is applied, process flow shown in FIG. 1 is described in detail, content can be as follows:
Step 101, target data sample and the corresponding barycenter of each cluster classification are obtained.
In force, server can obtain the data sample (i.e. target data sample) for needing to carry out clustering processing, with
And respectively cluster the corresponding barycenter of classification.In the clustering processing of the first round, server can be according to the number of default cluster classification
Amount randomly selects the data sample of the quantity from data sample to be clustered, as the barycenter of each cluster classification, subsequent
It is each to cluster the average value that the corresponding barycenter of classification be all data samples in of all categories after cluster in clustering processing.
Step 102, other beyond the first cluster classification according to belonging to target data sample and target data sample are poly-
The corresponding barycenter of class classification, definite each barycenter of other cluster classifications is with target data sample apart from lower limit.
In force, data sample can have a variety of attributes, for example, in the case of data sample is user, data
The corresponding attribute of sample can be monthly cost, surf time, age and gender etc..Data sample can with m dimension to
It measures to represent, as target data sample can use vector a1It represents, a1={ a11,a12,……a1m, data sample set can be with table
Show { a }.When carrying out clustering processing to { a }, the quantity of cluster classification can be pre-set, such as k classes.This k cluster classification pair
The barycenter answered can be expressed as c1、c2、c3……ck, c1、c2、c3……ckIt is the vector of m dimensions.With a1Belong to c1It is corresponding poly-
Exemplified by class classification (the i.e. first cluster classification), after server obtains target data sample and the corresponding barycenter of each cluster classification,
It can be according to a1And c2、c3……ck, c is determined respectively2、c3……ckWith a1Apart from lower limit, can be represented apart from lower limit with h ',
a1With c2Apart from lower limit be h '2, a1With c3Apart from lower limit be h '3, and so on.
Optionally, other each corresponding barycenter of cluster classification can be determined according to distance between sample centroid distance and barycenter
It is corresponding apart from lower limit, correspondingly, the processing procedure of step 102 can be as follows:Determine target data sample and target data sample
The sample centroid distance and the corresponding barycenter of the first cluster classification of the corresponding barycenter of the first cluster classification belonging to this with it is each
Distance between the barycenter of other corresponding barycenter of cluster classification;Determine the difference of distance between sample centroid distance and each barycenter, it will
Corresponding differences of barycenter of other each cluster classifications, as other each cluster classifications barycenter and target data sample away from
From lower limit.
In force, server can calculate a1With c1Sample centroid distance, which can be denoted as r, also
C can be calculated respectively1With c2、c3……ckBarycenter between distance, distance can be represented sequentially as d between these barycenter2……dk.So
Server can calculate r and d afterwards2Difference, using the difference as a1With c2Apart from lower limit h '2, calculate r and d3Difference, will
The difference is as a1With c3Apart from lower limit h '3, and so on, calculate r and dkDifference, using the difference as a1With ckDistance
Lower limit h 'k.Wherein, between sample centroid distance and barycenter there are many kinds of the computational methods of distance, such as Euclidean distance algorithm.European
In distance algorithm, the distance between any two data sample (including barycenter) can be expressed as Wherein, x and c is data sample to be calculated, and is m dimensional vectors, xiFor the seat in vector x
Mark, ciFor the coordinate in vectorial c.In the embodiment of the present invention, Euclidean distance uses not open the computational methods of radical sign, i.e.,To reduce the treating capacity of server.In addition, server is calculated between each barycenter after distance, it can be right
Distance is stored between each barycenter, and server can simultaneously be handled multiple data samples, in the processing of same wheel, to it
When his data sample is handled, server can call distance between each barycenter of storage, without being computed repeatedly.
For example, a1=(1,2,3), c1=(2,2,4), c2=(2,2,5), c3=(3,5,6) calculate public according to Euclidean distance
Knowable to formula, r=2, d2=1, d3=14, then h '2=1, h '3=13.
It optionally, can be according to the difference vector of target data sample and the barycenter of other each cluster classifications in unit vector
On projection determine apart from lower limit, correspondingly, the processing procedure of step 102 can be as follows:According to target data sample and target
Other corresponding barycenter of cluster classification beyond the first cluster classification belonging to data sample, determine target data sample with it is each
Projection of the difference vector of other corresponding barycenter of cluster classification on unit vector, the barycenter of other each cluster classifications is corresponded to
Projection length, barycenter and target data sample as other each cluster classifications apart from lower limit.
In force, a1And c2、c3……ckIt is a vector for m dimensions, server can be according to a1And c2、c3……ck, point
It Que Ding not a1With c2、c3……ckDifference vector, and then determine projection of each difference vector on unit vector, unit vector can
Think the vector on any direction.The length of these projections can be represented sequentially as l2、l3……lk.Server can calculate a1
With c2Projection of the difference vector on unit vector length l2, by l2As a1With c2Apart from lower limit h '2, calculate a1With c3's
The length l of projection of the difference vector on unit vector3, by l3As a1With c3Apart from lower limit h '3, and so on, until calculating
a1With ckProjection of the difference vector on unit vector length lk, by lkAs a1With ckApart from lower limit h 'k。
For example, a1=(1,2,3), c2=(2,2,5), c3=(3,5,6), unit vector is u (0,0,1), according to projection
Knowable to calculation formula, h '2=l2=2, h '3=l3=3.
Optionally, can be calculated according to mean square deviation inequality apart from lower limit, correspondingly, the processing procedure of step 102 can be with
It is as follows:Other cluster classifications beyond the first cluster classification according to belonging to target data sample and target data sample are corresponding
Barycenter determines the average and variance of target data sample barycenter corresponding with other each cluster classifications;According to target data sample
The average and variance and mean square deviation inequality of this barycenter corresponding with other each cluster classifications, determine other each clusters
The barycenter of classification is with target data sample apart from lower limit.
In force, server can calculate a respectively1And c2、c3……ckAverage and standard deviation, wherein, average m
The average value of each coordinate in the vector of dimension, the standard deviation that standard deviation obtains for each coordinate in the vector of m dimensions according to mean value computation.Its
In, a1Average beStandard deviation isc2Average beStandard deviation isAnd so on, ckAverage beStandard deviation isAccording to equal standard deviation inequality, a can be calculated1With c2Apart from lower limit h '2=m [(μa1-μc2)2+
(σa1-σc2)2], calculate a1With ckApart from lower limit h 'k=m [(μa1-μck)2+(σa1-σck)2], wherein, m is the dimension of vector.
For example, a1=(1,2,3), c2=(2,2,5), c3=(3,5,6), then a1With c2Apart from lower limit h '2=3* [(2-3)2+(0.81-1.41)2]
=4.08, a1With c2Apart from lower limit h '3=3* [(2-4.67)2+(0.81-1.24)2]=21.9.
Step 103, in other corresponding each barycenter of cluster classification, selection is corresponding to be less than target data sample apart from lower limit
The barycenter of sample centroid distance between this barycenter corresponding with the first cluster classification.
In force, server calculates c2、c3……ckWith a1Apart from lower limit (i.e. h '2、h′3……h′k) and a1
With c1Sample centroid distance (i.e. r) after, can be respectively by h '2、h′3……h′kCompared with r, determine less than r away from
From lower limit, and then these are chosen apart from the corresponding barycenter of lower limit, exclude the corresponding barycenter for being greater than or equal to r apart from lower limit.Tool
Body, for the corresponding barycenter for being greater than or equal to r apart from lower limit, then server can exclude, without subsequent processing.Example
Such as, r=2, h '2=1, h '3=13, then h '2< r, h '3> r, therefore h ' can be chosen2Corresponding barycenter c2, exclude h '3It is corresponding
Barycenter c3。
Step 104, in the barycenter of selection and the first corresponding barycenter of cluster classification, determine with target data sample away from
From minimum barycenter, target data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum
In.
In force, after server chooses barycenter, a can be calculated1With the distance and a of the barycenter of selection1With c1Away from
From, then calculated distance is compared, determine and a1Distance minimum barycenter, by a1It is corresponding to be included into the barycenter
It clusters in classification.In this way, server after selection is handled, can filter out a part of barycenter, the first cluster classification is only calculated
Corresponding barycenter and the distance of target data sample and the distance of the barycenter and target data sample chosen, without calculating mesh
The distance of data sample and all barycenter is marked, so as to save the process resource of server.
For example, a1=(1,2,3), c1=(2,2,4), the barycenter of selection is c2=(2,2,5), then server can calculate
a1With c1Distance, be 2, a1With c2Distance, be 5, server can be by a1It is included into c1In corresponding cluster classification.
Optionally, can processing unit be obtained according to the barycenter number of selection, target data sample is clustered, accordingly
, the processing procedure of step 104 can be as follows:According to the number of the barycenter of selection, the first number of processing unit is determined;Pre-
If processing unit pond in, obtain the first number processing unit;By the processing unit of acquisition, in the barycenter of selection and first
It clusters in the corresponding barycenter of classification, determines the barycenter with the distance minimum of target data sample, target data sample is included into really
In the corresponding cluster classification of barycenter made.
In force, processing unit pond can be pre-set in server, multiple places can be provided in processing unit pond
Unit is managed, each processing unit can calculate a1With the distance of a barycenter.After server chooses barycenter, it may be determined that selection
The number of barycenter, and then determine according to the number the first number of processing unit, then in processing unit pond, obtain the first number
Mesh processing unit.For known a1With c1Distance (i.e. sample centroid distance) situation, server can be by the barycenter of selection
Number as the first number, in processing unit pond, obtain the first number processing unit;For unknown a1With c1Distance
Situation, the number of barycenter that server determines to choose adds the number after one, using the number as the first number, in processing unit
Chi Zhong obtains the first number processing unit.After server obtains processing unit, it can selected by the processing unit of acquisition
The barycenter and c taken1In, definite and a1Distance minimum barycenter, by a1It is included into the corresponding cluster classification of the barycenter determined.
In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample
This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly-
The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample
It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting
In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target
Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated
The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen,
The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets
According to the calculation amount of the distance of sample, so as to save the process resource of server.
Embodiment three
Based on identical technical concept, the embodiment of the present invention additionally provides a kind of clustering apparatus of data sample, such as Fig. 2 institutes
Show, which includes:
Acquisition module 210, for obtaining target data sample and the corresponding barycenter of each cluster classification;
Determining module 220, for the first cluster according to belonging to the target data sample and the target data sample
Other corresponding barycenter of cluster classification beyond classification, determine the barycenter of other each cluster classifications and the target data sample
Apart from lower limit;
Module 230 is chosen, in other described corresponding each barycenter of cluster classification, choosing corresponding small apart from lower limit
The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification;
Cluster module 240, in the barycenter of selection and the corresponding barycenter of the first cluster classification, determine with it is described
The target data sample is included into the distance of the target data sample most by the barycenter of the distance minimum of target data sample
In the corresponding cluster classification of small barycenter.
Optionally, the determining module 220, is used for:
Determine target data sample barycenter corresponding with the first cluster classification belonging to the target data sample
The barycenter of sample centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications
Between distance;
The difference of distance between the sample centroid distance and each barycenter is determined, by the matter of other each cluster classifications
The corresponding difference of the heart, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.
Optionally, the determining module 220, is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly-
The corresponding barycenter of class classification determines the difference vector of target data sample barycenter corresponding with other each cluster classifications in list
Projection in bit vector, by the length of the corresponding projection of barycenter of other each cluster classifications, as it is described it is each other
The barycenter of cluster classification is with the target data sample apart from lower limit.
Optionally, the determining module 220, is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly-
The corresponding barycenter of class classification determines average and the side of target data sample barycenter corresponding with other each cluster classifications
Difference;
According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and
Mean square deviation inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.
Optionally, the cluster module 240, is used for:
According to the number of the barycenter of selection, the first number of processing unit is determined;
In default processing unit pond, the first number processing unit is obtained;
By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with
The target data sample is included into the corresponding cluster of barycenter determined by the barycenter of the distance minimum of the target data sample
In classification.
In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample
This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly-
The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample
It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting
In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target
Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated
The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen,
The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets
According to the calculation amount of the distance of sample, so as to save the process resource of server.
It should be noted that:The clustering apparatus for the data sample that above-described embodiment provides is clustered to data sample
When, only with the division progress of above-mentioned each function module for example, in practical application, above-mentioned function can be divided as needed
With by different function module completions, i.e., the internal structure of equipment is divided into different function modules, to complete above description
All or part of function.In addition, the clustering apparatus of data sample of above-described embodiment offer and the cluster side of data sample
Method embodiment belongs to same design, and specific implementation process refers to embodiment of the method, and which is not described herein again.
Example IV
Based on the technical concept identical with the clustering method of above-mentioned data sample, the embodiment of the present application additionally provides a kind of clothes
Business device, the structure diagram of the server refer to Fig. 3.The server includes processor 310, transceiver 320 and memory
330, transceiver 320 and memory 330 are connected respectively with processor 310, wherein:
Memory 330 obtains target data sample and the corresponding barycenter of each cluster classification for passing through transceiver 320;
Processor 310, for the first cluster class according to belonging to the target data sample and the target data sample
Other corresponding barycenter of cluster classification beyond not, determine the barycenter of other each cluster classifications and the target data sample
Apart from lower limit;
Processor 310 is additionally operable in other described corresponding each barycenter of cluster classification, chooses corresponding small apart from lower limit
The barycenter of sample centroid distance between target data sample barycenter corresponding with the described first cluster classification;
Processor 310 is additionally operable in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with it is described
The target data sample is included into the distance of the target data sample most by the barycenter of the distance minimum of target data sample
In the corresponding cluster classification of small barycenter.
Optionally, the processor 310, is used for:
Determine target data sample barycenter corresponding with the first cluster classification belonging to the target data sample
The barycenter of sample centroid distance and the corresponding barycenter of the first cluster classification barycenter corresponding with other each cluster classifications
Between distance;
The difference of distance between the sample centroid distance and each barycenter is determined, by the matter of other each cluster classifications
The corresponding difference of the heart, the barycenter as other each cluster classifications is with the target data sample apart from lower limit.
Optionally, the processor 310, is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly-
The corresponding barycenter of class classification determines the difference vector of target data sample barycenter corresponding with other each cluster classifications in list
Projection in bit vector, by the length of the corresponding projection of barycenter of other each cluster classifications, as it is described it is each other
The barycenter of cluster classification is with the target data sample apart from lower limit.
Optionally, the processor 310, is used for:
Other beyond the first cluster classification according to belonging to the target data sample and the target data sample are poly-
The corresponding barycenter of class classification determines average and the side of target data sample barycenter corresponding with other each cluster classifications
Difference;
According to the average and variance of the target data sample and other each corresponding barycenter of cluster classification and
Mean square deviation inequality determines each barycenter of other cluster classifications with the target data sample apart from lower limit.
Optionally, the processor 310, is used for:
According to the number of the barycenter of selection, the first number of processing unit is determined;
In default processing unit pond, the first number processing unit is obtained;
By the processing unit of acquisition, in the barycenter and the corresponding barycenter of the first cluster classification of selection, determine with
The target data sample is included into the corresponding cluster of barycenter determined by the barycenter of the distance minimum of the target data sample
In classification.
In the embodiment of the present invention, target data sample and the corresponding barycenter of each cluster classification are obtained, according to target data sample
This barycenter corresponding with other cluster classifications beyond the first cluster classification belonging to target data sample, determines that each other are poly-
The barycenter of class classification, apart from lower limit, in other corresponding each barycenter of cluster classification, is chosen corresponding with target data sample
It is less than the barycenter of the sample centroid distance between target data sample barycenter corresponding with the first cluster classification apart from lower limit, is selecting
In the barycenter taken and the corresponding barycenter of the first cluster classification, the barycenter with the distance minimum of target data sample is determined, by target
Data sample is included into cluster classification corresponding with the barycenter of the distance of target data sample minimum, in this way, the can only be calculated
The corresponding barycenter of one cluster classification and the distance of target data sample and the distance of the barycenter and target data sample chosen,
The distance of target data sample and all barycenter need not be calculated, and determines that the calculation amount apart from lower limit is much smaller than barycenter and number of targets
According to the calculation amount of the distance of sample, so as to save the process resource of server.
One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment
It completes, relevant hardware can also be instructed to complete by program, the program can be stored in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all the present invention spirit and
Within principle, any modifications, equivalent replacements and improvements are made should all be included in the protection scope of the present invention.