CN105095912A

CN105095912A - Data clustering method and device

Info

Publication number: CN105095912A
Application number: CN201510477834.6A
Authority: CN
Inventors: 杨诗; 向园; 洪春晓; 吕俊
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2015-08-06
Filing date: 2015-08-06
Publication date: 2015-11-25
Anticipated expiration: 2035-08-06
Also published as: CN109508754A; CN105095912B

Abstract

The invention discloses a data clustering method and device, relates to the technical field of data processing, and mainly aims at solving problems of high computation burden and high time consumption caused by the fact that distance between every two clustering central points requires to be computed in each iteration process when the clustering central points are too many. The technical scheme comprises the steps that the prediction value of first distance is acquired according to the difference value before and after last updating of a first clustering central point; the prediction value of third distance is acquired according to second distance, the difference value before and after last updating of the first clustering central point and the difference value before and after last updating of a second clustering central point; the prediction value of the first distance is compared with the prediction value of the third distance according to the triangle inequality rules; and if the prediction value of the third distance is greater than or equal to two times of the prediction value of the first distance, the second clustering central point is discarded. The data clustering method is mainly applied to the process of data classification by using a clustering algorithm.

Description

The method of data clusters and device

Technical field

The present invention relates to technical field of data processing, particularly relate to a kind of method and device of data clusters.

Background technology

Along with the fast development of internet, people more and more pay attention to the method that network element is classified fast and accurately.The normal clustering algorithm that adopts is classified to network element at present, wherein, to sample set S{S1, S2, S3 ... when sample in Sn} carries out cluster, adopt the first scheme following: in K iteration, for any one sample Si, it is asked to arrive cluster centre collection M{M1, M2 ... Mj ... the distance of each cluster centre point in Mk}, is divided in the class set at nearest cluster centre point place by this Si; Utilize the method for average, upgrade the cluster centre point in cluster centre collection M; Difference between the class set that calculating current iteration produces and the class set that last iteration produces, until this difference meets preset error condition.

The method is when carrying out the cluster set calculating cluster centre point, need that each sample in sample set S is carried out distance with each cluster centre point in cluster centre collection M respectively to calculate, namely need to carry out n*k point-to-point distance to calculate, calculated amount is comparatively large, consuming time longer.

In order to the calculated amount solving the first scheme above-mentioned existence is large, first scheme is additionally provided in longer problem currently available technology consuming time, the operating process that Si to be divided into nearest cluster centre point place class set relative to the first scheme by the program improves, improved plan is specific as follows: calculate cluster centre collection M{M1, M2 ... Mj ... distance in Mk} between any two cluster centre points, and preserve; By triangle inequality principle, namely the distance between Luj and 2Lui is calculated, wherein, Luj is the distance between cluster centre point Mu and cluster centre point Mj, wherein, cluster centre point Mu is the nearest cluster centre point of Si and current distance Si, and cluster centre point Mj is cluster centre point to be traveled through in current ergodic process, and Lui is the distance between Si and cluster centre point Mu; If Luj is greater than or equal to 2Liu, neglect cluster centre point Mj, and continue the next cluster centre point of traversal, or, after having traveled through, this Si is divided in the class set at Mu place; If Luj is less than 2Liu, then calculate the distance Lij between Si and Mj, wherein, Lij is the distance between sample point Si and cluster centre point Mj; When Lij is less than Lui, Lui=Lij is set, Mu=Mj, continues the next cluster centre point of traversal, or, after having traveled through, this Si is divided in the class set at Mu place.

When implementing first scheme, inventor finds it, and there are the following problems: when judging whether certain cluster centre point is the cluster centre point of sample, after determining cluster centre point Mu nearest in sample Si and cluster centre collection M, based on triangle inequality principle, the cluster centre point that in cluster centre collection M can not be Si is abandoned, without the need to calculating the distance between cluster centre point and sample Si abandoned, calculated amount can be reduced to a certain extent, shortening and calculating duration; But more for some cluster centre points, the demand that cluster is meticulousr, because each iterative process all needs to calculate cluster centre point distance between any two, causes calculated amount comparatively large, consuming time longer.

Summary of the invention

In view of this, the method of a kind of data clusters provided by the invention and device, fundamental purpose is to solve when cluster centre point is more, owing to all needing to calculate cluster centre point distance between any two in each iterative process, cause calculated amount comparatively large, longer problem consuming time.

According to one aspect of the invention, the invention provides a kind of method of data clusters, the method comprises:

Self difference before and after upgrading according to the first cluster centre point last time obtains the predicted value of the first distance; Wherein, described first distance carries out the distance between the sample point of data clusters and described first cluster centre point for needing, and described first cluster centre point is cluster centre point nearest with described sample point in clustering distance traversal;

Self difference before and after self difference before and after upgrading according to second distance, described first cluster centre point last time and the second cluster centre point last time upgrade obtains the predicted value of the 3rd distance, wherein, described second distance is the distance described in last clustering distance ergodic process between the first cluster centre point and the second cluster centre point, and described second cluster centre point is cluster centre point to be traveled through in current clustering distance ergodic process;

According to triangle inequality rule, the predicted value of the predicted value of described first distance and described 3rd distance is compared;

If the predicted value of described 3rd distance is greater than or equal to the predicted value of described first distance of twice, then described second cluster centre point is abandoned, so that when carrying out clustering distance traversal, no longer calculate distance between described sample point and described second cluster centre point and described second cluster centre point and other and wait to travel through the distance between cluster centre point.

According to another aspect of the present invention, the invention provides a kind of device of data clusters, this device comprises:

First acquiring unit, self difference for upgrading front and back according to the first cluster centre point last time obtains the predicted value of the first distance; Wherein, described first distance carries out the distance between the sample point of data clusters and described first cluster centre point for needing, and described first cluster centre point is cluster centre point nearest with described sample point in clustering distance traversal;

Second acquisition unit, for the predicted value according to self difference before and after second distance, described first cluster centre point last time renewal and self difference acquisition the 3rd distance before and after the renewal of the second cluster centre point last time, wherein, described second distance is the distance described in last clustering distance ergodic process between the first cluster centre point and the second cluster centre point, and described second cluster centre point is cluster centre point to be traveled through in current clustering distance ergodic process;

Comparing unit, the predicted value for described 3rd distance predicted value of described first distance of described first acquiring unit acquisition and described second acquisition unit obtained according to triangle inequality rule compares;

Discarding unit, when predicted value for described 3rd distance compared when described comparing unit is greater than or equal to the predicted value of described first distance of twice, described second cluster centre point is abandoned, so that when carrying out clustering distance traversal, no longer calculate distance between described sample point and described second cluster centre point and described second cluster centre point and other and wait to travel through the distance between cluster centre point.

By technique scheme, the method of data clusters provided by the invention and device, in current clustering distance ergodic process, based on the cluster centre collection that the last time upgrades, the predicted value of the first distance is obtained according to self difference before and after the first cluster centre point renewal last time, the predicted value of this first distance is need to carry out the distance between the sample point of data clusters and the nearest cluster centre point of this sample point, according to second distance, self difference before and after self difference before and after first cluster centre point last time upgraded and the second cluster centre point last time upgrade obtains the predicted value of the 3rd distance, second distance is the distance in last clustering distance ergodic process between the first cluster centre point and the second cluster centre point, second cluster centre point is cluster centre point to be traveled through in current clustering distance ergodic process, the predicted value of the 3rd distance and the predicted value of the first distance are compared, if when the predicted value of the 3rd distance is greater than or equal to the predicted value of the first distance of twice, described second cluster centre point is abandoned.In the present invention, based on triangle inequality rule, the second cluster centre point that the predicted value of the 3rd distance concentrated by cluster centre is greater than or equal to the predicted value of the first distance of twice corresponding filters, without the need to calculating the distance between the second cluster centre point and sample point, also without the need to calculating the second sample point and other wait to travel through the distance between cluster centre point, therefore, decrease calculating second sample point and other wait to travel through the time and calculated amount that the distance between cluster centre point consumes, improve the counting yield of data clusters.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.

Accompanying drawing explanation

By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:

Fig. 1 shows the process flow diagram of the method for a kind of data clusters that the embodiment of the present invention provides;

The predicted value that Fig. 2 shows the 3rd distance that the embodiment of the present invention provides is greater than or equal to the schematic diagram of the predicted value of the first distance of twice;

The first cluster centre o'clock that Fig. 3 shows the embodiment of the present invention to be provided carries out the process flow diagram of data clusters disposal route to the second cluster centre point;

Fig. 4 shows the process flow diagram embodiments providing and determine the corresponding cluster centre point methods of sample point;

Fig. 5 shows the composition frame chart of the device of a kind of data clusters that the embodiment of the present invention provides;

Fig. 6 shows the composition frame chart of the device of the another kind of data clusters that the embodiment of the present invention provides;

Fig. 7 shows the composition frame chart of the device of the another kind of data clusters that the embodiment of the present invention provides;

Fig. 8 shows the composition frame chart of the device of the another kind of data clusters that the embodiment of the present invention provides;

Fig. 9 shows the composition frame chart of the device of the another kind of data clusters that the embodiment of the present invention provides;

Figure 10 shows the composition frame chart of the device of the another kind of data clusters that the embodiment of the present invention provides;

Figure 11 shows the composition frame chart of the device of the another kind of data clusters that the embodiment of the present invention provides;

Figure 12 shows the composition frame chart of the device of the another kind of data clusters that the embodiment of the present invention provides.

Embodiment

Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.

The embodiment of the present invention provides a kind of method of data clusters, and as shown in Figure 1, the method comprises:

101, the predicted value of the first distance is obtained according to self difference before and after the first cluster centre point last time renewal.

Before carrying out this data clusters, suppose there is sample set S{S1, S2 ... Sn}, initial cluster center collection M{M1, M2 ... Mj ... Mk}, calculates cluster centre point distance between any two in initial cluster center collection M: d11, d12 ... d (k-1) k.For the arbitrary sample point Si in sample set S, wherein, i is more than or equal to 1 and is less than or equal to n, travel through each the cluster centre point in cluster centre collection M successively, determine that Si and cluster centre concentrate nearest cluster centre point Mu, and Si is divided in set corresponding to this cluster centre point Mu, and the first distance Liu preserved between sample point Si and cluster centre point Mu, the like obtain cluster set corresponding to cluster centre point, as cluster centre point M1, M2 ... Mj ... the cluster set that Mk is corresponding is respectively N1, N2 ... Nj ... Nk, calculate cluster set N1, N2 ... Nj ... in Nk, the mean value of sample point is M1 ', M2 ' ... Mj ' ... Mk ', and use M1 ', M2 ' ... Mj ' ... Mk ' upgrades M1, M2 ... Mj ... Mk, cluster centre collection M after renewal be M1 ', M2 ' ... Mj ' ... Mk ' }.

In order to the accuracy of data clusters can be improved, need to carry out iterative computation, when carrying out current data clustering algorithm, based on the cluster centre collection M after above-mentioned renewal be M1 ', M2 ' ... Mj ' ... Mk ' } calculate.Wherein, the first distance Liu be need to carry out data clusters sample point Si and last upgrades after the first cluster centre point Mu ' between distance, the first cluster centre point Mu ' is cluster centre point nearest with sample point during clustering distance travels through.

The the first range prediction value arranging sample point Si corresponding is set to Liu=Liu+Tu, and wherein, Tu is self difference before and after upgrading the first cluster centre point Mu ' last time, and namely Tu is the difference between Mu ' and Mu.In the embodiment of the present invention, the object that described first range prediction value is set to Liu=Liu+Tu is, guarantees the maximal value of the spacing of the first cluster centre point Mu ' after sample point Si and last renewal; Based on the Liu=Liu+Tu after resetting, carry out current clustering distance traversal.

In embodiments of the present invention, calculate sample point Si and on once upgrade after the first cluster centre point Mu ' between distance, cluster centre point distance is between any two concentrated: d11 at calculating initial cluster center, d12 ... during d (k-1) k, can adopt but be not limited to following method and realize, such as, Euclidean distance, manhatton distance, Chebyshev's distance, power distance, cosine similarity, Pearson's similarity, the cosine similarity revised, Jaccard similarity, Hamming distance, the Euclidean distance of weighting, correlation distance, mahalanobis distances etc. calculate the algorithm of distance, the embodiment of the present invention does not limit the concrete grammar adopted when calculating distance.

102, the predicted value of the 3rd distance is obtained according to self difference of second distance, the first cluster centre point last self difference and the second cluster centre point last time renewal front and back that upgrade front and back.

Wherein, second distance duj calculates initial cluster center to concentrate distance between cluster centre point Mu and cluster centre point Mj in step 101, described cluster centre point Mj is the cluster centre point before the second cluster centre point Mj ' does not upgrade, and the second cluster centre point Mj ' is cluster centre point to be traveled through in current clustering distance ergodic process; Tu is self difference before and after upgrading the first cluster centre point Mu ' last time, and namely Tu is the difference between Mu ' and Mu; Tj is self difference before and after upgrading the second cluster centre point Mj ' last time, and namely Tj is the difference between Mj ' and Mj, and second distance duj and Tu and Tj is carried out subtraction, obtains the predicted value of the 3rd distance for (duj-Tu-Tj).

It should be noted that, the predicted value of described 3rd distance is (duj-Tu-Tj), it is in computation process, only need calculate the first cluster centre point Mu ' last time upgrade before and after self difference and the second cluster centre point Mj ' last time upgrade self difference of front and back, and without the need to calculating the cluster centre collection M{M1 ' after last renewal, M2 ' ... Mj ' ... Mk ' } in cluster centre point distance between any two, calculated amount when can reduce data clusters and improve counting yield.

103, according to triangle inequality rule, the predicted value of the predicted value of the first distance and the 3rd distance is compared.

Based on triangle inequality rule, by the predicted value Liu of the first distance of acquisition in step 101, compare with the predicted value of the 3rd distance obtained in step 102.

If the predicted value of 104 the 3rd distances is greater than or equal to the predicted value of the first distance of twice, then the second cluster centre point is abandoned.

When the predicted value of the 3rd distance is greater than or equal to the predicted value of the first distance of twice, namely (duj-Tu-Tj) is greater than or equal to 2*Liu, illustrate that the distance Lij ' between sample point Si and the second cluster centre point Mj ' is greater than or equal to the distance of the predicted value Liu of sample point Si and the first distance, second cluster centre point Mj ' is abandoned, therefore, carrying out in current clustering distance ergodic process, without the need to calculating distance between sample point Si and the second sample point Mj, also without the need to calculating the second cluster centre point and other wait to travel through the distance between cluster centre point.As shown in Figure 2, the predicted value that Fig. 2 shows the 3rd distance that the embodiment of the present invention provides is greater than or equal to the schematic diagram of the predicted value of the first distance of twice.

The method of the data clusters that the embodiment of the present invention provides, in current clustering distance ergodic process, based on the cluster centre collection that the last time upgrades, the predicted value of the first distance is obtained according to self difference before and after the first cluster centre point renewal last time, the predicted value of this first distance is need to carry out the distance between the sample point of data clusters and the nearest cluster centre point of this sample point, according to second distance, self difference before and after self difference before and after first cluster centre point last time upgraded and the second cluster centre point last time upgrade obtains the predicted value of the 3rd distance, second distance is the distance in last clustering distance ergodic process between the first cluster centre point and the second cluster centre point, second cluster centre point is cluster centre point to be traveled through in current clustering distance ergodic process, the predicted value of the 3rd distance and the predicted value of the first distance are compared, if when the predicted value of the 3rd distance is greater than or equal to the predicted value of the first distance of twice, described second cluster centre point is abandoned.In the embodiment of the present invention, based on triangle inequality rule, the second cluster centre point that the predicted value of the 3rd distance concentrated by cluster centre is greater than or equal to the predicted value of the first distance of twice corresponding filters, without the need to calculating the distance between the second cluster centre point and sample point, also without the need to calculating the second sample point and other wait to travel through the distance between cluster centre point, therefore, decrease calculating second sample point and other wait to travel through the time and calculated amount that the distance between cluster centre point consumes, improve the counting yield of data clusters.

Further, when performing step 103 and according to triangle inequality rule, the predicted value (duj-Tu-Tj) of the predicted value Liu of the first distance and the 3rd distance being compared, if the predicted value (duj-Tu-Tj) of the 3rd distance is less than the predicted value Liu of the first distance of twice, illustrate that the distance Lij ' between sample point Si and the second cluster centre point Mj ' is less than the predicted value Liu of the first distance between sample point Si and the first cluster centre point Mu ', according to the first cluster centre point Mu ' after the last time upgrades, data clusters process is carried out to the second cluster centre point Mj ', determine that cluster centre point that sample point Si is corresponding is the first cluster centre point Mu ' after last renewal or the second cluster centre point Mj '.As shown in Figure 3, Fig. 3 shows the process flow diagram according to the first cluster centre point Mu ' after last time renewal, the second cluster centre point Mj ' being carried out to data clusters disposal route that the embodiment of the present invention provides, and the method comprises:

301, calculate the distance between the first cluster centre point after last renewal and sample point, obtain the actual value of the first distance.

Calculate the distance Liu ' between the first cluster centre point Mu ' and sample point Si after last renewal, described Liu ' is the actual value of the first distance in current clustering distance ergodic process, during the actual range Liu ' of the first distance between the first cluster centre point Mu ' and sample point Si after the embodiment of the present invention computationally once upgrades, the algorithm adopted please refer to the associated description in above-mentioned steps 101, and the embodiment of the present invention no longer repeats at this.

302, according to triangle inequality rule, the predicted value of the actual value of the first distance and the 3rd distance is compared.

Based on triangle inequality rule, the actual value Liu ' of the first distance and the predicted value (duj-Tu-Tj) of the 3rd distance are compared, if the predicted value (duj-Tu-Tj) of the 3rd distance is greater than or equal to the actual value Liu ' of the first distance of twice, then perform step 303; If the predicted value (duj-Tu-Tj) of the 3rd distance is less than the actual value Liu ' of the first distance of twice, then perform step 304.

303, the second cluster centre point is abandoned.

When the predicted value (duj-Tu-Tj) of the 3rd distance is greater than or equal to the actual value Liu ' of the first distance of twice, illustrate in current clustering distance ergodic process, the actual range of sample point Si to the second cluster centre point Mj ' is greater than or equal to the actual range of sample point Si to the first cluster centre point Mu ', namely the cluster centre point that sample point Si is corresponding can not be the second cluster centre point Mj ', therefore the second cluster centre point Mj ' is abandoned, no longer calculate distance between sample point Si and the second cluster centre point Mj ' and the second cluster centre point Mj ' and other to wait to travel through the distance between cluster centre point.

304, calculate the 4th distance, and determine whether the 4th distance is less than the actual value of the first distance.

When the predicted value (duj-Tu-Tj) of the 3rd distance is less than the actual value Liu ' of the first distance of twice, illustrate in current clustering distance ergodic process, the actual range of sample point Si to the second cluster centre point Mj ' is less than the actual range of sample point Si to the first cluster centre point Mu ', and the cluster centre point that namely sample point Si is corresponding may be the second cluster centre point Mj '.

Determine that the cluster centre point that sample point Si is corresponding is the first cluster centre point Mu ', or the second cluster centre point Mj ', need calculating the 4th distance Lij ', wherein, described 4th distance Lij ' is the distance between sample point Si and the second cluster centre point Mj '.If the 4th distance Lij ' is less than the actual value Liu ' of the first distance, then perform step 305; If the 4th distance Lij ' is greater than or equal to the actual value Liu ' of the first distance, then perform step 306.

305, the second cluster centre point is defined as cluster centre point nearest with sample point in current distance ergodic process.

When the 4th distance Lij ' is less than the actual value Liu ' of the first distance, determine that the second cluster centre point Mj ' is for cluster centre point nearest with sample point Si in current distance ergodic process.In a kind of implementation of the embodiment of the present invention, when described 4th distance Lij ' is less than the actual value Liu ' of described first distance, and current clustering distance has traveled through, then described second cluster centre point Mj ' assignment is given the described first cluster centre point Mu ' after last renewal, and described 4th distance Lij ' assignment is given the actual value Liu ' of the first distance, i.e. Lui '=Lij ', Mu '=Mj ', in the another kind of implementation of the embodiment of the present invention, when the 4th distance Lij ' is less than the actual value Liu ' of the first distance, and current clustering distance traversal does not complete, then the second cluster centre point Mj ' assignment is given the first cluster centre point Mu ' after last renewal, and the 4th distance Lij ' assignment is given the actual value Liu ' of the first distance, i.e. Lui '=Lij ', Mu '=Mj ', and continue the concentrated next cluster centre point of the current cluster centre of traversal based on the actual value Liu ' of the first cluster centre point Mu ' after assignment and the first distance after assignment, until traveled through current cluster centre collection.

306, the first cluster centre point after being upgraded the last time is defined as cluster centre point nearest with sample point in current distance ergodic process.

When the 4th distance Lij ' is greater than or equal to the actual value Liu ' of the first distance, determine that the first cluster centre point Mu ' after upgrading last time is cluster centre point nearest with sample point Si in current distance ergodic process.In a kind of implementation of the embodiment of the present invention, when the 4th distance Lij ' is greater than or equal to the actual value Liu ' of the first distance, and when current clustering distance has traveled through, the first cluster centre point Mu ' after being upgraded the last time is defined as cluster centre point nearest with sample point Si in current distance ergodic process; When the 4th distance Lij ' is greater than or equal to the actual value Liu ' of the first distance, and current clustering distance traversal does not complete, then the first cluster centre point Mu ' after upgrading based on the last time and the actual value Liu ' of the first distance continues the next cluster centre point that the current cluster centre of traversal is concentrated.

The embodiment of the present invention is in specific implementation process, Fig. 1 and Fig. 3 is combined, determines the cluster centre point that sample point Si is corresponding, as shown in Figure 4, Fig. 4 shows the process flow diagram embodiments providing and determine the corresponding cluster centre point methods of sample point Si, and the method comprises:

401 (steps 101), to upgrade according to the first cluster centre point last time before and after self difference obtain the predicted value of the first distance.

402 (steps 102), to upgrade according to second distance, the first cluster centre point last time before and after self difference and the second cluster centre point last time upgrade before and after self difference obtain the predicted value of the 3rd distance.

403 (steps 103), according to triangle inequality rule the predicted value of the predicted value of the first distance and the 3rd distance is compared.

If the predicted value of the 3rd distance is greater than or equal to the predicted value of the first distance of twice, then perform step 404; If the predicted value of the 3rd distance is less than the predicted value of the first distance of twice, then perform step 405.

404 (steps 104), the second cluster centre point to be abandoned.

405, according to the described first cluster centre point after last time renewal, data clusters process is carried out to described second cluster centre point.

About carrying out the implementation process of data clusters process to described second cluster centre point according to the described first cluster centre point after last time renewal, please refer to the detailed description of Fig. 3, the embodiment of the present invention no longer repeats at this.

Further, before execution step 304 calculates the 4th distance, calculate the 5th distance duj ', 5th distance is the distance between the second cluster centre point Mj ' and the first cluster centre point Mu ' after upgrading last time, according to triangle inequality rule, the actual value Liu ' of the first distance and the 5th distance duj ' are compared, when the 5th distance duj ' is greater than or equal to the actual value Liu ' of the first distance of twice, second cluster centre point is abandoned, no longer calculate distance between sample point Si and the second cluster centre point Mj ' and the second cluster centre point Mj ' and other to wait to travel through the distance between cluster centre point, when the 5th distance duj ' is less than the actual value Liu ' of the first distance of twice, continue to perform step 304.

It should be noted that, in the operating process of the operating procedure 301-step 303 of reality, can by the major part in cluster centre collection M, the cluster centre point being greater than or equal to the actual value Liu ' of the first distance with the distance of sample point Si abandons, and in cluster centre collection M, the cluster centre point of remaining part is the cluster centre point being less than the actual value Liu ' of the first distance with the distance of sample point Si.Exemplary, suppose in cluster centre collection M, there are 1000 cluster centre points, during by step 301-step 303, can 800 abandon with the corresponding cluster centre point that the distance of sample point Si is greater than or equal to the actual value Liu ' of the first distance, now, 200 cluster centre points are remained in cluster centre collection M.Calculate the 5th distance in residue 200 cluster centre points between the second cluster centre point Mj ' and the first cluster centre point Mu ' after upgrading last time respectively, when the 5th distance duj ' is greater than or equal to the actual value Liu ' of the first distance of twice, 150 second cluster centre point Mj ' are abandoned, now, 50 cluster centre points are remained in cluster centre collection M, calculate in sample point Si and cluster centre collection M the distance remained between 50 cluster centre points respectively, determine the cluster centre point that sample point Si is nearest.It should be noted that, in the operating process of reality, when calculating the 5th distance duj ' in cluster centre collection M between two between cluster centre point, less than the calculated amount of the 4th distance Lij ' between calculating sample point Si and the second cluster centre point Mj ', elapsed time is few.The embodiment of the present invention is based on triangle inequality rule, the cluster centre point being greater than or equal to the actual value Liu ' of the first distance in twice couple of cluster centre collection M with the distance of sample point Si abandons, and further reduces the calculated amount calculating sample point Si and the second cluster centre point Mj ' to a certain extent.

Further, as to the refinement of above-described embodiment and expansion, above-mentioned steps 102 is when predicted value (duj-Tu-Tj) of acquisition the 3rd distance, can adopt but be not limited to following mode and realize, obtain last value Mu and the rear corresponding value Mu ' of renewal upgrading front correspondence of the first cluster centre point Mu ', calculate the first difference Tu, wherein, described Tu is the difference between Mu ' and Mu; Obtain last value Mj and the rear corresponding value Mj ' of renewal upgrading front correspondence of the second cluster centre point Mj ', calculate the second difference Tj, wherein, described Tj is the difference between Tj ' and Tj; Second distance duj and the first difference Tu and the second difference Tj is carried out subtraction, obtains the predicted value (duj-Tu-Tj) of the 3rd distance.

Further, after step 104 is performed, judge whether current clustering distance traversal completes, if do not traveled through, then continued the current cluster centre of traversal and concentrate next cluster centre point; If traveled through, then the first cluster centre point Mu ' after being upgraded the last time has been defined as cluster centre point nearest with described sample point in current distance ergodic process.

After determining the cluster centre point that sample point Si is corresponding, the like obtain M1 ' in cluster centre collection M, M2 ' ... Mj ' ... the cluster set that Mk ' is corresponding is respectively N1 ', N2 ' ... Nj ' ... Nk ', cluster set N1 described in calculation procedure 101, N2 ... Nj ... Nk and current clustering distance travel through the cluster set N1 ' determined, N2 ' ... Nj ' ... difference O1 between Nk ', O2 ... Oj ... Ok, and judge described difference O1, O2 ... Oj ... whether Ok meets default error threshold, if meet, then front clustering distance is traveled through the cluster set N1 ' determined, N2 ' ... Nj ' ... Nk ' is defined as the result of final data cluster, if do not meet, then repeat data clusters based on embodiment of the present invention method described above, until determine the result of final data cluster.In the embodiment of the present invention, need to arrange according to the actual requirements arranging default error threshold, for the demand of some fine data clusters, the less of default error threshold is set, such as, arranging default error threshold is 1 or 0 etc., and the embodiment of the present invention does not limit the particular content that default error threshold is arranged.

Further, as the realization to said method, the embodiment of the present invention also provides a kind of device of data clusters, and as shown in Figure 5, this device comprises:

First acquiring unit 41, self difference for upgrading front and back according to the first cluster centre point last time obtains the predicted value of the first distance; Wherein, the first distance carries out the distance between the sample point of data clusters and the first cluster centre point for needing, and the first cluster centre point is cluster centre point nearest with sample point in clustering distance traversal;

Second acquisition unit 42, for the predicted value according to self difference before and after second distance, the first cluster centre point last time renewal and self difference acquisition the 3rd distance before and after the renewal of the second cluster centre point last time, wherein, second distance is the distance in last clustering distance ergodic process between the first cluster centre point and the second cluster centre point, and the second cluster centre point is cluster centre point to be traveled through in current clustering distance ergodic process;

Comparing unit 43, the predicted value for the 3rd distance obtained according to predicted value and the second acquisition unit 42 of regular the first distance obtained by first acquiring unit 41 of triangle inequality compares;

Discarding unit 44, when predicted value for the 3rd distance compared when comparing unit 43 is greater than or equal to the predicted value of the first distance of twice, second cluster centre point is abandoned, so that when carrying out clustering distance traversal, no longer calculate distance between sample point and the second cluster centre point and the second cluster centre point and other and wait to travel through the distance between cluster centre point.

Further, as shown in Figure 6, device also comprises:

Processing unit 45, when the predicted value for the 3rd distance compared when comparing unit 43 is less than the predicted value of the first distance of twice, carries out data clusters process according to the first cluster centre o'clock after the last time upgrades to the second cluster centre point.

Further, as shown in Figure 7, processing unit 45, comprising:

First computing module 451, for calculating the distance between the first cluster centre point after last renewal and sample point, obtains the actual value of the first distance;

First comparison module 452, for comparing according to the triangle inequality actual value of the first distance that calculated by the first computing module 451 of rule and the predicted value of the 3rd distance;

Discard module 453, when predicted value for the 3rd distance compared when the first comparison module 452 is greater than or equal to the actual value of the first distance of twice, second cluster centre point is abandoned, so that when carrying out clustering distance traversal, no longer calculate distance between sample point and the second cluster centre point and the second cluster centre point and other and wait to travel through the distance between cluster centre point;

Second computing module 454, the predicted value for the 3rd distance compared when the first comparison module 452 is less than the actual value of the first distance of twice, then calculate the 4th distance; Wherein, the 4th distance is the distance of sample point and the second cluster centre point;

First determination module 455, for determining whether the 4th distance of the second computing module 454 calculating is less than the actual value of the first distance;

Second determination module 456, when the first determination module 455 determines that the 4th distance is less than the actual value of the first distance, is defined as cluster centre point nearest with sample point in current distance ergodic process by the second cluster centre point;

3rd determination module 457, for when the first determination module 455 determines that the 4th distance is greater than or equal to the actual value of the first distance, the first cluster centre point after being upgraded the last time is defined as cluster centre point nearest with sample point in current distance ergodic process.

Further, as shown in Figure 8, the second determination module 456, comprising:

Assignment submodule 4561, for being less than the actual value of the first distance when the 4th distance, and when current clustering distance has traveled through, the second cluster centre point assignment is given the first cluster centre point after last renewal, and give the actual value of the first distance by the 4th distance assignment;

Process submodule 4562, for being less than the actual value of the first distance when the 4th distance, and when current clustering distance traversal does not complete, second cluster centre point assignment is given the first cluster centre point after last renewal, and the 4th distance assignment is given the actual value of the first distance, and continue the concentrated next cluster centre point of the current cluster centre of traversal based on the actual value of the first cluster centre point after assignment and the first distance after assignment.

Further, as shown in Figure 9, the 3rd determination module 457, comprising:

Determine submodule 4571, for being greater than or equal to the actual value of the first distance when the 4th distance, and when current clustering distance has traveled through, the first cluster centre point after being upgraded the last time is defined as cluster centre point nearest with sample point in current distance ergodic process;

Traversal submodule 4572, for being greater than or equal to the actual value of the first distance when the 4th distance, and current clustering distance traversal does not complete, then the first cluster centre point after upgrading based on the last time and the actual value of the first distance continue the next cluster centre point that the current cluster centre of traversal is concentrated.

Further, as shown in Figure 10, processing unit 45 also comprises:

3rd computing module 458, before the 4th distance that the second computing module 454 calculates, calculates the 5th distance, and the 5th distance is the distance between the first cluster centre point after upgrading the second cluster centre point and last time;

Second comparison module 459, compares for the 5th distance calculated according to actual value and the 3rd computing module 458 of regular the first distance calculated by first computing module 451 of triangle inequality;

Discard module 453, the 5th distance also for comparing when the second comparison module 459 is greater than or equal to the actual value of the first distance of twice, then the second cluster centre point is abandoned, so that when carrying out cluster traversal, no longer calculate distance between sample point and the second cluster centre point and the second cluster centre point and other and wait to travel through the distance between cluster centre point;

Second computing module 454, the 5th distance also for comparing when the second comparison module 459 is less than the actual value of the first distance of twice, then perform calculating the 4th distance.

Further, as shown in figure 11, second acquisition unit 42, comprising:

First processing module 421, upgrades front corresponding value and value corresponding after upgrading for obtaining for the first cluster centre point last time, and the first difference between calculating before and after the first cluster centre point renewal;

Second processing module 422, upgrades front corresponding value and value corresponding after upgrading for obtaining for the second cluster centre point last time, and the second difference between calculating before and after the second cluster centre point renewal;

Acquisition module 423, the second difference that the first difference calculated for second distance and the first processing module 421 and the second processing module 422 calculate carries out subtraction, obtains the predicted value of the 3rd distance.

Further, as shown in figure 12, device also comprises:

Judging unit 46, after the second cluster centre point abandons by discarding unit 44, judges whether current clustering distance traversal completes;

Traversal Unit 47, when judging unit 46 judges not traveled through, continues the next cluster centre point that the current cluster centre of traversal is concentrated;

Determining unit 48, for when judging unit 46 judges that traversal completes, the first cluster centre point after being upgraded the last time is defined as cluster centre point nearest with sample point in current distance ergodic process.

The device of the data clusters that the embodiment of the present invention provides, in current clustering distance ergodic process, based on the cluster centre collection that the last time upgrades, the predicted value of the first distance is obtained according to self difference before and after the first cluster centre point renewal last time, the predicted value of this first distance is need to carry out the distance between the sample point of data clusters and the nearest cluster centre point of this sample point, according to second distance, self difference before and after self difference before and after first cluster centre point last time upgraded and the second cluster centre point last time upgrade obtains the predicted value of the 3rd distance, second distance is the distance in last clustering distance ergodic process between the first cluster centre point and the second cluster centre point, second cluster centre point is cluster centre point to be traveled through in current clustering distance ergodic process, the predicted value of the 3rd distance and the predicted value of the first distance are compared, if when the predicted value of the 3rd distance is greater than or equal to the predicted value of the first distance of twice, described second cluster centre point is abandoned.In the embodiment of the present invention, based on triangle inequality rule, the second cluster centre point that the predicted value of the 3rd distance concentrated by cluster centre is greater than or equal to the predicted value of the first distance of twice corresponding filters, without the need to calculating the distance between the second cluster centre point and sample point, also without the need to calculating the second sample point and other wait to travel through the distance between cluster centre point, therefore, decrease calculating second sample point and other wait to travel through the time and calculated amount that the distance between cluster centre point consumes, improve the counting yield of data clusters.

The embodiment of the invention discloses a kind of method of A1, data clusters, comprising:

A2, method according to A1, described method also comprises:

If the predicted value of described 3rd distance is less than the predicted value of described first distance of twice, then according to the described first cluster centre point after last time renewal, data clusters process is carried out to described second cluster centre point.

A3, method according to A2, describedly carry out data clusters process according to the described first cluster centre point after the last time upgrades to described second cluster centre point, comprising:

Calculate the distance between the described first cluster centre point after described last time renewal and described sample point, obtain the actual value of the first distance;

According to triangle inequality rule, the predicted value of the actual value of described first distance and described 3rd distance is compared;

If the predicted value of described 3rd distance is greater than or equal to the actual value of described first distance of twice, then described second cluster centre point is abandoned, so that when carrying out clustering distance traversal, no longer calculate distance between described sample point and described second cluster centre point and described second cluster centre point and other and wait to travel through the distance between cluster centre point;

If the predicted value of described 3rd distance is less than the actual value of described first distance of twice, then calculate the 4th distance, and determine whether described 4th distance is less than the actual value of described first distance; Wherein, described 4th distance is the distance of described sample point and described second cluster centre point;

If described 4th distance is less than the actual value of described first distance, then described second cluster centre point is defined as cluster centre point nearest with described sample point in current distance ergodic process;

If described 4th distance is greater than or equal to the actual value of described first distance, then the described first cluster centre point after being upgraded the described last time is defined as cluster centre point nearest with described sample point in current distance ergodic process.

A4, method according to A3, be describedly defined as cluster centre point nearest with described sample point in current distance ergodic process by described second cluster centre point, comprising:

If described 4th distance is less than the actual value of described first distance, and current clustering distance has traveled through, described first cluster centre point after upgrading then described second cluster centre point assignment to the described last time, and described 4th distance assignment is given the actual value of described first distance;

If described 4th distance is less than the actual value of described first distance, and current clustering distance traversal does not complete, described first cluster centre point after upgrading then described second cluster centre point assignment to the described last time, and described 4th distance assignment is given the actual value of described first distance, and continue the concentrated next cluster centre point of the described current cluster centre of traversal based on the actual value of the first cluster centre point after assignment and the first distance after assignment.

A5, method according to A3, the described first cluster centre point after being upgraded the described last time is defined as cluster centre point nearest with described sample point in current distance ergodic process, comprising:

If described 4th distance is greater than or equal to the actual value of described first distance, and current clustering distance has traveled through, then the described first cluster centre point after being upgraded the described last time has been defined as cluster centre point nearest with described sample point in current distance ergodic process;

If described 4th distance is greater than or equal to the actual value of described first distance, and current clustering distance traversal does not complete, then the described first cluster centre point after upgrading based on the described last time and the actual value of described first distance continue the next cluster centre point that the described current cluster centre of traversal is concentrated.

A6, method according to A4 or A5, before calculating the 4th distance, described method also comprises:

Calculate the 5th distance, described 5th distance is the distance between the described first cluster centre point after upgrading described second cluster centre point and described last time;

According to triangle inequality rule, the actual value of described first distance and described 5th distance are compared;

If described 5th distance is greater than or equal to the actual value of described first distance of twice, then described second cluster centre point is abandoned, so that when carrying out cluster traversal, no longer calculate distance between described sample point and described second cluster centre point and described second cluster centre point and other and wait to travel through the distance between cluster centre point;

Described calculating the 4th distance, comprising:

If described 5th distance is less than the actual value of described first distance of twice, then perform described 4th distance of described calculating.

A7, method according to any one of A1-A5, the predicted value of self difference acquisition the 3rd distance before and after described self difference according to second distance, described first cluster centre point last time renewal front and back and the second cluster centre point last time upgrade, comprising:

Obtain last value and the rear corresponding value of renewal upgrading front correspondence of described first cluster centre point, and calculate the first difference between described first cluster centre point renewal front and back;

Obtain last value and the rear corresponding value of renewal upgrading front correspondence of described second cluster centre point, and calculate the second difference between described second cluster centre point renewal front and back;

Described second distance and described first difference and described second difference are carried out subtraction, obtains the predicted value of described 3rd distance.

A8, method according to A7, after being abandoned by described second cluster centre point, described method also comprises:

Judge whether described current clustering distance traversal completes;

If do not traveled through, then continue the next cluster centre point that the described current cluster centre of traversal is concentrated;

If traveled through, then the first cluster centre point after being upgraded the last time has been defined as cluster centre point nearest with described sample point in current distance ergodic process.

The device of B9, a kind of data clusters, comprising:

B10, device according to B9, described device also comprises:

Processing unit, when predicted value for described 3rd distance compared when described comparing unit is less than the predicted value of described first distance of twice, according to the described first cluster centre point after the last time upgrades, data clusters process is carried out to described second cluster centre point.

B11, device according to B10, described processing unit, comprising:

First computing module, for calculating the distance between the described first cluster centre point after described last time renewal and described sample point, obtains the actual value of the first distance;

First comparison module, for comparing the actual value of described first distance of described first computing module calculating and the predicted value of described 3rd distance according to triangle inequality rule;

Discard module, when predicted value for described 3rd distance compared when described first comparison module is greater than or equal to the actual value of described first distance of twice, described second cluster centre point is abandoned, so that when carrying out clustering distance traversal, no longer calculate distance between described sample point and described second cluster centre point and described second cluster centre point and other and wait to travel through the distance between cluster centre point;

Second computing module, the predicted value for described 3rd distance compared when described first comparison module is less than the actual value of described first distance of twice, then calculate the 4th distance; Wherein, described 4th distance is the distance of described sample point and described second cluster centre point;

First determination module, for determining whether described 4th distance of described second computing module calculating is less than the actual value of described first distance;

Second determination module, when described first determination module determines that described 4th distance is less than the actual value of described first distance, is defined as cluster centre point nearest with described sample point in current distance ergodic process by described second cluster centre point;

3rd determination module, during for determining that described 4th distance is greater than or equal to the actual value of described first distance when described first determination module, the described first cluster centre point after being upgraded the described last time is defined as cluster centre point nearest with described sample point in current distance ergodic process.

B12, device according to B11, described second determination module, comprising:

Assignment submodule, for being less than the actual value of described first distance when described 4th distance, and when current clustering distance has traveled through, described first cluster centre point after upgrading described second cluster centre point assignment to the described last time, and the actual value of described 4th distance assignment being given described first distance;

Process submodule, for being less than the actual value of described first distance when described 4th distance, and when current clustering distance traversal does not complete, described first cluster centre point after upgrading described second cluster centre point assignment to the described last time, and described 4th distance assignment is given the actual value of described first distance, and continue the concentrated next cluster centre point of the described current cluster centre of traversal based on the actual value of the first cluster centre point after assignment and the first distance after assignment.

B13, device according to B11, described 3rd determination module, comprising:

Determine submodule, for being greater than or equal to the actual value of described first distance when described 4th distance, and when current clustering distance has traveled through, the described first cluster centre point after being upgraded the described last time is defined as cluster centre point nearest with described sample point in current distance ergodic process;

Traversal submodule, for being greater than or equal to the actual value of described first distance when described 4th distance, and current clustering distance traversal does not complete, then the described first cluster centre point after upgrading based on the described last time and the actual value of described first distance continue the next cluster centre point that the described current cluster centre of traversal is concentrated.

B14, device according to B12 or B13, described processing unit also comprises:

3rd computing module, before described 4th distance that described second computing module calculates, calculates the 5th distance, and described 5th distance is the distance between the described first cluster centre point after upgrading described second cluster centre point and described last time;

Second comparison module, for comparing described 5th distance that the actual value of described first distance of described first computing module calculating and described 3rd computing module calculate according to triangle inequality rule;

Described discard module, described 5th distance also for comparing when described second comparison module is greater than or equal to the actual value of described first distance of twice, then described second cluster centre point is abandoned, so that when carrying out cluster traversal, no longer calculate distance between described sample point and described second cluster centre point and described second cluster centre point and other and wait to travel through the distance between cluster centre point;

Described second computing module, described 5th distance also for comparing when described second comparison module is less than the actual value of described first distance of twice, then perform described 4th distance of described calculating.

B15, device according to any one of B9-B13, described second acquisition unit, comprising:

First processing module, upgrades front corresponding value and value corresponding after upgrading for obtaining the described first cluster centre point last time, and the first difference between calculating before and after described first cluster centre point renewal;

Second processing module, upgrades front corresponding value and value corresponding after upgrading for obtaining the described second cluster centre point last time, and the second difference between calculating before and after described second cluster centre point renewal;

Acquisition module, described second difference that described first difference calculated for described second distance and described first processing module and described second processing module calculate carries out subtraction, obtains the predicted value of described 3rd distance.

B16, device according to B15, described device also comprises:

Judging unit, after described second cluster centre point abandons by described discarding unit, judges whether described current clustering distance traversal completes;

Traversal Unit, when described judging unit judges not traveled through, continues the next cluster centre point that the described current cluster centre of traversal is concentrated;

Determining unit, for when described judging unit judges that traversal completes, the first cluster centre point after being upgraded the last time is defined as cluster centre point nearest with described sample point in current distance ergodic process.

In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.

Be understandable that, the correlated characteristic in said method and device can reference mutually.In addition, " first ", " second " in above-described embodiment etc. are for distinguishing each embodiment, and do not represent the quality of each embodiment.

Those skilled in the art can be well understood to, and for convenience and simplicity of description, the system of foregoing description, the specific works process of device and unit, with reference to the corresponding process in preceding method embodiment, can not repeat them here.

Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.

In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.

In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.

All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions of the some or all parts in the denomination of invention (as determined the device of website internal chaining grade) that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.

The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims

1. a method for data clusters, is characterized in that, comprising:

2. method according to claim 1, is characterized in that, described method also comprises:

3. method according to claim 2, is characterized in that, described according to the last time upgrade after described first cluster centre point data clusters process is carried out to described second cluster centre point, comprising:

4. method according to claim 3, is characterized in that, described described second cluster centre point is defined as cluster centre point nearest with described sample point in current distance ergodic process, comprising:

5. method according to claim 3, is characterized in that, the described first cluster centre point after being upgraded the described last time is defined as cluster centre point nearest with described sample point in current distance ergodic process, comprising:

6. the method according to claim 4 or 5, is characterized in that, before calculating the 4th distance, described method also comprises:

Described calculating the 4th distance, comprising:

7. a device for data clusters, is characterized in that, comprising:

8. device according to claim 7, is characterized in that, described device also comprises:

9. device according to claim 8, is characterized in that, described processing unit, comprising:

10. device according to claim 9, is characterized in that, described second determination module, comprising: