CN109117895A

CN109117895A - Data clustering method, device and storage medium

Info

Publication number: CN109117895A
Application number: CN201811030449.7A
Authority: CN
Inventors: 赛影辉; 张国兴; 李中兵
Original assignee: SAIC Chery Automobile Co Ltd
Current assignee: Chery Automobile Co Ltd
Priority date: 2018-09-05
Filing date: 2018-09-05
Publication date: 2019-01-01

Abstract

The invention discloses a kind of data clustering method, device and storage mediums, belong to data mining technology field.The described method includes: carrying out uniform sampling to raw sample data collection, Uniform Sample data set is obtained；The position of each sample in the Uniform Sample data set is updated, the data set after obtaining location updating；Data clusters are carried out on the data set after the location updating by Molecule cluster technology.The present invention reduces sample size by carrying out uniform sampling to raw sample data collection, to reduce terminal operating resource, improve cluster speed, the subsequent position to each sample in obtained Uniform Sample data set is updated, and data clusters are carried out on the updated data set in position by Molecule cluster technology, the accuracy rate clustered to sample is improved, the accuracy rate clustered to sample is improved.

Description

Data clustering method, device and storage medium

Technical field

The present invention relates to data mining technology field, in particular to a kind of data clustering method, device and storage medium.

Background technique

Under big data environment, need to solve the problems, such as using to shape clustering algorithm in many application scenarios.Example Such as in geographic information processing field, the terrain information in mountain range, river is extracted using clustering algorithm；Know in field of image processing It Chu not people in image or object；Protein structure is clustered in medicine, identifies different types of protein etc. Deng.Wherein, clustering algorithm, which refers to, concentrates similitude between each data sample by a data, by similar data sample It is divided into the same cluster, to realize the algorithm that the sample of raw data set is divided into multiple clusters.

Currently, clustering algorithm usually requires certain priori knowledge, it sometimes can be by data when carrying out shape cluster Collection is divided into various convexs, hypersphere type cluster etc.,

But due to having deviation to the shape of data set when being clustered, lead to there are the data of many shapes can not Cluster, and time complexity all with higher are completed, so as to cause algorithm complexity, reduces cluster efficiency and accuracy.

Summary of the invention

The embodiment of the invention provides a kind of data clustering method, device and storage mediums, for solving in the related technology Cluster the low problem of low efficiency accuracy.The technical solution is as follows:

In a first aspect, providing a kind of data clustering method, which comprises

Uniform sampling is carried out to raw sample data collection, obtains Uniform Sample data set；

The position of each sample in the Uniform Sample data set is updated, the data after obtaining location updating Collection；

Data clusters are carried out on the data set after the location updating by Molecule cluster technology.

Optionally, described that uniform sampling is carried out to raw sample data collection, obtain Uniform Sample data set, comprising:

The raw sample data collection is subjected to Gauss Distribution Fitting, obtains master sample data set；

Determine the coordinate of the central point of the master sample data set and the coordinate of each sample；

The coordinate of the coordinate of central point based on the master sample data set and each sample determines described uniform Sample data set.

Optionally, the coordinate of the coordinate of the central point based on the master sample data set and each sample, Determine the Uniform Sample data set, comprising:

The coordinate of coordinate and each sample based on the central point, determines the central point and each sample The distance between；

The Uniform Sample data set is added in the sample nearest apart from the central point, and the uniform data will be added The sample of collection is rejected from the master sample data set；

Based in remaining sample in the master sample data set after Rejection of samples and the Uniform Sample data set The distance between each sample, determines distance matrix；

Based on the maximum range value in every a line in the distance matrix, determine apart from column vector；

By the lowest distance value in column vector in the master sample data set described in corresponding sample addition Uniform Sample data set, and the sample that the uniform data set is added is rejected from the master sample data set；

When the number of sample in the Uniform Sample data set is not up to sample size threshold value, return described based on rejecting In the master sample data set after sample in remaining sample and the Uniform Sample data set between each sample away from From the operation of distance matrix being determined, until the number of sample reaches the sample size threshold value in the Uniform Sample data set Until.

Optionally, the position to each sample in the Uniform Sample data set is updated, and obtains position more Data set after new, comprising:

K neighbor point of each sample in Uniform Sample data set is determined by k nearest neighbor algorithm；

The seat of coordinate and the respective k neighbor point of each sample based on sample in the Uniform Sample data set Mark, determines the local standard parameter of each sample described in the Uniform Sample data set；

Based on the local standard parameter of each sample, each sample and its are determined in the Uniform Sample data set Sample weights between his sample；

Seat based on the current coordinate of the sample weights and each sample, after determining each Sample Refreshment Mark；

Changing value becomes greater than coordinate between coordinate and the last coordinate determined after determining each Sample Refreshment When changing threshold value, the coordinate of each sample is updated and is returned described is determined in Uniform Sample data set by k nearest neighbor algorithm The operation of k neighbor point of each sample, until between the coordinate that the coordinate and last time after each Sample Refreshment determine Until changing value is less than or equal to the changes in coordinates threshold value；

Changing value is less than or equal to described between coordinate and the last coordinate determined after each Sample Refreshment Data set when changes in coordinates threshold value, after the data set of the updated sample composition of coordinate to be determined as to the location updating.

Optionally, the coordinate based on sample in the Uniform Sample data set and the respective k of each sample are a The coordinate of neighbor point determines the local standard parameter of each sample described in the Uniform Sample data set, comprising:

The seat of coordinate and the respective k neighbor point of each sample based on sample in the Uniform Sample data set Mark, the local standard parameter of each sample described in the Uniform Sample data set is determined by following first formula；

Wherein, the t (i) is the local standard parameter of any sample i in the Uniform Sample data set, the y_tFor The coordinate of any neighbor point, the y in the k neighbor point of any sample i_iFor any sampleⁱCoordinate, the kNN (y_i) be any sample i k neighbor point coordinate set.

Optionally, the local standard parameter based on each sample, determines in the Uniform Sample data set Sample weights between each sample and other samples, comprising:

Based on the local standard parameter of each sample, the Uniform Sample data are determined by following second formula Concentrate the sample weights between each sample and other samples；

Wherein, the W_ijFor any sample i in the Uniform Sample data set in addition to the sample i other are any Sample weights between sample j, the S_iFor the coordinate of sample i, the S_jFor the coordinate of sample j, the t (i) is the sample The local standard parameter of i, the t (j) are the local standard parameter of the sample j.

Optionally, the coordinate current based on the sample weights and each sample, determines each sample Updated coordinate, comprising:

Based on the current coordinate of the sample weights and each sample, determined by following third formula described each Coordinate after Sample Refreshment；

Wherein, the Coordinate_newFor the coordinate of any sample j in the Uniform Sample data set, the W_iIt is equal Even sample data concentrates any sample j and the weight in addition to the sample j between other any sample i, described Coordinate_iFor the coordinate of any sample i.

It is optionally, described to be clustered on the data set after the location updating by Molecule cluster technology, comprising:

Sample each in data set after the location updating is determined as a target cluster, it is poly- to obtain multiple targets Class；

Determine the similarity in the multiple target cluster between each target cluster and other targets cluster；

It is an agglomerative clustering by the highest two target Cluster mergings of similarity；

It is when in the data set after the location updating including the sample of mutual k proximity relations, the agglomerative clustering is true The target cluster being set in multiple target clusters, and return to each target cluster in the multiple target cluster of the determination Other targets cluster between similarity operation, until the location updating after data set in do not include mutual k neighbouring Until the sample of relationship.

Optionally, similar between each target cluster and other targets cluster in the multiple target cluster of the determination Degree, comprising:

Determine that the number of samples in each target cluster and each target cluster are clustered with other described targets Between with mutual k proximity relations sample size；

Based on the number of samples and each target cluster and other described targets cluster in each poly- target class Between the sample size with mutual k proximity relations, determined by following 4th formula each described in the multiple target cluster Similarity between target cluster and other described targets cluster；

Wherein, the P_k(c_x,c_y) it is that the either objective clusters_xThe similarity between y is clustered with the either objective, The P_xyFor the cluster x and cluster the sample size with mutual k proximity relations between y, the P_yxFor the cluster y with Cluster the sample size with mutual k proximity relations between x, the c_xFor number of samples in cluster x, the c_yFor sample in cluster y This number.

Optionally, the method also includes:

The type of remaining sample in the master sample data set of Rejection of samples is determined as and the data after location updating Concentrate the type of the nearest sample of any sample in the remaining sample.

Second aspect, provides a kind of data clusters device, and described device includes:

Sampling module obtains Uniform Sample data set for carrying out uniform sampling to raw sample data collection；

Update module is updated for the position to each sample in the Uniform Sample data set, obtains position Updated data set；

Cluster module, for carrying out data clusters on the data set after the location updating by Molecule cluster technology.

Optionally, the sampling module includes:

It is fitted submodule, for the raw sample data collection to be carried out Gauss Distribution Fitting, obtains master sample data Collection；

First determines submodule, for determining the coordinate of the central point of the master sample data set and the seat of each sample Mark；

Second determines submodule, coordinate and each sample for the central point based on the master sample data set Coordinate, determine the Uniform Sample data set.

Optionally, described second determine that submodule is used for:

Optionally, the update module includes:

Third determines submodule, and k for determining each sample in Uniform Sample data set by k nearest neighbor algorithm are neighbouring Point；

4th determine submodule, for based on sample in the Uniform Sample data set coordinate and each sample it is each From k neighbor point coordinate, determine the local standard parameter of each sample described in the Uniform Sample data set；

5th determines that submodule determines the Uniform Sample for the local standard parameter based on each sample Sample weights in data set between each sample and other samples；

6th determines submodule, for the coordinate current based on the sample weights and each sample, determine described in Coordinate after each Sample Refreshment；

First triggering submodule, for after determining each Sample Refreshment coordinate and last determining coordinate it Between changing value when being greater than changes in coordinates threshold value, the coordinate of each sample be updated and triggers the third determine that submodule is logical The k neighbor point that k nearest neighbor algorithm determines each sample in Uniform Sample data set is crossed, until the seat after each Sample Refreshment Until changing value is less than or equal to the changes in coordinates threshold value between mark and the last coordinate determined；

Submodule is updated, for changing value between the coordinate of coordinate and last determination after working as each Sample Refreshment When less than or equal to the changes in coordinates threshold value, the data set of the updated sample composition of coordinate is determined as the location updating Data set afterwards.

Optionally, the described 4th determine that submodule is used for:

Wherein, the t (i) is the local standard parameter of any sample i in the Uniform Sample data set, the y_tFor The coordinate of any neighbor point, the y in the k neighbor point of any sample i_iIt is described for the coordinate of any sample i kNN(y_i) be any sample i k neighbor point coordinate set.

Optionally, the described 5th determine that submodule is used for:

Optionally, the described 6th determine that submodule is used for:

Optionally, the cluster module includes:

7th determines submodule, gathers for sample each in the data set after the location updating to be determined as a target Class obtains multiple target clusters；

8th determines submodule, and for determining, each target cluster and other targets cluster it in the multiple target cluster Between similarity；

Merge submodule, for being an agglomerative clustering by the highest two target Cluster mergings of similarity；

Second triggering submodule, for including the sample of mutual k proximity relations in the data set when the location updating after When, the target that the agglomerative clustering is determined as in multiple target clusters is clustered, and trigger the described 7th and determine submodule The similarity in the multiple target cluster between each target cluster and other targets cluster is determined, until the location updating Do not include in data set afterwards mutual k proximity relations sample until.

Optionally, the described 7th determine that submodule is used for:

Optionally, described device further include:

Determining module, the type for remaining sample in the master sample data set by Rejection of samples is determined as and position In updated data set in remaining sample described in distance the nearest sample of any sample type.

Technical solution bring beneficial effect provided in an embodiment of the present invention includes at least:

In embodiments of the present invention, by reducing sample size to raw sample data collection progress uniform sampling, thus Reduce terminal operating resource, improves cluster speed.The subsequent position to each sample in obtained Uniform Sample data set It sets and is updated, and data clusters are carried out on the updated data set in position by Molecule cluster technology, improve to sample The accuracy rate clustered.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of data clustering method flow chart provided in an embodiment of the present invention；

Fig. 2 is another data clustering method flow chart provided in an embodiment of the present invention；

Fig. 3 is a kind of schematic diagram of raw sample data collection provided in an embodiment of the present invention；

Fig. 4 is the schematic diagram of the data set after a kind of location updating provided in an embodiment of the present invention；

Fig. 5 is a kind of data clusters apparatus structure schematic diagram provided in an embodiment of the present invention；

Fig. 6 is a kind of structural schematic diagram of sampling module provided in an embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of update module provided in an embodiment of the present invention；

Fig. 8 is a kind of structural schematic diagram of cluster module provided in an embodiment of the present invention；

Fig. 9 is another data clusters apparatus structure schematic diagram provided in an embodiment of the present invention；

Figure 10 is a kind of structural schematic diagram of terminal provided in an embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Before carrying out detailed explanation to the embodiment of the present invention, the first application to being related in the embodiment of the present invention Scene is explained.

Under big data environment, need to solve the problems, such as using to shape clustering algorithm in many application scenarios.Example Such as in geographic information processing field, the terrain information in mountain range, river is extracted using clustering algorithm；Know in field of image processing It Chu not people in image or object；Protein structure is clustered in medicine, identifies different types of protein etc. Deng.But due to having deviation to the shape of data set when being clustered, cause the data for there are many shapes that can not complete to gather Class, and time complexity all with higher reduce cluster efficiency and accuracy so as to cause algorithm complexity.

Based on such scene, the embodiment of the invention provides a kind of raising cluster accuracies, simplify the number of clustering algorithm According to clustering method.

It, next will be in conjunction with attached drawing to this in the application scenarios to the embodiment of the present invention and after system architecture is introduced The data clustering method that inventive embodiments provide describes in detail.

Fig. 1 is a kind of flow chart of data clustering method provided in an embodiment of the present invention, and referring to Fig. 1, this method is applied to In terminal, include the following steps.

Step 101: uniform sampling being carried out to raw sample data collection, obtains Uniform Sample data set.

Step 102: the position of each sample in the Uniform Sample data set being updated, after obtaining location updating Data set.

Step 103: by carrying out data clusters in Molecule cluster technology in the position updated data set.

Optionally, uniform sampling is carried out to raw sample data collection, obtains Uniform Sample data set, comprising:

The coordinate of the coordinate of central point based on the master sample data set and each sample determines the Uniform Sample number According to collection.

Optionally, the coordinate of the central point based on the master sample data set and the coordinate of each sample determine this Even sample data set, comprising:

The coordinate of coordinate based on the central point and each sample, determine between the central point and each sample away from From；

The Uniform Sample data set, and the sample that the uniform data set will be added is added in the sample nearest apart from the central point This is rejected from the master sample data set；

Based on each in remaining sample in the master sample data set after Rejection of samples and the Uniform Sample data set The distance between sample determines distance matrix；

By this, apart from lowest distance value in column vector, the uniform sample is added in the corresponding sample in the master sample data set Notebook data collection, and the sample that the uniform data set is added is rejected from the master sample data set；

When the number of sample in the Uniform Sample data set is not up to sample size threshold value, Rejection of samples should be based on by returning The distance between each sample in remaining sample and the Uniform Sample data set in the master sample data set afterwards, determine away from Operation from matrix, until the number of sample in the Uniform Sample data set reaches the sample size threshold value.

Optionally, the position of each sample in the Uniform Sample data set is updated, after obtaining location updating Data set, comprising:

The coordinate of coordinate and each respective k neighbor point of sample based on sample in the Uniform Sample data set, really The local standard parameter of each sample in the fixed Uniform Sample data set；

Local standard parameter based on each sample determines each sample and other samples in the Uniform Sample data set Sample weights between this；

Coordinate based on the sample weights and the current coordinate of each sample, after determining each Sample Refreshment；

Changing value is greater than changes in coordinates between coordinate and the last coordinate determined after determining each Sample Refreshment When threshold value, the coordinate of each sample is updated and return this determined in Uniform Sample data set by k nearest neighbor algorithm it is each The operation of k neighbor point of sample, until changing value between the coordinate that the coordinate and last time after each Sample Refreshment determine Until the changes in coordinates threshold value；

Changing value is less than or equal to the coordinate between coordinate and the last coordinate determined after each Sample Refreshment Data set when change threshold, after the data set of the updated sample composition of coordinate to be determined as to the location updating.

Optionally, coordinate and each respective k neighbor point of sample based on sample in the Uniform Sample data set Coordinate determines the local standard parameter of each sample in the Uniform Sample data set, comprising:

The coordinate of coordinate and each respective k neighbor point of sample based on sample in the Uniform Sample data set, leads to Cross the local standard parameter that following first formula determines each sample in the Uniform Sample data set；

Wherein, t (i) is the local standard parameter of any sample i in the Uniform Sample data set, y_tFor any sample The coordinate of any neighbor point, y in the k neighbor point of i_iFor the coordinate of any sample i, kNN (y_i) it is k of any sample i The coordinate set of neighbor point.

Optionally, based on the local standard parameter of each sample, each sample in the Uniform Sample data set is determined With the sample weights between other samples, comprising:

Local standard parameter based on each sample is determined in the Uniform Sample data set by following second formula Sample weights between each sample and other samples；

Wherein, W_ijFor any sample i in the Uniform Sample data set in addition to sample i other any sample j it Between sample weights, S_iFor the coordinate of sample i, S_jFor the coordinate of sample j, t (i) is the local standard parameter of sample i, t (j) the local standard parameter for being sample j.

Optionally, the seat based on the sample weights and the current coordinate of each sample, after determining each Sample Refreshment Mark, comprising:

Based on the sample weights and the current coordinate of each sample, which is determined more by following third formula Coordinate after new；

Wherein, Coordinate_newFor the coordinate of any sample j in the Uniform Sample data set, W_iFor Uniform Sample data Concentrate any sample j and the weight in addition to sample j between other any sample i, Coordinate_iFor any sample i Coordinate.

Optionally, by being clustered in Molecule cluster technology in the position updated data set, comprising:

Sample each in data set after the location updating is determined as a target cluster, obtains multiple target clusters；

Determine the similarity in multiple target cluster between each target cluster and other targets cluster；

When in the data set after the location updating including the sample of mutual k proximity relations, which is determined as Target cluster in multiple target clusters, and return to each target cluster and other mesh in the multiple target cluster of the determination The operation of the similarity between cluster is marked, until not including the sample of mutual k proximity relations in the data set after the location updating Until.

Optionally it is determined that the similarity in multiple target cluster between each target cluster and other targets cluster, packet It includes:

It determines and has between number of samples and each target cluster and other targets cluster in each target cluster There is the sample size of mutual k proximity relations；

Have based on the number of samples in each poly- target class and between each target cluster and other targets cluster Have the sample size of mutual k proximity relations, by following 4th formula determine each target cluster of this in multiple target cluster with Similarity between other targets cluster；

Wherein, P_k(c_x,c_y) it is that the either objective clusters_xWith the similarity between either objective cluster y, P_xyIt is poly- for this Class_xWith the sample size with mutual k proximity relations between cluster y, P_yxFor cluster y and cluster_xBetween have mutual k it is neighbouring The sample size of relationship, c_xFor number of samples in cluster x, c_yFor number of samples in cluster y.

Optionally, this method further include:

All the above alternatives, can form alternative embodiment of the invention according to any combination, and the present invention is real It applies example and this is no longer repeated one by one.

Fig. 2 is a kind of flow chart of data clustering method provided in an embodiment of the present invention, and referring to fig. 2, this method includes such as Lower step.

Step 201: terminal carries out uniform sampling to raw sample data collection, obtains Uniform Sample data set.

Since the usual data of raw sample data collection are more mixed and disorderly, if directly operated by raw sample data collection It may result in calculating process complexity, waste terminal resource, therefore, in order to reduce calculating process complexity, reduce terminal resource Waste, terminal can first carry out raw sample data collection simplifying processing, that is to say, terminal carries out raw sample data collection equal Even sampling obtains Uniform Sample data set.And terminal carries out uniform sampling to raw sample data collection, obtains Uniform Sample data The operation of collection can be with are as follows: the raw sample data collection is carried out Gauss Distribution Fitting, obtains master sample data set；Determine the mark Quasi- sample data concentrates the coordinate of central point and the coordinate of each sample；Seat based on the central point in the master sample data set The coordinate of mark and each sample, determines Uniform Sample data set.

It should be noted that the raw sample data collection is the set of previously given multiple data samples, this each Data sample can be indicated by a multi dimensional numerical type vector.For example, the raw sample data collection can as shown in Figure 3 one 2-D data set representations, in Fig. 3, each point indicates that a two Dimension Numerical Value type vector that is to say a sample, each point Corresponding abscissa is the first dimension value of the sample, and ordinate is the second dimension value of the sample.

Since primary data sample integrates as multidimensional data sample set, terminal carries out the raw sample data collection high This fitting of distribution, the operation for obtaining master sample data set can be with are as follows: doing mean value to each dimension of raw sample data collection is 0, the standardization that variance is 1.Raw sample data collection is D0, after standardization by the embodiment of the present invention for ease of description Obtained master sample data set is denoted as D.

It should also be noted that, the central point of master sample data set is each latitude coordinates in master sample data set D All in the point of dimension span center.

In addition, the coordinate of central point of the terminal based on the master sample data set and the coordinate of each sample, determine uniform The operation of sample data set can be with are as follows: the coordinate of coordinate and each sample based on central point determines that the central point is each with this The distance between sample；The Uniform Sample data set is added in the sample nearest apart from the central point, and the Uniform Number will be added It is rejected from the master sample data set according to the sample of collection；Based on remaining sample in the master sample data set after Rejection of samples With the distance between each sample in the Uniform Sample data set, distance matrix is determined；Based in every a line in the distance matrix Maximum range value, determine apart from column vector；This is right in the master sample data set apart from lowest distance value in column vector The Uniform Sample data set is added in the sample answered, and the sample that the uniform data set is added is picked from the master sample data set It removes；When the number of sample in the Uniform Sample data set is not up to sample size threshold value, after returning to this based on Rejection of samples The distance between remaining sample and each sample in the Uniform Sample data set, determine distance matrix in master sample data set Operation, until the number of sample in Uniform Sample data set reaches the sample size threshold value.

Wherein, for ease of description, Uniform Sample data set is denoted as S, distance matrix is denoted as Md, wherein distance matrix Md is the matrix of N (D) * N (S), wherein N (D) represents the sample size in master sample data set D, and N (S) represents data The sample size of concentration.Therefore, the distance that the i-th row j is arranged in distance matrix is sample i in representative mark sample data set D to The distance of sample j in even sample data set S.Since distance matrix Md has N (S) column, terminal is based in distance matrix Md Maximum range value in every a line determines that the length apart from column vector is N (S).

For example, terminal can determine that the central point is each with this according to the coordinate of coordinate and each sample based on central point Uniform Sample data set S is added in the sample m nearest apart from the central point by the distance between a sample, and by sample m from mark It is rejected in quasi- sample data set D.Later, terminal can be based on remaining sample in the master sample data set D after Rejection of samples m With the distance between each sample in Uniform Sample data set S, the distance matrix Md of N (D) * N (S) is determined.It adjusts the distance Every a line in matrix Md retains its maximum value, so that obtaining length is the column vector of N (S), then finds out the minimum in column vector Sample in master sample data set D corresponding to the position is added by Uniform Sample data set S for value corresponding position, And the sample being added into Uniform Sample data set S is rejected from master sample data set D.If Uniform Sample data set S In quantity be not up to sample size threshold value, then return based on remaining sample in the master sample data set after Rejection of samples with The distance between each sample, determines the operation of distance matrix in the Uniform Sample data set, until in Uniform Sample data set Until the number of sample reaches the sample size threshold value.

It should be noted that the sample size threshold value can be arranged in advance, for example, the sample size threshold value can for 200, 150,100 etc..

Step 202: terminal is updated the position of each sample in the Uniform Sample data set, obtains location updating Data set afterwards.

Since the quantity of sample in Uniform Sample data set S obviously subtracts than the quantity of sample in raw sample data collection D0 Few, therefore, apparent variation also has occurred in the position of each sample in Uniform Sample data set S, that is to say, each sample it is close Degree becomes smaller.Therefore, in order to be more clear sample, classify so as to subsequent, terminal is needed to the Uniform Sample data The position for each sample concentrated is updated, the data set after obtaining location updating.

Wherein, the data set after location updating is denoted as S ' for ease of description.Terminal is in the Uniform Sample data set The position of each sample be updated, the operation of the data set S ' after obtaining location updating may include steps of A- step The operation of F.

Step A: terminal determines k neighbor point of each sample in Uniform Sample data set by k nearest neighbor algorithm.

It should be noted that the value of k can be arranged in advance, for example, k can be 3,4,5 etc..Terminal is neighbouring by k Algorithm determines that the operation of k neighbor point of each sample in Uniform Sample data set can be implemented with reference to the relevant technologies, the present invention Example no longer repeats this one by one.

Step B: coordinate and each sample respective k neighbor point of the terminal based on sample in Uniform Sample data set S Coordinate determines the local standard parameter of each sample in Uniform Sample data set S.

Wherein, terminal can coordinate and the respective k neighbor point of each sample based on sample in Uniform Sample data set S Coordinate, the local standard parameter of each sample in Uniform Sample data set S is determined by following first formula.

It should be noted that t (i) is the office of any sample i in Uniform Sample data set S in above-mentioned first formula (1) Ministerial standard parameter, y_tFor the coordinate of any neighbor point in the k neighbor point of any sample i, y_iFor the coordinate of any sample i, kNN(y_i) be any sample i k neighbor point coordinate set.

Step C: local standard parameter of the terminal based on each sample determines each sample in Uniform Sample data set S With the sample weights between other samples.

Wherein, terminal can determine uniform sample by following second formula based on the local standard parameter of each sample Sample weights in notebook data collection S between each sample and other samples.

It should be noted that in above-mentioned second formula (2), W_ijFor any sample i in Uniform Sample data set S and remove sample The sample weights between other any sample j except this i, S_iFor the coordinate of sample i, S_jFor the coordinate of sample j, t (i) is sample The local standard parameter of this i, t (j) are the local standard parameter of sample j.

Step D: the terminal coordinate current based on sample weights and each sample, the coordinate after determining each Sample Refreshment.

Wherein, the coordinate that terminal can be current based on sample weights and each sample is determined every by following third formula Coordinate after a Sample Refreshment.

It should be noted that in above-mentioned third formula (3), Coordinate_newIt is any in Uniform Sample data set S The coordinate of sample j, W_iFor any sample j in Uniform Sample data set S and the power in addition to sample j between other any sample i Weight, Coordinate_iFor the coordinate of any sample i.

Step E: changing value is greater than between the coordinate and the last coordinate determined after terminal determines each Sample Refreshment When changes in coordinates threshold value, the coordinate of each sample is updated and return step A, until the coordinate after each Sample Refreshment with Until changing value is less than or equal to changes in coordinates threshold value between the coordinate that last time determines.

It should be noted that changes in coordinates threshold value includes abscissa threshold value and ordinate threshold value, and the changes in coordinates threshold value It can be arranged in advance, for example, the abscissa threshold value can be 1,2,3 etc., ordinate threshold value can be 1,2,3 etc..

Step F: changing value, which is less than or equal to, between the coordinate and the last coordinate determined after each Sample Refreshment sits When marking change threshold, the data set that the updated sample of coordinate forms is determined as the data set S ' after location updating by terminal.

Wherein, changing value is less than or equal to coordinate between coordinate after each Sample Refreshment and the last coordinate determined When change threshold, illustrate that the position of sample has been tended towards stability, terminal can be by the data of the updated sample composition of coordinate at this time Collection is determined as the data set S ' after location updating, referring to fig. 4.

It is worth noting that the operation of A- step F can make each sample in Uniform Sample data set S through the above steps Position tend towards stability, that is to say the position change very little of each sample or no longer change, thus obtain one belong to it is same The sample distribution of data classification is compact, belongs to the biggish sample data set of distance between the sample of different data classification.

Step 203: terminal carries out data clusters on the updated data set in position by Molecule cluster technology.

Wherein, the operation that terminal is clustered on the updated data set S ' in position by Molecule cluster technology can be with Are as follows: sample each in the data set after location updating is determined as a target cluster, obtains multiple target clusters；It determines multiple Similarity in target cluster between each target cluster and other targets cluster；The highest two targets cluster of similarity is closed It and is an agglomerative clustering；It is when in the data set after location updating including the sample of mutual k proximity relations, agglomerative clustering is true The target cluster being set in multiple target clusters, and return and determine each target cluster and other mesh in multiple target clusters The operation of the similarity between cluster is marked, until not including the sample of mutual k proximity relations in the data set after the location updating Until.

It should be noted that each sample can be determined as one independent small point by terminal for the ease of classifying Class that is to say, each sample is determined as a target cluster, and be one by the highest two target Cluster mergings of similarity Agglomerative clustering.Wherein, terminal determines that the distance between each target cluster and other targets cluster are to determine that each target is poly- Similarity between class and other targets cluster, therefore, two nearest clusters of distance are highest two clusters of similarity.

Due to including that there is mutual k proximity relations when there is the sample of mutual k proximity relations in the data set S ' after location updating Sample where target cluster between similarity it is usually higher, include mutual k in the updated data set S ' in position therefore When the sample of proximity relations, agglomerative clustering can be continued the target cluster being determined as in multiple target clusters by terminal, and The operation for determining the similarity in multiple target clusters between each target cluster and other targets cluster is returned to, until position is more Do not include in data set after new mutual k proximity relations sample until.

Wherein, the highest two target Cluster mergings of similarity are that an agglomerative clustering can refer to terminal by phase by terminal It is marked like highest two targets cluster is spent by like-identified, which can be text, number, color etc..

In addition, terminal determines the behaviour of the similarity in multiple target clusters between each target cluster and other targets cluster Making can be with are as follows: determines between number of samples and each target cluster and other targets cluster in each target cluster there is mutual k The sample size of proximity relations；Based on each target cluster in number of samples and each target cluster with other targets cluster it Between the sample size with mutual k proximity relations, by following 4th formula determine in multiple targets cluster each target cluster with Similarity between other targets cluster.

It should be noted that in above-mentioned 4th formula (4), P_k(c_x,c_y) it is that either objective clusters x and either objective is poly- Similarity between class y, P_xyFor cluster x and cluster the sample size with mutual k proximity relations between y, P_yxFor cluster y with Cluster the sample size with mutual k proximity relations between x, c_xFor number of samples in cluster x, c_yFor number of samples in cluster y.

In addition, since there is no all additions into Uniform Sample data set S for the sample in master sample data set D, that Terminal not only needs to cluster the sample in Uniform Sample data set S, it is also necessary to not having in master sample data set D It is added to the sample of Uniform Sample data set S and is clustered, for ease of description, by sample remaining in master sample data set D The data set of composition is denoted as D '.Terminal log can be with according to the operation that is clustered of sample in collection D ' are as follows: by the mark of Rejection of samples Quasi- sample data concentrates the type of remaining sample to be determined as and appoint in the remaining sample of distance in the data set after location updating The type of the nearest sample of one sample.

It is worth noting that terminal has found the sample in Uniform Sample data set by way of Molecule cluster for it Belong to its cluster, meanwhile, the sample in Uniform Sample data set also is not added to other in master sample data set The mark for identifying the sample type is had found, effective cluster to data set is finally realized.

In embodiments of the present invention, terminal by raw sample data collection carry out uniform sampling, reduce sample size, To reduce terminal operating resource, cluster speed is improved.Subsequent each sample in obtained Uniform Sample data set Position be updated, and data clusters are carried out on the updated data set in position by Molecule cluster technology, improved pair The accuracy rate that sample is clustered.

After data clustering method provided in an embodiment of the present invention is explained, next, to of the invention real The data clusters device for applying example offer is introduced.

Fig. 5 is a kind of block diagram for data clusters device that the embodiment of the present disclosure provides, referring to Fig. 5, the data clusters device It being implemented in combination with by software, hardware or both.The device includes: sampling module 501, update module 502 and cluster mould Block 503.

Sampling module 501 obtains Uniform Sample data set for carrying out uniform sampling to raw sample data collection；

Update module 502 is updated for the position to each sample in the Uniform Sample data set, obtains in place Set updated data set；

Cluster module 503, it is poly- for carrying out data on the data set after the location updating by Molecule cluster technology Class.

Optionally, referring to Fig. 6, the sampling module 501 includes:

It is fitted submodule 5011, for the raw sample data collection to be carried out Gauss Distribution Fitting, obtains master sample Data set；

First determines submodule 5012, for determining the coordinate and each sample of the central point of the master sample data set Coordinate；

Second determines submodule 5013, for the coordinate of the central point based on the master sample data set and described each The coordinate of sample determines the Uniform Sample data set.

Optionally, described second determine that submodule 5013 is used for:

Optionally, referring to Fig. 7, the update module 502 includes:

Third determines submodule 5021, for determining k of each sample in Uniform Sample data set by k nearest neighbor algorithm Neighbor point；

4th determines submodule 5022, for coordinate and each sample based on sample in the Uniform Sample data set The coordinate of this respective k neighbor point, determines the local standard parameter of each sample described in the Uniform Sample data set；

5th determines submodule 5023, for the local standard parameter based on each sample, determines described uniform Sample data concentrates the sample weights between each sample and other samples；

6th determines submodule 5024, for the coordinate current based on the sample weights and each sample, determines Coordinate after each Sample Refreshment；

First triggering submodule 5025, for the coordinate and the last seat determined after determining each Sample Refreshment When changing value is greater than changes in coordinates threshold value between mark, the coordinate of each sample is updated and triggers the third determines submodule Block 5021 determines k neighbor point of each sample in Uniform Sample data set by k nearest neighbor algorithm, until each sample is more Until changing value is less than or equal to the changes in coordinates threshold value between coordinate and the last coordinate determined after new；

Submodule 5026 is updated, coordinate after for working as each Sample Refreshment and is become between the coordinate of last determination When change value is less than or equal to the changes in coordinates threshold value, the data set of the updated sample composition of coordinate is determined as the position Updated data set.

Optionally, the described 4th determine that submodule 5022 is used for:

Optionally, the described 5th determine that submodule 5023 is used for:

Optionally, the described 6th determine that submodule 5024 is used for:

Optionally, referring to Fig. 8, the cluster module 503 includes:

7th determines submodule 5031, for sample each in the data set after the location updating to be determined as a mesh Mark cluster obtains multiple target clusters；

8th determines submodule 5032, and for determining, each target cluster and other targets are poly- in the multiple target cluster Similarity between class；

Merge submodule 5033, for being an agglomerative clustering by the highest two target Cluster mergings of similarity；

Second triggering submodule 5034, for including mutual k proximity relations in the data set when the location updating after When sample, the target that the agglomerative clustering is determined as in multiple target clusters is clustered, and trigger the described 7th and determine son Module determines the similarity in the multiple target cluster between each target cluster and other targets cluster, until the position Do not include in updated data set mutual k proximity relations sample until.

Optionally, the described 7th determine that submodule 5031 is used for:

Optionally, referring to Fig. 9, described device further include:

Determining module 504, the type for remaining sample in the master sample data set by Rejection of samples be determined as with In data set after location updating in remaining sample described in distance the nearest sample of any sample type.

In conclusion in embodiments of the present invention, terminal is reduced by carrying out uniform sampling to raw sample data collection Sample size improves cluster speed to reduce terminal operating resource.It is subsequent in obtained Uniform Sample data set The position of each sample is updated, and carries out data clusters on the updated data set in position by Molecule cluster technology, Improve the accuracy rate clustered to sample.

It should be understood that data clusters device provided by the above embodiment is when carrying out data clusters, only with above-mentioned each The division progress of functional module can according to need and for example, in practical application by above-mentioned function distribution by different function Energy module is completed, i.e., the internal structure of device is divided into different functional modules, to complete whole described above or portion Divide function.In addition, data clusters device provided by the above embodiment and data clustering method embodiment belong to same design, have Body realizes that process is detailed in embodiment of the method, and which is not described herein again.

Figure 10 shows the structural block diagram of the terminal 1000 of an illustrative embodiment of the invention offer.The terminal 1000 can To be: smart phone, tablet computer, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) player, laptop or desktop computer.Terminal 1000 is also Other titles such as user equipment, portable terminal, laptop terminal, terminal console may be referred to as.

In general, terminal 1000 includes: processor 1001 and memory 1002.

Processor 1001 may include one or more processing cores, such as 4 core processors, 8 core processors etc..Place Reason device 1001 can use DSP (Digital Signal Processing, Digital Signal Processing), FPGA (Field- Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, may be programmed Logic array) at least one of example, in hardware realize.Processor 1001 also may include primary processor and coprocessor, master Processor is the processor for being handled data in the awake state, also referred to as CPU (Central Processing Unit, central processing unit)；Coprocessor is the low power processor for being handled data in the standby state.? In some embodiments, processor 1001 can be integrated with GPU (Graphics Processing Unit, image processor), GPU is used to be responsible for the rendering and drafting of content to be shown needed for display screen.In some embodiments, processor 1001 can also be wrapped AI (Artificial Intelligence, artificial intelligence) processor is included, the AI processor is for handling related machine learning Calculating operation.

Memory 1002 may include one or more computer readable storage mediums, which can To be non-transient.Memory 1002 may also include high-speed random access memory and nonvolatile memory, such as one Or multiple disk storage equipments, flash memory device.In some embodiments, the non-transient computer in memory 1002 can Storage medium is read for storing at least one instruction, at least one instruction performed by processor 1001 for realizing this Shen Please in embodiment of the method provide data clustering method.

In some embodiments, terminal 1000 is also optional includes: peripheral device interface 1003 and at least one periphery are set It is standby.It can be connected by bus or signal wire between processor 1001, memory 1002 and peripheral device interface 1003.It is each outer Peripheral equipment can be connected by bus, signal wire or circuit board with peripheral device interface 1003.Specifically, peripheral equipment includes: In radio circuit 1004, touch display screen 1005, camera 1006, voicefrequency circuit 1007, positioning component 1008 and power supply 1009 At least one.

Peripheral device interface 1003 can be used for I/O (Input/Output, input/output) is relevant outside at least one Peripheral equipment is connected to processor 1001 and memory 1002.In some embodiments, processor 1001, memory 1002 and periphery Equipment interface 1003 is integrated on same chip or circuit board；In some other embodiments, processor 1001, memory 1002 and peripheral device interface 1003 in any one or two can be realized on individual chip or circuit board, this implementation Example is not limited this.

Radio circuit 1004 is for receiving and emitting RF (Radio Frequency, radio frequency) signal, also referred to as electromagnetic signal. Radio circuit 1004 is communicated by electromagnetic signal with communication network and other communication equipments.Radio circuit 1004 is by telecommunications Number being converted to electromagnetic signal is sent, alternatively, the electromagnetic signal received is converted to electric signal.Optionally, radio circuit 1004 include: antenna system, RF transceiver, one or more amplifiers, tuner, oscillator, digital signal processor, volume solution Code chipset, user identity module card etc..Radio circuit 1004 can by least one wireless communication protocol come with it is other Terminal is communicated.The wireless communication protocol includes but is not limited to: Metropolitan Area Network (MAN), each third generation mobile communication network (2G, 3G, 4G and 5G), WLAN and/or WiFi (Wireless Fidelity, Wireless Fidelity) network.In some embodiments, radio frequency electrical Road 1004 can also include NFC (Near Field Communication, wireless near field communication) related circuit, the application This is not limited.

Display screen 1005 is for showing UI (User Interface, user interface).The UI may include figure, text, Icon, video and its their any combination.When display screen 1005 is touch display screen, display screen 1005 also there is acquisition to exist The ability of the touch signal on the surface or surface of display screen 1005.The touch signal can be used as control signal and be input to place Reason device 1001 is handled.At this point, display screen 1005 can be also used for providing virtual push button and/or dummy keyboard, it is also referred to as soft to press Button and/or soft keyboard.In some embodiments, display screen 1005 can be one, and the front panel of terminal 1000 is arranged；Another In a little embodiments, display screen 1005 can be at least two, be separately positioned on the different surfaces of terminal 1000 or in foldover design； In still other embodiments, display screen 1005 can be flexible display screen, is arranged on the curved surface of terminal 1000 or folds On face.Even, display screen 1005 can also be arranged to non-rectangle irregular figure, namely abnormity screen.Display screen 1005 can be with Using LCD (Liquid Crystal Display, liquid crystal display), OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) etc. materials preparation.

CCD camera assembly 1006 is for acquiring image or video.Optionally, CCD camera assembly 1006 includes front camera And rear camera.In general, the front panel of terminal is arranged in front camera, the back side of terminal is arranged in rear camera.? In some embodiments, rear camera at least two is that main camera, depth of field camera, wide-angle camera, focal length are taken the photograph respectively As any one in head, to realize that main camera and the fusion of depth of field camera realize background blurring function, main camera and wide Pan-shot and VR (Virtual Reality, virtual reality) shooting function or other fusions are realized in camera fusion in angle Shooting function.In some embodiments, CCD camera assembly 1006 can also include flash lamp.Flash lamp can be monochromatic temperature flash of light Lamp is also possible to double-colored temperature flash lamp.Double-colored temperature flash lamp refers to the combination of warm light flash lamp and cold light flash lamp, can be used for Light compensation under different-colour.

Voicefrequency circuit 1007 may include microphone and loudspeaker.Microphone is used to acquire the sound wave of user and environment, and It converts sound waves into electric signal and is input to processor 1001 and handled, or be input to radio circuit 1004 to realize that voice is logical Letter.For stereo acquisition or the purpose of noise reduction, microphone can be separately positioned on the different parts of terminal 1000 to be multiple. Microphone can also be array microphone or omnidirectional's acquisition type microphone.Loudspeaker is then used to that processor 1001 or radio frequency will to be come from The electric signal of circuit 1004 is converted to sound wave.Loudspeaker can be traditional wafer speaker, be also possible to piezoelectric ceramics loudspeaking Device.When loudspeaker is piezoelectric ceramic loudspeaker, the audible sound wave of the mankind can be not only converted electrical signals to, can also be incited somebody to action Electric signal is converted to the sound wave that the mankind do not hear to carry out the purposes such as ranging.In some embodiments, voicefrequency circuit 1007 may be used also To include earphone jack.

Positioning component 1008 is used for the current geographic position of positioning terminal 1000, to realize navigation or LBS (Location Based Service, location based service).Positioning component 1008 can be the GPS (Global based on the U.S. Positioning System, global positioning system), the dipper system of China, Russia Gray receive this system or European Union The positioning component of Galileo system.

Power supply 1009 is used to be powered for the various components in terminal 1000.Power supply 1009 can be alternating current, direct current Electricity, disposable battery or rechargeable battery.When power supply 1009 includes rechargeable battery, which can support wired Charging or wireless charging.The rechargeable battery can be also used for supporting fast charge technology.

In some embodiments, terminal 1000 further includes having one or more sensors 1010.One or more sensing Device 1010 includes but is not limited to: acceleration transducer 1011, gyro sensor 1012, pressure sensor 1013, fingerprint sensing Device 1014, optical sensor 1015 and proximity sensor 1016.

Acceleration transducer 1011 can detecte the acceleration in three reference axis of the coordinate system established with terminal 1000 Size.For example, acceleration transducer 1011 can be used for detecting component of the acceleration of gravity in three reference axis.Processor The 1001 acceleration of gravity signals that can be acquired according to acceleration transducer 1011, control touch display screen 1005 with transverse views Or longitudinal view carries out the display of user interface.Acceleration transducer 1011 can be also used for game or the exercise data of user Acquisition.

Gyro sensor 1012 can detecte body direction and the rotational angle of terminal 1000, gyro sensor 1012 Acquisition user can be cooperateed with to act the 3D of terminal 1000 with acceleration transducer 1011.Processor 1001 is according to gyro sensors The data that device 1012 acquires, following function may be implemented: action induction (for example changing UI according to the tilt operation of user) is clapped Image stabilization, game control and inertial navigation when taking the photograph.

The lower layer of side frame and/or touch display screen 1005 in terminal 1000 can be set in pressure sensor 1013.When When the side frame of terminal 1000 is arranged in pressure sensor 1013, user can detecte to the gripping signal of terminal 1000, by Reason device 1001 carries out right-hand man's identification or prompt operation according to the gripping signal that pressure sensor 1013 acquires.Work as pressure sensor 1013 when being arranged in the lower layer of touch display screen 1005, is grasped by processor 1001 according to pressure of the user to touch display screen 1005 Make, realization controls the operability control on the interface UI.Operability control include button control, scroll bar control, At least one of icon control, menu control.

Fingerprint sensor 1014 is used to acquire the fingerprint of user, is collected by processor 1001 according to fingerprint sensor 1014 Fingerprint recognition user identity, alternatively, by fingerprint sensor 1014 according to the identity of collected fingerprint recognition user.Knowing Not Chu the identity of user when being trusted identity, authorize the user to execute relevant sensitive operation by processor 1001, which grasps Make to include solving lock screen, checking encryption information, downloading software, payment and change setting etc..Fingerprint sensor 1014 can be set Set the front, the back side or side of terminal 1000.When being provided with physical button or manufacturer Logo in terminal 1000, fingerprint sensor 1014 can integrate with physical button or manufacturer Logo.

Optical sensor 1015 is for acquiring ambient light intensity.In one embodiment, processor 1001 can be according to light The ambient light intensity that sensor 1015 acquires is learned, the display brightness of touch display screen 1005 is controlled.Specifically, work as ambient light intensity When higher, the display brightness of touch display screen 1005 is turned up；When ambient light intensity is lower, the aobvious of touch display screen 1005 is turned down Show brightness.In another embodiment, the ambient light intensity that processor 1001 can also be acquired according to optical sensor 1015, is moved The acquisition parameters of state adjustment CCD camera assembly 1006.

Proximity sensor 1016, also referred to as range sensor are generally arranged at the front panel of terminal 1000.Proximity sensor 1016 for acquiring the distance between the front of user Yu terminal 1000.In one embodiment, when proximity sensor 1016 is examined When measuring the distance between the front of user and terminal 1000 and gradually becoming smaller, by processor 1001 control touch display screen 1005 from Bright screen state is switched to breath screen state；When proximity sensor 1016 detect the distance between front of user and terminal 1000 by When gradual change is big, touch display screen 1005 is controlled by processor 1001 and is switched to bright screen state from breath screen state.

It that is to say, the embodiment of the present invention provides not only a kind of terminal, including processor and can hold for storage processor The memory of row instruction, wherein processor is configured as executing the method in Fig. 1 and embodiment shown in Fig. 2, moreover, this hair Bright embodiment additionally provides a kind of computer readable storage medium, is stored with computer program in the storage medium, the computer The data clustering method in Fig. 1 and embodiment shown in Fig. 2 may be implemented when program is executed by processor.

It, can be with it will be understood by those skilled in the art that the restriction of the not structure paired terminal 1000 of structure shown in Figure 10 Including than illustrating more or fewer components, perhaps combining certain components or being arranged using different components.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of data clustering method, which is characterized in that the described method includes:

The position of each sample in the Uniform Sample data set is updated, the data set after obtaining location updating；

2. the method as described in claim 1, which is characterized in that it is described that uniform sampling is carried out to raw sample data collection, it obtains Uniform Sample data set, comprising:

The coordinate of the coordinate of central point based on the master sample data set and each sample, determines the Uniform Sample Data set.

3. method according to claim 2, which is characterized in that the seat of the central point based on the master sample data set The coordinate of mark and each sample, determines the Uniform Sample data set, comprising:

The coordinate of coordinate and each sample based on the central point, determines between the central point and each sample Distance；

The Uniform Sample data set is added in the sample nearest apart from the central point, and the uniform data set will be added Sample is rejected from the master sample data set；

By the lowest distance value in column vector, corresponding sample addition is described uniformly in the master sample data set Sample data set, and the sample that the uniform data set is added is rejected from the master sample data set；

When the number of sample in the Uniform Sample data set is not up to sample size threshold value, return described based on Rejection of samples The distance between remaining sample and each sample in the Uniform Sample data set in the master sample data set afterwards, really The operation of set a distance matrix, until the number of sample in the Uniform Sample data set reaches the sample size threshold value.

4. the method as described in claim 1, which is characterized in that each sample in the Uniform Sample data set Position is updated, the data set after obtaining location updating, comprising:

The coordinate of coordinate and the respective k neighbor point of each sample based on sample in the Uniform Sample data set, really The local standard parameter of each sample described in the fixed Uniform Sample data set；

Based on the local standard parameter of each sample, each sample and other samples in the Uniform Sample data set are determined Sample weights between this；

Coordinate based on the current coordinate of the sample weights and each sample, after determining each Sample Refreshment；

Changing value is greater than changes in coordinates threshold between coordinate and the last coordinate determined after determining each Sample Refreshment When value, the coordinate of each sample is updated and is returned described is determined in Uniform Sample data set each by k nearest neighbor algorithm The operation of k neighbor point of sample, until changing between the coordinate that the coordinate and last time after each Sample Refreshment determine Until value is less than or equal to the changes in coordinates threshold value；

5. method as claimed in claim 4, which is characterized in that the coordinate based on sample in the Uniform Sample data set With the coordinate of each respective k neighbor point of sample, the office of each sample described in the Uniform Sample data set is determined Ministerial standard parameter, comprising:

The coordinate of coordinate and the respective k neighbor point of each sample based on sample in the Uniform Sample data set leads to Cross the local standard parameter that following first formula determines each sample described in the Uniform Sample data set；

Wherein, the t (i) is the local standard parameter of any sample i in the Uniform Sample data set, the y_tIt is described The coordinate of any neighbor point, the y in the k neighbor point of any sample i_iFor the coordinate of any sample i, the kNN (y_i) For the coordinate set of the k neighbor point of any sample i.

6. method as claimed in claim 4, which is characterized in that the local standard parameter based on each sample, Determine the sample weights in the Uniform Sample data set between each sample and other samples, comprising:

Based on the local standard parameter of each sample, determined in the Uniform Sample data set by following second formula Sample weights between each sample and other samples；

Wherein, the W_ijFor any sample i in the Uniform Sample data set and other any samples in addition to the sample i Sample weights between j, the S_iFor the coordinate of sample i, the S_jFor the coordinate of sample j, the t (i) is the sample i's Local standard parameter, the t (j) are the local standard parameter of the sample j.

7. method as claimed in claim 4, which is characterized in that described current based on the sample weights and each sample Coordinate, the coordinate after determining each Sample Refreshment, comprising:

Based on the current coordinate of the sample weights and each sample, each sample is determined by following third formula Updated coordinate；

Wherein, the Coordinate_newFor the coordinate of any sample j in the Uniform Sample data set, the W_iFor uniform sample Notebook data concentrates any sample j and the weight in addition to the sample j between other any sample i, the Coordinate_iFor The coordinate of any sample i.

8. the method as described in claim 1, which is characterized in that it is described by Molecule cluster technology after the location updating It is clustered on data set, comprising:

When in the data set after the location updating including the sample of mutual k proximity relations, the agglomerative clustering is determined as Target cluster in multiple targets cluster, and return in the multiple target cluster of the determination each target cluster and its The operation of similarity between his target cluster, until not including mutual k proximity relations in data set after the location updating Sample until.

9. method according to claim 8, which is characterized in that each target cluster in the multiple target cluster of determination With the similarity between other targets cluster, comprising:

Determine the number of samples in each target cluster and between each target cluster and other described targets cluster Sample size with mutual k proximity relations；

Based on the number of samples in each poly- target class and between each target cluster and other described targets cluster Sample size with mutual k proximity relations determines each target described in the multiple target cluster by following 4th formula Similarity between cluster and other described targets cluster；

Wherein, the P_k(c_x,c_y) it is that the either objective clusters x and the either objective and clusters similarity between y, it is described P_xyFor the cluster x and cluster the sample size with mutual k proximity relations between y, the P_yxFor the cluster y and cluster x Between the sample size with mutual k proximity relations, the c_xFor number of samples in cluster x, the c_yFor sample in cluster y Number.

10. the method as described in claim 1,8 or 9, which is characterized in that the method also includes:

By the type of remaining sample in the master sample data set of Rejection of samples be determined as in the data set after location updating The type of the nearest sample of any sample in the remaining sample.