CN110276401A

CN110276401A - Sample clustering method, apparatus, equipment and storage medium

Info

Publication number: CN110276401A
Application number: CN201910551643.8A
Authority: CN
Inventors: 熊凯
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2019-06-24
Filing date: 2019-06-24
Publication date: 2019-09-24
Also published as: WO2020258772A1

Abstract

The embodiment of the invention discloses a kind of sample clustering method, apparatus, equipment and storage mediums, it is related to data processing field, comprising: which statistical sample concentrates the corresponding first sample distance of each sample, first sample distance is the distance between sample and S neighbour's sample of sample；In whole first sample distances, the first sample distance in set distance range is obtained；It is calculated based on the first sample distance in set distance range apart from mean value；Based on the corresponding k nearest neighbor sample set of each sample, whole connection samples of each sample are determined, wherein the connection sample of K > S, sample and sample is neighbour's sample each other and there are connection relationships；The sample in sample set is clustered according to connection sample, apart from mean value and S value, is sweep radius apart from mean value, S value is that cluster minimum includes sample number.The technical issues of DBSCAN algorithm can not rationally cluster the sample set of density unevenness in the prior art can solve using the above method.

Description

Sample clustering method, apparatus, equipment and storage medium

Technical field

The present embodiments relate to technical field of data processing more particularly to a kind of sample clustering method, apparatus, equipment and Storage medium.

Background technique

Clustering refers to that the set by physics or abstract object is grouped into the analysis for the multiple classes being made of similar object Process.Nowadays, clustering is widely used in all kinds of fields, and with the extensive use of clustering, all kinds of clusters are calculated Method is come into being.For example, K-MEANS algorithm, K-MEDOIDS algorithm, BIRCH algorithm, CURE algorithm, DBSCAN algorithm, OPTICS algorithm etc..Wherein, DBSCAN algorithm is a more representational density-based algorithms, is needed artificial Input two parameters: one is sweep radius, is denoted as eps；Another is denoted as minPts comprising points to be minimum, and passes through two A parameter is focused to find out the maximum set of the connected object of density in sample.In the implementation of the present invention, discovery is existing by inventor When thering is technology to have following defects that cluster sample set based on DBSCAN algorithm, for the sample set of density unevenness, If sweep radius is smaller, for the sample sparse for density, be easy to be considered as noise spot and reject, if sweep radius compared with Greatly, then will easily gather apart from farther away sample for one kind, at this point, not can guarantee the accuracy of sample clustering.

To sum up, how under DBSCAN algorithm, the sample set of density unevenness is rationally clustered, becomes and urgently solves Certainly the problem of.

Summary of the invention

The present invention provides a kind of sample clustering method, apparatus, equipment and storage mediums, to solve in the prior art The technical issues of DBSCAN algorithm can not rationally cluster the sample set of density unevenness.

In a first aspect, the embodiment of the invention provides a kind of sample clustering methods, comprising:

Statistical sample concentrates the corresponding first sample distance of each sample, and the first sample distance is the sample and institute State the distance between S neighbour's sample of sample；

In all first sample distances, the first sample distance in set distance range is obtained；

It is calculated based on the first sample distance in the set distance range apart from mean value；

Based on the corresponding k nearest neighbor sample set of each sample, whole connection samples of each sample are determined, wherein K The connection sample of > S, the sample and the sample is neighbour's sample each other and there are connection relationships；

According to the connection sample, it is described the sample in the sample set is clustered apart from mean value and S value, it is described away from It is sweep radius from mean value, the S value is that cluster is minimum comprising sample number.

Further, it is described according to the connection sample, it is described apart from mean value and S value to the sample in the sample set into Row clusters

All connection samples are filtered apart from mean value based on described, to filter out the second sample distance greater than described Connection sample apart from mean value, the second sample distance are the distance between the connection sample of sample and the sample；

The sample in the sample set is clustered based on the connection sample obtained after S value and filtering.

Further, described that the sample in the sample set is gathered based on the connection sample obtained after S value and filtering Class includes:

Successively count the connection total sample number amount of each sample；

The connection total sample number amount is greater than the sample of S value as core sample；

In obtained whole core samples, select any core sample as current sample；

Access whole connection samples of the current sample；

Each connection sample that access is obtained accesses whole connection samples of the vertex correspondence as vertex This；

The each connection sample for repeating to obtain access accesses whole connections of the vertex correspondence as vertex The operation of sample, until accessing less than new connection sample；

Any core sample of not visited mistake is updated to current sample, and returns to execute and accesses the current sample All operations of connection sample, until whole core samples are accessed；

It is cluster by the current sample and the connection sample clustering obtained based on current sample interview.

Further, described to be based on the corresponding k nearest neighbor sample set of each sample, determine that the whole of each sample connects Connecing sample includes:

Obtain the corresponding k nearest neighbor sample set of each sample；

According to all k nearest neighbor sample sets, adjacency matrix is constructed, each element, which represents, in the adjacency matrix corresponds to Neighbor relationships between two samples；

Nonzero element in the adjacency matrix is counted, with whole connection samples of each sample of determination.

Further, nonzero element in the statistics adjacency matrix, with whole connection samples of each sample of determination Include:

In the adjacency matrix, the element group for being in symmetric position is obtained, the element group includes that the i-th row jth arranges The second element that first element and jth row i-th arrange；

If in first element and the second element including at least one neutral element, by first element and the Was Used is disposed as neutral element；

After the whole element groups for traversing the adjacency matrix, the adjacency matrix is updated；

Nonzero element in adjacency matrix after statistical updating, and corresponding two samples of the nonzero element are determined as mutually For neighbour's sample and there is connection relationship；

Based on neighbour's sample each other, whole connection samples of each sample are obtained.

Further, described in all first samples distances, obtain first sample in set distance range away from From including:

Based on all first sample distances, frequency distribution histogram is constructed；

The frequency of each bin in the frequency distribution histogram is counted, to determine set distance range；

Obtain the first sample distance in set distance range.

Further, the frequency for counting each bin in the frequency distribution histogram, to determine set distance range packet It includes:

Obtain frequency maximum bin in the frequency distribution histogram；

The frequency drop between adjacent rear position bin is calculated, the rear position bin is to be located at frequency in the frequency distribution histogram The bin at the number maximum rear bin；

Confirm the maximum adjacent rear position bin of frequency drop, and selects to be located behind in the maximum adjacent rear position bin Bin；

By the corresponding first sample distance of the frequency maximum bin and the corresponding first sample of bin being located behind Distance threshold of the distance as set distance range.

Further, the first sample distance based in the set distance range, which is calculated apart from mean value, includes:

Obtain sample size of the first sample distance in the set distance range；

First sample distance each in the set distance range is added, to obtain sample total distance；

Using the quotient of the sample total distance and the sample size as apart from mean value.

Further, before the corresponding first sample distance of each sample of the statistical sample concentration, further includes:

The k nearest neighbor figure of each sample in sample set is constructed, the weight of each edge is between corresponding to sample in the k nearest neighbor figure Distance.

Second aspect, the embodiment of the invention also provides a kind of sample clustering devices, comprising:

Distance statistics module concentrates the corresponding first sample distance of each sample, the first sample for statistical sample Distance is the distance between S neighbour's sample of the sample and the sample；

Distance obtains module, for obtaining the first sample in set distance range in all first sample distances This distance；

Mean value computation module, for being calculated based on the first sample distance in the set distance range apart from mean value；

Determining module is connected, for being based on the corresponding k nearest neighbor sample set of each sample, determines the complete of each sample Portion connects sample, wherein the connection sample of K > S, the sample and the sample closes for neighbour's sample each other and in the presence of connection System；

Sample clustering module, for according to the connection sample, it is described apart from mean value and S value to the sample in the sample set This is clustered, it is described apart from mean value be sweep radius, the S value be cluster it is minimum include sample number.

The third aspect, the embodiment of the invention also provides a kind of sample clustering equipment, comprising:

One or more processors；

Memory, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes sample clustering method as described in relation to the first aspect.

Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, are stored thereon with computer Program realizes sample clustering method as described in relation to the first aspect when the program is executed by processor.

Above-mentioned sample clustering method, apparatus, equipment and storage medium concentrate each sample and its S by statistical sample First sample distance between neighbour's sample, and obtained based on first sample distance apart from mean value, meanwhile, based on each sample K (K > S) neighbour's sample set, determines the corresponding connection sample of each sample, is neighbour each other between the connection sample and sample Sample and there is connection relationship, later, based on cluster it is minimum comprising sample number (S value) and sweep radius (apart from mean value) to tool The technological means for thering is the sample of connection relationship to be clustered, solve in the prior art DBSCAN algorithm for the sample of density unevenness The technical issues of this collection can not be clustered rationally determines reasonable sweep radius by first sample distance, later, based on scanning half Diameter clusters neighbour's sample each other, ensure that cluster reasonability, when the sample distribution density unevenness in sample set, passes through Neighbour's sample can be to avoid by the sample clustering cluster of the sample of sparse distribution and dense distribution each other.Meanwhile passing through frequency point Cloth histogram is determined apart from mean value, is inputted without user, and the workload for adjusting ginseng manually is reduced.

Detailed description of the invention

Fig. 1 is a kind of sample set distribution schematic diagram that the embodiment of the present invention one provides；

Fig. 2 is a kind of flow chart for sample clustering method that the embodiment of the present invention one provides；

Fig. 3 is another sample set distribution schematic diagram that the embodiment of the present invention one provides；

Fig. 4 is a kind of flow chart of sample clustering method provided by Embodiment 2 of the present invention；

Fig. 5 is a kind of k nearest neighbor figure provided by Embodiment 2 of the present invention；

Fig. 6 is a kind of frequency distribution histogram provided by Embodiment 2 of the present invention；

Fig. 7 is a kind of adjacency matrix schematic diagram provided by Embodiment 2 of the present invention；

Fig. 8 is another adjacency matrix schematic diagram provided by Embodiment 2 of the present invention；

Fig. 9 is a kind of structural schematic diagram for sample clustering device that the embodiment of the present invention three provides；

Figure 10 is a kind of structural schematic diagram for sample clustering equipment that the embodiment of the present invention four provides.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used to explain the present invention, rather than limitation of the invention.It also should be noted that for the ease of retouching It states, only the parts related to the present invention are shown in attached drawing rather than entire infrastructure.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is One more representational density-based algorithms.In general, DBSCAN needs to be manually entered two parameters: eps and minPts.For some object (i.e. some sample) in sample set, if the value of eps is E, by the scanning of the object half Region in diameter E is denoted as the E neighborhood of the object.It, will if the sample points in the E neighborhood of the object are greater than or equal to minPts The object is denoted as kernel object.For sample P and sample Q, if sample Q, in the E neighborhood of sample P, and sample P is core pair As, then, sample Q is reachable from the direct density of sample P.For sample P₁, sample P₂..., sample P_nIf sample P_iFrom sample P_i-1Direct density is reachable, then sample P_nFrom sample P₁Density is reachable.It is reachable to sample P density to set sample O, sample O to sample This Q density is reachable, then sample P is connected with sample Q density.For DBSCAN, the purpose is to find the connected object of density Maximum set.

For the sample set of density unevenness, sample set as shown in Figure 1, wherein Fig. 1 is what the embodiment of the present invention one provided A kind of sample set distribution schematic diagram, with reference to Fig. 1, the sample of left-half is more intensive, and the sample of right half part is more sparse.This When, if lesser value is arranged in eps, for example, the corresponding radius of circle 11 in Fig. 1 is set by eps, at this point, for right side For the sample divided, due to scanning less than other samples, can be considered as noise spot and filter out.It will lead to right half in this way Partial sample is largely filtered out, and influences cluster accuracy.If biggish value is arranged in eps, for example, setting eps in Fig. 1 The corresponding radius of circle 12, then, the sample of left-half and right half part can be gathered in scanning for one kind, alternatively, making The great amount of samples of left-half is gathered for one kind, and then influences cluster accuracy.

To sum up, the embodiment of the present invention provides a kind of sample set clustering method, with solve for density unevenness sample set without The problem of method rationally clusters.

Embodiment one

Fig. 2 is a kind of flow chart for sample clustering method that the embodiment of the present invention one provides.The sample provided in embodiment Clustering method can be executed by sample clustering equipment, which can be realized by way of software and/or hardware, The sample clustering equipment can be two or more physical entities and constitute, and is also possible to a physical entity and constitutes.For example, sample Cluster equipment can be the smart machine with data operation, analysis ability such as computer, mobile phone, plate or interactive intelligent tablet computer.

Specifically, the sample clustering method specifically includes with reference to Fig. 2:

Step 110, statistical sample concentrate the corresponding first sample distance of each sample, and first sample distance is sample and sample This distance between S neighbour's sample.

It illustratively, include multiple samples in sample set, the data type of each sample is identical.Wherein, the data of sample Type may be set according to actual conditions, and embodiment is not construed as limiting this.Further, in embodiment to the sample in sample set This describes sample clustering method for being clustered.Optionally, the acquisition modes embodiment of sample set is not construed as limiting, and can be The data that sample clustering equipment voluntarily acquires are also possible to the data of user's input, can also be and handle specific data The data obtained afterwards.In general, one data characteristics of each sample representation in sample set.For example, each sample indicates in sample set Position data of the same user within the setting period in daily different time sections.

Typically, by taking Fig. 1 as an example, the sample in sample set is scattered in the different position of feature space.In general, sample position Related with the feature that sample itself represents, feature is more similar, and the distance between sample is closer.Optionally, feature placement is preset Rule, and then sample position is determined according to the rule.Wherein, the particular content that feature places rule can be set according to the actual situation It is fixed.For example, dividing longitude and latitude in feature space for position data, later, the position data based on each sample includes Longitude and latitude determine each sample in the position of feature space.

Further, after obtaining sample set, sample clustering equipment can be calculated in sample set between any two sample Distance.Wherein, the calculation embodiment of distance is not construed as limiting, for example, using Euclidean distance, Minkowski Distance, Man Ha The modes such as distance of pausing determine the distance between sample.In general, distance is closer between two samples, show that two samples are more similar.

Illustratively, S neighbour's sample of sample refers to the sample close apart from sample S.For any sample, sample Cluster equipment can calculate the sample at a distance from other samples, and according to each distance, determine the sample close apart from sample S This, and it is denoted as S neighbour's sample.Wherein, S is positive integer, and specific value can be set in conjunction with actual conditions.Further, S Value indicates that cluster is minimum comprising sample number.I.e. to multiple clusters are obtained after sample clustering, each cluster at least includes S sample.One As for, the quantity of S neighbour's sample is 1, and in some cases, the quantity of S neighbour's sample may be multiple, at this point it is possible to Optional S neighbour's sample.

Optionally, before determining S neighbour's sample, k nearest neighbor figure corresponding to each sample drawing, wherein K is positive integer And it is greater than S, in general, the specific value of K may be set according to actual conditions.Further, for the k nearest neighbor figure of some sample and Speech, vertex are the sample, and include the K neighbour sample nearest apart from the sample in figure, meanwhile, by sample and K neighbour's sample Originally it is respectively connected with, and the weight of any line is the distance between two samples of the line.For example, it when K is equal to 8, obtains 8 samples for taking distance sample nearest, and distinguished line, at this point, the sample at line both ends can consider connection relationship, And the weight of its line is the distance between two samples.After determining k nearest neighbor figure, sample can be obtained according to k nearest neighbor figure S neighbour sample and corresponding distance.

Further, first sample distance is denoted as at a distance from by sample between S neighbour's sample.

Step 120, the first sample distance in whole first sample distances, in acquisition set distance range.

Specifically, being obtained after set distance range statistics first sample distance, for calculating ginseng when sweep radius Examine data.For the sample set of sample rate unevenness, in the sparse region of sample, first sample distance usually compared with Greatly, in the intensive region of sample, first sample distance is usually smaller.At this point, the accuracy in order to guarantee sweep radius, needs to join Set distance range is examined, the first sample distance only obtained in set distance range obtains sweep radius.In general, setting away from It is data representative in whole first sample distances with a distance from the first sample in range.

Illustratively, each first sample is counted apart from corresponding sample size.Wherein, sample size 50 indicate to deposit 50 samples first sample apart from identical.Further, set distance range is determined according to sample size.For example, being based on Sample size constructs frequency distribution histogram, wherein the particular number of the bin of frequency distribution histogram can be in conjunction with sample set Total sample number amount determines.Further, horizontal axis represents first sample distance in frequency distribution histogram, and the longitudinal axis represents first sample The sample size of distance.Obtain frequency maximum bin in frequency distribution histogram, wherein the corresponding sample size of frequency maximum bin At most.Later, in the rear position bin of frequency maximum bin, the sample size difference between two bin of arbitrary neighborhood, selection are calculated The maximum two neighboring bin of difference.Wherein, rear position bin refers to the bin for being located at the rear frequency maximum bin for horizontal axis.Into one Step, select the bin being located behind in two neighboring bin, and the bin at the rear and frequency maximum bin is two the corresponding One sample distance is determined as two distance thresholds of set distance range.Either, digit is set after obtaining frequency maximum bin Position bin afterwards, and by the corresponding two first sample distances of rear position bin and frequency maximum bin for setting digit be determined as setting away from Two distance thresholds from range.For another example, each first sample is counted after corresponding sample size, by being manually based on sample Quantity determines set distance range.It is in embodiment, difference maximum rear bin and frequency maximum bin is two corresponding One sample distance for two distance thresholds of set distance range as being described.Wherein, frequency maximum bin corresponding One sample distance is smaller, therefore, as the small distance threshold value of set distance range, in cluster, to guarantee that scanning is arrived The similar sample of sufficient amount feature.The corresponding first sample of the maximum rear bin of difference is apart from larger, therefore, as The relatively large distance threshold value of set distance range.In general, the maximum rear bin of difference shows that the corresponding sample size of the bin is reduced Amplitude is maximum, i.e., for explanation since the corresponding first sample distance of the bin, sample size is fewer and fewer, correspondingly, sample and its Feature difference is increasing between S neighbour's sample, therefore, regard the corresponding first sample distance of the bin as set distance range Larger threshold value, the sample to differ greatly can be ignored when calculating sweep radius, and then guarantee cluster accuracy.

Step 130 is calculated based on the first sample distance in set distance range apart from mean value.

Specifically, whole samples and corresponding first sample distance of the statistics in set distance range.Later, it will set Whole first samples distance in distance range is added, and be will add up result and be denoted as sample total distance.Meanwhile counting set distance Total sample number amount in range, and by sample total distance divided by total sample number amount, and then obtained quotient is denoted as apart from mean value.This When, this indicates sweep radius apart from mean value.

Step 140 is based on the corresponding k nearest neighbor sample set of each sample, determines whole connection samples of each sample, In, K > S, the connection sample of sample and sample is neighbour's sample and there are connection relationships each other.

Specifically, obtaining with the sample after calculating the distance between each sample and other each samples apart from nearest K Sample forms the corresponding k nearest neighbor sample set of the sample, wherein each sample in k nearest neighbor sample set can be denoted as neighbour There are neighbor relationships between sample, i.e. sample and neighbour's sample.If the k nearest neighbor figure of building sample in advance, this step can be straight It obtains and takes the sample in k nearest neighbor figure as k nearest neighbor sample set.

Further, connection sample refers to the sample for having connection relationship with current sample, in general, with connection relationship Two samples can be denoted as neighbour's sample each other, and neighbour's sample can be understood as the corresponding k nearest neighbor sample set of two samples each other It include another sample in conjunction.Specifically, obtaining the k nearest neighbor sample set of current sample, and then obtain k nearest neighbor sample The k nearest neighbor sample set of each neighbour's sample in set.Determine whether current sample is included in the k nearest neighbor sample of each neighbour's sample In this set, if current sample is included in the k nearest neighbor sample set of some neighbour's sample, by neighbour's sample and current sample Originally it is determined as neighbour's sample each other, and saves connection relationship between neighbour's sample and current sample, at this point, neighbour's sample is denoted as The connection sample of current sample.Wherein, the mode for saving connection relationship can be in sample set, draw between two samples Line.Further, the corresponding all connection samples of each sample can be determined according to above-mentioned steps.Correspondingly, close for K Disconnected sample in adjacent sample set, can not save its connection relationship.

Further, it is also possible to determine connection sample by way of adjacency matrix.Specifically, being based on k nearest neighbor sample set structure Build adjacency matrix, wherein each element is for indicating whether corresponding two samples are neighbor relationships in adjacency matrix.If close Adjacent relationship, then corresponding element is nonzero element, if not neighbor relationships, then corresponding element is neutral element.Further, really Whether the element for determining any group of positional symmetry in adjacency matrix is nonzero element, if so, the corresponding K of two samples of explanation is close It include another sample in adjacent sample set, i.e. two samples is neighbour's samples each other and have connection relationship.Traversal adjoining After whole symmetry elements of matrix, the connection sample of each sample can be determined.

For example, Fig. 3 is another sample set schematic diagram that the embodiment of the present invention one provides.It is right with reference to Fig. 3, K 5 In sample A, k nearest neighbor sample set includes 5 samples connecting with sample A for solid line, wherein there are two sample and sample A Tie-portion be overlapped.For sample B, k nearest neighbor sample set includes 4 samples and sample connecting with sample B for solid line This A.Although in the k nearest neighbor sample set of sample B including sample A, sample is not included in the k nearest neighbor sample set of sample A B, therefore, sample B and sample A are not neighbour's samples each other, do not save the connection relationship of sample A and sample B at this time.According to above-mentioned After mode traverses whole samples, the connection sample that there is connection relationship with each sample can be obtained.

It should be noted that embodiment do not limit step 140 and step 110- step 130 execute sequence, practical application In, step 140 can also be first carried out, then execute step 110- step 130 or step 140 and step 110- step 130 together Step executes.

Step 150 clusters the sample in sample set according to connection sample, apart from mean value and S value, is apart from mean value Sweep radius, S value are that cluster is minimum comprising sample number.

Wherein, cluster minimum is minPts comprising sample number.It illustratively, will be apart from mean value as sweep radius, by S value DBSCAN cluster is carried out to the sample in sample set as minPts.Specifically, select some sample as current sample, it Afterwards, to be scanned apart from mean value as peripheral region of the sweep radius to current sample.At this point, during the scanning process, only obtaining The connection sample of current sample is taken, later, if the quantity of connection sample is greater than S, the connection that current sample and scanning are obtained Sample clustering is cluster.

Either, the connection total sample number amount of each sample is determined, if connection total sample number amount is greater than S, by the sample As core sample.After traversing whole samples in sample set, whole core samples is found.Later, an optional core sample As current sample, and obtain whole connection samples of current sample.Further, the connection sample that will acquire is as top Point continues the whole connection samples for obtaining each connection sample, and then the connection sample that will acquire as vertex, and after It is continuous to obtain its corresponding all connection sample, this operation is repeated, until traversing less than new connection sample, at this point, will The whole connection samples arrived and current sample clustering are cluster.Later, not processed core sample is obtained, and is continued according to above-mentioned Step obtains connection sample, and then forms new cluster.After determining that each core sample is processed, end of clustering is determined.It can With understanding, in cluster process, if some the connection sample obtained is core sample, in the follow-up process, Bu Huizai Any processing is carried out to the core sample.

At this point, being directed to for the sample set of Fig. 3, in the process of cluster, sample A and sample B will not cluster cluster, sample A can other samples similar to its feature and density comparatively dense clustered, sample B can and density similar to its feature it is sparse Other samples clustered, ensure that cluster reasonability.

It is above-mentioned, the first sample distance between each sample and its S neighbour's sample is concentrated by statistical sample, and be based on First sample distance is obtained apart from mean value, meanwhile, K (K > S) neighbour's sample set based on each sample determines each sample Corresponding connection sample, for neighbour's sample each other and with connection relationship between the connection sample and sample, later, based on cluster The technology hand that minimum clusters the sample with connection relationship comprising sample number (S value) and sweep radius (apart from mean value) Section, solves the technical issues of DBSCAN algorithm can not rationally cluster the sample set of density unevenness in the prior art, passes through First sample distance determines that reasonable sweep radius clusters neighbour's sample each other based on sweep radius, ensure that later Reasonability is clustered, it, can be to avoid by sparse distribution by neighbour's sample each other when the sample distribution density unevenness in sample set Sample and dense distribution sample clustering cluster, and then guarantee cluster accuracy.

Embodiment two

Fig. 4 is a kind of flow chart of sample clustering method provided by Embodiment 2 of the present invention.The present embodiment is in above-mentioned reality It applies and is embodied on the basis of example.With reference to Fig. 4, sample clustering method provided in this embodiment includes:

The k nearest neighbor figure of each sample in step 201, building sample set, the weight of each edge is corresponding sample in k nearest neighbor figure Between distance.

Specifically, after calculating the distance between each sample and other each samples, when drawing the k nearest neighbor figure of some sample, Using the sample as vertex, and obtained according to the distance between sample away from K nearest sample of the sample and corresponding distance. Later, the line on vertex and K sample, and the weight by vertex at a distance from corresponding sample as the line are drawn respectively.It lifts For example, Fig. 5 is a kind of k nearest neighbor figure provided by Embodiment 2 of the present invention.It is the k nearest neighbor figure of sample C in Fig. 3 with reference to Fig. 5, Wherein, 6 K obtain K nearest sample of distance sample C according to sample C at a distance from other samples and distance are drawn later The line of this C of sample preparation and K sample, and the weight by distance as line.It should be noted that line shown in Fig. 5 Weight is only used for description k nearest neighbor figure, not adjusts the distance or the restriction apart from calculating.In practical application, it is close that weight can not be shown in K In adjacent figure.In general, it is based on k nearest neighbor figure, the line relationship in available sample set between each sample, i.e. neighbor relationships.

It should be noted that the benefit of building k nearest neighbor figure is easy for subsequent determining first sample distance, connection sample etc., It is convenient for subsequent calculating.

Step 202, statistical sample concentrate the corresponding first sample distance of each sample, and first sample distance is sample and sample This distance between S neighbour's sample.

Due to S < K, it can obtain in the k nearest neighbor figure of each sample away from nearest S neighbour's sample, And first sample distance is determined according to the weight of sample and the line of S neighbour's sample.

Step 203 is based on whole first sample distances, constructs frequency distribution histogram.

Specifically, counting the frequency of occurrence of each first sample distance, and using the number as corresponding first sample distance Frequency, later, based on each frequency construct frequency distribution histogram.In practical application, it is contemplated that there are numerical value very close to One sample distance, for example, two first samples distance is respectively 0.585 and 0.593, specific numerical value relatively, at this point, If each corresponding frequency of first sample distance, will increase Statistical Complexity and frequency distribution histogram complexity, therefore.It is real It applies in example, first sample distance is grouped in advance, later, the occurrence out of statistics first sample distance in each grouping Number, and using frequency of occurrence as the corresponding frequency of the grouping.For example, there is 1100 the first samples in the grouping of distance 0.55-0.65 This distance, therefore, the corresponding frequency of the grouping are 1100.It should be noted that rule of classification embodiment is not construed as limiting.Further , when establishing frequency distribution histogram, distance (being first sample distance in this example) is indicated with abscissa, ordinate indicates Sample number (i.e. frequency).At this point, each distance grouping can be denoted as a bin.In general, sample of the quantity of bin with sample set Total quantity is related.For example, the total sample number amount of sample set 5000 hereinafter, at this point, the quantity of bin can be set to 10, sample is total Quantity is more than to set 500 samples of every increases, the quantity increase by 1 of bin after 5000.Optionally, it is contemplated that some are grouped interior frequency Seldom, in subsequent calculating, reference significance is lower, therefore, can ignore the frequency of distance grouping in conjunction with actual conditions.

For example, Fig. 6 is a kind of frequency distribution histogram provided by Embodiment 2 of the present invention.With reference to Fig. 6, the frequency disribution is straight The abscissa of square figure is distance, and ordinate is sample number, i.e. frequency, and bin is 10.The corresponding distance of each bin as shown in Figure 6 Range and frequency, also, from 1 to 10 sequence of bin arranges.

The benefit that frequency distribution histogram is arranged is to clearly show that frequency disribution situation in each distance range, and be easy to Show the difference of frequency between each distance range.

The frequency of each bin in step 204, statistics frequency distribution histogram, to determine set distance range.

In embodiment, the frequency set based on each bin determines set distance range.At this point, the step specifically includes step 2041- step 2044:

Step 2041 obtains frequency maximum bin in frequency distribution histogram.

Specifically, counting the corresponding frequency of each distance range, later, the corresponding bin of maximum frequency is obtained, and is denoted as frequency Number maximum bin.For example, determining that frequency maximum bin is 5 based on ordinate with reference to Fig. 6.

Frequency drop after step 2042, calculating are adjacent between the bin of position, rear position bin are to be located in frequency distribution histogram The bin at the rear frequency maximum bin.

Illustratively, any bin for being located at some rear bin is denoted as the rear position of the bin by putting in order according to bin bin.In embodiment, position bin after the whole of frequency maximum bin is obtained.By taking Fig. 6 as an example, the rear position bin of frequency maximum bin is the 6th A bin to the 10th bin.

Further, adjacent bin refers to two adjacent bin of sequence, for example, in Fig. 6, the 1st bin and the 2nd bin is Adjacent bin, the 2nd bin and the 3rd bin are adjacent bin, and so on.In embodiment, adjacent bin after acquisition in the bin of position, And calculate the frequency difference of two bin in adjacent bin, wherein frequency difference, which is positive, to be counted and be denoted as frequency drop.Calculating frequency When difference, can be and subtraction is done to the adjacent corresponding frequency of two bin, if result be positive number, directly using the result as Frequency drop, if result is negative, using the absolute value of the result as frequency drop.

Optionally, in embodiment, count frequency drop when, can also calculate frequency maximum bin it is adjacent thereto after position bin it Between frequency drop, and by the frequency drop with it is each it is adjacent after the calculating of this step is used as together with frequency drop between the bin of position As a result.

The maximum adjacent rear position bin of step 2043, confirmation frequency drop, and position is selected in maximum adjacent rear position bin Bin in rear.

In general, frequency drop is bigger, show the bin being located behind in adjacent bin and the frequency between the bin in front It differs more, and then determines that the corresponding sample size of bin being located behind in adjacent bin significantly reduces.In embodiment, statistics is each After frequency drop, select frequency drop it is maximum it is adjacent after position bin, at this point, after this is adjacent in the bin of position, the bin that is located behind Corresponding sample size significantly reduces, and the corresponding sample size of bin for being located at the rear bin is seldom, and representativeness is lower, It is smaller to accuracy contribution when calculating apart from mean value, therefore, it can be ignored when calculating.Accordingly, it is set in embodiment Surely calculated result of the bin being located behind in the maximum adjacent rear position bin of frequency drop as this step is selected.For example, With reference to Fig. 6, frequency drop between the 6th bin and the 7th bin, the frequency between the 7th bin and the 8th bin are calculated separately The frequency drop between frequency drop and the 9th bin and the 10th bin between drop, the 8th bin and the 9th bin.Meter It after calculating each frequency drop, determines, the frequency drop of frequency drop is maximum between the 6th bin and the 7th bin, at this point, selection position The 7th bin in rear.

Step 2044, by the corresponding first sample distance of frequency maximum bin first sample corresponding with the bin being located behind Distance threshold of the distance as set distance range.

Specifically, due to the corresponding distance range of each bin, accordingly, it is determined that can be selection when set distance range The minimum value of each distance range is as distance threshold.It is set for example, the minimum value of the corresponding distance range of frequency maximum bin is used as Small distance threshold in set a distance range, using the minimum value of the corresponding distance range of the bin at rear as in set distance range Big distance threshold.It is also possible to select the maximum value of each distance range as distance threshold.For example, bin pairs of frequency maximum The maximum value for the distance range answered is as distance threshold small in set distance range, by the corresponding distance range of the bin at rear Maximum value as distance threshold big in set distance range.It can also be and combine actual conditions, select bin pairs of frequency maximum The minimum value for the distance range answered is as distance threshold small in set distance range, by the corresponding distance range of the bin at rear Maximum value as distance threshold big in set distance range.For example, with reference to Fig. 6, by the 5th corresponding distance range of bin In minimum value as the small distance threshold in set distance range, by the maximum value in the 7th corresponding distance range of bin As the big distance threshold in set distance range.It is also possible to determine the distance value placed in the middle of each distance range, and will occupies Middle distance value is as distance threshold.

In general, by the corresponding first sample distance of frequency maximum bin first sample distance corresponding with the bin being located behind The benefit of distance threshold as set distance range is that subsequent calculating is after mean value, it is ensured that by this apart from mean value It clusters compared with multisample, and filters out apart from farther away sample, that is, ensure that the reasonability apart from mean value.

It is understood that step 2041- step 2044 is only to determine the optional way of set distance range.Practical application In, it can be combined with frequency distribution histogram using other modes and determine set distance range.

First sample distance in step 205, acquisition set distance range.

Step 206 obtains sample size of the first sample distance in set distance range.

Specifically, due to the corresponding first sample distance of a sample.Therefore, it can count in set distance range First sample distance total number, and using total number as sample size.Sample can also be determined in conjunction with frequency distribution histogram This quantity, at this point, the corresponding frequency of each bin between frequency maximum bin and the bin being located behind is added, to obtain sample This quantity.For example, the corresponding frequency of the 5th bin, the 6th bin and the 7th bin is added, can be obtained with reference to Fig. 6 Sample size.

Step 207 is added first sample distance each in set distance range, to obtain sample total distance.

Specifically, each first sample distance in set distance range is added, and result is denoted as sample total distance. Either, sample total distance is obtained based on frequency distribution histogram, at this point it is possible to by the frequency of corresponding bin multiplied by respective distances The distance value placed in the middle of range in the results added that will be obtained again, and then obtains sample total distance later.For example, being obtained with reference to Fig. 6 Take the 5th bin respective distances range, later, select the distance value placed in the middle of distance range.By the corresponding distance placed in the middle of the 5th bin Value is multiplied to obtain with frequency first as a result, later, according to same calculation obtain corresponding second result of the 6th bin with And the 7th corresponding third of bin be as a result, later, by three results addeds, to obtain sample total distance.

Step 208, using the quotient of sample total distance and sample size as apart from mean value.

Specifically, first sample distance in set distance range can be obtained divided by sample size with sample total distance Average distance.In embodiment, average distance is denoted as apart from mean value, and sweep radius will be set as apart from mean value.Compared to Artificial invisible scanning radius, the actual conditions that the present embodiment can gather sample set are adaptively swept in existing DBSCAN algorithm Radius is retouched, and Principle of Statistics is utilized and determines sweep radius, ensure that the reasonability of sweep radius.

Step 209 obtains the corresponding k nearest neighbor sample set of each sample.

Specifically, the k nearest neighbor figure based on each sample, obtains K neighbour's sample of each sample, and form k nearest neighbor sample This set.I.e. by the k nearest neighbor sample set on whole samples composition vertex in k nearest neighbor figure in addition to vertex.

Step 210, according to whole k nearest neighbor sample sets, construct adjacency matrix, each element, which represents, in adjacency matrix corresponds to Neighbor relationships between two samples.

Wherein, adjacency matrix is to store all samples of sample set with an one-dimension array；Sample is stored with a two-dimensional array The data of relationship between this collection.Adjacency matrix can be divided into digraph adjacency matrix and non-directed graph adjacency matrix.In embodiment, with nothing To for figure adjacency matrix.Specifically, each sample is arranged in order, after arrangement, the corresponding number of each sample.Its In, queueing discipline embodiment is without limitation.Further, horizontally and vertically using the sample after arrangement as matrix, it Afterwards, show whether between corresponding two samples be neighbor relationships with intersection point element horizontally and vertically.Specifically, if some sample Included in the k nearest neighbor sample set of another sample, then the sample and another sample are neighbor relationships, at this point, will be with this Sample is ordinate, another sample is that the intersection point element of abscissa is denoted as nonzero element.Wherein, the occurrence of nonzero element can To be set according to actual conditions.In embodiment, by taking nonzero element is 1 as an example.In practical application, nonzero element can also be correspondence Distance value or other numerical value.Correspondingly, if some sample is not included in the k nearest neighbor sample set of another sample, it should Sample and another sample are non-neighbors relationship, at this point, will be using the sample as ordinate, another sample is the intersection point of abscissa Element is denoted as neutral element, that is, is denoted as 0.For example, Fig. 7 is a kind of adjacency matrix schematic diagram provided by Embodiment 2 of the present invention. With reference to Fig. 7,8 samples are currently shared, at this point, assigning 8 samples to 1-8 number in order.Later, in horizontally and vertically upper row 8 samples of column, i.e., using 8 sample numbers as abscissa and ordinate, later, according to the neighbor relationships building two between sample Tie up matrix.At this point, in two-dimensional matrix the i-th row jth arrange element show j-th of sample whether be i-th of sample neighbour's sample. For example, the 2nd row the 3rd column element be 1, then show number be 2 the corresponding k nearest neighbor sample set of sample in comprising number be 3 Sample.For another example, the element of the 7th row the 1st column is 0, then shows to number and not wrap in the corresponding k nearest neighbor sample set of sample for being 7 The sample for being 1 containing number.In general, can be in adjacency matrix according to k nearest neighbor sample set after determining k nearest neighbor sample set Element carry out assignment.In general, the element of the i-th row i-th column shows the connection relationship of sample and itself, in embodiment, by this yuan Element is denoted as 1.It is understood that lateral 1-8 number and longitudinal 1-8 number are sample number in Fig. 7, disregard in line number and In columns.

Optionally, in practical application, sample set includes many samples, at this point it is possible to construct one based on each sample Adjacency matrix, and horizontally and vertically using the sample in respective sample and corresponding k nearest neighbor sample set as adjacency matrix, lead to Neighbor relationships between sample and K neighbour's sample can be determined by crossing the adjacency matrix.Either, a neighbour is constructed based on sample set Connect matrix.At this point, can determine the neighbor relationships in sample set between each sample by the adjacency matrix.

Nonzero element in step 211, statistics adjacency matrix, with whole connection samples of each sample of determination.

Specifically, if the k nearest neighbor sample set of some sample includes another sample, and the k nearest neighbor sample set of another sample It closes and does not include the sample, it is determined that two samples are non-neighbour's sample each other, can be ignored in cluster.Therefore, it is necessary to find Whole neighbour's samples each other, and determine based on neighbour's sample each other the connection sample of each sample.In embodiment, pass through adjacent square Nonzero element in battle array determines neighbour's sample each other.For example, the number of two samples is respectively 5 and 6, and it is denoted as sample 5 and sample This 6.At this point, whether the element of the 5th row the 6th column shows in the k nearest neighbor sample set of sample 5 to include sample 6 in adjacency matrix, Whether the element of the 6th row the 5th column shows in the k nearest neighbor sample set of sample 6 comprising sample 5.If two elements are non-zero entry Element, then illustrate sample 5 and sample 6 mutually includes, i.e., sample 5 and sample 6 are neighbour's sample, and preservation sample 5 and sample 6 each other Connection relationship equally, sample 6 is denoted as to the connection sample of sample 5 at this point, sample 5 to be denoted as to the connection sample of sample 6.It presses According to aforesaid way, each nonzero element is counted, the connection sample of each sample can be obtained.

Optionally, it in embodiment, when counting nonzero element, sets the step and specifically includes step 2111- step 2115:

Step 2111, in adjacency matrix, obtain be in symmetric position element group, element group includes that the i-th row jth arranges The second element that first element and jth row i-th arrange.

Specifically, two elements that will abut against in matrix in symmetric position are denoted as element group.Wherein, symmetric position refers to Two opposite positions of transverse and longitudinal coordinate.For example, the i-th row jth column and jth row i-th are classified as symmetric position, at this point, symmetrical by two The corresponding element in position is denoted as an element group.Further, the element that the i-th row jth arranges is denoted as the first element, shows the It whether include j-th of sample in the corresponding k nearest neighbor sample set of i sample.The element that jth row i-th arranges is denoted as second element, Whether it shows in the corresponding k nearest neighbor sample set of j-th of sample comprising i-th of sample.By the element for obtaining symmetric position Group can obtain the neighbor relationships between corresponding two samples.

If including at least one neutral element in step 2112, the first element and second element, by the first element and second Element is disposed as neutral element.

Further, whether determine in the first element and second element comprising at least one neutral element, if the first element and Include at least one neutral element in second element, then the first element and second element are disposed as neutral element, otherwise, keeps the One element and second element are constant, and execute step 2123.It wherein, include at least one null element in the first element and second element Element shows in corresponding two samples not including another sample in the k nearest neighbor sample set of at least one sample, i.e., two Sample is non-neighbour's sample each other.At this point, the first element and second element are revised as neutral element, i.e., between two elements of cancellation Neighbor relationships.For example, with reference to Fig. 7, the first element of the 1st row the 8th column is 1, and the second element that eighth row the 1st arranges is 0, And first element and second element belong to the element group of symmetric position the first element that the 1st row the 8th arranges therefore be revised as 0, Cancel the neighbor relationships of sample 1 and sample 8.

After step 2113, whole element groups of traversal adjacency matrix, adjacency matrix is updated.

Specifically, traversal after whole element groups of symmetric position, updates adjacency matrix.With adjoining shown in fig. 7 For matrix, at this point, updating the adjacency matrix, and obtain adjacency matrix shown in Fig. 8 after whole element groups in traversing graph 7. Wherein, Fig. 8 is another adjacency matrix schematic diagram provided by Embodiment 2 of the present invention.

Nonzero element in adjacency matrix after step 2114, statistical updating, and corresponding two samples of nonzero element is true It is set to neighbour's sample each other and there is connection relationship.

Specifically, corresponding two samples of any nonzero element are neighbour's sample each other in updated adjacency matrix.Cause This, based on the nonzero element in updated adjacency matrix, side can determine all neighbour's sample each other.Further, consider Therefore when obtaining nonzero element, symmetric position can be only obtained to the adjacency matrix that updated adjacency matrix is symmetrization In a nonzero element, and determine that two samples are neighbour's sample each other according to the nonzero element.It determines each other adjacent to sample Afterwards, the line between neighbour's sample each other can be retained in sample set.

Step 2115 is based on neighbour's sample each other, obtains whole connection samples of each sample.

Specifically, obtain whole neighbour's sample each other comprising some sample, and the whole that will acquire neighbour's sample each other In whole connection samples of another sample as the sample.

Step 212, based on apart from mean value to all connection samples be filtered, with filter out the second sample distance be greater than distance The connection sample of mean value, the second sample distance are the distance between sample and the connection sample of sample.

Specifically, corresponding the distance between the connection sample of sample is denoted as the second sample distance in embodiment, i.e., Distance between neighbour's sample each other is denoted as the second sample distance.Further, if the second sample distance is greater than apart from mean value, Although illustrating corresponding two samples for neighbour's sample each other, its specific feature difference is larger, if cluster can together Influence the accuracy of cluster result.Therefore, the connection relationship for rejecting two samples is set in embodiment, i.e., determines two samples For non-neighbour's sample each other.At this point it is possible to delete the line between two samples in sample set.Meanwhile will abut against it is right in matrix The element answered is adjusted to neutral element.It is understood that the corresponding connection sample of each sample can be filtered in the manner described above This, and only it is retained less than the connection sample apart from mean value.At this point, the step is it can be appreciated that based on sweep radius to sample set In the connection relationship of each sample be scanned, to obtain accurate connection relationship.

Step 213 clusters the sample in sample set based on the connection sample obtained after S value and filtering.

Specifically, the step includes step 2131- step 2139:

Step 2131, the connection total sample number amount for successively counting each sample.

Specifically, determining the line total quantity of each sample according to the line between sample each in sample set, and then connected Connect total sample number amount.In general, the corresponding connection total sample number amount of each sample after filtering apart from mean value based on retaining Connect the total quantity of sample.It is understood that the connection that can also only record between adjacent sample each other is closed in practical application System, without being embodied a concentrated reflection of in sample.At this point it is possible to determine the connection sample of each sample according to the connection relationship of record, in turn Obtain connection total sample number amount.

Step 2132 will connect sample of the total sample number amount greater than S value as core sample.

Specifically, the connection total sample number amount of each sample is compared with S value, if connection total sample number amount is greater than S Corresponding sample is then denoted as core sample by value.In the manner described above, it after traversing each sample, can obtain in sample set Whole core samples.Wherein, core sample can be understood as in cluster process, can be used as the sample of starting point.In general, each Sample and its connection sample are because characteristic similarity is higher, it will usually cluster are clustered into, if the total quantity of some sample is less than S Value then illustrates that there are sample sizes to be lower than the possibility that cluster minimum includes sample number in the subsequent cluster for clustering and obtaining, therefore, poly- When class, the sample will not be selected as starting point, i.e., the sample will not be selected as core sample.

Step 2133, in obtained whole core samples, select any core sample as current sample.

Specifically, the starting point that an optional core sample is clustered as this, and it is denoted as current sample.It needs to illustrate It is in embodiment, to determine current sample in a random basis.In practical application, current samples selection rule can also be set, and lead to It crosses the rule and selects current sample.In general, the core sample is labeled as being accessed after determining current sample.

Whole connection samples of step 2134, the current sample of access.

Specifically, obtaining with current sample there is the whole of connection relationship to connect sample according to the connection relationship currently retained This, and the whole connection sample labelings that will acquire are to be accessed.

Step 2135, each connection sample for obtaining access are as vertex, and the whole for accessing vertex correspondence connects Connect sample.

Further, using currently available each connection sample as a vertex, later, according to what is currently retained Connection relationship, continuing to obtain with each vertex there is the whole of connection relationship to connect sample.At this point, each vertex it is also assumed that Sub- starting point in primary cluster.

Step 2136 is confirmed whether that access obtains new connection sample.If access obtains new connection sample, return is held Row step 2135, if access thens follow the steps 2137 less than new connection sample.

Specifically, continuing to obtain with each vertex there is the whole of connection relationship to connect according to the connection relationship currently retained When connecing sample, it is determined whether obtain new connection sample, i.e., whether obtain being not marked with the connection sample being accessed.If To new connection sample, then currently there are also the new similar samples of feature for explanation, at this point it is possible to return to step 2135, i.e., Using the connection sample newly obtained as vertex, continue the whole connection samples for accessing the vertex, until cannot get new connection sample Until this.If cannot get new connection sample, illustrate currently to have found the similar sample of whole features based on core sample This.At this point it is possible to think this end of clustering, and execute 2137.

It should be noted that in this cluster process, if some core sample is considered as connection sample, by the core Heart sample labeling is the sample being accessed.

Step 2137 is confirmed whether that there is also not visited core samples.If it exists, 2138 are thened follow the steps, otherwise, Execute 2139.

Specifically, determining whether to determine whether that there are also be not labeled there are also not visited core sample For the core sample being accessed.The core sample is then updated to current sample by not visited core sample if it exists, and Start new primary cluster process, i.e. execution step 2138.If confirmation core sample is accessed, illustrate currently to have visited It has asked the starting point that can be all clustered, cluster starting point can not be found again, therefore, executed step 2139.

Any core sample of not visited mistake is updated to current sample by step 2138.Return to step 2134.

Specifically, selecting current sample using random manner if the core sample quantity of not visited mistake is greater than 1. If the core sample quantity of not visited mistake is 1, using the core sample as current sample.Later, it returns to step 2134, that is, start primary new cluster.

Step 2139, the connection sample clustering obtained by current sample and based on current sample interview are cluster.

Specifically, a cluster process may be considered an access process, at this point, setting will every time cluster obtain it is complete Portion connects sample and current sample clustering is cluster.For sample set, there is cluster process several times can be obtained The cluster of respective numbers.Optionally, the sample of not visited mistake always is denoted as noise spot.

For example, it is directed to for the sample set of Fig. 3, includes sample A in K (K=5) neighbour's sample set of sample B, It is deleted comprising sample B at this point, the connection relationship between sample A and sample B can be kicked out of in the k nearest neighbor sample set of sample A Dotted line between sample A and sample B.In subsequent cluster process, no matter access-sample A connection sample or access-sample B company Sample is connect, sample A and sample B will not be clustered cluster, and then ensure that cluster reasonability.

It is above-mentioned, by constructing the k nearest neighbor figure of each sample, first sample distance is obtained based on k nearest neighbor figure, wherein first Sample distance is the distance between sample and S (S < K) neighbour's sample, later, based on first sample distance building frequency disribution Histogram, and determined according to frequency distribution histogram apart from mean value, meanwhile, adjacency matrix is constructed based on k nearest neighbor figure, and symmetrical adjacent Matrix is connect, to determine the sample with connection relationship, later, by carrying out apart from mean value to the sample with connection relationship Filter, and the technological means clustered based on filtered connection relationship and S value to sample set, are solved in the prior art The technical issues of DBSCAN algorithm can not rationally cluster the sample set of density unevenness, by symmetrical adjacency matrix and based on away from Mode from the sample that mean value filtering has connection relationship, can avoid different densities in the distribution density unevenness of sample Sample is polymerized to a cluster, influences cluster accuracy.Meanwhile being determined by frequency distribution histogram apart from mean value, it is defeated without user Enter, reduces the workload for adjusting ginseng manually, and determine by way of statistics apart from mean value, ensure that apart from the reasonable of mean value Property, and then guarantee cluster accuracy.

Embodiment three

Fig. 9 is a kind of structural schematic diagram for sample clustering device that the embodiment of the present invention three provides.With reference to Fig. 9, the sample Clustering apparatus includes: distance statistics module 301, distance acquisition module 302, mean value computation module 303, connection determining module 304 And sample clustering module 305.

Wherein, distance statistics module 301 concentrates the corresponding first sample distance of each sample for statistical sample, described First sample distance is the distance between S neighbour's sample of the sample and the sample；Distance obtains module 302, is used for In all first sample distances, the first sample distance in set distance range is obtained；Mean value computation module 303 is used It calculates in based on the first sample distance in the set distance range apart from mean value；Determining module 304 is connected, for based on every The corresponding k nearest neighbor sample set of a sample determines whole connection samples of each sample, wherein K > S, the sample with The connection sample of the sample is neighbour's sample each other and there are connection relationships；Sample clustering module 305, for according to the company Connect sample, it is described the sample in the sample set is clustered apart from mean value and S value, it is described apart from mean value be sweep radius, The S value is that cluster is minimum comprising sample number.

On the basis of the above embodiments, sample clustering module 305 includes: sample filter submodule, for based on described All connection samples are filtered apart from mean value, are greater than the connection sample apart from mean value to filter out the second sample distance This, the second sample distance is the distance between the connection sample of sample and the sample；Submodule is clustered, for being based on S The connection sample obtained after value and filtering clusters the sample in the sample set.

On the basis of the above embodiments, cluster submodule includes: total quantity statistic unit, for successively counting each sample This connection total sample number amount；Core sample determination unit, for using it is described connection total sample number amount be greater than S value sample as Core sample；Current sample selecting unit, in obtained whole core samples, selecting any core sample as current Sample；First access unit, for accessing whole connection samples of the current sample；Second access unit, for that will access Obtained each connection sample accesses whole connection samples of the vertex correspondence respectively as vertex；Third access unit, Each connection sample for repeating to obtain access accesses whole connection samples of the vertex correspondence as vertex Operation, until access less than new connection sample until；Sample Refreshment unit, for by any core sample of not visited mistake Originally it is updated to current sample, and returns to the operation for executing the whole connection samples for accessing the current sample, until whole cores Until sample standard deviation is accessed；Cluster cluster cell, for by the current sample and the connection sample obtained based on current sample interview This cluster is cluster.

On the basis of the above embodiments, connection determining module 304 includes: set acquisition submodule, each for obtaining The corresponding k nearest neighbor sample set of sample；Adjacency matrix constructs submodule, for according to all k nearest neighbor sample sets, building Adjacency matrix, each element represents the neighbor relationships between corresponding two samples in the adjacency matrix；Nonzero element counts submodule Block, for counting nonzero element in the adjacency matrix, with whole connection samples of each sample of determination.

On the basis of the above embodiments, nonzero element statistic submodule includes: element group acquiring unit, for described In adjacency matrix, the element group for being in symmetric position is obtained, the element group includes the first element and jth row of the i-th row jth column The second element of i-th column；Zero setting unit, if for including at least one null element in first element and the second element First element and second element are then disposed as neutral element by element；Matrix update unit, for traversing the adjacency matrix Whole element groups after, update the adjacency matrix；Neighbor relationships determination unit, for non-in the adjacency matrix after statistical updating Neutral element, and corresponding two samples of the nonzero element are determined as neighbour's sample each other and there is connection relationship；Connect sample This determination unit, for obtaining whole connection samples of each sample based on neighbour's sample each other.

On the basis of the above embodiments, it includes: histogram building submodule that distance, which obtains module 302, for based on complete First sample distance described in portion constructs frequency distribution histogram；Frequency statistics submodule, for counting the frequency disribution histogram The frequency of each bin in figure, to determine set distance range；First distance acquisition submodule, for obtaining in set distance range First sample distance.

On the basis of the above embodiments, Frequency statistics submodule includes: maximum bin acquiring unit, described for obtaining Frequency maximum bin in frequency distribution histogram；Drop computing unit, for calculating the frequency drop between adjacent rear position bin, institute Stating rear position bin is the bin for being located at the rear frequency maximum bin in the frequency distribution histogram；Bin confirmation unit, for confirming The maximum adjacent rear position bin of frequency drop, and the bin being located behind is selected in the maximum adjacent rear position bin；Threshold value is true Order member, for by the corresponding first sample distance of the frequency maximum bin and corresponding first sample of bin being located behind Distance threshold of this distance as set distance range.

On the basis of the above embodiments, mean value computation module 303 includes: sample size acquisition submodule, for obtaining Sample size of the first sample distance in the set distance range；Total distance submodule, for the setting Each first sample distance is added in distance range, to obtain sample total distance；Quotient computational submodule, being used for will be described The quotient of sample total distance and the sample size is used as apart from mean value.

On the basis of the above embodiments, further includes: k nearest neighbor figure constructs module, concentrates each sample for statistical sample Before corresponding first sample distance, the k nearest neighbor figure of each sample in sample set is constructed, the power of each edge in the k nearest neighbor figure Distance of the value between corresponding sample.

Sample clustering device provided in an embodiment of the present invention is included in sample clustering equipment, and can be used for executing above-mentioned The sample clustering method that embodiment of anticipating provides, has corresponding function and beneficial effect.

Example IV

Figure 10 is a kind of structural schematic diagram for sample clustering equipment that the embodiment of the present invention four provides.As shown in Figure 10, should Sample clustering equipment includes processor 40, memory 41, input unit 42 and output device 43；It is handled in sample clustering equipment The quantity of device 40 can be one or more, in Figure 10 by taking a processor 40 as an example；Processor 40 in sample clustering equipment, Memory 41, input unit 42 and output device 43 can be connected by bus or other modes, to pass through bus in Figure 10 For connection.

Memory 41 is used as a kind of computer readable storage medium, can be used for storing software program, journey can be performed in computer Sequence and module, if the corresponding program instruction/module of sample clustering method in the embodiment of the present invention is (for example, sample clustering fills It is poly- that distance statistics module 301, distance in setting obtain module 302, mean value computation module 303, connection determining module 304 and sample Generic module 305).Software program, instruction and the module that processor 40 is stored in memory 41 by operation, thereby executing sample The various function application and data processing of this cluster equipment realize above-mentioned sample clustering method.

Memory 41 can mainly include storing program area and storage data area, wherein storing program area can store operation system Application program needed for system, at least one function；Storage data area can be stored to be created according to using for sample clustering equipment Data etc..In addition, memory 41 may include high-speed random access memory, it can also include nonvolatile memory, such as At least one disk memory, flush memory device or other non-volatile solid state memory parts.In some instances, memory 41 can further comprise the memory remotely located relative to processor 40, these remote memories can be by being connected to the network extremely Sample clustering equipment.The example of above-mentioned network include but is not limited to internet, intranet, local area network, mobile radio communication and A combination thereof.

Input unit 42 can be used for receiving the number or character information of input, and generate the user with sample clustering equipment Setting and the related key signals input of function control.Output device 43 may include that display screen etc. shows equipment.

Above-mentioned sample clustering equipment includes sample clustering device, can be used for executing arbitrary sample clustering method, has phase The function and beneficial effect answered.

Embodiment five

The embodiment of the present invention also provides a kind of storage medium comprising computer executable instructions, and the computer is executable Instruction is used to execute a kind of sample clustering method when being executed by computer processor, this method comprises:

Certainly, a kind of storage medium comprising computer executable instructions, computer provided by the embodiment of the present invention The method operation that executable instruction is not limited to the described above, can also be performed sample clustering provided by any embodiment of the invention Relevant operation in method.

By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art Part can be embodied in the form of software products, which can store in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

It is worth noting that, included each unit and module are only pressed in the embodiment of above-mentioned sample clustering device It is divided, but is not limited to the above division according to function logic, as long as corresponding functions can be realized；In addition, The specific name of each functional unit is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of sample clustering method characterized by comprising

Statistical sample concentrates the corresponding first sample distance of each sample, and the first sample distance is the sample and the sample This distance between S neighbour's sample；

Based on the corresponding k nearest neighbor sample set of each sample, whole connection samples of each sample are determined, wherein K > S, The connection sample of the sample and the sample is neighbour's sample each other and there are connection relationships；

According to the connection sample, it is described the sample in the sample set is clustered apart from mean value and S value, the distance is equal Value is sweep radius, and the S value is that cluster minimum includes sample number.

2. sample clustering method according to claim 1, which is characterized in that it is described according to the connection sample, it is described away from Carrying out cluster to the sample in the sample set from mean value and S value includes:

All connection samples are filtered apart from mean value based on described, are greater than the distance to filter out the second sample distance The connection sample of mean value, the second sample distance are the distance between the connection sample of sample and the sample；

3. sample clustering method according to claim 2, which is characterized in that described based on the company obtained after S value and filtering It connects sample and the sample in the sample set cluster and include:

Successively count the connection total sample number amount of each sample；

In obtained whole core samples, select any core sample as current sample；

Access whole connection samples of the current sample；

Each connection sample that access is obtained accesses whole connection samples of the vertex correspondence as vertex；

The each connection sample for repeating to obtain access accesses whole connection samples of the vertex correspondence as vertex Operation, until access less than new connection sample until；

Any core sample of not visited mistake is updated to current sample, and returns to the whole for executing and accessing the current sample The operation for connecting sample, until whole core samples are accessed；

4. sample clustering method according to claim 1, which is characterized in that described to be based on the corresponding K of each sample Neighbour's sample set determines that whole connection samples of each sample include:

Obtain the corresponding k nearest neighbor sample set of each sample；

According to all k nearest neighbor sample sets, adjacency matrix is constructed, each element represents two corresponding in the adjacency matrix Neighbor relationships between sample；

5. sample clustering method according to claim 4, which is characterized in that non-zero entry in the statistics adjacency matrix Element, connecting samples with the whole of each sample of determination includes:

In the adjacency matrix, the element group for being in symmetric position is obtained, the element group includes the first of the i-th row jth column The second element that element and jth row i-th arrange；

If including at least one neutral element in first element and the second element, by first element and second yuan Element is disposed as neutral element；

Nonzero element in adjacency matrix after statistical updating, and corresponding two samples of the nonzero element are determined as each other closely Adjacent sample and have connection relationship；

6. sample clustering method according to claim 1, which is characterized in that described in all first sample distances In, the first sample distance obtained in set distance range includes:

Obtain the first sample distance in set distance range.

7. sample clustering method according to claim 6, which is characterized in that in the statistics frequency distribution histogram The frequency of each bin, to determine that set distance range includes:

Obtain frequency maximum bin in the frequency distribution histogram；

The frequency drop between adjacent rear position bin is calculated, the rear position bin is to be located at frequency most in the frequency distribution histogram The bin at the big rear bin；

Confirm frequency drop it is maximum it is adjacent after position bin, and it is described it is maximum it is adjacent after in the bin of position selection be located behind bin；

By the corresponding first sample distance of the frequency maximum bin and the corresponding first sample distance of bin being located behind Distance threshold as set distance range.

8. sample clustering method according to claim 1, which is characterized in that described based in the set distance range First sample distance, which is calculated apart from mean value, includes:

Obtain sample size of the first sample distance in the set distance range；

9. sample clustering method according to claim 1, which is characterized in that the statistical sample concentrates each sample corresponding First sample distance before, further includes:

The k nearest neighbor figure for constructing each sample in sample set, in the k nearest neighbor figure weight of each edge be between correspondence sample away from From.

10. a kind of sample clustering device characterized by comprising

Distance statistics module concentrates the corresponding first sample distance of each sample, the first sample distance for statistical sample For the distance between S neighbour's sample of the sample and the sample；

Distance obtains module, in all first samples distances, obtain the first sample in set distance range away from From；

Determining module is connected, for being based on the corresponding k nearest neighbor sample set of each sample, determines that the whole of each sample connects Connect sample, wherein the connection sample of K > S, the sample and the sample is neighbour's sample each other and there are connection relationships；

Sample clustering module, for according to the connection sample, it is described apart from mean value and S value to the sample in the sample set into Row cluster, it is described apart from mean value be sweep radius, the S value be cluster it is minimum include sample number.

11. a kind of sample clustering equipment, which is characterized in that the sample clustering equipment includes:

One or more processors；

Memory, for storing one or more programs；

When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now sample clustering method as described in any in claim 1-9.

12. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The sample clustering method as described in any in claim 1-9 is realized when execution.