WO2019136929A1 - 基于k邻域相似性的数据聚类方法、装置和存储介质 - Google Patents

基于k邻域相似性的数据聚类方法、装置和存储介质 Download PDF

Info

Publication number
WO2019136929A1
WO2019136929A1 PCT/CN2018/091697 CN2018091697W WO2019136929A1 WO 2019136929 A1 WO2019136929 A1 WO 2019136929A1 CN 2018091697 W CN2018091697 W CN 2018091697W WO 2019136929 A1 WO2019136929 A1 WO 2019136929A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
data point
neighborhood
neighbor
data points
Prior art date
Application number
PCT/CN2018/091697
Other languages
English (en)
French (fr)
Inventor
黄近秋
徐德明
万长林
Original Assignee
惠州学院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 惠州学院 filed Critical 惠州学院
Priority to US16/396,682 priority Critical patent/US11210348B2/en
Publication of WO2019136929A1 publication Critical patent/WO2019136929A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Definitions

  • the present invention relates to a data processing technology neighborhood, and in particular, to a K-neighbor similarity-based data clustering method, apparatus, and computer readable storage medium.
  • Data clustering is such an automatic classification method and device.
  • the difficulty of parameter adjustment reduces the availability of data clustering.
  • the density-based DBSCAN method requires two parameters, radius epsilon and minimum number of minpts, to define density thresholds. These parameters are heavily dependent on specific data. Because the density of data tends to vary widely, and the distance between data will have different scales.
  • the embodiment of the invention provides a data clustering method, a device and a computer readable storage medium based on the K neighborhood similarity, which can effectively solve the premise knowledge that the prior art needs more assumptions, the clustering parameters are difficult to set, and cannot Get three main questions about the hierarchical relationship between clusters.
  • An embodiment of the present invention provides a data clustering method based on K neighborhood similarity, including the steps of:
  • the clustering data points are sorted in ascending order according to the maximum radius of the K neighborhood of the data points; wherein the maximum radius of the K neighborhood of the data points refers to the maximum distance from the data point in the K neighborhood of the data points;
  • the distances are merged from near to far according to the distance until the number of data points in the cluster exceeds a second threshold nspts; if the difference between the K-neighbor statistical property value of the data point and the K-neighbor statistical property value of the neighboring cluster of the data point exceeds the first threshold delta, generating a data point New clustering;
  • the data clustering method based on K neighborhood similarity sorts the maximum radius of the K neighborhood of the data points to be clustered, that is, sorts the data points according to the density, and then performs ascending order.
  • the sorted data points are first cycled, and the data points conforming to the statistical similarity are merged into the same cluster; then, the data points with smaller cluster density are subjected to the second pass according to the scale required by the cluster.
  • data clustering is realized.
  • the data clustering method based on the K neighborhood similarity provided by the embodiment of the present invention has the following technical effects: First, it is not necessary to preset the number of clusters, and it is not necessary to know the probability distribution of the data; Second, the parameters are easy to set, and the settings of each parameter are independent of the density distribution and distance scale of the data. Third, the formation of clusters is gradually merged from high density to low density, and is generated while clustering is generated. The hierarchical relationship between clusters is derived.
  • the K field of the data point refers to a set of K data points that are closest to the data point distance; wherein K ranges from 5 to 9.
  • the K-neighbor statistical characteristic value of each of the data points includes at least one of a K-nearest maximum radius, a K-neighbor radius mean, and a K-neighbor radius standard deviation of the data point;
  • the K-neighbor statistical property value of the neighboring cluster includes an average value of a K-nearest maximum radius of all data points of the neighboring cluster, an average value of K-neighbor radius mean values of all data points, and all data points At least one of the average values of the K neighborhood neighborhood standard deviations.
  • the K-neighbor statistical property value of each of the data points includes a K-neighbor radius mean value of the data point and a K-neighbor radius standard deviation; in the step S2, if the data point is The K-neighbor statistical property value and the K-neighbor statistical property value of the neighboring cluster of the data point satisfy the following formula 1) and formula 2), and the K-neighbor statistical property value of the data point is considered to be the same as the data
  • the difference between the K-neighbor statistical property values of the neighboring clusters S of the points is within the first threshold delta range:
  • Formula 1) mean(S(M))-delta ⁇ std(S(M)) ⁇ p(M) ⁇ mean(S(M))+delta ⁇ std(S(M));
  • Equation 2 mean(S(V))-delta ⁇ std(S(V)) ⁇ p(V) ⁇ mean(S(V))+delta ⁇ std(S(V));
  • p(M) is the K-nearest radius mean of the data point p
  • p(V) is the K-neighbor standard deviation of the data point p
  • mean(S(M)) is the K of all data points adjacent to the cluster S
  • the mean of the neighborhood radius mean, mean(S(V)) is the average of the K-neighbor standard deviation of all data points adjacent to the cluster S
  • std(S(M)) is the data of the neighboring cluster S.
  • the variance of the K neighborhood radius mean of the point, std(S(V)) is the variance of the K neighborhood radius standard deviation of all data points adjacent to the cluster S
  • delta is the preset coefficient.
  • the preset range of the first threshold delta is usually 1 to 10.
  • the K-neighbor statistical property value of the data point satisfies the following formula 3
  • it is considered that the K-neighbor statistical property value of the data point is greater than or equal to the K of the data point.
  • Equation 3 p(M) ⁇ q(M) ⁇ nsr, q ⁇ N k(p);
  • nsr is preset The range is usually 3 to 5.
  • nspts are defined as significant clusters, and the total number of data points in all clusters is less than the second threshold.
  • Clustering of nspts is defined as non-significant clustering
  • the data point is merged into the data point to which the nearest data point belongs. Significantly clustered.
  • the second threshold nspts can be adjusted according to the required cluster size.
  • the embodiment of the invention further provides a data clustering device based on K neighborhood similarity, comprising:
  • a data point sorting module configured to sort the data points according to the maximum radius of the K neighborhood of the data points in an ascending order; wherein the maximum radius of the K neighborhood of the data points refers to the data point from the K neighborhood of the data point Maximum distance
  • a looping module configured to perform a first pass loop on the data points after ascending sorting, calculate a K neighborhood statistical property value of the data point, and calculate a K neighborhood statistical property value of the data point Comparing the K-neighbor statistical property values of the neighboring clusters corresponding to the generated data points; if the K-neighbor statistical property values of the data points and the K-neighbor statistical property values of the neighboring clusters of the data points The difference is within the first threshold delta, then the data points are incorporated into the neighboring clusters of the data points, if multiple neighboring clusters can be incorporated, then the distances are merged from near to far one by one until The number of data points in the cluster exceeds a second threshold nspts; if the difference between the K neighborhood statistical property value of the data point and the K neighborhood statistical property value of the neighboring cluster of the data point exceeds the first threshold delta, Generating a new cluster containing the data points;
  • a two-pass loop module configured to perform a second loop on the data points in the clusters in which the total number of data points in all the clusters is less than the second threshold nspts after the first loop is completed, if there is a K-point of the data point If the domain statistical property value is greater than or equal to nsr times the K-local statistical property value of any other data point in the K-neighbor of the data point, the data point is marked as noise data, otherwise the data point is merged into its nearest distance.
  • the cluster to which the data points belong configured to perform a second loop on the data points in the clusters in which the total number of data points in all the clusters is less than the second threshold nspts after the first loop is completed, if there is a K-point of the data point If the domain statistical property value is greater than or equal to nsr times the K-local statistical property value of any other data point in the K-neighbor of the data point, the data point is marked as noise data, otherwise the data point is
  • the K-neighborhood similarity data clustering apparatus sorts the clustered data points by density based on the maximum radius of the K neighborhood of the data points to be clustered, and then sorts the data points according to the density.
  • the sorted data points are first cycled, and the data points conforming to the statistical similarity are merged into the same cluster; then, the data points with smaller cluster density are subjected to the second pass according to the scale required by the cluster.
  • the data clustering is realized.
  • the data clustering method based on the K neighborhood similarity provided by the embodiment of the present invention has the following technical effects: First, it is not necessary to preset the number of clusters, and it is not necessary to know the probability distribution of the data; Second, the parameters are easy to set, and the settings of each parameter are independent of the density distribution and distance scale of the data. Third, the formation of clusters is gradually merged from high density to low density, and is generated while clustering is generated. The hierarchical relationship between clusters is derived.
  • the K field of the data point refers to a set of K data points that are closest to the data point distance; wherein K ranges from 5 to 9.
  • the K-neighbor statistical characteristic value of each of the data points includes at least one of a K-nearest maximum radius, a K-neighbor radius mean, and a K-neighbor radius standard deviation of the data point;
  • the K-neighbor statistical property value of the neighboring cluster includes an average value of a K-nearest maximum radius of all data points of the neighboring cluster, an average value of K-neighbor radius mean values of all data points, and all data points At least one of the average values of the K neighborhood neighborhood standard deviations.
  • the K-neighbor statistical characteristic value of each of the data points includes a K-neighbor radius mean value of the data point and a K-neighbor radius standard deviation; in the one-pass loop module, if the data is The K-neighbor statistical property value of the point and the K-neighbor statistical property value of the neighboring cluster of the data point satisfy the following formula 1) and formula 2), and the K-neighbor statistical property value of the data point is considered to be The difference between the K-neighbor statistical property values of the neighboring clusters S of the data points is within the first threshold delta range:
  • Formula 1) mean(S(M))-delta ⁇ std(S(M)) ⁇ p(M) ⁇ mean(S(M))+delta ⁇ std(S(M));
  • Equation 2 mean(S(V))-delta ⁇ std(S(V)) ⁇ p(V) ⁇ mean(S(V))+delta ⁇ std(S(V));
  • p(M) is the K-nearest radius mean of the data point p
  • p(V) is the K-neighbor standard deviation of the data point p
  • mean(S(M)) is the K of all data points adjacent to the cluster S
  • the mean of the neighborhood radius mean, mean(S(V)) is the average of the K-neighbor standard deviation of all data points adjacent to the cluster S
  • std(S(M)) is the data of the neighboring cluster S.
  • the variance of the K neighborhood radius mean of the point, std(S(V)) is the variance of the K neighborhood radius standard deviation of all data points adjacent to the cluster S
  • the delta is the preset coefficient; the preset range of the delta is usually 1 ⁇ 10.
  • the K-neighbor statistical property value of the data point satisfies the following formula 3
  • it is considered that the K-neighbor statistical property value of the data point is greater than or equal to the data point.
  • Equation 3 p(M) ⁇ q(M) ⁇ nsr, q ⁇ N k(p);
  • nsr N
  • the default range of nsr is usually 3 ⁇ 5.
  • clusters in which the total number of data points in all clusters is greater than or equal to the second threshold nspts are defined as significant clusters, and the total number of data points in all clusters is smaller than
  • the cluster of two threshold nspts is defined as non-significant clustering
  • the data point is merged into the data point to which the nearest data point belongs. Significantly clustered.
  • An embodiment of the present invention further provides a data clustering apparatus based on K neighborhood similarity, comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processing
  • the K-neighborhood similarity data clustering method described in any of the above embodiments is implemented when the computer program is executed.
  • the embodiment of the present invention further provides a computer readable storage medium, comprising: a stored computer program, wherein, when the computer program is running, controlling a device where the computer readable storage medium is located performs any of the above
  • the data clustering method based on K neighborhood similarity described in the embodiment.
  • FIG. 1 is a schematic flowchart of a data clustering method based on K neighborhood similarity according to an embodiment of the present invention.
  • 2(a) to 2(d) are flowcharts showing the effect of data clustering based on the K-neighborhood similarity data clustering method provided by the embodiment of the present invention.
  • 3(a) to 3(h) show the results of data clustering by using data sets of different shapes given by the K-neighborhood similarity data clustering method provided by the embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a data clustering apparatus based on K neighborhood similarity according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a data clustering apparatus based on K neighborhood similarity according to another embodiment of the present invention.
  • the embodiment of the present invention provides a data clustering method based on K neighborhood similarity and a corresponding device, wherein the cluster definition of the embodiment of the present invention is based on statistical data points.
  • Data clustering (clustering) is often defined as an aggregation of similar data.
  • the concept of similarity may be related to the statistical properties of the cluster and the density of the data. Referring to FIGS. 2(a) to 2(d), as shown in FIG. 2(a), data points are distributed in three regions of the data space. As shown in Figure 2(b), the database can be drawn as a data density surface with approximately three peaks at the center of the three clusters with an additional density axis.
  • the database and data density face can be divided into coherent parts, ie data blocks, represented by the density profile shown in Figure 2(c). If you collect data points along the density down direction, data clustering will gradually appear. As shown in FIG. 2(c), when the size of the connected cluster is smaller than the expected value, it is reasonable to merge the connected clusters, and the significant clusters do not merge with each other. As shown in Figure 2(d), the final clustering result is generated.
  • K-neighbor The K-neighbor of a data point refers to the set of K data points that are closest to the data point.
  • the set N k (p) represents the K neighborhood of the data point p, and the set N k (p) includes K data points, then any data point q in the set satisfies: dist(p, q) ⁇ dist(p , o), where dist(p, q) represents the distance between the data point p and the data point q, and the data point o is not within the set N k (p).
  • K neighborhood maximum radius The maximum radius of the K neighborhood of a data point is the maximum distance from the point in the K neighborhood of the data point.
  • K neighborhood radius mean The mean of the K neighborhood radius of a data point is the mean of the distance between two data points in the K neighborhood of the data point.
  • M k (p) mean(dist(a, b)), where a, b ⁇ N k (p).
  • Mean(dist(a,b)) represents the average of the distance between any two data points a and data points b in N k (p).
  • K neighborhood radius standard deviation The K-neighbor radius standard deviation of a data point refers to the standard deviation of the distance between two data points in the K field of the data point.
  • V k (p) std(dist(a, b)), where a, b ⁇ N k (p).
  • Std(dist(a,b)) represents the variance of the distance between any two data points a and data points b in N k (p).
  • the statistical data point of a data point refers to the K domain statistical property value to which the data point is attached to the data point, including (but not limited to) the K neighborhood maximum radius, the K neighborhood radius mean, K The standard deviation of the neighborhood radius is formed.
  • p(R, M, V) can be used to represent the K-domain statistical property value of data point p.
  • p(R) represents the K-nearest maximum radius value of the data point p
  • p(M) represents the mean K-radius radius of the data point p
  • p(V) represents the K-neighbor radius standard deviation of the data point p.
  • Definition 6 if the difference between the mean of each statistical feature value of a data point set and the corresponding statistical feature value of one data point is less than delta units (usually the standard deviation of the statistical feature values of the data point set) For the unit, the data point set is said to have statistical similarity.
  • S(R, M, V) is used to represent the statistical feature value of the data point set. If there is p(R, M, V) of the data point p, the following two conditions are met:
  • Condition 2 mean(S(V))-delta ⁇ std(S(V)) ⁇ p(V) ⁇ mean(S(V))+delta ⁇ std(S(V));
  • the data point set S is said to have statistical similarity.
  • delta is a threshold constant.
  • the K-domain radius mean M of the data point p is less than or equal to the minimum M nsr times of C1, nsr is an adjustable parameter, or the number of data points in C1 is less than the significance parameter nspts;
  • the K-domain radius mean M of the data point p is less than or equal to the minimum M nsr times of C2, nsr is an adjustable reference book, or the number of data points in C2 is less than the significance parameter nspts.
  • a data cluster is a significant cluster if and only if the data cluster contains more than or equal to a significant parameter nspts.
  • Definition 9 (noise point): When the K-domain radius mean M of a data point is greater than or equal to the nsr times of the K-domain radius mean M value of any data point belonging to a significant cluster in the K field, the data point is Noise point.
  • the data point p is a noise point.
  • a data clustering method based on K neighborhood similarity includes steps S101 to S103:
  • the data points to be clustered are sorted in ascending order according to the maximum radius of the K neighborhood of the data points;
  • the maximum radius of the K neighborhood of each data point refers to the maximum distance from the data point in the K neighborhood of the data point; the K field of the data point refers to the nearest K distance from the data point.
  • the value of K is usually in the range of 5-9.
  • the distances are merged from near to far according to the distance until the number of data points in the cluster exceeds a second threshold nspts; if the difference between the K-neighbor statistical property value of the data point and the K-neighbor statistical property value of the neighboring cluster of the data point exceeds the first threshold delta, generating a data point New clustering.
  • the K-neighbor statistical property value of each of the data points includes at least one of a K-nearest maximum radius, a K-neighbor radius mean, and a K-neighbor radius standard deviation of the data point; correspondingly, the neighboring cluster
  • the K-neighbor statistical property value includes the average of the K-nearest maximum radius of all data points of the neighboring cluster, the mean value of the K-neighbor radius mean of all data points, and the K-neighbor radius standard deviation of all data points. At least one of the average values.
  • the K neighborhood statistical property value of each of the data points includes a K neighborhood radius mean value of the data point and a K neighborhood radius standard deviation. If the K-neighbor statistical property value of the data point and the K-neighbor statistical property value of the neighboring cluster of the data point satisfy the following formula 1) and formula 2), the K-neighbor statistical property of the data point is considered The difference between the value and the K-neighbor statistical property value of the neighboring cluster S of the data point is within a first threshold delta range:
  • Formula 1) mean(S(M))-delta ⁇ std(S(M)) ⁇ p(M) ⁇ mean(S(M))+delta ⁇ std(S(M));
  • Equation 2 mean(S(V))-delta ⁇ std(S(V)) ⁇ p(V) ⁇ mean(S(V))+delta ⁇ std(S(V));
  • p(M) is the K-nearest radius mean of the data point p
  • p(V) is the K-neighbor standard deviation of the data point p
  • mean(S(M)) is the K of all data points adjacent to the cluster S
  • the mean of the neighborhood radius mean, mean(S(V)) is the average of the K-neighbor standard deviation of all data points adjacent to the cluster S
  • std(S(M)) is the data of the neighboring cluster S.
  • the variance of the K neighborhood radius mean of the point, std(S(V)) is the variance of the K neighborhood radius standard deviation of all data points adjacent to the cluster S
  • delta is the preset coefficient.
  • the preset range of the first threshold delta is usually 1 to 10.
  • the K-neighbor statistical property value of the data point satisfies the following formula 3
  • Equation 3 p(M) ⁇ q(M) ⁇ nsr, q ⁇ N k(p);
  • nsr is preset The range is usually 3 to 5.
  • clusters in which the total number of data points in all clusters is greater than or equal to the second threshold nspts are defined as significant clusters, and clusters in which the total number of data points in all clusters is less than the second threshold nspts are defined as non- Significant clustering, wherein the second threshold nspts can be adjusted according to the required cluster size.
  • the K-neighbor statistical property value of the data point is less than the nsr times of the K-neighbor statistical property value of any other data point in the K-neighbor of the data point, the data point is incorporated into the data point. From the significant cluster to which the nearest data point belongs.
  • the data clustering method based on K neighborhood similarity sorts the maximum radius of the K neighborhood of the data points to be clustered, that is, the clustered data points according to the density, and then sorts the data points after ascending order. Perform the first cycle to merge the data points that conform to the statistical similarity into the same cluster; then perform the second cycle of the data points with smaller cluster density according to the scale required by the cluster to find out all Data points are clustered by merging non-noise points into the nearest large-density cluster.
  • the data clustering method based on the K neighborhood similarity provided by the embodiment of the present invention has the following technical effects: First, it is not necessary to preset the number of clusters, and it is not necessary to know the probability distribution of the data; Second, the parameters are easy to set.
  • the value of K is generally 5-9
  • the value of delta is generally 1-10
  • the value of nsr is generally 3-5
  • the value of significant parameter nspts can be clustered according to needs.
  • Self-adjustment, and the setting of each parameter is independent of the density distribution and distance scale of the data
  • thirdly, the formation of clusters is gradually merged from high density to low density, and clustering is given while clustering is generated. The hierarchical relationship between them.
  • 3(a) to 3(h) show the results of data clustering of a given data set of different shapes by using the K neighborhood similarity data clustering method provided by the embodiment of the present invention.
  • the clustering method provided by this embodiment can overcome the link effect and correctly identify the two clusters linked by the point chain.
  • 3(b) to 3(h) show that clusters of a plurality of different shapes are correctly identified using the clustering method of the present embodiment.
  • FIG. 3(b) two overlapping clusters of density changes are correctly found by using the clustering method of the present embodiment: a dense cluster and a surrounding sparse cluster.
  • the noise points are correctly identified.
  • the number of clusters discovered by the clustering method of the present embodiment is exactly equal to the desired number of clusters. It can be seen that the data clustering method based on K neighborhood similarity provided by the embodiment has the ability to process data sets of arbitrary shapes, density changes, and noise points.
  • FIG. 4 is a schematic structural diagram of a data clustering apparatus based on K-neighborhood similarity according to an embodiment of the present invention, including:
  • the data point sorting module 401 is configured to sort the data points to be sorted in ascending order according to the maximum radius of the K neighborhood of the data points; wherein the maximum radius of the K neighborhood of each data point refers to the K neighborhood of the data point The maximum distance of the data point; the K field of the data point refers to a set of K data points that are closest to the data point; wherein K ranges from 5 to 9.
  • the looping module 402 is configured to perform a first pass loop on the data points after ascending sorting, calculate a K neighborhood statistical property value of the data point, and calculate a K neighborhood statistical property of the calculated data point. The value is compared with the K-neighbor statistical property value of the neighboring cluster corresponding to the generated data point; if the K-neighbor statistical property value of the data point and the K-neighbor statistical property of the neighboring cluster of the data point The difference in value is in the range of the first threshold delta, then the data point is incorporated into the neighboring cluster of the data point, if multiple neighboring clusters can be incorporated, then merged one by one according to distance from near to far Until the number of data points in the cluster exceeds a second threshold nspts; if the difference between the K neighborhood statistical property value of the data point and the K neighborhood statistical property value of the neighboring cluster of the data point exceeds the first threshold delta , generating a new cluster containing the data points;
  • the two-pass loop module 403 is configured to perform a second loop on the data points in the clusters in which the total number of data points in all clusters is less than the second threshold nspts after the first loop is completed, if there is a data point K If the neighborhood statistical property value is greater than or equal to the nsr times of the K neighborhood statistical property value of any other data point in the K neighborhood of the data point, the data point is marked as noise data, otherwise the data point is incorporated into the distance.
  • the cluster to which the most recent data point belongs is configured to perform a second loop on the data points in the clusters in which the total number of data points in all clusters is less than the second threshold nspts after the first loop is completed, if there is a data point K If the neighborhood statistical property value is greater than or equal to the nsr times of the K neighborhood statistical property value of any other data point in the K neighborhood of the data point, the data point is marked as noise data, otherwise the data point is incorporated into the distance.
  • the K neighborhood statistical property value of any of the data points includes at least one of a K neighborhood maximum radius, a K neighborhood radius mean, and a K neighborhood radius standard deviation of the data point;
  • the K-neighbor statistical property value of the neighboring cluster includes an average value of a K-nearest maximum radius of all data points of the neighboring cluster, an average value of K-neighbor radius mean values of all data points, and all data points At least one of the average values of the K neighborhood neighborhood standard deviations.
  • the K-neighbor statistical property value of each of the data points includes a K-nearest radius mean value of the data point and a K-neighbor radius standard deviation; if the data point is The K-neighbor statistical property value and the K-neighbor statistical property value of the neighboring cluster of the data point satisfy the following formula 1) and formula 2), and the K-neighbor statistical property value of the data point is considered to be the data point
  • the difference between the K-neighbor statistical property values of the neighboring cluster S is within the first threshold delta range:
  • Formula 1) mean(S(M))-delta ⁇ std(S(M)) ⁇ p(M) ⁇ mean(S(M))+delta ⁇ std(S(M));
  • Equation 2 mean(S(V))-delta ⁇ std(S(V)) ⁇ p(V) ⁇ mean(S(V))+delta ⁇ std(S(V));
  • p(M) is the K-nearest radius mean of the data point p
  • p(V) is the K-neighbor standard deviation of the data point p
  • mean(S(M)) is the K of all data points adjacent to the cluster S
  • the mean of the neighborhood radius mean, mean(S(V)) is the average of the K-neighbor standard deviation of all data points adjacent to the cluster S
  • std(S(M)) is the data of the neighboring cluster S.
  • the variance of the K neighborhood radius mean of the point, std(S(V)) is the variance of the K neighborhood radius standard deviation of all data points adjacent to the cluster S
  • the delta is the preset coefficient; the preset range of the delta is usually 1 ⁇ 10.
  • the K-neighbor statistical property value of the data point satisfies the following formula 3
  • it is considered that the K-neighbor statistical property value of the data point is greater than or equal to the K of the data point.
  • Equation 3 p(M) ⁇ q(M) ⁇ nsr, q ⁇ N k(p);
  • nsr is preset The range is usually 3 to 5.
  • clusters in which the total number of data points in all clusters is greater than or equal to the second threshold nspts are defined as significant clusters, and the total number of data points in all clusters is less than the second threshold.
  • Clustering of nspts is defined as non-significant clustering
  • the data point is merged into the data point to which the nearest data point belongs. Significantly clustered.
  • the data clustering apparatus based on K neighborhood similarity sorts the maximum radius of the K neighborhood of the data points to be clustered, that is, the clustered data points according to the density, and then sorts the data points after ascending order. Perform the first cycle to merge the data points that conform to the statistical similarity into the same cluster; then perform the second cycle of the data points with smaller cluster density according to the scale required by the cluster to find out all Data points are clustered by merging non-noise points into the nearest large-density cluster.
  • the data clustering method based on the K neighborhood similarity provided by the embodiment of the present invention has the following technical effects: First, it is not necessary to preset the number of clusters, and it is not necessary to know the probability distribution of the data; Second, the parameters are easy to set.
  • the value of K is generally 5-9
  • the value of delta is generally 1-10
  • the value of nsr is generally 3-5
  • the value of significant parameter nspts can be clustered according to needs.
  • Self-adjustment, and the setting of each parameter is independent of the density distribution and distance scale of the data
  • thirdly, the formation of clusters is gradually merged from high density to low density, and clustering is given while clustering is generated. The hierarchical relationship between them.
  • the embodiment of the present invention further provides a computer readable storage medium, comprising: a stored computer program, wherein when the computer program is running, controlling a device where the computer readable storage medium is located performs the above A data clustering method based on K neighborhood similarity described in an embodiment.
  • FIG. 5 it is a schematic diagram of a data clustering apparatus based on K neighborhood similarity according to an embodiment of the present invention.
  • the K-neighbor similarity-based data clustering apparatus of this embodiment includes a processor 501, a memory 502, and a computer program stored in the memory and operable on the processor, such as the K-based neighborhood described above Data clustering program for similarity.
  • the processor executes the computer program, the steps in the foregoing K neighborhood neighborhood similarity data clustering method embodiment are implemented, such as the K neighborhood similarity based data clustering step shown in FIG. 1.
  • the processor implements the functions of the modules/units in the various apparatus embodiments described above when the computer program is executed.
  • the computer program can be partitioned into one or more modules/units that are stored in the memory and executed by the processor to perform the present invention.
  • the one or more modules/units may be a series of computer program instruction segments capable of performing a particular function, the instruction segments being used to describe execution of the computer program in the K-neighborhood similarity data clustering device process.
  • the data neighborhood device based on K neighborhood similarity may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the K neighborhood similarity based data clustering apparatus may include, but is not limited to, a processor and a memory.
  • the neighboring technician can understand that the schematic diagram is only an example of a data clustering apparatus based on K neighborhood similarity, and does not constitute a limitation of a data clustering apparatus based on K neighborhood similarity, and may include a ratio More or less components, or a combination of certain components, or different components, such as the K-neighbor similarity-based data clustering device may also include input and output devices, network access devices, buses, and the like.
  • the so-called processor can be a central processing unit (CPU), or other general-purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), ready-made Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.
  • the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the K neighborhood similarity based data clustering device, utilizing various interfaces and lines The entire portion of the data clustering device based on the K neighborhood similarity is connected.
  • the memory can be used to store the computer program and/or module, the processor implementing the basis by running or executing a computer program and/or module stored in the memory, and invoking data stored in the memory Various functions of the data clustering device for K neighborhood similarity.
  • the memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored. Data created based on the use of the mobile phone (such as audio data, phone book, etc.).
  • the memory may include a high-speed random access memory, and may also include non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a Secure Digital (SD) card.
  • non-volatile memory such as a hard disk, a memory, a plug-in hard disk, a smart memory card (SMC), and a Secure Digital (SD) card.
  • Flash Card at least one disk storage device, flash memory device, or other volatile solid-state storage device.
  • the module/unit integrated by the K-neighborhood similarity data clustering device can be stored in a computer readable storage medium if it is implemented in the form of a software functional unit and sold or used as a stand-alone product.
  • the present invention implements all or part of the processes in the foregoing embodiments, and may also be completed by a computer program to instruct related hardware.
  • the computer program may be stored in a computer readable storage medium. The steps of the various method embodiments described above may be implemented when the program is executed by the processor.
  • the computer program comprises computer program code, which may be in the form of source code, object code form, executable file or some intermediate form.
  • the computer readable medium may include any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM). , random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. It should be noted that the content contained in the computer readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in a jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, computer readable media Does not include electrical carrier signals and telecommunication signals.
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical. Units can be located in one place or distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, and specifically, one or more communication buses or signal lines can be realized. The ordinary technicians in this neighborhood can understand and implement without any creative work.

Abstract

本发明公开了一种基于K邻域相似性的数据聚类方法,基于待聚类数据点的K邻域最大半径即按密度对待聚类数据点进行排序,对升序排序后的数据点进行第一遍循环,以符合统计相似性的数据点并入到同个聚类中;根据聚类所需规模对聚类密度较小的数据点进行第二遍循环,找出所有噪声点以及将非噪声点合并到最近的大密度聚类中,从而实现数据聚类。利用本发明实施例提供的基于K邻域相似性的数据聚类方法进行数据聚类带有如下技术效果:无需预先设定聚类的个数,无需知道数据的概率分布;参数容易设置,且各个参数的设置都与数据的密度分布和距离尺度无关;聚类的形成是由高密度到低密度逐渐合并而成的,在产生聚类的同时给出了聚类之间的层次关系。

Description

基于K邻域相似性的数据聚类方法、装置和存储介质
技术邻域
本发明涉及数据处理技术邻域,尤其涉及一种基于K邻域相似性的数据聚类方法、装置和计算机可读存储介质。
背景技术
随着数据处理技术的发展,越来越多的数据(比如互联网、物联网产生的大数据)需要进行快速、准确的分类,数据聚类就是这样一种自动进行分类的方法及装置。
目前大多数数据聚类朝着执行效率和有效性两个方向改进。对于执行效率可以通过有效的采样技术使得数据聚类能够很好的处理大规模的数据,而另一方面,聚类的有效性是个更为严重的问题。
本发明人在实施本发明的过程中发现,现有技术中存在以下技术问题:
第一,许多数据聚类方法需要先验知识或者概率分布的假设,而这些在实际应用中往往是无法提前得知的。比如,K均值聚类方法和EM最大期望值方法都需要预先知道数据中聚类的个数。
第二,参数调节困难降低了数据聚类的可用性,比如基于密度的DBSCAN方法,需要预先设半径epsilon和最少数据个数minpts两个参数来定义密度阈值,而这些参数严重依赖于具体的数据,因为数据的密度往往差异很大,而且数据之间距离也会有不同的尺度。
第三,大部分传统的数据聚类只能得到每个数据点所属的聚类,而不能给出聚类之间的层次关系,而现有的层次聚类方法,比如SLINK,有两个主要的不足之处,时间和空间代价相对较高,且具有明显的“链效应”。
发明内容
本发明实施例提供一种基于K邻域相似性的数据聚类方法、装置和计算机可读存储介质,能有效解决现有技术存在的需要较多假定的前提知识、聚类参数难以设置以及不能得到聚类之间的层次关系的三个主要问题。
本发明一实施例提供一种基于K邻域相似性的数据聚类方法,包括步骤:
S1、对待聚类数据点按照数据点的K邻域最大半径进行升序排序;其中,数据点的K邻域最大半径是指所述数据点的K邻域中离该数据点最大的距离;
S2、对升序排序后的数据点进行第一遍循环,计算所述数据点的K邻域统计特性值,并将计算得到的数据点的K邻域统计特性值和已生成的该数据点所属的邻近聚类的K邻域统计特性值进行比较;若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值的差值在第一阈值delta范围内,则将所述数据点并入所述数据点的邻近聚类中,如果可并入多个邻近聚类,则按照距离从近到远逐个并入直到聚类中数据点个数超过第二阈值nspts;若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值的差值超过第一阈值delta,则生成一个包含所述数据点的新聚类;
S3、所述第一遍循环完成后,对所有聚类中数据点总数小于第二阈值nspts的聚类中的数据点进行第二遍循环,若存在数据点的K邻域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点标记为噪声数据,否则将该数据点并入其距离最近的数据点所属的聚类,nsr>1,nspts>2。
与现有技术相比,本发明实施例提供的基于K邻域相似性的数据聚类方法基于待聚类数据点的K邻域最大半径即按密度对待聚类数据点进行排序,然后对升序排序后的数据点进行第一遍循环,以符合统计相似性的数据点并入到同个聚类中;然后再根据聚类所需规模对聚类密度较小的数据点进行第二遍循环,从而找 出所有噪声点以及将非噪声点合并到最近的大密度聚类中,从而实现数据聚类。可见,利用本发明实施例提供的基于K邻域相似性的数据聚类方法进行数据聚类带有如下技术效果:第一,无需预先设定聚类的个数,无需知道数据的概率分布;第二,参数容易设置,且各个参数的设置都与数据的密度分布和距离尺度无关;第三,聚类的形成是由高密度到低密度逐渐合并而成的,在产生聚类的同时给出了聚类之间的层次关系。
作为上述方案的改进,所述数据点的K领域是指距离所述数据点距离最近的K个数据点的集合;其中,K取值范围为5~9。
作为上述方案的改进,每一所述数据点的K邻域统计特性值包括该数据点的K邻域最大半径、K邻域半径均值以及K邻域半径标准差中的至少一种;相应的,所述邻近聚类的K邻域统计特性值包括所述邻近聚类的所有数据点的K邻域最大半径的平均值、所有数据点的K邻域半径均值的平均值以及所有数据点的K邻域半径标准差的平均值中的至少一种。
作为上述方案的改进,每一所述数据点的K邻域统计特性值包括所述数据点的K邻域半径均值以及K邻域半径标准差;在所述步骤S2中,若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值满足下列公式1)和公式2),则认为所述数据点的K邻域统计特性值与所述数据点的邻近聚类S的K邻域统计特性值的差值在第一阈值delta范围内:
公式1)mean(S(M))-delta×std(S(M))≥p(M)≤mean(S(M))+delta×std(S(M));
公式2)mean(S(V))-delta×std(S(V))≥p(V)≤mean(S(V))+delta×std(S(V));
其中,p(M)为数据点p的K邻域半径均值,p(V)为数据点p的K邻域标准差,mean(S(M))为邻近聚类S的所有数据点的K邻域半径均值的平均值,mean(S(V))为邻近聚类S的所有数据点的K邻域半径标准差的平均值,std(S(M))为邻近聚类S的所有数据点的K邻域半径均值的方差,std(S(V))为邻近聚类S的 所有数据点的K邻域半径标准差的方差,delta为预设系数。
作为上述方案的改进,第一阈值delta的预设范围通常为1~10。
作为上述方案的改进,在所述步骤S3中,若存在数据点的K邻域统计特性值满足下列公式3),则认为存在数据点的K邻域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍:
公式3)p(M)≥q(M)×nsr,q∈N k(p);
其中,p(M)为数据点p的K邻域半径均值,q(M)为数据点q的K邻域半径均值,N k(p)表示数据点p的K邻域,nsr的预设范围通常为3~5。
作为上述方案的改进,在所述步骤S3中,将所有聚类中数据点总数大于或等于第二阈值nspts的聚类定义为显著聚类,并将所有聚类中数据点总数小于第二阈值nspts的聚类定义为非显著聚类;
若存在数据点的K邻域统计特性值小于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点并入其距离最近的数据点所属的显著聚类中。
作为上述方案的改进,所述第二阈值nspts可根据所需聚类规模调整。
本发明实施例还提供了一种基于K邻域相似性的数据聚类装置,包括:
数据点排序模块,用于对待聚类数据点按照数据点的K邻域最大半径进行升序排序;其中,数据点的K邻域最大半径是指所述数据点的K邻域中离该数据点最大的距离;
一遍循环模块,用于对升序排序后的所述数据点进行第一遍循环,计算所述数据点的K邻域统计特性值,并将计算得到的所述数据点的K邻域统计特性值和已生成的该数据点所对应的邻近聚类的K邻域统计特性值进行比较;若所述数据点的K邻域统计特性值与该数据点的邻近聚类的K邻域统计特性值的差值在 第一阈值delta范围内,则将所述数据点并入所述数据点的邻近聚类中,如果可并入多个邻近聚类,则按照距离从近到远逐个并入直到聚类中数据点个数超过第二阈值nspts;若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值的差值超过第一阈值delta,则生成一个包含所述数据点的新聚类;
二遍循环模块,用于在所述第一遍循环完成后,对所有聚类中数据点总数小于第二阈值nspts的聚类中的数据点进行第二遍循环,若存在数据点的K邻域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点标记为噪声数据,否则将该数据点并入其距离最近的数据点所属的聚类。
与现有技术相比,本发明实施例提供的基于K邻域相似性的数据聚类装置基于待聚类数据点的K邻域最大半径即按密度对待聚类数据点进行排序,然后对升序排序后的数据点进行第一遍循环,以符合统计相似性的数据点并入到同个聚类中;然后再根据聚类所需规模对聚类密度较小的数据点进行第二遍循环,从而找出所有噪声点以及将非噪声点合并到最近的大密度聚类中,进而实现数据聚类。可见,利用本发明实施例提供的基于K邻域相似性的数据聚类方法进行数据聚类带有如下技术效果:第一,无需预先设定聚类的个数,无需知道数据的概率分布;第二,参数容易设置,且各个参数的设置都与数据的密度分布和距离尺度无关;第三,聚类的形成是由高密度到低密度逐渐合并而成的,在产生聚类的同时给出了聚类之间的层次关系。
作为上述方案的改进,所述数据点的K领域是指距离所述数据点距离最近的K个数据点的集合;其中,K取值范围通常为5~9。
作为上述方案的改进,每一所述数据点的K邻域统计特性值包括该数据点的K邻域最大半径、K邻域半径均值以及K邻域半径标准差中的至少一种;相应的,所述邻近聚类的K邻域统计特性值包括所述邻近聚类的所有数据点的K邻域最大半径的平均值、所有数据点的K邻域半径均值的平均值以及所有数据点的K 邻域半径标准差的平均值中的至少一种。
作为上述方案的改进,每一所述数据点的K邻域统计特性值包括所述数据点的K邻域半径均值以及K邻域半径标准差;在所述一遍循环模块中,若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值满足下列公式1)和公式2),则认为所述数据点的K邻域统计特性值与所述数据点的邻近聚类S的K邻域统计特性值的差值在第一阈值delta范围内:
公式1)mean(S(M))-delta×std(S(M))≥p(M)≤mean(S(M))+delta×std(S(M));
公式2)mean(S(V))-delta×std(S(V))≥p(V)≤mean(S(V))+delta×std(S(V));
其中,p(M)为数据点p的K邻域半径均值,p(V)为数据点p的K邻域标准差,mean(S(M))为邻近聚类S的所有数据点的K邻域半径均值的平均值,mean(S(V))为邻近聚类S的所有数据点的K邻域半径标准差的平均值,std(S(M))为邻近聚类S的所有数据点的K邻域半径均值的方差,std(S(V))为邻近聚类S的所有数据点的K邻域半径标准差的方差,delta为预设系数;delta的预设范围通常为1~10。
作为上述方案的改进,在所述二遍循环模块中,若存在数据点的K邻域统计特性值满足下列公式3),则认为存在数据点的K邻域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍:
公式3)p(M)≥q(M)×nsr,q∈N k(p);
其中,p(M)为数据点p的K邻域半径均值,q(M)为数据点q的K邻域半径均值,N k(p)表示数据点p的K邻域,nsr=N,nsr的预设范围通常为3~5。
作为上述方案的改进,在所述二遍循环模块中,将所有聚类中数据点总数大于或等于第二阈值nspts的聚类定义为显著聚类,并将所有聚类中数据点总数小于第二阈值nspts的聚类定义为非显著聚类;
若存在数据点的K邻域统计特性值小于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点并入其距离最近的数据点所属的显著聚类中。
本发明实施例还提供了一种基于K邻域相似性的数据聚类装置,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述处理器执行所述计算机程序时实现如上任意实施例所述的基于K邻域相似性的数据聚类方法。
本发明实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质包括存储的计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在设备执行如上任意实施例所述的基于K邻域相似性的数据聚类方法。
附图说明
图1是本发明一实施例提供的一种基于K邻域相似性的数据聚类方法的流程示意图。
图2(a)~2(d)显示了利用本发明实施例提供的基于K邻域相似性的数据聚类方法进行数据聚类的效果流程图。
图3(a)~3(h)显示了利用本发明实施例提供的基于K邻域相似性的数据聚类方法给定的不同形状的数据集进行数据聚类的结果。
图4是本发明一实施例提供的一种基于K邻域相似性的数据聚类装置的结构示意图。
图5是本发明另一实施例提供的一种基于K邻域相似性的数据聚类装置的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清 楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本邻域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
本发明实施例提供一种基于K邻域相似性的数据聚类方法及对应的装置,其中,本发明实施例的聚类定义是基于统计数据点的。数据集群(聚类)通常被定义为类似数据的聚合。在本发明实施例中,相似度的概念可以与聚类的统计特性和数据的密度相关。参考图2(a)~图2(d),如图2(a)所示,数据点分布在数据空间的三个区域中。如图2(b)所示,可以将数据库绘制为数据密度表面,在三个簇的中心上具有大约三个峰,具有附加的密度轴。数据库以及数据密度面可以分为相干部分,即数据块,由图2(c)中所示的密度轮廓表示。如果收集沿密度向下方向的数据点,数据聚类将逐渐出现。如图2(c)所示,当所述连接聚类的尺寸小于期望值时,将所述连接聚类合并是合理的,且显著的聚类不会相互合并。如图2(d)所示,生成最终聚类结果。
本发明将通过多个实施例来对本发明数据聚类方法及装置进行说明。在展开说明前,首先对本发明实施例中出现的各个术语进行相关定义:
定义1(K邻域):一个数据点的K邻域是指离该数据点最近的K个数据点的集合。
例如:用集合N k(p)表示数据点p的K邻域,集合N k(p)包括K个数据点,则该集合中任意数据点q满足:dist(p,q)≤dist(p,o),其中,dist(p,q)表示数据点p和数据点q之间的距离,且数据点o不在集合N k(p)内。
定义2(K邻域最大半径):一个数据点的K邻域最大半径是指该数据点的K邻域中离该点最大的距离。
例如:用R k(p)表示数据点p的K邻域最大半径,则R k(p)满足:R k(p)=max(dist(p,q)),q∈N k(p)。
定义3(K邻域半径均值):一个数据点的K邻域半径均值是指该数据点的 K邻域中两两数据点距离的均值。
例如:用M k(p)表示数据点p的K邻域半径均值,则M k(p)=mean(dist(a,b)),其中a,b∈N k(p)。mean(dist(a,b))表示N k(p)内任意两个数据点a和数据点b之间的距离的平均值。
定义4(K邻域半径标准差):一个数据点的K邻域半径标准差是指该数据点的K领域中两两数据点距离的标准差。
例如:用V k(p)表示数据点p的K邻域半径标准差,则V k(p)=std(dist(a,b)),其中a,b∈N k(p)。std(dist(a,b))表示N k(p)内任意两个数据点a和数据点b之间的距离的方差。
定义5(统计数据点):一个数据点的统计数据点是指给数据点附加该数据点的K领域统计特性值,包括(但不限于)K邻域最大半径、K邻域半径均值、K邻域半径标准差,形成的数据。
例如,可用p(R,M,V)表示数据点p的K领域统计特性值。其中,p(R)表示数据点p的K邻域最大半径值,p(M)表示数据点p的K邻域半径均值,p(V)表示数据点p的K邻域半径标准差值。
定义6(统计相似性):如果一个数据点集合的每个统计特征值的均值和一个数据点的对应的统计特征值的差小于delta个单位(通常以该数据点集统计特征值的标准差为单位),则称该数据点集合具有统计相似性。
例如:用S(R,M,V)表示数据点集合的统计特征值,若存在数据点p的p(R,M,V)满足以下两个条件:
条件1)mean(S(M))-delta×std(S(M))≥p(M)≤mean(S(M))+delta×std(S(M));
条件2)mean(S(V))-delta×std(S(V))≥p(V)≤mean(S(V))+delta×std(S(V));
则称该数据点集合S具有统计相似性。其中,delta为阈值常数。
定义7(聚类合并):两个数据点集合C1,C2为统计相似性集合,如果存在一个数据点p满足以下条件:
1)数据点p和数据点集合C1具有统计相似性;
2)数据点p和数据点集合C2具有统计相似性;
3)数据点p的K领域半径均值M小于或等于C1中最小M的nsr倍,nsr为可调参数,或者C1中数据点个数小于显著性参数nspts;
4)数据点p的K领域半径均值M小于或等于C2中最小M的nsr倍,nsr为可调参书,或者C2中数据点个数小于显著性参数nspts。
则可以将两个数据点集合C1,C2合并为一个联合统计相似性集合。
定义8(显著聚类):一个数据聚类是显著聚类,当且仅当该数据聚类包含的数据点个数大于或等于一个显著性参数nspts。
定义9(噪声点):当一个数据点的K领域半径均值M大于或等于它K领域中任意属于某个显著聚类的数据点的K领域半径均值M值的nsr倍,则该数据点为噪声点。
例如:当存在数据点p,p(M)≥q(M)×nsr,q∈N k(p),而且数据点q被包括在任何显著聚类中,则数据点p为噪声点。
参见图1,是本发明一实施例提供的一种基于K邻域相似性的数据聚类方法,包括步骤S101~S103:
S101、对待聚类数据点按照数据点的K邻域最大半径进行升序排序;
其中,每一数据点的K邻域最大半径是指所述数据点的K邻域中离该数据点最大的距离;所述数据点的K领域是指距离所述数据点距离最近的K个数据点的集合。在本实施例中,K取值范围通常为5~9。
S102、对升序排序后的数据点进行第一遍循环,计算所述数据点的K邻域统计特性值,并将计算得到的数据点的K邻域统计特性值和已生成的该数据点所属的邻近聚类的K邻域统计特性值进行比较;若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值的差值在第一阈值delta范围内,则将所述数据点并入所述数据点的邻近聚类中,如果可并入多个邻近聚类,则按 照距离从近到远逐个并入直到聚类中数据点个数超过第二阈值nspts;若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值的差值超过第一阈值delta,则生成一个包含所述数据点的新聚类。
S103、所述第一遍循环完成后,对所有聚类中数据点总数小于第二阈值nspts的聚类中的数据点进行第二遍循环,若存在数据点的K邻域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点标记为噪声数据,否则将该数据点并入其距离最近的数据点所属的聚类。
其中,在本实施例中,nsr>1,nspts>2。每一所述数据点的K邻域统计特性值包括该数据点的K邻域最大半径、K邻域半径均值以及K邻域半径标准差中的至少一种;相应的,所述邻近聚类的K邻域统计特性值包括所述邻近聚类的所有数据点的K邻域最大半径的平均值、所有数据点的K邻域半径均值的平均值以及所有数据点的K邻域半径标准差的平均值中的至少一种。
优选的,在所述步骤S102中,每一所述数据点的K邻域统计特性值包括所述数据点的K邻域半径均值以及K邻域半径标准差。若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值满足下列公式1)和公式2),则认为所述数据点的K邻域统计特性值与所述数据点的邻近聚类S的K邻域统计特性值的差值在第一阈值delta范围内:
公式1)mean(S(M))-delta×std(S(M))≥p(M)≤mean(S(M))+delta×std(S(M));
公式2)mean(S(V))-delta×std(S(V))≥p(V)≤mean(S(V))+delta×std(S(V));
其中,p(M)为数据点p的K邻域半径均值,p(V)为数据点p的K邻域标准差,mean(S(M))为邻近聚类S的所有数据点的K邻域半径均值的平均值,mean(S(V))为邻近聚类S的所有数据点的K邻域半径标准差的平均值,std(S(M))为邻近聚类S的所有数据点的K邻域半径均值的方差,std(S(V))为邻近聚类S的所有数据点的K邻域半径标准差的方差,delta为预设系数。在本实施例中,第 一阈值delta的预设范围通常为1~10。
优选的,在所述步骤S103中,若存在数据点的K邻域统计特性值满足下列公式3),则认为存在数据点的K邻域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍:
公式3)p(M)≥q(M)×nsr,q∈N k(p);
其中,p(M)为数据点p的K邻域半径均值,q(M)为数据点q的K邻域半径均值,N k(p)表示数据点p的K邻域,nsr的预设范围通常为3~5。
进一步的,本实施例将所有聚类中数据点总数大于或等于第二阈值nspts的聚类定义为显著聚类,并将所有聚类中数据点总数小于第二阈值nspts的聚类定义为非显著聚类,其中,所述第二阈值nspts可根据所需聚类规模调整。在所述步骤S3中,若存在数据点的K邻域统计特性值小于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点并入其距离最近的数据点所属的显著聚类中。
可见,本发明实施例提供的基于K邻域相似性的数据聚类方法基于待聚类数据点的K邻域最大半径即按密度对待聚类数据点进行排序,然后对升序排序后的数据点进行第一遍循环,以符合统计相似性的数据点并入到同个聚类中;然后再根据聚类所需规模对聚类密度较小的数据点进行第二遍循环,从而找出所有噪声点以及将非噪声点合并到最近的大密度聚类中,从而实现数据聚类。可见,利用本发明实施例提供的基于K邻域相似性的数据聚类方法进行数据聚类带有如下技术效果:第一,无需预先设定聚类的个数,无需知道数据的概率分布;第二,参数容易设置,K的取值一般为5-9,delta的取值一般为1-10,nsr的取值一般为3-5,显著性参数nspts的取值可根据需要聚类规模自行调节,且各个参数的设置都与数据的密度分布和距离尺度无关;第三,聚类的形成是由高密度到低密度逐渐合并而成的,在产生聚类的同时给出了聚类之间的层次关系。
图3(a)~图3(h)显示了利用本发明实施例提供的基于K邻域相似性的数据聚类方法对给定的不同形状的数据集进行数据聚类的结果。如图3(a)所示,采用本实施例提供的聚类方法能够克服链接效应,并正确识别出由点链链接的两个群集。图3(b)~3(h)显示使用本实施例的聚类方法正确识别出多个不同形状的聚类。如图3(b)所示,采用本实施例的聚类方法正确发现了密度变化的两个重叠聚类:密集集群和周围稀疏集群。如图3(d)所示,噪声点被正确识别。在这些给定的数据集上,通过本实施例的聚类方法发现的群集的数目精确地等于所期望的群集数量。可见,利用本实施例提供的基于K邻域相似性的数据聚类方法具有处理任意形状的,密度变化的以及带有噪声点的数据集的能力。
参见图4,是本发明一实施例提供的基于K邻域相似性的数据聚类装置的结构示意图,包括:
数据点排序模块401,用于对待聚类数据点按照数据点的K邻域最大半径进行升序排序;其中,每一数据点的K邻域最大半径是指所述数据点的K邻域中离该数据点最大的距离;所述数据点的K领域是指距离所述数据点距离最近的K个数据点的集合;其中,K取值范围为5~9。
一遍循环模块402,用于对升序排序后的所述数据点进行第一遍循环,计算所述数据点的K邻域统计特性值,并将计算得到的所述数据点的K邻域统计特性值和已生成的该数据点所对应的邻近聚类的K邻域统计特性值进行比较;若所述数据点的K邻域统计特性值与该数据点的邻近聚类的K邻域统计特性值的差值在第一阈值delta范围内,则将所述数据点并入所述数据点的邻近聚类中,如果可并入多个邻近聚类,则按照距离从近到远逐个并入直到聚类中数据点个数超过第二阈值nspts;若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值的差值超过第一阈值delta,则生成一个包含所述数据点的新聚类;
二遍循环模块403,用于在所述第一遍循环完成后,对所有聚类中数据点总数小于第二阈值nspts的聚类中的数据点进行第二遍循环,若存在数据点的K邻 域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点标记为噪声数据,否则将该数据点并入其距离最近的数据点所属的聚类。
在本实施例中,任一所述数据点的K邻域统计特性值包括该数据点的K邻域最大半径、K邻域半径均值以及K邻域半径标准差中的至少一种;相应的,所述邻近聚类的K邻域统计特性值包括所述邻近聚类的所有数据点的K邻域最大半径的平均值、所有数据点的K邻域半径均值的平均值以及所有数据点的K邻域半径标准差的平均值中的至少一种。
优选的,在所述一遍循环模块402中,每一所述数据点的K邻域统计特性值包括所述数据点的K邻域半径均值以及K邻域半径标准差;若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值满足下列公式1)和公式2),则认为所述数据点的K邻域统计特性值与所述数据点的邻近聚类S的K邻域统计特性值的差值在第一阈值delta范围内:
公式1)mean(S(M))-delta×std(S(M))≥p(M)≤mean(S(M))+delta×std(S(M));
公式2)mean(S(V))-delta×std(S(V))≥p(V)≤mean(S(V))+delta×std(S(V));
其中,p(M)为数据点p的K邻域半径均值,p(V)为数据点p的K邻域标准差,mean(S(M))为邻近聚类S的所有数据点的K邻域半径均值的平均值,mean(S(V))为邻近聚类S的所有数据点的K邻域半径标准差的平均值,std(S(M))为邻近聚类S的所有数据点的K邻域半径均值的方差,std(S(V))为邻近聚类S的所有数据点的K邻域半径标准差的方差,delta为预设系数;delta的预设范围通常为1~10。
优选的,在所述二遍循环模块403中,若存在数据点的K邻域统计特性值满足下列公式3),则认为存在数据点的K邻域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍:
公式3)p(M)≥q(M)×nsr,q∈N k(p);
其中,p(M)为数据点p的K邻域半径均值,q(M)为数据点q的K邻域半径均值,N k(p)表示数据点p的K邻域,nsr的预设范围通常为3~5。
进一步的,在所述二遍循环模块403中,将所有聚类中数据点总数大于或等于第二阈值nspts的聚类定义为显著聚类,并将所有聚类中数据点总数小于第二阈值nspts的聚类定义为非显著聚类;
若存在数据点的K邻域统计特性值小于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点并入其距离最近的数据点所属的显著聚类中。
可见,本发明实施例提供的基于K邻域相似性的数据聚类装置基于待聚类数据点的K邻域最大半径即按密度对待聚类数据点进行排序,然后对升序排序后的数据点进行第一遍循环,以符合统计相似性的数据点并入到同个聚类中;然后再根据聚类所需规模对聚类密度较小的数据点进行第二遍循环,从而找出所有噪声点以及将非噪声点合并到最近的大密度聚类中,从而实现数据聚类。可见,利用本发明实施例提供的基于K邻域相似性的数据聚类方法进行数据聚类带有如下技术效果:第一,无需预先设定聚类的个数,无需知道数据的概率分布;第二,参数容易设置,K的取值一般为5-9,delta的取值一般为1-10,nsr的取值一般为3-5,显著性参数nspts的取值可根据需要聚类规模自行调节,且各个参数的设置都与数据的密度分布和距离尺度无关;第三,聚类的形成是由高密度到低密度逐渐合并而成的,在产生聚类的同时给出了聚类之间的层次关系。
本发明实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质包括存储的计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在设备执行如上任一实施例所述的基于K邻域相似性的数据聚类方法。
参见图5,是本发明一实施例提供的基于K邻域相似性的数据聚类装置的示 意图。该实施例的基于K邻域相似性的数据聚类装置包括:处理器501、存储器502以及存储在所述存储器中并可在所述处理器上运行的计算机程序,例如上述的基于K邻域相似性的数据聚类程序。所述处理器执行所述计算机程序时实现上述各个基于K邻域相似性的数据聚类方法实施例中的步骤,例如图1所示的基于K邻域相似性的数据聚类步骤。或者,所述处理器执行所述计算机程序时实现上述各装置实施例中各模块/单元的功能。
示例性的,所述计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器中,并由所述处理器执行,以完成本发明。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序在所述基于K邻域相似性的数据聚类装置中的执行过程。
所述基于K邻域相似性的数据聚类装置可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述基于K邻域相似性的数据聚类装置可包括,但不仅限于,处理器、存储器。本邻域技术人员可以理解,所述示意图仅仅是基于K邻域相似性的数据聚类装置的示例,并不构成对基于K邻域相似性的数据聚类装置的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述基于K邻域相似性的数据聚类装置还可以包括输入输出设备、网络接入设备、总线等。
所称处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器是所述基于K邻域相似性的数据聚类装置的控制中心,利用各种接口和线路连接整个基于K邻域相似性的数据聚类装置 的各个部分。
所述存储器可用于存储所述计算机程序和/或模块,所述处理器通过运行或执行存储在所述存储器内的计算机程序和/或模块,以及调用存储在存储器内的数据,实现所述基于K邻域相似性的数据聚类装置的各种功能。所述存储器可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
其中,所述基于K邻域相似性的数据聚类装置集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本发明提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。本邻域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述是本发明的优选实施方式,应当指出,对于本技术邻域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也视为本发明的保护范围。

Claims (11)

  1. 一种基于K邻域相似性的数据聚类方法,其特征在于,包括步骤:
    S1、对待聚类数据点按照数据点的K邻域最大半径进行升序排序;其中,数据点的K邻域最大半径是指所述数据点的K邻域中离该数据点最大的距离;
    S2、对升序排序后的数据点进行第一遍循环,计算所述数据点的K邻域统计特性值,并将计算得到的数据点的K邻域统计特性值和已生成的该数据点所属的邻近聚类的K邻域统计特性值进行比较;若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值的差值在第一阈值delta范围内,则将所述数据点并入所述数据点的邻近聚类中,如果可并入多个邻近聚类,则按照距离从近到远逐个并入直到聚类中数据点个数超过第二阈值nspts;若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值的差值超过第一阈值delta,则生成一个包含所述数据点的新聚类;
    S3、所述第一遍循环完成后,对所有聚类中数据点总数小于第二阈值nspts的聚类中的数据点进行第二遍循环,若存在数据点的K邻域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点标记为噪声数据,否则将该数据点并入其距离最近的数据点所属的聚类,nsr>1,nspts>2。
  2. 如权利要求1所述的基于K邻域相似性的数据聚类方法,其特征在于,所述数据点的K领域是指距离所述数据点距离最近的K个数据点的集合;其中,K取值范围通常为5~9。
  3. 如权利要求1或2所述的基于K邻域相似性的数据聚类方法,其特征在于,每一所述数据点的K邻域统计特性值包括该数据点的K邻域最大半径、K邻域半径均值以及K邻域半径标准差中的至少一种;相应的,所述邻近聚类的K邻域统计特性值包括所述邻近聚类的所有数据点的K邻域最大半径的平均值、所有数据点的K邻域半径均值的平均值以及所有数据点的K邻域半径标准差的平均值中的至少一种。
  4. 如权利要求3所述的基于K邻域相似性的数据聚类方法,其特征在于,每一所述数据点的K邻域统计特性值包括所述数据点的K邻域半径均值以及K邻域半径标准差;在所述步骤S2中,若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值满足下列公式1)和公式2),则认为所述数据点的K邻域统计特性值与所述数据点的邻近聚类S的K邻域统计特性值的差值在第一阈值delta范围内:
    公式1)mean(S(M))-delta×std(S(M))≥p(M)≤mean(S(M))+delta×std(S(M));
    公式2)mean(S(V))-delta×std(S(V))≥p(V)≤mean(S(V))+delta×std(S(V));
    其中,p(M)为数据点p的K邻域半径均值,p(V)为数据点p的K邻域标准差,mean(S(M))为邻近聚类S的所有数据点的K邻域半径均值的平均值,mean(S(V))为邻近聚类S的所有数据点的K邻域半径标准差的平均值,std(S(M))为邻近聚类S的所有数据点的K邻域半径均值的方差,std(S(V))为邻近聚类S的所有数据点的K邻域半径标准差的方差,delta为预设系数。
  5. 如权利要求4所述的基于K邻域相似性的数据聚类方法,其特征在于,第一阈值delta的预设范围通常为1~10。
  6. 如权利要求2或4所述的基于K邻域相似性的数据聚类方法,其特征在于,在所述步骤S3中,若存在数据点的K邻域统计特性值满足下列公式3),则认为存在数据点的K邻域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍:
    公式3)p(M)≥q(M)×nsr,q∈N k(p);
    其中,p(M)为数据点p的K邻域半径均值,q(M)为数据点q的K邻域半径均值,N k(p)表示数据点p的K邻域,nsr的预设范围通常为3~5。
  7. 如权利要求6所述的基于K邻域相似性的数据聚类方法,其特征在于,在所述步骤S3中,将所有聚类中数据点总数大于或等于第二阈值nspts的聚类定义为显著聚类,并将所有聚类中数据点总数小于第二阈值nspts的聚类定义为非显著聚类;
    若存在数据点的K邻域统计特性值小于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点并入其距离最近的数据点所属的显著聚类中。
  8. 如权利要求7所述的基于K邻域相似性的数据聚类方法,其特征在于,所述第二阈值nspts可根据所需聚类规模调整。
  9. 一种基于K邻域相似性的数据聚类装置,其特征在于,包括:
    数据点排序模块,对待聚类数据点按照数据点的K邻域最大半径进行升序排序;其中,数据点的K邻域最大半径是指所述数据点的K邻域中离该数据点最大的距离;
    一遍循环模块,用于对升序排序后的所述数据点进行第一遍循环,计算所述数据点的K邻域统计特性值,并将计算得到的所述数据点的K邻域统计特性值和已生成的该数据点所对应的邻近聚类的K邻域统计特性值进行比较;若所述数据点的K邻域统计特性值与该数据点的邻近聚类的K邻域统计特性值的差值在第一阈值delta范围内,则将所述数据点并入所述数据点的邻近聚类中,如果可并入多个邻近聚类,则按照距离从近到远逐个并入直到聚类中数据点个数超过第二阈值nspts;若所述数据点的K邻域统计特性值与所述数据点的邻近聚类的K邻域统计特性值的差值超过第一阈值delta,则生成一个包含所述数据点的新聚类;
    二遍循环模块,用于在所述第一遍循环完成后,对所有聚类中数据点总数小于第二阈值nspts的聚类中的数据点进行第二遍循环,若存在数据点的K邻域统计特性值大于或等于该数据点的K邻域内任一其他数据点的K邻域统计特性值的nsr倍,则将该数据点标记为噪声数据,否则将该数据点并入其距离最近的数据点所属的聚类。
  10. 一种基于K邻域相似性的数据聚类装置,其特征在于,包括处理器、存储器以及存储在所述存储器中且被配置为由所述处理器执行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至8中任意一项所述的基于K邻域相似性的数据聚类方法。
  11. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括存储的计算机程序,其中,在所述计算机程序运行时控制所述计算机可读存储介质所在设备执行如权利要求1至8中任意一项所述的基于K邻域相似性的数据聚类方法。
PCT/CN2018/091697 2018-01-13 2018-06-15 基于k邻域相似性的数据聚类方法、装置和存储介质 WO2019136929A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/396,682 US11210348B2 (en) 2018-01-13 2019-04-27 Data clustering method and apparatus based on k-nearest neighbor and computer readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810037867.2 2018-01-13
CN201810037867.2A CN108256570A (zh) 2018-01-13 2018-01-13 基于k邻域相似性的数据聚类方法、装置和存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/396,682 Continuation-In-Part US11210348B2 (en) 2018-01-13 2019-04-27 Data clustering method and apparatus based on k-nearest neighbor and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2019136929A1 true WO2019136929A1 (zh) 2019-07-18

Family

ID=62741157

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/091697 WO2019136929A1 (zh) 2018-01-13 2018-06-15 基于k邻域相似性的数据聚类方法、装置和存储介质

Country Status (3)

Country Link
US (1) US11210348B2 (zh)
CN (1) CN108256570A (zh)
WO (1) WO2019136929A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215287A (zh) * 2020-10-13 2021-01-12 中国光大银行股份有限公司 基于距离的多节聚类方法和装置、存储介质及电子装置
CN113553499A (zh) * 2021-06-22 2021-10-26 杭州摸象大数据科技有限公司 一种基于营销裂变的作弊探测方法、系统和电子设备

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488941B (zh) * 2020-04-15 2022-05-13 烽火通信科技股份有限公司 一种基于改进Kmeans算法的视频用户分组方法和装置
CN111951942B (zh) * 2020-08-25 2022-10-11 河北省科学院应用数学研究所 门诊预检分诊方法、装置、终端及存储介质
CN112527649A (zh) * 2020-12-15 2021-03-19 建信金融科技有限责任公司 一种测试用例的生成方法和装置
CN112581407A (zh) * 2020-12-29 2021-03-30 北京邮电大学 一种距离像的噪声抑制方法、装置、电子设备及存储介质
CN113128598B (zh) * 2021-04-22 2024-04-09 深信服科技股份有限公司 一种传感数据检测方法、装置、设备及可读存储介质
KR20230059239A (ko) 2021-10-26 2023-05-03 삼성전자주식회사 스토리지 장치
CN115952426B (zh) * 2023-03-10 2023-06-06 中南大学 基于随机采样的分布式噪音数据聚类方法及用户分类方法
CN117912712A (zh) * 2024-03-20 2024-04-19 徕兄健康科技(威海)有限责任公司 基于大数据的甲状腺疾病数据智能管理方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170024186A1 (en) * 2015-03-02 2017-01-26 Slyce Holdings Inc. System and method for clustering data
CN107392222A (zh) * 2017-06-07 2017-11-24 深圳市深网视界科技有限公司 一种人脸聚类方法、装置以及存储介质
CN107562948A (zh) * 2017-09-26 2018-01-09 莫毓昌 一种基于距离的无参数多维数据聚类方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6092072A (en) * 1998-04-07 2000-07-18 Lucent Technologies, Inc. Programmed medium for clustering large databases
US7412429B1 (en) * 2007-11-15 2008-08-12 International Business Machines Corporation Method for data classification by kernel density shape interpolation of clusters
US8363961B1 (en) * 2008-10-14 2013-01-29 Adobe Systems Incorporated Clustering techniques for large, high-dimensionality data sets
US8880525B2 (en) * 2012-04-02 2014-11-04 Xerox Corporation Full and semi-batch clustering
US10162878B2 (en) * 2015-05-21 2018-12-25 Tibco Software Inc. System and method for agglomerative clustering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170024186A1 (en) * 2015-03-02 2017-01-26 Slyce Holdings Inc. System and method for clustering data
CN107392222A (zh) * 2017-06-07 2017-11-24 深圳市深网视界科技有限公司 一种人脸聚类方法、装置以及存储介质
CN107562948A (zh) * 2017-09-26 2018-01-09 莫毓昌 一种基于距离的无参数多维数据聚类方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NI, WEIWEI ET AL .: "K-LDCHD-A Local Density Based K- Neighborhood Clustering Algorithm for High Dimensional Space", JOURNAL OF COMPUTER RESEARCH AND DEVELOPMENT, vol. 42, no. 5, 16 May 2005 (2005-05-16), pages 784 - 791, XP055626211, ISSN: 1000-1239 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215287A (zh) * 2020-10-13 2021-01-12 中国光大银行股份有限公司 基于距离的多节聚类方法和装置、存储介质及电子装置
CN112215287B (zh) * 2020-10-13 2024-04-12 中国光大银行股份有限公司 基于距离的多节聚类方法和装置、存储介质及电子装置
CN113553499A (zh) * 2021-06-22 2021-10-26 杭州摸象大数据科技有限公司 一种基于营销裂变的作弊探测方法、系统和电子设备

Also Published As

Publication number Publication date
CN108256570A (zh) 2018-07-06
US11210348B2 (en) 2021-12-28
US20190251121A1 (en) 2019-08-15

Similar Documents

Publication Publication Date Title
WO2019136929A1 (zh) 基于k邻域相似性的数据聚类方法、装置和存储介质
CN106383877B (zh) 一种社交媒体在线短文本聚类和话题检测方法
CN110458078B (zh) 一种人脸图像数据聚类方法、系统及设备
CN103745482B (zh) 一种基于蝙蝠算法优化模糊熵的双阈值图像分割方法
CN111737743A (zh) 一种深度学习差分隐私保护方法
CN116403094B (zh) 一种嵌入式图像识别方法及系统
WO2020042579A1 (zh) 分组归纳方法、装置、电子装置及存储介质
CN110619231B (zh) 一种基于MapReduce的差分可辨性k原型聚类方法
WO2017124930A1 (zh) 一种特征数据处理方法及设备
CN110909874A (zh) 一种神经网络模型的卷积运算优化方法和装置
CN110876072B (zh) 一种批量注册用户识别方法、存储介质、电子设备及系统
CN114417095A (zh) 一种数据集划分方法及装置
CN102496146B (zh) 一种基于视觉共生的图像分割方法
CN107945186A (zh) 分割图像的方法、装置、计算机可读存储介质和终端设备
CN110276070B (zh) 一种语料处理方法、装置及存储介质
CN112199722A (zh) 一种基于K-means的差分隐私保护聚类方法
CN108021935B (zh) 一种基于大数据技术的维度约简方法及装置
Das et al. A robust environmental selection strategy in decomposition based many-objective optimization
JP2000306104A (ja) 画像領域分割方法及び装置
CN108717444A (zh) 一种基于分布式结构的大数据聚类方法和装置
CN114723481A (zh) 数据处理方法、装置、电子设备和存储介质
CN110874567B (zh) 颜值判定方法、装置、电子设备及存储介质
CN113128574A (zh) 场景缩减方法、装置及终端设备
Chen et al. Feature Analysis and Optimization of Underwater Target Radiated Noise Based on t-SNE
CN108206024B (zh) 一种基于变分高斯回归过程的语音数据处理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18899365

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18899365

Country of ref document: EP

Kind code of ref document: A1