WO2015001416A1 - Regroupement de données multidimensionnelles - Google Patents

Regroupement de données multidimensionnelles Download PDF

Info

Publication number
WO2015001416A1
WO2015001416A1 PCT/IB2014/001262 IB2014001262W WO2015001416A1 WO 2015001416 A1 WO2015001416 A1 WO 2015001416A1 IB 2014001262 W IB2014001262 W IB 2014001262W WO 2015001416 A1 WO2015001416 A1 WO 2015001416A1
Authority
WO
WIPO (PCT)
Prior art keywords
dimension
memberships
data points
cluster
initial
Prior art date
Application number
PCT/IB2014/001262
Other languages
English (en)
Inventor
Diptesh DAS
Aniruddha Sinha
Kingshuk CHAKRAVARTY
Amit Konar
Original Assignee
Tata Consultancy Services Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tata Consultancy Services Limited filed Critical Tata Consultancy Services Limited
Publication of WO2015001416A1 publication Critical patent/WO2015001416A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image

Definitions

  • the present subject matter relates, in general, to data processing and, in particular, to a system and a method for clustering multi-dimensional data.
  • Data clustering is a method of grouping data points or objects of a given data that are substantially similar in characteristics into clusters. Generally, each cluster is represented by a geometric centroid of the data points lying in the cluster. Clustering techniques can be applied to data that are quantitative (numerical), qualitative (categorical), or a combination of both. Clustering techniques are mostly unsupervised methods that can be used to organize data into clusters based on similarities among the individual data items. The potential of clustering techniques to reveal the underlying structures in data can be exploited in a wide variety of applications including classification, image processing, data mining, pattern recognition, modelling and identification.
  • Figure lb illustrates comparison of an exemplary image segmented by the present clustering system and a conventional clustering system.
  • Figures 2a and 2b illustrate a method for determining significant dimensions for clustering multi-dimensional data, according to an embodiment of the present subject matter.
  • Such conventional techniques partition data comprising a set of data points or objects into two or more clusters based on an iterative two-steps process. In the first step each data point is allocated to nearest cluster center, and in the second step cluster centers are determined based on identifying a centroid for each of the two or more clusters. The centroid is identified for each partition of the data points allocated to each cluster.
  • Such conventional clustering techniques fail to determine right clusters for data points that reside marginally at boundaries of the two or more clusters.
  • the clustering may be understood as partitioning a set of data points of the data into a plurality of clusters, such that the data points that belong to the same cluster are as similar as possible and the data points that belong to different clusters are as dissimilar as possible.
  • the system as described herein is a clustering system.
  • a database for storing multi-dimensional data is maintained according to one implementation.
  • the multi-dimensional data may be representative of multimedia data, financial transactions and the like.
  • the multi-dimensional data is represented by a plurality of data points in a multi-dimensional space, say n-dimensional space.
  • each of the plurality of data points may include a plurality of dimensions or components.
  • the multi-dimensional data may be an image and pixels of the image may be the plurality of data points.
  • the components of the pixels i.e., RGB (red, green and blue) or HSV (hue, saturation and value) can be the dimensions.
  • the database can be an external repository associated with the clustering system, or an internal repository within the clustering system.
  • the data stored in the database may be retrieved whenever clustering is to be performed. Further, the data contained within such database may be updated, whenever required. For example, new data may be added into the database, existing data can be modified, or non-useful data may be deleted from the database.
  • a database is maintained to store the multi-dimensional data, however, it is well appreciated that the multidimensional data may be received by the clustering system in real-time to identify significant dimensions and then perform clustering of the multidimensional data.
  • a membership is assigned to each dimension of each of the plurality of data points to a plurality of clusters.
  • the membership assigned to each dimension initially may be interchangeably referred to as initial membership.
  • the plurality of clusters may be pre-defined.
  • a membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster.
  • the membership assigned to a dimension of a data point may be understood as degree of belonging of the dimension to a particular cluster.
  • Each dimension may belong to several clusters simultaneously, with different degrees of membership.
  • the dimensions can be assigned a membership between 0 and 1 , indicating their partial memberships.
  • the memberships can be initialized in a random fashion using a random number between 0 and 1.
  • the memberships assigned to the dimensions of the plurality of data points are then aggregated.
  • the memberships may be induced by a fuzziness control parameter (m).
  • the fuzziness control parameter (m) determines the level of cluster fuzziness. A large value of fuzziness control parameter (m) results in smaller memberships and hence, fuzzier clusters.
  • the value of fuzziness control parameter (m) may be 2.
  • a cluster center of each of the plurality of clusters is computed based on the aggregated memberships. A cluster center of a cluster is average of all data points in the cluster. The computation of the cluster center has been explained later in detail (using equation 4), in the forthcoming description.
  • the fuzziness control parameter (m) is updated to stabilize the cluster centers.
  • the stability of the cluster centers is obtained by using the value of the fuzziness control parameter (m) which has the minimum effect in the change of cluster centers due to the change in memberships or membership values.
  • partial derivative of each dimension of the cluster centers may be taken with respect to membership degree of each dimension of each of the plurality of data points and then it may be set to zero.
  • the fuzziness control parameter (m) may be updated using an update factor weight (a). For instance, the value of the update factor weight (a) may be 0.95.
  • the fuzziness control parameter (m) may be updated based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m).
  • the computation of the initial or previous fuzziness control parameter (m) and the modified fuzziness control parameter (m) cluster center has been explained later in detail, in the forthcoming description.
  • the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated in a plurality of iterations.
  • the plurality of iterations terminate when a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit ( ⁇ ).
  • a pre-defined limit
  • the value of the pre-defined limit ( ⁇ ) may be 0.01.
  • the plurality of iterations is predefined, and the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated till the predefined number of iterations is exhausted.
  • a hard assignment of the plurality of data points to the cluster centers of the plurality of clusters is performed.
  • a point cluster index is identified for each of the plurality of data points.
  • the point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point.
  • the binary rank matrix is indicative of membership representation of each dimension of a data point.
  • the membership may be represented in terms of binary notation i.e. either as Is or as 0s.
  • the hard assignment is done based on a membership rank matrix for each of the plurality of data points.
  • a membership rank matrix is indicative of average membership of dimensions of a data point for which binary rank matrix entry is equal to 1.
  • the hard assignment to assign each of the dimensions of plurality of data points to cluster centers of the plurality of clusters may also be performed based on identifying a dimension cluster index for each dimension of each of the plurality of data points.
  • the dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1.
  • a measurement metric is determined for each dimension of each of the plurality of data points.
  • the measurement metric may be understood as a goodness measure for each dimension of a data point.
  • the measurement metric can be used for performing dimensionality reduction. Dimensionality reduction can be performed by tracking the dimensions which follow data points well as compared to other dimensions. If the value of goodness measure is high then a dimension follows the data points well indicating the significance of the dimension in the process of clustering. For instance, if value of goodness measure is high then the significance of the dimension for the process of clustering is also very high. Thus, dimensionality reduction can be performed by selecting a set of dimensions that have higher values of goodness measure. The set of dimensions that have higher goodness measure can be used for clustering the n-dimensional data.
  • the measurement metric for each dimension of each of the plurality of data points may be determined based on comparison of the point cluster index for each of the plurality of data points and the dimension cluster index for each dimension of each of the plurality of data points.
  • the measurement metric may be interchangeably referred to as a goodness measurement metric.
  • the measurement metric of a dimension of a data point is equal to 1 if the point cluster index is same as the dimension cluster index. Otherwise, the measurement metric is equal to 0.
  • each dimension of the data points independently contribute in the process of determining the membership of the data points to the clusters. Since membership degree, i.e., degree of belongingness of each dimension to each cluster is taken into consideration, therefore the distance between each dimension of a data point with the cluster center of that dimension of the data point is significantly minimized and as a result accuracy of clustering of input data improves significantly. Further, since cluster assignment of a data point by considering the highest aggregated membership for that cluster can also be ascertained by a set of dimensions having higher membership to belong to that cluster.
  • Figure la illustrates a network environment 100 implementing a clustering system 102, in accordance with an embodiment of the present subject matter.
  • the network environment 100 can be a public network environment, including thousands of personal computers, laptops, various servers, such as blade servers, and other computing devices.
  • the network environment 100 can be a private network environment with a limited number of computing devices, such as personal computers, servers, and laptops.
  • the clustering system 102 may be implemented in a variety of computing systems, such as a laptop computer, a desktop computer, a notebook, a workstation, a mainframe computer, a server, a network server, and the like. Further, it will be understood that the clustering system 102 is connected to a plurality of user devices 104-1, 104-2, 104-3..., and 104-N, collectively referred to as user devices 104 and individually referred to as a user device 104. As shown in figure 1, the user devices 104 are communicatively coupled to the clustering system 102 over a network 106 through one or more communication links for facilitating one or more end users to access and operate the clustering system 102.
  • the user device 104 may include, but is not limited to, a desktop computer, a portable computer, a handheld computing device, and a workstation.
  • the network 106 may be a wireless network, a wired network, or a combination thereof.
  • the network 106 may also be an individual network or a collection of many such individual networks, interconnected with each other and functioning as a single large network, e.g., the Internet or an intranet.
  • the network 106 may be implemented as one of the different types of networks, such as intranet, local area network (LAN), wide area network (WAN), the internet, and such.
  • the network 106 may either be a dedicated network or a shared network, which represents an association of the different types of networks that use a variety of protocols, for example, Hypertext Transfer Protocol (HTTP), Transmission Control Protocol/Internet Protocol (TCP/IP), etc., to communicate with each other.
  • the network 106 may include a variety of network devices, including routers, bridges, servers, computing devices, storage devices, and the like.
  • the network environment 100 further comprises a database 108 communicatively coupled to the clustering system 102.
  • the database 108 may store multi-dimensional data.
  • the data may be representative of multimedia data, financial transactions and the like. According to an implementation, the data is represented as a plurality of data points in a multi-dimensional space, say n- dimensional space.
  • the database 108 is shown external to the clustering system 102, it will be appreciated by a person skilled in the art that the database 108 can also be implemented internal to the clustering system 102, where the multi-dimensional data may be stored within a memory component of the clustering system 102.
  • the clustering system 102 includes processor(s) 1 10, interface(s) 1 12, and memory 1 14 coupled to the processor(s) 1 10.
  • the processor(s) 1 10 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions.
  • the processor(s) 110 may be configured to fetch and execute computer-readable instructions stored in the memory 114.
  • the memory 114 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM), and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • volatile memory such as static random access memory (SRAM), and dynamic random access memory (DRAM)
  • DRAM dynamic random access memory
  • non-volatile memory such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • the interface(s) 112 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a product board, a mouse, an external memory, and a printer. Additionally, the interface(s) 112 may enable the clustering system 102 to communicate with other devices, such as web servers and external repositories. The interface(s) 112 may also facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, LAN, cable, etc., and wireless networks, such as WLAN, cellular, or satellite. For the purpose, the interface(s) 112 may include one or more ports.
  • the clustering system 102 also includes module(s) 116 and data 118.
  • the module(s) 116 include, for example, an assignment module 120, a modification module 122, an identification module 124, and a determination module 126, and other module(s) 128.
  • the other module(s) 128 may include programs or coded instructions that supplement applications or functions performed by the clustering system 102.
  • the data 118 may be membership data 130, index data 132, and other data 134.
  • the other data 134 may serve as a repository for storing data that is processed, received, or generated as a result of the execution of one or more modules in the module(s) 116.
  • the assignment module 120 of the clustering system 102 may retrieve the multi-dimensional data from the database 108.
  • the multi-dimensional data may be an n-dimensional data.
  • the multi-dimensional data may be represented by a plurality of data points in a multi-dimensional space, say n-dimensional space. Further, each of the plurality of data points may include a plurality of dimensions or components.
  • the n-dimensional data is mathematically represented by the expression provided below:
  • ( X ) represents the n-dimensional data of size N.
  • the multi-dimensional data may be an image and pixels of the image may be the plurality of data points.
  • the components of the pixels i.e., RGB (red, green and blue) or HSV (hue, saturation and value) may be the dimensions.
  • the clustering system 102 may partition a two- dimensional image or a three-dimensional (3D) image into two or more clusters.
  • an image of 481 by 321 pixel dimension is taken and is transformed from RGB plane into HSV plane.
  • Each data point includes three components, i.e., Hue (H), Saturation (S) and Value (V). These components are clustered into three clusters based on the HSV value of background, subject skin and dress color of the subject in the image. Therefore, in this case, the total number of data points (N) are 155401 (481 x321), the total number of dimensions (n) are 3, and the total number of clusters (c) are 3.
  • the assignment module 120 may assign a membership to each dimension of each of the plurality of data points to a plurality of clusters.
  • the membership assigned to each dimension may be interchangeably referred to as initial membership.
  • the number of plurality of clusters may be pre-defined depending upon the application
  • a membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster.
  • the membership assigned to a dimension of a data point may be understood as degree of belonging of the dimension to a particular cluster.
  • Each dimension may belong to several clusters simultaneously, with different degrees of membership.
  • the assignment module 120 may initialize membership to each dimension using a random number ranging between 0 and 1, indicating their partial memberships.
  • the memberships can be assigned in a random fashion using a random number ranging between 0 and 1.
  • the membership assigned to the dimensions is mathematically represented by the expression provided below: ⁇ ⁇ ⁇ ( ⁇ ⁇ ) ' l ⁇ j ⁇ n, 1 ⁇ k ⁇ N, l ⁇ i ⁇ c
  • (x ⁇ ) denotes j th dimension of the k th data point and [ ⁇ ⁇ ( ⁇ )] denotes membership of x ⁇ to belong to the i th cluster.
  • (N) is the size of the n-dimensional data and (c) is the numbers of clusters for the n-dimensional data.
  • ⁇ ⁇ (xj ⁇ ) represents membership of ( ⁇ ) to belong to the i th cluster; and m represents the fuzziness control parameter.
  • the modification module Based on the aggregated memberships, the modification module
  • a cluster center of a cluster may be understood as average of all data points in the cluster.
  • the modification module 122 computes a cluster center using equation (4) provided below:
  • x ⁇ represents j th dimension of the k th data point
  • ⁇ ( x * k ) represents the aggregated membership of ( x * k ) to belong to the i th
  • the modification module 122 may initially compute the cluster center of each of the plurality of clusters using equation (5) provided below and then compute new cluster center of each of the plurality of clusters using equation (5) provided above:
  • the modification module 122 may then calculate square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points.
  • the square of distance calculated between the each dimension of cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points is mathematically represented by the expression provided below:
  • (x J k — vj) 2 denotes square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points
  • (x k ) denotes j th dimension of the k th data point
  • (vj ) denotes j th dimension of the i th cluster center.
  • the modification module 122 determines a modified membership for each dimension of each of the plurality of data points.
  • the modified membership is determined based on modifying the initial membership assigned to each dimension based on the cluster center of each of the plurality of clusters.
  • the modification module 122 determines the modified membership based on equation 7 (provided below).
  • the modification module 122 modifies the me ed below:
  • (x k — vj) 2 represents square of distance between each dimensions of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points
  • m represents the fuzziness control parameter.
  • the modification module 122 computes the cluster center of each of the plurality of clusters and modifies the memberships using equation (8) provided below:
  • ⁇ 03 ⁇ 4 represents membership of x ⁇ to belong to the i th cluster
  • (x ⁇ — ⁇ ) 2 represents square of distance between each dimensions of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points
  • m represents the fuzziness control parameter.
  • Equation (8) is taken with respect to memberships, cluster centers, and Lagrange's multiplier to obtain equation (5) and equation (7).
  • the modification module 122 aggregates the memberships or membership values and adapts the fuzziness control parameter (m) towards its convergence, i.e., the fuzziness control parameter (m) is placed in the less sensitive region of the cluster centers.
  • the modification module 122 updates the fuzziness control parameter (m) to stabilize the cluster centers.
  • the stability of the cluster centers is obtained by using the value of the fuzziness control parameter (m) which has the minimum effect in the change of cluster centers due to the change in membership values.
  • the modification module 122 updates the fuzziness control parameter (m) based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m).
  • the modification module 122 takes partial derivative of each dimension of the cluster centers with respect to membership or membership degree of each dimension of each of the plurality of data points and then it may be set to zero.
  • the modification module 122 selects the value of the fuzziness control parameter (m) such that it gives minimum absolute value of the partial derivative accumulated over all the dimensions of all the clusters for all data points.
  • the modification module 122 may update the fuzziness control parameter (m) using an update factor weight (a). For instance, the value of the update factor weight (a) may be 0.95.
  • the modification module 122 updates the fuzziness control parameter (m) using equation (9) and (10) provided below:
  • m represents the initial or previous fuzziness control parameter (m);.
  • m modified represents the modified fuzziness control parameter (m), where m modified is calculated using equation (9),
  • a is the weight factor
  • m new represents the updated fuzziness control parameter (m).
  • the modification module 122 may update the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) in a plurality of iterations until a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit ( ⁇ )
  • the value of the pre-defined limit ( ⁇ ) may be 0.01.
  • the plurality of iterations is predefined, and in said implementation, the initial memberships, the aggregated memberships, the cluster centers, and the fuzziness control parameter (m) are updated till the predefined number of iterations is exhausted.
  • the 102 calculates a binary rank matrix for each dimension of each of the plurality of data points.
  • the binary rank matrix is indicative of membership representation of each dimension of a data point.
  • the membership may be represented in terms of binary notation i.e. either as Is or as 0s.
  • the matrix dimension of a binary rank matrix is equal to ratio of total number of clusters to total number of dimensions of a data point.
  • the identification module 124 may assign a value of 1 to that cluster which corresponds to maximum value of membership and all other clusters are assigned a value of 0. Further, the identification module 124 computes a membership rank matrix for each of the plurality of data points.
  • the membership rank matrix may be indicative of average membership of dimensions of a data point for which binary rank matrix entry is equal to 1.
  • the identification module 124 Based on the binary rank matrix and the membership rank matrix, the identification module 124 performs a hard assignment of each of the dimensions of each of the plurality of data points to the cluster centers of the plurality of clusters is performed. To perform the hard assignment, the identification module 124 identifies a point cluster matrix for each of the plurality of data points and a dimension cluster index for each dimension of each of the plurality of data points to assign each data points to cluster centers of the clusters.
  • the point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point.
  • the point cluster index for the data point is mathematically represented by the expression provided below:
  • (C ata ) denotes a point cluster index of k th data point which has maximum number of Is in the binary rank matrix and Mi j (x * k ) denotes the binary rank matrix.
  • the hard assignment is done based on the membership rank matrix for each of the plurality of data points.
  • the point cluster index for this case is mathematically represented by the expression provided below:
  • U Ai (x ) denotes the membership rank matrix
  • the identification module 124 also performs hard assignment to assign each of the plurality of data points to cluster centers of the plurality of clusters based on identifying a dimension cluster index for each dimension of each of the plurality of data points.
  • the dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1.
  • the dimension cluster index for a dimension of a data point is mathematically represented by the expression provided below:
  • (C k ) denotes a dimension cluster index of j th dimension of k th data point and Mj j (5f k ) denotes the binary rank matrix.
  • the point cluster index and the dimension cluster index identified by the identification module 124 may be stored as the index data 132 within the clustering system 102.
  • the determination module 126 determines a measurement metric for each dimension of each of the plurality of data points.
  • the measurement metric may be interchangeably referred to as a goodness measurement metric.
  • the measurement metric may be understood as a goodness measure (G k ) for each dimension of a data point. If the value of goodness measure (G' k ) is high then a dimension follows the data points well indicating the significance of the dimension in the process of clustering. The value of the measurement metric for each dimension accumulated over all the data points provides the measure of significance of the dimension. For instance, if the value of the measurement metric is high, then measure of significance of the dimension may also be high. In one implementation, a set of dimensions that have higher goodness measure G ⁇ ) may be selected to be used for clustering the n- dimensional data.
  • the determination module 126 may determine the measurement metric for each dimension of each of the plurality of data points based on comparison of the point cluster index and the dimension cluster index.
  • the measurement metric of a dimension of a data point is equal to 1 if the point cluster index is same as the dimension cluster index. If the point cluster index and the dimension cluster index are not equal, then the measurement metric is equal to 0.
  • each dimension of the data points independently contribute in the process of determining the membership of the data points to the clusters. Since membership degree, i.e., degree of belongingness of each dimension to each cluster is taken into consideration, the distance between each dimension of a data point with the cluster center of that dimension of the data point is significantly minimized and as a result accuracy of clustering of input data improves significantly. Further, since cluster assignment of a data point by considering the highest aggregated membership for that cluster can also be ascertained by a set of dimensions having higher membership to belong to that cluster.
  • FIG. lb illustrates comparison of an exemplary image segmented by the present clustering system and a conventional clustering system.
  • image 140 is an original image that is to be segmented.
  • image 142 is the segmented image that is obtained as a result of image segmentation performed by the conventional clustering system
  • image 144 is the segmented image that is obtained as a result of the image segmentation performed by the present clustering system 102, i.e., clustering system described in accordance with the present subject matter.
  • the performance of the segmentation process is justified in terms of number of data points originally belonging to the subjects are misclassified as the background.
  • the present clustering system 102 outperforms the conventional clustering system by minimizing the misclassification error.
  • Figures 2a and 2b illustrate a method 200 for determining significant dimensions for clustering multi-dimensional data, according to an embodiment of the present subject matter.
  • the method 200 is implemented in computing device, such as a clustering system 102.
  • the method 200 may be described in the general context of computer executable instructions.
  • computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types.
  • the method 200 may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network.
  • the order in which the method 200 is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method 200, or an alternative method.
  • the method 200 can be implemented in any suitable hardware, software, firmware or combination thereof.
  • the method 200 includes obtaining multidimensional data, where the multi-dimensional data includes a plurality of data points.
  • the multi-dimensional data may be an n-dimensional data and the multi-dimensional data may be represented by a plurality of data points.
  • each of the plurality of data-points may include a plurality of dimensions.
  • the multi-dimensional data may be multimedia data, say an image.
  • the data points of the image can be the pixels and components of the pixels, i.e., RGB (red, green and blue) or HSV (hue, saturation and value) may be the dimensions.
  • the assignment module 120 of the clustering system 102 may obtain the multi-dimensional data from the database 108.
  • the method 200 includes assigning initial memberships to each dimension of each of the plurality of data points for a plurality of clusters.
  • the plurality of clusters may be pre-defined.
  • a membership assigned to a dimension of a data point may be understood as strength of association between the dimension of the data point and a particular cluster.
  • the memberships can be assigned in a random fashion using a random number between 0 and 1.
  • the assignment module 120 of the clustering system 102 assigns a membership to each dimension of each of the plurality of data points to a plurality of clusters.
  • the method 200 includes aggregating the initial memberships assigned to the dimensions of the plurality of data points.
  • the memberships may be aggregated induced by fuzziness control parameter (m).
  • the fuzziness control parameter (m) determines the level of fuzziness in a cluster.
  • the value of fuzziness control parameter (m) may be 2.
  • the assignment module 120 aggregates the memberships based on the equation (3) described in the previous section.
  • the method 200 includes computing a cluster center of each of the plurality of clusters based on the aggregated memberships.
  • a cluster center of a cluster may be understood as average of all data points in the cluster.
  • the modification module 122 computes a cluster center based on the equation (4) described in the previous section.
  • the method 200 includes calculating square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points.
  • the modification module 122 calculates a square of distance between each dimension of the cluster center of each of the plurality of clusters and each dimension of each of the plurality of data points.
  • the method 200 includes modifying the initial memberships assigned to the dimensions of each of the plurality of data points. For instance, if square of distance between each dimension of the cluster center and each dimension of each of the plurality of data points is greater than 0 then the membership assigned to each dimension is modified. In another instance, if square of distance between each dimensions of t!ie cluster center and each dimension of each of the plurality of data points is equal to 0 then the membership is set to 1 for the corresponding cluster and set to 0 for the rest of the clusters.
  • the modification module 122 modifying the membership assigned to each dimension of each of the plurality of data points based on the equation (6) described in the previous section.
  • the method 200 includes updating a fuzziness control parameter (m).
  • the fuzziness control parameter (m) may be updated based on weighted sum of initial or previous fuzziness control parameter (m) and modified fuzziness control parameter (m).
  • the modified fuzziness control parameter (m) is computed by taking the partial derivative of each dimension of the cluster centers with respect to membership of membership degree of each dimension of each of the plurality of data points and then it may be set to zero.
  • the modification module 122 selects the value of the fuzziness control parameter (m) such that it gives minimum absolute value of the partial derivative accumulated over all the dimensions of all the clusters for all data points.
  • the value of cluster centers and hence the membership values are updated until sum of absolute difference between the modified membership and the initial membership is less than a pre-defined limit ( ⁇ ) or a predefined limit of iterations have been exhausted.
  • the pre-defined limit ( ⁇ ) may be 0.01.
  • the modification module 122 updates the fuzziness control parameter (m) to stabilize the cluster centers.
  • the method 200 includes identifying a point cluster index for each data point and a dimension cluster index for dimensions of each data point.
  • the point cluster index for a data point may be understood as the index which has a maximum number of Is in a binary rank matrix for each dimension of the data point.
  • the dimension cluster index for a dimension of a data point may be understood as the index for which binary rank matrix for the dimension of the data point is 1.
  • the identification module 124 identifies a point cluster index for each data point and a dimension cluster index for dimensions of each data point.
  • the method 200 includes assigning each of the plurality of data points to cluster centers of the plurality of clusters based on the point cluster index.
  • the identification module 124 performs a hard assignment of the plurality of data points to the cluster centers of the plurality of clusters using the point cluster index
  • the method 200 includes determining a measurement metric for each dimension of each of the plurality of data points.
  • the measurement metric may be determined based on comparison of the point cluster index for each of the plurality of data points and the dimension cluster index for each dimension of each of the plurality of data points.
  • the measurement metric may be understood as a goodness measure (G ⁇ ) for each dimension of a data point. If the value of goodness measure (G ⁇ ) is high then a dimension follows the data points well indicating that the dimension and the corresponding data point index belong to the same cluster. In one implementation, a set of dimensions that have higher goodness measure G ⁇ ) may be selected to be used for clustering the n-dimensional data.
  • the determination module 126 determines the measurement metric for each dimension of each of the plurality of data points.
  • the method blocks 206, 208, 210, 212, and 214 described above are repeated in a plurality of iterations.
  • the plurality of iterations terminate when a sum of absolute difference between the modified memberships and the initial memberships or previously updated memberships is less than a pre-defined limit ( ⁇ ).
  • a pre-defined limit
  • the plurality of iterations is predefined, and the method blocks 206, 208, 210, 212, and 214 are repeated till the predefined number of iterations is exhausted.

Abstract

La présente invention se rapporte à un procédé permettant de regrouper des données multidimensionnelles, ledit procédé comprenant les étapes consistant à obtenir des données multidimensionnelles comprenant une pluralité de points de données, chaque point de données ayant de multiples dimensions. Des premiers groupes d'appartenance sont attribués à chaque dimension pour une pluralité de regroupements et soit les premiers groupes d'appartenance, soit des groupes d'appartenance modifiés attribués aux dimensions de chaque point de données sont agrégés et induits par un paramètre de commande de manque de netteté. Sur la base de l'agrégation, un centre de regroupement de chaque regroupement est calculé et le carré de la distance entre chaque dimension du centre de regroupement et chaque dimension est calculé. Sur la base du calcul, soit les premiers groupes d'appartenance, soit les groupes d'appartenance modifiés, attribués à la pluralité de dimensions de chaque point de données, sont modifiés et le paramètre de commande de manque de netteté est mis à jour. Une métrique de mesure de la faisabilité indicative de la portée de chaque dimension est déterminée pour chaque dimension sur la base de la comparaison d'un indice de regroupement de point et d'un indice de regroupement de dimension.
PCT/IB2014/001262 2013-07-05 2014-07-03 Regroupement de données multidimensionnelles WO2015001416A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN2282MU2013 2013-07-05
IN2282/MUM/2013 2013-07-05

Publications (1)

Publication Number Publication Date
WO2015001416A1 true WO2015001416A1 (fr) 2015-01-08

Family

ID=51399676

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2014/001262 WO2015001416A1 (fr) 2013-07-05 2014-07-03 Regroupement de données multidimensionnelles

Country Status (1)

Country Link
WO (1) WO2015001416A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017062530A1 (fr) * 2015-10-05 2017-04-13 Bayer Healthcare Llc Génération de recommandations de produit orthétique
WO2017149139A1 (fr) 2016-03-03 2017-09-08 Curevac Ag Analyse d'arn par hydrolyse totale
CN110610200A (zh) * 2019-08-27 2019-12-24 浙江大搜车软件技术有限公司 车商分类方法、装置、计算机设备及存储介质
CN113298115A (zh) * 2021-04-19 2021-08-24 百果园技术(新加坡)有限公司 基于聚类的用户分组方法、装置、设备和存储介质
CN113919449A (zh) * 2021-12-15 2022-01-11 国网江西省电力有限公司供电服务管理中心 基于精准模糊聚类算法的居民电力数据聚类方法及装置
US11315177B2 (en) * 2019-06-03 2022-04-26 Intuit Inc. Bias prediction and categorization in financial tools
CN114863151A (zh) * 2022-03-20 2022-08-05 西北工业大学 一种基于模糊理论的图像降维聚类方法

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LIANG PANG ET AL: "A Improved Clustering Analysis Method Based on Fuzzy C-Means Algorithm by Adding PSO Algorithm", 28 March 2012, HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, SPRINGER BERLIN HEIDELBERG, BERLIN, HEIDELBERG, PAGE(S) 231 - 242, ISBN: 978-3-642-28941-5, XP019174522 *
P. ARBELAEZ; M. MAIRE; C. FOWLKES; J. MALIK.: "Contour Detection and Hierarchical Image Segmentation", IEEE TPAMI, vol. 33, no. 5, May 2011 (2011-05-01), pages 898 - 916
R. SUGANYA ET AL: "Fuzzy C-Means Algorithm - A Review", INTERNATIONAL JOURNAL OF SCIENTIFIC AND RESEARCH PUBLICATIONS, vol. 2, no. 11, November 2012 (2012-11-01), pages 440 - 442, XP055151575 *
WEINA WANG ET AL: "The Global Fuzzy C-Means Clustering Algorithm", INTELLIGENT CONTROL AND AUTOMATION, 2006. WCICA 2006. THE SIXTH WORLD CONGRESS ON DALIAN, CHINA 21-23 JUNE 2006, PISCATAWAY, NJ, USA,IEEE, vol. 1, 21 June 2006 (2006-06-21), pages 3604 - 3607, XP010946075, ISBN: 978-1-4244-0332-5, DOI: 10.1109/WCICA.2006.1713041 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017062530A1 (fr) * 2015-10-05 2017-04-13 Bayer Healthcare Llc Génération de recommandations de produit orthétique
US11134863B2 (en) 2015-10-05 2021-10-05 Scholl's Wellness Company Llc Generating orthotic product recommendations
WO2017149139A1 (fr) 2016-03-03 2017-09-08 Curevac Ag Analyse d'arn par hydrolyse totale
US11920174B2 (en) 2016-03-03 2024-03-05 CureVac SE RNA analysis by total hydrolysis and quantification of released nucleosides
US11315177B2 (en) * 2019-06-03 2022-04-26 Intuit Inc. Bias prediction and categorization in financial tools
CN110610200A (zh) * 2019-08-27 2019-12-24 浙江大搜车软件技术有限公司 车商分类方法、装置、计算机设备及存储介质
CN113298115A (zh) * 2021-04-19 2021-08-24 百果园技术(新加坡)有限公司 基于聚类的用户分组方法、装置、设备和存储介质
CN113919449A (zh) * 2021-12-15 2022-01-11 国网江西省电力有限公司供电服务管理中心 基于精准模糊聚类算法的居民电力数据聚类方法及装置
CN113919449B (zh) * 2021-12-15 2022-03-15 国网江西省电力有限公司供电服务管理中心 基于精准模糊聚类算法的居民电力数据聚类方法及装置
CN114863151A (zh) * 2022-03-20 2022-08-05 西北工业大学 一种基于模糊理论的图像降维聚类方法
CN114863151B (zh) * 2022-03-20 2024-02-27 西北工业大学 一种基于模糊理论的图像降维聚类方法

Similar Documents

Publication Publication Date Title
US10713597B2 (en) Systems and methods for preparing data for use by machine learning algorithms
WO2015001416A1 (fr) Regroupement de données multidimensionnelles
EP3077960B1 (fr) Procédé et système permettant de calculer des mesures de distance sur un ordinateur quantique
Shao et al. Deep linear coding for fast graph clustering
Manzanera et al. Line and circle detection using dense one-to-one Hough transforms on greyscale images
US20150039538A1 (en) Method for processing a large-scale data set, and associated apparatus
US20220300528A1 (en) Information retrieval and/or visualization method
US11775610B2 (en) Flexible imputation of missing data
CN107832456B (zh) 一种基于临界值数据划分的并行knn文本分类方法
Hetland et al. Ptolemaic access methods: Challenging the reign of the metric space model
CN110147455A (zh) 一种人脸匹配检索装置及方法
Wu et al. 3D scene reconstruction based on improved ICP algorithm
CN111026865A (zh) 知识图谱的关系对齐方法、装置、设备及存储介质
Akgül et al. Density-based 3D shape descriptors
Zhang et al. An adaptive mean shift clustering algorithm based on locality-sensitive hashing
Barger et al. k-means for streaming and distributed big sparse data
Pandey et al. Min–max kurtosis mean distance based k-means initial centroid initialization method for big genomic data clustering
Dharamsotu et al. k-NN Sampling for Visualization of Dynamic data using LION-tSNE
Burdescu et al. A Spatial Segmentation Method.
Yoon et al. User-drawn sketch-based 3D object retrievalusing sparse coding
Burdescu et al. Multimedia data for efficient detection of visual objects
Myasnikov Evaluation of space partitioning data structures for nonlinear mapping
Park et al. Encouraging second-order consistency for multiple graph matching
Divya Lakshmi et al. Helly hypergraph based matching framework using deterministic sampling techniques for spatially improved point feature based image matching
Denisova et al. The Algorithms of Hierarchical Histogram computation for multichannel images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14755705

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14755705

Country of ref document: EP

Kind code of ref document: A1