EP1032918A1 - Procede de decomposition/reduction de donnees permettant de visualiser les groupes/sous-groupes de donnees - Google Patents

Procede de decomposition/reduction de donnees permettant de visualiser les groupes/sous-groupes de donnees

Info

Publication number
EP1032918A1
EP1032918A1 EP99946966A EP99946966A EP1032918A1 EP 1032918 A1 EP1032918 A1 EP 1032918A1 EP 99946966 A EP99946966 A EP 99946966A EP 99946966 A EP99946966 A EP 99946966A EP 1032918 A1 EP1032918 A1 EP 1032918A1
Authority
EP
European Patent Office
Prior art keywords
data
level
clusters
projection
visualization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP99946966A
Other languages
German (de)
English (en)
Inventor
Joseph Y. Wang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Catholic University of America
Original Assignee
Catholic University of America
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Catholic University of America filed Critical Catholic University of America
Publication of EP1032918A1 publication Critical patent/EP1032918A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • G06T7/0012Biomedical image inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/40Software arrangements specially adapted for pattern recognition, e.g. user interfaces or toolboxes therefor

Definitions

  • the present invention relates gene ⁇ cally to the field of data analysis and data presentation ⁇ and, more particularly, to the analysis of data sets having higher dimensionality data points m order to optimally present the data m a lower dimensional order context, i.e., m a hierarchy of two- or three-dimensional visual contexts to reveal data structures within the data set.
  • the visualization of data sets having a large number of data points with multiple variables or attributes associated with each data point represents a complex problem.
  • a priori to easily identify groups or subgroups of data points that have relational attributes such that structures and sub-structures existing within the data set can be visualized.
  • Various techniques have been developed for processing the data sets to reveal internal structures as an aid to understanding the data.
  • a large data set will oftentimes have data points that are multi-variant, that is, a single data point can have a multitude of attributes, including attributes that are completely independent from one another or have some degree of mter- attribute relationship or dependency.
  • a single projection of a higher- order data set onto a visualization space may not be able to present all of the structures and substructures within the data set of interest m such a way that the structures or sub-structures can be visually distinguished or discriminated.
  • presentation schema involves hierarchical visualization by which the data set is viewed at a highest - level , whole data set viewpoint. Thereafter, features within the highest-level projection are identified m accordance with an algorithm (s) or other identification criteria and those next highest level features further processed to reveal their respective internal structure m another projection (s) .
  • This hierarchal process can be repeated for successive levels to present successively finer and detailed views of the data set.
  • m a hierarchical visualization scheme
  • an image tree is provided with the successively lower images of the tree revealing more detail .
  • the data set is subjected by Bishop and Tipping to a form of linear latent variable modelling to find a representation of the multidimensional data set m terms of two latent, or "hidden,” variables that is determined indirectly from the data set .
  • the modelling is similar to principal component analysis, but defines a probability density m the data space.
  • a single top-level latent variable model is generated with the posterior mean of each data point plotted m the latent space. Any cluster centers identified m this initial plot are used as the basis for initiating the next -lower level analysis leading to a mixture of the latent variable models.
  • the parameters, including the optimal projections, are determined by maximum likelihood; this criterion need not always lead to the most interesting or mterpretable visualization plots. Disclosure of Invention
  • the present invention provides a data decomposition/reduction method for visualizing large sets of multi -variant data including the processing of the multi -variant data down to two- or three- dimensional space m order to optimally reveal otherwise hidden structures within the data set including the principal data cluster or clusters at a first or top level of processing and additional sub-clusters within the principal data clusters m successive lower level visualizations.
  • the identification of the morphology of clusters and subclusters and mter-cluster separation and relative positioning within a large data set allows investigation of the underlying drive that created the data set morphology and the mtra-data-set features .
  • the data set constituted by a multitude of data points each having a plurality of attributes, is initially processed as a whole using multiple finite normal mixture models and hierarchical visualization spaces to develop the multi-level data visualization and interpretation.
  • the top-level model and its projection explain the entire data set revealing the presence of clusters and cluster relationships, while lower-level models and projections display internal structure within individual clusters, such as the presence of subclusters, which might not be apparent in the higher-level models and projections.
  • each level is relatively simple while the complete hierarchy maintains overall flexibility while still conveying considerable structural information.
  • the arrangement combines (a) minimax entropy modeling by which the models are determined and various parameters estimated and (b) principal component analysis to optimize structure decomposition and dimensionality reduction.
  • the present invention advantagiously performs a probabilistic principal component analysis to project the softly partitioned data space down to a desired two-dimensional visualization space to lead to an optimal dimensionality reduction allowing the best extraction and visualization of local clusters.
  • the minimax entropy principle is used to select the model structures and estimate its parameter values, where the soft partitioning of the data set results in a standard finite normal mixture model with minimum conditional bias and variance.
  • the present invention treats structure decomposition and dimensionality reduction as two separate but complementary operations, where the criterion used to optimize dimensionality reduction is the separation of clusters rather than the maximum likelihood approach of Bishop and Tipping.
  • the resulting projections in turn, enhance the performance of structure decomposition at the next lower level .
  • a model selection procedure is applied to determine the number of subclusters inside each cluster at each level using an information theoretic criteria based upon the minimum of alternate calculations of the Akaike Information Critera (AIC) and the minimum description length (MDL) criteria. This determination allows the process of the present invention to automatically determine whether a further split of a subspace should be implemented or whether to terminate the further processing.
  • AIC Akaike Information Critera
  • MDL minimum description length
  • a probabilistic adaptive principal component extraction (PAPEX) algorithm is also applied to estimate the desired number of principal axes. When the dimensionality of the raw data is high, this PAPEX approach is computationally very efficient.
  • the present invention defines a probability distribution in data space which naturally induces a corresponding distribution in projection space through a Radon transform. This defined probability distribution permits an independent procedure in determining values for the intrinsic model parameters without concurrent estimation of projection mapping matrix. ⁇
  • the underlying "drive" that give rise to the data points often form clusters of points because more than one variable may be a function of that same underlying drive .
  • the data set (designated herein as the t-space) is projected onto a single x-space (i.e., two- dimensional space) , in which a descriptor W is determined from the sample covariance matrix C t by fitting a single Gaussian model to the data set over t-space .
  • a descriptor W is determined from the sample covariance matrix C t by fitting a single Gaussian model to the data set over t-space .
  • the a value f(t) is then determined for K 0 m which the values of ⁇ k z lk , ⁇ tk , and C tk are further refined by maximizing the likelihood over t-space.
  • G k (t) is determined by repeating the above process steps to thus construct multiple x-subspaces at the third level; the hierarchy is completed under the information theoretic criteria using the AIC and the MDL and all x-space subspaces plotted for visual evaluation.
  • the present invention advantageously provides a data decomposition/reduction method for visualizing data clusters/sub-clusters within a large data space that is optimally effective and computationally efficient .
  • FIG. 1 is a schematic block diagram of a system for processing a raw multi -varient data set m accordance with the present invention
  • FIG. 2 is a flow diagram of the process flow of the present invention
  • FIG. 2A is an alternative visualization of the process flow of the present invention.
  • FIG. 3 is an example of the projection of a data set onto a 2 -dimensional visualization space after determination of the principal axis
  • FIG. 4A is a 2 -dimensional visualization space of one of the clusters of FIG. 3 ;
  • FIG. 4B is a 2 -dimensional visualization space of another of the clusters of FIG. 3;
  • FIG. 5 is an example of the projection of a data set onto a 2 -dimensional visualization space after determination of the principal axis;
  • FIG. 6A is a 2 -dimensional visualization space of one of the clusters of FIG. 5 ;
  • FIG. 6B is a 2 -dimensional visualization space of a second of the clusters of FIG. 5;
  • FIG. ⁇ C is a 2-d ⁇ mens ⁇ onal visualization space of a third of the clusters of FIG. 5.
  • FIG. 1 A processing system for implementing the dimensionality reduction using probabilistic principal component analysis and structure decomposition using adaptive expectation maximization methods for visualizing data m accordance with the present invention is shown m FIG. 1 and designated generally therein by the reference character 10.
  • the system 10 includes a working memory 12 that accepts the raw multi-varient data set, indicated at 14, and which bi-directionally interfaces with a processor 16.
  • the processor 16 processes the raw t-space data set ⁇ ⁇ 14 as explained m more detail below and presents that data to a graphical user interface (GUI) 18 which presents a two- or three- dimensional visual presentation to the user as also explained below.
  • GUI graphical user interface
  • a plotter or printer 20 can be provided to generate a printed record of the display output of the graphical user interface (GUI) .
  • the processor 16 may take the form of a software or firmware programed CPU, ALU, ASIC, or microprocessor or a combination thereof.
  • the data set is subject to a global principal component analysis to thereafter effect a top most projection.
  • This step is initiated by determining the value of a variable W for the top-most projection m the hierarchy of projections.
  • W is directly found by evaluating the covariance matrix C t .
  • APEX adaptive principal components extraction
  • the two-step expectation maximization (EM) algorithm can be applied to allow a standard finite normal mixture model (SFNM) , i.e., where
  • the standard finite normal mixture (SFNM) modeling solution addresses the estimation of the regional parameters ( ⁇ k ⁇ tk ) and the detection of the structural parameter K 0 in the relationship
  • the EM algorithm is implemented as a two-step process, i.e., the E-step and the M-step as follows:
  • K a 7K 0 - 1 (i.e., the values of Akaike' s Information Criteria (AIC) and the Minimum Description Length (MDL) for K with selection of a model m which K corresponds to the minimum of the
  • EQ. 9 are then used as the initial means of the respective submodels. Since the mixing proportions ⁇ are pro ection- invariant , a 2 x 2 unit matrix is assigned to the remaining parameters of the covariance matrix C tk .
  • the expectation-maximization (EM) algorithm can be again applied to allow a standard finite normal matrix (SFNM) with K 0 submodels to be fitted to the data over t-space.
  • SFNM finite normal matrix
  • the corresponding EM algorithm can be derived by replacing all x m the E-step and the M-step equations, above, by t.
  • C tk can be directly evaluated to obtain W k as described above.
  • an algorithm termed the probabilistic adaptive principal component extraction (PAPEX) is applied as follows .
  • i f c(i + 1) i f c(i) + ifc(i)tifc -
  • a k (i + 1) a k (i) - ⁇ 7feto(»)yifc(*) + y2fc(*) a fc( )]
  • W k the eigenvector associated with the second largest eigenvalue of the covariance matrix C k .
  • the determination of the parameters of the models at the third level can again be viewed as a two-step estimation problem, in which a further split of the models at the second level is determined within each of the subspaces over x- space, and then the parameters of the selected models are fine tuned over t-space.
  • the learning of ⁇ k (x) can again be performed using the expectation-maximization (EM) algorithm and the model selection procedures described above.
  • the third level EM algorithm has the same form as the EM algorithm at the second level, except that in the E-step, the posterior probability that a data point x 1 belongs to submodel j is given by
  • EQ. 19 are then used to initialize the means of the respective submodels, and the expectation maximization (EM) algorithm can be applied to allow a standard finite normal matrix (SFNM) distribution with K 0 submodels to be fitted to the data over t- space.
  • the formulation can be derived by simply replacing all x in the second level M-step by t. With the resulting z 1 ( k _-) in t-space, the PAPEX algorithm can be applied to estimate W ( k ) , in which the effective input values are expressed by
  • Hk z i(kJ) ⁇ ⁇ ⁇ t ⁇ k ,j)) EQ. 20 * "
  • the next level visualization subspace is generated by plotting each data point t x at the corresponding
  • FIGS. 3, 4A, and 4B A first exemplary two-level implementation of the present invention is shown in FIGS. 3, 4A, and 4B in which the entire data set is present in the top level projection and two local clusters within that top level projection each individually presented in FIGS. 4A and 4B.
  • the entire data set is subject to principal component analysis as described above to obtain the principal axis or axes (axis A x being representative) for the top level display. Additionally, the axis (unnumbered) for each of the apparent clusters is displayed. Thereafter, the apparent centers of the two clusters are identified and the data subject to the aforementioned processing to further reveal the local cluster of FIG. 4A and the local cluster of FIG. 4B .
  • FIGS. 5, 6A, 6B, and 6C A second exemplary two- level implementation of the present invention is shown in FIGS. 5, 6A, 6B, and 6C in which the entire data set is present in the top level projection and three local clusters within that top level projection are each individually presented in FIGS. 6A, 6B, and 6C.
  • the entire data set is subject to principal component analysis as described above to obtain the principal axis (A x ) and the axis (unnumbered) for each of the apparent clusters as displayed.
  • the t-space raw data set arises from a mixture of three Gaussians consisting of 300 data points as presented in FIG. 5.
  • two cloud- like clusters are well separated while a third cluster appears spaced in between the two well- separated cloud-like clusters.
  • the second level visual space is generated with a mixture of two local principal component axis subspaces where the line A x indicates the global principal axis.
  • the plot on the "right" of FIG. 5 shows evidence of further split.
  • a hierarchical model is adopted, which illustrates that there are indeed total three clusters within the data set, as shown in FIGS. 6A, 6B, and 6C.
  • An alternate visualization of the process of flow of the present invention is shown in FIG.
  • the present invention has use m all applications requiring the analysis of data, particularly multi -dimensional data, for the purpose of optimally visualizing various underlying structures and distributions present within the universe of data. Applications include the detection of data clusters and sub-clusters and their relative relationships m areas of medical, industrial, geophysical imaging, and digital library processing, for example.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
  • Radiology & Medical Imaging (AREA)
  • Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)
  • Complex Calculations (AREA)

Abstract

Dans ce procédé les données de dimensionnalité importantes sont visualisées selon un mode hiérarchique afin de permettre la visualisation de l'ensemble de données dans son intégralité selon une hiérarchie descendante, en termes de groupes et, aux niveaux inférieurs, de sous-groupes. L'ensemble de données est soumis à des modèles standard de mélange normal fini et à des projections probabilistes des composantes principales, dont les paramètres sont estimés au moyen de l'analyse de maximisation des espérances et de l'analyse en composantes principales, selon le critères d'information d'Akaike (AIC) et les critères de longueur descriptive minimale (MDL). Les données brutes de grande dimension sont traitées par l'analyse en composantes principales afin de faire apparaître la distribution dominante des données à un premier niveau. L'information ainsi traitée est ensuite retraitée afin de faire apparaître les divers sous-groupes à l'intérieur des premiers groupes. Les divers groupes et sous-groupes aux différents niveaux hiérarchiques font l'objet d'une projection visuelle faisant apparaître leur structure sous-jacente. Ce système est utile dans toutes les applications dans lesquelles des données multidimensionnelles de dimensionnalité importante doivent être réduites en un espace de projection bi- ou tridimensionnel afin de permettre une exploration visuelle de la structure sous-jacente de l'ensemble de données.
EP99946966A 1998-09-17 1999-09-17 Procede de decomposition/reduction de donnees permettant de visualiser les groupes/sous-groupes de donnees Withdrawn EP1032918A1 (fr)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US10062298P 1998-09-17 1998-09-17
US100622P 1998-09-17
US39842199A 1999-09-17 1999-09-17
PCT/US1999/021363 WO2000016250A1 (fr) 1998-09-17 1999-09-17 Procede de decomposition/reduction de donnees permettant de visualiser les groupes/sous-groupes de donnees
US398421 1999-09-17

Publications (1)

Publication Number Publication Date
EP1032918A1 true EP1032918A1 (fr) 2000-09-06

Family

ID=26797375

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99946966A Withdrawn EP1032918A1 (fr) 1998-09-17 1999-09-17 Procede de decomposition/reduction de donnees permettant de visualiser les groupes/sous-groupes de donnees

Country Status (5)

Country Link
EP (1) EP1032918A1 (fr)
JP (1) JP2002525719A (fr)
AU (1) AU5926299A (fr)
CA (1) CA2310333A1 (fr)
WO (1) WO2000016250A1 (fr)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2453608C (fr) 2003-12-17 2007-11-06 Ibm Canada Limited - Ibm Canada Limitee Evaluation des besoins en matiere de stockage pour une configuration de groupement de donnees a plusieurs dimensions
JP4670010B2 (ja) * 2005-10-17 2011-04-13 株式会社国際電気通信基礎技術研究所 移動体分布推定装置、移動体分布推定方法及び移動体分布推定プログラム
US8239379B2 (en) * 2007-07-13 2012-08-07 Xerox Corporation Semi-supervised visual clustering
US20090232388A1 (en) * 2008-03-12 2009-09-17 Harris Corporation Registration of 3d point cloud data by creation of filtered density images
JP5332647B2 (ja) * 2009-01-23 2013-11-06 日本電気株式会社 モデル選択装置、モデル選択装置の選択方法及びプログラム
US9424337B2 (en) 2013-07-09 2016-08-23 Sas Institute Inc. Number of clusters estimation
US9202178B2 (en) 2014-03-11 2015-12-01 Sas Institute Inc. Computerized cluster analysis framework for decorrelated cluster identification in datasets
CN105447001B (zh) * 2014-08-04 2018-12-14 华为技术有限公司 高维数据降维方法及装置
JP6586764B2 (ja) * 2015-04-17 2019-10-09 株式会社Ihi データ分析装置及びデータ分析方法
US9996543B2 (en) 2016-01-06 2018-06-12 International Business Machines Corporation Compression and optimization of a specified schema that performs analytics on data within data systems
US11164106B2 (en) * 2018-03-19 2021-11-02 International Business Machines Corporation Computer-implemented method and computer system for supervised machine learning
US11847132B2 (en) 2019-09-03 2023-12-19 International Business Machines Corporation Visualization and exploration of probabilistic models

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0016250A1 *

Also Published As

Publication number Publication date
AU5926299A (en) 2000-04-03
CA2310333A1 (fr) 2000-03-23
JP2002525719A (ja) 2002-08-13
WO2000016250A1 (fr) 2000-03-23

Similar Documents

Publication Publication Date Title
Stanford et al. Finding curvilinear features in spatial point patterns: principal curve clustering with noise
Tirandaz et al. A two-phase algorithm based on kurtosis curvelet energy and unsupervised spectral regression for segmentation of SAR images
Clausi K-means Iterative Fisher (KIF) unsupervised clustering algorithm applied to image texture segmentation
Attene et al. Hierarchical mesh segmentation based on fitting primitives
Keuchel et al. Binary partitioning, perceptual grouping, and restoration with semidefinite programming
WO2000016250A1 (fr) Procede de decomposition/reduction de donnees permettant de visualiser les groupes/sous-groupes de donnees
Krasnoshchekov et al. Order-k α-hulls and α-shapes
Demir et al. Coupled segmentation and similarity detection for architectural models
Allassonniere et al. A stochastic algorithm for probabilistic independent component analysis
Tsuchie et al. High-quality vertex clustering for surface mesh segmentation using Student-t mixture model
Bergamasco et al. A graph-based technique for semi-supervised segmentation of 3D surfaces
Lavoué et al. Markov Random Fields for Improving 3D Mesh Analysis and Segmentation.
Ali et al. Review on fuzzy clustering algorithms
AlZu′ bi et al. 3D medical volume segmentation using hybrid multiresolution statistical approaches
Blanchet et al. Triplet Markov fields for the classification of complex structure data
Vilalta et al. An efficient approach to external cluster assessment with an application to martian topography
Huang et al. Texture classification by multi-model feature integration using Bayesian networks
Kouritzin et al. A graph theoretic approach to simulation and classification
Gehre et al. Feature Curve Co‐Completion in Noisy Data
Li et al. High resolution radar data fusion based on clustering algorithm
Marras et al. 3D geometric split–merge segmentation of brain MRI datasets
Guizilini et al. Iterative continuous convolution for 3d template matching and global localization
Huang et al. Image segmentation using an efficient rotationally invariant 3D region-based hidden Markov model
Roy et al. A finite mixture model based on pair-copula construction of multivariate distributions and its application to color image segmentation
Li Unsupervised texture segmentation using multiresolution Markov random fields

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20000515

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

17Q First examination report despatched

Effective date: 20030321

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20040702