WO2002103954A2 - Plate-forme d'exploration de donnees en bio-informatique et autres domaines de decouverte de connaissance - Google Patents

Plate-forme d'exploration de donnees en bio-informatique et autres domaines de decouverte de connaissance Download PDF

Info

Publication number
WO2002103954A2
WO2002103954A2 PCT/US2002/019202 US0219202W WO02103954A2 WO 2002103954 A2 WO2002103954 A2 WO 2002103954A2 US 0219202 W US0219202 W US 0219202W WO 02103954 A2 WO02103954 A2 WO 02103954A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
features
feature
gene
genes
Prior art date
Application number
PCT/US2002/019202
Other languages
English (en)
Other versions
WO2002103954A3 (fr
Inventor
Isabelle Guyon
Edward Reiss
René DOURSAT
David Lewis
Jason Weston
Original Assignee
Biowulf Technologies, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Biowulf Technologies, Llc filed Critical Biowulf Technologies, Llc
Priority to AU2002304006A priority Critical patent/AU2002304006A1/en
Priority to US10/481,068 priority patent/US7444308B2/en
Publication of WO2002103954A2 publication Critical patent/WO2002103954A2/fr
Publication of WO2002103954A3 publication Critical patent/WO2002103954A3/fr
Priority to US11/928,641 priority patent/US7542947B2/en
Priority to US13/079,198 priority patent/US8126825B2/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/20Heterogeneous data integration
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • FIG. 9 is an exemplary screen shot of an interface for the Gene Search Assistant application for bioinformatics for use in searching published information.
  • the following sequence illustrates application of recursive feature elimination (RFE) to a SVM using the weight magnitude as the ranking criterion.
  • SRS6 SRS is a program developed at the European Bioinfonnatics Institute for the indexing and cross-referencing of databases of textual information. It provides unified access to molecular biology databases, integration of analysis tools and advanced parsing tools for disseminating and reformatting information stored in ASCII text.
  • Ranlced lists of features can also be visualized as a matrix of colored coefficients.
  • the columns of the matrix represent all of the values a given feature takes across all patterns.
  • the columns are ordered according to the feature ranlcing.
  • the rows of the matrix may be ordered, for example, to group the examples of a same class together.
  • a matrix can be transposed.
  • One can also represent ranlced lists of feature subsets, particularly equivalent features, in this way. Nested subsets of features with cardinality increments of one can be visualized by printing the feature identifiers in the order that they are added to increase the cardinality of the feature subsets.
  • the identifiers, or their background can then be optionally colored according to the score of the subset containing all the features from the beginning of the list to that feature.
  • feature f a singleton
  • color 1 illustrated as low density dots
  • feature f 5 is filled indicated by a box filled with color 8 (illustrated as grid lines) to indicate the highest score.
  • FIG. 14 illustrates the gene tree (observation graph) corresponding to the screen information in FIG. 11.
  • This tree was generated from DNA microarray data of colon cancer and normal patients.
  • Several runs using the RFE-SVM algorithm were used to generate alternative nested subsets of genes.
  • the nodes are labeled with GANs.
  • the quality of every subset of genes can be assessed, for example, by the success rate of a classifier trained with these genes.
  • the shading (color) of the last node of a given path indicates the quality of the subset, hi the present example, a scale of 64 shades, or colors, was used to map the leave-one-out success rate.
  • a binary tree of depth 4 is construed. This means that for every gene selection, only two alternatives are presented, and that up to four genes can be selected. Wider trees (with more children at every node) permit selection from a wider variety of genes. Deeper tree provide for selection of a larger number of genes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Genetics & Genomics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne une plate-forme d'exploration de données comprenant une pluralité de modules système, chacun formés d'une pluralité de composants. Chaque module présente un composant de données d'entrée, un moteur d'analyse de données destiné au traitement des données d'entrée, un composant de données de sortie destiné à produire des résultats de l'analyse de données, et un serveur web destiné à accéder à et à surveiller les autres modules à l'intérieur de l'unité et à établir la communication avec d'autres unités. Chaque module réalise le traitement d'un type différent de données, par exemple, un premier module traite les données de jeux ordonnés de micro-échantillons (expression de gènes) alors qu'un second module traite la littérature biomédicale sur Internet pour des informations qui sont le support de relations entre des gènes et des maladies et une fonctionnalité génétique. Dans le mode de réalisation préféré de l'invention, le moteur d'analyse de données est une machine d'apprentissage basée sur noyau, et en particulier, une ou plusieurs machines à vecteur de support (support vector machines / SVMs). Le moteur d'analyse de données comprend une fonction de pré-traitement permettant la sélection de caractéristiques, la réduction de la quantité de données à traiter par sélection du nombre optimal d'attributs, ou de 'caractéristiques', convenant aux informations à découvrir.
PCT/US2002/019202 1998-05-01 2002-06-17 Plate-forme d'exploration de donnees en bio-informatique et autres domaines de decouverte de connaissance WO2002103954A2 (fr)

Priority Applications (4)

Application Number Priority Date Filing Date Title
AU2002304006A AU2002304006A1 (en) 2001-06-15 2002-06-17 Data mining platform for bioinformatics and other knowledge discovery
US10/481,068 US7444308B2 (en) 2001-06-15 2002-06-17 Data mining platform for bioinformatics and other knowledge discovery
US11/928,641 US7542947B2 (en) 1998-05-01 2007-10-30 Data mining platform for bioinformatics and other knowledge discovery
US13/079,198 US8126825B2 (en) 1998-05-01 2011-04-04 Method for visualizing feature ranking of a subset of features for classifying data using a learning machine

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US29875701P 2001-06-15 2001-06-15
US29886701P 2001-06-15 2001-06-15
US29884201P 2001-06-15 2001-06-15
US60/298,867 2001-06-15
US60/298,842 2001-06-15
US60/298,757 2001-06-15

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/016012 Continuation-In-Part WO2002095534A2 (fr) 1998-05-01 2002-05-20 Procedes de selection de caracteristiques dans une machine a enseigner

Related Child Applications (3)

Application Number Title Priority Date Filing Date
US10481068 A-371-Of-International 2002-06-17
US11/928,606 Continuation US7921068B2 (en) 1998-05-01 2007-10-30 Data mining platform for knowledge discovery from heterogeneous data types and/or heterogeneous data sources
US11/928,641 Continuation US7542947B2 (en) 1998-05-01 2007-10-30 Data mining platform for bioinformatics and other knowledge discovery

Publications (2)

Publication Number Publication Date
WO2002103954A2 true WO2002103954A2 (fr) 2002-12-27
WO2002103954A3 WO2002103954A3 (fr) 2003-04-03

Family

ID=27404588

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/019202 WO2002103954A2 (fr) 1998-05-01 2002-06-17 Plate-forme d'exploration de donnees en bio-informatique et autres domaines de decouverte de connaissance

Country Status (2)

Country Link
AU (1) AU2002304006A1 (fr)
WO (1) WO2002103954A2 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7243100B2 (en) * 2003-07-30 2007-07-10 International Business Machines Corporation Methods and apparatus for mining attribute associations
WO2008078293A1 (fr) * 2006-12-22 2008-07-03 International Business Machines Corporation Procédé mis en oeuvre par ordinateur, programme d'ordinateur et système destiné à analyser des enregistrements de données
WO2010072382A1 (fr) * 2008-12-22 2010-07-01 Roche Diagnostics Gmbh Système et procédé d'analyse de données génomiques
US10515715B1 (en) 2019-06-25 2019-12-24 Colgate-Palmolive Company Systems and methods for evaluating compositions
CN112116952A (zh) * 2020-08-06 2020-12-22 温州大学 基于扩散及混沌局部搜索的灰狼优化算法的基因选择方法
US11521751B2 (en) * 2020-11-13 2022-12-06 Zhejiang Lab Patient data visualization method and system for assisting decision making in chronic diseases

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6266668B1 (en) * 1998-08-04 2001-07-24 Dryken Technologies, Inc. System and method for dynamic data-mining and on-line communication of customized information
US20020052882A1 (en) * 2000-07-07 2002-05-02 Seth Taylor Method and apparatus for visualizing complex data sets
US20020083067A1 (en) * 2000-09-28 2002-06-27 Pablo Tamayo Enterprise web mining system and method
US20020095260A1 (en) * 2000-11-28 2002-07-18 Surromed, Inc. Methods for efficiently mining broad data sets for biological markers
US20020111742A1 (en) * 2000-09-19 2002-08-15 The Regents Of The University Of California Methods for classifying high-dimensional biological data
US20020120405A1 (en) * 2000-09-27 2002-08-29 Aled Edwards Protein data analysis
US20020119462A1 (en) * 2000-07-31 2002-08-29 Mendrick Donna L. Molecular toxicology modeling
US20020133504A1 (en) * 2000-10-27 2002-09-19 Harry Vlahos Integrating heterogeneous data and tools
US6470333B1 (en) * 1998-07-24 2002-10-22 Jarg Corporation Knowledge extraction system and method
US20020165845A1 (en) * 2001-05-02 2002-11-07 Gogolak Victor V. Method and system for web-based analysis of drug adverse effects

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6470333B1 (en) * 1998-07-24 2002-10-22 Jarg Corporation Knowledge extraction system and method
US6266668B1 (en) * 1998-08-04 2001-07-24 Dryken Technologies, Inc. System and method for dynamic data-mining and on-line communication of customized information
US20020049704A1 (en) * 1998-08-04 2002-04-25 Vanderveldt Ingrid V. Method and system for dynamic data-mining and on-line communication of customized information
US20020052882A1 (en) * 2000-07-07 2002-05-02 Seth Taylor Method and apparatus for visualizing complex data sets
US20020119462A1 (en) * 2000-07-31 2002-08-29 Mendrick Donna L. Molecular toxicology modeling
US20020111742A1 (en) * 2000-09-19 2002-08-15 The Regents Of The University Of California Methods for classifying high-dimensional biological data
US20020120405A1 (en) * 2000-09-27 2002-08-29 Aled Edwards Protein data analysis
US20020083067A1 (en) * 2000-09-28 2002-06-27 Pablo Tamayo Enterprise web mining system and method
US20020133504A1 (en) * 2000-10-27 2002-09-19 Harry Vlahos Integrating heterogeneous data and tools
US20020095260A1 (en) * 2000-11-28 2002-07-18 Surromed, Inc. Methods for efficiently mining broad data sets for biological markers
US20020165845A1 (en) * 2001-05-02 2002-11-07 Gogolak Victor V. Method and system for web-based analysis of drug adverse effects

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
KEMP ET AL.: 'Using the functional data model to integrate distributed biological data sources' PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE SYSTEMS June 1996, pages 176 - 185, XP002958893 *
MOORE S.K.: 'Harmonizing data, setting standards' GENOMICS INFORMATION SETS, IEEE SPECTRUM vol. 38, no. 1, January 2001, pages 111 - 112, XP002958891 *
PAVLIDIS ET AL.: 'Gene functional classification from heterogeneous data' PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL BIOLOGY April 2001, pages 249 - 255, XP000988076 *
SYED ET AL.: 'A study of support vectors on model independent example selection' PROCEEDINGS OF THE 5TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING July 1999, pages 272 - 276, XP002958894 *
WALKER R.L.: 'Parallel clustering system using the methodologies of evolutionary computations' PROCEEDINGS OF THE 2001 CONGRESS ON EVOLUTIONARY COMPUTATION 2001, pages 831 - 838, XP002958892 *
YANG ET AL.: 'Data-driven theory refinement algorithms for bioformatics' INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS July 1999, pages 4064 - 4068, XP010372571 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7243100B2 (en) * 2003-07-30 2007-07-10 International Business Machines Corporation Methods and apparatus for mining attribute associations
WO2008078293A1 (fr) * 2006-12-22 2008-07-03 International Business Machines Corporation Procédé mis en oeuvre par ordinateur, programme d'ordinateur et système destiné à analyser des enregistrements de données
US7953677B2 (en) 2006-12-22 2011-05-31 International Business Machines Corporation Computer-implemented method, computer program and system for analyzing data records by generalizations on redundant attributes
WO2010072382A1 (fr) * 2008-12-22 2010-07-01 Roche Diagnostics Gmbh Système et procédé d'analyse de données génomiques
US10839942B1 (en) 2019-06-25 2020-11-17 Colgate-Palmolive Company Systems and methods for preparing a product
US10839941B1 (en) 2019-06-25 2020-11-17 Colgate-Palmolive Company Systems and methods for evaluating compositions
US10515715B1 (en) 2019-06-25 2019-12-24 Colgate-Palmolive Company Systems and methods for evaluating compositions
US10861588B1 (en) 2019-06-25 2020-12-08 Colgate-Palmolive Company Systems and methods for preparing compositions
US11315663B2 (en) 2019-06-25 2022-04-26 Colgate-Palmolive Company Systems and methods for producing personal care products
US11342049B2 (en) 2019-06-25 2022-05-24 Colgate-Palmolive Company Systems and methods for preparing a product
US11728012B2 (en) 2019-06-25 2023-08-15 Colgate-Palmolive Company Systems and methods for preparing a product
CN112116952A (zh) * 2020-08-06 2020-12-22 温州大学 基于扩散及混沌局部搜索的灰狼优化算法的基因选择方法
CN112116952B (zh) * 2020-08-06 2024-02-09 温州大学 基于扩散及混沌局部搜索的灰狼优化算法的基因选择方法
US11521751B2 (en) * 2020-11-13 2022-12-06 Zhejiang Lab Patient data visualization method and system for assisting decision making in chronic diseases

Also Published As

Publication number Publication date
AU2002304006A1 (en) 2003-01-02
WO2002103954A3 (fr) 2003-04-03

Similar Documents

Publication Publication Date Title
US8126825B2 (en) Method for visualizing feature ranking of a subset of features for classifying data using a learning machine
US7444308B2 (en) Data mining platform for bioinformatics and other knowledge discovery
JP7305656B2 (ja) 確率分布をモデル化するためのシステムおよび方法
Guyon et al. An introduction to variable and feature selection
Malley et al. Statistical learning for biomedical data
Srinivasu et al. Using recurrent neural networks for predicting type-2 diabetes from genomic and tabular data
Yip et al. A survey of classification techniques for microarray data analysis
Kumar et al. A case study on machine learning and classification
WO2002103954A2 (fr) Plate-forme d'exploration de donnees en bio-informatique et autres domaines de decouverte de connaissance
Sánchez-Maroño et al. Classification of microarray data
Altınçay Decision trees using model ensemble-based nodes
WO2022212337A1 (fr) Techniques de base de données de graphes pour apprentissage automatique
Shaer et al. Learning to increase the power of conditional randomization tests
AU2020101987A4 (en) DIMA-Dataset Discovery: DATASET DISCOVERY IN DATA INVESTIGATIVE USING MACHINE LEARNING AND AI-BASED PROGRAMMING
Ni et al. HEAL: Brain-inspired Hyperdimensional Efficient Active Learning
Tan et al. Machine learning and its application to bioinformatics: an overview
Nilsson Nonlinear dimensionality reduction of gene expression data
Mumbuçoğlu Classification of microarray gene expression cancer data by using artificial intelligence methods
Sevilla-Villanueva A methodology for pre-post intervention studies: An application for a nutritional case study
Sun Improving classification performance of microarray analysis by feature selection and feature extraction methods
Young II Disease endotypes of type 1 diabetes: Exploration through machine learning and topological data analysis
Joshi et al. Smart Health Prediction System Using Data Mining
Sasirekha et al. Identification and Classification of Leukemia Using Machine Learning Approaches
Troisi et al. Data analysis in metabolomics: from information to knowledge
Thanigainathan USING ENSEMBLE CLUSTERING TO IDENTIFY PHENOTYPES OF DIABETES PATIENTS FOR EVALUATING DISEASE PROGRESSION

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 69(1) EPC

122 Ep: pct application non-entry in european phase
ENP Entry into the national phase

Ref document number: 2006064415

Country of ref document: US

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 10481068

Country of ref document: US

WWP Wipo information: published in national office

Ref document number: 10481068

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP