WO2002005133A1 - Procede et appareil de recherche dans une base de donnees contenant des informations biologiques - Google Patents

Procede et appareil de recherche dans une base de donnees contenant des informations biologiques Download PDF

Info

Publication number
WO2002005133A1
WO2002005133A1 PCT/SG2000/000100 SG0000100W WO0205133A1 WO 2002005133 A1 WO2002005133 A1 WO 2002005133A1 SG 0000100 W SG0000100 W SG 0000100W WO 0205133 A1 WO0205133 A1 WO 0205133A1
Authority
WO
WIPO (PCT)
Prior art keywords
domains
domain
protein
database
user
Prior art date
Application number
PCT/SG2000/000100
Other languages
English (en)
Inventor
Allison Lim
Jiren Wang
Limsoon Wong
Original Assignee
Kent Ridge Digital Labs
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kent Ridge Digital Labs filed Critical Kent Ridge Digital Labs
Priority to PCT/SG2000/000100 priority Critical patent/WO2002005133A1/fr
Publication of WO2002005133A1 publication Critical patent/WO2002005133A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention relates to a method and apparatus adapted to facilitate searching of a database containing biological information.
  • the present invention in particular but not exclusively, provides a search system and method that allows for flexible design of queries for sequences with certain biological function units (such as motif and domains) and identification of protein sequences that have these units.
  • the design of queries in one form, is based on a combination of existing models and / or user-defined queries.
  • the present invention has application, in one form, to the fields of bioinformatics, computer science, information science, pharmaceutical science and biotechnology. BACKGROUND ART
  • biological scientists express a need to identify proteins by their functional units, and which must satisfy certain compositional constraints.
  • the present invention seeks as an object to alleviate at least one problem associated with the prior art.
  • the present invention in one form, stems from the recognition that the problems a, b, and c noted above exist and a solution to the problems should be devised.
  • the present invention provides an interface/apparatus and /or method for devising a query for use in interrogating a biological database to identify a target protein, in which query a user is able to:
  • the apparatus and / or method further serves to: (d) execute the query by searching for the target protein by identifying those protein sequences from (c) having composition of domains from (a) detected using means from (b).
  • the present invention preferably includes a number of features, such as:
  • the user can define complex compositions of domain models in one query.
  • These types of queries can involve regular expressions, hidden Markov models, or position weight matrix profiles, and are particularly suitable for identifying complex multi-domain proteins.
  • the query can also include other models or compositions as would be understood by those skilled in the art.
  • the user can take advantage of the defined motif/domain libraries so that the user can search for a defined domain by a partial regular expression match or a keyword that describes the motif.
  • the user can (a) use pre-defined domain models, (b) automatically create them using user-supplied English specification, and (c) automatically create them (both directly and indirectly) using user-supplied seed sequences.
  • the user can specify composition of models (eg. two filin domains) and combination of search methods (eg. occurrence of regular expression within a zinc finger domain).
  • FIG 1 illustrates schematically one embodiment of the present invention
  • Figure 2 illustrates an example output of the embodiment of Figure 1.
  • the present invention contains several major components described as follows: a. An extensible collection of motifs, profiles, regular expression patterns, hidden Markov models, etc. with their associated search methods. These motifs, profiles, regular expressions, hidden Markov models, etc. are collectively referred to in this Disclosure as " domain models". These domain models can be verbatim import from established external databases. b. An extensible collection of databases of protein sequences. These databases may be existing library(s) or be compiled individually. c. An interface allowing a user to select an individual domain model.
  • the selection is achieved either by (a) entering an English description, and then selecting from matching entries in PROTEIN DESIGNER'S collection of domain models; or by (b) direct browsing of entries in PROTEIN DESIGNER'S collection; or by (c) direct entry using regular expression; or by (d) direct derivation of hidden Markov model from a user-supplied list of seed protein sequences; or by (e) direct derivation of hidden Markov model from protein sequences in public databases matching a user-supplied list of seed protein sequences; or by (f) direct derivation of hidden Markov model from protein sequences in public databases matching a user-supplied English description. d. An interface allowing the user to compose the individual models to form a description of the domain composition of the proteins he wishes to identify.
  • This interface can be either graphical or text-based.
  • the user uses it to specify (a) relative ordering of the domains in the target protein, (b) distance and/or containment constraints between these domains in the target protein, and (d) if necessary, scoring thresholds for these domains.
  • e An interface allowing the user to select databases from PROTEIN DESIGNER'S collection of databases.
  • f An engine for applying the specified domain composition on the selected protein databases and for displaying the matching proteins.
  • the present invention is referred to as the PROTEIN DESIGNER and provides a user a convenient way
  • a relational database or a FASTA-formatted flat file is again preferably used in the implementation.
  • top-level options included, such as:
  • Option 1 is " Type in regular expression". Under this option, the user is asked to provide a regular expression to specify the constituent model.
  • the symbols allowed in this regular expression are the 20 amino acid letters and the dot symbol (.) representing don't-care. These symbols can be grouped using the square brackets ([ and ]) so that [ABC] means A or B or C. These symbols can be written adjacent to each other so that ABC means A followed by B followed by C.
  • Each symbol can be annotated by a repetition constraint ⁇ d ⁇ meaning repeat d times; ⁇ ,d ⁇ meaning repeat at most d times; ⁇ d, ⁇ meaning repeat at least d times; ⁇ j,k ⁇ meaning repeat between j to k times.
  • C. ⁇ 2,4 ⁇ C. ⁇ 3 ⁇ [LIVMFYWC]. ⁇ 8 ⁇ H. ⁇ 3,5 ⁇ H specifies the usual zinc finger domain.
  • Option 2 is "Type in English word”. Under this option, the user is asked to provide a list of keywords describing a constituent domain. This keywords is then used to search the " DB of domain models" (3.1) to find predefined domains whose description in the database contains these words. Those with more matching keywords are listed first. The user can then browse and select from this list a desired constituent domain model, provided that model appears on the list.
  • a more sophisticated embodiment can also make use of approximate matching of keywords based on stemming and thesaurus.
  • Option 3 is " Construct HMM". Under this option, there are several suboptions to let the user select the means for constructing a desired constituent domain model. The preferred, but not the only, means are the followings:
  • the user is also given an option to save the constructed hidden Markov model into the database (3.1).
  • the options noted above may also be used partly or wholly in combination.
  • M1 ..., Mn are the constituent domain models selected from the "Interface for selecting domain models" (3.3) and S1, ..., Sn are the corresponding methods/scoring thresholds associated with these selected constituent domain models.
  • S1 ..., Sn are the corresponding methods/scoring thresholds associated with these selected constituent domain models.
  • SPEC1 before ⁇ >d ⁇ SPEC2 and SPEC2 is separated by a distance of at least d residues is acceptable
  • Round brackets can be used to disambiguate where necessary.
  • a graphical method An graphical icon is provided for each Mi/Si selected from “Interface for selecting domain models” (3.3).
  • a canvas is provided for the user to click and drop these icons.
  • a line between two icons denotes " before", in a left-to-right manner.
  • a line can be annotated by a distance constraint ⁇ >d ⁇ and its means the constituent domain represented by the icon at its left is separated from the constituent domain represented by the icon at its right by at least d residues.
  • a circle shaded in a light colour (say red) can be used to group icons. Such a circle means that all the constituents' domains represented by all the enclosed icons must appear in the desired protein.
  • the circle can also be annotated by a distance constraint ⁇ >d ⁇ and its means that these constituent domains are expected to overlap by at least of residues.
  • a circle shaded in a different light colour (say blue) can be used to group icons. Such a circle means that at least one of the constituent domains represented by the enclosed icons must appear in the desired protein.
  • An example is shown in the shaded box adjacent to 3.4 in Figure 1.
  • the distance constraint ⁇ >d ⁇ above can also be generalized, for example, to ⁇ j-kj meaning at leasty and at most k.
  • the names of the sublists from (3.2) are provided to the user for selection.
  • the user can specify some keywords and all protein sequences whose English descriptions in (3.2) match these keywords are selected.
  • This interface looks up from the protein sequence database (3.2) the names, English or other language description, and sequence of each hit from (3.6). It then displays this information, together with a graphical or textual layout of the constituent domains of the corresponding hit.
  • this interface initially displays a summary of the hits.
  • the summary contains just the name of the protein sequence in each hit, together with a graphical or textual layout of its constituent domains.
  • the layout can be selected or defined or composed by the user .
  • TPR Tetratricopeptide repeat domains
  • Figure 2 An example presentation for results of a search consisting of three Tetratricopeptide repeat domains (TPR) is given in Figure 2 .
  • the corresponding (textual) domain composition specification is "TPR before ⁇ 0 ⁇ TPR before ⁇ 0 ⁇ TPR”. It shows three protein sequences (P14922, P30260, P38042) from the Swissprot database that satisfy the domain composition criterion.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé et un appareil conçus pour faciliter la recherche dans une base de données contenant des informations biologiques. L'invention concerne notamment, mais non exclusivement, un système et un procédé de recherche qui permettent une formulation flexible de demandes de séquences dotées de certaines unités de fonction biologique (comme des motifs et des domaines) et l'identification de séquences protéiniques comportant ces unités. La formulation de demandes, dans un mode de réalisation, est basée sur la combinaison de modèles existants et/ou de demandes définies par l'utilisateur afin de fournir une sélection flexible de critères dans la définition d'une ou de plusieurs demandes d'interrogation d'une base de données. L'invention s'applique, dans un mode de réalisation, aux champs de la bio-informatique, de la science informatique, de la science de l'information, de la science pharmaceutique et de la biotechnologie.
PCT/SG2000/000100 2000-07-07 2000-07-07 Procede et appareil de recherche dans une base de donnees contenant des informations biologiques WO2002005133A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/SG2000/000100 WO2002005133A1 (fr) 2000-07-07 2000-07-07 Procede et appareil de recherche dans une base de donnees contenant des informations biologiques

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2000/000100 WO2002005133A1 (fr) 2000-07-07 2000-07-07 Procede et appareil de recherche dans une base de donnees contenant des informations biologiques

Publications (1)

Publication Number Publication Date
WO2002005133A1 true WO2002005133A1 (fr) 2002-01-17

Family

ID=20428837

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2000/000100 WO2002005133A1 (fr) 2000-07-07 2000-07-07 Procede et appareil de recherche dans une base de donnees contenant des informations biologiques

Country Status (1)

Country Link
WO (1) WO2002005133A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000023474A1 (fr) * 1998-10-21 2000-04-27 The University Of Queensland Ingenierie des proteines
WO2000026818A1 (fr) * 1998-10-30 2000-05-11 International Business Machines Corporation Procedes et appareil pour la detection d'homologies de sequences

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000023474A1 (fr) * 1998-10-21 2000-04-27 The University Of Queensland Ingenierie des proteines
WO2000026818A1 (fr) * 1998-10-30 2000-05-11 International Business Machines Corporation Procedes et appareil pour la detection d'homologies de sequences

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ALTSCHUL S F ET AL: "Iterated profile searches with PSI-BLAST-a tool for discovery in protein databases", TIBS TRENDS IN BIOCHEMICAL SCIENCES,EN,ELSEVIER PUBLICATION, CAMBRIDGE, vol. 23, no. 11, 1 November 1998 (1998-11-01), pages 444 - 447, XP004143492, ISSN: 0968-0004 *
CURWEN V A ET AL: "GHOST: a gene homology online search tool", TRENDS IN GENETICS,ELSEVIER, AMSTERDAM,NL, vol. 16, no. 7, July 2000 (2000-07-01), pages 321 - 323, XP004207252, ISSN: 0168-9525 *
JASON WANG; BRUCE SHAPIRO; DENNIS SHASHA: "Pattern Discovery in Biomolecular Data", 1999, OXFORD UNIVERSITY PRESS, NEW YORK OXFORD, ISBN: 0-19-511940-1, XP002168720 *

Similar Documents

Publication Publication Date Title
NL1028923C2 (nl) Werkwijze, toestel en software voor het extraheren van chemische gegevens.
Walker et al. SEALS: a system for easy analysis of lots of sequences.
Krogh An introduction to hidden Markov models for biological sequences
Gouy et al. ACNUC–a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage
US6249784B1 (en) System and method for searching and processing databases comprising named annotated text strings
JP5309570B2 (ja) 情報検索装置、情報検索方法、制御プログラム
US6904427B1 (en) Search system and method based on search condition combinations
US6633817B1 (en) Sequence database search with sequence search trees
US20050278292A1 (en) Spelling variation dictionary generation system
JP2006139783A (ja) 照会から得られる1以上の文書に含まれる照会関連キーワードの集合を識別するための、前記照会が前記キーワードを含む必要がない、方法及びシステム
Brejová et al. Finding patterns in biological sequences
JP2004287725A (ja) 検索処理方法及びプログラム
US20210158903A1 (en) Information processing system and search method
Allali et al. The at-most $ k $-deep factor tree
WO2002005133A1 (fr) Procede et appareil de recherche dans une base de donnees contenant des informations biologiques
JPH113343A (ja) 情報検索装置
Schbath et al. R'MES: a tool to find motifs with a significantly unexpected frequency in biological sequences
JP2000148789A (ja) 特許情報等の引用文献分析方法及び引用文献分析装置
JP5347307B2 (ja) 情報検索装置、情報検索方法、制御プログラム
KR100551954B1 (ko) 유전자 온톨로지를 이용한 단백질 상호작용 네트워크 검색시스템 및 방법
JP2001014326A (ja) 構造指定による類似文書の検索装置及び検索方法
US7277798B2 (en) Methods for extracting similar expression patterns and related biopolymers
Bockhorst et al. Discovering patterns in biological sequences by optimal segmentation
JP4247026B2 (ja) キーワード頻度算出方法及びそれを実行するプログラム
Mulder et al. Interpro and interproscan

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): SG US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase