EP1221126A2 - Interface graphique pour affichage et analyse de donnees de sequences biologiques - Google Patents

Interface graphique pour affichage et analyse de donnees de sequences biologiques

Info

Publication number
EP1221126A2
EP1221126A2 EP00963469A EP00963469A EP1221126A2 EP 1221126 A2 EP1221126 A2 EP 1221126A2 EP 00963469 A EP00963469 A EP 00963469A EP 00963469 A EP00963469 A EP 00963469A EP 1221126 A2 EP1221126 A2 EP 1221126A2
Authority
EP
European Patent Office
Prior art keywords
sequence
database
sequences
modules
family
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP00963469A
Other languages
German (de)
English (en)
Inventor
Stephen Chamberlin
Steven A. Benner
Lukas Knecht
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Luminex Corp
Original Assignee
Eragen Biosciences Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/397,335 external-priority patent/US6941317B1/en
Application filed by Eragen Biosciences Inc filed Critical Eragen Biosciences Inc
Publication of EP1221126A2 publication Critical patent/EP1221126A2/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the present invention relates to a computer research tool for searching and displaying biological data. More specifically, the invention relates to a computer research tool utilizing a novel graphical user interface (GUI) for performing computerized research of biological data from various databases and for providing enhanced graphical representation of biological data, progressive querying, and cross-navigation of relational data.
  • GUI novel graphical user interface
  • DNA contains the blueprints for these structures.
  • DNA is composed of very long polymers of chemical sub-units known as nucleotides. Each nucleotide includes one of four nitrogenous bases: adenine (A), thymine (T), cytosine (C) and guanine (G).
  • A adenine
  • T thymine
  • C cytosine
  • G guanine
  • DNA serves as a template for ribonucleic acid
  • RNA ribonucleic acid
  • RNA is also composed of nucleotides. Each RNA nucleotide includes one of four nitrogenous bases. These bases of RNA differ from that of DNA only in the substitution of thymine (T) with uracil (U).
  • Three nucleotides of DNA encode three nucleotides of RNA, which in turn encode one amino acid of a protein.
  • Proteins are macromolecules of amino acids which show great diversity in physical properties thereby fulfilling a broad range of biological functions (i.e., polymers of covalently bonded amino acids).
  • a protein's structure and function depends upon its amino acid sequence, which is determined by the nucleotide sequence of the RNA which produced it, which is determined by the nucleotide sequence of the DNA that produced the RNA.
  • the great diversity observed in the sequence of amino acids is the direct result of the many possible permutations of DNA and RNA.
  • the primary structure is the sequence of amino acids covalently bonded together.
  • the secondary structure is the result of amino acid sequence of the polypeptide.
  • the bonding causes the chain to develop specific shapes (alpha helix, beta sheet).
  • the tertiary structure is the 3-dimensional folding of the alpha helix or the pleated sheet.
  • the quaternary structure is the spatial relationship between the different polypeptides in the protein.
  • Sequence comparison is a very powerful tool in molecular biology, genetics and protein chemistry. Frequently, it is unknown for which proteins a new DNA sequence codes or if it codes for any protein at all. If you compare a new coding sequence with all known sequences there is a high probability to find a similar sequence. Usually one tries to determine what level of similarity is shared between the proteins in terms of structural and functional characteristics. This determination is made by comparing the amino acid sequences of the proteins. It has been observed that the primary structures of a given protein from related species closely resemble one another.
  • databases of known biological data need to be accessed.
  • databases where biological information such as DNA and protein sequence data are stored, including, general biological databanks such as EMBL/GENBAN (nucleotide sequences), SWISSPROT (protein sequences), and PDB/Protein Data Bank (protein structures).
  • EMBL/GENBAN nucleotide sequences
  • SWISSPROT protein sequences
  • PDB/Protein Data Bank protein structures.
  • GenBank is an annotated collection of all publically available DNA sequences.
  • Genbank database comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI.
  • SWISS-PROT is an annotated protein sequence database maintained by the Department of Medical Biochemistry of the University of Geneva.
  • the PDB/Protein Data Bank maintained by Brookhaven National Laboratory, contains all publically available solved protein structures. These databases contain large amounts of raw sequence data which can be cumbersome to use.
  • a derived database generally contains added descriptive materials on top of the primary data or provides novel structuring of the data based on certain defined relationships.
  • Derived/structured databases typically structure the protein sequence data into usable sets of data (tables), grouping the protein sequences by family or by homology domains.
  • a protein family is a group of sequences that can be aligned from end to end and are ⁇ 55% different globally.
  • a homologous domain is a subsequence of a protein that is distinguished by a well- defined set of properties or characteristics and may also occur in at least two different subfamilies.
  • ProDom a protein domain database, consisting of an automatic compilation of homologous domains.
  • the database was designed as a tool to help analyze domain arrangements of proteins and protein families.
  • Current versions of the ProDom database are built using a procedure based on recursive PSI-BLAST searches.
  • ProDom contains 57,976 domain families, sorted by decreasing number of protein sequences in the families.
  • ProDom is generated from the SWISS-PROT database by automated sequence comparison.
  • DOMO is a database of homologous protein domain families. DOMO was obtained from successive sequence analysis steps including similarity search, domain delineation, multiple sequence alignment, and motif construction.
  • DOMO has analyzed 83,054 non redundant protein sequences from SWISS-PROT and PIR-International Sequence DataBase yielding a database of 99,058 domain clusters into 8,877 multiple sequence alignments.
  • Another derived protein sequence database is the Block Database. Blocks are multiply aligned ungapped segments correspondingto the most highly conserved regions of proteins. The blocks for the Block Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in the Prosite Database. The blocks are then calibrated against the SWISS-PROT database to obtain a measure of the chance distribution matches.
  • biological databases may be searched by either an unstructured (keyword) or structured (field based) search.
  • An unstructured search of the database is preformed by searching for a keyword or the ID of records. For example, a keyword search of "ecoli" retrieves a list of protein sequences that are identified by the keyword "ecoli".
  • a structured search is a more deliberate search, allowing, for example, the searching of the database for protein sequences which contain a particular sequence of interest.
  • ENTREZ search engine which utilizes keyword searching. If a search results in too many hits, ENTREZ allows the addition of new search terms to progressively narrow the number of hits. A researcher may then select all or a subset of the entries that match the search for display to generate a summary page that reports on each of the selected entries.
  • the search results may be displayed in a variety of formats or standardized reports.
  • the genomes division of ENTREZ has a graphic interface based on alignments among multiple maps. The display image shows a series of genetic and physical maps published from a variety of sources, roughly aligned, with diagonal lines connecting common features.
  • SRS Sequence Retrieval System
  • SRS Sequence Retrieval System
  • the SRS cross- references sequence information from approximately 40 other sequence databases includingones that hold protein and nucleotide sequence information, 3D structure, disease and phenotype information, and functional information.
  • the SRS search allows structured queries on one or more databases with common fields (e.g., ID, AccNumber, Description). SRS displays the results as a series of hypertext links. The search can be broadened to other databases by bringing in cross-references.
  • GUIs Graphical User Interfaces
  • GUIs include a desktop metaphor upon which one or more icons, application windows, or other graphical objects are displayed.
  • a data processing system user interacts with a GUI display utilizing a graphical pointer, which the user controls with a graphical pointing device, such as a mouse, trackball, or joystick.
  • a graphical pointer which the user controls with a graphical pointing device, such as a mouse, trackball, or joystick.
  • the user can select icons or other graphical objects within the GUI display by positioning the graphical pointer over the graphical object and depressing a button associated with the graphical pointing device.
  • the user can typically relocate icons, application windows, and other graphical objects on the desktop utilizing the well known drag-and-drop techniques.
  • the user can control the underlying hardware devices and software objects represented by the graphical objects in a graphical and intuitive manner.
  • User interfaces used with multi-taskingprocessors also allow the user to simultaneously work on many tasks at once, each task being confined to its own display window.
  • the interface allows the presentation of multiple windows in potentiallyoverlappingrelationshipson a display screen. The user can thus retain a window on the screen while temporarily superimposing a further window entirely or partially overlapping the retained window. This enables the user to divert the attention from a first window to one or more secondary windows for assistance and /or references, so that overall user interaction may be improved.
  • MSAs multiple sequence alignments
  • secondary structure predictions two-dimensional graphical representations of sequences
  • phylogenetic trees A multiple sequence alignment displays the alignment of homologous residues among a set of sequences in columns.
  • sequences are displayed as schematic boxes wherein each box is spatially oriented.
  • Phylogenetic trees are genealogical trees which are built up with information gained from the comparison of the amino acid sequences in a protein.
  • the phylogenetic tree (rooted or unrooted) is a graphical representation of the evolutionary distance between individual protein sequences in a family of proteins.
  • the branches of the phylogenetic tree are evolutionary distances from the PAM matrix, an evolutionary model that assumes that estimation of mutation rates for closely related proteins can be extrapolated to distant relationships.
  • a good example of a graphical user interface can be found in the ProDom interface.
  • the output from a ProDom query for proteins sharing a homologous domain with a particular sequence may be displayed as 2D graphic representations, summarized alignments and trees, alignment in MSF format, and 3D structures.
  • the 2D graphical view presents domain arrangements for proteins sharing homology by showing each protein on a single line, starting with its name, hypertext- linked to SWISS-PROT, followed by a 2D view of schematic boxes, each box hypertext- linked to corresponding ProDom entries.
  • the limitation of most of these systems is that the graphical displays are both static and unrelated.
  • a static graphical display is defined as when a user is unable to refine or modify the search criteria from within the graphical display.
  • Unrelated graphical displays are defined as when a user modifies a graphical display for a particular search, the remaining graphical displays for the particular search are not correspondingly modified (i.e., no propagation).
  • the invention is a computer research tool for searching and displaying biological data.
  • the invention provides a computer research tool for performing computerized research of biological data from various databases and for providing a novel graphical user interface that significantly enhances biological data representation, progressive querying and cross-navigation of windows and databases.
  • the invention can be implemented in numerous ways, including as a system, a device, a method, or a computer readable medium. Several embodiments of the invention are discussed below.
  • an embodiment of the invention includes a database containing tables of data, a display device and a processor unit.
  • the display device has a plurality of display areas (windows).
  • the processor unit operates to access the database to retrieve the data from the corresponding associated tables and then display the retrieved data in the display areas.
  • the processor unit also detects when a selection associated with one of the display areas is made and thereafter automatically modifies the data being displayed in the other display areas in accordance with the selection. Selection is made in a graphically distinct manner. Changes in certain selections, including scale and limits, propagate throughout.
  • an embodiment of the invention includes a number of display areas ("windows") for searching and displaying biological data which are interlinked for ease of navigation.
  • GUI graphical user interface
  • a variety of formats for searching and displaying biological data is provided. Searches can be performed by keyword, sequence listing or family identifier (module ID). Sequence search results are graphically displayed showing the relationship between the probe sequence chosen for the query and each of the families that are related with their associated modules. Keyword search results are graphically displayed showing all sequences having the requested keyword along with all modules for the currently selected catalog. The module of interest may then be selected which results in a summary window for the family associated with the module.
  • the family summary window display provides a two dimensional spatial orientation of the biological data, including visual locations of modules (represented as schematic boxes) in each of the sequences for a selected family distinguishably displayed and positional ly aligned as well as the location of all other modules in those sequences.
  • Another results display provides the user with the associated multiple sequence alignment of biological data and secondary structure predictions (Vparse, Score, PredSI, PredSec).
  • a further results display provides the user with the associated phylogenetic tree (rooted or unrooted).
  • a further display provides the actual protein sequence information for any selected member of a family.
  • an embodiment of the invention includes the operations of: interlinking display areas via the database such that selections may be propagated throughout.
  • the method further includes multiple catalog views, browsing through families, cross-navigation and propagation through all catalogs and sequences, propagation through families, assigning protein function, and scaling of silent/express mutation ratios.
  • an embodiment of the invention includes: computer readable code devices for interlinking display areas via the database such that selections may be propagated throughout, multiple catalog views, browsing through families, cross-navigationand propagation through all catalogs and sequences, propagationthrough families, assigning protein function, and scaling of silent express mutation ratios (kA/kS).
  • One significant advantage of the invention is that it allows a user to directly use the data returned by one or more queries as the basis for making additional queries.
  • This kind of interactive and progressive query-making activity access to all of the information on a given topic is possible.
  • new data connections and relationships may be discovered.
  • the user is able to more efficiently and effectively review related biological information than conventionally possible.
  • FIG. 1 is an overview of the preferred embodiment of the hardware architecture for computerized searching of biological data.
  • FIG. 2 depicts a classic three tier client server model utilized in a preferred embodiment of the present invention ( 1 -tier/Database Server, 2-tier/Application Server, 3-tier/Client).
  • FIG. 3 depicts the preferred client/server communication model with all three tiers.
  • FIG. 4A depicts the entity relationship diagram of one embodiment of the database.
  • FIG.4B depicts the entity relationship diagram of another embodiment of the database.
  • FIG. 4C depicts the entity relationship diagram for DNA.
  • FIG. 5 depicts the block diagram for user related database.
  • FIG. 6 is a navigational flowchart of a preferred embodiment of the invention illustrating all major windows and options available.
  • FIG. 7 depicts a sample screen display for the catalog selection window.
  • FIG. 8 depicts a sample screen display for the search by name window.
  • FIG. 9 depicts a sample screen display for the Module Family Summary (MFS) window.
  • FIG. 10 depicts a sample screen display for the search by sequence window.
  • MFS Module Family Summary
  • FIG. 11 depicts a sample screen display for the sequence search results (SSR) window.
  • FIG. 12 depicts a sample screen display for the search by keyword window.
  • FIG. 13 depicts a sample screen display for the keyword search results (KSR) window.
  • FIG. 14 depicts a sample screen display for the MSA window.
  • FIG. 15 depicts a sample screen display for the evolutionary tree window.
  • FIG. 16 depicts an example of consecutive screen displays for interactive and progressive query-making activity from search by sequence.
  • FIG. 17 depicts an example of consecutive screen displays for interactive and progressive query-making activity from search by keyword.
  • FIG. 18A depicts the MFS window for one catalog.
  • FIG. 18B depicts the MFS window with a second catalog selected and consecutive screen displays for interactive and progressive query-making activity from this window.
  • FIG. 19 depicts the screen display for the evolutionary tree window as linked to the MFS window and MSA window with highlighted selections propagated throughout.
  • FIG.20 depicts the screen displays showing interactive and progressive query-making activity across multiple MFS windows.
  • FIG 1 is an overview of the preferred embodiment of the hardware architecture for computerized searching of biological data.
  • the architecture preferably comprises at least two networked computer processors (client component and server component(s)) and a database(s) for storing biological data.
  • the computer processors can be processors that are typically found in personal desktop computers (e.g., IBM, Dell, Macintosh), portable computers, mainframes, minicomputers, or other computing devices.
  • a classic three tier client server model is utilized as shown in Figure 2 (1 -tier Database Server, 2-tier/Application Server, 3-tier/Client).
  • RDMS relational database management system
  • RDB machine provides the interface to the database.
  • the client application In a preferred database-centricclient/serverarchitecture, the client application generally requests data and data-related services from the application server which makes requests to the database server.
  • the server(s) e.g., either as part of the application server machine or a separate
  • RDB/relational database machine responds to the client's requests and provides secured access to shared data.
  • the client components are preferably complete, stand-alone personal computers offering a full range of power and features to run applications.
  • the client component preferably operates under any operating system and includes communication means, input means, storage means, and display means.
  • the user enters input commands into the computer processor through input means which could comprise a keyboard, mouse, or both.
  • the input means could comprise any device used to transfer information or commands.
  • the display comprises a computer monitor, television, LCD, LED, or any other means to convey information to the user.
  • the user interface is a graphical user interface (GUI) written and operating under the Java programming language (Sun Microsystems) as a Java compatible browser or Java Virtual Machine (JVM).
  • the GUI provides flexible navigational tools to explore patterns in the evolutionary relationships between genomic sequences.
  • the clients and the Application server communicate via Java's RMI (Remote Method Invocation).
  • the server component(s) can be a personal computer, a minicomputer, or a mainframe and offers data management, information sharing between clients, network administration and security.
  • the Database Server (RDBMS - Relational Database Management System) and the Application Server may be the same machine or different hosts if desired.
  • the Application Server is preferably a Java application (JDK Ver. 1.1 or JRE) running on a supported UNIX platform (e.g., Linux, Irix, Solaris).
  • the Database Server is preferably SQL-capable (e.g., MySQL, Oracle).
  • the Application Server and Database Server communicate via the protocol implied by the JDBC (Java Database Connectivity) driver of the RDBMS.
  • JDBC Java Database Connectivity
  • the Application Server preferably completely isolates the client from any notion of relational databases; the client's view is one of (Java) objects, not relations.
  • the present invention also envisions other computing arrangements for the client and server(s), including processing on a single machine such as a mainframe, a collection of machines, or other suitable means.
  • the client and server machines work together to accomplish the processing of the present invention.
  • the preferableprotocolbetweenthe clientand server is RMI (Remote Method Invocation for Java-to-Java communications across Virtual Machines).
  • RMI is a standard defined by the Java Core.
  • the database is preferably connected to the database server component and can be any device which will hold data.
  • the database can consist of any type of magnetic or optical storing device for a computer (e.g., CDROM, internal hard drive, tape drive).
  • the database can be located remote to the server component (with access via modem or leased line) or locally to the server component.
  • the database is preferably a relational database created/derived from existing biological data sets and/or databases (e.g., SwissProt, GeneBank) that is organized and accessed according to relationships between data items.
  • the database is SQL compatible with standard JDBC supported mechanisms and datatypes.
  • the relational database would preferably consist of a plurality of tables (entities). The rows of a table represent records (collections of information about separate items) and the columns represent fields (particular attributesof a record). In its simplest conception, the relational database is a collection of data entries that "relate" to each other through at least one common field.
  • portions of the database may be organized by identifying families of homologous protein sequences within the database, constructing for each family a multiple sequence alignment, an evolutionary tree, and ancestral sequences at nodes in the tree, constructing a corresponding multiple alignment for the DNA sequences that encode the proteins in the protein family, assigning silent and expressed mutations in the DNA sequences to each branch of the DNA evolutionary tree a secondary structure is predicted for the family, and this predicted secondary structure is aligned with the ancestral sequence at the root of the tree.
  • the predicted structural models and their corresponding models of ancestral sequences may be used to organize the protein sequence database to provide rapid search and retrieval of sequence databases.
  • the predicted models are set within the evolutionary history of the protein family.
  • the evolutionary history is defined by a multiple alignment of the sequences of members of the protein family, an evolutionary tree connecting these members, and ancestral sequences reconstructed in probabilistic form throughout the tree.
  • a multiple alignment, an evolutionary tree, and ancestral sequences at nodes in the tree can be constructed by methods well known in the art for a set of homologous proteins. These three elements of the description are interlocking, as is well known in the art.
  • Trees are compared based on their scores using either maximum parsimony or maximum likelihood criteria, and selected based on considerations of score and correspondence to known facts.
  • a correspondingmultiple alignment is constructed by methods well known in the art for the DNA sequences that encode the proteins in the protein family. The multiple alignment is constructed in parallel with the protein alignment. In regions of gaps or ambiguities, the amino acid sequence alignment can be adjusted to give the alignment with the most parsimonious DNA tree.
  • the presently preferred method of constructing ancestral DNA sequences for a given tree is the maximum parsimony method.
  • the DNA and protein trees and multiple alignments must be congruent, meaning that when amino acids are aligned in the protein alignment, the corresponding codons are aligned in the DNA alignment.
  • the connectivity of the two evolutionary trees must show the same evolutionary relationships. In regions where the connectivity of the amino acid tree is not uniquely defined by the amino acid sequences, the tree that gives the most parsimonious DNA tree is used to decide between two trees or reconstructionsof equal value.
  • the ancestral amino acids reconstructed at nodes in the tree must correspond to the reconstructed codons at those nodes. When the ancestral sequences are ambiguous, and where the DNA sequences cannot resolve the ambiguity, the reconstructed DNA sequences must be ambiguous in parallel. Approximate reconstructions are valuable even when exact reconstructions are not possible from available data, and the tree is preferably constrained to correspond to evolutionary relationships between proteins inferred from biological data (e.g., cladistics).
  • mutations in the DNA sequences are then assigned to each branch of the DNA evolutionary tree. These may be fractional mutations to reflect ambiguities in the sequences at the nodes of the tree. When ambiguities are encountered, alternatives are weighted equally. Mutations along each branch are then assigned as being "silent,” meaning that they do not have an impact on the encoded protein sequence, and "expressed,” meaning that they do have an impact on the encoded protein sequence. Fractional assignments are made in the case of ambiguities in the reconstructed sequences at nodes in a tree. Thereafter, intermediates in the evolutionary tree are then prepared in the laboratory using protein engineering and biotechnology methods well known in the art [Jermann, T.M.,
  • PAM units where a PAM unit is the number of point accepted mutations per 100 amino acids, while the quality of the secondary structure prediction determined by the methods disclosed in
  • U.S. Patent No. 5,958,784 becomes worse if the family does not contain at least some protein sequence pairs 40 PAM units or more divergent, families used in this invention preferably contain at least some protein sequence pairs more than 40 PAM units divergent, but contain no protein pairs more than 150 PAM units divergent. Most preferably, a majority of protein pairs are 40 or more PAM units divergent and no protein pair is more than 120 PAM units divergent.
  • sequences in a protein family are, however, generally determined by the availability of sequences in the database.
  • Rapidly searchable database The above-noted steps provide one method to organize the protein sequence database in a rapidly searchable form.
  • the ancestral sequences and the predicted secondary structures associated with the families defined by these steps are surrogates for the sequences and structures of the individual proteins that are members of the family.
  • the reconstructed ancestral sequence represents in a single sequence all of the sequences of the descendent proteins.
  • the predicted secondary structure associated with the ancestral sequence represents in a single structural model all of the core secondary structural elements of the descendent proteins.
  • the ancestral sequences can replace the descendent sequences, and the corresponding core secondary structural models can replace the secondary structures of the descendent proteins.
  • the first surrogate database is the database that collects from each of the families of proteins in the databases a single ancestral sequence, at the point in the tree that most accurately approximates the root of the tree. If the root cannot be determined, the ancestral sequence chosen for the surrogate sequence database is near the center of mass of the tree.
  • the second surrogate database is a database of the corresponding secondary structural elements. The surrogate databases are much smaller than the complete databases that contain the actual sequences or actual structures for each protein in the family, as each ancestral sequence represents many descendent proteins.
  • Probe sequence or structure
  • the probe sequence is then matched with the members of this family to determine where it fits within the evolutionary tree defined by the family.
  • the multiple alignment, evolutionary tree, predicted secondary structure and reconstructed ancestral sequences may be different once the new probe sequence is incorporated into the family. If so, the different multiple alignment, evolutionary tree, and predicted secondary structure are recorded, and the modified reconstructed ancestral sequence and structure are inco ⁇ orated into their respective surrogate databases for future use.
  • Alignment of ancestral sequences with ancestral sequences has an advantage in detecting longer distance homology, as the ancestral sequences contain information about what amino acid residues are conserved within the nuclear family, and therefore are more likely to be conserved between diverging nuclear families.
  • Each separate table represents a different schema of interconnections between some or all of the protein sequences from the underlying biological data set/ database.
  • the relational database for storing biological data includes a plurality of interrelated tables wherein each table comprises an attribute having a common domain with an attribute of at least one other table in the database.
  • the invention provides for viewing patterns in the evolutionary relationships between genomic sequences on the basis of the data stored in the relational database. Three versions of this schema can be viewed using the entity relationship diagrams of
  • Figures 4A - 4C which illustrates the fields for each type of data and the interconnections between data types.
  • the following is a detailed description of a collection of tables (entities) in one embodiment of the present invention with the attributes and keys of each relation.
  • the ID field of all relations except those of the SeqAnnType table is assigned automatically when inserting the relation.
  • Figure 4C shows the entity relationship diagram for DNA. /. AASequence Table
  • the AASequence table contains all amino acid sequences available in the database. Every sequence belongs to exactly one sequence database.
  • the Catalog table contains all catalogs available in this database.
  • the FamAnnotation table contains all annotations of all families. An annotation always belongs to exactly one family.
  • the Family table contains all families of all catalogs. A family always belongs to exactly one catalog and contains an arbitrary number of modules.
  • Module Table contains all modules of all catalogs. A module always belongs to exactly one sequence and exactly one family. The multiple sequence alignment is implicitly stored as a Gaps structure from which the MSA can be directly constructed. 6. Profile Table
  • the Profile table contains all profiles of all families. A profile always belongs to exactly one family. For each family, there may be several profiles at different PAM.
  • the SeqAnnotationtable contains all annotations of all sequences in this database. Each annotation belongs to exactly one sequence.
  • the SeqAnnType table contains all types of all sequence annotations in this database. Its main pu ⁇ ose is to provide standard descriptions for each type to be displayed in the GUI. This entity has more the character of a lookup table than of a real database entity. The Id attribute of each relation is assigned manually. The semantics of some IDs must be known to e.g. the GUI, as different annotations will be displayed differently. Types are collected into groups (type id ranges) whose members normally do not overlap (e.g. secondary structure or binding sites). The groups are predefined in mastercatalog.ds.SeqAnnGroup.
  • the SequenceDB table contains all sequence databases available in this database.
  • the SequenceKey table contains all indexed keys of a sequence like ID, accession numbers, EC numbers and so on. Storing these keys separately allows additional key types to be added without modifying the database structure or any code. Furthermore, there may be multiple occurrences of the same key type for any sequence.
  • Short arrays and graphs are preferably stored as BLOBs (Binary Large Objects) to prevent uncontrolled growth of the number of entities in the design. Only large arrays of variable size are stored in their own relation by properly normalizing the database.
  • the mapping between transient Java objects and persistent database relations is based on a unique "Id" (an INTEGER) for each relation. References to other objects in the database are therefore always INTEGER foreign keys.
  • each table there is a subclass whose instances are Java objects corresponding to relations, and a subclass representing the corresponding entity and providingthe necessary SQL code.
  • the graphical user interface of the present invention allows the user to browse the sequence database, perform searches, and examine evolutionary relationships.
  • the GUI is a browser that can be used to follow evolutionary relationships through the genomic sequence; the browsing provides interactive trees, multiple sequence alignments, and families; the database of families can be searched using a novel method that represents each family as a probabilistic sequence; rates of evolution are displayed on evolutionary trees and provide evidence of changes in function.
  • each display screens of the invention generally comprise a window title bar, a menu bar (with command such as File, Edit, and the like), a tool bar (with options such as Close, Paste, Clear, and the like), and an information display region.
  • the information display region may, for example, display a query window or a results window.
  • Figure 6 is a navigational flowchart of a preferred embodiment of the invention illustrating all major windows and options available. Rectangles represent the windows and circles represent options available within their respective windows. Arrows indicate direction [and each layer of the program is color codes].
  • the initial window is a LOGIN window (not shown) wherein a user may enter a valid user name and password to access the system.
  • the CATALOG SELECTION WINDOW 100 Upon a successful login, the CATALOG SELECTION WINDOW 100 appears in the user's display as shown in a pictorial diagram in Figure 7 according to an embodiment of the invention.
  • the CATALOG SELECTION WINDOW 100 displays the available catalogs for selection by name, version, description, number of families and number of modules.
  • Each catalog is constructed to provide a view of relationships between (some or all of) the protein sequences in the database.
  • Different catalogs emphasize different features of the protein sequence database. For example, one catalog might emphasize repeat units within proteins, another catalog focuses on alignments which comprise the whole length of genes (e.g., a gene product catalog), and another focuses on local patterns of divergence between protein sequences (e.g., a modularized catalog).
  • Catalogs are composed of families of modules, each module defining a region of a protein sequence. Thus, the families relate regions of different protein sequences in biologically meaningful ways.
  • the Catalog table (described previously) contains the following: Id, Name, Description, Source, Version, DBVersion, ProfilePAM, NrModules, NRFamilies, MinFamld, MaxFamld, SearchKeys, SeqAnnTypes, RefSeqDBs.
  • Modularized catalog A description of relationships between proteins sequences based on local patterns of divergence (according to a model of evolution). There are many ways in which such a catalog might be constructed. Published examples include ProDom and DOMO.
  • Entry catalog A description defining the 'classical' method of sequence database construction. That is that there are no explicit relationships between different sequence entries. It is simply a dictionary of all available protein sequences in the database. Catalogs can be subsets of all data in the database. Typically this is most (scientifically) useful for the modularized catalogs where focusing on a subset of all the gene products (e.g. just mammalian sequences) is biologically meaningful (e.g., mammalian modularized catalog, bacterial modularized catalog). Selecting a catalog by double clicking on one of the available catalogs displays a query window for that catalog. Alternatively, you can select a catalog entry with the mouse and click on the Open button on the toolbar.
  • This search window 110 allows you to obtain a specific family number of interest as shown in Figure 8.
  • the result of this search is the Module Family Summary (MFS)window 140 of Figure 9 showing a graphical view of the associated proteins and their modules.
  • MFS Module Family Summary
  • Figure 10 - Search bv Sequence This window 120 allows you to search a protein against a catalog for homologous (evolutionarily related) protein families as shown in Figure 10.
  • Homology is an important concept in extracting information from sequence databases because conclusions can be drawn about the chemical behavior and biological function when two proteins are homologous.
  • One way to determine whether two proteins are homologous is to compare their amino acid sequences. Procedures are well established in the art for comparing two protein sequences, scoring similarities, and using this score to assess the likelihood that the similarities arose by reason of common ancestry rather than by random chance [Gonnet, G.H.,
  • SSR Sequence Search Results
  • This window shows the relationship between the probe sequence chosen for the query and each of the families that are related.
  • the score corresponds to a 'log odds score', the probability that the relationship between the sequence is related according to the model of evolution vs. the probability that the similarity is by chance.
  • Double clicking on any module shown in SSR window will display the module family summary (MFS) window 140 for that family.
  • MFS module family summary
  • the alignment of the probe sequence with the summary can then be displayed.
  • Figure 12 - Search by Keyword This window 130 provides searches of the protein sequence database by keyword, according to annotations of proteins in the original sequence database as shown in Figure 12. Keywords provided include selection by organism, by classification, by gene name and by gene product description.
  • KSR keyword search results
  • Figure 13 includes a list of database sequence ID's which have database sequence annotations matching the keywords in the description.
  • the display shows a graphical view of the individual protein sequences which fit the keyword search criteria.
  • the graphical view is shown as a linear arrangement of schematic boxes (spatially oriented) representing the existing modules found in the selected catalog along the identified amino acid range (AA).
  • AA amino acid range
  • Each of the schematic boxes is preferably identified with its corresponding Module ID number and is differentiated by color or another form of distinctive representation.
  • sequences may match with a particular keyword but no modularization information is available (and no schematic boxes shown) because these sequences were not part of the set included in the currently selected catalog.
  • This window is the gateway to navigational power of the present invention by providing a gateway to other displays.
  • the window shows all of the sequences in the currently selected catalog that are members of a particular family, where the family contains all of the sequences which have a particular module of interest. All relationships between modules have been precalculated in the database.
  • the module is a subsequence that is a member of family where the graphical length is proportionate. Unidentified regions do not have schematic boxes.
  • the module of interest is preferably visibly distinguished from the other modules and its ID is identified in the title bar of the window.
  • the sequences in the family may also contain other modules. The sequences are ordered to cluster modules which are closest together in evolutionary distance.
  • the window is preferably tabular with a separate numbered row for each sequence.
  • the columns preferably include the sequence ID, a description, the amino acid range and a graphical view of the sequence shown as a linear arrangement of schematic boxes (spatially oriented) representing the existing modules found in the selected catalog along the identified amino acid range (AA).
  • Each of the schematic boxes is preferably identified with its corresponding Module ID number and is differentiated by color or another form of distinctive representation.
  • the schematic boxes for the currently selected module are vertically aligned and visually distinguished, such as by color (red).
  • the windows "tool tips" feature may expand any truncated descriptions or provide additional information in a floating window when the pointer is placed over a particular table entry. This window provides a good indication of any long distance homology between various modules. Proteins that share a common module frequently possess other homologous modules at analogous positions. Such relationships can be confirmed by examining multiple sequence alignments and trees.
  • the toolbar of the MFS window allows you to perform many different tasks from this point (e.g., print, export to disk, display multiple sequence alignment (MSA), and display phylogenetic tree). Each of these tasks will now be discussed in detail.
  • the family summary can be printed directly to a printer available from a local computer.
  • the description of the modularization of this family can be exported to file.
  • MSA multiple sequence alignment
  • Selectingthe MSA button on the window's toolbar shows the multiple sequence alignment (the way in which the modules are related at the amino acid level) in the MSA window 150 of Figure 14.
  • This window provides detailed evolutionary information at the protein sequence level to following pattern of conservation and variation of amino acid composition of the module.
  • the MSA is preferably colored according to hydrophobic or hydrophilic nature of the amino acids, (e.g., RED indicates hydrophobic propensity and BLUE indicates hydrophillic).
  • the numbering system used in the MSA window preferably corresponds to the number system implemented by both the MFS window and the Tree window. Highlighted on each sequence are annotated regions of individual protein sequences. Moving the cursor (pointer) over a chosen highlighted region displays the annotation in a floating window. These can be hand-crafted comments, such as feature table entries from SwissProt or automatically generated from patterns such as those in PROSITE (or others such as PRINTS). Different annotations can be selected from the option bar at the bottom of the MSA window.
  • Analyzing correlations in the patterns of substitution in the sequences for each module family allows predictions to be made about the nature of underlying structural or functional constraints.
  • the annotations provides are VParse, showing the location of putative structure breaking residues or motifs; Score, showing the degree of conservation at each position in the alignment. This value is dependent on both the evolutionary distances between the sequences and the mutability of the individual amino acids, and is a sensitive indicator of significance of conserved sites; PredSI, indicating the predicted solvent accessibility of the residue at that position; and PredSec, indicating the predicted secondary structure. If the PAM width of the family is poor or the number of sequences is small, then there may be insufficient information for a secondary structure prediction. If the given module aligns significantly with any entry in the PDB indicating a confidant homology, a string of secondary structural elements corresponding to that alignment can be seen at the bottom of the MSA window below PredSI and PredSec strings.
  • the multiplesequencealignmentwindow shows the amino-acid by amino-acid relationship between proteins which are in the same family.
  • Some preferred features include a) Coloring: Hydrophobic residues in red, hydrophilic residues in blue, amphiphilic residues in black, b) Parses: Regions of sequence which are likely to represent secondary structure breaking positions, are indicated, c) PredSI: Predicted surface/interior residues are indicated, d) PredSec: Predicted secondary structure is shown, e) Experimental Sec:
  • the family summary can be printed directly to any printer available from your local computer, d) Annotations:
  • the sequences in the MSA can be highlighted for particular annotations (usually specific sequence motifs or special database annotations).
  • One such collection of annotations is the 'Prosite' database. Regions of sequence corresponding to Prosite annotations are colored in orange. Moving the cursor over that region of the text displays the details of the annotation in a floating window.
  • the complete set of annotations visible in the current window can be obtained by looking at the 'Annotation Types' menu. A subset of all the annotations in the current collection can be obtained by customizing with the tickboxes in the annotation menu. Selection of sequences/modules will now be described. You can select some or all of the sequences of individual modules.
  • Sequences are selected individually in the MSA display with single mouse clicks. To combine your selection with previous selections, keep the ⁇ CTRL> key depressed while selecting. In addition to the toolbar tasks, other features are available from the window. For example, double clicking on any sequence in the ID column shows the protein sequence window for that family (including catalog membership, description, and annotations).
  • Tree button on the window's shows the evolutionary tree in the Tree window 160 of Figure 15. This indicates the pattern of divergence/similarity between individual modules, assumingthat the distance between modules can be computed from the similarity in the protein sequences.
  • trees show an estimate of the evolutionary history of a protein module, constructed using the PAM distances between individual members of that family. Trees may be displayed either as rooted or unrooted form; there is no significant distinction between these representations, the location of the root being chosen to balance the tree. On the branches of the tree, the length of the branches are displayed in PAM units. This provides an estimate of divergence in composition of the various sequences. Selecting the
  • “kA/kS" key at the bottom of the window will display the ratio of the rate of expressed changes at the DNA level to the rate of silent changes, i.e., the rate of mutation leading to changes at the protein level calibrated against the rate of mutation leading to changes only at the DNA level.
  • the rates are preferably normalized so that when reading expressed:silentratios, a value of around 1.0 indicates no selection, both synonymous (silent changes) and non-synonymous
  • the threshold level of kA/kS is user adjustable on the slide scale. Separate coloring schemes are preferably used to indicate branches above or below the threshold. For example, if kA/kS is less than the threshold (default 1.0) the branch is colored blue and if kA/ kS is greater than the threshold, the branch is colored red. If no DNA sequence information is available, then the branch is colored black. In practice, proteins are normally under the influence of purifying selection so the ratios fall well below this value. Therefore, where the ratio approaches or exceeds 1.0, the confidence that one is looking at an episode of rapid sequence evolution (presumably to new function) increases.
  • the value of the expressed:silent ratio will appear lower as random mutation has increased the number of silent changes.
  • a suitable threshold value can be determined by examining the tree as a whole, which will contain branches that have maintained purifying selection for longer periods and comparing these values with those that suggest mare rapid changes at expressed sites.
  • the graphic interface also offers scaling facility to zoom in and zoom out using the slide scale at the bottom of the window. Zooming may be necessary in order to see the PAM distance labels or leaf identifiers. This information can also be seen in a floating window by positioning the cursor over the leaves or branches. Individual branches can be displayed on separate trees by selecting the appropriate branches and the "Zoom" key. The "fit" button sizes the tree to the full window size.
  • the Tree window shows the evolutionary relationship between individual modules of a family, using distance calculated from a comparison of their amino acid sequences.
  • Some preferred features include: a) Coloring: Blue edge - KaKs (see below) below threshold, Red edge - KaKs above threshold, Black edge - KaKs not computed (no
  • Rooted/Unrooted The tree can be displayed in rooted, or in unrooted form, depending on user preference. There is no difference in information content between these two descriptions; the root of the tree is chosen for balance, not as a result of other phylogenetic evidence.
  • Some preferred function include: a) Export tree description: The tree can be exported to file in a variety of formats, b) Print: The tree can be printed directly to any printer available from your local computer, c) Annotations: The sequences in the MSA can be highlighted for particular annotations (usually specific sequence motifs or special database annotations). One such collection of annotations is the 'Prosite' database.
  • Regions of sequence correspondingto Prosite annotations are colored in orange. Moving the cursor over that region of the text displays the details of the annotation in a floating window. The complete set of annotationsvisible in the current window can be obtained by looking at the 'Annotation Types' menu. A subset of all the annotations in the current collection can be obtained by customizing with the tickboxes in the annotation menu; d) Selection sequences/modules: You can select branches or leaves of the tree with single clicks of the mouse. Selecting a branch will result in all of the branches and leaves being selected downstream of the root. (The root is marked with a circle).
  • Fit tree/Rescaletree The fit button can be used to rescale the tree to fit into the entire window. Pressing SHIFT and the left mouse button can be used to zoom in toward the selected portion of the window, and SHIFT with the right mouse button will zoom out from the selected portion of the window; f) Ka/Ks: You can display either KaKs or PAM distances on the tree.
  • the MFS window, tree window and MSA window are linked so that selections in one window highlight sequences in the other. You can select some or all of the modules possessed by individual sequences. Modules are selected individually in the MFS window with single mouse clicks. All of the modules possessed by a protein can be selected at once by selecting the module id (#) on the left hand column. To combine your selection with previous selections, keep the ⁇ CTRL> key depressed while selecting.
  • Notable features of the present invention include a) multi-catalog views where a user can simultaneously view more then one catalog, b) tree to tree interactivity where active selections from a current window get propagated throughout (selected and deleted), c) connectivity of selections where a section of a tree as selected will highlight associated information in other windows which is continuously applied as windows are opened, d) the MSA window's Prosite annotations, tool tips to show annotations, and customization of annotation display/view to select subsets.
  • Appendix A contains source code for certain of the functions of the present invention written in Java, including IndexSelection.java for managing and propagating selections, FamilyFrame.java for showing the data about a family as a table, Family TableModel.java for representation of family information with sequences and their modules, IndexSelectionListener.java for as an interface for a class which listens to index selection changes, and FamilySequenceRenderer.java for rendering family sequences wherein single clicking on a module selects it and double clicking triggers opening of the corresponding family frame.
  • IndexSelection.java for managing and propagating selections
  • FamilyFrame.java for showing the data about a family as a table
  • Family TableModel.java for representation of family information with sequences and their modules
  • IndexSelectionListener.java for as an interface for a class which listens to index selection changes
  • FamilySequenceRenderer.java for rendering family sequences wherein single clicking on a module selects it and double clicking triggers opening of the corresponding family
  • FIG. 16 depicts an example of consecutive screen displays for interactive and progressive query-making activity from search by sequence.
  • First a sequence is typed or cut and pasted into the query box 120. Minimum scope, maximum matches and PAM are adjusted by the user as desired.
  • the query is then run resulting in the Sequence Search Results (SSR) window 125. From this window, the user can progressively query the desired module(s) by double-clicking on that module(s) which results in the MFS window 140 with the module of interest designated in RED and positionally aligned with all other modules.
  • SSR Sequence Search Results
  • FIG. 17 depicts an example of consecutive screen displays for interactive and progressive query-making activity from search by keyword.
  • the search term "isocitrate” is entered (other search terms may be included with the necessary boolean logic) in the query window 130.
  • the Keyword Search Results (KSR) window 135 is displayed listing all sequences which contain the queried keyword with the graphical display of the sequences and their modules. From this window, the user may select a module of interest to be displayed in the MFS window 140.
  • the MFS window 140 shows the two-dimensional spatial orientation of the biological data, including visual locations of modules (represented as schematic boxes) in each of the sequences for a selected family distinguishably displayed and positionally aligned as well as the location of all other modules in those sequences.
  • Example 3
  • FIG. 18A depicts the MFS window 140 for one catalog. From the menu bar, additional catalog views may be selected. In this example, both the genomes and OPgenomes catalogs are chosen to be viewed as shown in FIG. 18B. This view depicts the MFS window with two catalogs. From this window the user can begin progressive query-making activity by selecting modules of interest and viewing them in the MFS windows.
  • Example 4
  • FIG. 19 depicts the screen display for the evolutionary tree window 160 as linked to the MFS window 140 and MSA window 150 with highlighted selections propagated throughout. Highlighting a section of the tree will automatically propagate the highlighting to the other windows (MSA and MFS) for the selected sequences.
  • Example 5
  • FIG.20 depicts the screen displays showing interactive and progressive query-making activity across multiple MFS windows 140.
  • the user can select the 377 1 for display, the 978_1 for display, and the 371_1 for display (simultaneously).
  • the windows can be closed or moved about the screen as desired.
  • the user can continue to select displays from the resulting windows, e.g. , the user can select the 1075 1 module from the 978_1 MFS window for display, and so on.
  • Progressive querying can be continued through level upon level.
  • the MSA 150 of Figure 14 evolutionary tree 160 of Figure 15, or database entry (not shown) can be displayed as depicted in the flowchart of Figure 6.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

La présente invention concerne un outil de recherche informatisé pour la recherche et l'affichage de données biologiques. L'invention concerne plus particulièrement un outil de recherche informatisé qui, d'une part permet une recherche informatisée de données depuis diverses bases de données, et d'autre part constitue une nouvelle interface graphique améliorant notablement la représentation des données biologique, le questionnement progressif, et la navigation entre fenêtres et bases de données. L'invention peut se mettre en oeuvre de diverses façons, et notamment sous forme d'un système, d'un appareil, d'un procédé ou d'un support informatique.
EP00963469A 1999-09-14 2000-09-14 Interface graphique pour affichage et analyse de donnees de sequences biologiques Withdrawn EP1221126A2 (fr)

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
US15414999P 1999-09-14 1999-09-14
US09/397,335 US6941317B1 (en) 1999-09-14 1999-09-14 Graphical user interface for display and analysis of biological sequence data
US154149P 1999-09-14
US397335 1999-09-14
PCT/US2000/025247 WO2001020535A2 (fr) 1999-09-14 2000-09-14 Interface graphique pour affichage et analyse de donnees de sequences biologiques

Publications (1)

Publication Number Publication Date
EP1221126A2 true EP1221126A2 (fr) 2002-07-10

Family

ID=26851192

Family Applications (1)

Application Number Title Priority Date Filing Date
EP00963469A Withdrawn EP1221126A2 (fr) 1999-09-14 2000-09-14 Interface graphique pour affichage et analyse de donnees de sequences biologiques

Country Status (6)

Country Link
EP (1) EP1221126A2 (fr)
JP (1) JP2003509776A (fr)
CN (1) CN1390332A (fr)
AU (1) AU781841B2 (fr)
CA (1) CA2384883A1 (fr)
WO (1) WO2001020535A2 (fr)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1271343A1 (fr) * 2001-06-22 2003-01-02 Kelman Gesellschaft für Geninformation mbH Procédé et dispositif pour traiter et visualiser automatiquement d'ensembles d'informations
EP1271344A1 (fr) * 2001-06-22 2003-01-02 Kelman Gesellschaft für Geninformation mbH Procédé et dispositif pour traiter et visualiser automatiquement des données
EP1510938B1 (fr) * 2003-08-29 2014-06-18 Sap Ag Méthode pour fournir un graphe de visualisation sur un ordinateur et ordinateur pour fournir un graphe de visualisation
JP4638726B2 (ja) * 2004-12-22 2011-02-23 株式会社アルファジェン サンプルセット製造方法、遺伝子整列プログラム及びサンプルセット
CN1932040B (zh) * 2006-09-21 2010-06-09 武汉大学 全基因组目标基因家族成员的自动化快速检测系统
WO2009148616A2 (fr) * 2008-06-06 2009-12-10 Dna 2.0 Inc. Systèmes et procédés pour déterminer des propriétés qui ont une incidence sur une valeur de propriété d'expression de polynucléotides dans un système d'expression
US8832098B2 (en) 2008-07-29 2014-09-09 Yahoo! Inc. Research tool access based on research session detection
KR101165536B1 (ko) * 2010-10-21 2012-07-16 삼성에스디에스 주식회사 유전자정보 제공 방법 및 이를 위한 유전자정보 서버 그리고 유전자정보 브라우저 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체
US20140089328A1 (en) * 2012-09-27 2014-03-27 International Business Machines Corporation Association of data to a biological sequence
US10296164B2 (en) * 2015-12-08 2019-05-21 Fisher-Rosemount Systems, Inc. Methods, apparatus and systems for multi-module process control management
CN113674798B (zh) * 2020-05-15 2024-04-26 复旦大学 蛋白质组学数据的分析系统
WO2023220204A1 (fr) 2022-05-10 2023-11-16 Google Llc Diffusion en continu incrémentielle pour résumés en direct

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5966712A (en) * 1996-12-12 1999-10-12 Incyte Pharmaceuticals, Inc. Database and system for storing, comparing and displaying genomic information

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
None *
See also references of WO0120535A3 *

Also Published As

Publication number Publication date
CN1390332A (zh) 2003-01-08
AU7488100A (en) 2001-04-17
AU781841B2 (en) 2005-06-16
WO2001020535A9 (fr) 2002-09-26
JP2003509776A (ja) 2003-03-11
WO2001020535A2 (fr) 2001-03-22
CA2384883A1 (fr) 2001-03-22
WO2001020535A3 (fr) 2002-01-17

Similar Documents

Publication Publication Date Title
US6941317B1 (en) Graphical user interface for display and analysis of biological sequence data
US20030171876A1 (en) System and method for managing gene expression data
US20030176929A1 (en) User interface for a bioinformatics system
US20060020398A1 (en) Integration of gene expression data and non-gene data
Frishman et al. Comprehensive, comprehensible, distributed and intelligent databases: current status.
JP2001125929A (ja) 生体分子配列データのためのグラフィカルビューア
WO2002073504A1 (fr) Systeme et procede d'extraction et d'utilisation de donnees d'expression genique provenant de multiples sources
AU781841B2 (en) Graphical user interface for display and analysis of biological sequence data
Cannataro et al. Proteus, a grid based problem solving environment for bioinformatics: Architecture and experiments
Bailey et al. GAIA: framework annotation of genomic sequence
Skrzypek et al. Using the Candida genome database
US20040170949A1 (en) Method for organizing and depicting biological elements
EP1366359A1 (fr) Systeme et procede servant a gerer des donnees d'expression genique
US20150149512A1 (en) Integrated Desktop Software for Management of Virus Data
Li et al. Bioinformatics adventures in database research
Valencia Search and retrieve
Dahlquist Using Gen MAPP and MAPPFinder to View Microarray Data on Biological Pathways and Identify Global Trends in the Data
US9418204B2 (en) Bioinformatics system architecture with data and process integration
WO2000016220A1 (fr) Dispositif et procede pour l'elaboration d'une base de donnees qui permet d'extraire des donnees relatives a l'expression et d'assurer la gestion d'informations de laboratoire
Ray et al. The PACRAT system: an extensible WWW-based system for correlated sequence retrieval, storage and analysis
Sayers et al. Macromolecular structure databases
Wishart Lecture 2 Biological Databases
NEW EntrezUser'sGuide
Westfall User-Centered Interface Design
Hightower Guide to Selected Bioinformatics Internet Resources.: Science and Technology Resources on the Internet

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20020314

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL PAYMENT 20020314;LT PAYMENT 20020314;LV PAYMENT 20020314;MK PAYMENT 20020314;RO PAYMENT 20020314;SI PAYMENT 20020314

17Q First examination report despatched

Effective date: 20020909

17Q First examination report despatched

Effective date: 20020909

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20070405