WO2009039425A1 - Gestion des connaissances d'informations scientifiques basée sur l'expression directionnelle - Google Patents

Gestion des connaissances d'informations scientifiques basée sur l'expression directionnelle Download PDF

Info

Publication number
WO2009039425A1
WO2009039425A1 PCT/US2008/077097 US2008077097W WO2009039425A1 WO 2009039425 A1 WO2009039425 A1 WO 2009039425A1 US 2008077097 W US2008077097 W US 2008077097W WO 2009039425 A1 WO2009039425 A1 WO 2009039425A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
feature set
correlation
features
directional
Prior art date
Application number
PCT/US2008/077097
Other languages
English (en)
Inventor
Qiaojuan Jane Su
Ilya Kupershmidt
Original Assignee
Nextbio
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nextbio filed Critical Nextbio
Publication of WO2009039425A1 publication Critical patent/WO2009039425A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/20Heterogeneous data integration
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the present invention relates generally to methods, systems and apparatus for storing and retrieving biological, chemical and medical information.
  • Research in these fields has increasingly shifted from the laboratory bench to computer-based methods.
  • Public sources such as NCBI (National Center for Biotechnology Information), for example, provide databases with genetic and molecular data. Between these and private sources, an enormous amount of data is available to the researcher from various assay platforms, organisms, data types, etc.
  • NCBI National Center for Biotechnology Information
  • researchers need fast and efficient tools to quickly assimilate new information and integrate it with pre-existing information across different platforms, organisms, etc.
  • researchers also need tools to quickly navigate through and analyze diverse types of information.
  • the present invention relates to methods, systems and apparatus for capturing, integrating, organizing, navigating and querying large-scale data from high-throughput biological and chemical assay platforms. It provides a highly efficient meta-analysis infrastructure for performing research queries across a large number of studies and experiments from different biological and chemical assays, data types and organisms, as well as systems to build and add to such an infrastructure.
  • aspects of the invention relate to integrating, organizing, navigating and querying "directional" data, such as gene expression profiles.
  • a feature set is a "derived" data set from the "raw data” taken from one or more experiments on one or more samples.
  • a directional feature set is a feature set that contains information about the direction of change in a feature relative to a control.
  • Bi-directional feature sets for example, contain information about features that are up-regulated and features that are down-regulated in response to a control.
  • One example of a bi-directional feature set is a gene expression profile that contains information about up and down regulated genes in a particular disease state relative to normal state, or in a treated sample relative to a non- treated control.
  • One aspect of the invention relates to methods of integrating expression-based data into a knowledge base having at least one pre-existing bi-directional feature set.
  • Each bidirectional feature set includes a list of features (e.g., genes, proteins) and, for at least some of the listed features, up or down expression information relative to a control.
  • a bi-directional feature set includes information about the feature expression signature of a disease.
  • a bi-directional feature set includes information about the feature expression response to a stimulus or compound.
  • the methods involve receiving a bi-directional input feature set containing a list of features and, for at least some of the listed features, up or down regulation expression information relative to a control and automatically correlating the input feature set with a plurality or all other pre-existing bi-directional feature sets.
  • Automatically correlating the input feature set with a bidirectional pre-existing feature set involves determining multiple individual correlation scores and, from the multiple individual correlation scores, determining an overall correlation score and a correlation direction that indicates if the input feature set is positively or negatively correlated to the pre-existing feature set, and the magnitude or extent of that correlation.
  • Determining these multiple individual correlation scores may involve correlating the up-regulation expression information of the input feature set with the non-directional feature set and correlating the down-regulation expression information of input feature set with the non-directional feature set.
  • the methods may involve storing at least the overall correlation scores and correlation directions for use in replying to user queries involving a feature set.
  • Another aspect of the invention relates to computer implemented methods of conducting queries in a knowledge base of chemical and/or biological information that includes a plurality of feature sets, each feature set containing a list of chemical or biological features and associated statistical information.
  • the methods involve receiving a query identifying at least one feature set having up and down feature regulation expression information, wherein the query is received from a user input to a computer system, using precomputed correlation scores between the at least one identified feature set and other content in the knowledge base to determine feature set rankings in reply to said query; and presenting the user with a ranked list of feature sets as determined by the precomputed correlation scores, and, for at least some of the feature sets in the ranked list, an indication of whether the correlation of that feature set with the identified feature set is positive or negative.
  • the methods further involve presenting the user with information about the correlation between each of the up and down regulation expression information of the identified feature set and each of the up and down feature regulation expression information of at least one listed feature set.
  • Figure IA is a schematic representation of various elements in a knowledge base that may be used in accordance with various embodiments of the present invention.
  • Figure IB is a representation of a directional feature set including up- and down- expressed features.
  • Figure 1C is a representation of the up-expressed features from the feature set shown in
  • Figure 2 is a flow diagram presenting steps in correlating two bi-directional feature sets
  • Figure 3 is a flow diagram presenting steps in correlating a bi-directional feature set A with a non-directional feature set B.
  • Figure 4A shows a representation of four individual correlation scores: A + B + , A + B_, A_B_, and A_B + , determined to determine an overall correlation score between two bi-directional feature sets, A and B.
  • Figure 4B shows a representation of the two individual correlation scores, A + B and A_B, determined to determine an overall correlation score between a bi-directional feature set A and a non-directional feature set B.
  • Figure 5 is a flow diagram presenting steps in correlating a bi-directional feature set A with a feature group G.
  • Figure 6 is a flow diagram presenting steps in correlating a feature set to another feature set.
  • Figure 7 is a representation of features lists of two feature sets (Fl and F2) that are to be correlated to one another using the process described in Figure 6. Stop or checkpoints used in a correlation process are shown on the diagram.
  • Figure 8 is a set diagram showing an example of a feature set versus feature group relationship.
  • Figure 9A is a flow diagram presenting key steps in correlating a feature set to a feature group.
  • Figure 9B is an example of a feature table of a feature set that may be correlated to a feature group. Stop or checkpoints used in a correlation process are shown on the diagram.
  • Figure 10 is a set diagram showing an example of a feature set versus feature set relationship.
  • Figure HA is a flow diagram presenting key steps in correlating a feature set to another feature set.
  • Figure HB is a representation of features lists of two feature sets (Fl and F2) that are to be correlated to one another. Stop or checkpoints used in a correlation process are shown on the diagram.
  • Figure 11C is a representation of features lists of two feature sets (Fl and F2) that are to be correlated to one another. Lines indicate features that are mapped to or aligned with one another.
  • Figure 12 is a process flow diagram presenting key steps employed to generate a knowledge base in accordance with one embodiment of the present invention.
  • Figure 13A is a schematic representation of raw data and data sets (feature sets) generated from raw data for use in a knowledge base.
  • Figure 13B is a flow diagram presenting key steps employed in curating raw data in accordance with one embodiment of the present invention.
  • Figure 13C is flow diagram presenting key steps employed in a data quality control operation of a curating process in accordance with one embodiment of the present invention.
  • Figure 13D is a flow diagram presenting key steps employed in a statistical analysis operation of a curating process in accordance with one embodiment of the present invention.
  • Figure 14 is a flow diagram presenting key steps employed in generating tissue-specific feature sets from multi-tissue experiments or studies in accordance with one embodiment of the present invention.
  • Figure 15 is a flow diagram presenting key steps employed in importing data into a knowledge base in accordance with one embodiment of the present invention.
  • Figure 16A is a process flow diagram depicting some operations in processing a query employing a single feature set as the query input in accordance with certain embodiments.
  • Figures 16B-16D are screen shots depicting query results window for a feature set versus feature sets query.
  • Figure 17 is a diagrammatic representation of a computer system that can be used with the methods and apparatus described herein. Detailed Description of the Preferred Embodiments.
  • the present invention relates to methods, systems and apparatus for capturing, integrating, organizing, navigating and querying large-scale data from high-throughput biological and chemical assay platforms. It provides a highly efficient meta-analysis infrastructure for performing research queries across a large number of studies and experiments from different biological and chemical assays, data types and organisms, as well as systems to build and add to such an infrastructure. While most of the description below is presented in terms of systems, methods and apparatuses that integrate and allow exploration of data from biological experiments and studies, the invention is by no means so limited. For example, the invention covers chemical and clinical data. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without limitation to some of the specific details presented herein.
  • Raw data This is the data from one or more experiments that provides information about one or more samples. Typically, raw data is not yet processed to a point suitable for use in the databases and systems of this invention. Subsequent manipulation reduces it to the form of one or more "feature sets" suitable for use in such databases and systems. The process of converting the raw data to feature sets is sometimes referred to as curation. Most of the examples presented herein concern biological experiments in which a stimulus acts on a biological sample such as a tissue or cell culture. Often the biological experiment will have associated clinical parameters such as tumor stage, patient history, etc. The invention is not however limited to biological samples and may involve, for example, experiments on non-biological samples such as chemical compounds, various types of synthetic and natural materials, etc. and their effects on various types of assays (e.g., cancer cell line progression).
  • non-biological samples such as chemical compounds, various types of synthetic and natural materials, etc. and their effects on various types of assays (e.g., cancer cell line progression).
  • the sample may be exposed to one or more stimuli or treatments to produce test data. Control data may also be produced.
  • the stimulus is chosen as appropriate for the particular study undertaken. Examples of stimuli that may be employed are exposure to particular materials or compositions, radiation (including all manner of electromagnetic and particle radiation), forces (including mechanical (e.g., gravitational), electrical, magnetic, and nuclear), fields, thermal energy, and the like.
  • materials that may be used as stimuli include organic and inorganic chemical compounds, biological materials such as nucleic acids, carbohydrates, proteins and peptides, lipids, various infectious agents, mixtures of the foregoing, and the like.
  • stimuli include non-ambient temperature, non-ambient pressure, acoustic energy, electromagnetic radiation of all frequencies, the lack of a particular material (e.g., the lack of oxygen as in ischemia), temporal factors, etc.
  • a particularly important class of stimuli in the context of this invention is exposure to therapeutic agents (including agents suspected of being therapeutic but not yet proven to have this property).
  • the therapeutic agent is a chemical compound such as a drug or drug candidate or a compound present in the environment.
  • the biological impact of chemical compounds is manifest as a change in a feature such as a level of gene expression or a phenotypic characteristic.
  • the raw data will include "features" for which relevant information is produced from the experiment.
  • the features are genes or genetic information from a particular tissue or cell sample exposed to a particular stimulus.
  • a typical biological experiment determines expression or other information about a gene or other feature associated with a particular cell type or tissue type.
  • Other types of genetic features for which experimental information may be collected in raw data include SNP patterns (e.g., haplotype blocks), portions of genes (e.g., exons/introns or regulatory motifs), regions of a genome of chromosome spanning more than one gene, etc.
  • Other types of biological features include phenotypic features such as the morphology of cells and cellular organelles such as nuclei, Golgi, etc.
  • Types of chemical features include compounds, metabolites, etc.
  • the raw data may be generated from any of various types of experiments using various types of platforms (e.g., any of a number of microarray systems including gene microarrays, SNP microarrays and protein microarrays, cell counting systems, High- Throughput Screening ("HTS”) platforms, etc.).
  • platforms e.g., any of a number of microarray systems including gene microarrays, SNP microarrays and protein microarrays, cell counting systems, High- Throughput Screening ("HTS”) platforms, etc.
  • an oligonucleotide microarray is also used in experiments to determine expression of multiple genes in a particular cell type of a particular organism.
  • mass spectrometry is used to determine abundance of proteins in samples.
  • Feature set - This refers to a data set derived from the "raw data" taken from one or more experiments on one or more samples.
  • the feature set includes one or more features (typically a plurality of features) and associated information about the impact of the experiment(s) on those features.
  • the features of a feature set may be ranked (at least temporarily) based on their relative levels of response to the stimulus or treatment in the experiment(s) or based on their magnitude and direction of change between different phenotypes, as well as their ability to differentiate different phenotypic states (e.g., late tumor stage versus early tumor stage).
  • the feature set may include information about only a subset of the features or responses contained in the raw data. As indicated, a process such as curation converts raw data to feature sets.
  • the feature set pertains to raw data associated with a particular question or issue (e.g., does a particular chemical compound interact with proteins in a particular pathway).
  • the feature set may be limited to a single cell type of a single organism. From the perspective of a "Directory,” a feature set belongs to a "Study.” In other words, a single study may include one or more feature sets.
  • the feature set is either a "bioset” or a "chemset.”
  • a bioset typically contains data providing information about the biological impact of a particular stimulus or treatment.
  • the features of a bioset are typically units of genetic or phenotypic information as presented above.
  • a chemset typically contains data about a panel of chemical compounds and how they interact with a sample, such as a biological sample.
  • the features of a chemset are typically individual chemical compounds or concentrations of particular chemical compounds.
  • the associated information about these features may be EC50 values, IC50 values, or the like.
  • a feature set typically includes, in addition to the identities of one or more features, statistical information about each feature and possibly common names or other information about each feature.
  • a feature set may include still other pieces of information for each feature such as associated description of key features, user-based annotations, etc.
  • the statistical information may include p-values of data for features (from the data curation stage), "fold change” data, and the like.
  • a fold change indicates the number of times (fold) that expression is increased or decreased in the test or control experiment (e.g., a particular gene's expression increased "4-fold” in response to a treatment).
  • a feature set may also contain features that represent a "normal state", rather than an indication of change.
  • a feature set may contain a set of genes that have "normal and uniform" expression levels across a majority of human tissues. In this case, the feature set would not necessarily indicate change, but rather a lack thereof.
  • a rank is ascribed to each feature, at least temporarily. This may be simply a measure of relative response within the group of features in the feature set. As an example, the rank may be a measure of the relative difference in expression (up or down regulation) between the features of a control and a test experiment. In certain embodiments, the rank is independent of the absolute value of the feature response. Thus, for example, one feature set may have a feature ranked number two that has a 1.5 fold increase in response, while a different feature set has the same feature ranked number ten that has a 5 fold increase in response to a different stimulus.
  • Directional feature set - A directional feature set is a feature set that contains information about the direction of change in a feature relative to a control.
  • Bi-directional feature sets for example, contain information about which features are up-regulated and which features are down-regulated in response to a control.
  • One example of a bi-directional feature set is a gene expression profile that contains information about up and down regulated genes in a particular disease state relative to normal state, or in a treated sample relative to non- treated.
  • the terms "up-regulated” and “down-regulated” and similar terms are not limited to gene or protein expression, but include any differential impact or response of a feature.
  • Non-directional feature sets contain features without indication of a direction of change of that feature. This includes gene expression, as well as different biological measurements in which some type of biological response is measured.
  • a non-directional feature set may contain genes that are changed in response to a stimulus, without an indication of the direction (up or down) of that change.
  • the non-directional feature set may contain only up-regulated features, only down-regulated features, or both up and down-regulated features, but without indication of the direction of the change, so that all features are considered based on the magnitude of change only.
  • Feature group - This refers to a group of features (e.g., genes) related to one another.
  • the members of a feature group may all belong to the same protein pathway in a particular cell or they may share a common function or a common structural feature.
  • a feature group may also group compounds based on their mechanism of action or their structural/binding features.
  • the index set is a set in the knowledge base that contains feature identifiers and mapping identifiers and is used to map all features of the feature sets imported to feature sets and feature groups already in the knowledge base.
  • the index set may contain several million feature identifiers pointing to several hundred thousand mapping identifiers.
  • Each mapping identifier (in some instances, also referred to as an address) represents a unique feature, e.g., a unique gene in the mouse genome.
  • the index set may contain diverse types of feature identifiers (e.g., genes, genetic regions, etc.), each having a pointer to a unique identifier or address.
  • the index set may be added to or changed as new knowledge is acquired.
  • Knowledge base This refers to a collection of data used to analyze and respond to queries. In certain embodiments, it includes one or more feature sets, feature groups, and metadata for organizing the feature sets in a particular hierarchy or directory (e.g., a hierarchy of studies and projects).
  • a knowledge base may include information correlating feature sets to one another and to feature groups, a list of globally unique terms or identifiers for genes or other features, such as lists of features measured on different platforms (e.g., Affymetrix human HG_U133A chip), total number of features in different organisms, their corresponding transcripts, protein products and their relationships.
  • a knowledge base typically also contains a taxonomy that contains a list of all tags (keywords) for different tissues, disease states, compound types, phenotypes, cells, as well as their relationships.
  • taxonomy defines relationships between cancer and liver cancer, and also contains keywords associated with each of these groups (e.g., a keyword "neoplasm” has the same meaning as "cancer”).
  • a keyword "neoplasm” has the same meaning as "cancer”
  • at least some of the data in the knowledge base is organized in a database. Curation- Curation is the process of converting raw data to one or more feature sets (or feature groups). In some cases, it greatly reduces the amount of data contained in the raw data from an experiment. It removes the data for features that do not have significance.
  • the process of curation identifies such features and removes them from the raw data.
  • the curation process also identifies relevant clinical questions in the raw data that are used to define feature sets. Curation also provides the feature set in an appropriate standardized format for use in the knowledge base.
  • Data import - Data import is the process of bringing feature sets and feature groups into a knowledge base or other repository in the system, and is an important operation in building a knowledge base.
  • a user interface may facilitate data input by allowing the user to specify the experiment, its association with a particular study and/or project, and an experimental platform (e.g., an Affymetrix gene chip), and to identify key concepts with which to tag the data.
  • data import also includes automated operations of tagging data, as well as mapping the imported data to data already in the system. Subsequent "preprocessing" (after the import) correlates the imported data (e.g., imported feature sets and/or feature groups) to other feature sets and feature groups.
  • Preprocessing - Preprocessing involves manipulating the feature sets to identify and store statistical relationships between pairs of feature sets in a knowledge base. Preprocessing may also involve identifying and storing statistical relationships between feature sets and feature groups in the knowledge base. In certain embodiments, preprocessing involves correlating a newly imported feature set against other feature sets and against feature groups in the knowledge base. Typically, the statistical relationships are pre-computed and stored for all pairs of different feature sets and all combinations of feature sets and feature groups, although the invention is not limited to this level of complete correlation. In one embodiment, the statistical correlations are made by using rank-based enrichment statistics. For example, a rank-based iterative algorithm that employs an exact test is used in certain embodiments, although other types of relationships may be employed, such as the magnitude of overlap between feature sets.
  • a new feature set input into the knowledge base is correlated with every other (or at least many) feature sets already in the knowledge base.
  • the correlation compares the new feature set and the feature set under consideration on a feature-by-feature basis by comparing the rank or other information about matching genes.
  • a rank-based iterative algorithm is used in one embodiment to correlate the feature sets.
  • the result of correlating two feature sets is a "score.” Scores are stored in the knowledge base and used in responding to queries.
  • Study/Project/Library - This is a hierarchy of data containers (like a directory) that may be employed in certain embodiments.
  • a study may include one or more feature sets obtained in a focused set of experiments (e.g., experiments related to a particular cardiovascular target).
  • a Project includes one or more Studies (e.g., the entire cardiovascular effort within a company).
  • the library is a collection of all projects in a knowledge base. The end user has flexibility in defining the boundaries between the various levels of the hierarchy.
  • Tag - A tag associates descriptive information about a feature set with the feature set. This allows for the feature set to be identified as a result when a query specifies or implicates a particular tag. Often clinical parameters are used as tags.
  • Tags include tumor stage, patient age, sample phenotypic characteristics and tissue types.
  • Tags may also be referred to as concepts.
  • Mapping - Mapping takes a feature (e.g., a gene) in a feature set and maps it to a globally unique mapping identifier in the knowledge base. For example, two sets of experimental data used to create two different feature sets may use different names for the same gene.
  • the knowledge base includes an encompassing list of globally unique mapping identifiers in an index set. Mapping uses the knowledge base's globally unique mapping identifier for the feature to establish a connection between the different feature names or IDs.
  • a feature may be mapped to a plurality of globally unique mapping identifiers.
  • a gene may also be mapped to a globally unique mapping identifier for a particular genetic region.
  • Mapping allows diverse types of information (i.e., different features, from different platforms, data types and organisms) to be associated with each other. There are many ways to map and some of these will be elaborated on below.
  • Another type of mapping involves indirect mapping of a gene in the feature set to the gene in the index set. For example, the gene in an experiment may overlap in coordinates with a regulatory sequence in the knowledge base.
  • That regulatory sequence in turn regulates a particular gene. Therefore, by indirect mapping, the experimental sequence is indirectly mapped to that gene in the knowledge base. Yet another form of indirect mapping involves determining the proximity of a gene in the index set to an experimental gene under consideration in the feature set. For example, the experimental feature coordinates may be within 100 basepairs of a knowledge base gene and thereby be mapped to that gene.
  • Correlation As an example, a new feature set input into the knowledge base is correlated with every other (or at least many) feature sets already in the knowledge base. The correlation compares the new feature set and the feature set under consideration on a feature-by- feature basis comparing the rank or other information about matching genes. A ranked based running algorithm is used in one embodiment (to correlate the feature sets). The result of correlating two feature sets is a "score.” Scores are stored in the knowledge base and used in responding to queries about genes, clinical parameters, drug treatments, etc.
  • Correlation is also employed to correlate new feature sets against all feature groups in the knowledge base. For example, a feature group representing "growth" genes may be correlated to a feature set representing a drug response, which in turn allows correlation between the drug effect and growth genes to be made.
  • aspects of the invention relate to methods of assimilating directional expression-based experimental data so that it may be efficiently navigated and analyzed.
  • aspects of the invention relate to correlating directional feature sets with other directional feature sets, non-directional feature sets and feature groups.
  • a directional feature set is a data set that includes one or more features and associated information about experimental impact on or response to those features, including the direction of the impact or response.
  • One framework for the methods described herein is described below and in U.S. Patent Application No. 11/651,539, published as U.S. Patent Publication 20070162411, in which data is imported into a knowledge base of diverse types of biological, chemical and/or medical information.
  • Correlation scoring is performed after the data has been imported and involves correlating the imported data with pre-existing data in the knowledge base. All new data imported into the system is pre-processed - correlations are typically pre-computed across the entire information space.
  • the methods described are not limited to this framework, but may be used to correlate any experimental data that includes directional information with other biological, chemical and medical information. Examples of directional feature sets include gene expression signatures of diseases and compounds. The methods described herein provide correlation scores indicating the degree of correlation between feature sets as well as a correlation direction.
  • a researcher searching for compounds that are negatively correlated to a particular disease may query numerous compounds against a disease of interest and be presented with a list of those that are negatively correlated with the disease.
  • To perform a similar analysis without directional feature set correlation scoring one would have to both query up-regulated gene expression of the disease and down-regulated gene expression of the disease against the set of compound feature sets independently, obtaining a list compounds that correlate to the up-regulated features and a separate list of compounds that correlate to the down-regulated features.
  • the user would then have to manually assimilate the information to find compounds that demonstrate signatures of up-regulation of genes that are down-regulated in cancer cells, and down-regulation of genes that are up-regulated in cancer cells.
  • the processes described herein provide a researcher with information that might otherwise be lost without using correlation scoring that does not distinguish between up and down-regulated features.
  • the features sets contain a ranked list of features. Ranking involves ordering features within each feature set based on their relative levels of response to the stimulus or treatment in the experiment(s), their magnitude of change between different phenotypes, etc. Ranking is typically based on one or more associated statistics in an imported feature set; for example, features may be ranked in order of decreasing fold-change or increasing p-value. This ranking system ensures that features from feature sets that use different statistics can still be compared based on their relative order or rankings across feature sets.
  • Ranking in a bi-directional feature set are typically based on the magnitude of change relative to control, i.e., up-regulated and down-regulated features are ranked together based on the magnitude of the change from control.
  • mapping identifiers are associated with each feature in a feature set. Mapping is the process through which diverse features (e.g., from different platforms, data types and organisms) are associated with each other. For example, a gene may be associated with a SNP, a protein, or a sequence region of interest. In certain embodiments, e.g., if correlating feature sets having data from a common platform, mapping may not be necessary. Examples of methods of mapping are described further below. The mapping methods described below and in U.S.
  • Patent Publication 20070162411 enable data to be connected across assay types, organisms and platforms. Both ranking and mapping are determined prior to the correlation process. Once mapping is determined, correlation algorithms may be applied automatically and systematically to pre-compute correlation scores (e.g., p-values and/or derivative rank scores) between a given set of data and any other biological, clinical, or chemical entities within the knowledge base.
  • correlation scores e.g., p-values and/or derivative rank scores
  • Figure IA shows a representation of various elements in a knowledge base of scientific information. Examples of generation of or addition to some of these elements (e.g., feature sets and a feature set scoring table) are discussed in the further below and in U.S. Patent Publication No. 20070162411 and U.S. Provisional Patent Applications 61/033,673 and 61/089,834, incorporated by reference herein.
  • the knowledge base may also include other elements as discussed in the U.S. Patent Publication No. 20070162411, such as an index set, which is used to map features during a data import process.
  • element 104 indicates all the feature sets in the knowledge base. After data importation, the feature sets typically contain at least a feature set name and a feature table.
  • the feature table contains a list of features, each of which is usually identified by an imported identifier and/or a feature identifier. Each feature has a normalized rank in the feature set, as well as a mapping identifier. Mapping identifiers and ranks are determined during the import process as described further below, and then may be used to generate correlation scores between any two feature sets and between feature sets and feature groups.
  • the feature table also typically contains statistics associated with each feature, e.g., p-values and/or fold-changes. One or more of these statistics can be used to calculate the rank of each feature. In certain embodiments, the ranks may be normalized. In the case of a bi-directional feature set, a direction of correlation for some or all features is also indicated.
  • a feature set may also contain an associated study name and/or a list of tags. Feature sets may be generated from data taken from public or internal sources.
  • Element 106 indicates all the feature groups in the knowledge base.
  • Feature groups contain a feature group name, and a list of features (e.g., genes) related to one another.
  • a feature group typically represents a well-defined set of features generally from public resources - e.g., a canonical signaling pathway, a protein family, etc.
  • the feature groups do not typically have associated statistics or ranks.
  • the feature sets may also contain an associated study name and/or a list of tags.
  • Element 108 indicates a scoring table, which contains a measure of correlation between each feature set and each of the other feature sets and between each feature set and each feature group.
  • FS 1 -FS 2 is a measure of correlation between feature set 1 and feature set 2
  • FS 1 -FG 1 a measure of correlation between feature set 1 and feature group 1, etc.
  • the measures are p-values or rank scores derived from p-values.
  • Element 110 is a taxonomy or ontology that contains tags or scientific terms for different tissues, disease states, compound types, phenotypes, cells, and other standard biological, chemical or medical concepts as well as their relationships.
  • the tags are typically organized into a hierarchical structure as schematically shown in the figure. An example of such a structure is Diseases/Classes of Diseases/Specific Diseases in each Class.
  • the knowledge base may also contain a list of all feature sets and feature groups associated with each tag.
  • the tags and the categories and sub-categories in the hierarchical structure are arranged in may be referred to as concepts.
  • the knowledge base also contains a concept scoring table or tables, such as shown in element 112, which contains scores indicating the relevance of each concept or correlation of each concept with the other information in the database, such as features, feature sets and feature groups.
  • scores indicating the relevance of each concept in the taxonomy to each feature are shown at 114
  • scores indicating the relevance of each concept in the taxonomy to each feature set are shown at 116
  • scores indicating the relevance of each concept in the taxonomy to each feature group are shown at 118.
  • the organizational structure of the concept scoring is an example; other structures may also be used to store or present the scoring.
  • F 1 -C 1 is a measure of relevance of Concept 1 to Feature 1
  • FS 1 -C 1 a measure of relevance to Concept 1 to feature set 1
  • FG 1 -C 1 a measure of relevance to Concept 1 to feature group 1, etc.
  • the concept scoring table includes information about the relevance or correlation of at least some concepts with each of all or a plurality of other concepts.
  • Concept scoring is described in particular detail in U.S. Provisional Patent Applications 61/033,673 and 61/089,834, referenced above.
  • the feature set to feature set and feature set to feature group correlation scoring described is used in concept scoring and to present a user with a list of the most relevant concepts in response to a query.
  • the feature set to feature set and feature set to feature group correlation scoring may also be used to present user with lists of ranked features, feature sets or feature groups in response to a query.
  • FIG. IB An example of a feature table of a directional feature set is shown in Figure IB.
  • the features are genes, with their common names indicated in column 115.
  • Imported feature identifiers are indicated in column 119 and mapping identifiers are indicated in column 117.
  • the associated statistics, in this case fold change, are indicated in column 111.
  • Direction of fold change is indicated by the positive or negative sign next to each fold change value.
  • Direction of change may be indicated in any suitable manner.
  • Ranks are indicated in column 121, in this case corresponding to the magnitude of fold change shown in column 111.
  • the feature rankings in Figure IB are based on fold change, without regard to the direction of change.
  • preprocessing uses feature rankings to correlate feature sets.
  • the features in a feature set are ranked based on the p- value, fold change, or any other meaningful measurement or statistic contained in the feature table.
  • the rank is based on the absolute value of the feature statistics in a given feature set.
  • one feature set may have a feature ranked number two that has a 1.5 fold increase in response, while a different feature set has the same feature ranked number ten that has a 5 fold increase in response to a different stimulus.
  • above ranking is typically performed during data import.
  • FIG. 1 is a process flow sheet that shows an overview of correlation scoring between two bi-directional feature sets, A and B.
  • A has up-expressed features A + and down-expressed features A_; B has up-expressed features B + and down-expressed features B_.
  • the process begins at an operation 201 in which feature set A and feature set B are received.
  • feature set A is an input feature set, for example, provided by a user and feature set B is a pre-existing feature set in the knowledge base.
  • individual correlation scores are determined indicating the correlation between: A + and B + (the up-expressed features of A and the up-expressed features of B); A_ and B_ (the down-expressed features of A and the down-expressed features of B); A + and B_ (the up-expressed features of A and the down-expressed features of B); and A_ and B + (the down-expressed features of A and the up-expressed features of B).
  • correlation scoring is performed using feature ranks and rank-based iterative scoring algorithms. At this point there are four correlation scores: A + B + , A + B_, A_B_, and A_B + .
  • Figure 4A shows a representation of the four scores.
  • a + B + and A_B_ are measures of positive correlation between A and B: a large A + B + indicates a high correlation between features up-expressed in A and up-expressed in B; a large A_B_ indicates a high correlation between features down-expressed in A and those down- expressed B.
  • a + B. and A_B + are measures of negative correlation between A and B: a large A + B. indicates a high correlation between features up-expressed in A and those down-expressed in B and large A_B + indicates a high correlation between features down- expressed in A and features up-expressed in B.
  • an overall correlation score and a correlation direction are determined in an operation 205. Determining a correlation direction and an overall score may involve comparing the individual scores indicating a positive correlation (A + B + and A_B_) with those that indicate a negative correlation (A + B. and A_B + ). In certain embodiments, determining an overall correlation score and correlation direction involves adding the scores together, with the scores indicating a negative correlation given a negative sign in the correlation score expression. The overall correlation score is the absolute value of the expression and the correlation direction is determined by the sign (positive or negative):
  • IA + B + + A + B. + A_B_ + A_B + I may be used, with correlation direction calculated as described above.
  • the maximum of IA + B + + A_B_I and I A + B. + A_B_I is used.
  • the overall correlation score and correlation direction are stored in an operation 207 for use in responding to user queries.
  • the individual correlation scores are typically also stored, e.g., to be presented to a user who wishes to see detail on a particular feature set returned in result to a query.
  • Directional feature sets are also correlated with non-directional feature sets and feature groups.
  • Non-directional feature sets contain features that do not have a particular direction associated with the change.
  • An example of a non-directional feature set is one that contains a set of genes that have normal and uniform expression levels across a majority of human tissues.
  • Figure 3 is a process flow sheet showing operations in correlating a bidirectional feature set A with a non-directional feature set B. First, in an operation 301, bidirectional feature set A containing up and down-expressed features and associated statistics, and feature set B are received. Feature set B is a non-directional feature set.
  • FIG. 3 shows a representation of the two individual scores, A + B and A_B, that are determined.
  • an overall or final score is determined in an operation 305. Typically this operation involves selecting the score indicating a higher correlation.
  • the individual scores and the overall score are then stored for use in responding to user queries in an operation 307.
  • the flow sheet in Figure 3 also applies to correlating bi-directional feature sets with uni-directional feature sets, i.e., feature sets that have only up-expressed or only down-expressed features.
  • Bi-directional feature sets are also correlated with feature groups.
  • feature groups have collections of features having structural and/or functional characteristics in common.
  • Correlation of a bi-directional feature set with feature group is typically performed by determining individual correlation scores, one indicating the correlation between the up-regulated features of the bi-directional feature set with all features in the feature group and one indicating the correlation between the down-regulated features of the bi-directional feature set and the features in the feature group. The highest correlation score is then presented as a default score.
  • a more detailed view e.g., in a matrix, however, can provide information about correlations between a feature group and both directions of a feature set. An overview of the process is shown in Figure 5.
  • Individual correlation scores indicating the correlation of the up-regulated features A_ with the feature group G and the down-regulated features A_ with the feature group G are determined in an operation 503. This may be performed by a rank based algorithm as described below.
  • the individual correlation scores and the overall correlation scores are then stored in an operation 507 for use in responding to user queries.
  • Non-directional feature sets are also correlated with feature groups.
  • a user wishes to query a cancer signature feature set against multiple drug compound feature sets to identify possible compounds effective against the disease.
  • a feature set to be queried shows up- and down-regulated gene expression in human breast cancer cells, and feature sets to be queried against show up- and down-regulated gene expression in response to Compound A; up- and down-regulated gene expression in response to Compound B; up- and down- regulated gene expression in response to Compound C, etc.
  • a ranked list of compound feature sets can be returned in response to the query, ranked according to correlation score.
  • the correlation direction (positive or negative) of the ranked feature sets with the queried feature set are also shown.
  • a negative correlation is desired to find a compound that has the opposite effect of the cancer.
  • the user can expand a compound feature set having a negative correlation to see the details of the individual correlations, e.g., the correlation between the genes up-regulated in cancer cells and genes down-regulated in response to the compound, etc.
  • the user would then have to manually assimilate the information to find compounds that demonstrate signatures of both up-regulate genes that are down- regulated in cancer cells, and down-regulate genes that are up-regulated in cancer cells.
  • the processes described herein capture information that would be lost using correlation scoring that does not distinguish between up and down-regulated features.
  • rank-based enrichment algorithms which take into account feature rankings.
  • Fisher's exact test may be used to measure the significance of association of two variables. (See, Fisher, R.A. (1922). “On the interpretation of ⁇ 2 from contingency tables, and the calculation of P”. Journal of the Royal Statistical Society 85(l):87-94, which is hereby incorporated by reference for all purposes).
  • Fisher's exact test is used to measure the significance of the overlap of features in a given feature set B with features in a given feature group C. It should be noted that for directional feature sets, B may be either B + or B_.
  • FIG. 8 is a feature set versus feature group set diagram.
  • P represents all features in the experimental platform (e.g., all genes that a microarray test measures expression of or all features in the raw data);
  • B represents the features in the feature set;
  • C represents the features in the feature group.
  • the table below the set diagram shows the sets indicated on the diagram. In applying Fisher's exact test in any situation, it is necessary to define four parameters or elements of the contingency table that will give meaningful results.
  • BfIC is the intersect of feature set B and feature group C, and is shown as the striped subset in the diagram. This represents features in B that are mapped to features in C.
  • PnC-BnC represents the features in P that are mapped to C, but are not in B, and is indicated on the diagram;
  • B-B PIC represents the features in B that are not mapped to features in C and is indicated on the diagram
  • P-B-PnC+P ⁇ C represents the features in P that are neither in B nor mapped to features in C. This subset is also indicated on the diagram.
  • Fisher's exact test a p-value is obtained.
  • the implementation of Fisher's exact test is based on Agresti A, (1992), A Survey of Exact Inference for Contingency Tables, Statistical Science, 7, 131-153, hereby incorporated by reference.
  • Figure 9A is a process flow diagram showing key operations in generating a correlation score indicating the correlation between a feature set B and feature group C. This 'running' algorithm can be described as iterations of the Fisher's exact test at dynamic checkpoints.
  • the process begins with receiving feature set B and feature group C (902).
  • the features in the feature set have been ordered by rank, as discussed above.
  • An example of a feature set feature table with rankings is shown in Figure 9B.
  • the rank is shown in column 952.
  • the file also contains mapping identifiers in column 956.
  • the feature table also contains an imported ID column (954) displaying the feature identifiers as received during data import; a symbol column (958) displaying symbols associated with the features; a p-value column (960) displaying p-values as measured in the experiment; and a fold-change column (962) displaying fold changes as measured in the experiment.
  • the rankings shown in column 952 are based on fold-change; however they may also be based on p-value, other appropriate statistics or a combination thereof.
  • the feature group C also contains a list of feature identifiers and mapping identifiers, however, it typically does not contain rankings or other statistics. Common mapping identifiers allow determination of the members of the BfIC subset in an operation 904 shown in Figure 9A.
  • the highlighted rows in Figure 9B indicate the features that are members of the BfIC subset.
  • feature X is determined (906).
  • Feature X is the next feature in BfIC in rank order.
  • Feature X is the feature ranked 11, which is indicated at Stop 1 in column 964. This is the first checkpoint.
  • a sub-feature set B x is determined (908). (Decision diamond 920 indicates an optimization step that is discussed further below).
  • Sub-feature set B x is the set of all features having a rank equal to or higher than X.
  • sub-feature set B x contains the features ranked 1-11. Fisher's exact test is then performed for sub-feature set B x and feature group C in an operation 910, using the parameters described above (i.e., B x HC, PnC-B x PlC, etc.). The resulting p-value, p x is then compared to a global p-value, and if it is less than the global p-value, it is saved as the (new) global p-value. For the first iteration, where there is no pre-existing global p-value, p x is saved as the global p-value with which to be compared in the successive iteration.
  • B x PlC has one member, with every successive iteration adding a member.
  • Decision 914 determines if there are any remaining features in BPlC. If there are, the process returns to operation 906, in which feature X is identified. For example, for the second process iteration of the feature set shown in Figure 9B, feature X is the feature ranked 13, and sub-feature set B x contains features ranked 1-13. Essentially, the process looks at all possible p-values for all sub-feature sets B x and selects the lowest p-value. It should be noted that performing Fisher's exact test only at the "stop" points indicated returns the same result as if it were performed at each ranked feature.
  • a multiple-hypothesis testing correction is applied to the global p-value to obtain a final p-value for feature set B and feature group C (916).
  • the p-value is multiplied by the size of the feature set. This correction accounts for the fact that larger feature sets return lower p-values, as there are more opportunities for lower p-values to be received with larger feature sets. Multiple-hypothesis testing corrections are known in the art.
  • This final p-value is then stored, e.g., in a Scoring Table.
  • a 'rank score' is stored in the Scoring Table in addition to or instead of the final p-value.
  • the rank score is a derivative of the final p-value and is the negative logarithm of the p-value.
  • FiriF2 is the intersect of feature set Fl and feature set F2, and is indicated in the diagram. This represents features in Fl that are mapped to features in F2;
  • FinP2-FiriF2 represents the features in the intersect of Pl and P2 that are in Fl, but are not in F2;
  • F2nPl-FiriF2 represents the features in the intersect of Pl and P2 that are in F2 , but are not in Fl;
  • PinP2-FinP2-F2nPl+FinF2 represents the features in the intersect of Pl and P2 that are neither in Fl or F2.
  • Figure HA is a process flow diagram showing key operations in generating a correlation score indicating the correlation between feature sets Fl and F2.
  • feature sets Fl and F2 are received, each with a ranked list of mapped features, e.g., as shown for one feature set in Figure 9B.
  • the intersect F1PIF2 is determined using the mapping identifiers.
  • Fl PIF2 From Fl PIF2, a ranked list of features Fl(i) and F2(j) are generated .
  • the variables "i" and "j" are used to designate stops or checkpoints of Fl and F2, respectively.
  • stops are used to define sub-feature sets for which to generate p-values, the lowest of which is the final p-value for the feature set to feature set comparison.
  • Fisher's exact test would be performed for all combinations of sub-Fl(i) Sets and all possible sub-F2(j) sets. However, this is not necessary, as reflected in the algorithm below.
  • FIG. 1 IB shows an example of ranked lists of Fl and F2 (for ease of description, only the rankings are shown; however as in Figure 9B, the tables may also include feature identifiers, mapping addresses, statistics, etc.).
  • the highlighted ranks in each feature set indicate the members of FinF2.
  • Fl(I) the Fl feature ranked 1st
  • Fl(2) the Fl feature ranked 5th
  • F2(l) the F2 feature ranked 2nd
  • F2(2) the F2 feature ranked 7th, etc.
  • Fl(i) and F2(j) are then 'aligned,' i.e., each feature Fl(i) is connected to or associated with its corresponding feature F2(j) (1106).
  • This is graphically depicted in Figure HC, in which the Fl(I) is aligned with F2(3); Fl (2) is aligned with ⁇ 1(2); Fl(3) is aligned with F2(4), etc.
  • Align(Fl(i)) is used in the flow sheet and in the following description to refer to the feature in F2(j) that Fl(i) is aligned to; for example, Align(Fl(3)) refers to F2(4). Similarly Align(F2(3)) refers to Fl(I).
  • a counter i is set to zero (1108). Operation 1108 also indicates that a sorted vector used later in the algorithm to determine sub-F2 Sets is empty at this point.
  • Counter i is compared to imax, where imax is the number of features in Fl(i) (1112). If it is less than or equal to imax, the process proceeds to an operation 1114, in which a sub- feature set sub-Fl(i) is defined. (Operation 1130 is an optimization step that is discussed further below).
  • Sub-Fl(i) contains Fl(i) and all higher ranked features in Fl.
  • sub-Fl(l) contains only Fl(I) as it is the highest ranked feature.
  • the rank of Fl(2) is 5, so sub-Fl(2) would contain the features in Fl that are ranked 1-5.
  • the rank of Align(Fl(i)) is then inserted into the vector (1116). For Fl(I), the vector would be [13]; for Fl(2), the vector would be [2,13], etc.
  • the process then defines a sub-feature set sub-F2(j) in an operation 1118.
  • the rank of F2(3) is 13, so sub-F2(j) contains the features in F2 ranked 1-13.
  • Fisher's exact test is then performed for sub-Fl(i) and sub-F2(j) using the parameters described above with respect to Figure 10 to generate a p- value P 1J (1120).
  • the p-value P 1 J is then compared to the global p-value and saved as the global p-value if it is lower (1122). Determining if the current sub-Fl(i) should be compared to other sub-F2 sets involves checking if the sorted vector contains any rank values that are higher (i.e., lower in rank) than the rank of the current F2(j) (1124).
  • j is set to the stop corresponding to the next rank value in the vector and a new sub-F2(j) containing F2(j) and all higher-ranked features in F2 is defined (1126).
  • the rank of F2(l) is 2, so the vector contains [2,13].
  • a new sub-F2(j) is created using the F2 stop corresponding to rank 13 as the new j; in this case sub-F2(3) is created, containing the F2 features ranked 1-13.
  • the process then returns to operation 1120, in which Fisher's exact test is performed for Fl(i) and the new F2(j).
  • operation 1124 if there are no rank values greater than the rank of current F2(j), the process returns to operation 1110 to calculate p-values for the next Fl stop.
  • a multiple hypothesis testing correction is applied (1128). This correction is based on the total number of possible hypothesis tests, i.e., all possible combinations of Fl and F2 sub-feature sets.
  • FIG. HA One optimization is shown in Figure HA at operation 1130, in which stop i may be skipped if the next stop is contiguous and links to a higher rank.
  • the rank Fl(i+1) is compared to the rank (Fl(i)) +1. If these are equal, the Fl(i) and Fl(i+1) are contiguous. If they are contiguous and rank Align(Fl(i+l)) ⁇ rank Align(Fl(i)), then the stop may be skipped.
  • a second optimization may be performed on the inner loop, wherein the calculation at a "j" may be skipped if the next j-value is contiguous, i.e., if j+1 is an element of the vector. This is essentially the same optimization as described above for the feature set to feature group correlation.
  • two feature sets are correlated according to the iterative rank- based algorithm described with reference to the process flow sheet shown in Figure 6.
  • feature sets Fl and F2 are received in an operation 601, each with a ranked list of mapped features.
  • Fl and/or F2 is just the up-expressed or down-expressed features of the bi-directional feature set.
  • Figure 1C shows an example of a feature table with a ranked list of up-expressed features from the feature table shown in Figure IB.
  • Direction is indicated in column 113.
  • Fl top% This subset is referred to in Figure 6 as Fl top% .
  • the platform is a gene chip with 14,000 genes, x may be 2% of 14,000, or 280 features. The percent typically ranges from 1-10%. Because the percentage is based on platform size, for small feature sets, Fl top % may be the same size of Fl.
  • Fl top % HF2 i.e., the features in Fl top% that are mapped to features in F2 are determined.
  • a schematic in Figure 7 shows a simple example in which Fl top% contains the ten top ranked features in Fl (indicated in bold).
  • Feature X is then determined (607). Feature X is the next feature in Fl top% nF2 in rank order of F2. Thus, for the first iteration of the process as applied to the feature sets shown in Figure 7, Feature X is the feature ranked 2, which is indicated at Stop 1. This is the first checkpoint.
  • a sub-feature set F2 X is determined (609). Sub-feature set F2 X is the set of all features having a rank equal to or higher than X.
  • sub-feature set F2 X contains the features ranked 1 and 2 of F2.
  • Fisher's exact test is then performed for sub- feature sets Fl top% and F2 x in an operation 611, using the parameters described in Figure 10 and associated text of the Appendix.
  • the resulting p- value, p x is then compared to a global p-value, and if it is less than the global p-value, it is saved as the (new) global p-value.
  • p x is saved as the global p- value with which to be compared in the successive iteration.
  • Decision 615 determines if there are any remaining features in Fl top % ⁇ F2.
  • the process returns to operation 607, in which feature X is identified.
  • feature X is the feature ranked 13
  • sub- feature set F2 X contains features ranked 1-13.
  • the process looks at all possible p-values for all sub-feature sets F2 X and selects the lowest p-value. It should be noted that performing Fisher's exact test only at the "stop" points indicated returns the same result as if it were performed at each ranked feature. This is because the p-values from Fisher's exact test performed at all non-stop points will be higher than the global p-value.
  • a multiple-hypothesis testing correction may be applied to the global p-value to obtain a final p-value for feature set Fl top% and feature group F2 (617).
  • Multiple-hypothesis testing corrections are known in the art and account for the fact that larger feature sets return lower p-values, as there are more opportunities for lower p-values to be received with larger feature sets. Operations 603-617 are then performed again, this time for F2 top% and iterating over all of Fl. See 619. The result is a p-value or correlation score for F2 top % and Fl.
  • the two p-values or correlation scores obtained in 617 and 619 are then averaged by taking geometric mean of p- values or arithmetic mean of scores to obtain a final score correlation score for Fl and F2 (621).
  • This final p-value is then stored, e.g., in a Scoring Table.
  • a 'rank score' is stored in the Scoring Table in addition to or instead of the final p-value.
  • the rank score is a derivative of the final p-value and is the negative logarithm of the p-value.
  • FIG. 12 shows an overview of the process of producing a knowledge base; Figures 13-15 describe aspects of the process in greater detail.
  • the knowledge base contains feature sets and feature groups from a number of sources, including data from external sources, such as public databases, including the National Center for Biotechnology Information (NCBI).
  • NCBI National Center for Biotechnology Information
  • Figure 12 shows an overview of the process of producing a knowledge base. The process begins with receiving raw data from a particular experiment or study (1202).
  • the raw data may be obtained from a public database, private sources, an individual experiment run in a lab, etc.
  • the raw data typically contains information for control and test samples.
  • the raw data includes expression profiles for normal (control) and tumor (test) lung cells.
  • the raw data from the study or experiment may contain additional information, e.g., the gene expression profiles may also be associated with a particular disease state, or with patients having different clinical parameters (age, gender, smoker/non-smoker, etc.).
  • a feature set is a set of features identified as being significant in a given experimental setting and associated statistical information.
  • the features of one feature set from the lung cancer gene expression study would be the set of genes that are differentially expressed between tumor and normal cells.
  • Associated statistical information might indicate the fold change or a p- value associated with each feature, representing the change of the feature between the experimental and control conditions.
  • Feature sets are generated from a particular study or experiment and are imported into the knowledge base (1206). As described below, importing the data typically involves tagging the feature set with appropriate biomedical or chemical terms, as well as automatically mapping each feature in a feature set, i.e., establishing connections between each imported feature and other appropriate features in the knowledge base as appropriate.
  • the next major operation in producing a knowledge base is correlation scoring of each imported feature set with all other feature sets and feature groups in the knowledge base (1208).
  • the correlation e.g., in the form of a p-value, of a feature set with all other feature sets and all feature groups is known and stored.
  • the user is able, by submitting queries and navigating, to efficiently explore and connect biological information contained in the knowledge base.
  • the process illustrated in Figure 1 may be performed anytime a user wishes to add experimental data to the knowledge base.
  • A. Curating is the process of generating feature sets from raw data.
  • Figure 13A presents a graphical representation of raw data and the resulting feature sets.
  • Raw data includes the data for control and test samples; in the example depicted in Figure 13 A, data 1352 includes measurements (e.g., intensity measurements from a microarray) for features A-F.
  • the data typically includes replicate data; here the control sample replicates are indicated as data 1352' and 1352".
  • Test sample data is shown in the figure as Sample (test 1) data (1354 and replicates 1354' and 1354”), test 2 data, ... test N (1356 and replicates 1356' and 1356"), each with identified features and associated statistics.
  • Sample (test 1) data 1354 and replicates 1354' and 1354
  • test 2 data 1354 and replicates 1354' and 1354
  • test N 1356 and replicates 1356' and 1356
  • each test sample represents a different concentration of a potential therapeutic compound.
  • each test sample represents a tissue sample taken from a patient with a different clinical indication (e.g., lung tissue samples from non-smokers, from smokers of various levels, from drivers of diesel vehicles, from patients before treatment administration and after, etc.).
  • the samples from which raw data is generated typically contain many different types of information, especially when it comes to clinical samples.
  • raw data from an experiment measuring differential gene expression between tumors of different patients - e.g., in tumor stage 1 and tumor stage 2 cells - may also contain information on other attributes of those patients in this example, beyond whether they are tumor stage 1 or tumor stage 2, e.g., whether they are smokers, their age, their prior treatment, year of diagnosis, etc.
  • the curation process generates one or more feature sets, which are shown in the example depicted in the figure as feature set 1 (1358) to feature set M (1360).
  • feature sets contain statistics derived from measurements in the raw data. In the figure, these are labeled stat 1 and stat 2, e.g., a p-value and a fold change.
  • P-values generally refer to the probability of obtaining a result at least as extreme as that obtained and are one type of data that may be in the raw data.
  • Fold change typically refers to the magnitude of change (2-fold, 3-fold, etc.) of some measurement in control and test samples.
  • Each feature set relates to a different biological, clinical, or chemical question (e.g., up-regulation in response to compound treatment; up-regulation in a particular tissue, etc.).
  • each feature set may have a different collection of features as only features identified in curation as statistically relevant to a given question are included in a particular feature set.
  • feature set 1 in Figure 13A contains features A - E and feature set M contains features D, E and F.
  • each feature set may contain different associated statistical measures as appropriate for the set.
  • the depiction of raw data in Figure 13A is merely an example of how raw data may be presented.
  • Figure 13B is a process flowsheet that depicts an overview of a curation process.
  • the process begins with data quality control (1302).
  • Data quality control is an operation that includes normalizing the data, removing outlying data and identifying all valid clinical questions (i.e., identifying all possible feature sets).
  • Figure 13C presents a process flowsheet showing operations in one embodiment of a data quality control process.
  • the process begins with normalizing the data (1308). Normalization strategies for various types of data are well-known in the art. Any appropriate normalization strategy may be used.
  • Outliers are then identified and removed (1310). This is typically performed on a per sample basis (i.e., outlying samples are removed). Standardized processes for identifying outliers are also well-known.
  • genes for proteins in one pathway comprise features for a first Feature set and genes for proteins in a different pathway comprise features for a different Feature set.
  • the clinical questions defining the feature sets pertain to the impact of a particular stimulus or treatment on features measured, ultimately identifying genes in two distinct cellular pathways.
  • clinical question are questions that the experiment was designed to answer or measure. This may be designated as a valid clinical question, and the related feature set would contain the features for which there is a statistically significant difference between control and test sample. (The features in each feature set are typically determined in the subsequent statistical analysis operation described below).
  • Valid clinical questions may also be questions that the experiment was not necessarily designed for, but that the raw data gathered supports.
  • an experiment may be designed to compare tumor stage 1 and tumor stage 2 samples, with the data published with associated clinical annotations that show Her2-positive patients and Her2-negative patients.
  • One feature set may be up-regulation of genes in tumor stage 2 versus tumor stage 1 samples (i.e., the feature set contains genes that are up-regulated in tumor stage 2 samples) and another feature set may be up-regulation of genes in Her2-positive versus Her2-negative patients (i.e., the feature set contains genes that are up-regulated in Her2-positive patients).
  • a clinical question may be deemed "valid” if there is enough statistically significant data to support the clinical comparison.
  • the raw data contains features that can be associated with a large number of different clinical parameters or attributes of the patient, e.g., smoker/non-smoker, drugs taken, age, tumor stage, etc.
  • Identifying valid clinical questions involves determining if any differences in features for two groups of data (e.g., smoker versus non-smoker) is statistically significant. Identifying valid clinical questions may be performed by any appropriate methodology, including brute force methods and more sophisticated methods. For example, a multi-ANOVA type analysis may be performed on the entire raw data set containing different clinical parameters to find which parameters have a statistically significant effect on differential gene expression (or other change in measured features).
  • clustering may be applied to data to, e.g., compare samples of clusters of data groups to see if statistically significant comparisons of groups that may be used to generate feature sets are present.
  • Figure 13D shows an overview of one statistical analysis process.
  • the process starts with signal filtering (1314), an operation in which features whose corresponding signals are below a threshold intensity (or other measurement) are filtered out. For example, fluorescent signals from a microarray are analyzed on a gene-by-gene basis with signals below a threshold filtered out. In this manner, a reduced set of genes is generated.
  • One or more statistical tests are then performed on a feature-by- feature basis to determine for what features the differential measurement between control and test is significant enough to include the feature in the feature set.
  • the feature sets are generated (1306).
  • the feature sets typically contain a name and a feature table, the feature table being a list of feature identifiers (e.g., names of genes) and the associated statistics.
  • Generating a feature set involves putting the feature set into an appropriate standardized format for importation into the knowledge base. i. Tissue-Specific Feature Sets
  • tissue-specific feature sets may be generated.
  • Tissue- specific feature sets are feature sets generated from multi-tissue experiments and contain features that show specificity for a particular tissue or tissues. For example, for an experiment measuring gene expression across twelve tissues, one generated feature set may be liver- specific up-regulated genes. The process is typically used with studies having a number of tissues across which a median expression is statistically relevant. In certain embodiments, tissue-specific feature sets are generated for studies across at least twelve tissues.
  • Figure 14 is a flow sheet showing a process by which tissue-specific feature sets are generated. The process shown in Figure 14 is performed for each possible feature in the data set (e.g., each feature remaining in the data set after signal filtering). The process begins with identifying the median expression (or other measurement) of the feature across all tissues (1402). The median expression across all samples in all tissues is used as synthetic control or normal expression of the feature in a tissue. The amount or degree of up/down regulation in each tissue relative to the median is then determined (1404). It is then determined if the feature is tissue-specific or not (1406).
  • a feature is determined to be tissue-specific if it is up or down regulated beyond a threshold in no more than n tissues.
  • up-regulation and down- regulation are considered separately; however, in certain embodiments, these may be considered together (e.g., a feature is determined to be tissue-specific if it is up- or down- regulated in no more than n tissues). If the feature is tissue-specific, the tissues in which the feature is specifically up/down regulated are identified (1408). The feature is then added to the identified tissue-specific feature sets (1410).
  • tissue-specific feature sets are typically generated for multi- tissue studies in addition to the feature sets generated by comparing expression between control and test as described above.
  • the gene is included in a separate feature set that includes genes that are up-regulated in liver, ii.
  • Directional Feature Sets In the case of directional feature sets, an indication of the direction of change relative to a control (up/down; positive/negative, etc.) is also typically indicated. In many cases, the direction is indicated by a positive or negative sign, e.g., associated with a fold change or other amount, or in a separate column of the table. The direction may also be indicated by any other appropriate means.
  • Feature Groups are also be indicated by any other appropriate means.
  • Feature groups contain any set of features of interest, typically without associated statistics.
  • Examples of feature groups include any set of features that the researcher is interested in, a set of features that defines a biological pathway, or a set of features that defines a protein family. Curation of feature groups may be performed by any appropriate method. Features involved in particular pathways, or sharing common functions or common structural features may be received from public or private databases, or generated by the researcher or user. After curation, the feature groups typically contain a name, other descriptive information and a list of member features.
  • Figure 15 is a process flowsheet that shows an overview of the Data Importation process.
  • the process begins in an operation 1502, in which the user defines all relevant files (all feature set and/or feature group files) as well as technology, e.g., the microarray or other platform used to generate the data and any associated information through a user interface.
  • platform technology does not apply to feature groups, as a feature group typically contains a group of features related biologically and not experimentally.
  • Associated information may include text files that contain descriptions or lists of key concepts of the feature set or feature group.
  • a location for the feature set in a directory system is also typically specified. For example, the user may specify a Project directory and Study subdirectory.
  • mapping is the process through which diverse features (e.g., from different platforms, data types and organisms) are associated with each other. For example, a gene may be associated with a SNP, a protein, or a sequence region of interest. During data importation, every feature is automatically mapped. In certain embodiments, mapping involves mapping each feature to one or more reference features or addresses in a globally unique mapping identifier set in the knowledge base (e.g., an index set). Mapping facilitates correlation between all feature sets and feature groups, allowing independent sets of data/information from diverse sources, assay types and platforms, to be correlated.
  • mapping involves the use of an index set that contains addresses or identifiers, each representing a unique feature (e.g., an index set may contain addresses or mapping identifiers representing a single gene of a human or non-human genome). Also in certain embodiments, mapping involves matching imported identifiers (e.g., generic name, GenBank number, etc.) to feature identifiers in the index set. These feature identifiers are various synonyms, genomic coordinates, etc., each of which points to one or more unique mapping identifiers. The mapping process may involve looking up feature identifier(s) that match an imported identifier, and then locating the mapping identifier(s) that the feature identifiers point to. In some cases, the best of a plurality of mapping identifiers is chosen for the mapping.
  • imported identifiers e.g., generic name, GenBank number, etc.
  • the mapping process may range from relatively simple (e.g., making a connection between a gene and its protein product) to the more complex (e.g., mapping a connection between a sequence element and a nearby gene on a given region of a chromosome).
  • a feature may have a one-to-one mapping, i.e., each feature is mapped to single reference or mapping identifier.
  • features are mapped to a plurality of references or mapping identifiers.
  • Three categories of mapping that may be employed are: feature-centric mapping, sequence- centric mapping and mapping based on indirect associations.
  • Feature-centric mapping relies on established relationships between various features and their identifiers and is typically employed when there is a standard nomenclature for the feature and identifiers. For example, several different accession numbers can all map to a single gene. A protein product of a gene maps to that gene because that relationship is already established. Two different compound IDs that represent the same substance map to a common drug reference. Different accession numbers of gene A, names for gene A, protein product of gene A, etc. are all mapped to a unique reference for that gene. In case of different organisms, orthologue information may be used to map all data between all available organisms. Sequence-centric mapping creates associations between various features based on their genomic coordinates.
  • Sequence-centric mapping may be useful in situations where established relationships between various identifiers and/or features are unknown or do not exist. Associative mapping does not require a feature to have a one-to-one mapping - having it point to a single reference feature or ID; features may be associated with several features simultaneously. For example, if a sequence region that is being imported falls within a given haplotype block, then associative mapping can be done between that sequence region and all genes within a given haplotype block. Another example is a region that is located within a known binding site of a gene. Although the feature of interest does not map directly to that gene, the region is potentially related to that gene through the binding site that regulates it, and so can be mapped to it. Further details of various mapping processes are described in U.S. Patent Publication No. 20070162411 and titled "System and Method for Scientific Information Knowledge Management,"incorporated by reference herein.
  • Ranking involves ordering features within each feature set based on their relative levels of response to the stimulus or treatment in the experiment(s), or based on their magnitude and direction of change between different phenotypes, as well as their ability to differentiate different phenotypic states (e.g., late tumor stage versus early tumor stage). Ranking is typically based on one or more of the associated statistics in an imported feature set; for example, features may be ranked in order of decreasing fold-change or increasing p-value. In certain embodiments, a user specifies what statistic is to be used to rank features.
  • Data tagging is performed in an operation 1508 (operations 1504-1508 may be performed concurrently or in any order).
  • Tags are standard terms that describe key concepts from biology, chemistry or medicine associated with a given study, feature set or feature group. Tagging allows users to transfer these associations and knowledge to the system along with the data. For example, if a study investigated beta blockers within a muscle tissue then the two tags may be "beta blockers" and "muscle.” In addition, if a researcher knows that a given study is relevant to cardiovascular research, he/she can add a tag "cardiovascular disorders". Tagging may be performed automatically or manually. Automatic tagging automatically extracts key concepts for imported data.
  • the system parses all text and documents associated with a given study and automatically captures and scores key concepts (e.g., based on frequency and specificity criteria) that match a database of tags - "standard" biomedical, chemical or other keywords.
  • key concepts e.g., based on frequency and specificity criteria
  • a user can specify additional files to be imported with the data, for example text descriptions of the experiments or studies.
  • Automatic tagging parses these documents for terms that match tags in the database.
  • a user may "manually" or semi-automatically add tags to feature sets and feature groups. The user selects from tags in the database to associate with the feature sets and feature groups.
  • the user may enter keywords to search the database. The search extracts the relevant tags and the user may add them to the imported data.
  • GCM Global Correlation Matrix
  • the system will retrieve all pre-computed pathway associations (from GCM) for each feature set and then determine the most highly correlated pathway between the feature sets (based on the pre-computed individual signature-pathway scores).
  • a query involves (i) designating specific content that is to be compared and/or analyzed against (ii) other content in a "field of search" to generate (iii) a query result in which content from the field of search is selected and/or ranked based upon the comparison.
  • general types of queries include feature set queries, feature group queries, feature- specific and concept queries.
  • Figure 16A depicts a sequence of operations that may be employed when a user identifies a single feature set for the query (as distinguished from the case where the user presents multiple feature sets for a query).
  • the user identifies one feature set as an input for running the query. He or she may do this by browsing through a list of feature sets organized by Study and Project or some other ontology such as a hierarchy of taxonomy keywords (Concepts or Tags). Alternatively, the user may manually enter the identity of a feature set he or she is familiar with. Regardless of how the query feature set is entered, the system receives the identity of that feature set as a query input as depicted at block 1601 in Figure 16A.
  • this command is a "Run Query" command as identified at block 1603.
  • the query may be limited to a particular field of search within the features, feature sets and feature groups of the knowledge base.
  • the search may include the entire knowledge base and this may be the default case.
  • the user may define a field of search or the system may define it automatically for particular types of feature sets.
  • the system compares the query feature set against all other feature sets for the field of search.
  • scoring tables may be generated from correlations of each feature set against all other feature sets in the knowledge base and each feature set against all feature groups in the knowledge base.
  • the correlation scores provide a convenient way to rank all other feature sets in the field of search against the feature set used in the query.
  • a comparison of the query feature set against all other feature sets in the field of search is used to produce a ranked list or lists of the other feature sets.
  • the comparison of the query feature set against all other feature sets in the field of search involves using pre-computed feature set - concept scores as described in U.S. Provisional Patent Applications Nos. U.S. Provisional Patent Applications 61/033,673 and 61/089,834, referenced above, with comparisons between feature sets used to generate the pre-computed concept scores.
  • the ranked list can be used to display the other feature sets from the field of search in descending order, with the most highly correlated (or otherwise most relevant) other feature set listed first, at the top of the list. As indicated in block 1607 of Figure 16 A, the resulting ranked list may be presented as a result of the query via a user interface.
  • the other feature sets identified at operation 1605 are simply presented as a list of individual feature sets at operation 1607.
  • the other feature sets may not be directly shown in the query results screen. Rather, for example, the Studies containing the query result feature sets are listed, with the feature sets in a particular Study viewed by selecting and expanding the Study.
  • taxonomy groups are listed as surrogates for the feature sets in the ranked list. Such taxonomy groups may be based on tags such as "cancer" or "stage 2 lymphoma,” etc.
  • the feature sets are presented, as indicated in block 1607, for at least some of the resulting feature sets, an indication of the correlation direction (positive or negative) with the query feature set is presented.
  • a correlation direction for Studies or tags listed is presented based on the correlation direction(s) of the feature sets grouped under the Study or tag.
  • Figure 16B shows a sample results screen 1651 for a feature set versus feature set query.
  • the user took a "breast cancer basal-like CHGN vs. normal-like tumors" genes bioset and queried it against all other feature sets in a knowledge base.
  • Studies containing the ranked feature sets (biosets) resulting from the query are presented as rows 1653.
  • One of the Studies has been expanded to show individual feature sets (biosets) as rows 1655.
  • a “rank score” 1657 graphically depicting the relative rank of the feature set.
  • Other columns present common genes, p-values, and species of origin.
  • a correlation direction is presented at 1659 for all feature sets (whether presented as such or as Studies containing them).
  • the process may be complete.
  • a user may conduct further queries using the feature set provided as the query input. For example, as indicated at decision operation 1609, the system may allow a user to select feature sets to view details on the correlation between the input feature set and the selected feature set or sets.
  • the system next presents detailed correlation information on each selected feature set and the query feature set. See block 1611.
  • This information may include the overlapping features between the selected and queried feature set, the individual correlation scores and overlapping features associated with each individual correlation score.
  • this information is presented graphically, e.g., as in Figure 4A, or in the case of bi-directional vs. non-directional features sets, as in Figure 4B.
  • the process described in Figure 16A may be modified to present information on queries involving feature sets.
  • Figure 16C shows a sample results screen 1681 in which the "Breast cancer non-treated - Relapse CHGN vs.
  • Bioset 1 (Breast cancer Basal-like CHGN vs. normal-like tumors), for example, contains 2773 up-regulated genes, 119 of which overlap with up-regulated genes from bioset 2 (Breast cancer non-treated - Relapse CHGN vs. no relapse).
  • the overlap genes with ranks in each bioset are presented at 1695.
  • the individual correlation scores (up/up, up/down, down/down,down/up) are presented to the user upon mouse over of the corresponding quadrant cells. 4.
  • certain embodiments of the invention employ processes acting under control of instructions and/or data stored in or transferred through one or more computer systems. Certain embodiments also relate to an apparatus for performing these operations.
  • This apparatus may be specially designed and/or constructed for the required purposes, or it may be a general-purpose computer selectively configured by one or more computer programs and/or data structures stored in or otherwise made available to the computer.
  • the processes presented herein are not inherently related to any particular computer or other apparatus.
  • various general-purpose machines may be used with programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required method steps. A particular structure for a variety of these machines is shown and described below.
  • certain embodiments relate to computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations associated with at least the following tasks: (1) obtaining raw data from instrumentation, databases (private or public (e.g., NCBI), and other sources, (2) curating raw data to provide feature sets, (3) importing feature sets and other data to a repository such as database or knowledge base, (4) mapping Features from imported data to pre-defined Feature references in an index, (5) generating a pre-defined feature index, (6) generating correlations or other scoring between feature sets and feature sets and between feature sets and feature groups, (7) creating feature groups, (8) receiving queries from users (including, optionally, query input content and/or query field of search limitations), (9) running queries using features, feature groups, feature sets, Studies, taxonomy groups, and the like, and (10) presenting query results to a user (optionally in a manner allowing the user to navigate through related content perform related queries).
  • databases private or public (e.g., NCBI)
  • the invention also pertains to computational apparatus executing instructions to perform any or all of these tasks. It also pertains to computational apparatus including computer readable media encoded with instructions for performing such tasks. Further the invention pertains to useful data structures stored on computer readable media. Such data structures include, for example, feature sets, feature groups, taxonomy hierarchies, feature indexes, Score Tables, and any of the other logical data groupings presented herein. Certain embodiments also provide functionality (e.g., code and processes) for storing any of the results (e.g., query results) or data structures generated as described herein. Such results or data structures are typically stored, at least temporarily, on a computer readable medium such as those presented in the following discussion. The results or data structures may also be output in any of various manners such as displaying, printing, and the like.
  • tangible computer-readable media suitable for use computer program products and computational apparatus of this invention include, but are not limited to, magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROM disks; magneto-optical media; semiconductor memory devices (e.g., flash memory), and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM).
  • ROM read-only memory
  • RAM random access memory
  • the data and program instructions provided herein may also be embodied on a carrier wave or other transport medium (including electronic or optically conductive pathways).
  • program instructions include low-level code, such as that produced by a compiler, as well as higher-level code that may be executed by the computer using an interpreter. Further, the program instructions may be machine code, source code and/or any other code that directly or indirectly controls operation of a computing machine. The code may specify input, output, calculations, conditionals, branches, iterative loops, etc.
  • FIG. 17 illustrates, in simple block format, a typical computer system that, when appropriately configured or designed, can serve as a computational apparatus according to certain embodiments.
  • the computer system 1700 includes any number of processors 1702 (also referred to as central processing units, or CPUs) that are coupled to storage devices including primary storage 1906 (typically a random access memory, or RAM), primary storage 1704 (typically a read only memory, or ROM).
  • processors 1702 may be of various types including microcontrollers and microprocessors such as programmable devices (e.g., CPLDs and FPGAs) and non-programmable devices such as gate array ASICs or general- purpose microprocessors.
  • primary storage 1704 acts to transfer data and instructions uni-directionally to the CPU and primary storage 1706 is used typically to transfer data and instructions in a bi-directional manner. Both of these primary storage devices may include any suitable computer-readable media such as those described above.
  • a mass storage device 1708 is also coupled bi-directionally to primary storage 1706 and provides additional data storage capacity and may include any of the computer- readable media described above. Mass storage device 1708 may be used to store programs, data and the like and is typically a secondary storage medium such as a hard disk. Frequently, such programs, data and the like are temporarily copied to primary memory 1706 for execution on CPU 1702.
  • mass storage device 1708 may, in appropriate cases, be incorporated in standard fashion as part of primary storage 1704.
  • a specific mass storage device such as a CD- ROM 1714 may also pass data uni-directionally to the CPU or primary storage.
  • CPU 1702 is also coupled to an interface 1710 that connects to one or more input/output devices such as such as video monitors, track balls, mice, keyboards, microphones, touch- sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognition peripherals, USB ports, or other well-known input devices such as, of course, other computers.
  • input/output devices such as such as video monitors, track balls, mice, keyboards, microphones, touch- sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognition peripherals, USB ports, or other well-known input devices such as, of course, other computers.
  • CPU 1702 optionally may be coupled to an external device such as a database or a computer or telecommunications network using an external connection as shown generally at 1712. With such a connection, it is contemplated that the CPU might receive information from the network, or might output information to the network in the course of performing the method steps described herein.
  • a system such as computer system 1700 is used as a data import, data correlation, and querying system capable of performing some or all of the tasks described herein.
  • System 1700 may also serve as various other tools associated with knowledge bases and querying such as a data capture tool.
  • Information and programs, including data files can be provided via a network connection 1712 for downloading by a researcher. Alternatively, such information, programs and files can be provided to the researcher on a storage device.
  • the computer system 1700 is directly coupled to a data acquisition system such as a microarray or high-throughput screening system that captures data from samples. Data from such systems are provided via interface 1712 for analysis by system 1700. Alternatively, the data processed by system 1700 are provided from a data storage source such as a database or other repository of relevant data.
  • a memory device such as primary storage 1706 or mass storage 1708 buffers or stores, at least temporarily, relevant data. The memory may also store various routines and/or programs for importing, analyzing and presenting the data, including importing feature sets, correlating feature sets with one another and with feature groups, generating and running queries, etc.

Landscapes

  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Genetics & Genomics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

La présente invention concerne des procédés, des systèmes et un appareil pour la capture, l'intégration, l'organisation, la navigation et la recherche de données sur une grande échelle à partir de plates-formes d'analyses biologiques et chimiques à haut rendement. L'invention fournit une infrastructure de méta-analyse de haute efficacité pour la réalisation de demandes de recherches pour un grand nombre d'études et d'expériences à partir de différentes analyses biologiques et chimiques, de types de données et d'organismes, ainsi que des systèmes pour la construction et l'addition d'une telle infrastructure. En particulier, des aspects de l'invention concernent l'intégration, l'organisation, la navigation et la recherche de données « directionnelles », telles que les profils d'expression génique.
PCT/US2008/077097 2007-09-21 2008-09-19 Gestion des connaissances d'informations scientifiques basée sur l'expression directionnelle WO2009039425A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US97428907P 2007-09-21 2007-09-21
US60/974,289 2007-09-21

Publications (1)

Publication Number Publication Date
WO2009039425A1 true WO2009039425A1 (fr) 2009-03-26

Family

ID=40468400

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/077097 WO2009039425A1 (fr) 2007-09-21 2008-09-19 Gestion des connaissances d'informations scientifiques basée sur l'expression directionnelle

Country Status (1)

Country Link
WO (1) WO2009039425A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130166320A1 (en) * 2011-09-15 2013-06-27 Nextbio Patient-centric information management
US9141913B2 (en) 2005-12-16 2015-09-22 Nextbio Categorization and filtering of scientific data
US9183349B2 (en) 2005-12-16 2015-11-10 Nextbio Sequence-centric scientific information management
US10275711B2 (en) 2005-12-16 2019-04-30 Nextbio System and method for scientific information knowledge management

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7155453B2 (en) * 2002-05-22 2006-12-26 Agilent Technologies, Inc. Biotechnology information naming system
US7225183B2 (en) * 2002-01-28 2007-05-29 Ipxl, Inc. Ontology-based information management system and method
US7243112B2 (en) * 2001-06-14 2007-07-10 Rigel Pharmaceuticals, Inc. Multidimensional biodata integration and relationship inference

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7243112B2 (en) * 2001-06-14 2007-07-10 Rigel Pharmaceuticals, Inc. Multidimensional biodata integration and relationship inference
US7225183B2 (en) * 2002-01-28 2007-05-29 Ipxl, Inc. Ontology-based information management system and method
US7155453B2 (en) * 2002-05-22 2006-12-26 Agilent Technologies, Inc. Biotechnology information naming system

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9141913B2 (en) 2005-12-16 2015-09-22 Nextbio Categorization and filtering of scientific data
US9183349B2 (en) 2005-12-16 2015-11-10 Nextbio Sequence-centric scientific information management
US9633166B2 (en) 2005-12-16 2017-04-25 Nextbio Sequence-centric scientific information management
US10127353B2 (en) 2005-12-16 2018-11-13 Nextbio Method and systems for querying sequence-centric scientific information
US10275711B2 (en) 2005-12-16 2019-04-30 Nextbio System and method for scientific information knowledge management
US20130166320A1 (en) * 2011-09-15 2013-06-27 Nextbio Patient-centric information management

Similar Documents

Publication Publication Date Title
US8364665B2 (en) Directional expression-based scientific information knowledge management
US8275737B2 (en) System and method for scientific information knowledge management
AU2022268283B2 (en) Phenotype/disease specific gene ranking using curated, gene library and network based data structures
US10127353B2 (en) Method and systems for querying sequence-centric scientific information
US9141913B2 (en) Categorization and filtering of scientific data
CN108198621B (zh) 一种基于神经网络的数据库数据综合诊疗决策方法
JP4594622B2 (ja) 薬発見法
US7428554B1 (en) System and method for determining matching patterns within gene expression data
US20050055193A1 (en) Computer systems and methods for analyzing experiment design
CN108206056B (zh) 一种鼻咽癌人工智能辅助诊疗决策终端
CN108335756B (zh) 鼻咽癌数据库及基于所述数据库的综合诊疗决策方法
US20130166320A1 (en) Patient-centric information management
Netanely et al. PROMO: an interactive tool for analyzing clinically-labeled multi-omic cancer datasets
CN108320797B (zh) 一种鼻咽癌数据库及基于所述数据库的综合诊疗决策方法
WO2009039425A1 (fr) Gestion des connaissances d'informations scientifiques basée sur l'expression directionnelle
WO2002071059A1 (fr) Systeme et procede servant a gerer des donnees d'expression genique
Sarfraz et al. MiCA: An extended tool for microarray gene expression analysis
Selvanayaki et al. Finding microarray genes using GO ontology
Zhu Detecting gene similarities using large-scale content-based search systems
Dalziel et al. XMAS: an experiential approach for visualization, analysis, and exploration of time series microarray data
Antal et al. Towards an integrated usage of expression data and domain literature in gene clustering: representations and methods
MELITA et al. A Genetic Algorithm-Support Vector Machine Approach to DNA Microarrays Supervised Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08831452

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2008831452

Country of ref document: EP