WO2013173384A1 - Method and system for sorting biological samples - Google Patents

Method and system for sorting biological samples Download PDF

Info

Publication number
WO2013173384A1
WO2013173384A1 PCT/US2013/041015 US2013041015W WO2013173384A1 WO 2013173384 A1 WO2013173384 A1 WO 2013173384A1 US 2013041015 W US2013041015 W US 2013041015W WO 2013173384 A1 WO2013173384 A1 WO 2013173384A1
Authority
WO
WIPO (PCT)
Prior art keywords
subset
features
samples
interest
characterizing features
Prior art date
Application number
PCT/US2013/041015
Other languages
French (fr)
Inventor
Dale Wong
Original Assignee
Dale Wong
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dale Wong filed Critical Dale Wong
Publication of WO2013173384A1 publication Critical patent/WO2013173384A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • the present disclosure is related to the field of sorting biological samples. More specifically, aspects of the present disclosure are related to a system and method for creating and using a differentiating model to sort biological samples.
  • GWAS genome-wide association studies
  • Genetic samples obtained from the participants are typically scanned on automated laboratory machines. The machine may survey each participant's genome sequence for strategically selected markers of genetic variation, for example single nucleotide polymorphisms (SNPs).
  • SNPs single nucleotide polymorphisms
  • a SNP is basically a substitution.
  • Other examples of genetic variation include "insertions/deletions”, “inversions”, “transpositions", and "copy number variations".
  • a method of sorting biological samples comprises obtaining one or more biological samples, each of the samples comprising a set of one or more characterizing features and a set of one or more features of interest.
  • One or more of the biological samples have known values for the set of features of interest and one or more of the biological samples have unknown values for the set of features of interest.
  • the method also comprises deriving a subset of the characterizing features. The subset predicts the value of the one or more features of interest. The subset may optionally be characterized by the possible inclusion of interactions between the characterizing features in the subset.
  • the method also comprises sorting the biological samples that have unknown values for the set of features of interest based on the predicted value of the subset.
  • a system to sort biological samples comprises a processor configured to receive one or more biological samples in a digitized representation. Each of the samples comprises a set of one or more characterizing features and a set of one or more features of interest. One or more of the biological samples have known values for the set of features of interest and one or more of the biological samples have unknown values for the set of features of interest.
  • the processor is further configured to derive a subset of the characterizing features. The subset predicts the value of the one or more features of interest. The subset may optionally be characterized by the possible inclusion of interactions between the characterizing features in the subset.
  • the processor is further configured to sort the biological samples that have unknown values for the set of features of interest based on the predicted value of the subset.
  • Additional aspects of the present disclosure describe a system to sort biological samples, which comprises a processor configured to receive one or more biological samples in a digitized representation.
  • Each of the samples comprises a set of one or more characterizing features and a set of one or more features of interest.
  • One or more of the biological samples have known values for the set of features of interest.
  • the processor is further configured to derive a subset of the characterizing features.
  • the subset predicts the value of the one or more features of interest.
  • the subset may optionally be characterized by the possible inclusion of interactions between the characterizing features in the subset.
  • Further aspects of the present disclosure include a system to sort biological samples, which includes a processor configured to receive one or more biological samples in a digitized representation.
  • Each of the samples comprises a set of one or more characterizing features and a set of one or more features of interest.
  • One or more of the biological samples have unknown values for the set of features of interest.
  • the processor is further configured to sort the biological samples that have unknown value for the set of features of interest. The sorting is based on a predicted value from a subset of the characterizing features and, where appropriate, possible interactions between the characterizing features in the subset.
  • FIG. 1 illustrates a method/process to sort biological samples in accordance with one aspect of the present disclosure.
  • FIG. 2 illustrates an example application of the method of FIG. 1.
  • FIG. 3 illustrates a labeling process in accordance with another aspect of the present disclosure.
  • FIG. 4 illustrates a Differentiating Model in a form of a Boolean sum of products in accordance with another aspect of the present disclosure.
  • FIG. 5 illustrates a differentiating model in accordance with another aspect of the present disclosure.
  • FIG. 6 illustrates a method of creating a differentiating model with a Boolean optimization method in accordance with another aspect of the present disclosure.
  • Bio samples may be obtained from individual organisms or aggregate populations.
  • biological samples may include genetic samples, protein samples, metabolite samples, and laboratory specimens.
  • genetic samples may include, for example, DNA sequences, partial DNA sequences, oligonucleotides, single nucleotide polymorphisms, copy number variations, alleles, haplotypes, genes, transcription factors, RNA expressions, and/or epigenetic factors. Testing of genetic samples may be performed on a variety of specimens including blood, urine, saliva, hair, sperm, ova, zygotes, and biopsied tissue.
  • the collected samples may be sorted for various applications such as in sorting plants or animals for selective breeding, sorting sperm, ova, and/or zygotes for selective breeding, sorting patients for clinical trials or for the most appropriate treatment, sorting individuals for resistance to or risk for a disease or condition, sorting individuals for a propensity for a desired trait or for a propensity for the absence of an undesired trait, and sorting tests for correlation with a trait.
  • a process of sorting samples may normally involve steps of receiving one or more samples and differentiating each sample according to a specified criterion. Some samples that have unknown value of the specified criterion may be sorted based on other known characteristics of the samples.
  • the steps of the sorting process may act directly on the samples, or may act on one-to-one proxies for the samples, e.g., digital data representing physical specimens.
  • the transformation of the proxies corresponds directly to the desired transformation of the samples.
  • the sorting process may involve creation of a Differentiating Model (hereinafter referred to as "DM").
  • the input to a DM is a sample (e.g., a physical specimen or a digitized representation of a physical specimen).
  • the output of a DM is the sample and a label.
  • the label may be in turn used to differentiate the given sample.
  • the sample may be a plant that has not yet blossomed, and the label may be the expected blossom color from a set of possible colors ⁇ Red, White, Blue ⁇ .
  • the label may be the expected height of the plant at maturity ⁇ high, medium, dwarf ⁇ .
  • Creating a DM for an unknown characteristic based on the known characteristics of the samples is extremely difficult when the unknown characteristic involves, at least in part, interactions between multiple characteristics. Interactions may either increase or decrease the significance of a characteristic, as compared to the significance of the single characteristic alone.
  • a first characteristic is the presence of a specific gene mutation linked with a risk of lung cancer
  • a second characteristic is exposure to a specific carcinogen. Samples with only one of the two characteristics may not show any increase in cancer. However, the significance of both characteristics occurring together is increased when compared to the significance of either the gene or the exposure alone.
  • An accurate DM should take into account such interactions between characteristics, when such interactions are present in the samples.
  • a regression model may include a term for each possible pair of characteristics, each possible triple of characteristics, and so on up to each possible combination of (N - 1) characteristics.
  • N - 1 The large number of possible combinations however makes this approach impractical as is the case with genetic samples.
  • the samples are comprised of genetic samples with a large number of characteristics and existing methods that have been applied are unsatisfactory and the created predictive models generally only account for a small fraction of the heritability of a disease.
  • aspects of present disclosure provide an improved method and system for creating and using a DM to sort biological samples.
  • FIG. 1 illustrates a method/process to sort biological samples in accordance with the present disclosure.
  • a sorting system 110 in accordance with one aspect of the present disclosure receives samples 101 with a number of characterizing features (e.g., genetic profiles and/or demographic characteristics).
  • the samples 101 are sorted samples that have known value of the sort characteristic, which is the feature of interest for a particular application.
  • the sorted samples 101 may include two groups, one group having samples that have the sort characteristic and one group having samples that do not have the characteristic.
  • the sorting system 110 then generates a DM 120 based on the sorted samples.
  • the DM 120 includes a subset of the characterizing features that may include interactions between the features in the subset and predicts the value of the feature of interest.
  • the sorting system 110 may sort unsorted samples 102 that have unknown values for the feature of interest.
  • the unsorted sample 102 may be divided into two groups, one group 103 having samples that are expected to have the sort characteristic and one group 104 having samples that are not expected to have the sort characteristic.
  • FIG. 2 illustrates an example application of the method of FIG. 1 to sort children into those who are expected to be tall and those who are not expected to be tall. Upon input to the sort, it is not known yet which children are which.
  • the sorted samples 210 may include information 216 of a group 212 of adults who are tall and information of a group 214 of adults who are not tall.
  • Unsorted samples 220 include information 226 of children 221 to be sorted.
  • the information may include, for example, individual's DNA sequences and demographics.
  • a sorting system 230 in accordance with one aspect of the present disclosure may receive the sorted samples 210 and generate a DM based on the sorted samples 210.
  • the sorting system 230 may in turn sort the unsorted samples 220 according to the DM and the known characteristics of the children to be sorted (i.e., the information 226) to divide the children into two groups, one group 222 of children who will be tall and one group of children 224 who will not be tall.
  • a sorting system may be implemented using a suitably configured processor.
  • the processor may be a microprocessor that is part of a general purpose computer.
  • the processor may be a microprocessor embedded in a special purpose apparatus that may perform not only the sorting of the samples, but also other functions, such as gathering physical samples and initial characterization of the genetic profile for each sample.
  • the processor may be one or more Field Programmable Gate Arrays (FPGA's) or Application Specific Integrated Circuits (ASIC's) configured to implement the complete or partial method of the present disclosure.
  • the processor may be multiple processors, of the same or different types, configured to execute in parallel the method of the present disclosure.
  • the sorting system may receive the digitized representation of the samples, e.g., in the form of one or more computer files, databases, and/or data packets; and may be transferred via local bus, local or extended network, or the Internet.
  • the format of the representation and the transfer medium may be mixed and matched in any number of combinations.
  • the sorting of these representations by the system applies with a one-to-one correspondence to the biological samples.
  • the first step of the method according to the embodiments of the present disclosure is to obtain samples that contain one or more characterizing features.
  • the samples may include a partial or complete genetic profile of one or more individuals.
  • a genetic profile may include a set of Single Nucleotide Polymorphisms (SNP's) present in the individual's DNA, a set of oligonucleotides which are the complement of short sections of the individual's DNA, or the RNA expressed by an individual's DNA.
  • the samples may contain characterizing features, such as DNA base pairs, RNA expressions, copy number variations, transcription factors, amino acids and/or genes.
  • a genetic profile may include epigenetic characteristics.
  • the samples may include characteristics in addition to genetic profiles.
  • characteristics may include, without limitation, results from clinical tests (e.g., disease states, symptoms of disease or health, reaction to medical treatments and/or medical substances, performance measurements, imaging, or blood chemistry), observed traits (e.g., height, weight, and/or family history), observed behaviors (e.g., dementia and/or depression), and/or demographic characteristics (e.g., age, sex, ethnicity, nationality, occupation, and/or geographic locations).
  • One or more characteristics in the samples may be the features of the interest in a particular application. Some samples may have known values for the features of interest and some samples may have unknown values of the features of interest. In an example where the feature of the interest is whether a person has a disease, some samples have known value for this feature from the results of clinical tests, either individuals having the disease or not having the disease. Some samples have unknown value when the individuals do not know if they have the disease or not. When the value of the specified criterion (i.e., features of interest) is unknown for the samples, it can be estimated based on other known characteristics of the samples. For example, it can be estimated whether a person has a disease based on his age, his genetic profile or other characteristics in the sample that are associated with the disease according to analysis of the samples with known value for the features of interest.
  • the specified criterion i.e., features of interest
  • samples may be gathered independently by any of a variety of means (e.g., collecting blood, saliva and buccal cell samples).
  • the process of collecting samples may be distributed across locations and time, and may be carried out by one or more agents.
  • the samples may be gathered at one or more locations, at one or more times, by one or more agents.
  • the samples may be received for sorting at one or more locations, at one or more times, by one or more agents.
  • the sorting may be performed at one or more locations, at one or more times, by one or more agents.
  • the agents in the various steps of collecting and sorting samples are not necessarily the same.
  • the method according to one embodiment of the present disclosure may include selecting a characteristic and grouping samples based on the selected characteristic.
  • the collected samples may be labeled as followed. Each sample in the collection is given an empty label.
  • the samples are then separated into two or more groups according to a selected characteristic of the samples.
  • a group may be empty.
  • the selected characteristic may be "red blossoms”
  • the samples are separated into two groups, one group where all the samples have the characteristic of red blossoms, and a second group where all the samples do not have the characteristic of red blossoms.
  • the presence or absence of the selected characteristic is appended to each sample's label accordingly.
  • another characteristic is selected, and each non-empty group of samples that does not satisfy a stopping criterion is separated according to the newly selected characteristic.
  • the second selected characteristic may be "more than twelve blossoms", and each of the groups of samples will be further separated into two groups, one group where all the samples have the characteristic of more than twelve blossoms, and a second group where all the samples do not have the characteristic of more than twelve blossoms. The presence or absence of this newly selected characteristic is appended to each sample's label accordingly. The process of selecting a characteristic, separating the samples into sub-groups, and appending each sample's label continues until a stopping criterion is reached.
  • stopping criteria may include, singly or in combination: a) all groups have no more than one sample, b) all characteristics have been selected, c) a certain number of characteristics have been selected, d) all groups have no more than a certain number of samples, e) available resources to store the groups and their labels is exhausted, and/or f) all groups have a set of samples which satisfy a certain target criterion.
  • the method may use a target criterion that the samples within a group either all have the specified sort characteristic, or all do not have the specified sort characteristic. For example, suppose the specified sort characteristic by which to sort the samples is "profit greater than ten dollars".
  • the process would stop when each of the groups consists of samples that all have profit greater than ten dollars, or all have profit that is not greater than ten dollars.
  • the stopping criteria may comprise criteria a, b, e, and f.
  • FIG. 3 shows a labeling process in accordance with the embodiments of the present disclosure.
  • sorted samples 301 may be separated into four groups, one group 310 with label [A] representing those samples with characterizing feature A, one 320 with label [B] representing those samples with characterizing feature B, one 330 with label [C] representing those samples with characterizing feature C and one 340 with label [D] representing those samples with
  • samples in group 320 may be separated into two sub-groups, one group 322 with label [B, E] representing those samples having characterizing features B and E and one 324 with label [B,!E] representing those samples having characterizing feature B but not feature E.
  • Samples in group 330 may be separated into one group 332 with label [C, E] representing those samples with characterizing features C and E and one 334 with label [C, !E] representing those samples with characterizing feature C but not feature E.
  • Samples in group 340 may be separated into one group 342 with label [D, E] representing those samples with characterizing features D and E and one 344 with label [D, !E] representing those samples with characterizing feature C and but not feature E.
  • Samples in the group 310, 322, and 334 have satisfied the stop criteria, and thus, there is no further labeling process for these groups. The process for selecting a characteristic for the labeling process is described later in detail.
  • the process of grouping the samples may be performed by, for example, recursive partitioning, labeling, and construction of a classification or regression tree.
  • DM Differentiating Model
  • the threshold may be that 80% of the samples must be positive.
  • the threshold may be determined by finding the value that divides the samples into two groups where the variance within the two groups is minimized.
  • the labels of the set of positive groups may be formed into a Boolean equation where the feature of interest may be the output term of the equation and the characterizing features may be the input terms.
  • the Boolean equation includes a Boolean conjunction (AND) of each term in the label (i.e., characterizing features) for each label and a Boolean disjunction (OR) of all these labels' Boolean conjunctions.
  • the DM comprises a Boolean equation in a "sum of products" form.
  • a Boolean equation could similarly be defined for the set of negative sub-groups. Note that the above process may extend to sort characteristics that have more than two types, for example "high”, “medium”, and “low”.
  • FIG. 4 shows an exemplary differentiating model in a form of a Boolean sum of products where the feature of interest equals Feature A OR (Feature B AND Feature E) OR (Feature C and NOT- Feature E). It is noted that if there are multiple features of interest, the process of forming the labels of the set of positive groups into a Boolean equation, in one embodiment, can be repeated for each feature, and there would be a separate Boolean equation for each feature. It is noted that there are other methods that can deal with multiple outputs simultaneously.
  • the DM may also include additional information so that it can handle a sort characteristic that may take on continuous values.
  • the DM may include information of the average value for the specified sort characteristic for all samples that match a label. The value is stored along with the Boolean conjunction for that label, and is part of the DM.
  • the DM may include the confidence value for a label. The confidence value may be based on, for example, the difference between the means of two populations (or mathematically equivalently, a chi-square distribution), where one population is the set of samples that match a label's Boolean conjunction, the other population is the complementary set of samples that do not match a label's Boolean conjunction, and the mean is the mean of the specified sort characteristic for the population.
  • the confidence value for a label may be also stored along with the Boolean conjunction for that label, and is part of the DM. It should be noted that the exact method used to calculate a confidence value depends on the type and number of the characteristics and samples, and adheres to standard statistical theory and practice.
  • the label's average value and the label's confidence level may be used to estimate a value for the specified sort characteristic, for unsorted samples that match the label's Boolean conjunction and for which the actual value of the specified sort characteristic is unknown.
  • the label's average value may be continuous, categorical or dichotomous.
  • the DM may be considered parameterized, where a minimum confidence level is specified, and only those labels in the DM with a confidence level greater than or equal to the specified level are active in the Boolean disjunction of the DM.
  • FIG. 5 shows an example of a Differentiating Model in an alternative presentation format.
  • each box shows one condition 510, and the set of all such conditions comprise the Differentiating Model.
  • Each condition 510 is comprised of one or more features 512, all of which must be true for the condition to be matched. Matching any one of the conditions would sort a sample as positive for the specified sort characteristic.
  • Conditions are listed in order of decreasing confidence level. Non-shaded boxes are those that are above a specified confidence threshold and are active in the sorting process. Shaded boxes are those that are below a specified confidence threshold and are inactive in the sorting process.
  • the selection may generally involve recursive partitioning, minimizing a constraint term during regression or ranking based on the statistical relationship between the characterizing feature and the features of interest.
  • the characteristics are first selected randomly or stochastically. After the complete labeling process is repeated multiple times, the DM which performs best is chosen to sort the samples.
  • the DM is comprised of all the randomly created DMs, and the average of the estimated values from the multiple DMs is used to sort the samples.
  • a subset of characteristics is pre-selected using any of a number of well known "feature selection" methods including lasso regression, elastic net regression, Relief-F, principal component analysis, multi-dimensional reduction, and others.
  • the characteristics are sorted according to the confidence value for each characteristic.
  • a specified number of subsets of the samples are created. For each subset, a confidence value for each characteristic is calculated. For each characteristic, an average confidence value across all subsets is calculated. The characteristics are then sorted in decreasing order of their average confidence values. The characteristics are considered in this sorted order by the labeling process.
  • characteristics are selected by the following method. Each previously unselected characteristic in the sorted list is considered in turn. A metric is calculated for this characteristic relative to the set of active groups (i.e., those groups that are not sub-divided). The characteristic with the best metric is selected to divide the active groups. Many alternative metrics are possible, including without limitation, Gini Impurity, Entropy Gain, and P-Value. In a preferred embodiment, the metric used is the P-Value based on, for example, the difference between the means of two populations, where one population is the set of samples that have the characteristic, the other population is the complementary set of samples that do not have the characteristic, and the mean is the mean of the specified sort characteristic for the population. The exact method used to calculate a P-Value depends on the type and number of the
  • a selection method involving a specified window size and minimum threshold.
  • the labeling process will consider a number of characteristics equal to the specified window size. If at least one characteristic is found whose metric exceeds the specified minimum threshold, the characteristic with the best metric will be selected. Otherwise, the labeling process will continue with the next number of characteristics equal to the specified window size. If none of the characteristics exceed the specified minimum threshold, then the characteristic with the best metric will be selected.
  • the Boolean equation of the differentiating model may be optimized using a Boolean optimization method.
  • a Boolean optimization method may find a logically equivalent condition with a minimal number of unique terms/characteristics). For example, the following Boolean conditions (1) and (2) are logically equivalent, but the condition (2) uses fewer terms.
  • the entire set of samples with all the characteristics may be cast as a single Boolean condition, and then a Boolean optimization method could be run to derive a logically equivalent condition with the minimum number of unique terms, if time and resources permit.
  • a Boolean optimization method could be run to derive a logically equivalent condition with the minimum number of unique terms, if time and resources permit.
  • the above labeling process ensures that a consistent set of conditions will be formed because samples which are inconsistent with their active group's Boolean expression are excluded from the system of equations.
  • FIG. 6 illustrates a method of creating a DM with a Boolean optimization method in accordance with one embodiment of the present disclosure.
  • a DM 610 is first created from a labeling process 620 based on the sorted samples 601.
  • the DM 610 may be used to define a subset of the samples 630 that is logically consistent.
  • the subset of the samples 630 may be optimized by a Boolean optimization method 640, and the resulting optimized condition becomes a new DM 650.
  • aspects of the present disclosure provide for a straightforward and logical way of sorting biological samples.
  • aspects of the present disclosure may be applied to the problem of identifying particular combinations of genetic features (e.g., base pairs, alleles, single nucleotide polymorphisms (SNPs), haplotypes, and the like) as markers for specific diseases or conditions.
  • Aspects of the present disclosure can be used to generate a Differentiating Model that identifies such combinations from samples known to have or not have a given condition or disease.
  • aspects of the present disclosure may be used to screen individuals for a given disease or condition using such a differentiating model. It is noted that aspects of the present disclosure are not limited to such examples.
  • aspects of the present disclosure may be further extended to include other applications such as drug target discovery, companion diagnostics, biomarker discovery, transcription factor discovery, epigenetic analysis, or pathway analysis.
  • aspects of the present disclosure may be applied to model inference for any high dimensional application with interacting features.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A system to sort biological samples may create a differentiating model based on biological samples that have known value of the features of interest. The differentiating model involves a combination of one or more characterizing features in the sample. Based on the differentiating model, the system may sort biological samples that have unknown value of the features of interest. It is emphasized that this abstract is provided to comply with the rules requiring an abstract that will allow a searcher or other reader to quickly ascertain the subject matter of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims.

Description

METHOD AND SYSTEM FOR SORTING BIOLOGICAL SAMPLES
PRIORITY CLAIMS
This application claims the benefit of priority of commonly-assigned, co-pending U.S.
Provisional application serial number 61/647,959, to Dale Wong, entitled "METHOD AND SYSTEM FOR SORTING BIOLOGICAL SAMPLES", filed May 16, 2012, the entire disclosure of which is herein incorporated by reference.
FIELD OF THE DISCLOSURE
The present disclosure is related to the field of sorting biological samples. More specifically, aspects of the present disclosure are related to a system and method for creating and using a differentiating model to sort biological samples.
BACKGROUND
Sorting of biological samples has many applications in scientific and medical research. As but one example among many possible examples, genome-wide association studies (GWAS) are an examination of many genetic variants in different individuals to see if any variant is associated with a trait, such as a disease. These studies normally involve comparison of the DNA of two groups of participants, for example, one group of people affected by a disease and another group of people without the disease. Genetic samples obtained from the participants are typically scanned on automated laboratory machines. The machine may survey each participant's genome sequence for strategically selected markers of genetic variation, for example single nucleotide polymorphisms (SNPs). A SNP is basically a substitution. Other examples of genetic variation include "insertions/deletions", "inversions", "transpositions", and "copy number variations". If certain genetic variations are found to be significantly more frequent in people with the disease compared to people without disease, the variations are said to be associated with the disease. Once genetic associations are identified, researchers may use the information to develop strategies to detect, treat and prevent the disease. Currently, some techniques and methods have been developed to analyze samples for genetic variations that contribute to the onset of a disease.
It is within this context that aspects of the present disclosure arise.
SUMMARY According to aspects of the present disclosure, a method of sorting biological samples comprises obtaining one or more biological samples, each of the samples comprising a set of one or more characterizing features and a set of one or more features of interest. One or more of the biological samples have known values for the set of features of interest and one or more of the biological samples have unknown values for the set of features of interest. The method also comprises deriving a subset of the characterizing features. The subset predicts the value of the one or more features of interest. The subset may optionally be characterized by the possible inclusion of interactions between the characterizing features in the subset. The method also comprises sorting the biological samples that have unknown values for the set of features of interest based on the predicted value of the subset.
According to other aspects of the present disclosure, a system to sort biological samples comprises a processor configured to receive one or more biological samples in a digitized representation. Each of the samples comprises a set of one or more characterizing features and a set of one or more features of interest. One or more of the biological samples have known values for the set of features of interest and one or more of the biological samples have unknown values for the set of features of interest. The processor is further configured to derive a subset of the characterizing features. The subset predicts the value of the one or more features of interest. The subset may optionally be characterized by the possible inclusion of interactions between the characterizing features in the subset. The processor is further configured to sort the biological samples that have unknown values for the set of features of interest based on the predicted value of the subset.
Additional aspects of the present disclosure describe a system to sort biological samples, which comprises a processor configured to receive one or more biological samples in a digitized representation. Each of the samples comprises a set of one or more characterizing features and a set of one or more features of interest. One or more of the biological samples have known values for the set of features of interest. The processor is further configured to derive a subset of the characterizing features. The subset predicts the value of the one or more features of interest. The subset may optionally be characterized by the possible inclusion of interactions between the characterizing features in the subset. Further aspects of the present disclosure include a system to sort biological samples, which includes a processor configured to receive one or more biological samples in a digitized representation. Each of the samples comprises a set of one or more characterizing features and a set of one or more features of interest. One or more of the biological samples have unknown values for the set of features of interest. The processor is further configured to sort the biological samples that have unknown value for the set of features of interest. The sorting is based on a predicted value from a subset of the characterizing features and, where appropriate, possible interactions between the characterizing features in the subset.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates a method/process to sort biological samples in accordance with one aspect of the present disclosure.
FIG. 2 illustrates an example application of the method of FIG. 1.
FIG. 3 illustrates a labeling process in accordance with another aspect of the present disclosure.
FIG. 4 illustrates a Differentiating Model in a form of a Boolean sum of products in accordance with another aspect of the present disclosure.
FIG. 5 illustrates a differentiating model in accordance with another aspect of the present disclosure.
FIG. 6 illustrates a method of creating a differentiating model with a Boolean optimization method in accordance with another aspect of the present disclosure.
DETAILED DESCRIPTION
Biological samples may be obtained from individual organisms or aggregate populations. By way of example but not by way of limitation, biological samples may include genetic samples, protein samples, metabolite samples, and laboratory specimens. Particularly, genetic samples may include, for example, DNA sequences, partial DNA sequences, oligonucleotides, single nucleotide polymorphisms, copy number variations, alleles, haplotypes, genes, transcription factors, RNA expressions, and/or epigenetic factors. Testing of genetic samples may be performed on a variety of specimens including blood, urine, saliva, hair, sperm, ova, zygotes, and biopsied tissue. The collected samples may be sorted for various applications such as in sorting plants or animals for selective breeding, sorting sperm, ova, and/or zygotes for selective breeding, sorting patients for clinical trials or for the most appropriate treatment, sorting individuals for resistance to or risk for a disease or condition, sorting individuals for a propensity for a desired trait or for a propensity for the absence of an undesired trait, and sorting tests for correlation with a trait. A process of sorting samples may normally involve steps of receiving one or more samples and differentiating each sample according to a specified criterion. Some samples that have unknown value of the specified criterion may be sorted based on other known characteristics of the samples. In addition, the steps of the sorting process may act directly on the samples, or may act on one-to-one proxies for the samples, e.g., digital data representing physical specimens. The transformation of the proxies corresponds directly to the desired transformation of the samples.
The sorting process may involve creation of a Differentiating Model (hereinafter referred to as "DM"). The input to a DM is a sample (e.g., a physical specimen or a digitized representation of a physical specimen). The output of a DM is the sample and a label. The label may be in turn used to differentiate the given sample. For example, in a scenario of plant selective breeding, the sample may be a plant that has not yet blossomed, and the label may be the expected blossom color from a set of possible colors {Red, White, Blue} . As another example, the label may be the expected height of the plant at maturity {high, medium, dwarf} .
Creating a DM for an unknown characteristic based on the known characteristics of the samples is extremely difficult when the unknown characteristic involves, at least in part, interactions between multiple characteristics. Interactions may either increase or decrease the significance of a characteristic, as compared to the significance of the single characteristic alone. For example, a first characteristic is the presence of a specific gene mutation linked with a risk of lung cancer, and a second characteristic is exposure to a specific carcinogen. Samples with only one of the two characteristics may not show any increase in cancer. However, the significance of both characteristics occurring together is increased when compared to the significance of either the gene or the exposure alone. An accurate DM should take into account such interactions between characteristics, when such interactions are present in the samples.
Some existing methods attempt to account for each possible interaction. For example, a regression model may include a term for each possible pair of characteristics, each possible triple of characteristics, and so on up to each possible combination of (N - 1) characteristics. The large number of possible combinations however makes this approach impractical as is the case with genetic samples. In order to implement a more practical method, it has been proposed to limit the number of possible interactions considered. For example, some methods only handle single characteristics and/or pairs of characteristics. Some other methods may only consider a random sampling of interactions. As an extreme example, some ignore interactions completely. In the example application of sorting individuals for risk for a disease or condition, the samples are comprised of genetic samples with a large number of characteristics and existing methods that have been applied are unsatisfactory and the created predictive models generally only account for a small fraction of the heritability of a disease. Aspects of present disclosure provide an improved method and system for creating and using a DM to sort biological samples.
FIG. 1 illustrates a method/process to sort biological samples in accordance with the present disclosure. A sorting system 110 in accordance with one aspect of the present disclosure receives samples 101 with a number of characterizing features (e.g., genetic profiles and/or demographic characteristics). The samples 101 are sorted samples that have known value of the sort characteristic, which is the feature of interest for a particular application. In other words, the sorted samples 101 may include two groups, one group having samples that have the sort characteristic and one group having samples that do not have the characteristic. The sorting system 110 then generates a DM 120 based on the sorted samples. The DM 120 includes a subset of the characterizing features that may include interactions between the features in the subset and predicts the value of the feature of interest. Based on the DM 120 and its predicted value of the feature of interest, the sorting system 110 may sort unsorted samples 102 that have unknown values for the feature of interest. As a result, the unsorted sample 102 may be divided into two groups, one group 103 having samples that are expected to have the sort characteristic and one group 104 having samples that are not expected to have the sort characteristic.
FIG. 2 illustrates an example application of the method of FIG. 1 to sort children into those who are expected to be tall and those who are not expected to be tall. Upon input to the sort, it is not known yet which children are which. The sorted samples 210 may include information 216 of a group 212 of adults who are tall and information of a group 214 of adults who are not tall.
Unsorted samples 220 include information 226 of children 221 to be sorted. The information may include, for example, individual's DNA sequences and demographics. A sorting system 230 in accordance with one aspect of the present disclosure may receive the sorted samples 210 and generate a DM based on the sorted samples 210. The sorting system 230 may in turn sort the unsorted samples 220 according to the DM and the known characteristics of the children to be sorted (i.e., the information 226) to divide the children into two groups, one group 222 of children who will be tall and one group of children 224 who will not be tall.
MACHINE
A sorting system according to the present disclosure may be implemented using a suitably configured processor. In one example, the processor may be a microprocessor that is part of a general purpose computer. Alternatively, the processor may be a microprocessor embedded in a special purpose apparatus that may perform not only the sorting of the samples, but also other functions, such as gathering physical samples and initial characterization of the genetic profile for each sample. In another example, the processor may be one or more Field Programmable Gate Arrays (FPGA's) or Application Specific Integrated Circuits (ASIC's) configured to implement the complete or partial method of the present disclosure. In yet another alternative, the processor may be multiple processors, of the same or different types, configured to execute in parallel the method of the present disclosure. There are many possible implementations of such a sorting system, and some examples above are only cited for illustrative purposes, without limitation. The sorting system may receive the digitized representation of the samples, e.g., in the form of one or more computer files, databases, and/or data packets; and may be transferred via local bus, local or extended network, or the Internet. The format of the representation and the transfer medium may be mixed and matched in any number of combinations. The sorting of these representations by the system applies with a one-to-one correspondence to the biological samples. The following sections describe in detail each process/step to perform the method of sorting biological samples in accordance with the present disclosure.
SAMPLES
The first step of the method according to the embodiments of the present disclosure is to obtain samples that contain one or more characterizing features. In one embodiment, the samples may include a partial or complete genetic profile of one or more individuals. By way of example, but not by way of limitation, a genetic profile may include a set of Single Nucleotide Polymorphisms (SNP's) present in the individual's DNA, a set of oligonucleotides which are the complement of short sections of the individual's DNA, or the RNA expressed by an individual's DNA. As such, the samples may contain characterizing features, such as DNA base pairs, RNA expressions, copy number variations, transcription factors, amino acids and/or genes. Alternatively, a genetic profile may include epigenetic characteristics.
In another embodiment of the present disclosure, the samples may include characteristics in addition to genetic profiles. Examples of such characteristics may include, without limitation, results from clinical tests (e.g., disease states, symptoms of disease or health, reaction to medical treatments and/or medical substances, performance measurements, imaging, or blood chemistry), observed traits (e.g., height, weight, and/or family history), observed behaviors (e.g., dementia and/or depression), and/or demographic characteristics (e.g., age, sex, ethnicity, nationality, occupation, and/or geographic locations).
One or more characteristics in the samples may be the features of the interest in a particular application. Some samples may have known values for the features of interest and some samples may have unknown values of the features of interest. In an example where the feature of the interest is whether a person has a disease, some samples have known value for this feature from the results of clinical tests, either individuals having the disease or not having the disease. Some samples have unknown value when the individuals do not know if they have the disease or not. When the value of the specified criterion (i.e., features of interest) is unknown for the samples, it can be estimated based on other known characteristics of the samples. For example, it can be estimated whether a person has a disease based on his age, his genetic profile or other characteristics in the sample that are associated with the disease according to analysis of the samples with known value for the features of interest.
It should be noted that samples may be gathered independently by any of a variety of means (e.g., collecting blood, saliva and buccal cell samples). The process of collecting samples may be distributed across locations and time, and may be carried out by one or more agents. The samples may be gathered at one or more locations, at one or more times, by one or more agents. The samples may be received for sorting at one or more locations, at one or more times, by one or more agents. The sorting may be performed at one or more locations, at one or more times, by one or more agents. The agents in the various steps of collecting and sorting samples are not necessarily the same. Once the samples are gathered, each sample has a corresponding representation within one or more digital formats, e.g., computer files. In one example, each sample is scanned by a machine and stored in a digital format in a remote or local database.
After obtaining samples, the method according to one embodiment of the present disclosure may include selecting a characteristic and grouping samples based on the selected characteristic.
LABELING
In one embodiment of the present disclosure, the collected samples may be labeled as followed. Each sample in the collection is given an empty label. The samples are then separated into two or more groups according to a selected characteristic of the samples. A group may be empty. For example, the selected characteristic may be "red blossoms", and the samples are separated into two groups, one group where all the samples have the characteristic of red blossoms, and a second group where all the samples do not have the characteristic of red blossoms. The presence or absence of the selected characteristic is appended to each sample's label accordingly. Next, another characteristic is selected, and each non-empty group of samples that does not satisfy a stopping criterion is separated according to the newly selected characteristic. For example, the second selected characteristic may be "more than twelve blossoms", and each of the groups of samples will be further separated into two groups, one group where all the samples have the characteristic of more than twelve blossoms, and a second group where all the samples do not have the characteristic of more than twelve blossoms. The presence or absence of this newly selected characteristic is appended to each sample's label accordingly. The process of selecting a characteristic, separating the samples into sub-groups, and appending each sample's label continues until a stopping criterion is reached.
By way of example but not by way of limitation, stopping criteria may include, singly or in combination: a) all groups have no more than one sample, b) all characteristics have been selected, c) a certain number of characteristics have been selected, d) all groups have no more than a certain number of samples, e) available resources to store the groups and their labels is exhausted, and/or f) all groups have a set of samples which satisfy a certain target criterion. For stopping criterion f, the method may use a target criterion that the samples within a group either all have the specified sort characteristic, or all do not have the specified sort characteristic. For example, suppose the specified sort characteristic by which to sort the samples is "profit greater than ten dollars". In such a case, the process would stop when each of the groups consists of samples that all have profit greater than ten dollars, or all have profit that is not greater than ten dollars. In one embodiment of the present disclosure, the stopping criteria may comprise criteria a, b, e, and f.
FIG. 3 shows a labeling process in accordance with the embodiments of the present disclosure. Based on the first selected characteristic, which may take on one of the values A, B, C, or D, sorted samples 301 may be separated into four groups, one group 310 with label [A] representing those samples with characterizing feature A, one 320 with label [B] representing those samples with characterizing feature B, one 330 with label [C] representing those samples with characterizing feature C and one 340 with label [D] representing those samples with
characterizing feature D. Based on a second characteristic, which may take on one of the values E or !E, samples in group 320 may be separated into two sub-groups, one group 322 with label [B, E] representing those samples having characterizing features B and E and one 324 with label [B,!E] representing those samples having characterizing feature B but not feature E. Samples in group 330 may be separated into one group 332 with label [C, E] representing those samples with characterizing features C and E and one 334 with label [C, !E] representing those samples with characterizing feature C but not feature E. Samples in group 340 may be separated into one group 342 with label [D, E] representing those samples with characterizing features D and E and one 344 with label [D, !E] representing those samples with characterizing feature C and but not feature E. Samples in the group 310, 322, and 334 have satisfied the stop criteria, and thus, there is no further labeling process for these groups. The process for selecting a characteristic for the labeling process is described later in detail.
It is noted that the process of grouping the samples may be performed by, for example, recursive partitioning, labeling, and construction of a classification or regression tree.
DIFFERENTIATING MODEL After the labeling process, a Differentiating Model (DM) may be created based on the labels on the sorted samples. First, all active groups (i.e., those that have not been sub-divided) are classified as "positive" or "negative". In one embodiment, to determine if a group is positive or negative, the average value for the specified sort characteristic (i.e., feature of interest) is calculated for all samples in a sub-group. If that value is above a specified threshold, then that group is considered positive, and otherwise it is considered negative. Note that the appellations of "positive" and "negative" are relative to the specified sort characteristic.
By way of example, and not by way of limitation, with a dichotomous characteristic, the threshold may be that 80% of the samples must be positive. As another example, for a continuous characteristic, the threshold may be determined by finding the value that divides the samples into two groups where the variance within the two groups is minimized.
Second, in one embodiment, the labels of the set of positive groups may be formed into a Boolean equation where the feature of interest may be the output term of the equation and the characterizing features may be the input terms. Specifically, the Boolean equation includes a Boolean conjunction (AND) of each term in the label (i.e., characterizing features) for each label and a Boolean disjunction (OR) of all these labels' Boolean conjunctions. As such, the DM comprises a Boolean equation in a "sum of products" form. For example, a DM's Boolean equation may be "profitable" = ("red flowers" AND "more than twelve blossoms") OR ("blue flowers" AND "is a rose"). In another embodiment, a Boolean equation could similarly be defined for the set of negative sub-groups. Note that the above process may extend to sort characteristics that have more than two types, for example "high", "medium", and "low". FIG. 4 shows an exemplary differentiating model in a form of a Boolean sum of products where the feature of interest equals Feature A OR (Feature B AND Feature E) OR (Feature C and NOT- Feature E). It is noted that if there are multiple features of interest, the process of forming the labels of the set of positive groups into a Boolean equation, in one embodiment, can be repeated for each feature, and there would be a separate Boolean equation for each feature. It is noted that there are other methods that can deal with multiple outputs simultaneously.
In a preferred embodiment, the DM may also include additional information so that it can handle a sort characteristic that may take on continuous values. The DM may include information of the average value for the specified sort characteristic for all samples that match a label. The value is stored along with the Boolean conjunction for that label, and is part of the DM. In addition, the DM may include the confidence value for a label. The confidence value may be based on, for example, the difference between the means of two populations (or mathematically equivalently, a chi-square distribution), where one population is the set of samples that match a label's Boolean conjunction, the other population is the complementary set of samples that do not match a label's Boolean conjunction, and the mean is the mean of the specified sort characteristic for the population. The confidence value for a label may be also stored along with the Boolean conjunction for that label, and is part of the DM. It should be noted that the exact method used to calculate a confidence value depends on the type and number of the characteristics and samples, and adheres to standard statistical theory and practice. The label's average value and the label's confidence level may be used to estimate a value for the specified sort characteristic, for unsorted samples that match the label's Boolean conjunction and for which the actual value of the specified sort characteristic is unknown. The label's average value may be continuous, categorical or dichotomous.
In a preferred embodiment, the DM may be considered parameterized, where a minimum confidence level is specified, and only those labels in the DM with a confidence level greater than or equal to the specified level are active in the Boolean disjunction of the DM.
FIG. 5 shows an example of a Differentiating Model in an alternative presentation format. In FIG. 5, each box shows one condition 510, and the set of all such conditions comprise the Differentiating Model. Each condition 510 is comprised of one or more features 512, all of which must be true for the condition to be matched. Matching any one of the conditions would sort a sample as positive for the specified sort characteristic. Conditions are listed in order of decreasing confidence level. Non-shaded boxes are those that are above a specified confidence threshold and are active in the sorting process. Shaded boxes are those that are below a specified confidence threshold and are inactive in the sorting process.
SELECTION
There are many methods to select the characteristics for the above labeling process. The selection may generally involve recursive partitioning, minimizing a constraint term during regression or ranking based on the statistical relationship between the characterizing feature and the features of interest. In one embodiment, the characteristics are first selected randomly or stochastically. After the complete labeling process is repeated multiple times, the DM which performs best is chosen to sort the samples. In some embodiments, the DM is comprised of all the randomly created DMs, and the average of the estimated values from the multiple DMs is used to sort the samples. In other embodiments, a subset of characteristics is pre-selected using any of a number of well known "feature selection" methods including lasso regression, elastic net regression, Relief-F, principal component analysis, multi-dimensional reduction, and others.
In some other embodiments, the characteristics are sorted according to the confidence value for each characteristic. In one example, a specified number of subsets of the samples are created. For each subset, a confidence value for each characteristic is calculated. For each characteristic, an average confidence value across all subsets is calculated. The characteristics are then sorted in decreasing order of their average confidence values. The characteristics are considered in this sorted order by the labeling process.
In some embodiments, characteristics are selected by the following method. Each previously unselected characteristic in the sorted list is considered in turn. A metric is calculated for this characteristic relative to the set of active groups (i.e., those groups that are not sub-divided). The characteristic with the best metric is selected to divide the active groups. Many alternative metrics are possible, including without limitation, Gini Impurity, Entropy Gain, and P-Value. In a preferred embodiment, the metric used is the P-Value based on, for example, the difference between the means of two populations, where one population is the set of samples that have the characteristic, the other population is the complementary set of samples that do not have the characteristic, and the mean is the mean of the specified sort characteristic for the population. The exact method used to calculate a P-Value depends on the type and number of the
characteristics and samples, and adheres to standard statistical theory and practice. In a preferred embodiment, there is a selection method involving a specified window size and minimum threshold. The labeling process will consider a number of characteristics equal to the specified window size. If at least one characteristic is found whose metric exceeds the specified minimum threshold, the characteristic with the best metric will be selected. Otherwise, the labeling process will continue with the next number of characteristics equal to the specified window size. If none of the characteristics exceed the specified minimum threshold, then the characteristic with the best metric will be selected. OPTIMIZATION
The Boolean equation of the differentiating model may be optimized using a Boolean optimization method. A Boolean optimization method may find a logically equivalent condition with a minimal number of unique terms/characteristics). For example, the following Boolean conditions (1) and (2) are logically equivalent, but the condition (2) uses fewer terms.
"Positive" = (a AND b) OR (a AND not-b) ( 1 )
"Positive" = a (2)
In one embodiment, the entire set of samples with all the characteristics may be cast as a single Boolean condition, and then a Boolean optimization method could be run to derive a logically equivalent condition with the minimum number of unique terms, if time and resources permit. In addition, it is important to make sure that the resulting "system of equations" is logically consistent. The above labeling process ensures that a consistent set of conditions will be formed because samples which are inconsistent with their active group's Boolean expression are excluded from the system of equations. FIG. 6 illustrates a method of creating a DM with a Boolean optimization method in accordance with one embodiment of the present disclosure. Specifically, a DM 610 is first created from a labeling process 620 based on the sorted samples 601. The DM 610 may be used to define a subset of the samples 630 that is logically consistent. The subset of the samples 630 may be optimized by a Boolean optimization method 640, and the resulting optimized condition becomes a new DM 650.
Aspects of the present disclosure provide for a straightforward and logical way of sorting biological samples. By way of example, and not by way of limitation, aspects of the present disclosure may be applied to the problem of identifying particular combinations of genetic features (e.g., base pairs, alleles, single nucleotide polymorphisms (SNPs), haplotypes, and the like) as markers for specific diseases or conditions. Aspects of the present disclosure can be used to generate a Differentiating Model that identifies such combinations from samples known to have or not have a given condition or disease. Furthermore, aspects of the present disclosure may be used to screen individuals for a given disease or condition using such a differentiating model. It is noted that aspects of the present disclosure are not limited to such examples. The scope of aspects of the present disclosure may be further extended to include other applications such as drug target discovery, companion diagnostics, biomarker discovery, transcription factor discovery, epigenetic analysis, or pathway analysis. In more general terms, aspects of the present disclosure may be applied to model inference for any high dimensional application with interacting features.
The foregoing discussion discloses and describes merely exemplary methods and embodiments of the present invention. As will be understood by those familiar with the art, the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims. Additionally, in the claims that follow, the indefinite article "a", or "an" when used in claims containing an open-ended transitional phrase, such as "comprising," refers to a quantity of one or more of the item following the article, except where expressly stated otherwise. Furthermore, the later use of the word "said" or "the" to refer back to the same claim term does not change this meaning, but simply re-invokes that non-singular meaning. The appended claims are not to be interpreted as including means-plus-function limitations or step-plus- function limitations, unless such a limitation is explicitly recited in a given claim using the phrase "means for" or "step for."

Claims

What is claimed is: 1. A method of sorting biological samples, comprising:
obtaining one or more biological samples, each sample comprising a set of one or more characterizing features and a set of one or more features of interest, wherein one or more of the samples have known values for the set of features of interest, and one or more of the biological samples have unknown values for the set of features of interest;
deriving a subset of the characterizing features, wherein the subset predicts a predicted value of the one or more features of interest ; and
sorting the biological samples that have unknown values for the set of features of interest based on the predicted value of the subset.
2. The method of claim 1 , further comprising providing a predicted value for each of the samples that have unknown value for the set of features of interest.
3. The method of claim 1 , wherein the characterizing features include DNA base pairs, R A expressions, copy number variations, transcription factors, amino acids, and/or genes.
4. The method of claim 1 , wherein the characterizing features include results from
clinical tests, reaction to one or more medical treatments or medical substances, one or more demographic characteristics including age, sex, ethnicity, nationality, occupation, or geographic location, or any combination thereof.
5. The method of claim 1 , wherein the features of interest include disease states, one or more symptoms of disease or health, reaction to a medical treatment or medical substance, or any combination thereof.
6. The method of claim 1 , wherein the predicted value is continuous, categorical, or dichotomous.
7. The method of claim 1 , further comprising storing the subset in a reusable format.
8. The method of claim 1, wherein the subset of the characterizing features may be characterized by interactions between the characterizing features in the subset.
9. The method of claim 1, wherein deriving a subset includes deriving a subset of interactions between the characterizing features in the subset.
10. The method of claim 9, wherein deriving a subset comprises:
(a) establishing a statistical relationship between each said characterizing feature and the one or more features of interest; and
(b) selecting a subset of the characterizing features and a subset of interactions between the characterizing features in the subset, wherein the selection of features and interactions between the characterizing features in the subset minimizes errors on the predicted value and a size of the said subset.
11. The method of claim 10, further comprising optimizing the subset to simplify the interactions between the characterizing features in the subset.
12. The method of claim 10, wherein the statistical relationship is based on a difference between means of two populations, a confidence value of two groups of samples, regression coefficients, or multiple random sampling.
13. The method of claim 10, wherein selecting the subset involves ranking the
characterizing features based on the statistical relationship, recursive partitioning, or minimizing a constraint term during regression.
14. The method of claim 11, wherein optimizing the subset involves Boolean logic minimization.
15. The method of claim 1, wherein deriving a subset comprises Boolean logic
minimization, wherein
one of the features of interest is an output term of a Boolean equation,
at least one of the characterizing features is an input term of the Boolean equation, the Boolean equation is formed from the conjunction of the input terms, the Boolean equation is formed for each of the biological samples that have the known values for the feature of interest, and the Boolean equations are combined to form a consistent system of Boolean equations.
16. The method of claim 1, wherein the steps of obtaining samples, deriving a subset, and the sorting samples are separate in time.
17. The method of claim 1, wherein the obtaining samples, the deriving a subset, and sorting sample are performed by multiple agents.
18. A system to sort biological samples, comprising:
one or more processors configured to receive a digitized representation of one or more biological samples, wherein each of the biological samples comprises a set of one or more characterizing features and a set of one or more features of interest, and wherein one or more of the biological samples have known values for the set of features of interest, and one or more of the biological samples have unknown values of the set of features of interest; wherein the one or more processors are further configured to derive a subset of the characterizing features, wherein the subset predicts a value of the features of interest; and wherein the one or more processors are further configured to sort the biological samples that have unknown values for the set of features of interest based on the predicted value of the subset.
19. The system of claim 18, wherein the subset of the characterizing features may be characterized by interactions between the characterizing features in the subset.
20. The system of claim 18, wherein deriving the subset of the characterizing features includes deriving a subset of interactions between the characterizing features in the subset.
21. The system of claim 18, wherein the system is embedded in a special purpose device that is configured to obtain the digitized representation from physical biological specimens.
22. The system of claim 18, wherein one or more processors are configured to receive the digitized representation via the Internet.
23. The system of claim 18, wherein the one or more processors are configured to receive the digitized representation from a database.
24. The system of claim 18, wherein the one or more processors are configured to receive the digitized representation as a computer file.
25. The system of claim 18, wherein the one or more processors are part of a standalone computer.
26. The system of claim 18, wherein the one or more processors are a cluster of
standalone computers.
27. The system of claim 18, wherein the one or more processors include one or more Field Programmable Gate Array or one or more Application Specific Integrated Circuits or a combination thereof.
28. The system of claim 18, wherein the subset is stored in a reusable digitized format such as a computer file or a file system of a database management system.
29. The system of claim 18, wherein the one or more processors include two or more processors at separate locations, wherein the two or more processors are configured such that two or more of receiving the digitized representation of one or more biological samples, deriving the subset of characterizing features, or sorting the biological samples occur at separate locations.
30. A system to sort biological samples, comprising:
one or more processors configured to receive one or more biological samples in a digitized representation, wherein each of the biological samples comprises a set of one or more characterizing features, and a set of one or more features of interest, and wherein one or more of the biological samples have known values for the set of features of interest; and wherein the one or more processors are further configured to derive a subset of the characterizing features, wherein the subset predicts a value of the one or more features of interest.
31. The system of claim 30, wherein the subset of the characterizing features may be characterized by interactions between the characterizing features in the subset.
32. The system of claim 30, wherein deriving the subset of the characterizing features includes deriving a subset of interactions between the characterizing features in the subset.
33. The system of claim 30, wherein the subset is stored in a reusable digitized format such as a computer file or a file system of a database management system.
34. A system to sort biological samples, comprising:
a processor configured to receive one or more biological samples in a digitized representation, wherein each of the biological samples comprises a set of one or more characterizing features and a set of one or more of features of interest, and wherein one or more of the biological samples have unknown values for the set of features of interest; and
wherein the processor is configured to sort the biological samples that have unknown values for the set of features of interest, wherein the sorting is based on a predicted value from a subset of the characterizing features.
35. The system of claim 34, wherein the sorting is based on a predicted value from a subset of the characterizing features and interactions between the characterizing features in the subset.
36. The system of claim 34, wherein the processor is configured to derive the subset of the characterizing features.
37. The system of claim 34, wherein deriving the subset of the characterizing features includes deriving a subset of interactions between the characterizing features in the subset.
38. The system of claim 34, wherein the subset is stored in a reusable digitized format such as a computer file or a file system of a database management system.
PCT/US2013/041015 2012-05-16 2013-05-14 Method and system for sorting biological samples WO2013173384A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201261647959P 2012-05-16 2012-05-16
US61/647,959 2012-05-16

Publications (1)

Publication Number Publication Date
WO2013173384A1 true WO2013173384A1 (en) 2013-11-21

Family

ID=49584225

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2013/041015 WO2013173384A1 (en) 2012-05-16 2013-05-14 Method and system for sorting biological samples

Country Status (1)

Country Link
WO (1) WO2013173384A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649080A (en) * 2015-11-04 2017-05-10 神讯电脑(昆山)有限公司 Automatic synchronization system and method for testing document

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008037479A1 (en) * 2006-09-28 2008-04-03 Private Universität Für Gesundheitswissenschaften Medizinische Informatik Und Technik - Umit Feature selection on proteomic data for identifying biomarker candidates
EP1963849A2 (en) * 2005-11-14 2008-09-03 Bayer Healthcare, LLC Methods for prediction and prognosis of cancer, and monitoring cancer therapy
WO2009099379A1 (en) * 2008-02-08 2009-08-13 Phadia Ab Method, computer program product and system for enabling clinical decision support

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1963849A2 (en) * 2005-11-14 2008-09-03 Bayer Healthcare, LLC Methods for prediction and prognosis of cancer, and monitoring cancer therapy
WO2008037479A1 (en) * 2006-09-28 2008-04-03 Private Universität Für Gesundheitswissenschaften Medizinische Informatik Und Technik - Umit Feature selection on proteomic data for identifying biomarker candidates
WO2009099379A1 (en) * 2008-02-08 2009-08-13 Phadia Ab Method, computer program product and system for enabling clinical decision support

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649080A (en) * 2015-11-04 2017-05-10 神讯电脑(昆山)有限公司 Automatic synchronization system and method for testing document

Similar Documents

Publication Publication Date Title
US20180225416A1 (en) Systems and methods for visualizing a pattern in a dataset
JP4437050B2 (en) Diagnosis support system, diagnosis support method, and diagnosis support service providing method
US7653491B2 (en) Computer systems and methods for subdividing a complex disease into component diseases
AU2002359549B2 (en) Methods for the identification of genetic features
Jacobs et al. What's in a name; Genetic structure in Solanum section Petota studied using population-genetic tools
US20140206006A1 (en) Single cell classification method, gene screening method and device thereof
US20060111849A1 (en) Computer systems and methods that use clinical and expression quantitative trait loci to associate genes with traits
Satler et al. Inferring processes of coevolutionary diversification in a community of Panamanian strangler figs and associated pollinating wasps
KR20200093438A (en) Method and system for determining somatic mutant clonability
CN113272912A (en) Methods and apparatus for phenotype-driven clinical genomics using likelihood ratio paradigm
Dinh et al. Statistical inference for the evolutionary history of cancer genomes
JP2023517903A (en) Molecular techniques for predicting bacterial phenotypic traits from their genomes
Shults et al. Species delimitation and mitonuclear discordance within a species complex of biting midges
CN110770839A (en) Method for the accurate computational decomposition of DNA mixtures from contributors of unknown genotype
Logsdon et al. A novel variational Bayes multiple locus Z-statistic for genome-wide association studies with Bayesian model averaging
Yoosefzadeh-Najafabadi et al. Genome-wide association study statistical models: A review
Patil et al. Repetitive genomic regions and the inference of demographic history
Vavoulis et al. DGEclust: differential expression analysis of clustered count data
Chu et al. A graphical model approach for inferring large-scale networks integrating gene expression and genetic polymorphism
Hackett et al. Constructing linkage maps in autotetraploid species using simulated annealing
WO2019242445A1 (en) Detection method, device, computer equipment and storage medium of pathogen operation group
Schikora-Tamarit et al. Recent gene selection and drug resistance underscore clinical adaptation across Candida species
CN112823391B (en) Quality control metrics based on detection limits
Kim Bioinformatic and Statistical Analysis of Microbiome Data
WO2013173384A1 (en) Method and system for sorting biological samples

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13790969

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13790969

Country of ref document: EP

Kind code of ref document: A1