CN110800063B - Detection of tumor-associated variants using cell-free DNA fragment size - Google Patents

Detection of tumor-associated variants using cell-free DNA fragment size Download PDF

Info

Publication number
CN110800063B
CN110800063B CN201880041466.2A CN201880041466A CN110800063B CN 110800063 B CN110800063 B CN 110800063B CN 201880041466 A CN201880041466 A CN 201880041466A CN 110800063 B CN110800063 B CN 110800063B
Authority
CN
China
Prior art keywords
interest
bins
variant
sample
cfdna
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201880041466.2A
Other languages
Chinese (zh)
Other versions
CN110800063A (en
Inventor
姜婷婷
赵晨
庄涵宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Illumina Inc
Original Assignee
Illumina Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Illumina Inc filed Critical Illumina Inc
Publication of CN110800063A publication Critical patent/CN110800063A/en
Application granted granted Critical
Publication of CN110800063B publication Critical patent/CN110800063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B40/00Libraries per se, e.g. arrays, mixtures
    • C40B40/04Libraries containing only organic compounds
    • C40B40/06Libraries containing nucleotides or polynucleotides, or derivatives thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry

Abstract

Methods and systems for determining variants of interest by analyzing the size and sequence of cfDNA fragments obtained from a test sample are provided. The methods and systems provided herein enable methods that synergistically combine size and sequence information, thereby improving the specificity and sensitivity of the assay compared to conventional methods.

Description

Detection of tumor-associated variants using cell-free DNA fragment size
Citation of related application
The present application is based on the benefit of U.S. c. ≡119 (e) entitled "detection of tumor-related variants using cell-free DNA fragment size" U.S. provisional patent application 62/488,549 filed on 21, 2017, which is incorporated herein by reference in its entirety for all purposes.
Background
The advent of technologies that can sequence the entire genome in a relatively short period of time and the discovery of circulating cell-free DNA (cfDNA) provides an opportunity to analyze genetic material without the risks associated with invasive sampling methods.
A small amount of circulating tumor DNA (ctDNA) derived from a tumor, in particular a malignancy (or cancer), can be found in cfDNA of the blood of a cancer patient. ctDNA can be identified by sequencing cfDNA to detect sequence variants known to be specifically associated with various types of tumors. The detection method for determining ctDNA in blood for diagnosis or stratification of disease is also called liquid biopsy. Limitations of existing methods for diagnosing cancer in liquid biopsies include insufficient sensitivity due to limited ctDNA levels, and technical sequencing bias due to the inherent nature of genomic information. These limitations have resulted in a continuing need for methods of improving specificity, sensitivity, and applicability to reliably analyze variants associated with cancer in a variety of clinical settings.
Summary of The Invention
The average length of ctDNA fragments is shorter than cfDNA fragments in cells not affected by cancer. Some embodiments disclosed herein utilize this difference between unaffected cfDNA and ctDNA to detect the presence of (or determine) a cancer-related variant. The present application provides methods and systems for liquid biopsies that effectively combine cfDNA size information with sequence information, enabling high analytical sensitivity and specificity in determining tumor-related variants and determining cancer. Because the various methods and systems provided herein implement algorithms and processes that synergistically combine size and sequence information, these embodiments achieve improved analytical sensitivity and specificity as compared to conventional methods using only sequence or size information, and overcome certain limitations in liquid biopsies for diagnosing cancer.
One aspect of the application relates to a method for determining the presence or copy number of a tumor-associated gene sequence variant in a test sample by analyzing the size and sequence of cfDNA fragments obtained from the test sample. In some embodiments, the test sample may be peripheral blood, saliva, urine, and other biological fluids, as described below.
In some embodiments, the methods are implemented on a computer system comprising one or more processors and system memory to determine the presence or copy number of a tumor-associated gene sequence variant in a test sample comprising cell-free nucleic acid fragments derived from tumor cells.
Some embodiments provide methods for detecting a simple nucleotide variant associated with a tumor in a test sample comprising a cell-free nucleic acid fragment. The method comprises the following steps: (a) Enriching cfDNA fragments having sequences corresponding to one or more selected genomic regions in which simple nucleotide variants associated with a tumor are located; (b) Preparing a library with cfDNA fragments extracted from a sample, wherein the library retains fragment lengths of cfDNA fragments; (c) Sequencing cfDNA fragments to obtain the sequence and size of cfDNA fragments; and (d) using the sequence and size of the cfDNA fragments to generate a determination that a tumor variant is present in the cfDNA fragments.
In some embodiments, the method comprises: (a) Retrieving, by one or more processors, sequence reads and fragment sizes of cfDNA fragments obtained from a test sample; (b) Distributing cfDNA fragments by one or more processors into a plurality of bins representing different fragment sizes; and (c) using the sequence reads and by the one or more processors, determining allele frequencies of the variant of interest in a set of priority bins selected from the plurality of bins, wherein the set of priority bins is selected as: (i) Limiting the probability that the number of variants of interest in the set of priority bins is below a detection limit, and (ii) increasing the probability that the number of variants of interest in the set of priority bins is above all bins in the plurality of bins.
In some embodiments, the test sample is a plasma sample. In some embodiments, the variant of interest is known or suspected to be associated with cancer. In some embodiments, the variant of interest is known or suspected to be associated with a genetic disease.
In some embodiments, the method further comprises comparing the allele frequencies of the variants of interest in the set of priority bins to a standard, and determining the variants of interest in the test sample based on the comparison. In some embodiments, the detection limit of the method is about 0.05% to 0.2%.
In some embodiments, the set of priority bins is selected by: providing a plurality of candidate sets, each candidate set comprising non-uniform bins from a plurality of bins; for each candidate set, calculating a first probability that the allele frequency of the variant of interest in the bin of the candidate set is below the detection limit in a modeled sample, wherein the modeled sample comprises cfDNA derived from cells with the variant of interest and cfDNA derived from cells without the variant of interest; for each candidate set, calculating a second probability that the allele frequency of the variant of interest in the bin of the candidate set in the modeled sample is higher than the allele frequency of the variant of interest in the plurality of bins in the modeled sample; and selecting a candidate set as a priority set based on the first probability and the second probability. Each candidate set comprising non-uniform bins from multiple bins means that each candidate set has a different bin than the bins of the other candidate sets.
In some embodiments, the preferred set has a maximum value of the second probability among the candidate sets for which the value of the first probability does not exceed the criterion.
In some embodiments, the plurality of candidate sets are obtained by a desirability method (greedy process). In some embodiments, the craving method comprises: obtaining sequence reads and fragment sizes of cfDNA fragments obtained from one or more unaffected training samples known to be unaffected by the disorder of interest and one or more affected training samples known to be affected by the disorder of interest; distributing cfDNA fragments obtained from one or more unaffected training samples into a plurality of bins based on size; distributing cfDNA fragments obtained from one or more affected training samples into a plurality of bins based on size; ranking each bin of the plurality of bins based on a ratio of the frequency of the segments of the one or more affected training samples to the frequency of the segments of the one or more unaffected training samples; selecting the bin with the highest rating as the candidate set; adding the bin with the next highest rating to the final candidate set to provide a next candidate set; and repeating the last step until all bins of the plurality of bins are added, providing a candidate set each time and repeatedly.
In some embodiments, the disorder of interest comprises one or more cancers. In some embodiments, the disorder of interest comprises cancer associated with a variant of interest. In some embodiments, the affected training sample comprises cancerous tissue and the unaffected training sample comprises non-cancerous tissue.
In some embodiments, the allele frequencies of the variants of interest in the bin of candidate sets in the modeled sample are estimated as:
wherein AF (L) b1,b2...bk ) Is a box L b1 ,L b2 ...L bk Allele frequency, N mut (L b1,b2...bk ) Is a box L b1 ,L b2 ...L bk In the sequence, DP is the sequencing depth, f tumor is cfDNA fraction from cells with variants of interest,α(L bi ) Is a box L bi The density of fragments in the fragment length distribution of the affected sample for which one or more conditions of interest are known to be affected, and β (L) bi ) Is a box L bi The density of fragments in the fragment length distribution of the unaffected samples known to be unaffected by the condition of interest.
In some embodiments, the cell having the variant of interest is a cancer cell, and the modeling sample comprises a plasma sample comprising cfDNA from a cancer cell and cfDNA from a non-cancer cell.
In certain embodiments, tank L b1 ,L b2 ...L bk The counts of variants of interest in (a) are modeled as a binomial distribution:
wherein AF is defined as Tumor(s) Is the allele frequency of the variant of interest in the tissue with the variant of interest.
In certain embodiments, AF Tumor(s) The calculation is as follows:
AF tumor(s) =AF Plasma of blood /f Tumor(s)
Wherein AF is defined as Blood vessel The slurry is the allele frequency of the variant of interest in the modeled sample.
In some embodiments, the method further comprises removing one or more bins not containing variant sequences of interest from the priority set after selecting the candidate set as the priority set.
In some embodiments, the variant of interest comprises a Simple Nucleotide Variant (SNV). In some embodiments, the SNV is a single nucleotide variant, a phased sequence variant (phased sequential variant), or a small indel (small indel).
In some embodiments, the sequence reads are double-ended reads, and the size of the cfDNA fragments is obtained from a read pair.
In some embodiments, cfDNA fragments obtained from the sample have been enriched.
In some embodiments, the method further comprises, prior to (a), extracting cfDNA fragments from the test sample.
In some embodiments, the cfDNA fragments comprise circulating tumor DNA (ctDNA) fragments.
Another aspect of the application provides a method of analyzing cell-free DNA (cfDNA) to determine a variant of interest, the method comprising: (a) Obtaining sequence reads and fragment sizes of cfDNA fragments obtained from the test sample; (b) Assigning cfDNA fragments to a plurality of bins representing different fragment sizes based on their sizes; and (c) determining, using the sequence reads, allele frequencies of variants of interest in a set of preferential bins selected from a plurality of bins, wherein the set of preferential bins is selected by: (i) Providing a plurality of candidate sets, each candidate set comprising non-uniform bins from a plurality of bins; (ii) For each candidate set, calculating a second probability that the allele frequency of the variant of interest in the bin of the candidate set is higher than the allele frequency of the variant of interest in the plurality of bins in the modeled sample, wherein the modeled sample comprises a tissue having the variant of interest and a tissue having a wild-type sequence of the variant of interest; and (iii) selecting a candidate set having a maximum value of the second probability.
In some embodiments, the method further comprises, prior to (iii) and for each candidate set, calculating a first probability that the allele frequency of the variant of interest in the bin of the candidate set in the modeled sample does not exceed the detection limit, wherein (iii) comprises selecting a candidate set having a maximum value of the second probability among the candidate sets for which the value of the first probability does not exceed the criterion.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable medium having program instructions provided thereon for performing the operations described herein and other computing operations.
Some embodiments provide a system for assessing the copy number of a nucleic acid sequence of interest in a test sample. The system comprises: a sequencer for receiving nucleic acids from the test sample and providing nucleic acid sequence information from the sample; a processor; and one or more computer-readable storage media having instructions stored thereon for execution on a processor to evaluate the copy number in the test sample using the methods described herein.
Although the examples herein refer to humans, and the language is primarily directed to humans, the concepts described herein are applicable to genomes from any plant or animal. These and other objects and features of the present application will become more fully apparent from the following description and appended claims, or may be learned by the practice of the application as set forth hereinafter.
Incorporation by reference
All patents, patent applications, and other publications, including all sequences disclosed in these references, cited herein are expressly incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. All cited documents are incorporated by reference in their entirety in relevant part for the purposes indicated in the context of the cited documents herein. However, citation of any document is not to be construed as an admission that it is prior art with respect to the present application.
Brief Description of Drawings
FIG. 1A illustrates, on a subject matter, how double ended sequencing can be used to determine fragment size and sequence.
Fig. 1B shows density plots of empirical data from fragment sizes supporting reads of tumor variants (dark grey) and reads of non-tumor variants or reference sequences (light grey).
FIG. 1C shows allele frequencies for shorter cfDNA fragments (shorter than or equal to 150bp, dark grey bars) and longer cfDNA fragments (longer than 150bp, light grey bars).
FIG. 2 shows a flow chart illustrating a process of preparing a sample and analyzing cfDNA fragments extracted from the sample, using both fragment size and sequence information to determine variants of interest.
FIG. 3A shows a flow chart illustrating a process of determining variants of interest using sequence information and size information of cfDNA fragments.
Fig. 3B shows a flow chart illustrating a craving method for obtaining multiple candidate bin sets.
Fig. 3C illustrates how data for normal cfDNA and tumor-derived DNA can be combined to model samples, such as plasma samples that include normal and tumor-associated cfDNA.
Fig. 3D shows a flow chart illustrating a process for selecting a set of priority bins from a plurality of candidate sets.
Fig. 3E shows the frequency length distribution of a normal sample and the frequency length distribution of a tumor sample, and how probability data is obtained from the distributions.
Fig. 3F shows probability data for multiple candidate sets.
FIG. 4 illustrates a typical computer system, according to certain embodiments.
FIG. 5 is a block diagram of a dispersion system for processing test samples and making diagnostics.
Fig. 6 schematically illustrates how different operations in a process test sample are grouped for processing by different elements of the system.
Figures 7A-7D show allele frequencies for variants of interest using different sets of fragment size bins, one for each of the four cases.
Fig. 8 shows fragment length distribution of cfDNA from tumor cells and normal cells.
Fig. 9 shows a histogram of the distribution of cfDNA fragments into bins each spanning 5 nucleotides in fragment length.
Figure 10 shows fold change data for groups with different levels of original allele frequency for 32 true positive mutations.
Detailed Description
Definition of the definition
Practice of the methods and systems disclosed herein involve, unless otherwise indicated, conventional techniques and equipment commonly used in the fields of molecular biology, microbiology, protein purification, protein engineering, protein and DNA sequencing, and recombinant DNA, which are within the skill of the art. Such techniques and equipment are known to those skilled in the art and are described in many textbooks and references (see, e.g., sambrook et al, "Molecular Cloning: A Laboratory Manual," Third Edition (Cold Spring Harbor), [2001 ]); ausubel et al, "Current Protocols in Molecular Biology" [1987 ]).
The numerical range includes numbers defining the range. Each maximum numerical limitation given throughout this specification is intended to include each lower numerical limitation as if such lower numerical limitations were expressly written herein. Every minimum numerical limitation given throughout this specification will include every higher numerical limitation, as if such higher numerical limitations were expressly written herein. Every numerical range given throughout this specification will include every narrower numerical range that falls within such broader numerical range, as if such narrower numerical ranges were all expressly written herein.
The headings provided herein are not intended to limit the application.
Unless defined otherwise herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. Various scientific dictionaries containing the terms included herein are well known and available to those of skill in the art. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the embodiments disclosed herein, some methods and materials are described.
Terms defined directly below will be more fully described by reference to the specification in its entirety. It is to be understood that this application is not limited to the particular methodology, protocols, and reagents described, as these may vary depending upon the context in which the methodology, protocols, and reagents are used by those skilled in the art. As used herein, the singular terms "a," "an," and "the" include plural referents unless the context clearly dictates otherwise.
Unless otherwise indicated, nucleic acids are written in a 5 'to 3' direction from left to right, and amino acid sequences are written in an amino to carboxyl direction from left to right.
As used herein, a sequence of interest refers to a nucleic acid sequence in the genome of an organism (e.g., a human). In some embodiments, the sequence of interest is a gene, SNP, exon, regulatory sequences of a gene, and the like. In some embodiments, the sequence of interest is a chromosomal or sub-chromosomal region.
The variant of interest is a specific variant of a gene sequence that needs to be measured, qualitative, quantitative or detected. In some embodiments, the variant of interest is a variant known or suspected to be associated with a disorder (e.g., cancer, tumor, or genetic disease).
Genes are sites (or regions) of DNA, consisting of nucleotides, and are genetic molecular units.
The gene may acquire mutations in its sequence, resulting in different variants, i.e., alleles, in the population. The proteins encoded by these alleles are slightly different, resulting in different phenotypic traits.
Allele frequency or gene frequency is the frequency of one allele of a gene (or gene variant) relative to the other alleles of that gene and can be expressed as a fraction or percentage. Allele frequencies are typically associated with a particular locus because genes are typically located at one or more loci. However, allele frequencies as used herein may also be associated with size-based bins of DNA fragments. In this sense, DNA fragments containing alleles (like cfDNA) are assigned to different size-based bins. The allele frequency in the size-based bin is the allele frequency relative to the frequency of other alleles. In some embodiments, the frequency of alleles or variants is the proportion of reads that support variant decisions among all reads in multiple bins (e.g., a set of priority bins).
The term "parameter" herein refers to a numerical value that characterizes a system property such as a physical feature, the value of which or other feature affects related conditions, such as a sample or DNA fragment with a simple nucleotide variant or copy number variant. In some cases, the term parameter is a reference to a variable that affects the mathematical relationship or model output, which may be an independent variable (i.e., the input of the model) or an intermediate variable based on one or more independent variables. Depending on the range of models, the output of one model may become the input to another model and thus become a parameter of the other model.
The term "fragment size parameter" refers to a parameter related to the size or length of a fragment or collection of fragments (e.g., nucleic acid fragments; e.g., cfDNA fragments obtained from body fluids). When a genome produces a nucleic acid fragment of a size or range of sizes that has a higher concentration relative to a nucleic acid fragment from another genome or another portion of the same genome, the fragment size or range of sizes may be characteristic of an abnormal genome or portion thereof. Various embodiments disclosed herein provide methods of combining size information with sequence information to determine simple nucleotide variants. In addition, the abundance of sequences can also be combined with size information to determine structural or copy number variations. The various embodiments combine fragment size information and sequence information in an innovative manner that is more efficient than simple accumulation or alternative selection of both information, thereby providing improved performance over conventional detection methods for detecting cancer variants with low variant frequencies.
A "simple nucleotide variant" or "SNV" is a genetic variant that differs from a reference sequence by one or more nucleotides in a relatively short gene sequence. SNVs include single nucleotide variants, phased sequence variants, and small indels (indels). SNV differs from structural variants and copy number variants in that structural variants include chromosomal structural rearrangements, such as large indels, repeats, inversions, and copy number variants include abnormal copy numbers of the normal diploid region of the genome. Certain SNVs known or suspected to be associated with a tumor (also referred to as tumor SNVs) are in various embodiments analytical targets.
The term "fragment that may comprise a variant" is used herein to refer to a fragment that is identified as a cfDNA fragment suspected of having a sequence mutation corresponding to a cancer variant. In various embodiments, a cfDNA fragment is identified as a fragment likely to contain a variant if it is determined that the sequence reads provided by the cfDNA fragment contain sequences of known cancer variants and the genomic coordinates of the sequence reads match the cancer variants. Since sequencing and other processing sometimes introduce errors, it is not certain that the sequence of fragments exhibiting cancer mutations corresponds truly to fragments derived from cancer cells. It is possible that the actual reads from the sequences of the fragments comprising the cancer variants are due to sequencing errors and not true somatic mutations.
The term "copy number variation" or "CNV" herein refers to a copy number variation of a nucleic acid sequence present in a test sample as compared to the copy number of the nucleic acid sequence present in a reference sample. In certain embodiments, the nucleic acid sequence is 1kb or greater. In some cases, the nucleic acid sequence is the entire chromosome or a significant portion thereof. "copy number variant" refers to a nucleic acid sequence that is found to have a copy number difference by comparing the nucleic acid sequence of interest in a test sample to the expected level of the nucleic acid sequence of interest. For example, the level of the nucleic acid sequence of interest in the test sample is compared to the level of the nucleic acid sequence of interest in a qualified sample. Copy number variants/changes include deletions (including microdeletions), insertions (including microinsertions), duplications, multiplications, and translocations. CNVs encompass chromosomal aneuploidy and partial aneuploidy.
The term "aneuploidy" herein refers to an imbalance of genetic material caused by loss or increase of whole or partial chromosomes.
The term "plurality" refers to more than one element. For example, the term as used herein may refer to a number of nucleic acid molecules or sequence tags sufficient to identify a significant difference in SNV or CNV in a test sample and a qualifying sample using the methods disclosed herein. In some embodiments, at least about 3X 10 is obtained for each test sample 6 About 20 to 40bp. In some embodiments, each test sample provides at least about 5 x 10 6 、8×10 6 、10×10 6 、15×10 6 、20×10 6 、30×10 6 、40×10 6 Or 50X 10 6 Data for each sequence tag, each sequence tag comprising about 20 to 40bp.
The term "double-ended read" refers to a read from double-ended sequencing, one read from each end of a nucleic acid fragment. Double-ended sequencing may involve fragmentation of a polynucleotide strand into short sequences called inserts. Fragmentation is optional or unnecessary for shorter polynucleotides (e.g., cell-free DNA molecules).
The terms "polynucleotide," "nucleic acid," and "nucleic acid molecule" are used interchangeably to refer to a covalently linked sequence of nucleotides (i.e., ribonucleotides of RNA and deoxyribonucleotides of DNA) in which the 3 '-position of the pentose of one nucleotide is linked to the 5' -position of the pentose of the next nucleotide by a phosphodiester group. Nucleotides include any form of nucleic acid sequence including, but not limited to, RNA and DNA molecules, such as cfDNA molecules. The term "polynucleotide" includes, but is not limited to, single-stranded and double-stranded polynucleotides.
The term "test sample" herein refers to a sample, typically derived from a biological fluid, cell, tissue, organ or organism, comprising a nucleic acid or mixture of nucleic acids comprising at least one nucleic acid sequence to be subjected to SNV or CNV screening. In certain embodiments, the sample comprises at least one nucleic acid sequence suspected of having undergone variation in copy number. Such samples include, but are not limited to, sputum/oral fluid, amniotic fluid, blood components or fine needle biopsy samples (e.g., surgical biopsies, fine needle biopsy, etc.), urine, peritoneal fluid, pleural fluid, etc. Although the sample is typically taken from a human individual (e.g., a patient), the assay may be used for SNV or CNV in any mammalian sample, including but not limited to dogs, cats, horses, goats, sheep, cattle, pigs, and the like. The sample may be obtained directly from biological sources or may be used after pretreatment to alter sample characteristics. For example, such pretreatment may include preparing plasma from blood, diluting viscous fluids, and the like. The pretreatment method may also include, but is not limited to: filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, addition of reagents, cleavage, and the like. If such pretreatment methods are employed on the sample, such pretreatment methods should generally leave the nucleic acid of interest in the test sample, sometimes at a concentration proportional to the concentration in the untreated test sample (i.e., the sample that has not been subjected to any such pretreatment methods). Such "treated" or "processed" samples are still considered biological "test" samples relative to the methods described herein.
The term "training set" refers herein to a set of training samples, which may include affected and/or unaffected samples, and is used to build a model for analyzing the test samples. In some embodiments, the training set comprises unaffected samples. In these embodiments, a sample training set that is unaffected for the SNV or CNV of interest is used to establish a threshold for detecting the SNV or CNV. Unaffected samples in the training set may be used as qualified samples to identify normalized sequences (e.g., normalized chromosomes), and the chromosome amounts of the unaffected samples are used to set a threshold for each sequence of interest (e.g., chromosome). In some embodiments, the training set comprises the affected samples. The affected samples in the training set can be used to verify that the affected test samples can be easily distinguished from the unaffected samples.
The training set is also a statistical sample in the population of interest that should not be confused with a biological sample. Statistical samples typically contain a plurality of individuals, and the data of these individuals is used to determine one or more quantitative values of interest that can be generalized to the population. The statistical sample is a subset of individuals in the population of interest. The individual may be a human, an animal, a tissue, a cell, other biological samples (i.e., a statistical sample may include multiple biological samples), and other individual entities that provide data points for statistical analysis.
Typically, the training set is used in combination with the validation set. The term "validation set" is used to refer to a set of individuals in a statistical sample whose data is used to validate or evaluate quantitative values of interest determined using a training set. In some embodiments, for example, the training set provides data for computing a mask for the reference sequence, while the validation set provides data for evaluating the validity or effectiveness of the mask.
The term "sequence of interest" or "nucleic acid sequence of interest" herein refers to a nucleic acid sequence that is associated with differences in sequence representation between healthy and diseased individuals. The sequence of interest may be a sequence on a chromosome that is incorrectly represented (i.e., represented excessively or insufficiently) in a disease or genetic disorder. The sequence of interest may be a part of a chromosome, i.e. a chromosome fragment, or the whole chromosome. For example, the sequence of interest may be a chromosome that is over represented under aneuploidy conditions, or a gene encoding a tumor suppressor that is under represented in cancer. Sequences of interest include sequences that are expressed too high or too low in the total population or subpopulation of individual cells. A "qualified sequence of interest" is a sequence of interest in a qualified sample. A "test sequence of interest" is a sequence of interest in a test sample.
The term "coverage" refers to the abundance of sequence tags mapped to defined sequences. Coverage may be quantitatively expressed by sequence tag density (or sequence tag count), sequence tag density ratio, normalized coverage, adjusted coverage value, and the like.
The term "second generation sequencing" (NGS) herein refers to a sequencing method that allows for large-scale parallel sequencing of clonally amplified molecules and single nucleic acid molecules. Non-limiting examples of NGS include sequencing by synthesis and sequencing by ligation using reversible dye terminators.
The terms "threshold" and "pass threshold" herein refer to any value used as a cutoff value to characterize a sample (e.g., a test sample comprising nucleic acid from an organism suspected of having a medical condition). The threshold value may be compared to the parameter value to determine whether the sample from which the parameter value was generated indicates that the organism has a medical condition. In certain embodiments, the qualifying threshold is calculated using the qualifying dataset and used as a diagnostic limit for SNV or CNV. If the results obtained from the methods disclosed herein exceed a threshold, then the individual may be diagnosed as having SNV or CNV. The appropriate threshold for the methods described herein can be determined by analyzing the normalized values (e.g., chromosome amount, NCV, or NSV) calculated for the training set of samples. The threshold may be determined using a qualified (i.e., unaffected) sample in a training set that includes both a qualified (i.e., unaffected) sample and an affected sample. Samples in the training set known to have chromosomal aneuploidies (i.e., affected samples) can be used to determine that a selected threshold can be used to distinguish between affected and unaffected samples in the test set (see examples herein). The choice of threshold depends on the confidence with which the user wishes to classify. In some embodiments, the training set for determining the appropriate threshold comprises at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, at least 2000, at least 3000, at least 4000, or more acceptable samples. It may be advantageous to use a larger set of qualifying samples to increase the diagnostic utility of the threshold.
The term "read" refers to a sequence obtained from a portion of a nucleic acid sample. Typically, although not necessarily, short sequences representing consecutive base pairs in the sample are read. Reads can be represented symbolically by the base pair sequence of the sample portion (as A, T, C or G). The reads may be stored in a storage device and appropriately processed to determine if the reads match a reference sequence or meet other criteria. The reads may be obtained directly from the sequencing device or indirectly from stored sequence information related to the sample. In some cases, reads are DNA sequences of sufficient length (e.g., at least about 25 bp) to identify a larger sequence or region, e.g., can be aligned and specifically assigned to a chromosomal or genomic region or gene.
The term "genomic read" is used to refer to a read of any segment of the entire genome of an individual.
The term "sequence tag" is used interchangeably herein with the term "mapped sequence tag" to refer to a sequence read that has been specifically assigned (i.e., mapped) to a larger sequence (e.g., a reference genome) by alignment. Mapped sequence tags are uniquely mapped to the reference genome, i.e., they are assigned to a single location of the reference genome. Unless otherwise indicated, tags mapped to the same sequence on the reference sequence will be counted once. The tags may be provided as a data structure or other data combination. In certain embodiments, the tag comprises a read sequence and information about the read, such as the position of the sequence in the genome, e.g., on a chromosome. In certain embodiments, the position is designated as a positive chain orientation. Tags may be defined to allow a limited amount of mismatches in alignment with the reference genome. In some embodiments, tags that can map to multiple locations on the reference genome (i.e., tags that are not uniquely mapped) may not be included in the analysis.
As used herein, the term "aligned," "alignment," or "aligned" refers to the process of comparing a read or tag to a reference sequence to determine whether the reference sequence contains a read sequence. If the reference sequence contains a read, the read may be mapped to the reference sequence, or in some embodiments, to a specific location in the reference sequence. In some cases, the alignment simply indicates whether the read is a member of a particular reference sequence (i.e., whether the read is present or absent in the reference sequence). For example, an alignment of a read with a human chromosome 13 reference sequence will indicate whether the read is present in the reference sequence of chromosome 13. The tool that provides this information may be referred to as a set membership tester. In some cases, the alignment also indicates the position on the reference sequence to which the read or tag is mapped. For example, if the reference sequence is a whole human genomic sequence, the alignment may indicate that a read is present on chromosome 13, and may further indicate that the read is on a specific strand and/or site on chromosome 13.
Aligned reads or tags are one or more sequences that are identified as matching according to the order of the nucleic acid molecules to known sequences in the reference genome. Although the alignment may be performed manually, the alignment is typically performed by a computer algorithm, as it is not possible to manually align the reads within a reasonable period of time to implement the methods disclosed herein. One example of an algorithm from an alignment sequence is the high-efficiency nucleotide data local alignment (ELAND) computer program distributed as part of Illumina genomics analysis flow (pipeline). Alternatively, a Bloom filter or similar set membership tester may be used to align reads with the reference genome. See U.S. patent application Ser. No. 61/552,374, filed 10/27 2011, which is incorporated herein by reference in its entirety. The match for sequence reads in an alignment may be 100% sequence match, or less than 100% (non-perfect match).
As used herein, the term "mapping" refers to specifically assigning sequence reads to larger sequences, such as a reference genome, by alignment.
As used herein, the term "reference genome" or "reference sequence" refers to any particular known genomic sequence, whether partial or complete, that can be used to reference any organism or virus from which an individual's identification sequences are derived. For example, reference genomes for human individuals as well as many other organisms can be found on the national center for biotechnology information (NCBI, NCBI. "genome" refers to the complete genetic information of an organism or virus expressed in nucleic acid sequences.
In various embodiments, the reference sequence is substantially larger than the reads with which it is aligned. For example, the reference sequence may be at least about 100-fold greater, or at least about 1000-fold greater, or at least about 10,000-fold greater, or at least about 10 5 Multiple times greater, or at least about 10 6 Multiple times greater, or at least about 10 7 And is multiple-large.
In one example, the reference sequence is a reference sequence of a full-length human genome. Such sequences may be referred to as genomic reference sequences. In another example, the reference sequence is limited to a specific human chromosome, such as chromosome 13. In some embodiments, the reference Y chromosome is a Y chromosome sequence from human genomic version hg 19. Such sequences may be referred to as chromosomal reference sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, sub-chromosomal regions (e.g., chains) of any species, and the like.
In various embodiments, the reference sequence is a consensus sequence or other combination derived from a plurality of individuals. However, in some applications, the reference sequence may be taken from a particular individual.
The term "clinically relevant sequence" herein refers to a nucleic acid sequence known or suspected to be associated with or involved in a genetic or disease condition. Determining the absence or presence of clinically relevant sequences may be used to confirm diagnosis or confirm diagnosis of a medical condition, or to provide a prognosis of disease progression.
The term "derived" when used in the context of a nucleic acid or a mixture of nucleic acids refers herein to the manner in which the nucleic acid is obtained from its source. For example, in one embodiment, a mixture of nucleic acids derived from two different genomes refers to nucleic acids such as cfDNA that are naturally released by a cell through a naturally occurring process (e.g., necrosis or apoptosis). In another embodiment, a mixture of nucleic acids derived from two different genomes means that the nucleic acids are extracted from two different types of cells of an individual.
The term "based on" when used herein in the context of obtaining a particular quantitative value refers to using another quantity as an input to calculate the particular quantitative value as an output.
The term "biological fluid" herein refers to a liquid taken from a biological source and includes, for example, blood, serum, plasma, sputum, lavage, cerebrospinal fluid, urine, semen, sweat, tears, saliva, and the like. As used herein, the terms "blood," "plasma," and "serum" expressly encompass fractions or processed portions thereof. Similarly, when a sample is taken from a biopsy, swab, smear, or the like, "sample" expressly encompasses a treated fraction or portion obtained from a biopsy, swab, smear, or the like.
As used herein, the term "corresponding to" sometimes refers to a nucleic acid sequence (e.g., a gene or chromosome) that is present in the genomes of different individuals and does not necessarily have the same sequence in all genomes, but is used to provide identity to a sequence of interest (e.g., a gene or chromosome) rather than genetic information.
As used herein, the term "chromosome" refers to a genetic vector of living cells that is derived from a chromatin chain comprising DNA and protein components (particularly histones). Conventional internationally recognized personal human genome chromosome numbering systems are employed herein.
As used herein, the term "polynucleotide length" refers to the absolute number of nucleotides in a sequence or region of a reference genome. The term "chromosome length" refers to a known chromosome length given in base pairs, e.g., chromosome provided in NCBI36/hg18 module of human chromosome found on the world wide web |genome|ucsc|edu/cgi-bin/hgtrackshgsid=167155613 & chromsinfopage=3.
The term "individual" as used herein refers to human individuals as well as non-human individuals, such as mammals, invertebrates, vertebrates, fungi, yeasts, bacteria and viruses. Although the examples herein refer to humans, and the language is primarily directed to humans, the concepts disclosed herein are applicable to genomes from any plant or animal, and may be used in veterinary, animal science, research laboratory, and the like.
The term "condition" is meant in a broad sense to refer to "medical conditions," including all diseases and disorders, but may include injuries and normal health conditions such as pregnancy, which may affect a person's health, benefit from medical assistance, or involve effects on medical procedures.
As used herein, the term "sensitivity" refers to the probability that a test result is positive when a condition of interest is present. It can be calculated as the number of true positives divided by the sum of true positives and false negatives.
As used herein, the term "specificity" refers to the probability that a test result is negative in the absence of a disorder of interest. It can be calculated as the number of true negatives divided by the sum of true negatives and false positives.
The term "enriching" herein refers to the process of amplifying a polymorphic target nucleic acid contained in a portion of a parent sample and combining the amplified product with the remainder of the parent sample from which that portion was removed. For example, the remainder of the parent sample may be the original parent sample.
As used herein, the term "primer" refers to an isolated oligonucleotide that is capable of acting as a point of initiation of synthesis when placed under conditions that induce synthesis of an extension product (e.g., conditions include nucleotides, an inducer such as a DNA polymerase, and appropriate temperature and pH). For maximum amplification efficiency, the primer is preferably single stranded, but may also be double stranded. If double-stranded, the primer is first treated to separate its strand before use in preparing the extension product. Preferably, the primer is an oligodeoxyribonucleotide. The primer must be long enough to prime the synthesis of the extension product in the presence of the inducer. The exact length of the primer will depend on many factors, including temperature, primer source, use of the method, and parameters used for primer design.
Introduction to the invention
Cancer genome sequencing studies have collectively identified a variety of genetic mutations that can lead to the growth and development of human tumors. Due to their findings, scientists have found that most cancers have somatic DNA mutations. Unlike genetic or germ line mutations that are transmitted from parents to children, somatic mutations are formed in the DNA of a single cell throughout the life of a human, and are not transmitted from parents to children. Thus, sequence variants due to somatic DNA mutations associated with cancer provide biomarkers for detecting cancer and measuring cancer progression.
Tumor tissue itself includes a large amount of DNA material that can be analyzed to detect cancer variants or sequence variants known or suspected to be associated with various cancers. This can be done by biopsy of the tumor tissue. However, due to the continual change in the location and form of cancer, it is often difficult to obtain biopsy samples at various locations in succession to obtain cancer tissue and DNA of cancer origin. Scientists have found that dying tumor cells release small fragments of their DNA into blood and other body fluids. These fragments are called cell-free circulating tumor DNA (ctDNA), and coexist with cell-free DNA (cfDNA) of non-cancer cells. Methods for screening ctDNA associated with somatic mutations are being developed to detect and track progression of a patient's tumor. These methods are also known as liquid biopsies.
Current various liquid biopsy methods utilize high throughput sequencing to analyze cfDNA collected from a patient. However, the ability to detect tumor-specific variants is limited by several factors. Liquid biopsy methods using high throughput sequencing are limited by sequencing error rates and sequencing depth. In some cancer patients, tumor burden may be very large for some tumor variants. For example, in some samples ctDNA may be less than 0.1% or 0.01%. Thus, the fraction of cfDNA derived from a tumor may be below the error range of the sequencing procedure. Tumor-specific variants, which are referred to by patients with low tumor burden, may suffer from a high false positive rate, because the probability that sequences matching tumor variants in a putative read are actually due to sequencing errors and not true mutations is small but present. It is desirable to increase true positives to improve sensitivity, while decreasing false positives to improve selectivity.
Recent studies have found that ctDNA fragments are generally shorter than cfDNA fragments of non-tumor cells. It has been observed that ctDNA fragments are on average about 20bp shorter (e.g., about 145bp versus 165 bp) than background cfDNA. The distribution of ctDNA and cfDNA is broad and overlapping. However, these observations alone do not provide a means to improve the performance of liquid biopsy analysis. The methods described below may characterize the insert size distribution that supports reading of variants. These methods exploit differences in fragment size by applying specific processes and algorithms that synergistically and effectively combine fragment size and fragment abundance information, thereby improving the performance of variant assays using high throughput sequencing. Some embodiments provide improved sensitivity and/or selectivity compared to using sequence or size information alone or in the alternative.
In various applications ctDNA analysis requires detection of mutated fragments at very low frequencies, with mutation rates of 0.1-0.01% or even lower for screening applications. In view of errors in library preparation, clustering and sequencing, it is challenging to distinguish true positives from false positives (specificity). By establishing a detection method using cfDNA fragment size, we can increase the likelihood that cancer variants are judged to be correct. For example, if a possible somatic mutation is determined in a fragment that is short, it is more likely to be a true tumor fragment than if the fragment is long. This type of weighting can be used to improve the specificity of the assay. Another way to exploit the fragment size differences is to use only or more heavily weight the shorter fragments. This effectively enriches ctDNA information relative to non-tumor cfDNA.
Other potential benefits of using size information include: 1) reducing sequencing requirements, 2) estimating total tumor burden, and 3) distinguishing germ cell variants from true somatic variants, wherein germ cell variants have normal fragment lengths and somatic variants have shorter fragments.
ctDNA measurements can also be used to address the issue of analyzing tumor heterogeneity in metastatic cancer patients. Metastatic cancer patients carry multiple tumors that sometimes have different driving mutations. Since these driver mutations are often targets for drugs, it is highly desirable to identify and characterize driver mutations in patients. Moreover, it would be valuable to know which driving mutations are from the same tumor and which driving mutations are from different tumors. Furthermore, there is evidence that when a drug-targeted driving mutation is present in a dominant clone within a tumor, the therapeutic response is better than if the driving mutation was from a smaller clone. Conventional methods are ineffective for determining these heterogeneity metrics. Multiple biopsies of a patient to determine these metrics are impractical or unsafe. Furthermore, biopsies sample only a small portion of one tumor. ctDNA in blood is the superposition of ctDNA in all human tumors.
In the examples below, preliminary data using targeted sequencing show that the relative allele frequencies in ctDNA data may represent different clones or tumors. Some embodiments provided herein can determine tumor heterogeneity, mutations from the same clone, and which clone is likely to be the primary clone.
In certain embodiments, whole genome sequencing facilitates determination of tumor heterogeneity. The vast majority of solid tumors vary in somatic copy number by more than 10Mb. These can be detected using whole genome sequencing, providing an orthogonality measure of ctDNA scores. Different copy number levels of region specificity can be used to determine tumor heterogeneity. In addition, a measure of such genome-wide copy number variation and a measure of tumor heterogeneity can be compared to the deeper targeted sequencing described above. Comparing the focal somatic changes measured using the targeting method to copy number changes may increase the ability to distinguish between multiple clones.
The present application provides an analytical method in liquid biopsies for obtaining fragment size information from e.g. double-ended reads and using this information in an analytical procedure. The higher analytical sensitivity provides the ability to apply liquid biopsy methods with higher selectivity. And by adjusting the decision criteria, sensitivity can also be improved compared to conventional methods using only sequence information.
Fragment size of cfDNA
As described above, fragment size parameters, sequences and abundance in cfDNA can be used to evaluate tumor variants. Fragment size of cfDNA fragments can be obtained by double-ended sequencing, electrophoresis (e.g., microchip-based capillary electrophoresis), and other methods known in the art. The subject matter of fig. 1A illustrates how double-ended sequencing can be used to determine fragment size, fragment sequence, and sequence coverage.
The top half of fig. 1A shows a schematic of ctDNA fragments and non-cancerous cfDNA fragments that provide templates for the double-ended sequencing process. Typically, long nucleic acid sequences are fragmented into shorter sequences for reading during double ended sequencing. Such fragments are also referred to as inserts. In some embodiments, cell-free DNA does not require fragmentation, as they are already present in most fragments shorter than 300 base pairs. As shown in the upper part of FIG. 1A, ctDNA fragments are shorter than background cfDNA. Some observations have seen a difference of about 20bp, e.g., ctDNA of about 145bp, whereas non-cancerous cfDNA is about 165bp. Fig. 1B shows a density plot of empirical data from reads of cancer or tumor variants (dark grey) and reads of non-cancer variants or reference sequences (light grey). Here, the cancer variants show enrichment of smaller fragment sizes.
In the application of the embodiments disclosed herein, the exact and absolute sizes of the two DNA sources are not as important as the relative differences between the two. In one hypothesis, the size of the DNA fragment is related to different cell types of cancer cells relative to normal cells. Non-cancerous cfDNA in plasma may originate from blood cells, whereas cancerous cfDNA in plasma may originate from epithelial cells. The nucleosome structure of blood cells may be different from that of epithelial cells. Such structural differences may result in DNA being cut into different sizes. In another hypothesis, the difference in fragment size may be the result of interactions between cancer cells and nucleosomes.
Nucleosomes are the basic unit of DNA packaging in eukaryotes, comprising a segment of DNA sequentially wound around a histone octamer consisting of 2 copies of core histones H2A, H2B, H and H4. Nucleosome core particles consist of approximately 147 base pairs of DNA wrapped in 1.67 left-handed supercoiled turns around a histone octamer. The core particles are linked by a linker DNA of up to about 80 bp. Technically, nucleosomes are defined as the core particle plus one of these junction regions; however, the term is sometimes used to refer to the nucleosome core as well. Apoptosis or other cellular mechanisms in cancer cells and non-cancer cells may differentially disrupt the structure of nucleosomes. Those skilled in the art understand that the underlying mechanism of this size difference does not affect the utility of the present application.
In double-ended sequencing on certain platforms, such as synthesis platform sequencing of Illumina, described further below, adaptor sequences, index sequences, and/or primer sequences are ligated to both ends of the fragments. The fragment is first read from one direction, with read 1 provided from one end of the fragment. A second read is then performed starting from the other end of the fragment, providing a sequence of reads 2. The correspondence between reads 1 and 2 can be identified by their coordinates in the flow cell. Then, reads 1 and 2 are mapped to the reference sequence as a pair of tags that are close to each other, as shown in the lower half of fig. 1A. In some embodiments, if the reads are long enough, the two reads may overlap in the middle portion of the insert. After the pair is aligned with the reference sequence, the relative distance between the two reads and the length of the fragment from which the reads were derived can be determined from the positions of the two reads. Because double-ended reads provide twice as many base pairs as single-ended reads of the same length, they help to improve alignment quality, especially for sequences having many repeated sequences or non-unique sequences. After the double-ended reads are aligned with the reference sequence, the number of reads that are aligned with the bin can be determined. The number and length of inserts (e.g., cfDNA fragments) may also be determined for one bin. In some embodiments, if the insert spans two bins, half of the insert may be attributed to each bin. In various embodiments, the sequence information and alignment position of the insert are used to determine whether the insert includes a variant of interest, such as a cancer-related variant of the sequence of interest 110 in the reference genome. For example, in certain embodiments, if a cfDNA fragment is read that contains the sequence of a tumor variant and matches the genomic coordinates of the cancer variant, the cfDNA is determined to be a fragment that may contain the variant. In a downstream process, cfDNA fragments are analyzed using the sequence and size of fragments that may contain variants to determine the presence or abundance of cancer variants in a sample.
In some studies it has been observed that cfDNA with cancer related variants (ctDNA) tends to have a shorter fragment size (or shorter fragments) than normal cfDNA. As shown in FIG. 1B, ctDNA has a Fragment Length Distribution (FLD) with its peak shifted to the lower end. Also, fig. 1C shows that shorter cfDNA fragments (shorter than or equal to 150bp, dark grey bars) tend to have higher allele frequencies than longer cfDNA fragments (longer than 150bp, light grey bars). Each pair of bar graphs in the figure shows data for variants associated with cancer, with the vertical axis representing variant allele frequencies.
Determining variants of interest using cfDNA fragment size
Various embodiments provide methods for determining a variant of interest (e.g., a tumor variant or a cancer specific somatic variant) from cfDNA sequencing data. These variants can be divided into three major classes: simple Nucleotide Variants (SNV), structural Variants (SV), and Copy Number Variants (CNV). SNVs include simple nucleotide variants, phased sequence variants, and small insertions and deletions (indels). Structural variants include structural rearrangements of the chromosome including large indels, duplications, inversions and transversions. CNV includes abnormal copy numbers of the normal diploid region of the genome. Among these three variant classes, SNV and CNV decisions can be improved by combining ctDNA fragment size information.
Fig. 2 shows a flow chart illustrating a process 200 for preparing a sample and analyzing cfDNA fragments extracted from the sample using fragment size and sequence information to determine variants of interest (e.g., tumor variants) in the sample. In some embodiments, the tumor variant is a Simple Nucleotide Variant (SNV). In some embodiments, the SNV is a simple nucleotide variant, such as a SNP, phased sequence variant, or small indel. In some embodiments, the tumor is malignant (cancerous) or potentially malignant (pre-cancerous). The process begins by obtaining a sample from an individual that includes cell-free DNA. Samples may be obtained from peripheral blood, saliva, and other body fluids, as described further below in the sample processing section. The process involves extracting cell-free DNA fragments from a sample. See block 202. In some embodiments, a relatively large amount of cfDNA may be required, as the ctDNA concentration of some samples may be relatively low.
To increase the likelihood of detecting a tumor variant of interest, some embodiments involve enriching for sequence regions known to have tumor variants. In some embodiments, enrichment involves whole genome amplification of cfDNA fragments. In some embodiments, enrichment involves targeted amplification of cfDNA fragments. See operation 204. Enrichment can be performed either before or after sequencing library preparation. Indeed, unless otherwise indicated, all operations described or illustrated herein may be performed in an order other than that shown. As shown in process 200 of fig. 2, cfDNA fragments having sequences corresponding to one or more selected genomic regions where tumor-associated variants are located are enriched. Operation 204 facilitates targeting sequence amplification in regions that may have variants known or suspected to be associated with a tumor, particularly a malignant or pre-cancerous tumor. By amplifying fragments in these target regions, the likelihood of detecting cancer-related variants is increased. In some applications, the targeting region may include a chromosome, a sub-chromosome, or a single gene region. In other applications, simple nucleotide variants can be targeted over a relatively narrow range of sequences, such as 500bp, 1000bp, 2000bp, 3000bp, 4000bp, 5000bp, 10000bp, or 20000bp. In some embodiments, whole genome sequencing may be performed. Such embodiments are particularly useful for detecting long sequence CNVs. In various embodiments, due to the low concentration of ctDNA, a deep sequencing process is performed, e.g., at least about 10,000x deep. Such deep sequencing can be facilitated by targeted and whole genome amplification.
The various embodiments may be applied with or without sample amplification, provided that there is no experimental procedure that destroys differences in tumor or healthy tissue fragment size distribution.
Process 200 involves preparing a sequencing library from cfDNA fragments extracted from a sample. See block 206. In many applications using sequencing libraries on high throughput sequencing platforms, DNA molecules are fragmented and end repaired. However, in applications involving cfDNA, the DNA molecule exists as a fragment ranging from tens to hundreds of base pairs. In embodiments herein using information on fragment size, library preparation should substantially preserve fragment size. Thus, harsh conditions that may destroy debris present in the body fluid should be avoided in sample preparation. Of course, separate formulations may involve certain primers and adaptors that extend the fragment length. However, as long as the preparation affects the size consistently across different fragments, the size information of the fragments can be recovered, as in the double ended sequencing technique described above.
In some embodiments, preparing the library involves applying adaptors to both ends of the extracted cfDNA fragments. In some embodiments, the adaptors include physically unique molecular identifiers that can be used to identify individual fragments in the sample. In some embodiments, the physically unique molecular identifier is less than about 12 nucleotides. Methods and systems for applying unique molecular identifiers are provided in U.S. patent application Ser. No. 15/130,668, which is incorporated herein by reference in its entirety.
The process 200 also involves sequencing cfDNA fragments to obtain reads containing information about cfDNA fragment sequences. See block 208. In various embodiments, double-ended sequencing is used to sequence fragments from both ends. This method may be useful when reading shorter than the fragments, as may be the case on various high throughput sequencing platforms. In alternative embodiments, single-ended sequencing with reads long enough to cover the entire fragment of the DNA fragment may be used.
The sequence obtained from the sequencing is read against a reference genome or portion thereof to provide a sequence tag that includes the sequence and alignment (e.g., genome coordinates). See block 210. Alignment information of sequence tags can determine the relative position of two reads in a pair of paired-end reads. The process 200 also involves determining the size of cfDNA fragments present in the sample using information in the sequence tags. See block 212. In some embodiments, the sequence tag is long enough to cover the entire size of the cfDNA fragment. In these embodiments, fragment size can be obtained by simply counting the number of bases in the fragment during sequencing. In other embodiments, the relative alignment positions of the two reads of a pair can be used to determine the size of the fragment from which the reads were obtained. The aligned positions of the reads can be combined with the sequences of the reads to determine whether the fragment from which the reads were derived is likely to include cancer variants derived from cancer-related mutations. If a sequence comprising a cancer variant is read, and optionally matches the genomic coordinates of the cancer variant, the fragment from which the read originates is also referred to as a fragment that may contain the variant. The fragment may contain sequences derived from cancer mutations because the reads have a smaller but efficient opportunity to match the sequence and position of the cancer variants due to errors occurring in the sequencing procedure.
The process 200 then generates a predicate to use the sequence and size of cfDNA fragments to determine whether a tumor variant is present in the cfDNA fragments. See block 214. In various embodiments of the process of fig. 3A-3F, sequence and size information of cfDNA fragments can be used to generate a determination of a tumor variant or a disorder associated with the variant.
Turning to fig. 3A, a flow chart shows a process 300 for determining a variant of interest using sequence information and size information of cfDNA fragments. The method may be implemented on a computer system that includes one or more processors and a system memory as will be described further below. The variant of interest may be an allele associated with a disorder of interest. In some embodiments, the variant of interest is suspected of being associated with a cancer or tumor. For example, the variant of interest may be a BRCA mutation known to be associated with breast cancer. In some embodiments, the variant of interest is known or suspected to be associated with a genetic disease. In some embodiments, the variants of interest include Simple Nucleotide Variants (SNVs). In some embodiments, the SNV is a simple nucleotide variant, a phased sequence variant, or an indel.
The process 300 begins with obtaining sequence reads of cfDNA fragments derived from a test sample. The process also obtains the size of cfDNA fragments derived from the test sample. See block 302. The size of cfDNA fragments is also referred to as fragment size, fragment length or molecular size. In some embodiments, the size information and sequence information of cfDNA is obtained through a process such as process 200 depicted in fig. 2. In some embodiments, the sequence reads are double-ended reads, and the read pairs are used to determine cfDNA fragment sizes as described above. In some embodiments, when the variant of interest is associated with a tumor or cancer, the cfDNA fragment comprises a circulating tumor DNA (ctDNA) fragment. In some embodiments, the test sample is a plasma sample. In some embodiments, the test sample is a plasma sample of a pregnant woman, and the cfDNA includes cfDNA derived from the pregnant woman and cfDNA derived from a fetus carried by the pregnant woman.
The process 300 also includes assigning cfDNA fragments to a plurality of bins representing different fragment sizes. See block 304. In some embodiments, each bin of the plurality of bins has the same bin size. In other words, each bin covers a fixed range of fragment sizes. In some embodiments, the boxes cover non-overlapping contiguous size ranges. For example, the first bin contains fragments of 1 to 5 nucleotides, the second bin contains fragments of 6 to 10 nucleotides, the third bin contains fragments of 11 to 16 nucleotides, the fourth bin contains fragments of 16 to 20 nucleotides, and so on. In various embodiments, different bin sizes may be used, such as 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, and 100. In some embodiments, the plurality of bins collectively encompass a total range of 1-1000 nucleotides, 1-500 nucleotides, or 1-380 nucleotides. In different embodiments under different conditions, different bin sizes and total ranges may be used. For example, the analysis shown in FIG. 9 assigns cfDNA fragments into bins each spanning 5 nucleotides, with multiple bins collectively covering a size range of 1-380 nucleotides. As shown in fig. 9, when cfDNA fragments are assigned to bins, the frequencies of cfDNA fragments in non-uniform bins form a histogram corresponding to the fragment length distribution, similar to distributions 330, 332, and 334 in fig. 3C.
The process 300 also includes determining a number of variants of interest in a set of priority bins selected from the plurality of bins using the sequence reads obtained in operation 302. Selecting the set of priority bins to: (1) Limiting the number of variants of interest in the set of priority bins to a probability that is below a detection Limit (LOD) in the set of priority bins, and (2) increasing the probability that the number of variants of interest in the set of priority bins is above all bins in the plurality of bins. For example, to detect a cancer-related variant, a set of priority bins is selected to increase the signal associated with the cancer while ensuring that the signal in the set of priority bins exceeds a detection limit. In some embodiments, the number of variants of interest is the allele frequency of the variant of interest. In some embodiments, the number is a count of variants of interest in the set of priority bins. In some embodiments, the amounts may be normalized relative to a reference or baseline.
In some embodiments, the set of priority bins is obtained by providing a plurality of candidate sets and selecting one set from the plurality of candidate sets as the set of priority. In some embodiments, bins that do not contain a variant of interest (e.g., a tumor-associated variant) are excluded from the priority set. In some embodiments, the process of selecting or identifying the priority set is performed separately from testing the sample using the selected bin. In other words, the process of identifying bins may be performed once, and the selected bins are used multiple times to test samples. In some embodiments, the plurality of candidate sets may be provided by a process 310 as shown in fig. 3B and illustrated in fig. 3C. In some embodiments, a set of priority bins is selected from a plurality of candidate sets using a process such as process 350 shown in fig. 3D and illustrated in fig. 3E-3F.
In some embodiments, the process 300 further comprises comparing the number of variants of interest to a decision criterion to determine the presence or abundance of the variants of interest in the test sample. In some embodiments, the number of variants of interest in the set of priority bins is the allele frequency, and the criterion is 0.05%. Other numbers and decision criteria may be used in other embodiments for various conditions. For example, the number may be normalized to a reference amount (e.g., allele frequencies of a normalized sequence), and an appropriate criterion may be empirically determined. In some embodiments, the physical condition associated with the variant of interest may be determined based on the number of variants of interest.
Fig. 3B is a flow chart illustrating a method for obtaining cravings for multiple candidate bin sets. Process 310 begins with obtaining sequence reads and sizes of cfDNA fragments from one or more unaffected training samples known to be unaffected by the disorder of interest. See block 302. In some embodiments, the disorder of interest is known to be associated with a variant of interest. For example, the variant of interest may be breast cancer and the variant of interest may be a mutation in the BRCA1 or BRCA2 gene. In some embodiments, the disorder of interest is a general disorder including a class of disorders associated with the variant of interest. For example, the variant of interest may be a BRCA mutation, while the condition of interest may be a general cancer, including breast cancer, lung cancer, gastric cancer, and/or other forms of cancer. In the former example, the set of priority bins may be more specifically adapted to detect a type of cancer associated with a variant of interest. In the foregoing example, the candidate box set obtained by the desirous method 310 and the priority box set selected from the candidate box set may be more widely generalized to various types of cancers.
In some embodiments, the condition of interest includes one or more cancers. In some embodiments, the disorder of interest includes cancer associated with a variant of interest. In some embodiments, the affected training sample comprises cancer cells, while the unaffected training sample comprises healthy cells.
The process 310 also involves obtaining sequence reads and sizes of cfDNA fragments from an affected training sample that are known to be affected by one or more conditions of interest. See block 314.
The process 310 also includes distributing cfDNA fragments derived from one or more unaffected training samples into a plurality of bins based on their size. See block 316. cfDNA fragments from one or more unaffected training samples are assigned to a plurality of bins based on their size, resulting in a histogram corresponding to fragment length distribution 330 of fig. 3C.
Fig. 3C illustrates how data for normal cfDNA and tumor-derived DNA can be combined to model samples, such as plasma samples containing normal and tumor-associated cfDNA. Fig. 3C shows a fragment length distribution 330 of unaffected samples, a fragment length distribution 332 of tumor derived fragments, and a fragment length distribution 334 of modeled samples including normal and tumor cfDNA fragments, obtained by combining fragments from distribution 330 and distribution 332. For example, a plasma sample of a patient affected by a tumor (thus comprising normal cfDNA and ctDNA of tumor origin) may have a fragment length distribution similar to distribution 334.
The process 310 also involves distributing cfDNA fragments from one or more affected training samples into a plurality of bins based on size. See block 318. When the condition of interest is a tumor, cfDNA fragments from the unaffected training samples are assigned to multiple bins, resulting in a histogram corresponding to fragment length distribution 332 in fig. 3C.
The process 310 also includes ranking each bin based on a ratio of a number of segments of one or more affected training samples assigned to the bin to a number of segments of one or more unaffected training samples assigned to the bin. See block 320. See also 342 of fig. 3C. In some embodiments, the number of fragments is the frequency of the fragments. In some embodiments, the amounts may be normalized relative to a baseline or reference level.
Each bin of the plurality of bins may contain fragments from the unaffected training samples and the affected training samples. For example, the box 336 in fig. 3C covers fragments in the range of about 100 nucleotides in size and contains fragments from normal (unaffected) samples and fragments from cancer (or affected) samples. In the illustration of fig. 3C, the box 336 includes three fragments from a normal sample and three fragments from a tumor sample. Thus, the box 336 provides a ratio of normal fragments to tumor fragments of 1. The box 340 also includes three fragments from normal samples and three fragments from tumor samples. Thus, bin 340 also provides a ratio of fragments from the tumor to normal fragments of 1. The box 338 included 13 fragments from normal samples and 9 fragments from tumor samples. The box 338 provides a cancer to normal ratio of 9/13. Thus, in operation 320, boxes 336 and 340 are rated higher than box 338.
In fig. 3C, fragments including the wild-type variant are represented as open circles, and fragments including the tumor variant are represented as filled circles. In some embodiments, when bins are selected for the set of priority bins, bins that do not contain tumor variants, such as bin 336, are excluded in downstream processes.
Process 310 also includes selecting the bin with the highest rating as the candidate set. See block 322. In the example shown in fig. 3C, in some embodiments, the bin 336 or bin 340 may be selected as the candidate set. In other embodiments, other factors, such as boxes 336 and 340, may be considered when two or more candidate sets have parallel ratings. For example, the number of fragments comprising a tumor variant may be considered to address the juxtaposition. Thus, in the example of fig. 3C, bin 340 is selected before bin 336 because bin 340 includes a fragment that includes two tumor variants. Other factors that may be considered to address the juxtaposition include, but are not limited to: total number of fragments in the bin, number of fragments derived from cancer sample, experimental considerations and biological considerations.
Process 310 adds the bin with the next highest rating to the last candidate set to provide the next candidate set. See block 324. Operation 324 in fig. 3B corresponds to adding bin 336 to the last candidate set comprising bin 340 to provide the next candidate set. The next candidate set includes bin 340 and bin 336.
Process 310 determines whether there are more bins to consider. See block 326. If more bins are to be considered, the process repeats this last step by adding another bin with the next highest rating to the last candidate set to provide the next candidate set. See the yes branch of decision 326, looping back to block 324. If no more bins are considered, process 310 provides the obtained candidate set. See block 328. See also operation 344 in fig. 3C. From the candidate sets, one set will be selected to provide a bin of priority sets. In some embodiments, the priority set is selected using a process 350 shown in FIG. 3D, which is also described with reference to FIG. 3E.
The flow chart shown in fig. 3D illustrates a process 350 for selecting a set of priority bins from a plurality of candidate sets. Process 350 begins by providing a plurality of candidate sets. In some embodiments, the multiple candidate sets may be obtained through a process such as process 310 of fig. 3B. Each candidate set includes non-uniform bins from multiple bins. See block 352.
The process 350 also involves calculating, for each candidate set, a first probability (P1) that the allele frequency of the variant of interest in the bin of the candidate set of modeled samples is below the detection limit. See block 354. In some embodiments, the detection limit is about 0.05% to 0.2%. In some embodiments, the limit of detection is about 0.2% or 0.05%. The modeled sample can be obtained by combining a normal sample associated with the fragment distribution 330 and a tumor sample associated with the fragment length distribution 332. The modeled samples included cfDNA fragments from cells not affected by the disorder of interest and cfDNA from cells affected by the disorder of interest.
The process 350 also involves calculating, for each candidate set, a second probability (P2) that the allele frequency of the variant of interest of the modeled sample in the bin of the candidate set is higher than the allele frequencies of the variants of interest in all bins of the plurality of bins. See block 356. Modeling allele frequencies of variants of interest of a sample in all bins of a plurality of bins, also known as plasma allele frequencies (AF Plasma of blood )。
The process 350 also involves selecting, as the priority set, a bin of the candidate set having the second probability maximum among the candidate sets whose first probability value does not exceed the threshold. See block 358. In some embodiments, bins that do not contain fragments from the affected (or tumor) sample are excluded from the priority set.
Fig. 3E shows the frequency length distribution (360) of the normal sample and the frequency length distribution 362 of the tumor sample, and how the first probability (P1) and the second probability (P2) are obtained for the modeled samples. For a given bin (bin L), the allele frequency α (L) of the tumor sample and the allele frequency β (L) of the normal sample can be obtained. Allele frequencies α (L) and β (L) can be used to calculate the allele frequencies of the bins of the modeled samples, as follows:
wherein the method comprises the steps of
AF(L b1,b2...bk ) Is a box L b1 ,L b2 ...L bk Is selected from the group consisting of a nucleotide sequence,
N mut (L b1,b2...bk ) Is a box L b1 ,L b2 ...L bk The count of variants of interest in (c) is determined,
DP is the depth of sequencing and is the depth of sequencing,
f tumor(s) Is cfDNA fraction from cells with variants of interest,
α(L bi ) Is a box L bi The density of fragments in the fragment length distribution of the affected sample for which one or more conditions of interest are known to be affected, and
β(L bi ) Is a box L bi The density of fragments in the fragment length distribution of the unaffected samples known to be unaffected by the condition of interest.
In certain embodiments, the tank L may be b1 ,L b2 ...L bk The counts of variants of interest in (a) are modeled as a binomial distribution:
wherein AF is defined as Tumor(s) Is the allele frequency of the variant of interest in the tissue with the variant of interest.
In certain embodiments, AF Tumor(s) The calculation formula of (2) is as follows:
AF tumor(s) =AF Plasma of blood /f Tumor(s)
Wherein AF is defined as Plasma of blood Is modeling the allele frequencies of the variants of interest in the sample in all bins of the plurality of bins.
P1 and P2 can be obtained using the allele frequencies of the variants of interest (and their probability distributions) for the modeled samples in the bins of the candidate set, and the allele frequencies of the variants of interest (and their probability distributions) in all bins of the plurality of bins. After obtaining P1 and P2, the candidate set may be selected using data of two probabilities of the plurality of candidate sets, such that the selected candidate set has a maximum value in the second probability (P2) among candidate sets whose value of the first probability (P1) does not exceed a threshold value. See block 358. In some embodiments, the threshold is about 0.002.
Fig. 3F plots two probabilities for multiple candidate sets. Data point 370 has the largest possible P2 among all data points (this is also the largest among all data points for which the P1 value is below the threshold). The possible P1 of the data point 370 is below a threshold (e.g., on a logarithmic scale of 200). Accordingly, the candidate set corresponding to the data point 370 is selected as the priority set. The priority bin set is labeled bin set 372 in the inset of fig. 3F. In some embodiments, bins that do not contain fragments of the variant of interest (e.g., tumor variant) are excluded from the top bin set.
In some embodiments, the second probability (P2) is considered when selecting the set of priority bins, while optionally the first probability (P1) is considered. In some embodiments, a method for analyzing cell-free DNA comprises: (a) Obtaining, by a computer system, sequence reads and sizes of cfDNA fragments derived from a test sample; (b) Distributing cfDNA fragments by one or more processors into a plurality of bins representing different fragment sizes; (c) Using the sequence to read and determine, by the one or more processors, a number of variants of interest in a priority bin selected from the plurality of bins, wherein the priority bin set is selected by: (i) Providing a plurality of candidate sets, each candidate set comprising non-uniform bins from a plurality of bins; (ii) For each candidate set, calculating a second probability that the allele frequencies of the variants of interest in bins of the candidate set in the modeled sample are higher than the allele frequencies of the variants of interest in all bins of the plurality of bins in the modeled sample; (iii) A bin of candidate sets having a maximum value of the second probability is selected from the plurality of candidate sets as the priority set. Each candidate set includes a different bin, meaning that each candidate set has a different bin than the bins of the other candidate sets.
In some embodiments, the method further comprises, prior to (iii) and for each candidate set, calculating a first probability that the allele frequency of the variant of interest in the bin of candidate sets in the modeled sample is below the detection limit, wherein (iii) comprises selecting, as the priority set, the bin of candidate set having the maximum value of the second probability among the candidate sets whose value of the first probability does not exceed the threshold value.
Fragment length may also improve the performance of CNV decisions. Methods and systems for determining CNV using fragment size, fragment sequence, and sequence coverage are provided in U.S. patent application Ser. No. 15/382,508, which is incorporated herein by reference in its entirety. Briefly, CNV determinations are typically performed by comparing the in-bin coverage of genomic regions to baseline. The baseline may be a set of controls, or a control region in the sample where no copy number change is expected. If bin-coverage is compared to a set of controls, fragment size can be used as an independent feature to support CNV.
Sample and sample processing
Sample of
The sample used to determine variants or determine CNV comprises "cell-free" nucleic acids (e.g., cfDNA). Cell-free nucleic acids, including cell-free DNA, can be obtained from biological samples including, but not limited to, plasma, serum, and urine (see, e.g., fan et al, proc Natl Acad Sci 105:16266-16271[2008]; koide et al, prenatal Diagnosis 25:604-607[2005]; chen et al, nature Med.2:1033-1035[1996]; lo et al, lancet 350:485-487[1997]; botezatu et al, clin chem.46:1078-1084,2000; and Su et al, J mol. Diagn.6:101-107[2004 ]). To isolate cell-free DNA from cells in a sample, various methods may be used, including but not limited to: fractionation, centrifugation (e.g., density gradient centrifugation), DNA-specific precipitation, or high-throughput cell sorting and/or other separation methods. Commercially available kits for manual and automatic isolation of cfDNA are available (Roche Diagnostics, indianapolis, IN, qiagen, valencia, CA, macherey-Nagel, duren, DE). Biological samples comprising cfDNA have been used to determine the presence or absence of chromosomal abnormalities, such as trisomy 21, by sequencing assays that can detect chromosomal aneuploidies and/or multiple polymorphisms.
In various embodiments, cfDNA present in a sample may be enriched specifically or non-specifically prior to use (e.g., prior to preparation of a sequencing library). Nonspecific enrichment of sample DNA refers to whole genome amplification of sample genomic DNA fragments that can be used to increase the level of sample DNA prior to preparing a cfDNA sequencing library. The non-specific enrichment may be a selective enrichment of one of two genomes present in a sample comprising more than one genome. For example, the non-specific enrichment may be the selectivity of the cancer genome in a plasma sample, which may be obtained by known methods to increase the relative proportion of cancer to normal DNA in the sample. Alternatively, the non-specific enrichment may be a non-selective amplification of two genomes present in the sample. For example, the non-specific amplification may be amplification of cancer and normal DNA in a sample comprising a mixture of DNA from cancer and normal genomes. Methods for whole genome amplification are known in the art. Degenerate oligonucleotide primer PCR (DOP), primer Extension PCR (PEP) and Multiple Displacement Amplification (MDA) are examples of whole genome amplification methods. In some embodiments, a sample comprising a mixture of cfDNA from different genomes is not enriched for cfDNA of the genomes present in the mixture. In other embodiments, a sample comprising a mixture of cfDNA from different genomes is enriched non-specifically for any one genome present in the sample.
As described above, samples comprising nucleic acids to which the methods described herein are applied generally include, for example, biological samples ("test samples") as described above. In some embodiments, the one or more nucleic acids to be screened for one or more SNVs or CNVs are purified or isolated by any of a number of well-known methods.
Thus, in certain embodiments, the sample comprises or consists of a purified or isolated polynucleotide, or it may comprise such a sample, such as a tissue sample, a biological fluid sample, a cell sample, or the like. Suitable biological fluid samples include, but are not limited to: blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear exudates, lymph, saliva, cerebrospinal fluid, residues, bone marrow suspensions, vaginal exudates, transcervical lavage fluid, cerebral fluid, ascites fluid, milk, secretions of the respiratory, intestinal and genitourinary tracts, amniotic fluid, milk and leukocyte samples. In some embodiments, the sample is a sample that is readily available by non-invasive procedures, such as blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear exudates, saliva, or stool. In certain embodiments, the sample is a peripheral blood sample, or a plasma and/or serum fraction of a peripheral blood sample. In other embodiments, the biological sample is a swab or smear, a biopsy specimen, or a cell culture. In another embodiment, the sample is a mixture of two or more biological samples, e.g., the biological sample may comprise two or more biological fluid samples, tissue samples, and cell culture samples. As used herein, the terms "blood," "plasma," and "serum" expressly encompass fractions or portions of processing thereof. Similarly, where a sample is obtained from a biopsy, swab, smear, or the like, "sample" expressly encompasses a treated fraction or portion obtained from a biopsy, swab, smear, or the like.
In certain embodiments, the sample may be obtained from a source, including, but not limited to: samples from different individuals, samples from different stages of development of the same or different individuals, samples from different diseased individuals (e.g., individuals with cancer or suspected of having a genetic disease), samples from normal individuals, samples obtained at different stages of a disease in an individual, samples obtained from individuals who have different treatments for the disease, samples from individuals affected by different environmental factors, samples from individuals susceptible to the disease, samples from individuals exposed to an infectious agent (e.g., HIV), and the like.
In an exemplary but non-limiting embodiment, the sample is a maternal sample obtained from a pregnant female, such as a pregnant woman. In this case, the sample may be analyzed using the methods described herein to provide prenatal diagnosis of potential chromosomal abnormalities in the fetus. The maternal sample may be a tissue sample, a biological fluid sample, or a cell sample. As non-limiting examples, biological fluids include: blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear exudates, lymph, saliva, cerebrospinal fluid, residues, bone marrow suspensions, vaginal exudates, cervical lavage fluid, cerebral fluid, ascites fluid, milk, secretions of the respiratory, intestinal and genitourinary tracts, and white blood cell samples.
In another exemplary but non-limiting embodiment, the parent sample is a mixture of two or more biological samples, e.g., the biological samples may comprise two or more biological fluid samples, tissue samples, and cell culture samples. In some embodiments, the sample is a sample that is readily available by non-invasive procedures, such as blood, plasma, serum, sweat, tears, sputum, urine, milk, sputum, ear exudates, saliva, and stool. In some embodiments, the biological sample is a peripheral blood sample, and/or plasma and serum fractions thereof. In other embodiments, the biological sample is a swab or smear, a biopsy sample, or a sample of a cell culture. As mentioned above, the terms "blood", "plasma" and "serum" expressly encompass fractions or processed portions thereof. Similarly, where a sample is obtained from a biopsy, swab, smear, or the like, "sample" expressly encompasses a treated fraction or portion obtained from a biopsy, swab, smear, or the like.
In certain embodiments, the sample may also be obtained from tissue, cells or other sources containing polynucleotides cultured in vitro. The cultured samples may be taken from sources including, but not limited to: cultures (e.g., tissues or cells) maintained under different media and conditions (e.g., pH, pressure, or temperature), cultures (e.g., tissues or cells) maintained for different lengths of time, cultures (e.g., tissues or cells) treated with different factors or agents (e.g., drug candidates or modulators), or cultures of different types of tissues and/or cells.
Methods for isolating nucleic acids from biological sources are well known and will vary depending on the nature of the source. The skilled artisan can readily isolate nucleic acids from sources as desired by the methods described herein. In some cases, it may be advantageous to fragment a nucleic acid molecule in a nucleic acid sample. Fragmentation can be random or specific, for example, by digestion with restriction endonucleases. Methods of random fragmentation are well known in the art and include, for example, limited dnase digestion, alkali treatment and physical shearing. In one embodiment, the sample nucleic acid is obtained in the form of cfDNA without fragmentation.
Sequencing library preparation
In one embodiment, the methods described herein can utilize a second generation sequencing technique (NGS) that allows multiple samples to be sequenced separately by a single sequencing run, either as genomic molecules (i.e., single sequencing) or as pooled samples comprising indexed genomic molecules (e.g., multiple sequencing). These methods can produce reads of up to hundreds of millions of DNA sequences. In various embodiments, the sequence of genomic nucleic acid and/or index genomic nucleic acid may be determined using, for example, second generation sequencing techniques (NGS) as described herein. In various embodiments, one or more processors as described herein may be used to analyze a large amount of sequence data obtained using NGS.
In various embodiments, the use of such sequencing techniques does not involve the preparation of a sequencing library.
However, in certain embodiments, the sequencing methods contemplated herein involve the preparation of a sequencing library. In one exemplary method, the preparation of a sequencing library involves preparing a random collection of adaptor-modified DNA fragments (e.g., polynucleotides) that are ready to be sequenced. Sequencing libraries of polynucleotides may be prepared from DNA or RNA, including equivalents, analogs of DNA or cDNA, such as DNA or complementary cDNA, or replicated DNA generated from RNA templates, by the action of reverse transcriptase. Polynucleotides may originate in double stranded form (e.g., dsDNA, such as genomic DNA fragments, cDNA, PCR amplification products, etc.), or in certain embodiments, polynucleotides may originate in single stranded form (e.g., ssDNA, RNA, etc.) and have been converted to dsDNA form. For example, in certain embodiments, single stranded mRNA molecules may be replicated into double stranded cdnas suitable for use in preparing sequencing libraries. The exact sequence of the primary polynucleotide molecule is generally not critical to the library preparation method and may be known or unknown. In one embodiment, the polynucleotide molecule is a DNA molecule. More particularly, in certain embodiments, the polynucleotide molecule represents the entire genetic complement of an organism or substantially the entire genetic complement of an organism, and is a genomic DNA molecule (e.g., cellular DNA, cell-free DNA (cfDNA), etc.), typically comprising an intron sequence and an exon sequence (coding sequence), as well as non-coding regulatory sequences, such as promoter and enhancer sequences. In certain embodiments, the primary polynucleotide molecule comprises a human genomic DNA molecule, such as a cfDNA molecule present in the peripheral blood of a pregnant individual.
By using polynucleotides comprising a specific range of fragment sizes, the preparation of sequencing libraries for certain NGS sequencing platforms can be simplified. The preparation of such libraries typically involves fragmentation of large polynucleotides (e.g., cellular genomic DNA) to obtain polynucleotides of a desired size range.
Fragmentation can be accomplished by any of a variety of methods known to those skilled in the art. For example, the splitting may be accomplished by mechanical means including, but not limited to, atomization, sonication, and hydraulic shearing. However, mechanical means fragmentation typically cleaves the DNA backbone at the C-O, P-O and C-C bonds, resulting in heterogeneous mixing of blunt ends and 3' -and 5' -overhangs with cleaved C-O, P-O and/C-C bonds (see, e.g., alnemri and Liwack, J Biol. Chem265:17323-17333[1990]; richards and Boyer, J Mol Biol 11:327-240[1965 ]), which may require repair, e.g., ligation of sequencing adaptors, as they may lack the 5' -phosphate required for subsequent enzymatic reactions.
In contrast, cfDNA typically exists in fragments of less than about 300 base pairs, and thus, generation of a sequencing library using cfDNA samples typically does not require fragmentation.
In general, polynucleotides will be converted to blunt-ended DNA having 5 '-phosphate and 3' -hydroxyl groups, whether they are forced to fragment (e.g., in vitro fragmentation) or naturally exist as fragments. Standard protocols, e.g., protocols for sequencing using the Illumina platform described elsewhere herein, instruct the user to perform end repair on sample DNA, purify the end repaired product prior to dA-tailing, and purify the dA-tailing product prior to the library-prepared adaptor ligation step.
Various embodiments of the sequence library preparation methods described herein avoid the need for one or more steps that are typically enforced by standard protocols to obtain modified DNA products that can be sequenced by NGS. The reduction method (ABB method), 1-step and 2-step are examples of methods for preparing sequencing libraries, which can be found in patent application No. 13/555,037 filed on 7/20 2012, which is incorporated herein by reference in its entirety.
Sequencing method
As described above, the prepared samples (e.g., sequencing libraries) are sequenced as part of the process of identifying SNV or CNV. Any of a variety of sequencing techniques may be used.
Several sequencing techniques are commercially available, such as hybridization sequencing platforms from Affymetrix Inc. (Sunnyval, calif.), from 454Life Sciences (Bradford, CT), illumina/Solexa (Hayward, calif.) ) And Helicos Biosciences (Cambridge, mass.) and a ligation sequencing platform from Applied Biosystems (Foster City, calif.), as described below. In addition to single molecule sequencing using synthetic sequencing of Helicos Biosciences, other single molecule sequencing techniques include, but are not limited to: pacific Biosciences SMRT TM Technology, ION current TM Techniques, and nanopore sequencing developed by Oxford Nanopore Technologies, for example.
While the automated Sanger method is considered a "first generation" technique, sanger sequencing (including automated Sanger sequencing) may also be used in the methods described herein. Other suitable sequencing methods include, but are not limited to, nucleic acid imaging techniques such as Atomic Force Microscopy (AFM) or Transmission Electron Microscopy (TEM). Exemplary sequencing techniques are described in more detail below.
In one exemplary but non-limiting embodiment, the methods described herein include synthetic sequencing using Illumina and reversible terminator-based sequencing chemistry (e.g., described in Bentley et al, nature 6:53-59[2009 ]), obtaining sequence information for nucleic acids in a test sample, e.g., cfDNA in a maternal sample, cfDNA or cellular DNA in an individual being screened for cancer, etc. The template DNA may be genomic DNA, e.g., cellular DNA or cfDNA. In some embodiments, genomic DNA from isolated cells is used as a template and fragmented into lengths of hundreds of base pairs. In other embodiments, cfDNA is used as a template and fragmentation is not required because cfDNA exists as a short fragment. For example, fetal cfDNA circulates in blood in fragments of about 170 base pairs (bp) length (Fan et al, clin Chem 56:1279-1286[2010 ]), and no fragmentation of the DNA is required prior to sequencing. Circulating tumor DNA is also present in short fragments whose size distribution peaks at about 150-170 bp. Sequencing techniques of Illumina rely on ligating fragmented genomic DNA to a planar optically transparent surface that incorporates an oligonucleotide anchor. The template DNA was end repaired to generate a 5 '-phosphorylated blunt end, and the polymerase activity of the Klenow fragment was used to add one a base to the 3' end of the blunt phosphorylated DNA fragment. This addition provides for the ligation of DNA fragments to oligonucleotide adaptors that have a single T base overhang at the 3' end to increase ligation efficiency. The adaptor oligonucleotide is complementary to the flow cell anchor oligonucleotide (not to be confused with anchor/anchored reads in a repeat amplification assay). The adaptor-modified single-stranded template DNA is added to the flow-through cell under limiting dilution conditions and immobilized by hybridization with the anchoring oligonucleotide. The ligated DNA fragments were extended and bridge amplified to create an ultra-high density sequencing flow cell with hundreds of millions of clusters, each cluster containing about 1,000 copies of the same template. In one embodiment, randomly fragmented genomic DNA is amplified using PCR prior to cluster amplification. Alternatively, genomic library preparations without amplification (e.g., without PCR) are used, and randomly fragmented genomic DNA is enriched using only cluster amplification (Kozarewa et al, nature Methods 6:291-295[2009 ]). Templates were sequenced using powerful four-color DNA sequencing by synthesis techniques employing reversible terminators and mobile fluorochromes. High sensitivity fluorescence detection can be achieved using laser excitation and total internal reflection optics. Short sequence reads of about tens to hundreds of base pairs are aligned to a reference genome and unique mappings of short sequence reads to the reference genome are identified using specially developed data analysis flow software. After the first read is completed, the template may be regenerated in situ for a second read from the other end of the fragment. Thus, single-or double-ended sequencing of DNA fragments can be used.
Various embodiments of the application may use sequencing by synthesis that allows for double ended sequencing. In some embodiments, sequencing of Illumina through the synthesis platform involves clustered fragments. Clustering is the process by which each fragment molecule is amplified isothermally. In some embodiments, the fragment has two different adaptors attached to both ends of the fragment that allow hybridization of the fragment to two different oligonucleotides on the surface of the flow cell lane, as in the examples described herein. The fragments further include or are linked at both ends of the fragments to two index sequences that provide a marker to identify different samples in multiple sequencing. In some sequencing platforms, the fragment to be sequenced is also referred to as an insert.
In some embodiments, the flow cell used for clustering in the Illumina platform is a slide with channels. Each channel is a glass channel covered with two types of oligonucleotide "lawn". The first of the two types of oligonucleotides on the surface enables hybridization. The oligonucleotide is complementary to a first adaptor on one end of the fragment. The polymerase produces complementary strands of the hybridized fragment. The double stranded molecules are denatured and the original template strand is washed away. The remaining strands are parallel to many other remaining strands and are clonally amplified by bridge application.
In bridge amplification, the strand is folded and a second adapter region on the second end of the strand hybridizes to a second oligonucleotide on the surface of the flow cell. The polymerase produces complementary strands forming a double-stranded bridge molecule. The double stranded molecule is denatured, resulting in two single stranded molecules being attached to the flow cell through two different oligonucleotides. The process is then repeated one by one and occurs simultaneously for millions of clusters, resulting in clonal amplification of all fragments. After bridge amplification, the reverse strand is cleaved and washed away leaving only the forward strand. The 3' end is blocked to prevent accidental actuation.
After clustering, sequencing begins with extension of the first sequencing primer to produce a first read. During each cycle, fluorescently labeled nucleotides compete for addition to the growing chain. Only one template-based sequence is added. After each nucleotide is added, the cluster is excited by a light source and emits a characteristic fluorescent signal. The number of cycles determines the length of the read. The emission wavelength and signal intensity determine the base determination. For a given cluster, all the same chains are read at the same time. Thousands of clusters are sequenced in a massively parallel manner. After the first reading is completed, the read product is rinsed off.
In the next step of the protocol involving two index primers, the index 1 primer is introduced and hybridized to the index 1 region on the template. The index region provides for the identification of fragments, which is useful for multiplexing samples during multiple sequencing. The generation of the index 1 read is similar to the first read. After completion of index 1 read, the read product is washed away and the 3' end of the strand is deprotected. The template strand then folds and binds to a second oligonucleotide on the flow cell. The index 2 sequence is read in the same manner as index 1. The index 2 read product is then washed out at the completion of this step.
After reading the two indices, read 2 is initiated by extending the second flow cell oligonucleotide using a polymerase, thereby forming a double-stranded bridge. The double-stranded DNA is denatured and the 3' -end is blocked. The original forward strand is cut and washed away, leaving the reverse strand. Read 2 first was the introduction of the read 2 sequencing primer. The sequencing step is repeated as read 1 until the desired length is reached. The 2 product read was washed away. The whole process will produce millions of reads representing all fragments. Sequences from pooled sample libraries were isolated according to unique indices introduced during sample preparation. For each sample, similarly extended reads for base determinations will be locally clustered. The forward and reverse reads are paired to create a continuous sequence. These contiguous sequences are aligned to a reference genome to identify variants.
The sequencing by synthesis example described above involves a double-ended read, which is used in many embodiments of the disclosed methods. Double ended sequencing involves two reads at both ends of the fragment. When a pair of reads is mapped to a reference sequence, the base pair distance between the two reads can be determined, which can then be used to determine the length of the fragment from which the reads were obtained. In some cases, one of a pair of end reads of a segment across two bins will be aligned with one bin and the other read aligned with an adjacent bin. This situation becomes smaller and smaller as the box becomes longer or the number of reads becomes shorter. Various methods may be used to interpret the bin membership of these fragments. For example, the bin fragment size frequency may be omitted when it is determined; they can be counted for two adjacent bins; they can be assigned to bins containing larger base pairs of the two bins; or they may be assigned to two bins whose weights relate to a portion of the base pairs in each bin.
Double-ended reads may use inserts of different lengths (i.e., different fragment sizes to be sequenced). Double-ended reading is used as a default meaning in the present application to refer to reading obtained from various insert lengths. In some cases, to distinguish between short insert double-ended reads and long insert double-ended reads, the latter is also referred to as a matched double-ended read. In some embodiments involving matched double-ended reads, two biotin-ligated adaptors are first ligated to both ends of a relatively long insert (e.g., a few kb). The biotin-ligated adaptors then ligate the two ends of the insert to form a circular molecule. The circularized molecule may then be further broken to obtain an adaptor fragment comprising the biotin-ligated adaptors. The sub-fragments comprising the opposite order of the ends of the original fragment can then be sequenced by the same method as for short insert double-ended sequencing described above. More details of paired-end sequencing using the Illumina platform are shown in the following URL in an online publication, which is incorporated herein by reference in its entirety: res. Additional information about double ended sequencing can be found in: U.S. patent No. 7601499 and U.S. patent publication No. 2012/0,053,063, which are incorporated herein by reference for materials for double ended sequencing methods and apparatus.
After sequencing the DNA fragments, sequence reads of a predetermined length, e.g., 100bp, are mapped to or aligned with a known reference genome. Reads of the mapping or alignment and their corresponding positions on the reference sequence are also referred to as tags. In one embodiment, the reference genomic sequence is the NCBI36/hg18 sequence, which is available from the world Wide Web: genome. Ucsc. Edu/cgi-bin/hggateway dynamics org = Human & db = hg18& hgsid = 166260105. Alternatively, the reference genomic sequence is GRCh37/hg19, which is available from the world Wide Web: genome. Ucsc. Edu/cgi-bin/hgGateway. Other sources of public sequence information include GenBank, dbEST, dbSTS, EMBL (European molecular biology laboratories) and DDBJ (Japanese DNA database). There are many computer algorithms available for aligning sequences, including but not limited to: BLAST (Altschul et al, 1990), BLITZ (MPsrch) (Sturrock & Collins, 1993), FASTA (Person & Lipman, 1988), BOWTIE (Langmead et al, genome Biology 10: R25.1-R25.10[2009 ]), or ELAND (Illumina, inc., san Diego, CA, USA). In one embodiment, one end of a clone-expanded copy of the plasma cfDNA molecule is sequenced and processed through a bioinformatic alignment analysis by Illumina genome analyzer using high-efficiency large-scale alignment (ELAND) software of nucleotide databases.
In one exemplary but non-limiting embodiment, the methods described herein include single molecule sequencing technology using the true single molecule sequencing (tSMS) technology of the Helicos corporation to obtain sequence information for nucleic acids in a test sample, e.g., cfDNA in a maternal sample, cfDNA or cellular DNA of an individual being screened for cancer or the like (e.g., as described in Harris t.d.et al., science 320:106-109[ 2008.]). In the tSMS technique, a DNA sample is cut into strands of about 100 to 200 nucleotides, and then a polyA sequence is added to the 3' end of each DNA strand. Each strand is labeled by the addition of a fluorescent-labeled adenosine nucleotide. The DNA strands are then hybridized to a flow cell containing millions of oligo-T capture sites immobilized on the surface of the flow cell. In certain embodiments, the density of templates may be about 1 hundred million templates/cm 2 . The flow cell is then loaded into an instrument (e.g., heliScope TM Sequencer) and irradiating the surface of the flow cell with a laser to show the position of each template. The CCD camera can determine the position of the template on the surface of the flow cell. The template fluorescent label is then cleaved and washed away. The sequencing reaction is initiated by the introduction of a DNA polymerase and a fluorescent labeled nucleotide. oligo-T nucleic acids were used as primers. The polymerase incorporates the labeled nucleotides into the primer in a template directed manner. The polymerase and unincorporated nucleotides are removed. By imaging the flow cell surface, templates that have been targeted for incorporation of fluorescent-labeled nucleotides can be identified. After imaging, the cleavage step will remove the fluorescent label and then repeat the process with other fluorescent labeled nucleotides until the desired read length is obtained. Sequence information is collected in each nucleotide addition step. Whole genome sequencing by single molecule sequencing techniques, which excludes or generally avoids PCR-based sequencing library preparation Amplification and this method allows direct measurement of the sample, rather than measuring copies of the sample.
In another exemplary but non-limiting embodiment, the methods described herein include using 454 sequencing (Roche) to obtain sequence information for nucleic acids in a test sample, e.g., cfDNA in a parent test sample, cfDNA or cellular DNA of an individual being screened for cancer, etc. (e.g., as described in Margulies, M.et al. Nature 437:376-380[2005 ]). 454 sequencing typically involves two steps. In a first step, the DNA is cut into fragments of about 300-800 base pairs, and the fragments are then blunt-ended. The oligonucleotide adaptors are then ligated to the ends of the fragments. The adaptors are used as primers for the amplification and sequencing of the fragments. Fragments can be attached to DNA capture beads, such as streptavidin-coated beads, using, for example, adaptor B containing a 5' -biotin tag. Within the droplets of the oil-water emulsion, the fragments attached to the beads are PCR amplified. The result is a multiple copy of the clonally amplified DNA fragment per bead. In a second step, the beads are captured in wells (e.g., picoliter-sized wells). Pyrophosphate sequencing was performed in parallel for each DNA fragment. The addition of one or more nucleotides produces an optical signal that is recorded by a CCD camera in the sequencer. The signal intensity is proportional to the number of nucleotides incorporated. Pyrosequencing utilizes pyrophosphate (PPi), which is released upon addition of a nucleotide. In the presence of adenosine 5' phosphosulfate, PPi is converted to ATP by ATP sulfurylase. Luciferases use ATP to convert luciferin to oxidized luciferin, and the light generated by this reaction can be measured and analyzed.
In another exemplary but non-limiting embodiment, the methods described herein include using SOLiD TM Technique (Applied Biosystems) to obtain sequence information of nucleic acids in a test sample, e.g., cfDNA in a maternal test sample, cfDNA or cellular DNA of an individual being screened for cancer, etc. In SOLiD TM In sequencing by ligation, genomic DNA is cut into fragments and adaptors are ligated to the 5 'and 3' ends of the fragments to generate a library of fragments. Alternatively, the fragments may be generated by ligating adaptors to the 5 'and 3' ends of the fragments, circularizing the fragments, and digesting the circularized fragmentsInternal adaptors are introduced and adaptors are ligated to the 5 'and 3' ends of the resulting fragments to generate paired double-ended libraries. Next, a cloned bead population is prepared in a microreactor containing beads, primers, templates, and PCR components. Following PCR, the template is denatured and the beads are enriched to isolate beads with expanded template. Templates on selected beads can be bound to glass slides via 3' modification. The sequence may be determined by sequential hybridization and ligation of partially random oligonucleotides to bases (or base pairs) defined by the center of recognition by a particular fluorophore. After color registration, the attached oligonucleotides are cleaved and removed, and the process is repeated.
In another exemplary but non-limiting embodiment, the methods described herein include using Single Molecule Real Time (SMRT) of Pacific Biosciences TM ) Sequencing techniques to obtain sequence information for nucleic acids in a test sample, e.g., cfDNA in a maternal test sample, cfDNA or cellular DNA of an individual being screened for cancer, etc. In SMRT sequencing, the continuous incorporation of dye-labeled nucleotides is imaged during DNA synthesis. A single DNA polymerized sequence information enzyme molecule is attached to the bottom surface of each zero-mode wavelength detector (ZMW detector) that obtains sequence information when phosphorylated nucleotides are incorporated into the growing primer strand. The ZMW detector includes a confinement structure that enables incorporation of individual nucleotides by the DNA polymerase against background observations of fluorescent nucleotides that diffuse rapidly into and out of the ZMW (e.g., in microseconds). Often, several milliseconds are required to incorporate the nucleotide into the growing strand. During this time, the fluorescent label is excited and generates a fluorescent signal, and the fluorescent label is cleaved off. Measurement of the corresponding fluorescence of the dye shows which base was incorporated. The process is repeated to provide a sequence.
In another exemplary but non-limiting embodiment, the methods described herein include using nanopore sequencing to obtain sequence information of cfDNA or cellular DNA of a test sample, e.g., cfDNA in a parent test sample, an individual being screened for cancer, etc. (e.g., as described in Soni GV and Meller a clin Chem 53:1996-2001[2007 ]). Nanopore sequencing DNA analysis techniques were developed by a number of companies including, for example, oxford nanohole technologies (Oxford, united Kingdom), sequenom, NABsys, and the like. Nanopore sequencing is a single molecule sequencing technique by which single molecule DNA can be directly sequenced as it passes through a nanopore. A nanopore is a small hole, typically about 1 nanometer in diameter. The nanopore is immersed in a conductive fluid and a potential (voltage) is applied thereto, creating a small amount of current due to the conduction of ions through the nanopore. The amount of current flowing is sensitive to the size and shape of the nanopore. As the DNA molecule passes through the nanopore, each nucleotide on the DNA molecule blocks the nanopore to a different extent, thereby varying the magnitude of the current through the nanopore to a different extent. Thus, this change in current as the DNA molecule passes through the nanopore provides for reading of the DNA sequence.
In another exemplary but non-limiting embodiment, the methods described herein include obtaining sequence information of nucleic acids in a test sample, e.g., cfDNA in a maternal test sample, cfDNA or cellular DNA of an individual being screened for cancer, etc., using a chemically sensitive field effect transistor (chemFET) array (e.g., described in U.S. patent application publication No. 2009/0026082). In one example of this technique, a DNA molecule can be placed into a reaction chamber and a template molecule can be hybridized to a sequencing primer that binds to a polymerase. By chemFET, one or more triphosphates can be incorporated into the new nucleic acid strand at the 3' end of the sequencing primer, identified as a current change. An array may have multiple chemFET sensors. In another example, a single nucleic acid may be attached to a bead and the nucleic acid may be amplified on the bead and the individual beads may be transferred to individual reaction chambers on a chemFET array, each chamber having a chemFET sensor, and the nucleic acid may be sequenced.
In another embodiment, the method includes obtaining sequence information of nucleic acids in a test sample, e.g., cfDNA in a maternal test sample, using a Transmission Electron Microscope (TEM). This method, known as single molecule placement rapid nano-transfer (IMPRNT), involves selectively heavy atom-tagged DNA imaging of high molecular weight (150 kb or greater) DNA using a single atom resolution transmission electron microscope, and arranging these molecules on ultrathin films in ultra dense (3 nm chain to chain) parallel arrays with consistent base to base spacing. Electron microscopy is used to image molecules on film to determine the location of heavy atom labels and extract base sequence information from DNA. This method is further described in PCT patent publication No. WO 2009/046445. The method allows sequencing of the complete human genome in less than 10 minutes.
In another embodiment, the DNA sequencing technique is Ion Torrent single molecule sequencing, which combines semiconductor technology with simple sequencing chemistry to directly translate chemically encoded information (A, C, G, T) into digital information (0, 1) on a semiconductor chip. In fact, when nucleotides are incorporated into a DNA strand by a polymerase, hydrogen ions are released as by-products. Ion Torrent performs this biochemical process in a massively parallel manner using a high density array of micromachined holes. Each well contains a different DNA molecule. Below the aperture is an ion sensitive layer and below the ion sensitive layer is an ion sensor. When a nucleotide (e.g., C) is added to a DNA template and then integrated into the DNA strand, hydrogen ions will be released. The charge generated by the ions changes the pH of the solution and can be detected by Ion sensors of Ion Torrent. Sequencer-essentially the smallest solid state pH meter in the world-determines the base, directly converting chemical information to digital information. Then, ion Personal Genome Machine (PGM) TM ) The sequencer sequentially injects one nucleotide into the chip. If the next nucleotide injected into the chip does not match. No voltage change is recorded nor is any base determined. If there are two identical bases on the DNA strand, the voltage will double and the chip will record the two identical bases determined. Direct detection can record nucleotide incorporation within seconds.
In another embodiment, the method includes obtaining sequence information of nucleic acids in a test sample, e.g., cfDNA in a maternal test sample, using hybridization sequencing. Sequencing by hybridization includes contacting a plurality of polynucleotide sequences with a plurality of polynucleotide probes, wherein each of the plurality of polynucleotide probes may optionally be attached to a substrate. The substrate may be a planar surface comprising an array of known nucleotide sequences. Hybridization patterns to the array can be used to determine the polynucleotide sequences present in the sample. In other embodiments, each probe is attached to a bead, such as a magnetic bead or the like. Hybridization to the beads can be determined and used to identify a plurality of polynucleotide sequences in the sample.
In some embodiments of the methods described herein, the mapped sequence tag comprises a sequence read of about 20bp, about 25bp, about 30bp, about 35bp, about 40bp, about 45bp, about 50bp, about 55bp, about 60bp, about 65bp, about 70bp, about 75bp, about 80bp, about 85bp, about 90bp, about 95bp, about 100bp, about 110bp, about 120bp, about 130, about 140bp, about 150bp, about 200bp, about 250bp, about 300bp, about 350bp, about 400bp, about 450bp, or about 500 bp. It is expected that advances in technology will enable single-ended reads of greater than 500bp, allowing reads of greater than 1000bp when generating dual-ended reads. In one embodiment, the mapped sequence tag comprises a 36bp sequence read. By comparing the sequence of the tag to a reference sequence to determine the chromosomal origin of the sequenced nucleic acid (e.g., cfDNA) molecule, localization of the sequence tag can be achieved and no specific genetic sequence information is required. A small degree of mismatch (0-2 mismatches per sequence tag) may be allowed to account for minor polymorphisms that may exist between the reference genome and the genome in the mixed sample.
Multiple sequence tags are typically obtained for each sample. In some embodiments, mapping the reads to the reference genome of each sample yields at least about 3×10 6 A sequence tag of at least about 5X 10 6 Sequence tags of at least about 8X 10 6 A sequence tag of at least about 10X 10 6 Sequence tags of at least about 15X 10 6 A sequence tag of at least about 20X 10 6 A sequence tag of at least about 30X 10 6 A sequence tag of at least about 40X 10 6 Individual sequence tags, or at least about 50X 10 6 A sequence tag comprising a 20-40bp (e.g., 36 bp) read. In one embodiment, all sequence readsMapping to all regions of the reference genome. In one embodiment, tags that have been mapped to all regions (e.g., all chromosomes) of the reference genome are analyzed and SNV or CNV in the cfDNA sample is determined.
The accuracy required to correctly determine whether SNV or CNV is present in a sample depends on the variation in the number of sequence tags mapped to the reference genome in the sample during sequencing (inter-chromosomal variability) and the variation in the number of sequence tags mapped to the reference genome in different sequencing runs (inter-sequence variability). For example, the variation may be particularly pronounced for tags mapped to GC-rich or GC-deficient reference sequences. The use of different protocols for extracting and purifying nucleic acids, preparing sequencing libraries, and using different sequencing platforms may result in other variations. The present method uses sequence doses (chromosome doses or fragment doses) based on knowledge of the normalized sequences (normalized chromosome sequences or normalized fragment sequences), essentially taking into account accumulated variability from chromosome to chromosome (in run) and sequencing (between runs) as well as platform-dependent variability. The chromosome dosage is based on knowledge of the normalized chromosome sequence, which may consist of a single chromosome or may consist of two or more chromosomes selected from chromosomes 1-22, X and Y. Alternatively, the normalized chromosomal sequence may be composed of one chromosomal segment, or of two or more segments of one chromosome or of two or more chromosomes. Fragment dose is based on knowledge of normalized fragment sequences, which may consist of a single fragment of any one chromosome, or of two or more fragments of any two or more of chromosomes 1-22, X and Y.
Apparatus and system for assaying variants of interest
Various computer-executed algorithms and programs are typically used to perform analysis of sequencing data and to derive diagnostics therefrom. Accordingly, certain embodiments employ processes involving data stored in or transmitted through one or more computer systems or other processing systems. Embodiments disclosed herein also relate to apparatus for performing these operations. The apparatus may be specially constructed for the required purposes, or it may be a general-purpose computer (or a group of computers) selectively activated or reconfigured by a computer program and/or data structure stored in the computers. In some embodiments, a set of processors cooperatively (e.g., via network or cloud computing) and/or in parallel perform some or all of the enumerated analysis operations. The processor or set of processors used to perform the methods described herein may be of various types, including microcontrollers and microprocessors, such as programmable devices (e.g., CPLD and FPGA), and non-programmable devices, such as gate array ASICs or general purpose microprocessors.
Additionally, certain embodiments relate to tangible and/or non-volatile computer-readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations. Examples of computer readable media include, but are not limited to: semiconductor memory devices, magnetic media such as disk drives, magnetic tape, optical media such as CDs, magneto-optical media, and hardware devices specifically configured for storing and executing program instructions such as read-only memory devices (ROMs) and Random Access Memories (RAMs). The computer readable medium may be controlled directly by the end user or the medium may be controlled indirectly by the end user. Examples of directly controlled media include media located at a user facility and/or media not shared with other entities. Examples of indirectly controlled media include media that a user may indirectly access through an external network and/or through a service that provides a shared resource (e.g., "cloud"). Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
In various embodiments, the data or information employed in the disclosed methods and apparatus is provided in electronic format. Such data or information may include reads and tags derived from nucleic acid samples, counts or densities of such tags aligned with specific regions of reference sequences (e.g., aligned with chromosomes or chromosome segments), reference sequences (including reference sequences that provide only or predominantly polymorphisms), chromosome and fragment dosages such as SNV or aneuploidy determinations, normalized chromosome and fragment values, paired chromosomes or fragments and corresponding normalized chromosomes or fragments, counseling advice, diagnosis, and the like. As used herein, data or other information provided in electronic format may be used for storage on machines and transmission between machines. Conventionally, data in electronic format is provided in digital form and may be stored as bits and/or bytes in various data structures, lists, databases, and the like. The data may be embodied electronically, optically, etc.
One embodiment provides a computer program product for generating an output indicative of whether a cancer-related SNV or aneuploidy is present in a test sample. The computer product may contain instructions for performing any one or more of the methods of determining chromosomal abnormalities described above. As explained, the computer product may include a non-volatile and/or tangible computer-readable medium having computer-executable or compilable logic (e.g., instructions) recorded thereon to enable a processor to determine whether an SNV decision or a CNV decision should be made. In one example, the computer product includes a computer-readable medium having computer-executable or compilable logic (e.g., instructions) recorded thereon for enabling a processor to diagnose SNV or CNV.
Sequence information from the sample under consideration may be mapped to a chromosome reference sequence to identify a plurality of sequence tags for each of any one or more chromosomes of interest, and to identify a plurality of sequence tags for normalizing the fragment sequence of each of the any one or more chromosomes of interest. In various embodiments, the reference sequence is stored in a database, such as a relational database or an object database.
It should be appreciated that it is impractical, if not impossible in most cases, for an unassisted person to perform the computational operations of the methods disclosed herein. For example, without the aid of a computing device, many years of effort may be required to map one 30bp read from a sample to any one of the human chromosomes. Of course, the problem becomes more complex because reliable SNV and CNV decisions typically require mapping thousands (e.g., at least about 10,000) or even millions of reads onto one or more chromosomes.
The methods disclosed herein can be performed using a system for assessing copy number of a gene sequence of interest in a test sample. The system comprises: (a) A sequencer for receiving nucleic acids from a test sample and providing nucleic acid sequence information from the sample; (b) a processor; and (c) one or more computer-readable storage media having instructions stored thereon for execution on the processor to implement a method for identifying any SNV or CNV.
In some embodiments, the method is indicated by a computer readable medium having stored thereon computer readable instructions for implementing a method to identify any SNV or CNV. Accordingly, one embodiment provides a computer program product comprising one or more computer-readable non-volatile storage media having computer-executable instructions stored thereon that, when executed by one or more processors of a computer system, cause the computer system to implement a method for assessing copy number of a sequence of interest in a test sample comprising normal nucleic acid and tumor cell-free nucleic acid. The method comprises the following steps: (a) Retrieving, by one or more processors, sequence reads and fragment sizes of cfDNA fragments obtained from a test sample; (b) Distributing, by one or more processors, the cfDNA fragments into a plurality of bins representing different fragment sizes; (c) Using the sequence reads and determining, by the one or more processors, allele frequencies of the variant of interest in a set of priority bins selected from the plurality of bins, wherein the set of priority bins is selected to be: (i) Limiting the probability that the number of variants of interest in the set of priority bins is below a detection limit; (ii) Increasing the probability that the number of variants of interest in the set of priority bins is higher than all bins of the plurality of bins.
In some embodiments, the instructions may further include automatically recording information related to the method, such as chromosome dosage and whether SNV or CNV is present in the patient medical record, for a human individual providing the maternal test sample. Patient medical records can be maintained by, for example, a laboratory, physician's office, hospital, health maintenance organization, insurance company, or personal medical record website. Additionally, based on the results of the processor-implemented analysis, the method may further include prescribing, initiating and/or altering treatment of the human individual from which the maternal test sample was obtained. This may involve performing one or more other tests or analyses on other samples collected from the individual.
The disclosed methods may also be performed using a computer processing system adapted or configured to implement a method for identifying any SNV or CNV. One embodiment provides a computer processing system adapted or configured to implement the methods described herein. In one embodiment, the apparatus comprises a sequencing device adapted or configured for sequencing at least a portion of the nucleic acid molecules in the sample to obtain the types of sequence information described elsewhere herein. The apparatus may further comprise an assembly for processing the sample. Such components are described elsewhere herein.
The sequence or other data can be directly or indirectly input to a computer or stored in a computer readable medium. In one embodiment, the computer system is directly connected to a sequencing device that reads and/or analyzes nucleic acid sequences from the sample. Sequences or other information from such tools are provided through interfaces in the computer system. Alternatively, the sequence processed by the system is provided from a sequence storage source (e.g., database or other repository). The storage means or mass storage means at least temporarily buffers or stores the nucleic acid sequence once available to the processing apparatus. In addition, the storage device may store tag counts for various chromosomes or genomes, and the like. The memory may also store various routines and/or programs for analyzing presentation sequences or mapping data. Such programs/routines may include programs for performing statistical analysis, and the like.
In one example, a user provides a sample into a sequencing apparatus. The data is collected and/or analyzed by a sequencing device connected to a computer. Software on the computer allows data collection and/or analysis. The data may be stored, displayed (via a monitor or other similar device), and/or transmitted to other locations. The computer may be connected to the internet for transmitting data to a handheld device used by a remote user (e.g., physician, scientist or analyst). It should be appreciated that the data may be stored and/or analyzed prior to transmission. In some embodiments, the raw data is collected and sent to a remote user or device that will analyze and/or store the data. The transmission may be via the internet, satellite or other connection. Alternatively, the data may be stored on a computer readable medium and the medium may be shipped to an end user (e.g., by mail). The remote users may be in the same or different geographic locations including, but not limited to, buildings, cities, states, countries, or continents.
In some embodiments, the method further comprises collecting data about the plurality of polynucleotide sequences (e.g., read, tag, and/or reference chromosome sequences) and sending the data to a computer or other computing system. For example, the computer may be connected to laboratory equipment, such as a sample collection device, a nucleotide amplification device, a nucleotide sequencing device, or a hybridization device. The computer may then collect the applicable data collected by the laboratory device. The data may be stored on the computer at any step, for example, collecting real-time information before, during, or after transmission. The data may be stored on a computer readable medium that may be extracted from a computer. The collected or stored data may be transmitted from the computer to a remote location, for example, via a local area network or a wide area network such as the internet. As described below, various operations may be performed on the transmitted data at a remote location.
The types of electronically formatted data that may be stored, transmitted, analyzed, and/or manipulated in the systems, devices, and methods disclosed herein are as follows:
reads obtained by sequencing nucleic acids in a test sample
Tag reference genome or sequence obtained by aligning reads with reference genome or other reference sequences
Sequence tag Density-tag count or number for each of two or more regions (typically chromosomes or chromosome fragments) of a reference genome or other reference sequence
Identity of a chromosome or chromosome fragment normalized to a particular chromosome or chromosome segment of interest
Dosage of chromosome or chromosome fragment (or other region) obtained from chromosome or fragment of interest and corresponding normalized chromosome or fragment
The threshold value for determining chromosome dosage is regarded as affected, unaffected or nondetermined
Practical determination of chromosome dosage
Diagnosis (clinical symptoms associated with judgment)
Further test advice derived from decision and/or diagnosis
Treatment and/or monitoring plan derived from decision and/or diagnosis
These various types of data may be obtained, stored, transmitted, analyzed, and/or manipulated at one or more locations using different devices. The processing options range widely. At one end of the range, all or a number of such information is stored at and used in the location where the test sample is processed, such as a doctor's office or other clinical location. In the other extreme, the sample is obtained at one location, may be processed and optionally sequenced at a different location, read aligned and determined at one or more different locations, and prepared for diagnosis, advice and/or planning at another location (possibly the location where the sample was obtained).
In various embodiments, reads are generated with a sequencing device and then transmitted to a remote site where they are processed to generate a decision. For example, at this remote location, reads are aligned with reference sequences to generate tags, which are counted and assigned to chromosomes or fragments of interest. Also at the remote location, counts were converted to doses using the relevant normalizing chromosome or fragment. Still further, at a remote location, the dose is used to generate a determination.
Processing operations that may be employed at different locations include:
sample collection
Sample processing prior to sequencing
Sequencing
Analyzing the sequence data and deriving SNV or CNV decisions
Diagnosis of
Reporting diagnosis and/or decision to patient or healthcare provider
Planning further treatments, tests and/or monitoring
Execution plan
Psychological consultation
Any one or more of these operations may be automated, as described elsewhere herein. Typically, analysis of sequencing and sequence data will be computer processed to generate SNV or CNV decisions. Other operations may be performed manually or automatically.
Examples of locations where sample collection may be performed include the office of a health practitioner, a clinic, a patient's home (where sample collection tools or kits are provided), and an ambulatory medical cart. Examples of locations where sample processing may be performed prior to sequencing include the office of a medical practitioner, a clinic, a patient's home (where sample processing devices or kits are provided), an ambulatory medical cart, and the facilities of an SNV or CNV analysis provider. Examples of locations where sequencing may be performed include a healthcare practitioner's office, a clinic, a patient's home (where sample sequencing devices and/or kits are provided), an ambulatory medical cart, and facilities of an SNV or CNV analysis provider. A dedicated network connection may be provided at the location where sequencing is performed for transmitting sequence data (typically reads) in electronic format. Such connections may be wired or wireless and have and may be configured to send data to sites that may process and/or aggregate the data prior to transmission to the processing site. The data aggregator may be maintained by a health organization (e.g., a Health Maintenance Organization (HMO)).
The analysis and/or derivation operations may be performed at any of the foregoing locations, or may be performed at another remote site dedicated to the service of calculating and/or analyzing nucleic acid sequence data. Such locations include, for example, clusters such as a general server farm, facilities for SNV or CNV analysis service traffic, and the like. In some embodiments, the computing device used to perform the analysis is leased or leased. The computing resources may be part of an internet-accessible collection of processors, such as processing resources commonly known as clouds. In some cases, the computations are performed in parallel or massively parallel by a set of interrelated or non-interrelated processors. Distributed processing such as cluster computing, grid computing, etc. may be used to accomplish this. In such embodiments, clusters or grids of computing resources collectively form a super virtual computer composed of multiple processors or computers that cooperate to perform the analysis and/or derivation described herein. As described herein, these techniques, as well as more conventional supercomputers, may be used to process sequence data. Each in the form of parallel computing that relies on a processor or computer. In the case of grid computing, these processors (typically the entire computer) are connected via a network (private, public or internet) via a conventional network protocol (e.g., ethernet). In contrast, supercomputers have many processors connected by local high-speed computer buses.
In certain embodiments, the diagnosis is made at the same location as the analysis operation. In other embodiments, it is performed in a different location. In some instances, the reporting diagnosis is performed at the sampling location, although this is not required. Examples of locations where diagnostics and/or execution plans may be generated or reported include offices of medical practitioners, clinics, computer accessible internet sites, and hand-held devices such as cell phones, tablet computers, smart phones, etc. that are connected to the network, either wired or wireless. Examples of sites where consultation is performed include the health practitioner's office, clinic, internet site accessible through a computer, hand held device, etc.
In some embodiments, the sample collection, sample processing, and sequencing operations are performed at a first location, while the analysis and deduction operations are performed at a second location. However, in some cases, the sample collection is collected at one location (e.g., the practitioner's office or clinic), while the sample processing and sequencing is performed at a different location, optionally the same location where the analysis and derivation is performed.
In various embodiments, the sequence of operations listed above may be initiated by initiation of sample collection, sample processing, and/or sequencing by a user or entity. After one or more of these operations begin to execute, other operations naturally follow. For example, a sequencing operation may result in reads being automatically collected and sent to a processing device, which typically performs sequence analysis and derivation of SNV or CNV operations automatically and possibly without further user intervention. In some embodiments, the results of the processing operation are then automatically communicated to a system component or entity, which may be reformatted as a diagnosis to be communicated to the system component or entity that processes the reporting of information to the health professional and/or patient. As explained, such information may also be automatically processed to generate treatment, test, and/or monitoring plans, possibly along with advisory information. Thus, initiating an early operation may initiate an end-to-end sequence in which a health professional, patient, or other relevant aspect is provided with diagnostic, planning, counseling, and/or other information useful to the physical condition. This can be done even though the parts of the whole system are physically separated and may be located remotely from, for example, the sample and sequencing equipment.
FIG. 4 illustrates, in simple block form, a typical computer system that, when properly configured or designed, can be used as a computing device in accordance with certain embodiments. The computer system 2000 includes any number of processors 2002 (also referred to as central processing units or CPUs) connected to a storage device that includes a main memory 2006 (typically random access memory or RAM), a main memory 2004 (typically read only memory or ROM). The CPU 2002 may be of various types including microcontrollers and microprocessors such as programmable devices (e.g., CPLDs and FPGAs), and non-programmable devices such as gate array ASICs or general purpose microprocessors. In the depicted embodiment, main memory 2004 is used for unidirectional transfer of data and instructions to the CPU, while main memory 2006 is typically used for bidirectional transfer of data and instructions. Both of these primary storage devices may include any suitable computer-readable media, such as those described above. The mass storage device 2008 is also bi-directionally coupled to the primary storage 2006 and provides additional data storage capacity and may include any of the computer-readable media described above. The mass storage device 2008 may be used to store programs, data, and the like, and is typically a secondary storage medium such as a hard disk. Typically, such programs, data, and the like are temporarily copied to the main memory 2006 to be executed on the CPU 2002. It will be appreciated that the information retained within mass storage device 2008, may, in appropriate cases, be incorporated in standard fashion as part of main memory 2004. Certain mass storage devices, such as CD-ROM 2014, may also transfer data uni-directionally to the CPU or main memory.
CPU 2002 is also connected to an interface 2010, which interface 2010 is connected to one or more input/output devices such as a nucleic acid sequencer (2020), video monitor, trackball, mouse, keyboard, microphone, touch sensitive display, transducer card reader, tape or paper strip reader, tablet, stylus, voice or handwriting recognition peripheral, USB port, or other well-known input devices such as other computers. Finally, CPU 2002 optionally uses an external connection (as shown generally at 2012) to connect to an external device such as a database, computer, or telecommunications network. With this connection, it is contemplated that the CPU might receive information from the network, or might output information to the network in the course of performing the method steps described herein. In some embodiments, the nucleic acid sequencer (2020) may be communicatively connected to the CPU 2002 via a network connection 2012 instead of or in addition to via the interface 2010.
In one embodiment, a system, such as computer system 2000, is used as a data import, data correlation, and query system capable of performing some or all of the tasks described herein. Information and programs, including data files, may be provided via the network connection 2012 for access or download by researchers. Alternatively, such information, programs, and files may be provided to the researcher on a storage device.
In particular embodiments, computer system 2000 is directly connected to a data acquisition system, such as a microarray, a high throughput screening system, or a nucleic acid sequencer that captures data from a sample (2020). Data from such a system is provided through interface 2010 for analysis by system 2000. Alternatively, the data processed by the system 2000 is provided from a data storage source such as a database or other repository of related data. Upon entering the device 2000, storage means, such as the primary memory 2006 or mass memory 2008, at least temporarily buffers or stores the relevant data. The memory may also store various routines and/or programs for importing, analyzing, and presenting the data, including sequence reads, UMIs, code for determining sequence reads, folding sequence reads, correcting errors in reads, and the like.
In certain embodiments, a computer as used herein may comprise a user terminal, which may be any type of computer (e.g., desktop computer, laptop computer, tablet computer, etc.), media computing platform (e.g., cable, satellite set-top box, digital video recorder, etc.), handheld computing device (e.g., PDA, email client, etc.), cell phone, or any other type of computing or communication platform.
In certain embodiments, a computer as used herein may also include a server system in communication with the user terminal, which may include a server device or a decentralized server device, and may include a mainframe computer, mini-computer, supercomputer, personal computer, or a combination thereof. Multiple server systems may also be used without departing from the scope of the invention. The user terminal and the server system may communicate with each other through a network. The network may include, for example, wired networks such as LAN (local area network), WAN (wide area network), MAN (metropolitan area network), ISDN (integrated services digital network), etc., as well as wireless networks such as wireless LAN, CDMA, bluetooth, satellite communication networks, etc., without limiting the scope of the invention.
FIG. 5 illustrates one embodiment of a dispersion system for generating a decision or diagnosis from a test sample. Sample collection location 01 is used to obtain a test sample from a patient, such as a pregnant woman or a putative cancer patient. The sample is then provided to a processing and sequencing station 03 where the test sample can be processed and sequenced as described above. Position 03 includes equipment for processing the sample and equipment for sequencing the processed sample. As described elsewhere herein, the result of the sequencing is a collection of reads, typically provided in electronic format, and provided to a network such as the internet, which is represented by reference numeral 05 in fig. 5.
The sequence data is provided to a remote location 07 where analysis and decision generation is performed. Such a location may include one or more powerful computing devices such as a computer or processor. After the computing resource at location 07 has completed the analysis and generated a decision from the received sequence information, the decision will be passed back to the network 05. In some embodiments, not only is a determination made at location 07, but a related diagnosis is also generated. The determination and/or diagnosis is then transmitted over a network and returned to the sample collection location 01, as shown in fig. 5. As explained, this is just one of many variations on how to separate the various operations related to generating a decision or diagnosis between the various locations. One common variation involves providing sample collection, processing, and sequencing at a single location. Another variation involves providing processing and sequencing at the same location as the analysis and decision generation.
Fig. 6 details options for performing various operations at different locations. In the very fine sense shown in fig. 6, each of the following operations is performed at a respective location: sample collection, sample processing, sequencing, read alignment, judgment, diagnosis, reporting and/or planning.
In one embodiment, where some of these operations are pooled, sample processing and sequencing are performed at one location, and read alignment, determination, and diagnosis are performed at a different location. See the section identified by reference character a in fig. 6. In another embodiment, which is identified by character B in fig. 6, sample collection, sample processing and sequencing are all performed at the same location. In this embodiment, the read alignment and determination is performed at the second location. Finally, diagnostics, reporting and/or planning is performed at a third location. In the embodiment represented by character C in fig. 6, sample collection is performed at a first location, sample processing, sequencing, read alignment, determination and diagnosis are all performed together at a second location, and reporting and/or planning is performed at a third location. Finally, in the embodiment labeled D in fig. 6, sample collection is performed at a first location, sample processing, sequencing, read alignment and determination are all performed at a second location, and diagnosis and reporting and/or planning management are performed at a third location.
One embodiment provides a system for analyzing simple nucleotide variants associated with a tumor in cell-free DNA (cfDNA), the system comprising a sequencer for receiving a nucleic acid sample and providing nucleic acid sequence information from the nucleic acid sample; a processor; a machine-readable storage medium comprising code for execution on the processor, the code comprising: (a) Code for retrieving sequence reads and fragment sizes of cfDNA fragments obtained from a test sample; (b) Code for assigning cfDNA fragments into bins representing different fragment sizes; (c) Code for determining allele frequencies of variants of interest in a set of precedence bins selected from a plurality of bins using sequence reads, wherein the set of precedence bins is selected as: (i) Limiting the probability that the number of variants of interest in the set of priority bins is below the detection limit, and (ii) increasing the probability that the number of variants of interest in the set of priority bins is above all bins of the plurality of bins.
In some embodiments of any of the systems provided herein, the sequencer is configured to perform Next Generation Sequencing (NGS). In some embodiments, the sequencer is configured for large-scale parallel sequencing using sequencing by synthesis with reversible dye terminators. In other embodiments, the sequencer is configured to sequence by ligation. In other embodiments, the sequencer is configured to perform single molecule sequencing.
Examples
Example 1
This example uses simulation data to illustrate the advantages provided by the method of analyzing cell-free DNA fragments using a set of priority bins to determine variants of interest. This example shows that some embodiments may provide improved signal levels for detecting variants of interest (e.g., tumor-associated variants).
Simulation data were generated for four different scenarios, each with different tumor scores, allele frequencies in tumor cells, and allele frequencies in plasma samples. The plasma samples included cfDNA fragments derived from tumor cells and healthy cells. These cases also have different sequencing depths.
Case a is provided to simulate a sample whose tumor score (f Tumor(s) ) Allele frequencies in tumor cells (AF) of 0.01 Tumor(s) ) Allele frequency in plasma (AF) of 0.5 Plasma of blood ) 0.005, and subjected to a sequencing Depth (DP) of 5000X. Case a is provided to mimic clinical conditions where the cancer is in an early stage and the tumor score is very low.
Case B had a tumor score of 0.2, tumor cell allele frequency of 0.5, and plasma allele frequency of 0.1. The sequencing depth for case B was 1000. Case B was used to simulate clinical conditions at advanced stages of the tumor when tumor burden is high and tumor changes may be beneficial in monitoring tumor progression.
Case C had a tumor score of 0.2, tumor cell allele frequency of 0.02, and plasma allele frequency of 0.004. The sequencing depth was 5000. Case C was provided to mimic therapeutic resistance mutations in metastasis, where tumor score was high, but allele frequency was low in tumor cells.
The tumor score for case D was 0.1, the tumor cell allele frequency was 0.05, and the plasma allele frequency was 0.005. The sequencing depth in this case was 5000. Case D was designed to mimic subclone mutations in primary carcinoma, with tumor scores at moderate levels and relatively low tumor cell allele frequencies.
Fig. 7A to 7D show allele frequencies of variants of interest using bin sets of different fragment sizes, one for each of scenarios a to D. Fig. 7A shows data of case a. Allele frequencies for the four-box set are shown as M1-M4 cassette plots. Box M1 shows allele frequency data obtained using a set of priority bins. Selecting a priority set to increase the probability that the variant of interest in the priority set has a higher allele frequency than in all bins of the plurality of bins (or AF as described above) Plasma of blood ) And limiting the probability that the variant of interest has an allele frequency below the detection limit in the set of priority bins. Bins that do not contain any fragments with variants of interest are excluded. That is, the bin used to obtain the allele frequency is prioritized and contains the variant.
The data in box M2 is obtained in a similar manner as M1, except that the set of priority bins includes all bins in the candidate set, i.e., including bins containing fragments with variants of interest and bins not containing any fragments with variants of interest. I.e. the bin for obtaining the allele frequency is preferentially obtained.
Box M3 shows allele frequency data for bins containing fragments shorter than 150 base pairs.
Box M4 shows allele frequency data for all bins, without priority. This allele frequency is also referred to as the plasma allele frequency or the original allele frequency.
As shown in FIG. 7A, the allele frequencies (preferential and containing variants) for M1 are higher than for M3 and M4. Likewise, the allele frequency of M2 (preferential) is also higher than M3 and M4. The differences are statistically significant. Fig. 7A illustrates that in case a where cancer is in an early stage and the tumor score is very low, the use of a priority box containing the variant of interest helps to increase the signal level for detecting the variant of interest.
The data patterns observed above in fig. 7A also appear in fig. 7B-7D. Thus, using the priority box obtained by the methods described herein, the signal level for detecting tumor variants under various clinical conditions can be increased, as in case B when tumor burden is high, when tumor allele frequency is low in the metastatic state of tumors with therapeutic resistance mutations (case C), and when tumors are associated with subclone mutations with relatively low tumor allele frequency and moderate tumor scores (case D).
The use of two sets of priority bins may not only increase the sensitivity of detecting a variant of interest, but may also potentially increase or maintain the selectivity of detecting a variant of interest. The selectivity values of four different types of bins for analysis of cfDNA data were determined, according to selectivity = true negative/(true negative + false negative). Data for three different tumor scores were determined (0.01, 0.0.1 and 0.2). Consistent data patterns between four non-consistent bins occur across different tumor scores. In particular, the analysis performed using all bins had a high selectivity of 99.7%. When only the box containing fragments shorter than 150bp was used, the selectivity level was reduced to 94.6%. The selectivity of the analysis using the preferential bin and the preferential and mutant bin was still 99.7%. For higher tumor scores (0.1 and 0.2), the same selectivity pattern remains unchanged in the four different bin sets. Thus, it is apparent from the data that the selectivity of variant detection can be maintained using a priority bin.
Example 2
This example provides empirical data obtained from an actual biological sample to illustrate that the method using the priority bin disclosed above can increase the signal level for detecting variants of interest.
Fig. 8 shows fragment length distribution of cfDNA from tumor cells and normal cells. Dark grey bars represent tumor cell derived cfDNA distribution. The middle gray line shows the cfDNA distribution of normal cell-derived genes including FGFR3 and lemm 1. The light grey line shows the cfDNA distribution of normal cell derived containing other genes. The distribution of cfDNA fragments of tumor origin is also filled with grey shading. It is apparent from the three distributions in fig. 8 that the cfDNA distribution of tumor origin has one main peak moving toward the lower end.
Fig. 9 shows the frequency of cfDNA fragments with bins of 5nt bin size allocated to cfDNA of tumor origin and cfDNA of normal origin. Tumor-derived cfDNA frequencies are shown in dark grey bars, while normal cfDNA data are shown in light grey bars. These two distributions are bimodal. The distribution of the cancer sources has a major peak at about 150-175bp and a minor peak at about 315 bp. The distribution of normal cells has a major peak at 170bp and a minor peak at about 320 bp. The data in fig. 9 also indicate that the tumor-derived cfDNA fragments can be shorter than the normal cfDNA fragments.
A fold change in the number of tumor alleles was obtained for the three types of bin sets, which fold change was related to the number of tumor alleles obtained using all bins. They showed 32 true positive mutations including Simple Nucleotide Variants (SNVs). True positive mutations are known from an empirical study. The fold change value was greater than 1 among 31 of the 32 mutations using a preference box comprising mutated fragments. The fold change values obtained using all the sets of preferential bins (including bins containing mutated fragments and bins not containing mutated fragments) were greater than 1 in 28 of the 32 mutations. For methods using bins containing fragments shorter than 150bp, 30 of the 32 true mutations can be detected with fold change levels greater than 1. The data show that the signal without mutation is below the limit of detection. Thus, the data show that the use of the priority box can increase the signal level for detecting 32 true positive mutations in a biological sample.
Figure 10 shows fold change data for 32 true positive mutations, which are divided into groups with different levels of original allele frequency. The horizontal axis of fig. 10 represents the original allele frequencies of the mutations of the biological samples. The Y-axis of fig. 10 shows the fold change. The dark grey bars show the fold change values obtained using the set of priority bins containing mutations. The gray bars in light gray show the fold change values obtained using the set of priority bins. The middle gray bars show the data obtained using bins containing fragments shorter than 150 bp. The data in fig. 10 shows that the fold change values were greater than 1 using the method of the priority bin, except that the original allele frequency was 7.89% (as indicated by the arrow in the figure).
Furthermore, fold change values appear to be greater when the mutation has a lower allele frequency, such as when the allele frequency is below 1 (as shown by the left dashed box).
The use of a set of preferential bins may facilitate the detection of mutations with allele frequencies below the detection limit of 0.05%. Allele frequencies of 5 mutations were determined, which were all below the detection limit when data analysis was performed using all bins (see left second panel). The allele frequencies of the mutations MDA_10134A: KRAS, MDA10070A: KRAS and MSK080: KRAS were increased above the detection limit using a priority box containing the mutation of interest (see third left panel). Similar results were obtained for the method using all bins in the priority bin set (see fourth left column). In contrast, the method using a box containing a fragment shorter than 150bp cannot save any of the 5 mutations belonging to below the detection limit (see fifth left column). Thus, the data indicate that analyzing cfDNA fragments using a priority bin can help detect tumor variants with allele frequencies below the detection limit, effectively saving mutation detection that would otherwise be missed.

Claims (65)

1. A method of analyzing cell-free DNA (cfDNA) to determine variants of interest, the method implemented on a computer system comprising one or more processors and system memory, the method comprising:
(a) Retrieving, by the one or more processors, sequence reads and fragment sizes of cfDNA fragments obtained from the test sample;
(b) Distributing, by one or more processors, the cfDNA fragments into a plurality of bins representing different fragment sizes; and
(c) Using the sequence reads and by one or more processors, determining allele frequencies of variants of interest in a set of priority bins selected from a plurality of bins, wherein the set of priority bins is selected as: (i) Limiting the probability that the number of variants of interest in the set of priority bins is below the detection limit; (ii) Increasing the probability that the number of variants of interest in the set of priority bins is higher than all bins of the plurality of bins.
2. The method of claim 1, wherein the test sample is a plasma sample.
3. The method of claim 1, wherein the set of priority bins is selected by a method comprising:
providing a plurality of candidate sets, each candidate set comprising non-uniform bins from a plurality of bins;
for each candidate set, calculating a first probability that the allele frequency of the variant of interest in the bin of the candidate set is below the detection limit in a modeled sample, wherein the modeled sample comprises cfDNA derived from cells with the variant of interest and cfDNA derived from cells without the variant of interest;
For each candidate set, calculating a second probability that the allele frequency of the variant of interest in the bin of the candidate set in the modeled sample is higher than the allele frequency of the variant of interest in the plurality of bins in the modeled sample; and
a candidate set is selected as a priority set based on the first probability and the second probability.
4. A method as claimed in claim 3, wherein the priority set has a maximum value of the second probability in the candidate set where the value of the first probability does not exceed the criterion.
5. The method of claim 3, wherein the plurality of candidate sets are obtained by a craving method.
6. The method of claim 5, wherein the craving method comprises:
obtaining sequence reads and fragment sizes of cfDNA fragments obtained from one or more unaffected training samples known to be unaffected by the disorder of interest and one or more affected training samples known to be affected by the disorder of interest;
distributing cfDNA fragments obtained from one or more unaffected training samples into a plurality of bins based on their size;
distributing cfDNA fragments obtained from one or more affected training samples into a plurality of bins based on their size;
Ranking each bin of the plurality of bins based on a ratio of a frequency of the segments of the one or more affected training samples to a frequency of the segments of the one or more unaffected training samples;
selecting the bin with the highest rating as the candidate set;
adding the bin with the next highest rating to the final candidate set to provide a next candidate set; and
the last step is repeated until all bins of the plurality of bins are added, each time the candidate set is provided repeatedly.
7. The method of claim 3, further comprising, after selecting the candidate set as the priority set, removing one or more bins from the priority set that do not contain variant sequences of interest.
8. The method of any one of claims 1-7, wherein the limit of detection is 0.05% -0.2%.
9. The method of claim 1, wherein the variant of interest comprises a Simple Nucleotide Variant (SNV).
10. The method of claim 9, wherein the SNV is a single nucleotide variant, a phased sequence variant, or a small insertion deletion.
11. The method of claim 1, wherein the sequence reads are double-ended reads and the cfDNA fragment size is derived from a read pair.
12. The method of claim 1, wherein cfDNA fragments obtained from the sample have been enriched.
13. The method of claim 1, further comprising extracting the cfDNA fragments from the test sample prior to (a).
14. The method of claim 1, wherein the cfDNA fragment comprises a circulating tumor DNA (ctDNA) fragment.
15. A method of analyzing cell-free DNA (cfDNA) to determine a variant of interest, the method comprising:
(a) Obtaining sequence reads and fragment sizes of cfDNA fragments obtained from the test sample;
(b) Assigning cfDNA fragments to a plurality of bins representing different fragment sizes based on their sizes; and
(c) Using the sequence reads, determining allele frequencies of variants of interest in a set of preferred bins selected from a plurality of bins, wherein the set of preferred bins is selected by a method comprising:
(i) Providing a plurality of candidate sets, each candidate set comprising non-uniform bins from a plurality of bins;
(ii) For each candidate set, calculating a second probability that the allele frequency of the variant of interest in the bin of the candidate set in the modeled sample is higher than the allele frequency of the variant of interest in the plurality of bins in the modeled sample, wherein the modeled sample comprises tissue having the variant of interest and tissue having a wild-type sequence of the variant of interest; and
(iii) A candidate set is selected having a maximum value of the second probability.
16. The method of claim 15, further comprising, prior to (iii) and for each candidate set, calculating a first probability that the allele frequencies of the variants of interest in the bin of the candidate set in the modeled sample do not exceed a detection limit, wherein (iii) comprises selecting a candidate set having a maximum of the second probabilities among the candidate sets for which the value of the first probability does not exceed the criterion.
17. A system for analyzing cell-free DNA (cfDNA), the system comprising a system memory and one or more processors configured to:
(a) Retrieving sequence reads and fragment sizes of cfDNA fragments obtained from a test sample;
(b) Distributing cfDNA fragments into a plurality of bins representing different fragment sizes; and
(c) Using the sequence reads, determining allele frequencies of variants of interest in a set of precedence bins selected from a plurality of bins, wherein the set of precedence bins is selected as: (i) Limiting the probability that the number of variants of interest in the set of priority bins is below a detection limit, and (ii) increasing the probability that the number of variants of interest in the set of priority bins is above all bins of the plurality of bins.
18. The system of claim 17, wherein the variant of interest is known or suspected to be associated with cancer.
19. The system of claim 17, wherein the variant of interest is known or suspected to be associated with a genetic disease.
20. The system of claim 17, wherein the one or more processors are further configured to: comparing the allele frequencies of the variants of interest in the set of priority bins to a standard, and determining the variants of interest in the test sample based on the comparison.
21. The system of claim 17, wherein the one or more processors are further configured to select the set of priority bins by:
providing a plurality of candidate sets, each candidate set comprising non-uniform bins from a plurality of bins;
for each candidate set, calculating a first probability that the allele frequency of the variant of interest in the bin of the candidate set is below the detection limit in a modeled sample, wherein the modeled sample comprises cfDNA derived from cells with the variant of interest and cfDNA derived from cells without the variant of interest;
for each candidate set, calculating a second probability o and that the allele frequency of the variant of interest in the bin of the candidate set in the modeled sample is higher than the allele frequency of the variant of interest in the plurality of bins in the modeled sample
A candidate set is selected as a priority set based on the first probability and the second probability.
22. The system of claim 21, wherein the priority set has a maximum value of the second probability among the candidate sets for which the value of the first probability does not exceed the criterion.
23. The system of claim 21, wherein the one or more processors are further configured to: the plurality of candidate sets is obtained by a craving method.
24. The system of claim 23, wherein the craving method comprises:
obtaining sequence reads and fragment sizes of cfDNA fragments obtained from one or more unaffected training samples known to be unaffected by the disorder of interest and one or more affected training samples known to be affected by the disorder of interest;
distributing cfDNA fragments obtained from one or more unaffected training samples into a plurality of bins based on their size;
distributing cfDNA fragments obtained from one or more affected training samples into a plurality of bins based on their size;
ranking each bin of the plurality of bins based on a ratio of a frequency of the segments of the one or more affected training samples to a frequency of the segments of the one or more unaffected training samples;
Selecting the bin with the highest rating as the candidate set;
adding the bin with the next highest rating to the final candidate set to provide a next candidate set; and
the last step is repeated until all bins of the plurality of bins are added, each time the candidate set is provided repeatedly.
25. The system of claim 24, wherein the condition of interest comprises one or more cancers.
26. The system of claim 25, wherein the disorder of interest comprises cancer associated with the variant of interest.
27. The system of claim 25, wherein the affected training sample comprises cancerous tissue and the unaffected training sample comprises non-cancerous tissue.
28. The system of claim 21, wherein in the bin of the candidate set in the modeled sample, the allele frequencies of the variant of interest are estimated as:
wherein the method comprises the steps of
AF(L b1,b2...bk ) Is a box L b1 ,L b2 ...L bk Is selected from the group consisting of a nucleotide sequence,
N mut (L b1,b2...bk ) Is a box L b1 ,L b2 ...L bk The count of variants of interest in (c) is determined,
DP is the depth of sequencing and is the depth of sequencing,
f tumors are cfDNA fractions from cells with variants of interest,
α(L bi ) Is a box L bi The density of fragments in the fragment length distribution of the affected sample for which one or more conditions of interest are known to be affected, and
β(L bi ) Is a box L bi The density of fragments in the fragment length distribution of the unaffected samples known to be unaffected by the condition of interest.
29. The system of claim 28, wherein the cell having the variant of interest is a cancer cell and the modeling sample comprises a plasma sample comprising cfDNA from a cancer cell and cfDNA from a non-cancer cell.
30. The system of claim 28, wherein the tank L is b1 ,L b2 ...L bk The counts of variants of interest in (a) are modeled as a binomial distribution:
wherein AF is defined as Tumor(s) Is the allele frequency of the variant of interest in the tissue with the variant of interest.
31. The system of claim 30, wherein AF Tumor(s) The calculation is as follows:
AF tumor(s) =AF Plasma of blood /f Tumor(s)
Wherein AF is defined as Plasma of blood Is the allele frequency of the variant of interest in the modeled sample.
32. The system of claim 21, wherein the one or more processors are further configured to: after selecting the candidate set as the priority set, one or more bins not containing variant sequences of interest are removed from the priority set.
33. The system of claim 17, wherein the detection limit is 0.05% -0.2%.
34. The system of claim 17, wherein the variant of interest comprises a Simple Nucleotide Variant (SNV).
35. The system of claim 34, wherein the SNV is a single nucleotide variant, a phased sequence variant, or a small insertion deletion.
36. The system of claim 17, wherein the sequence reads are double-ended reads and the cfDNA fragment size is derived from a read pair.
37. The system of claim 17, wherein cfDNA fragments obtained from the sample have been enriched.
38. The system of claim 17, wherein the cfDNA fragment comprises a circulating tumor DNA (ctDNA) fragment.
39. The system of claim 17, further comprising: a sequencer for receiving a nucleic acid sample and providing nucleic acid sequence information from the nucleic acid sample.
40. A system for analyzing cell-free DNA (cfDNA), the system comprising a system memory and one or more processors configured to:
(a) Obtaining sequence reads and fragment sizes of cfDNA fragments obtained from the test sample;
(b) Assigning cfDNA fragments to a plurality of bins representing different fragment sizes based on their sizes; and
(c) Using the sequence reads, determining allele frequencies of variants of interest in a set of preferred bins selected from a plurality of bins, wherein the set of preferred bins is selected by a method comprising:
(i) Providing a plurality of candidate sets, each candidate set comprising non-uniform bins from a plurality of bins;
(ii) For each candidate set, calculating a second probability that the allele frequency of the variant of interest in the bin of the candidate set in the modeled sample is higher than the allele frequency of the variant of interest in the plurality of bins in the modeled sample, wherein the modeled sample comprises tissue having the variant of interest and tissue having a wild-type sequence of the variant of interest; and
(iii) A candidate set is selected having a maximum value of the second probability.
41. The system of claim 40, wherein the one or more processors are further configured to: before (iii) and for each candidate set, calculating a first probability that the allele frequency of the variant of interest in the bin of the candidate set in the modeled sample does not exceed the detection limit, wherein (iii) comprises selecting a candidate set having a maximum value of the second probability among the candidate sets for which the value of the first probability does not exceed the criterion.
42. A non-transitory machine-readable medium storing program code which, when executed by one or more processors of a computer system, causes the computer system to implement a method for determining variants of interest in cell-free DNA, the program code comprising:
(a) Code for retrieving sequence reads and fragment sizes of cfDNA fragments obtained from a test sample;
(b) Code for assigning the cfDNA fragments into bins representing different fragment sizes; and
(c) Code for determining allele frequencies of a variant of interest in a set of precedence bins selected from a plurality of bins using the sequence reads, wherein the set of precedence bins is selected as: (i) Limiting the probability that the number of variants of interest in the set of priority bins is below a detection limit, and (ii) increasing the probability that the number of variants of interest in the set of priority bins is above all bins of the plurality of bins.
43. The non-transitory machine-readable medium of claim 42, wherein the variant of interest is known or suspected to be associated with cancer.
44. The non-transitory machine-readable medium of claim 42, wherein the variant of interest is known or suspected to be associated with a genetic disease.
45. The non-transitory machine-readable medium of claim 42, wherein the program code further comprises code for: comparing the allele frequencies of the variants of interest in the set of priority bins to a standard, and determining the variants of interest in the test sample based on the comparison.
46. The non-transitory machine-readable medium of claim 42, wherein the program code further comprises code for:
providing a plurality of candidate sets, each candidate set comprising non-uniform bins from a plurality of bins;
for each candidate set, calculating a first probability that the allele frequency of the variant of interest in the bin of the candidate set is below the detection limit in a modeled sample, wherein the modeled sample comprises cfDNA derived from cells with the variant of interest and cfDNA derived from cells without the variant of interest;
for each candidate set, calculating a second probability that the allele frequency of the variant of interest in the bin of the candidate set in the modeled sample is higher than the allele frequency of the variant of interest in the plurality of bins in the modeled sample; and
a candidate set is selected as a priority set based on the first probability and the second probability.
47. The non-transitory machine readable medium of claim 46, wherein the priority set has a maximum value of a second probability among the candidate sets of which the value of the first probability does not exceed the criterion.
48. The non-transitory machine readable medium of claim 46, wherein the program code further comprises code for: the plurality of candidate sets is obtained by a craving method.
49. The non-transitory machine-readable medium of claim 48, wherein the craving method comprises:
obtaining sequence reads and fragment sizes of cfDNA fragments obtained from one or more unaffected training samples known to be unaffected by the disorder of interest and one or more affected training samples known to be affected by the disorder of interest;
distributing cfDNA fragments obtained from one or more unaffected training samples into a plurality of bins based on their size;
distributing cfDNA fragments obtained from one or more affected training samples into a plurality of bins based on their size;
ranking each bin of the plurality of bins based on a ratio of a frequency of the segments of the one or more affected training samples to a frequency of the segments of the one or more unaffected training samples;
selecting the bin with the highest rating as the candidate set;
adding the bin with the next highest rating to the final candidate set to provide a next candidate set; and
the last step is repeated until all bins of the plurality of bins are added, each time the candidate set is provided repeatedly.
50. The non-transitory machine-readable medium of claim 49, wherein the condition of interest comprises one or more cancers.
51. The non-transitory machine-readable medium of claim 50, wherein the disorder of interest comprises a cancer associated with the variant of interest.
52. The non-transitory machine readable medium of claim 50, wherein the affected training sample comprises cancerous tissue and the unaffected training sample comprises non-cancerous tissue.
53. The non-transitory machine readable medium of claim 46, wherein in the bin of the candidate set in the modeled sample, the allele frequencies of the variant of interest are estimated as:
wherein the method comprises the steps of
AF(L b1,b2...bk ) Is a box L b1 ,L b2 ...L bk Is selected from the group consisting of a nucleotide sequence,
N mut (L b1,b2...bk ) Is a box L b1 ,L b2 ...L bk The count of variants of interest in (c) is determined,
DP is the depth of sequencing and is the depth of sequencing,
f tumors are cfDNA fractions from cells with variants of interest,
α(L bi ) Is a box L bi The density of fragments in the fragment length distribution of the affected sample for which one or more conditions of interest are known to be affected, and
β(L bi ) Is a box L bi The density of fragments in the fragment length distribution of the unaffected samples known to be unaffected by the condition of interest.
54. The non-transitory machine-readable medium of claim 53, wherein the cells having the variant of interest are cancer cells and the modeled sample comprises a plasma sample comprising cfDNA from cancer cells and cfDNA from non-cancer cells.
55. The non-transitory machine-readable medium of claim 53, wherein the bin L b1 ,L b2 ...L bk The counts of variants of interest in (a) are modeled as a binomial distribution:
wherein the AF tumor is the allele frequency of the variant of interest in a tissue having the variant of interest.
56. The non-transitory machine-readable medium of claim 55, wherein AF is performed with respect to the plurality of machine-readable media Tumor(s) The calculation is as follows:
AF tumor(s) =AF Plasma of blood /f Tumor(s)
Wherein AF is defined as Plasma of blood Is the allele frequency of the variant of interest in the modeled sample.
57. The non-transitory machine readable medium of claim 46, wherein the program code further comprises code for: one or more bins not containing variant sequences of interest are removed from the prioritized set.
58. The non-transitory machine-readable medium of claim 42, wherein the detection limit is 0.05% -0.2%.
59. The non-transitory machine-readable medium of claim 42, wherein the variant of interest comprises a Simple Nucleotide Variant (SNV).
60. The non-transitory machine-readable medium of claim 59, wherein the SNV is a single nucleotide variant, a phased sequence variant, or a small insertion deletion.
61. A non-transitory machine readable medium as claimed in claim 42, wherein the sequence reads are double-ended reads and the cfDNA fragment size is derived from a read pair.
62. A non-transitory machine readable medium as claimed in claim 42, wherein cfDNA fragments obtained from the sample have been enriched.
63. The non-transitory machine-readable medium of claim 42, wherein the cfDNA fragments comprise circulating tumor DNA (ctDNA) fragments.
64. A non-transitory machine-readable medium storing program code which, when executed by one or more processors of a computer system, causes the computer system to implement a method for determining variants of interest in cell-free DNA, the program code comprising:
(a) Code for obtaining sequence reads and fragment sizes of cfDNA fragments obtained from a test sample;
(b) Code for assigning cfDNA fragments based on their sizes into bins representing different fragment sizes; and
(c) Code for determining allele frequencies of variants of interest in a set of preferred bins selected from a plurality of bins using the sequence read, wherein the set of preferred bins is selected by a method comprising:
(i) Providing a plurality of candidate sets, each candidate set comprising non-uniform bins from a plurality of bins;
(ii) For each candidate set, calculating a second probability that the allele frequency of the variant of interest in the bin of the candidate set in the modeled sample is higher than the allele frequency of the variant of interest in the plurality of bins in the modeled sample, wherein the modeled sample comprises tissue having the variant of interest and tissue having a wild-type sequence of the variant of interest; and
(iii) A candidate set is selected having a maximum value of the second probability.
65. The non-transitory machine-readable medium of claim 64, wherein the program code further comprises code for: before (iii) and for each candidate set, calculating a first probability that the allele frequency of the variant of interest in the bin of the candidate set in the modeled sample does not exceed the detection limit, wherein (iii) comprises selecting a candidate set having a maximum value of the second probability among the candidate sets for which the value of the first probability does not exceed the criterion.
CN201880041466.2A 2017-04-21 2018-04-20 Detection of tumor-associated variants using cell-free DNA fragment size Active CN110800063B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201762488549P 2017-04-21 2017-04-21
US62/488,549 2017-04-21
PCT/US2018/028654 WO2018195483A1 (en) 2017-04-21 2018-04-20 Using cell-free dna fragment size to detect tumor-associated variant

Publications (2)

Publication Number Publication Date
CN110800063A CN110800063A (en) 2020-02-14
CN110800063B true CN110800063B (en) 2023-12-08

Family

ID=62196689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201880041466.2A Active CN110800063B (en) 2017-04-21 2018-04-20 Detection of tumor-associated variants using cell-free DNA fragment size

Country Status (6)

Country Link
US (2) US11342047B2 (en)
EP (1) EP3612964A1 (en)
CN (1) CN110800063B (en)
AU (2) AU2018254595B2 (en)
CA (1) CA3060414A1 (en)
WO (1) WO2018195483A1 (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10741269B2 (en) 2013-10-21 2020-08-11 Verinata Health, Inc. Method for improving the sensitivity of detection in determining copy number variations
CA2970501C (en) 2014-12-12 2020-09-15 Verinata Health, Inc. Using cell-free dna fragment size to determine copy number variations
US10095831B2 (en) 2016-02-03 2018-10-09 Verinata Health, Inc. Using cell-free DNA fragment size to determine copy number variations
GB201818159D0 (en) * 2018-11-07 2018-12-19 Cancer Research Tech Ltd Enhanced detection of target dna by fragment size analysis
WO2020132499A2 (en) * 2018-12-21 2020-06-25 Grail, Inc. Systems and methods for using fragment lengths as a predictor of cancer
JP7332695B2 (en) * 2018-12-21 2023-08-23 エフ. ホフマン-ラ ロシュ アーゲー Identification of global sequence features in whole-genome sequence data from circulating nucleic acids
CN109652513B (en) * 2019-02-25 2022-08-23 元码基因科技(北京)股份有限公司 Method and kit for accurately detecting individual mutation of liquid biopsy based on second-generation sequencing technology
US20210102199A1 (en) * 2019-10-08 2021-04-08 Illumina, Inc. Fragment size characterization of cell-free dna mutations from clonal hematopoiesis
BR112022008752A2 (en) * 2019-11-06 2022-08-16 Univ Leland Stanford Junior METHODS AND SYSTEMS FOR THE ANALYSIS OF NUCLEIC ACID MOLECULES
CN112086129B (en) * 2020-09-23 2021-04-06 深圳吉因加医学检验实验室 Method and system for predicting cfDNA of tumor tissue
CN112410422B (en) * 2020-10-30 2022-06-03 深圳思勤医疗科技有限公司 Method for predicting tumor risk value based on fragmentation pattern
US11248265B1 (en) 2020-11-19 2022-02-15 Clear Labs, Inc Systems and processes for distinguishing pathogenic and non-pathogenic sequences from specimens
WO2022226389A1 (en) * 2021-04-23 2022-10-27 The Translational Genomics Research Institute Analysis of fragment ends in dna
EP4130293A1 (en) * 2021-08-04 2023-02-08 OncoDNA SA Method of mutation detection in a liquid biopsy
KR102544002B1 (en) * 2022-03-10 2023-06-16 주식회사 아이엠비디엑스 Method for Differentiating Somatic Mutation and Germline Mutation

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013035114A1 (en) * 2011-09-08 2013-03-14 Decode Genetics Ehf Tp53 genetic variants predictive of cancer
CN103003447A (en) * 2011-07-26 2013-03-27 维里纳塔健康公司 Method for determining the presence or absence of different aneuploidies in a sample
CN103397103A (en) * 2013-08-26 2013-11-20 中国人民解放军第三军医大学第三附属医院 Method and kit for detecting SOCS family gene labeled single nucleotide polymorphism sites
CN103642920A (en) * 2013-12-09 2014-03-19 中国人民解放军第三军医大学第三附属医院 Method and kit for detecting RAGE (receptor for advanced glycation end) family gene tag single nucleotide polymorphic site
CN104781421A (en) * 2012-09-04 2015-07-15 夸登特健康公司 Systems and methods to detect rare mutations and copy number variation
CN104884633A (en) * 2012-10-05 2015-09-02 勒芬天主教大学研发中心 High-throughput genotyping by sequencing low amounts of genetic material
CN105518151A (en) * 2013-03-15 2016-04-20 莱兰斯坦福初级大学评议会 Identification and use of circulating nucleic acid tumor markers
WO2016094853A1 (en) * 2014-12-12 2016-06-16 Verinata Health, Inc. Using cell-free dna fragment size to determine copy number variations
CN105830077A (en) * 2013-10-21 2016-08-03 维里纳塔健康公司 Method for improving the sensitivity of detection in determining copy number variations
CN106460070A (en) * 2014-04-21 2017-02-22 纳特拉公司 Detecting mutations and ploidy in chromosomal segments
CN106544407A (en) * 2015-09-18 2017-03-29 广州华大基因医学检验所有限公司 The method for determining donor source cfDNA ratios in receptor cfDNA samples

Family Cites Families (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007145612A1 (en) 2005-06-06 2007-12-21 454 Life Sciences Corporation Paired end sequencing
US8262900B2 (en) 2006-12-14 2012-09-11 Life Technologies Corporation Methods and apparatus for measuring analytes using large scale FET arrays
CN101889074A (en) 2007-10-04 2010-11-17 哈尔西恩莫尔丘勒公司 Sequencing nucleic acid polymers with electron microscopy
WO2009051842A2 (en) 2007-10-18 2009-04-23 The Johns Hopkins University Detection of cancer by measuring genomic copy number and strand length in cell-free dna
HUE031849T2 (en) 2008-09-20 2017-08-28 Univ Leland Stanford Junior Noninvasive diagnosis of fetal aneuploidy by sequencing
BR112012010708A2 (en) 2009-11-06 2016-03-29 Univ Hong Kong Chinese method for performing prenatal diagnosis, and, computer program product
WO2012006291A2 (en) 2010-07-06 2012-01-12 Life Technologies Corporation Systems and methods to detect copy number variation
US9029103B2 (en) 2010-08-27 2015-05-12 Illumina Cambridge Limited Methods for sequencing polynucleotides
US8725422B2 (en) 2010-10-13 2014-05-13 Complete Genomics, Inc. Methods for estimating genome-wide copy number variations
US9411937B2 (en) 2011-04-15 2016-08-09 Verinata Health, Inc. Detecting and classifying copy number variation
WO2014014498A1 (en) 2012-07-20 2014-01-23 Verinata Health, Inc. Detecting and classifying copy number variation in a fetal genome
JP5659319B2 (en) 2011-06-29 2015-01-28 ビージーアイ ヘルス サービス カンパニー リミテッド Non-invasive detection of genetic abnormalities in the fetus
CA2850781C (en) 2011-10-06 2020-09-01 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
DK2772549T3 (en) 2011-12-31 2019-08-19 Bgi Genomics Co Ltd METHOD OF DETECTING GENETIC VARIATION
EP4148739A1 (en) 2012-01-20 2023-03-15 Sequenom, Inc. Diagnostic processes that factor experimental conditions
US9892230B2 (en) 2012-03-08 2018-02-13 The Chinese University Of Hong Kong Size-based analysis of fetal or tumor DNA fraction in plasma
US10504613B2 (en) 2012-12-20 2019-12-10 Sequenom, Inc. Methods and processes for non-invasive assessment of genetic variations
US9920361B2 (en) 2012-05-21 2018-03-20 Sequenom, Inc. Methods and compositions for analyzing nucleic acid
US11261494B2 (en) 2012-06-21 2022-03-01 The Chinese University Of Hong Kong Method of measuring a fractional concentration of tumor DNA
WO2014149134A2 (en) 2013-03-15 2014-09-25 Guardant Health Inc. Systems and methods to detect rare mutations and copy number variation
WO2014052855A1 (en) 2012-09-27 2014-04-03 Population Diagnostics, Inc. Methods and compositions for screening and treating developmental disorders
CN105722994B (en) 2013-06-17 2020-12-18 维里纳塔健康公司 Method for determining copy number variation in chromosomes
US10415083B2 (en) 2013-10-28 2019-09-17 The Translational Genomics Research Institute Long insert-based whole genome sequencing
AU2015266665C1 (en) 2014-05-30 2021-12-23 Verinata Health, Inc. Detecting fetal sub-chromosomal aneuploidies and copy number variations
DK3656875T3 (en) * 2014-07-18 2021-12-13 Illumina Inc Non-invasive prenatal diagnosis
US10364467B2 (en) 2015-01-13 2019-07-30 The Chinese University Of Hong Kong Using size and number aberrations in plasma DNA for detecting cancer
US20170058332A1 (en) * 2015-09-02 2017-03-02 Guardant Health, Inc. Identification of somatic mutations versus germline variants for cell-free dna variant calling applications
US10095831B2 (en) 2016-02-03 2018-10-09 Verinata Health, Inc. Using cell-free DNA fragment size to determine copy number variations

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103003447A (en) * 2011-07-26 2013-03-27 维里纳塔健康公司 Method for determining the presence or absence of different aneuploidies in a sample
WO2013035114A1 (en) * 2011-09-08 2013-03-14 Decode Genetics Ehf Tp53 genetic variants predictive of cancer
CN104781421A (en) * 2012-09-04 2015-07-15 夸登特健康公司 Systems and methods to detect rare mutations and copy number variation
CN104884633A (en) * 2012-10-05 2015-09-02 勒芬天主教大学研发中心 High-throughput genotyping by sequencing low amounts of genetic material
CN105518151A (en) * 2013-03-15 2016-04-20 莱兰斯坦福初级大学评议会 Identification and use of circulating nucleic acid tumor markers
CN103397103A (en) * 2013-08-26 2013-11-20 中国人民解放军第三军医大学第三附属医院 Method and kit for detecting SOCS family gene labeled single nucleotide polymorphism sites
CN105830077A (en) * 2013-10-21 2016-08-03 维里纳塔健康公司 Method for improving the sensitivity of detection in determining copy number variations
CN103642920A (en) * 2013-12-09 2014-03-19 中国人民解放军第三军医大学第三附属医院 Method and kit for detecting RAGE (receptor for advanced glycation end) family gene tag single nucleotide polymorphic site
CN106460070A (en) * 2014-04-21 2017-02-22 纳特拉公司 Detecting mutations and ploidy in chromosomal segments
WO2016094853A1 (en) * 2014-12-12 2016-06-16 Verinata Health, Inc. Using cell-free dna fragment size to determine copy number variations
CN106544407A (en) * 2015-09-18 2017-03-29 广州华大基因医学检验所有限公司 The method for determining donor source cfDNA ratios in receptor cfDNA samples

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Aryan Arbabi等.Cell-free DNA fragment-size distribution analysis for non-invasive prenatal CNV prediction .《Bioinformatics》.2016,第32卷(第11期),1662-1669. *
Diana H. Liang等.Cell-free DNA as a molecular tool for monitoring disease progression and response to therapy inbreast cancer patients.《Breast Cancer Research and Treatment》.2015,第155卷139-149. *
Fragment Length of Circulating Tumor DNA;HUNTER R. UNDERHILL等;《PLOS GENETICS》;第12卷(第7期);1-24, 正文第5页第5段, 第6页第1段, 图3, 图5 *
HUNTER R. UNDERHILL ET等.Fragment Length of Circulating Tumor DNA.《PLOS GENETICS》.2016,第12卷(第7期),1-24. *
PEIYONG JIANG等.Lengthening and shortening of plasma DNA in hepatocellular carcinoma patients.《PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES》.2015,第112卷(第11期),E1317-E1325. *
王雯邈.乳腺癌差异表达基因3‘UTR遗传变异与三阴性乳腺癌预后的关联研究.《中国硕士学位论文全文数据库(医药卫生科技辑)》.2015,(第2015(01)期),E072-1204. *

Also Published As

Publication number Publication date
AU2023219911A1 (en) 2023-09-14
US20220246234A1 (en) 2022-08-04
EP3612964A1 (en) 2020-02-26
AU2018254595B2 (en) 2023-06-15
WO2018195483A1 (en) 2018-10-25
CN110800063A (en) 2020-02-14
US20180307796A1 (en) 2018-10-25
US11342047B2 (en) 2022-05-24
AU2018254595A1 (en) 2019-11-14
CA3060414A1 (en) 2018-10-25

Similar Documents

Publication Publication Date Title
CN110800063B (en) Detection of tumor-associated variants using cell-free DNA fragment size
US20200335178A1 (en) Detecting repeat expansions with short read sequencing data
US11492656B2 (en) Haplotype resolved genome sequencing
TWI661049B (en) Using cell-free dna fragment size to determine copy number variations
CN107750277B (en) Determination of copy number variation using cell-free DNA fragment size
CN110770838B (en) Methods and systems for determining somatically mutated clonality
JP6161607B2 (en) How to determine the presence or absence of different aneuploidies in a sample
AU2019250200A1 (en) Error Suppression In Sequenced DNA Fragments Using Redundant Reads With Unique Molecular Indices (UMIs)
JP7009516B2 (en) Methods for Accurate Computational Degradation of DNA Mixtures from Contributors of Unknown Genotypes
JP2020529648A (en) Methods and systems for degradation and quantification of DNA mixtures from multiple contributors of known or unknown genotypes
CN115989544A (en) Method and system for visualizing short reads in repetitive regions of a genome
CN113096726B (en) Determination of copy number variation using cell-free DNA fragment size
CN116157869A (en) Systems and methods for detecting genetic alterations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40023955

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant