WO2020247411A1 - Limit of detection based quality control metric - Google Patents
Limit of detection based quality control metric Download PDFInfo
- Publication number
- WO2020247411A1 WO2020247411A1 PCT/US2020/035787 US2020035787W WO2020247411A1 WO 2020247411 A1 WO2020247411 A1 WO 2020247411A1 US 2020035787 W US2020035787 W US 2020035787W WO 2020247411 A1 WO2020247411 A1 WO 2020247411A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- fetal fraction
- coverage
- samples
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/10—Ploidy or copy number detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/20—Sequence assembly
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/30—Unsupervised data analysis
Definitions
- CNV copy number variation
- Conventional procedures for genetic screening and biological dosimetry have utilized invasive procedures, e.g., amniocentesis, cordocentesis, or chorionic villus sampling (CVS), to obtain cells for the analysis of karyotypes. Recognizing the need for more rapid testing methods that do not require cell culture, fluorescence in situ hybridization (FISH), quantitative fluorescence PCR (QF-PCR) and array- Comparative Genomic Hybridization (array-CGH) have been developed as molecular- cytogenetic methods for the analysis of copy number variations.
- FISH fluorescence in situ hybridization
- QF-PCR quantitative fluorescence PCR
- array-CGH array- Comparative Genomic Hybridization
- One aspect of the disclosure relates to methods for processing test samples each including cell-free nucleic acid fragments originating from a mother and a fetus.
- the methods are implemented using a computer system including one or more processors and memory.
- the method includes: (a) determining a value of fetal fraction of the test sample , wherein the fetal fraction of the test sample indicates the relative amount of fetal origin cell-free nucleic acid fragments in the test sample; (b) receiving, by the computer system, sequence reads obtained by sequencing the cell-free nucleic acid fragments in the test sample; (c) aligning, by the computer system, the sequence reads of the cell-free nucleic acid fragments to a reference genome comprising a sequence of interest, thereby providing sequence tags; (d) determining, by the computer system, a coverage of the sequence tags for at least a portion of the reference genome; (e) determining that the test sample is within an exclusion region based on the coverage of sequences tags determined in (
- the method further includes, prior to (f), determining that the test sample is negative for the CNV of the sequence of interest.
- the method further includes: repeating (a)-(d) using the re-sequenced sequence reads; determining that the test sample is outside the exclusion region; and calling the test sample as either having the CNV of the sequence of interest or not having the CNV of the sequence of interest.
- the detection criterion is a desired level of confidence that for an observed fetal fraction the ground truth fetal fraction is larger than a specified LOD. In some implementations, the detection criterion is X% confident that for the observed fetal fraction, the ground truth fetal fraction is larger than LOD Y%. In some implementations, X is about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, or 99.5%. In some implementations, Y is about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or 99% confidence of detection. In some implementations, X is 50% and Y is 95%.
- the exclusion region is under the fetal fraction
- the value of fetal fraction of the test sample is determined based on sizes of the cell-free nucleic acid fragments. In some implementations, the value of fetal fraction of the test sample is determined by: obtaining a frequency distribution of the sizes of the cell-free nucleic acid fragments; and applying the frequency distribution to a model relating fetal fraction to frequency of fragment size to obtain the fetal fraction value.
- the value of fetal fraction of the test sample is determined based on coverage information for bins of the reference genome.
- the value of fetal fraction is calculated by: applying coverage values of a plurality of bins of the reference genome to a model relating fetal fraction to coverage of bin to obtain the fetal fraction value.
- the plurality of bins of the reference genome have higher fractions of fetal cell-free nucleic acid fragments than other bins.
- the value of fetal fraction of the test sample is determined based on coverage information for the bins of a sex chromosome.
- Figure 2 shows a two-step coverage threshold for excluding samples.
- Figure 3 shows fetal fraction distributions for three populations or samples thereof.
- Figure 4B shows another LOD QC process for CNV detection.
- Figure 7 shows that estimated fetal fractions include errors causing the estimated fetal fractions to deviate from the true fetal fractions.
- Figure 11 shows observed fetal fraction of 2% and its simulated true fetal fraction distributions given different errors or coverage.
- Figure 16B shows an example process 800 for determining fetal fraction from coverage information according to some implementations of the disclosure.
- Figure 19 shows the chromosome Y coverage (left plot) and FF fraction estimator (right plot) for the synthetically generated samples as a function of dilution fraction.
- Figure 27 shows the samples that are rescued by the LOD QC method.
- parameter represents a physical feature whose value or other characteristic has an impact a relevant condition such as copy number variation.
- parameter is used with reference to a variable that affects the output of a mathematical relation or model, which variable may be an independent variable (i.e., an input to the model) or an intermediate variable based on one or more independent variables.
- an output of one model may become an input of another model, thereby becoming a parameter to the other model.
- fragment size parameter refers to a parameter that relates to the size or length of a fragment or a collection of fragments such nucleic acid fragments; e.g., a cfDNA fragments obtained from a bodily fluid.
- weighting refers to modifying a quantity such as a parameter or variable using one or more values or functions, which are considered the“weight.”
- the parameter or variable is multiplied by the weight.
- the parameter or variable is modified exponentially.
- the function may be a linear or non-linear function. Examples of applicable non-linear functions include, but are not limited to Heaviside step functions, box-car functions, stair-case functions, or sigmoidal functions. Weighting an original parameter or variable may systematically increase or decrease the value of the weighted variable. In various embodiments, weighting may result in positive, non-negative, or negative values.
- chromosomal aneuploidy and “complete chromosomal aneuploidy” herein refer to an imbalance of genetic material caused by a loss or gain of a whole chromosome, and includes germline aneuploidy and mosaic aneuploidy.
- test sample refers to a sample, typically derived from a biological fluid, cell, tissue, organ, or organism, comprising a nucleic acid or a mixture of nucleic acids comprising at least one nucleic acid sequence that is to be screened for copy number variation.
- the sample comprises at least one nucleic acid sequence whose copy number is suspected of having undergone variation.
- Methods of pretreatment may also involve, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, the addition of reagents, lysing, etc. If such methods of pretreatment are employed with respect to the sample, such pretreatment methods are typically such that the nucleic acid(s) of interest remain in the test sample, sometimes at a concentration proportional to that in an untreated test sample (e.g., namely, a sample that is not subjected to any such pretreatment method(s)). Such“treated” or“processed” samples are still considered to be biological “test” samples with respect to the methods described herein.
- qualified sample or“unaffected sample” herein refers to a sample comprising a mixture of nucleic acids that are present in a known copy number to which the nucleic acids in a test sample are to be compared, and it is a sample that is normal, i.e., not aneuploid, for the nucleic acid sequence of interest.
- qualified samples are used as unaffected training samples of a training set to derive sequence masks or sequence profiles.
- qualified samples are used for identifying one or more normalizing chromosomes or segments for a chromosome under consideration. For example, qualified samples may be used for identifying a normalizing chromosome for chromosome 21.
- training set refers to a set of training samples that can comprise affected and/or unaffected samples and are used to develop a model for analyzing test samples.
- the training set includes unaffected samples.
- thresholds for determining CNV are established using training sets of samples that are unaffected for the copy number variation of interest.
- the unaffected samples in a training set may be used as the qualified samples to identify normalizing sequences, e.g., normalizing chromosomes, and the chromosome doses of unaffected samples are used to set the thresholds for each of the sequences, e.g., chromosomes, of interest.
- the training set includes affected samples.
- the affected samples in a training set can be used to verify that affected test samples can be easily differentiated from unaffected samples.
- a training set is used in conjunction with a validation set.
- the term“validation set” is used to refer to a set of individuals in a statistical sample, data of which individuals are used to validate or evaluate the quantitative values of interest determined using a training set.
- a training set provides data for calculating a mask for a reference sequence, while a validation set provides data to evaluate the validity or effectiveness of the mask.
- evaluation of copy number is used herein in reference to the statistical evaluation of the status of a genetic sequence related to the copy number of the sequence.
- the evaluation comprises the determination of the presence or absence of a genetic sequence.
- the evaluation comprises the determination of the partial or complete aneuploidy of a genetic sequence.
- the evaluation comprises discrimination between two or more samples based on the copy number of a genetic sequence.
- the evaluation comprises statistical analyses, e.g., normalization and comparison, based on the copy number of the genetic sequence.
- qualified sequence which is a sequence against which the amount of a sequence or nucleic acid of interest is compared.
- a qualified sequence is one present in a biological sample preferably at a known representation, i.e., the amount of a qualified sequence is known.
- a qualified sequence is the sequence present in a“qualified sample.”
- A“qualified sequence of interest” is a qualified sequence for which the amount is known in a qualified sample, and is a sequence that is associated with a difference of a sequence of interest between a control subject and an individual with a medical condition.
- a normalizing sequence refers to a sequence that is used to normalize the number of sequence tags mapped to a sequence of interest associated with the normalizing sequence.
- a normalizing sequence comprises a robust chromosome.
- A“robust chromosome” is one that is unlikely to be aneuploid.
- a robust chromosome is any chromosome other than the X chromosome, Y chromosome, chromosome 13, chromosome 18, and chromosome 21.
- the normalizing sequence displays a variability in the number of sequence tags that are mapped to it among samples and sequencing runs that approximates the variability of the sequence of interest for which it is used as a normalizing parameter.
- the normalizing sequence can differentiate an affected sample from one or more unaffected samples.
- the normalizing sequence best or effectively differentiates, when compared to other potential normalizing sequences such as other chromosomes, an affected sample from one or more unaffected samples.
- the variability of the normalizing sequence is calculated as the variability in the chromosome dose for the sequence of interest across samples and sequencing runs.
- normalizing sequences are identified in a set of unaffected samples.
- a “normalizing chromosome,” “normalizing denominator chromosome,” or “normalizing chromosome sequence” is an example of a “normalizing sequence.”
- A“normalizing chromosome sequence” can be composed of a single chromosome or of a group of chromosomes.
- a normalizing sequence comprises two or more robust chromosomes.
- the robust chromosomes are all autosomal chromosomes other than chromosomes, X, Y, 13, 18, and 21.
- the term “differentiability” herein refers to a characteristic of a normalizing chromosome that enables one to distinguish one or more unaffected, i.e., normal, samples from one or more affected, i.e., aneuploid, samples.
- a normalizing chromosome displaying the greatest“differentiability” is a chromosome or group of chromosomes that provides the greatest statistical difference between the distribution of chromosome doses for a chromosome of interest in a set of qualified samples and the chromosome dose for the same chromosome of interest in the corresponding chromosome in the one or more affected samples.
- variable refers to another characteristic of a normalizing chromosome that enables one to distinguish one or more unaffected, i.e., normal, samples from one or more affected, i.e., aneuploid, samples.
- the variability of a normalizing chromosome which is measured in a set of qualified samples, refers to the variability in the number of sequence tags that are mapped to it that approximates the variability in the number of sequence tags that are mapped to a chromosome of interest for which it serves as a normalizing parameter.
- sequence tag density ratio refers to the ratio of the number of sequence tags that are mapped to a chromosome of the reference genome, e.g., chromosome 21, to the length of the reference genome chromosome.
- the term“coverage” refers to the abundance of sequence tags mapped to a defined sequence. Coverage can be quantitatively indicated by sequence tag density (or count of sequence tags), sequence tag density ratio, normalized coverage amount, adjusted coverage values, etc.
- the term“coverage quantity” refers to a modification of raw coverage and often represents the relative quantity of sequence tags (sometimes called counts) in a region of a genome such as a bin. A coverage quantity may be obtained by normalizing, adjusting and/or correcting the raw coverage or count for a region of the genome. For example, a normalized coverage quantity for a region may be obtained by dividing the sequence tag count mapped to the region by the total number sequence tags mapped to the entire genome.
- Normalized coverage quantity allows comparison of coverage of a bin across different samples, which may have different depths of sequencing. It differs from sequence dose in that the latter is typically obtained by dividing by the tag count mapped to a subset of the entire genome. The subset is one or more normalizing segments or chromosomes. Coverage quantities, whether or not normalized, may be corrected for global profile variation from region to region on the genome, G-C fraction variations, outliers in robust chromosomes, etc.
- the term “parameter” herein refers to a numerical value that characterizes a property of a system. Frequently, a parameter numerically characterizes a quantitative data set and/or a numerical relationship between quantitative data sets. For example, a ratio (or function of a ratio) between the number of sequence tags mapped to a chromosome and the length of the chromosome to which the tags are mapped, is a parameter.
- threshold value and“qualified threshold value” herein refer to any number that is used as a cutoff to characterize a sample such as a test sample containing a nucleic acid from an organism suspected of having a medical condition.
- the threshold may be compared to a parameter value to determine whether a sample giving rise to such parameter value suggests that the organism has the medical condition.
- a qualified threshold value is calculated using a qualifying data set and serves as a limit of diagnosis of a copy number variation, e.g., an aneuploidy, in an organism. If a threshold is exceeded by results obtained from methods disclosed herein, a subject can be diagnosed with a copy number variation, e.g., trisomy 21.
- Appropriate threshold values for the methods described herein can be identified by analyzing normalized values (e.g. chromosome doses, NCVs or NSVs) calculated for a training set of samples. Threshold values can be identified using qualified (i.e., unaffected) samples in a training set which comprises both qualified (i.e., unaffected) samples and affected samples. The samples in the training set known to have chromosomal aneuploidies (i.e., the affected samples) can be used to confirm that the chosen thresholds are useful in differentiating affected from unaffected samples in a test set (see the Examples herein). The choice of a threshold is dependent on the level of confidence that the user wishes to have to make the classification.
- qualified i.e., unaffected samples in a training set which comprises both qualified (i.e., unaffected) samples and affected samples.
- the samples in the training set known to have chromosomal aneuploidies i.e., the affected samples
- bin refers to a segment of a sequence or a segment of a genome.
- bins are contiguous with one another within the genome or chromosome.
- Each bin may define a sequence of nucleotides in a reference genome. Sizes of the bin may be 1 kb, 100 kb, 1Mb, etc., depending on the analysis required by particular applications and sequence tag density.
- bins may have other characteristics such as sample coverage and sequence structure characteristics such as G-C fraction.
- the term“masking threshold” is used herein to refer to a quantity against which a value based on the number of sequence tags in a sequence bin is compared, wherein a bin having a value exceeding the masking threshold is masked.
- the masking threshold can be a percentile rank, an absolute count, a mapping quality score, or other suitable values.
- a masking threshold may be defined as the percentile rank of a coefficient of variation across multiple unaffected samples.
- a masking threshold may be defined as a mapping quality score, e.g., a MapQ score, which relates to the reliability of aligning sequence reads to a reference genome.
- a masking threshold value is different from a copy number variation (CNV) threshold value, the latter being a cutoff to characterize a sample containing a nucleic acid from an organism suspected of having a medical condition related to CNV.
- CNV copy number variation
- a CNV threshold value is defined relative to a normalized chromosome value (NCV) or a normalized segment value (NSV) described elsewhere herein.
- normalized value refers to a numerical value that relates the number of sequence tags identified for the sequence (e.g. chromosome or chromosome segment) of interest to the number of sequence tags identified for a normalizing sequence (e.g. normalizing chromosome or normalizing chromosome segment).
- a“normalized value” can be a chromosome dose as described elsewhere herein, or it can be an NCV, or it can be an NSV as described elsewhere herein.
- the term“read” refers to a sequence obtained from a portion of a nucleic acid sample. Typically, though not necessarily, a read represents a short sequence of contiguous base pairs in the sample. The read may be represented symbolically by the base pair sequence (in A, T, C, or G) of the sample portion. It may be stored in a memory device and processed as appropriate to determine whether it matches a reference sequence or meets other criteria. A read may be obtained directly from a sequencing apparatus or indirectly from stored sequence information concerning the sample.
- a read is a DNA sequence of sufficient length (e.g., at least about 25 bp) that can be used to identify a larger sequence or region, e.g., that can be aligned and specifically assigned to a chromosome or genomic region or gene.
- sequence tag is herein used interchangeably with the term “mapped sequence tag” to refer to a sequence read that has been specifically assigned, i.e., mapped, to a larger sequence, e.g., a reference genome, by alignment.
- Mapped sequence tags are uniquely mapped to a reference genome, i.e., they are assigned to a single location to the reference genome. Unless otherwise specified, tags that map to the same sequence on a reference sequence are counted once. Tags may be provided as data structures or other assemblages of data.
- a tag contains a read sequence and associated information for that read such as the location of the sequence in the genome, e.g., the position on a chromosome.
- the location is specified for a positive strand orientation.
- a tag may be defined to allow a limited amount of mismatch in aligning to a reference genome.
- tags that can be mapped to more than one location on a reference genome, i.e., tags that do not map uniquely, may not be included in the analysis.
- non-redundant sequence tag refers to sequence tags that do not map to the same site, which is counted for the purpose of determining normalized chromosome values (NCVs) in some embodiments. Sometimes multiple sequence reads are aligned to the same locations on a reference genome, yielding redundant or duplicated sequence tags. In some embodiments, duplicate sequence tags that map to the same position are omitted or counted as one“non-redundant sequence tag” for the purpose of determining NCVs. In some embodiments, non-redundant sequence tags aligned to non-excluded sites are counted to yield“non-excluded-site counts” (NES counts) for determining NCVs.
- NES counts “non-excluded-site counts”
- a site refers to a unique position (i.e. chromosome ID, chromosome position and orientation) on a reference genome.
- a site may provide a position for a residue, a sequence tag, or a segment on a sequence.
- excluded sites are sites found in regions of a reference genome that have been excluded for the purpose of counting sequence tags. In some embodiments, excluded sites are found in regions of chromosomes that contain repetitive sequences, e.g., centromeres and telomeres, and regions of chromosomes that are common to more than one chromosome, e.g., regions present on the Y-chromosome that are also present on the X chromosome.
- Non-excluded sites are sites that are not excluded in a reference genome for the purpose of counting sequence tags.
- NCV Normalized chromosome value
- NCV relates coverage of a test sample to coverages of a set of training/qualified samples.
- NCV is based on chromosome dose.
- NCV relates to the difference between the chromosome dose of a chromosome of interest in a test sample and the mean of the corresponding chromosome dose in a set of qualified samples as, and can be calculated as:
- p. j and 6 are the estimated mean and standard deviation, respectively, for the j- th chromosome dose in a set of qualified samples, and X jj is the observed j-th chromosome ratio (dose) for test sample i.
- NCV can be calculated“on the fly” by relating the chromosome dose of a chromosome of interest in a test sample to the median of the corresponding chromosome dose in multiplexed samples sequenced on the same flow cells as:
- NCVi j
- test sample i is one of the multiplexed samples sequenced on the same flow cell from which M j is determined.
- test sample A which is sequenced as one of 64 multiplexed samples on one flow cell
- the terms“aligned,”“alignment,” or“aligning” refer to the process of comparing a read or tag to a reference sequence and thereby determining whether the reference sequence contains the read sequence. If the reference sequence contains the read, the read may be mapped to the reference sequence or, in certain embodiments, to a particular location in the reference sequence. In some cases, alignment simply tells whether or not a read is a member of a particular reference sequence (i.e., whether the read is present or absent in the reference sequence). For example, the alignment of a read to the reference sequence for human chromosome 13 will tell whether the read is present in the reference sequence for chromosome 13. A tool that provides this information may be called a set membership tester.
- an alignment additionally indicates a location in the reference sequence where the read or tag maps to. For example, if the reference sequence is the whole human genome sequence, an alignment may indicate that a read is present on chromosome 13, and may further indicate that the read is on a particular strand and/or site of chromosome 13.
- Aligned reads or tags are one or more sequences that are identified as a match in terms of the order of their nucleic acid molecules to a known sequence from a reference genome. Alignment can be done manually, although it is typically implemented by a computer algorithm, as it would be impossible to align reads in a reasonable time period for implementing the methods disclosed herein.
- One example of an algorithm from aligning sequences is the Efficient Local Alignment of Nucleotide Data (ELAND) computer program distributed as part of the Illumina Genomics Analysis pipeline.
- ELAND Efficient Local Alignment of Nucleotide Data
- a Bloom filter or similar set membership tester may be employed to align reads to reference genomes. See US Patent Application No. 61/552,374 filed October 27, 2011 which is incorporated herein by reference in its entirety.
- the matching of a sequence read in aligning can be a 100% sequence match or less than 100% (non-perfect match).
- mapping refers to specifically assigning a sequence read to a larger sequence, e.g., a reference genome, by alignment.
- the term“reference genome” or“reference sequence” refers to any particular known genome sequence, whether partial or complete, of any organism or vims which may be used to reference identified sequences from a subject.
- a reference genome used for human subjects as well as many other organisms is found at the National Center for Biotechnology Information at ncbi.nlm.nih.gov.
- A“genome” refers to the complete genetic information of an organism or vims, expressed in nucleic acid sequences.
- the reference sequence is significantly larger than the reads that are aligned to it.
- it may be at least about 100 times larger, or at least about 1000 times larger, or at least about 10,000 times larger, or at least about 10 5 times larger, or at least about 10 6 times larger, or at least about 10 7 times larger.
- the reference sequence is that of a full length human genome. Such sequences may be referred to as genomic reference sequences. In another example, the reference sequence is limited to a specific human chromosome such as chromosome 13. In some embodiments, a reference Y chromosome is the Y chromosome sequence from human genome version hgl9. Such sequences may be referred to as chromosome reference sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, sub- chromosomal regions (such as strands), etc., of any species.
- the reference sequence is a consensus sequence or other combination derived from multiple individuals. However, in certain applications, the reference sequence may be taken from a particular individual.
- clinically -relevant sequence refers to a nucleic acid sequence that is known or is suspected to be associated or implicated with a genetic or disease condition. Determining the absence or presence of a clinically-relevant sequence can be useful in determining a diagnosis or confirming a diagnosis of a medical condition, or providing a prognosis for the development of a disease.
- nucleic acid when used in the context of a nucleic acid or a mixture of nucleic acids, herein refers to the means whereby the nucleic acid(s) are obtained from the source from which they originate.
- a mixture of nucleic acids that is derived from two different genomes means that the nucleic acids, e.g., cfDNA, were naturally released by cells through naturally occurring processes such as necrosis or apoptosis.
- a mixture of nucleic acids that is derived from two different genomes means that the nucleic acids were extracted from two different types of cells from a subject.
- patient sample refers to a biological sample obtained from a patient, i.e., a recipient of medical attention, care or treatment.
- the patient sample can be any of the samples described herein.
- the patient sample is obtained by non-invasive procedures, e.g., peripheral blood sample or a stool sample.
- the methods described herein need not be limited to humans.
- the patient sample may be a sample from a non-human mammal (e.g., a feline, a porcine, an equine, a bovine, and the like).
- mixture sample refers to a sample containing a mixture of nucleic acids, which are derived from different genomes.
- maternal sample refers to a biological sample obtained from a pregnant subject, e.g., a woman.
- biological fluid refers to a liquid taken from a biological source and includes, for example, blood, serum, plasma, sputum, lavage fluid, cerebrospinal fluid, urine, semen, sweat, tears, saliva, and the like.
- blood As used herein, the terms“blood,”“plasma” and“serum” expressly encompass fractions or processed portions thereof.
- sample is taken from a biopsy, swab, smear, etc.
- the“sample” expressly encompasses a processed fraction or portion derived from the biopsy, swab, smear, etc.
- the term“corresponding to” sometimes refers to a nucleic acid sequence, e.g., a gene or a chromosome, that is present in the genome of different subjects, and which does not necessarily have the same sequence in all genomes, but serves to provide the identity rather than the genetic information of a sequence of interest, e.g., a gene or chromosome.
- fetal fraction refers to the fraction of fetal nucleic acids present in a sample comprising fetal and maternal nucleic acid. Fetal fraction is often used to characterize the cfDNA in a mother’s blood.
- chromosome refers to the heredity-bearing gene carrier of a living cell, which is derived from chromatin strands comprising DNA and protein components (especially histones).
- chromatin strands comprising DNA and protein components (especially histones).
- the conventional internationally recognized individual human genome chromosome numbering system is employed herein.
- subject refers to a human subject as well as a non human subject such as a mammal, an invertebrate, a vertebrate, a fungus, a yeast, a bacterium, and a virus.
- a mammal such as a mammal, an invertebrate, a vertebrate, a fungus, a yeast, a bacterium, and a virus.
- examples herein concern humans and the language is primarily directed to human concerns, the concepts disclosed herein are applicable to genomes from any plant or animal, and are useful in the fields of veterinary medicine, animal sciences, research laboratories and such.
- condition herein refers to“medical condition” as a broad term that includes all diseases and disorders, but can include injuries and normal health situations, such as pregnancy, that might affect a person’s health, benefit from medical assistance, or have implications for medical treatments.
- partial when used in reference to a chromosomal aneuploidy herein refers to a gain or loss of a portion, i.e., segment, of a chromosome.
- mosaic herein refers to denote the presence of two populations of cells with different karyotypes in one individual who has developed from a single fertilized egg. Mosaicism may result from a mutation during development which is propagated to only a subset of the adult cells.
- the term“sensitivity” as used herein refers to the probability that a test result will be positive when the condition of interest is present. It may be calculated as the number of true positives divided by the sum of true positives and false negatives.
- the term“specificity” as used herein refers to the probability that a test result will be negative when the condition of interest is absent. It may be calculated as the number of true negatives divided by the sum of true negatives and false positives.
- the term “enrich” herein refers to the process of amplifying polymorphic target nucleic acids contained in a portion of a maternal sample, and combining the amplified product with the remainder of the maternal sample from which the portion was removed.
- the remainder of the maternal sample can be the original maternal sample.
- the term“original maternal sample” herein refers to a non-enriched biological sample obtained from a pregnant subject, e.g., a woman, who serves as the source from which a portion is removed to amplify polymorphic target nucleic acids.
- The“original sample” can be any sample obtained from a pregnant subject, and the processed fractions thereof, e.g., a purified cfDNA sample extracted from a maternal plasma sample.
- the term“primer,” as used herein refers to an isolated oligonucleotide that is capable of acting as a point of initiation of synthesis when placed under conditions inductive to synthesis of an extension product (e.g., the conditions include nucleotides, an inducing agent such as DNA polymerase, and a suitable temperature and pH).
- the primer is preferably single stranded for maximum efficiency in amplification, but may alternatively be double stranded. If double stranded, the primer is first treated to separate its strands before being used to prepare extension products.
- the primer is an oligodeoxyribonucleotide.
- the primer must be sufficiently long to prime the synthesis of extension products in the presence of the inducing agent. The exact lengths of the primers will depend on many factors, including temperature, source of primer, use of the method, and the parameters used for primer design.
- CNV in the human genome significantly influence human diversity and predisposition to diseases (Redon et al., Nature 23:444-454 [2006], Shaikh et al. Genome Res 19:1682-1690 [2009]).
- diseases include, but are not limited to cancer, infectious and autoimmune diseases, diseases of the nervous system, metabolic and/or cardiovascular diseases, and the like.
- CNVs have been known to contribute to genetic disease through different mechanisms, resulting in either imbalance of gene dosage or gene disruption in most cases. In addition to their direct correlation with genetic disorders, CNVs are known to mediate phenotypic changes that can be deleterious.
- CNV arise from genomic rearrangements, primarily owing to deletion, duplication, insertion, and unbalanced translocation events.
- NIPT Non-invasive prenatal testing
- Current methodologies involve sequencing maternal samples using short reads (25bp-36bp), aligning to the genome, computing and normalizing sub-chromosomal coverage, and finally evaluating over representation of target chromosomes (13 / 18 / 21 / X / Y) compared to the expected normalized coverage associated with a normal diploid genome.
- traditional NIPT assay and analysis relies on the counts or coverage to evaluate the likelihood of fetal aneuploidy.
- NIPT Since maternal plasma samples represent a mixture of maternal and fetal cfDNA, the success of any given NIPT method depends on its sensitivity to detect copy number changes in the low fetal fraction samples. For counting based methods, their sensitivity is determined by (a) sequencing depth and (b) ability of data normalization to reduce technical variance.
- This disclosure provides analytical methodology for NIPT and other applications by deriving fragment size information from, e.g., paired-end reads, and using this information in an analysis pipeline. Improved analytical sensitivity provides the ability to apply NIPT methods at reduced coverage (e.g., reduced sequencing depth) which enables the use of the technology for lower-cost testing of average risk pregnancies.
- Data illustrates various examples of situations where determined values of fetal fraction are inaccurate possibly for a variety of reasons.
- the data shows certain trends in fetal fraction error such as increasing fetal fraction error with increasing call error (which may result from reduced coverage) and increasing fetal fraction error with decreasing magnitude of the true fetal fraction value. Due to these problems with fetal fraction as a mode for determining which samples to exclude, another technique for determining which's samples to exclude should be used.
- Fetal fraction affects the determination of fetal cell-free DNA abundance and copy number variations in NIPT.
- fetal fraction decreases in a sample, it becomes more difficult to accurately determine the relative coverage of fetal cell-free DNA because the signal decreases and the relative noise increases as fetal fraction decreases.
- Coverage or sequencing depth also has a similar effect on CNV detection. Because the two factors jointly affect CNV detection, the minimal fetal fraction needed to ensure detection changes as a function of coverage, having a hyperbolic shape as shown in Figure 1.
- This disclosure provides methods and systems of quality control (QC) in CNV detection processes such as those shown in Figures 4A, 4B, 15, and 16A.
- the QC can be performed either before or after making a CNV call to identify samples that have fetal fraction levels too low to yield reliable results.
- the identified samples can be rerun to obtain new reads. If sequencing depth increases in rerun, coverage increases, which improve signal or reduce noise of the sample.
- the disclosed implementations when applied to different sample populations can take into consideration variations between samples and populations. And they can more effectively identify samples with low fetal fraction and/or low read coverage.
- FIG. 4A shows workflow 200 that uses a limit of detection (LOD) QC method for determining CNV according to some implementations.
- LOD limit of detection
- the quality check occurs at box 212 to check whether the test sample is within an exclusion region that is defined by at least a fetal fraction LOD curve.
- the exclusion region is defined by the LOD curve and a coverage threshold or a fetal fraction threshold.
- This step and its downstream steps can be applied to other CNV detection process described hereinafter.
- This illustrative workflow involves sequencing a test sample including maternal and fetal cell-free nucleic acid fragments to obtain sequence reads. See box 202.
- Various techniques may be used to obtain the sequence reads, including but not limited to the sequencing techniques described herein after.
- the fetal fraction value is determined based on coverage information for the bins of the reference genome, where the reference genome is divided into segments or bins.
- the value of fetal fraction is calculated by applying coverage values of a plurality of bins of the reference genome to a model relating fetal fraction to coverage of bin to obtain the fetal fraction value.
- bins that have over-represented fetal cell-free nucleic acid fragments aligned to them are selected from training samples. The test reads aligned to the bins are used to determine the fetal fraction.
- Process 200 proceeds to determine a coverage of the sequence tags for a sequence of interest. See 210. Coverages are determined per sample. In various implementations, coverages are determined over all chromosomes (entire reference genome), over a subset of chromosomes, or at the sub-chromosomal level. Various techniques may be used to determine the coverage, including but not limited to those shown hereinafter. In some implementations, the coverage is determined by: dividing the reference genome into a plurality of bins, determining a number of sequence tags aligning to each bin, and determining the coverage of the sequence tags using the numbers of sequence tags in bins in the sequence of interest. In some implementations, the method further includes adjusting the number of sequence tags aligning to the bins by accounting for bin-to-bin variations due to factors other than copy number variations. More detailed descriptions of methods for determining coverage are provided hereinafter.
- the next step in process 200 is a QC step. It involves determining whether the test sample is within an exclusion region based on the coverage of sequence tags determined in 208 and the fetal fraction determined in 204.
- the exclusion region is defined by at least a fetal fraction LOD curve.
- the fetal fraction LOD curve varies with coverage values and indicates minimum values of fetal fractions needed to achieve a detection criterion given different coverages. See Figure 14 for examples of the LOD curve.
- An LOD is the minimal level of signals (analytes, fetal fraction, scores indicating conditions, etc.) that can be detected with a defined confidence.
- an LOD is the minimal level of fetal fraction (or other analytes) required to detect an aneuploidy/CNV with a defined confidence.
- the detection criterion is X% confident that for the observed fetal fraction the ground truth fetal fraction is larger than LOD Y%.
- the ground truth fetal fraction is the actual fetal fraction underlying the inferred fetal fraction.
- the specified LOD is determined as a smallest observed fetal fraction in which Y percent of the affected samples can be detected.
- the detection criterion for an observed fetal fraction and observed coverage is obtained using a distribution of true fetal fractions (or simulated fetal fraction) of the observed fetal fraction at the observed coverage.
- Figure 5 shows empirical and hypothetical data illustrating the statistical concepts underlying LOD. The CNV detection is based on a log likelihood ratio indicating the likelihood that a sample harbors CNV. The top panel shows LLR on the Y axis and fetal fraction on the X axis. At each fetal fraction level, multiple samples are measured and the mean and standard deviation can be obtained.
- the top panel shows that a cutoff value (labeled 502A) is applied to call the CNV. As the fetal fraction value increases, the LLR scores also increase, moving further away from the cutoff, allowing more samples to be detected.
- 2.3% is the lowest fetal fraction at which an aneuploidy/CNV can be detected with 95% confidence (one-tailed).
- the observed LLR for affected samples at fetal fraction 2.3% is shown as 504A in the top panel and 504B in the bottom panel.
- the underlying population distribution of the observed data is shown in the bottom panel as 506.
- the cutoff LLR is shown as 502A in the top panel, and 502B in the bottom panel.
- the bottom panel illustrates that at fetal fraction 2.3%, 5% of the underlying population are below the LLR cutoff, and 95% of the population are above the LLR threshold 502B. As such, 95% of the samples in the population are detected as having CNV.
- test sample is not within the exclusion region defined by the fetal fraction LOD curve
- the process proceeds to evaluate coverage of the sequence of interest to determine CNV. See the“NO” branch from box 212 to box 14. And that ends the process in this path. If it is determined that the test sample is within the LOD exclusion region, and it is determined that the sample should be re-sequenced, the“YES” branch of box 212 and the“YES” branch of box 216. Then the sample is re-sequenced and the process repeats from 202 through 212. If it is determined at box 216 that the sequence does not need to be re-sequenced, such as when it was already re-sequenced, the process ends.
- the QC methods disclosed can help to increase both sensitivity and specificity. If one wishes to only improve sensitivity, one may first determine that the test sample is negative or positive. Only samples that are negatively called go through the QC process as shown in Figure 4B.
- Figure 4B shows a LOD QC process for CNV detection that is similar to the process in Figure 4A, except that a call is first performed before determining whether the sample falls in the exclusion region. See block 211. If the call is negative, the process then determines whether the sample is within the exclusion region defined by an LOD curve. See block 212. If not, the process ends. See the“NO” branch of box 216. If the test sample is within the exclusion region, (“YES” branch of box 212) and it is determined that re-sequenced is needed, then the sample will be re-sequenced and repeat the process. If it is determined that the sample does not need to be re-sequenced, such as in the case that it has already been re-sequenced once, then the process ends. See the“NO” branch of box 216.
- Figure 8 shows fetal fraction error due to coverage and due to fetal fraction values.
- the left panel shows that the standard deviation of fetal fraction estimates (a measure related to error) decreases as coverage increases.
- the right panel shows that the standard deviation of fetal fraction estimates decreases as the true fetal fraction increases.
- Figure 9 simulate the true fetal fractions for eight levels of observed fetal fractions. It shows observed fetal fractions at 1% interval between 0% and 8%. It also shows the ground truth distributions for those observed fetal fractions.
- the vertical dash line 9002 shows an observed fetal fraction 0%.
- Solid line 9004 is the distribution of the true fetal fraction for the observed fetal fraction at 0%.
- Dash line 9006 indicates the observed fetal fraction of 8%.
- Solid line 9008 shows the distribution of true fetal fraction for the observed fetal fraction of 8%.
- the middle panel shows the distributions for fetal fractions with 1.5% error.
- the right panel shows the distributions of fetal fraction in 2% error.
- the difference between the observed fetal fraction and true fetal fraction is affected by coverage, such that larger error leads to larger difference between the observed fetal fraction and true fetal fraction.
- Figure 11 shows observed fetal fraction of 2% and its simulated true fetal fraction distributions given different errors or coverage.
- the observed fetal fraction is indicated by line 1102.
- the true fetal fraction with the smallest error 1% (or highest coverage) has a distribution 1112.
- Distribution 1114 is the true fetal fraction 1.5% error or intermediate coverage.
- Distribution 1116 is the true fetal fraction distribution with the highest 2% error or the lowest coverage.
- the 5th percentile or 95% confidence for distributions 1112, 1114, 1116 are marked by lines 1122, 1124, and 1126, respectively. They show that as error increases or coverage decreases, the true fetal fraction deviates further from the observed fetal fraction, and the 95% confidence level also increases.
- 20 percentile (1128) and 25 percentile (1130) of distribution 1116 correspond to 80% confidence level and 75% confidence level respectively. Similarly, other confidence levels such as 50% confidence may be determined.
- LOD95 is first empirically determined for a particular coverage as (as fetal fraction). This LOD95 value is also the value for the desired, e.g., 20 percentile, true fetal fraction. With this value, one can determine the observed fetal fraction from the true vs observed functions (lines in the left panel of Figure 12).
- the LOD95 for coverage 1 million is empirically determined to be 6.50% FF, which is also the 20th percentile of the true fetal fraction at 1202.
- the LOD95 for coverage 2 million is empirically determined to be 4.59% FF, which is also the 20th percentile of the true fetal fraction at 1206.
- Figure 13 tabulates the LOD, coverage, and observed fetal fraction.
- the table shows the effective read count in the first column from the left, LOD95 in the third column, and observed fetal fraction required to achieve 80% confidence that the true fetal fraction is above LOD95 in the fourth column.
- the observed fetal fraction required to have a 75% confidence is shown in the right column, which may be obtained from the data shown in the right panel of Figure 12.
- the second column shows FF errors for the different effective read counts, illustrating that as coverage increases, error decreases.
- Figure 14 includes two fetal fraction LOD curves that may be obtained from data similar to those shown in Figure 13.
- the LOD curve is used in conjunction with coverage threshold such as 1 M coverage threshold shown in Figure 14.
- the exclusion region is under the fetal fraction LOD curve.
- the exclusion region is the area in which each point has a lower value in fetal fraction or and/or coverage than the corresponding points on the fetal fraction LOD curve.
- the exclusion region is defined by the fetal fraction LOD curve and a coverage threshold. The exclusion region is under both the fetal fraction LOD curve and the coverage threshold.
- the method of generating an LOD curve may be summarized as follows. An LOD curve was obtained using simulating“samples” where each“sample” includes the following.
- assessing a nucleic acid sample for CNV involves characterizing the status of a chromosomal or segment aneuploidy by one of three types of calls:“normal” or“unaffected,”“affected,” and“no-call.” Thresholds for calling normal and affected are typically set. A parameter related to aneuploidy or other copy number variation is measured in a sample and the measured value is compared to the thresholds. For duplication type aneuploidies, a call of affected is made if a chromosome or segment dose (or other measured value sequence content) is above a defined threshold set for affected samples.
- the parameters that may be used to determine CNV include, but are not limited to, coverage, fragment size biased/weighted coverage, fraction or ratio of fragments in a defined size range, and methylation level of fragments.
- coverage is obtained from counts of reads aligned to a region of a reference genome and optionally normalized to produce sequence tag counts.
- sequence tag counts can be weighted by fragment size.
- a fragment size parameter is biased toward fragment sizes characteristic of one of the genomes.
- a fragment size parameter is a parameter that relates to the size of a fragment.
- a parameter is biased toward a fragment size when: 1) the parameter is favorably weighted for the fragment size, e.g., a count weighted more heavily for the size than for other sizes; or 2) the parameter is obtained from a value that is favorably weighted for the fragment size, e.g., a ratio obtained from a count weighted more heavily for the size.
- a size is characteristic of a genome when the genome has an enriched or higher concentration of nucleic acid of the size relative to another genome or another portion of the same genome.
- the processed sequence coverage quantity is a sequence tag density ratio, which is the number of sequence tag standardized by sequence length.
- the processed sequence coverage quantity or the other parameter is a normalized sequence tag or another normalized parameter, which is the number of sequence tags or the other parameter of a sequence of interest divided by that of all or a substantial portion of the genome.
- the processed sequence coverage quantity or the other parameter such as a fragment size parameter is adjusted according to a global profile of the sequence of interest.
- the processed sequence coverage quantity or the other parameter is adjusted according to the within-sample correlation between the GC content and the sequence coverage for the sample being tested.
- the processed sequence coverage quantity or the other parameter results from combinations of these processes, which are further described elsewhere herein.
- a chromosome dose is calculated as the ratio of the processed sequence coverage or the other parameter for each of the chromosomes of interest and that for the normalizing chromosome sequence(s).
- the complete chromosomal aneuploidies are selected from complete chromosomal trisomies, complete chromosomal monosomies and complete chromosomal polysomies.
- the complete chromosomal aneuploidies are selected from complete aneuploidies of any one of chromosome 1-22, X, and Y.
- the said different complete fetal chromosomal aneuploidies are selected from trisomy 2, trisomy 8, trisomy 9, trisomy 20, trisomy 21, trisomy 13, trisomy 16, trisomy 18, trisomy 22, , 47, XXX, 47,XYY, and monosomy X.
- steps (a)-(d) are repeated for test samples from different maternal subjects, and the method comprises determining the presence or absence of any two or more different complete fetal chromosomal aneuploidies in each of the test samples.
- fi j and s are the estimated mean and standard deviation, respectively, for the j- th chromosome dose in a set of qualified samples, and x i; ⁇ is the observed /-th chromosome dose for test sample i.
- NCV can be calculated“on the fly” by relating the chromosome dose of a chromosome of interest in a test sample to the median of the corresponding chromosome dose in multiplexed samples sequenced on the same flow cells as:
- test sample i is one of the multiplexed samples sequenced on the same flow cell from which M j is determined.
- a method for determining the presence or absence of different partial fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acids.
- the method involves procedures analogous to the method for detecting complete aneuploidy as outlined above. However, instead of analyzing a complete chromosome, a segment of a chromosome is analyzed. See US Patent Application Publication No. 2013/0029852, which is incorporated by reference.
- operations 130 and 135 qualified sequence tag coverages (or values of another parameter) and test sequence tag coverages (or values of another parameter) are determined.
- the present disclosure provides processes to determine coverage quantities that provide improved sensitivity and selectivity relative to conventional methods. Operation 130 and 135 are marked by asterisks and emphasized by boxes of heavy lines to indicate these operations contribute to improvement over prior art.
- the sequence tag coverage quantities are normalized, adjusted, trimmed, and otherwise processed to improve the sensitivity and selectivity of the analysis. These processes are further described elsewhere herein.
- the method makes use of normalizing sequences of qualified training samples in determination of CNV of test samples.
- the qualified training samples are unaffected and have normal copy number.
- Normalizing sequences provide a mechanism to normalize measurements for intra-run and inter-run variabilities. Normalizing sequences are identified using sequence information from a set of qualified samples obtained from subjects known to comprise cells having a normal copy number for any one sequence of interest, e.g., a chromosome or segment thereof. Determination of normalizing sequences is outlined in steps 110, 120, 130, 145 and 146 of the embodiment of the method depicted in Figure 1. In some embodiments, the normalizing sequences are used to calculate sequence dose for test sequences. See step 150.
- normalizing sequences are also used to calculate a threshold against which the sequence dose of the test sequences is compared. See step 150.
- the sequence information obtained from the normalizing sequence and the test sequence is used for determining statistically meaningful identification of chromosomal aneuploidies in test samples (step 160).
- Figure 15 provides a flow diagram 100 of an embodiment for determining a CNV of a sequence of interest, e.g., a chromosome or segment thereof, in a biological sample.
- a biological sample is obtained from a subject and comprises a mixture of nucleic acids contributed by different genomes.
- the different genomes can be contributed to the sample by two individuals, e.g., the different genomes are contributed by the fetus and the mother carrying the fetus.
- one or more normalizing chromosomes or one or more normalizing chromosome segments are selected for each possible chromosome of interest.
- the normalizing chromosomes or segments are identified asynchronously from the normal testing of patient samples, which may take place in a clinical setting. In other words, the normalizing chromosomes or segments are identified prior to testing patient samples.
- the associations between normalizing chromosomes or segments and chromosomes or segments of interest are stored for use during testing. As explained below, such association is typically maintained over periods of time that span testing of many samples. The following discussion concerns embodiments for selecting normalizing chromosomes or chromosome segments for individual chromosomes or segments of interest.
- a set of qualified samples is obtained to identify qualified normalizing sequences and to provide variance values for use in determining statistically meaningful identification of CNV in test samples.
- a plurality of biological qualified samples are obtained from a plurality of subjects known to comprise cells having a normal copy number for any one sequence of interest.
- the qualified samples are obtained from mothers pregnant with a fetus that has been confirmed using cytogenetic means to have a normal copy number of chromosomes.
- the biological qualified samples may be a biological fluid, e.g., plasma, or any suitable sample as described below.
- a qualified sample contains a mixture of nucleic acid molecules, e.g., cfDNA molecules.
- the qualified sample is a maternal plasma sample that contains a mixture of fetal and maternal cfDNA molecules.
- Sequence information for normalizing chromosomes and/or segments thereof is obtained by sequencing at least a portion of the nucleic acids, e.g., fetal and maternal nucleic acids, using any known sequencing method.
- any one of the Next Generation Sequencing (NGS) methods described elsewhere herein is used to sequence the fetal and maternal nucleic acids as single or clonally amplified molecules.
- the qualified samples are processed as disclosed below prior to and during sequencing. They may be processed using apparatus, systems, and kits as disclosed herein.
- step 120 at least a portion of each of all the qualified nucleic acids contained in the qualified samples are sequenced to generate millions of sequence reads, e.g., 36bp reads, which are aligned to a reference genome, e.g., hgl8.
- the sequence reads comprise about 20bp, about 25bp, about 30bp, about 35bp, about 40bp, about 45bp, about 50bp, about 55bp, about 60bp, about 65bp, about 70bp, about 75bp, about 80bp, about 85bp, about90bp, about 95bp, about lOObp, about l lObp, about 120bp, about 130, about 140bp, about 150bp, about 200bp, about 250bp, about 300bp, about 350bp, about 400bp, about 450bp, or about 500bp.
- the mapped sequence reads comprise 36bp. In another embodiment, the mapped sequence reads comprise 25bp.
- Sequence reads are aligned to a reference genome, and the reads that are uniquely mapped to the reference genome are known as sequence tags. Sequence tags falling on masked segments of a masked reference sequence are not counted for analysis of CNV.
- At least about 3 x 106 qualified sequence tags, at least about 5 x 10 6 qualified sequence tags, at least about 8 x 10 6 qualified sequence tags, at least about 10 x 10 6 qualified sequence tags, at least about 15 x 10 6 qualified sequence tags, at least about 20 x 10 6 qualified sequence tags, at least about 30 x 10 6 qualified sequence tags, at least about 40 x 10 6 qualified sequence tags, or at least about 50 x 10 6 qualified sequence tags comprising between 20 and 40bp reads are obtained from reads that map uniquely to a reference genome.
- step 130 all the tags obtained from sequencing the nucleic acids in the qualified samples are counted to obtain a qualified sequence tag coverage.
- step 135 all tags obtained from a test sample are counted to obtain a test sequence tag coverage.
- the present disclosure provides processes to determine coverage quantities that provides improved sensitivity and selectivity relative to conventional methods. Operation 130 and 135 are marked by asterisks and emphasized by boxes of heavy lines to indicate these operations contribute to improvement over prior art.
- the sequence tag coverage quantities are normalized, adjusted, trimmed, and otherwise processed to improve the sensitivity and selectivity of the analysis. These processes are further described elsewhere herein.
- sequence tag coverage for a sequence of interest e.g., a clinically -relevant sequence
- sequence tag coverages for additional sequences from which normalizing sequences are identified subsequently are determined, as are the sequence tag coverages for additional sequences from which normalizing sequences are identified subsequently.
- the sequence of interest is a chromosome that is associated with a complete chromosomal aneuploidy, e.g., chromosome 21, and the qualified normalizing sequence is a complete chromosome that is not associated with a chromosomal aneuploidy and whose variation in sequence tag coverage approximates that of the sequence (i.e., chromosome) of interest, e.g., chromosome 21.
- the selected normalizing chromosome(s) may be the one or group that best approximates the variation in sequence tag coverage of the sequence of interest.
- Any one or more of chromosomes 1-22, X, and Y can be a sequence of interest, and one or more chromosomes can be identified as the normalizing sequence for each of the any one chromosomes 1-22, X and Y in the qualified samples.
- the normalizing chromosome can be an individual chromosome or it can be a group of chromosomes as described elsewhere herein.
- the sequence of interest is a segment of a chromosome associated with a partial aneuploidy, e.g., a chromosomal deletion or insertion, or unbalanced chromosomal translocation
- the normalizing sequence is a chromosomal segment (or group of segments) that is not associated with the partial aneuploidy and whose variation in sequence tag coverage approximates that of the chromosome segment associated with the partial aneuploidy.
- the selected normalizing chromosome segment(s) may be the one or more that best approximates the variation in sequence tag coverage of the sequence of interest. Any one or more segments of any one or more chromosomes 1-22, X, and Y can be a sequence of interest.
- the sequence of interest is a segment of a chromosome associated with a partial aneuploidy and the normalizing sequence is a whole chromosome or chromosomes.
- the sequence of interest is a whole chromosome associated with an aneuploidy and the normalizing sequence is a chromosomal segment or segments that are not associated with the aneuploidy.
- the qualified normalizing sequence may be chosen to have a variation in sequence tag coverage or a fragment size parameter that best or effectively approximates that of the sequence of interest as determined in the qualified samples.
- a qualified normalizing sequence is a sequence that produces the smallest variability across the qualified samples when used to normalize the sequence of interest, i.e., the variability of the normalizing sequence is closest to that of the sequence of interest determined in qualified samples.
- the qualified normalizing sequence is the sequence selected to produce the least variation in sequence dose (for the sequence of interest) across the qualified samples.
- the process selects a sequence that when used as a normalizing chromosome is expected to produce the smallest variability in run-to-run chromosome dose for the sequence of interest.
- Substantial alterations in these procedures will affect the number of tags that are mapped to all sequences, which in turn will determine which one or group of sequences will have a variability across samples in the same and/or in different sequencing runs, on the same day or on different days that most closely approximates that of the sequence(s) of interest, which would require that the set of normalizing sequences be re-determined.
- Substantial alterations in procedures include changes in the laboratory protocol used for preparing the sequencing library, which includes changes related to preparing samples for multiplex sequencing instead of singleplex sequencing, and changes in sequencing platforms, which include changes in the chemistry used for sequencing.
- the normalizing sequence chosen to normalize a particular sequence of interest is a sequence that best distinguishes one or more qualified, samples from one or more affected samples, which implies that the normalizing sequence is a sequence that has the greatest differentiability, i.e., the differentiability of the normalizing sequence is such that it provides optimal differentiation to a sequence of interest in an affected test sample to easily distinguish the affected test sample from other unaffected samples.
- the normalizing sequence is a sequence that has a combination of the smallest variability and the greatest differentiability.
- the level of differentiability can be determined as a statistical difference between the sequence doses, e.g., chromosome doses or segment doses, in a population of qualified samples and the chromosome dose(s) in one or more test samples as described below and shown in the Examples.
- differentiability can be represented numerically as a t-test value, which represents the statistical difference between the chromosome doses in a population of qualified samples and the chromosome dose(s) in one or more test samples.
- differentiability can be based on segment doses instead of chromosome doses.
- differentiability can be represented numerically as a Normalized Chromosome Value (NCV), which is a z-score for chromosome doses as long as the distribution for the NCV is normal.
- NCV Normalized Chromosome Value
- segment doses can be represented numerically as a Normalized Segment Value (NSV), which is a z-score for chromosome segment doses as long as the distribution for the NSV is normal.
- NSV Normalized Segment Value
- the mean and standard deviation of chromosome or segment doses in a set of qualified samples can be used.
- the mean and standard deviation of chromosome or segment doses in a training set comprising qualified samples and affected samples can be used.
- the normalizing sequence is a sequence that has the smallest variability and the greatest differentiability or an optimal combination of small variability and large differentiability.
- the method identifies sequences that inherently have similar characteristics and that are prone to similar variations among samples and sequencing runs, and which are useful for determining sequence doses in test samples.
- a qualified sequence dose i.e., a chromosome dose or a segment dose
- a sequence of interest is determined as the ratio of the sequence tag coverage for the sequence of interest and the qualified sequence tag coverage for additional sequences from which normalizing sequences are identified subsequently in step 145.
- the identified normalizing sequences are used subsequently to determine sequence doses in test samples.
- the sequence dose in the qualified samples is a chromosome dose that is calculated as the ratio of the number of sequence tags or fragment size parameter for a chromosome of interest and the number of sequence tags for a normalizing chromosome sequence in a qualified sample.
- the normalizing chromosome sequence can be a single chromosome, a group of chromosomes, a segment of one chromosome, or a group of segments from different chromosomes.
- the sequence dose in the qualified samples is a segment dose as opposed to a chromosome dose, which segment dose is calculated as the ratio of the number of sequence tags for a segment of interest, that is not a whole chromosome, and the number of sequence tags for a normalizing segment sequence in a qualified sample.
- the normalizing segment sequence can be any of the normalizing chromosome or segment sequences discussed above.
- a normalizing sequence is identified for a sequence of interest.
- the normalizing sequence is the sequence based on the calculated sequence doses, e.g., that result in the smallest variability in sequence dose for the sequence of interest across all qualified training samples.
- the method identifies sequences that inherently have similar characteristics and are prone to similar variations among samples and sequencing runs, and which are useful for determining sequence doses in test samples.
- Normalizing sequences for one or more sequences of interest can be identified in a set of qualified samples, and the sequences that are identified in the qualified samples are used subsequently to calculate sequence doses for one or more sequences of interest in each of the test samples (step 150) to determine the presence or absence of aneuploidy in each of the test samples.
- the normalizing sequence identified for chromosomes or segments of interest may differ when different sequencing platforms are used and/or when differences exist in the purification of the nucleic acid that is to be sequenced and/or preparation of the sequencing library.
- the use of normalizing sequences according to the methods described herein provides specific and sensitive measure of a variation in copy number of a chromosome or segment thereof irrespective of sample preparation and/or sequencing platform that is used.
- more than one normalizing sequence is identified, i.e., different normalizing sequences can be determined for one sequence of interest, and multiple sequence doses can be determined for one sequence of interest.
- CV standard deviation/mean
- two, three, four, five, six, seven, eight or more normalizing sequences can be identified for use in determining a sequence dose for a sequence of interest in a test sample.
- the normalizing chromosome sequence when a single chromosome is chosen as the normalizing chromosome sequence for a chromosome of interest, the normalizing chromosome sequence will be a chromosome that results in chromosome doses for the chromosome of interest that has the smallest variability across all samples tested, e.g., qualified samples.
- the best normalizing chromosome may not have the least variation, but may have a distribution of qualified doses that best distinguishes a test sample or samples from the qualified samples, i.e., the best normalizing chromosome may not have the lowest variation, but may have the greatest differentiability.
- a sequence dose is determined for a sequence of interest in a test sample comprising a mixture of nucleic acids derived from genomes that differ in one or more sequences of interest.
- step 125 at least a portion of the test nucleic acids in the test sample is sequenced as described for the qualified samples to generate millions of sequence reads, e.g., 36bp reads.
- sequence reads e.g., 36bp reads.
- 2x36 bp paired end reads are used for paired end sequencing.
- the reads generated from sequencing the nucleic acids in the test sample are uniquely mapped or aligned to a reference genome to produce tags.
- At least about 3 x 10 6 qualified sequence tags, at least about 5 x 10 6 qualified sequence tags, at least about 8 x 10 6 qualified sequence tags, at least about 10 x 10 6 qualified sequence tags, at least about 15 x 10 6 qualified sequence tags, at least about 20 x 10 6 qualified sequence tags, at least about 30 x 10 6 qualified sequence tags, at least about 40 x 10 6 qualified sequence tags, or at least about 50 x 10 6 qualified sequence tags comprising between 20 and 40bp reads are obtained from reads that map uniquely to a reference genome.
- the reads produced by sequencing apparatus are provided in an electronic format. Alignment is accomplished using computational apparatus as discussed below.
- the alignment procedure permits limited mismatch between reads and the reference genome. In some cases, 1, 2, or 3 base pairs in a read are permitted to mismatch corresponding base pairs in a reference genome, and yet a mapping is still made.
- step 135 all or most of the tags obtained from sequencing the nucleic acids in the test samples are counted to determine a test sequence tag coverage using a computational apparatus as described below.
- each read is aligned to a particular region of the reference genome (a chromosome or segment in most cases), and the read is converted to a tag by appending site information to the read.
- the computational apparatus may keep a running count of the number of tags/reads mapping to each region of the reference genome (chromosome or segment in most cases). The counts are stored for each chromosome or segment of interest and each corresponding normalizing chromosome or segment.
- the reference genome has one or more excluded regions that are part of a true biological genome but are not included in the reference genome. Reads potentially aligning to these excluded regions are not counted. Examples of excluded regions include regions of long repeated sequences, regions of similarity between X and Y chromosomes, etc. Using a masked reference sequence obtained by masking techniques described above, only tags on unmasked segments of the reference sequence are taken into account for analysis of CNV.
- the method determines whether to count a tag more than once when multiple reads align to the same site on a reference genome or sequence. There may be occasions when two tags have the same sequence and therefore align to an identical site on a reference sequence.
- the method employed to count tags may under certain circumstances exclude from the count identical tags deriving from the same sequenced sample. If a disproportionate number of tags are identical in a given sample, it suggests that there is a strong bias or other defect in the procedure. Therefore, in accordance with certain embodiments, the counting method does not count tags from a given sample that are identical to tags from the sample that were previously counted.
- a defined percentage of the tags that are counted must be unique. If more tags than this threshold are not unique, they are disregarded. For example, if the defined percentage requires that at least 50% are unique, identical tags are not counted until the percentage of unique tags exceeds 50% for the sample.
- the threshold number of unique tags is at least about 60%. In other embodiments, the threshold percentage of unique tags is at least about 75%, or at least about 90%, or at least about 95%, or at least about 98%, or at least about 99%.
- a threshold may be set at 90% for chromosome 21.
- 30M tags are aligned to chromosome 21, then at least 27M of them must be unique. If 3M counted tags are not unique and the 30 million and first tag is not unique, it is not counted.
- the choice of the particular threshold or other criterion used to determine when not to count further identical tags can be selected using appropriate statistical analysis. One factor influencing this threshold or other criterion is the relative amount of sequenced sample to the size of the genome to which tags can be aligned. Other factors include the size of the reads and similar considerations.
- the number of test sequence tags mapped to a sequence of interest is normalized to the known length of a sequence of interest to which they are mapped to provide a test sequence tag density ratio.
- normalization to the known length of a sequence of interest is not required, and may be included as a step to reduce the number of digits in a number to simplify it for human interpretation.
- the sequence tag coverage for a sequence of interest e.g., a clinically- relevant sequence, in the test samples is determined, as are the sequence tag coverages for additional sequences that correspond to at least one normalizing sequence identified in the qualified samples.
- a test sequence dose is determined for a sequence of interest in the test sample.
- the test sequence dose is computationally determined using the sequence tag coverages of the sequence of interest and the corresponding normalizing sequence as described herein.
- the computational apparatus responsible for this undertaking will electronically access the association between the sequence of interest and its associated normalizing sequence, which may be stored in a database, table, graph, or be included as code in program instructions.
- the at least one normalizing sequence can be a single sequence or a group of sequences.
- the sequence dose for a sequence of interest in a test sample is a ratio of the sequence tag coverage determined for the sequence of interest in the test sample and the sequence tag coverage of at least one normalizing sequence determined in the test sample, wherein the normalizing sequence in the test sample corresponds to the normalizing sequence identified in the qualified samples for the particular sequence of interest.
- the normalizing sequence identified for chromosome 21 in the qualified samples is determined to be a chromosome, e.g., chromosome 14, then the test sequence dose for chromosome 21 (sequence of interest) is determined as the ratio of the sequence tag coverage for chromosome 21 in and the sequence tag coverage for chromosome 14 each determined in the test sample. Similarly, chromosome doses for chromosomes 13, 18, X, Y, and other chromosomes associated with chromosomal aneuploidies are determined.
- a normalizing sequence for a chromosome of interest can be one or a group of chromosomes, or one or a group of chromosome segments.
- Chromosome segments can range from kilobases (kb) to megabases (Mb) in size (e.g., about lkb to 10 kb, or about 10 kb to 100 kb, or about lOOkb to 1 Mb).
- step 160 the copy number variation of the sequence of interest is determined in the test sample by comparing the test sequence dose for the sequence of interest to at least one threshold value established from the qualified sequence doses. This operation may be performed by the same computational apparatus employed to measure sequence tag coverages and/or calculate segment doses.
- step 160 the calculated dose for a test sequence of interest is compared to that set as the threshold values that are chosen according to a user-defined “threshold of reliability” to classify the sample as a“normal” an“affected” or a“no call.”
- The“no call” samples are samples for which a definitive diagnosis cannot be made with reliability.
- Each type of affected sample e.g., trisomy 21, partial trisomy 21, monosomy X
- the determ i nation of CNV comprises calculating a NCV or NSV that relates the chromosome or segment dose to the mean of the corresponding chromosome or segment dose in a set of qualified samples as described above. Then CNV can be determined by comparing the NCV/NSV to a predetermined copy number evaluation threshold value.
- Thresholds are set largely depending on the variability in chromosome doses for a particular chromosome of interest as determined in a set of unaffected samples.
- the variability is dependent on a number of factors, including the fraction of fetal cDNA present in a sample.
- the variability (CV) is determined by the mean or median and standard deviation for chromosome doses across a population of unaffected samples.
- the threshold (s) for classifying aneuploidy use NCVs, according to :
- an expected fetal fraction associated with the given NCV value can be calculated from the CV based on the mean and standard deviation of the chromosome ratio for the chromosome of interest across a population of unaffected samples.
- a decision boundary can be chosen above which samples are determined to be positive (affected) based on the normal distribution quantiles.
- a threshold is set for optimal trade-off between the detection of true positives and rate of false negative results. Namely, the threshold is chosen to maximize the sum of true positives and true negatives, or minimize the sum of the false positives and false negatives.
- Certain embodiments provide a method for providing prenatal diagnosis of a fetal chromosomal aneuploidy in a biological sample comprising fetal and maternal nucleic acid molecules.
- the diagnosis is made based on obtaining sequence information from at least a portion of the mixture of the fetal and maternal nucleic acid molecules derived from a biological test sample, e.g., a maternal plasma sample, computing from the sequencing data a normalizing chromosome dose for one or more chromosomes of interest, and/or a normalizing segment dose for one or more segments of interest, and determining a statistically significant difference between the chromosome dose for the chromosome of interest and/or the segment dose for the segment of interest, respectively, in the test sample and a threshold value established in a plurality of qualified (normal) samples, and providing the prenatal diagnosis based on the statistical difference.
- a diagnosis of normal or affected is made. A“no call” is provided in the event that the diagnosis for normal or
- the suspected and no call thresholds are shown in Table 1. As can be seen, the thresholds of NCV vary across different chromosomes. In some embodiments, the thresholds vary according to the FF for the sample as explained above. Threshold techniques applied here contribute to improved sensitivity and selectivity in some embodiments. TABLE 1. Suspected and Affected NCV Thresholds Bracketing No-Call Ranges
- Figure 16A shows a flow chart of a three -pass process for evaluating copy number. It includes three overlapping passes of work flow 700, which includes pass 1 (or 713 A) analysis of coverage of reads associated with fragments of all sizes, pass 2 (or 713B) analysis of coverage of reads associated with shorter fragments, and pass 3 (or 713C) analysis of relative frequency of shorter reads relative to all reads.
- pass 1 or 713 A
- pass 2 or 713B
- pass 3 or 713C
- Process 700 is similar to process 600 in its overall organization. Operations indicated by blocks 702, 704, 706, 710, 712 may be performed in the same or a similar manner to operations indicated by blocks 602, 604, 606, and 610, and 612. After read counts are obtained, coverage is determined using reads from fragments of all sizes in pass 713A. Coverage is determined using reads from short fragments in pass 713B. Frequency of reads from short fragments relative to all reads is determined in pass 713C. The relative frequency is also referred to as a size ratio or a size fraction elsewhere herein. It is an example of a fragment size characteristic. In some implementations, short fragments are fragments shorter than about 150 base pairs. In various implementations, short fragments can be in the size ranges of about 50-150, 80- 150, or 110-150 base pairs. In some implementations, the third pass, or pass 713C, is optional.
- the data of the three passes 713A, 713B, and 713C all undergo normalization operations 714, 716, 718, 719, and 722 to remove variance unrelated to copy number of the sequence of interest. These normalization operations are boxed in blocks 723. Operation 714 involves normalizing the analyzed quantity of the sequence of interest by dividing the analyzed quantity by the total value of the quantity of the reference sequence. This normalization step uses values obtained from a test sample. Similarly, operations 718 and 722 normalize the analyzed quantity using values obtained from the test sample. Operations 716 and 719 use values obtained from a training set of unaffected samples.
- Operation 719 removes further variance using a principal component analysis (PCA) method.
- the variance removed by the PCA methods is due to factors unrelated to copy number of the sequence of interest.
- the analyzed quantity in each bin provides an independent variable for the PCA, and the samples of the unaffected training set supply values for these independent variables.
- the samples of the training set all include samples having the same copy number of the sequence of interest, e.g., two copies of a somatic chromosome, one copy of the X chromosome (when male samples are used as unaffected samples), or two copies of the X chromosome (when female samples are used as unaffected samples).
- the PCA of the training set yields principal components that are unrelated to copy number of the sequence of interest.
- the principal components can then be used to remove variance in a test sample unrelated to the copy number of the sequence of interest.
- the variance of one or more of the principal components is removed from the test sample’s data using the coefficients estimated from unaffected samples’ data in a region outside of the sequence of interest.
- the region represents all robust chromosomes. For instance, a PCA is performed on normalized bin coverage data of training normal samples, thereby providing principal components corresponding to dimensions in which most variance in the data can be captured. Variance so captured is unrelated to copy number variation in the sequence of interest. After the principal components have been obtained from the training normal samples, they are applied to test data. A linear regression model with test sample as response variable and principal components as dependent variables is generated across bins from a region outside of the sequence of interest.
- the coverage values of all bins have been“normalized” to remove sources of variation other than aneuploidy or other copy number variations.
- the bins of the sequence of interest are enriched or altered relative to other bins for purposes of copy number variation detection. See block 724, which is not an operation but represents the resulting coverage values.
- the normalization operations in large block 723 may increase the signal and/or reduce the noise of the quantity under analysis.
- blocks 728 and 732 are not operations but represents the coverage and relative frequency values after the processing large block 723. It should be understood, that the operations in large block 723 may be modified, rearranged, or removed.
- PCA operation 719 is not performed.
- the correcting for GC operation 718 is not performed.
- the order of the operations is changed; e.g., PCA operation 719 is performed prior to correct for GC operation 718,
- operation 726 calculates a t-statistic as follows:
- xi is the bin coverage of the sequence of interest
- X2 being the bin coverage of the reference region/sequence
- si being the standard deviation of the coverages of the sequence of interest
- S2 being the standard deviation of the coverages of the reference region, being the number of bins of the sequence of interest
- m being the number of the bins of the reference region.
- the reference region includes all robust chromosomes (e.g., chromosomes other than those most likely to harbor an aneuploidy). In some implementations, the reference region includes at least one chromosome outside of the sequence of interest. In some imitations, the reference region includes robust chromosomes not including the sequence of interest. In other implementations, the reference region includes a set of chromosomes (e.g., a subset of chromosomes selected from the robust chromosomes) that have been determined to provide the best signal detection ability for a set of training samples.
- the signal detection ability is based on the ability of the reference region to discriminate bins harboring copy number variations from bins that do not harbor copy number variations.
- the reference region is identified in a manner similar to that employed to determine a “normalizing sequence” or a “normalizing chromosome” as described in the section titled “Identification of Normalizing Sequences.”
- one or more fetal fraction estimates may be combined with any of the t statistics in block 726, 730 and 734 to obtain a likelihood estimate for a ploidy case. See block 736.
- the one or more fetal fractions of block 740 are obtained by any of process 800 in Figures 16B, process 900 in Figure 16C, or process 1000 of Figure 16D.
- the processes may be implemented in parallel using a workflow as workflow 1100 in Figure 2J.
- FIG 16B shows an example process 800 for determining fetal fraction from coverage information according to some implementations of the disclosure.
- Process 800 starts by obtaining coverage information (e.g., sequence dose values) of training samples from a training set. See block 802.
- coverage information e.g., sequence dose values
- Each sample of the training set is obtained from a pregnant woman known to be carrying a male fetus. Namely, the sample contains cfDNA of the male fetus.
- operation 802 may obtain sequence coverage normalized in ways different from sequence dose as described herein, or it may obtain other coverage values.
- Process 800 then involves calculating fetal fractions of the training samples.
- fetal fraction may be calculated from the sequence dose values:
- Rx j is the sequence dose for a male sample
- median(i?X j ) being the median of the sequence doses for female samples.
- mean or other central tendency measures may be used.
- the FF may be obtained by other methods, such as the relative frequency of X and Y chromosomes. See block 804.
- Process 800 further involves dividing the reference sequence into multiple bins of subsequences.
- the reference sequence is a complete genome.
- the bins are 100 kb bins.
- the genome is divided into about 25,000 bins.
- the process then obtains coverages of the bins. See block 806.
- the coverages used in block 806 are obtained after undergoing normalizing operations shown in block 1123 of Figure 2J. In other implementations, coverages from different size range may be used.
- Each bin is associated with coverages of the samples in the training set. Therefore, for each bin a correlation may be obtained between the coverage of the samples and the fetal fractions of the samples.
- Process 800 involves obtaining correlations between fetal fraction and coverage for all the bins. See block 808. Then the process selects the bins having correlation values above a threshold. See block 810. In some implementations, bins having the 6000 highest correlation values are selected. The purpose is to identify bins that demonstrate high correlation between coverage and fetal fraction in the training samples. Then the bins may be used to predict fetal fraction in the test sample.
- the training samples are male samples, the correlation between fetal fraction and coverage may be generalized to male and female test samples.
- process 800 uses the selected bins having high correlation values to obtain a linear model relating fetal fraction to coverage. See block 812. Each selected bin provides an independent variable for the linear model. Therefore, the obtained linear model also includes a parameter or weight for each bin. The weights of the bins are adjusted to fit the model to the data.
- process 800 involves applying coverage data of the test sample to the model to determine the fetal fraction for the test sample. See block 814. The applied coverage data of the test sample are for the bins that have high correlations between fetal fraction and coverage.
- Figure 2J shows workflow 1100 for processing sequence reads information of which can be used to obtain fetal fraction estimates.
- the workflow 1100 shares similar processing steps as workflow 600 in Figure 2D.
- Blocks 1102, 1104, 1106, 1110, 1112, 1123, 1114, 1116, 1118, and 1122 respectively correspond to blocks 602, 604, 606, 610, 612, 623, 614, 616, 618, and 622.
- one or more normalizing operations in the 123 block are optional.
- Pass 1 provides coverage information, which may be used in block 806 of process 800 shown in Figure 16B.
- Process 800 then can yield a fetal fraction estimate 1150 in Figure 2J.
- the putative sex of the fetus is obtained by using the coverage of the Y chromosome.
- Two or more fetal fractions e.g., 1150 and 1152 may be combined in various ways to provide a composite estimate of fetal fraction (e.g., 1155). For instance, an average or a weighted average approach may be used in some implementations, wherein weighting can be based on the statistical confidence of the fetal fraction estimate.
- a composite estimate of fetal fraction for a putatively female fetus is obtained by using information selected from the group consisting of: a fetal fraction obtained from coverage information of bins, a fetal fraction obtained from fragment size information, and any combinations thereof.
- Figure 16C shows a process for determining fetal fraction from size distribution information according to some implementations.
- Process 900 starts by obtaining coverage information (e.g., sequence dose values) of male training samples from a training set. See block 902.
- Process 900 then involves calculating fetal fractions of the training samples using methods described above with reference to block 804. See block 904.
- Process 900 proceeds to divide a size range into a plurality of bins to provide fragment-size-based bins and determine frequencies of reads for the fragment- size-based bins. See block 906.
- the frequencies of fragment- size-based bins are obtained without normalizing for factors shown in block 1123. See path 1124 of Figure 2J.
- the frequencies of fragment-size- based bins are obtained after optionally undergoing normalizing operations shown in block 1123 of Figure 2J.
- the size range is divided into 40 bins.
- the bin at the low end includes fragments of size smaller than about 55 base pairs.
- the bin at the low end includes fragments of size in the range of about 50-55 base pairs, which excludes information for reads shorter than 50 bp. In some implementations, the bin at the high end includes fragments of size larger than about 245 base pairs. In some implementations, the bin at the high end includes fragments of size in the range of about 245-250 base pairs, which excludes information for reads longer than 250 bp.
- Process 900 proceeds by obtaining a linear model relating fetal fraction to frequencies of reads for the fragment-size-based bins, using data of the training samples. See block 908.
- the obtained linear model includes independent variables for the frequencies of reads of the size-based bins.
- the model also includes a parameter or weight for each size-based bin. The weights of the bins are adjusted to fit the model to the data.
- process 900 involves applying read frequency data of the test sample to the model to determine the fetal fraction for the test sample. See block 910.
- an 8-mer frequency may be used to calculate fetal fraction.
- Figure 16D shows an example process 1000 for determining fetal fraction from 8-mer frequency information according to some implementations of the disclosure.
- Process 1000 starts by obtaining coverage information (e.g., sequence dose values) of male training samples from a training set. See block 1002.
- Process 1000 then involves calculating fetal fractions of the training samples using any of the methods described for block 804. See block 1004.
- Process 1000 further involves obtaining the frequencies of 8-mers (e.g., all possible permutations of 4 nucleotides at 8 positions) from the reads of each training sample. See block 1006. In some implementations, up to 65,536 or close to that many 8-mers and their frequencies are obtained. In some implementations, the frequencies of 8-mers are obtained without normalizing for factors shown in block 1123. In some implementations, 8-mer frequencies are obtained after optionally undergoing normalizing operations.
- 8-mers e.g., all possible permutations of 4 nucleotides at 8 positions
- process 1000 involves applying 8-mer frequency data of the test sample to the model to determine the fetal fraction for the test sample. See block 1014.
- coverages (or other parameters) having correlation with different portions of a genome are weighted for the different portions in calculating fetal fraction.
- examples of such methods include the SeqFF method described in U.S. Patent Application Publication No. US 2015/0005176, and Kim et al. (2015), Prenatal Diagnosis, 35, 1-6, which are incorporated by reference in their entireties for the purpose of calculating fetal fraction.
- bins having higher fractions of fetal cell-free nucleic acid fragments are weighted more heavily for determining fetal fractions.
- process 700 involves obtaining a final ploidy likelihood in operation 736 using the t-statistic based on the coverage of all fragments provided by operation 726, the fetal fraction estimate provided by operation 726, and the t-statistic based on the coverage of the short fragments provided by operation 730.
- These implementations combine the results from pass 1 and pass 2 using a multivariate normal models.
- the ploidy likelihood is an aneuploidy likelihood, which is a likelihood of a model having an aneuploid assumption (e.g., trisomy or monosomy) minus the likelihood of a model having an euploid assumption wherein the model uses the t- statistic based on the coverage of all fragments, the fetal fraction estimate, and the t- statistic based on the coverage of the short fragments as an input and provides a likelihood as an output.
- the ploidy likelihood is expressed as a likelihood ratio.
- likelihood ratio is modeled as:
- the model combine coverage generated from short fragments with coverage generated by all fragments, which helps improving separation between coverage scores of affected and unaffected samples.
- the model also makes use of fetal fraction, thereby further improves the ability to discriminate between affected and unaffected samples.
- the likelihood ratio is calculated using t-statistic based on coverage of all fragments (726), t-statistic based on coverage of short fragments (730), and a fetal fraction estimate provided by processes 800 (or block 726), 900, or 1000 as described above. In some implementations, this likelihood ratio is used to analyze chromosomes 13, 18, and 21.
- pi represents the likelihood that data come from a multivariate normal distribution representing a 3-copy or 1-copy model
- po represents the likelihood that data come from a multivariate normal distribution representing a 2-copy model
- T short-f r eq is a T score calculated from relative frequency of short fragments
- q iff total being the density distribution of fetal fraction (estimated from training data) considering the error associated with fetal fraction estimation.
- the likelihood ratio is calculated using t-statistic based on relative frequency of short fragments (734) and a fetal fraction estimate provided by processes 800 (or block 726), 900, or 1000 as described above. In some implementations, this likelihood ratio is used to analyze chromosome X.
- the likelihood ratio is calculated using t- statistic based on coverage of all fragments (726), t-statistic based on coverage of short fragments (730), and relative frequency of short fragments (734). Moreover, fetal fraction obtained as describe above may be combined with t-statistics to calculate likelihood ration. By combining information from any of the three passes 713A, 713B, and 713C, the discriminative ability of the ploidy evaluation can be improved.
- the euploid model and aneuploidy model take t-statistics as input. However, of course they can also take raw or otherwise transformed coverage or abundance value as input, and provide likelihood as outputs. The otherwise transformed or t-statistics input can help to improve predictive ability of the models, but the transformation is not necessary in all implementations .
- the modeled likelihood ratio represents the likelihood of the modeled data having been obtained from a trisomy or monosomy sample relative to the likelihood of the modeled data having been obtained from a diploid sample. Such likelihood ratio may be used to determine trisomy or monosomy of the autosomes in some implementations.
- the likelihood ratio for monosomy X and the likelihood ratio for trisomy X are evaluated.
- a chromosome coverage measurement e.g., CNV or coverage z score
- the four values are evaluated using a decision tree to determine copy number of the sex chromosome.
- the decision tree allows determination of a ploidy case of XX, XY, X, XXY, XXX, or XYY.
- the likelihood ratio is transformed into a log likelihood ratio
- a criterion or threshold for calling an aneuploidy or a copy number variation can be empirically set to obtain a particular sensitivity and selectivity. For instance, a log likelihood ratio of 1.5 may be set for calling a trisomy 13 or a trisomy 18 based on a model’s sensitivity and selectivity when applied to a training set. Moreover, for instance, a call criterion value of 3 may be set for a trisomy of chromosome 21 in some applications.
- Samples that are used for determining a CNV can include samples taken from any cell, tissue, or organ in which copy number variations for one or more sequences of interest are to be determined.
- the samples contain nucleic acids that are that are present in cells and/or nucleic acids that are“cell-free” (e.g., cfDNA).
- cell-free nucleic acids e.g., cell-free DNA (cfDNA).
- Cell-free nucleic acids, including cell-free DNA can be obtained by various methods known in the art from biological samples including but not limited to plasma, serum, and urine (see, e.g., Fan et ak, Proc Natl Acad Sci 105: 16266-16271 [2008]; Koide et ak, Prenatal Diagnosis 25:604-607 [2005]; Chen et ak, Nature Med. 2: 1033-1035 [1996]; Lo et ak, Lancet 350: 485-487 [1997]; Botezatu et ak, Clin Chem.
- Biological samples comprising cfDNA have been used in assays to determine the presence or absence of chromosomal abnormalities, e.g., trisomy 21, by sequencing assays that can detect chromosomal aneuploidies and/or various polymorphisms.
- non-specific enrichment can be the non-selective amplification of both genomes present in the sample.
- non-specific amplification can be of fetal and maternal DNA in a sample comprising a mixture of DNA from the fetal and maternal genomes.
- Methods for whole genome amplification are known in the art. Degenerate oligonucleotide-primed PCR (DOP), primer extension PCR technique (PEP) and multiple displacement amplification (MDA) are examples of whole genome amplification methods.
- DOP Degenerate oligonucleotide-primed PCR
- PEP primer extension PCR technique
- MDA multiple displacement amplification
- the sample comprising the mixture of cfDNA from different genomes is un-enriched for cfDNA of the genomes present in the mixture.
- the sample comprising the mixture of cfDNA from different genomes is non-specifically enriched for any one of the genomes present in the sample.
- the sample comprising the nucleic acid(s) to which the methods described herein are applied typically comprises a biological sample (“test sample”), e.g., as described above.
- test sample e.g., as described above.
- the nucleic acid(s) to be screened for one or more CNVs is purified or isolated by any of a number of well-known methods.
- the sample comprises or consists of a purified or isolated polynucleotide, or it can comprise samples such as a tissue sample, a biological fluid sample, a cell sample, and the like.
- suitable biological fluid samples include, but are not limited to blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear flow, lymph, saliva, cerebrospinal fluid, ravages, bone marrow suspension, vaginal flow, trans-cervical lavage, brain fluid, ascites, milk, secretions of the respiratory, intestinal and genitourinary tracts, amniotic fluid, milk, and leukophoresis samples.
- the terms“blood,”“plasma” and“serum” expressly encompass fractions or processed portions thereof.
- the“sample” expressly encompasses a processed fraction or portion derived from the biopsy, swab, smear, etc.
- samples can be obtained from sources, including, but not limited to, samples from different individuals, samples from different developmental stages of the same or different individuals, samples from different diseased individuals (e.g., individuals with cancer or suspected of having a genetic disorder), normal individuals, samples obtained at different stages of a disease in an individual, samples obtained from an individual subjected to different treatments for a disease, samples from individuals subjected to different environmental factors, samples from individuals with predisposition to a pathology, samples individuals with exposure to an infectious disease agent (e.g., HIV), and the like.
- sources including, but not limited to, samples from different individuals, samples from different developmental stages of the same or different individuals, samples from different diseased individuals (e.g., individuals with cancer or suspected of having a genetic disorder), normal individuals, samples obtained at different stages of a disease in an individual, samples obtained from an individual subjected to different treatments for a disease, samples from individuals subjected to different environmental factors, samples from individuals with predisposition to a pathology, samples individuals with exposure to an infectious disease agent (e.g
- the sample is a maternal sample that is obtained from a pregnant female, for example a pregnant woman.
- the sample can be analyzed using the methods described herein to provide a prenatal diagnosis of potential chromosomal abnormalities in the fetus.
- the maternal sample can be a tissue sample, a biological fluid sample, or a cell sample.
- a biological fluid includes, as non-limiting examples, blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear flow, lymph, saliva, cerebrospinal fluid, ravages, bone marrow suspension, vaginal flow, transcervical lavage, brain fluid, ascites, milk, secretions of the respiratory, intestinal and genitourinary tracts, and leukophoresis samples.
- the maternal sample is a mixture of two or more biological samples, e.g., the biological sample can comprise two or more of a biological fluid sample, a tissue sample, and a cell culture sample.
- the sample is a sample that is easily obtainable by non- invasive procedures, e.g., blood, plasma, serum, sweat, tears, sputum, urine, milk, sputum, ear flow, saliva and feces.
- the biological sample is a peripheral blood sample, and/or the plasma and serum fractions thereof.
- the biological sample is a swab or smear, a biopsy specimen, or a sample of a cell culture.
- the terms“blood,”“plasma” and“serum” expressly encompass fractions or processed portions thereof.
- the“sample” expressly encompasses a processed fraction or portion derived from the biopsy, swab, smear, etc.
- samples can also be obtained from in vitro cultured tissues, cells, or other polynucleotide-containing sources.
- the cultured samples can be taken from sources including, but not limited to, cultures (e.g., tissue or cells) maintained in different media and conditions (e.g., pH, pressure, or temperature), cultures (e.g., tissue or cells) maintained for different periods of length, cultures (e.g., tissue or cells) treated with different factors or reagents (e.g., a drug candidate, or a modulator), or cultures of different types of tissue and/or cells.
- nucleic acids are well known and will differ depending upon the nature of the source.
- One of skill in the art can readily isolate nucleic acid(s) from a source as needed for the method described herein.
- sample nucleic acids are obtained from as cfDNA, which is not subjected to fragmentation.
- the methods described herein can utilize next generation sequencing technologies (NGS), that allow multiple samples to be sequenced individually as genomic molecules (i.e., singleplex sequencing) or as pooled samples comprising indexed genomic molecules (e.g., multiplex sequencing) on a single sequencing run.
- NGS next generation sequencing technologies
- these methods can generate up to several hundred million reads of DNA sequences.
- the sequences of genomic nucleic acids, and/or of indexed genomic nucleic acids can be determined using, for example, the Next Generation Sequencing Technologies (NGS) described herein.
- NGS Next Generation Sequencing Technologies
- analysis of the massive amount of sequence data obtained using NGS can be performed using one or more processors as described herein.
- sequencing methods contemplated herein involve the preparation of sequencing libraries.
- sequencing library preparation involves the production of a random collection of adapter- modified DNA fragments (e.g., polynucleotides) that are ready to be sequenced.
- Sequencing libraries of polynucleotides can be prepared from DNA or RNA, including equivalents, analogs of either DNA or cDNA, for example, DNA or cDNA that is complementary or copy DNA produced from an RNA template, by the action of reverse transcriptase.
- the polynucleotides may originate in double- stranded form (e.g., dsDNA such as genomic DNA fragments, cDNA, PCR amplification products, and the like) or, in certain embodiments, the polynucleotides may originated in single-stranded form (e.g., ssDNA, RNA, etc.) and have been converted to dsDNA form.
- single stranded mRNA molecules may be copied into double-stranded cDNAs suitable for use in preparing a sequencing library.
- the precise sequence of the primary polynucleotide molecules is generally not material to the method of library preparation, and may be known or unknown.
- the polynucleotide molecules are DNA molecules. More particularly, in certain embodiments, the polynucleotide molecules represent the entire genetic complement of an organism or substantially the entire genetic complement of an organism, and are genomic DNA molecules (e.g., cellular DNA, cell free DNA (cfDNA), etc.), that typically include both intron sequence and exon sequence (coding sequence), as well as non-coding regulatory sequences such as promoter and enhancer sequences.
- the primary polynucleotide molecules comprise human genomic DNA molecules, e.g., cfDNA molecules present in peripheral blood of a pregnant subject.
- Preparation of sequencing libraries for some NGS sequencing platforms is facilitated by the use of polynucleotides comprising a specific range of fragment sizes.
- Preparation of such libraries typically involves the fragmentation of large polynucleotides (e.g. cellular genomic DNA) to obtain polynucleotides in the desired size range.
- Fragmentation can be achieved by any of a number of methods known to those of skill in the art. For example, fragmentation can be achieved by mechanical means including, but not limited to nebulization, sonication and hydroshear.
- cfDNA typically exists as fragments of less than about 300 base pairs and consequently, fragmentation is not typically necessary for generating a sequencing library using cfDNA samples.
- polynucleotides are forcibly fragmented (e.g., fragmented in vitro), or naturally exist as fragments, they are converted to blunt-ended DNA having 5’-phosphates and 3’-hydroxyl.
- Standard protocols e.g., protocols for sequencing using, for example, the Illumina platform as described elsewhere herein, instruct users to end-repair sample DNA, to purify the end-repaired products prior to dA-tailing, and to purify the dA-tailing products prior to the adaptor-ligating steps of the library preparation.
- ABB method An abbreviated method (ABB method), a 1-step method, and a 2- step method are examples of methods for preparation of a sequencing library, which can be found in patent application 13/555,037 filed on July 20, 2012, which is incorporated by reference by its entirety.
- verification of the integrity of the samples and sample tracking can be accomplished by sequencing mixtures of sample genomic nucleic acids, e.g., cfDNA, and accompanying marker nucleic acids that have been introduced into the samples, e.g., prior to processing.
- Marker nucleic acids can be combined with the test sample (e.g., biological source sample) and subjected to processes that include, for example, one or more of the steps of fractionating the biological source sample, e.g., obtaining an essentially cell-free plasma fraction from a whole blood sample, purifying nucleic acids from a fractionated, e.g., plasma, or unfractionated biological source sample, e.g., a tissue sample, and sequencing.
- sequencing comprises preparing a sequencing library.
- the sequence or combination of sequences of the marker molecules that are combined with a source sample is chosen to be unique to the source sample.
- the unique marker molecules in a sample all have the same sequence.
- the unique marker molecules in a sample are a plurality of sequences, e.g., a combination of two, three, four, five, six, seven, eight, nine, ten, fifteen, twenty, or more different sequences.
- the integrity of a sample can be verified using a plurality of marker nucleic acid molecules having identical sequences.
- identity of a sample can be verified using a plurality of marker nucleic acid molecules that have at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17m, at least 18, at least 19, at least 20, at least 25, at least 30, at least 35, at least 40, at least 50, or more different sequences.
- Verification of the integrity of the plurality of biological samples requires that each of the two or more samples be marked with marker nucleic acids that have sequences that are unique to each of the plurality of test sample that is being marked.
- a first sample can be marked with a marker nucleic acid having sequence A
- a second sample can be marked with a marker nucleic acid having sequence B.
- a first sample can be marked with marker nucleic acid molecules all having sequence A
- a second sample can be marked with a mixture of sequences B and C, wherein sequences A, B and C are marker molecules having different sequences.
- the marker nucleic acid(s) can be added to the sample at any stage of sample preparation that occurs prior to library preparation (if libraries are to be prepared) and sequencing.
- marker molecules can be combined with an unprocessed source sample.
- the marker nucleic acid can be provided in a collection tube that is used to collect a blood sample.
- the marker nucleic acids can be added to the blood sample following the blood draw.
- the marker nucleic acid is added to the vessel that is used to collect a biological fluid sample, e.g., the marker nucleic acid(s) are added to a blood collection tube that is used to collect a blood sample.
- the integrity of a human cell-free DNA sample obtained from a subject affected by a pathogen can be verified using marker molecules having sequences that are absent from both the human genome and the genome of the affecting bacterium.
- a pathogen e.g., a bacterium
- Sequences of genomes of numerous pathogens e.g., bacteria, viruses, yeasts, fungi, protozoa etc., are publicly available on the World Wide Web at ncbi.nlm.nih.gov/genomes.
- marker molecules are nucleic acids that have sequences that are absent from any known genome. The sequences of marker molecules can be randomly generated algorithmically.
- the marker molecules can be naturally- occurring deoxyribonucleic acids (DNA), ribonucleic acids or artificial nucleic acid analogs (nucleic acid mimics) including peptide nucleic acids (PNA), morpholino nucleic acid, locked nucleic acids, glycol nucleic acids, and threose nucleic acids, which are distinguished from naturally-occurring DNA or RNA by changes to the backbone of the molecule or DNA mimics that do not have a phosphodiester backbone.
- the deoxyribonucleic acids can be from naturally-occurring genomes or can be generated in a laboratory through the use of enzymes or by solid phase chemical synthesis.
- DNA mimics that are not found in nature.
- Derivatives of DNA are that are available in which the phosphodiester linkage has been replaced but in which the deoxyribose is retained include but are not limited to DNA mimics having backbones formed by thioformacetal or a carboxamide linkage, which have been shown to be good structural DNA mimics.
- Other DNA mimics include morpholino derivatives and the peptide nucleic acids (PNA), which contain an N-(2-aminoethyl)glycine-based pseudopeptide backbone (Ann Rev Biophys Biomol Struct 24:167-183 [1995]).
- This modification reduces the action of endo-and exonucleases2 including 5’ to 3’ and 3’ to 5’ DNA POL 1 exonuclease, nucleases SI and PI, RNases, serum nucleases and snake venom phosphodiesterase.
- the length of the marker molecules can be distinct or indistinct from that of the sample nucleic acids, i.e., the length of the marker molecules can be similar to that of the sample genomic molecules, or it can be greater or smaller than that of the sample genomic molecules.
- the length of the marker molecules is measured by the number of nucleotide or nucleotide analog bases that constitute the marker molecule.
- Marker molecules having lengths that differ from those of the sample genomic molecules can be distinguished from source nucleic acids using separation methods known in the art. For example, differences in the length of the marker and sample nucleic acid molecules can be determined by electrophoretic separation, e.g., capillary electrophoresis.
- Size differentiation can be advantageous for quantifying and assessing the quality of the marker and sample nucleic acids.
- the marker nucleic acids are shorter than the genomic nucleic acids, and of sufficient length to exclude them from being mapped to the genome of the sample. For example, as a 30 base human sequence is needed to uniquely map it to a human genome. Accordingly in certain embodiments, marker molecules used in sequencing bioassays of human samples should be at least 30 bp in length.
- sequencing using the Illumina GAII sequence analyzer includes an in vitro clonal amplification by bridge PCR (also known as cluster amplification) of polynucleotides that have a minimum length of llObp, to which adaptors are ligated to provide a nucleic acid of at least 200 bp and less than 600 bp that can be clonally amplified and sequenced.
- the length of the adaptor-ligated marker molecule is between about 200bp and about 600bp, between about 250bp and 550bp, between about 300bp and 500bp, or between about 350 and 450. In other embodiments, the length of the adaptor- ligated marker molecule is about 200bp.
- the length of the marker molecule when sequencing fetal cfDNA that is present in a maternal sample, can be chosen to be similar to that of fetal cfDNA molecules.
- the length of the marker molecule used in an assay that comprises massively parallel sequencing of cfDNA in a maternal sample to determine the presence or absence of a fetal chromosomal aneuploidy can be about 150 bp, about 160bp, 170 bp, about 180bp, about 190bp or about 200bp; preferably, the marker molecule is about 170 pp.
- the yield of sequences per unit mass is dependent on the number of 3’ end hydroxyl groups, and thus having relatively short templates for sequencing is more efficient than having long templates. If starting with nucleic acids longer than 1000 nt, it is generally advisable to shear the nucleic acids to an average length of 100 to 200 nt so that more sequence information can be generated from the same mass of nucleic acids. Thus, the length of the marker molecule can range from tens of bases to thousands of bases.
- the length of marker molecules used for single molecule sequencing can be up to about 25bp, up to about 50bp, up to about 75bp, up to about lOObp, up to about 200bp, up to about 300bp, up to about 400bp, up to about 500bp, up to about 600bp, up to about 700bp, up to about 800 bp, up to about 900bp, up to about lOOObp, or more in length.
- the length chosen for a marker molecule is also determined by the length of the genomic nucleic acid that is being sequenced.
- cfDNA circulates in the human bloodstream as genomic fragments of cellular genomic DNA. Fetal cfDNA molecules found in the plasma of pregnant women are generally shorter than maternal cfDNA molecules (Chan et ak, Clin Chem 50:8892 [2004]). Size fractionation of circulating fetal DNA has confirmed that the average length of circulating fetal DNA fragments is ⁇ 300 bp, while maternal DNA has been estimated to be between about 0.5 and 1 Kb (Li et ak, Clin Chem, 50: 1002-1011 [2004]).
- marker molecules that are chosen can be up to about the length of the cl ’ DNA.
- the length of marker molecules used in maternal cfDNA samples to be sequenced as single nucleic acid molecules or as clonally amplified nucleic acids can be between about 100 bp and 600.
- the sample genomic nucleic acids are fragments of larger molecules.
- a sample genomic nucleic acid that is sequenced is fragmented cellular DNA.
- the length of the marker molecules can be up to the length of the DNA fragments.
- the length of the marker molecules is at least the minimum length required for mapping the sequence read uniquely to the appropriate reference genome.
- the length of the marker molecule is the minimum length that is required to exclude the marker molecule from being mapped to the sample reference genome.
- marker molecules can be used to verify samples that are not assayed by nucleic acid sequencing, and that can be verified by common bio-techniques other than sequencing, e.g., real-time PCR.
- Sample controls e.e., in process positive controls for sequencing and/or analysis.
- marker sequences introduced into the samples can function as positive controls to verify the accuracy and efficacy of sequencing and subsequent processing and analysis.
- compositions and method for providing an in-process positive control (IPC) for sequencing DNA in a sample are provided.
- positive controls are provided for sequencing cfDNA in a sample comprising a mixture of genomes are provided.
- An IPC can be used to relate baseline shifts in sequence information obtained from different sets of samples, e.g., samples that are sequenced at different times on different sequencing runs.
- an IPC can relate the sequence information obtained for a maternal test sample to the sequence information obtained from a set of qualified samples that were sequenced at a different time.
- an IPC can relate the sequence information obtained from a subject for particular segment(s) to the sequence obtained from a set of qualified samples (of similar sequences) that were sequenced at a different time.
- an IPC can relate the sequence information obtained from a subject for particular cancer-related loci to the sequence information obtained from a set of qualified samples (e.g., from a known amplification/deletion, and the like).
- IPCs can be used as markers to track sample(s) through the sequencing process.
- IPCs can also provide a qualitative positive sequence dose value, e.g., NCV, for one or more aneuploidies of chromosomes of interest, e.g., trisomy 21, trisomy 13, trisomy 18 to provide proper interpretation, and to ensure the dependability and accuracy of the data.
- IPCs can be created to comprise nucleic acids from male and female genomes to provide doses for chromosomes X and Y in a maternal sample to determine whether the fetus is male.
- the type and the number of in-process controls depends on the type or nature of the test needed.
- the in-process control can comprise DNA obtained from a sample known comprising the same chromosomal aneuploidy that is being tested.
- the IPC includes DNA from a sample known to comprise an aneuploidy of a chromosome of interest.
- the IPC for a test to determine the presence or absence of a fetal trisomy, e.g., trisomy 21, in a maternal sample comprises DNA obtained from an individual with trisomy 21.
- the IPC comprises a mixture of DNA obtained from two or more individuals with different aneuploidies.
- the IPC comprises a combination of DNA samples obtained from pregnant women each carrying a fetus with one of the trisomies being tested.
- IPCs can be created to provide positive controls for tests to determine the presence or absence of partial aneuploidies.
- An IPC that serves as the control for detecting a single aneuploidy can be created using a mixture of cellular genomic DNA obtained from a two subjects one being the contributor of the aneuploid genome.
- an IPC that is created as a control for a test to determine a fetal trisomy, e.g., trisomy 21 can be created by combining genomic DNA from a male or female subject carrying the trisomic chromosome with genomic DNA with a female subject known not to carry the trisomic chromosome.
- Genomic DNA can be extracted from cells of both subjects, and sheared to provide fragments of between about 100 - 400 bp, between about 150-350 bp, or between about 200-300 bp to simulate the circulating cfDNA fragments in maternal samples.
- the proportion of fragmented DNA from the subject carrying the aneuploidy e.g., trisomy 21, is chosen to simulate the proportion of circulating fetal cfDNA found in maternal samples to provide an IPC comprising a mixture of fragmented DNA comprising about 5%, about 10%, about 15%, about 20%, about 25%, about 30%, of DNA from the subject carrying the aneuploidy.
- the IPC can comprise DNA from different subjects each carrying a different aneuploidy.
- the IPC can comprise about 80% of the unaffected female DNA, and the remaining 20% can be DNA from three different subjects each carrying a trisomic chromosome 21, a trisomic chromosome 13, and a trisomic chromosome 18.
- the mixture of fragmented DNA is prepared for sequencing. Processing of the mixture of fragmented DNA can comprise preparing a sequencing library, which can be sequenced using any massively parallel methods in singleplex or multiplex fashion. Stock solutions of the genomic IPC can be stored and used in multiple diagnostic tests.
- the IPC can be created using cfDNA obtained from a mother known to carry a fetus with a known chromosomal aneuploidy.
- cfDNA can be obtained from a pregnant woman carrying a fetus with trisomy 21.
- the cfDNA is extracted from the maternal sample, and cloned into a bacterial vector and grown in bacteria to provide an ongoing source of the IPC.
- the DNA can be extracted from the bacterial vector using restriction enzymes.
- the cloned cfDNA can be amplified by, e.g., PCR.
- the IPC DNA can be processed for sequencing in the same runs as the cfDNA from the test samples that are to be analyzed for the presence or absence of chromosomal aneuploidies.
- IPCs can be created to reflect other partial aneuploidies including for example, various segment amplification and/or deletions.
- various cancers are known to be associated with particular amplifications (e.g., breast cancer associated with 20Q13)
- IPCs can be created that incorporate those known amplifications.
- sequencing technologies are available commercially, such as the sequencing-by-hybridization platform from Affymetrix Inc. (Sunnyvale, CA) and the sequencing-by-synthesis platforms from 454 Life Sciences (Bradford, CT), Illumina/Solexa (Hayward, CA) and Helicos Biosciences (Cambridge, MA), and the sequencing-by-ligation platform from Applied Biosystems (Foster City, CA), as described below.
- other single molecule sequencing technologies include, but are not limited to, the SMRTTM technology of Pacific Biosciences, the ION TORRENTTM technology, and nanopore sequencing developed for example, by Oxford Nanopore Technologies.
- Sanger sequencing including the automated Sanger sequencing, can also be employed in the methods described herein. Additional suitable sequencing methods include, but are not limited to nucleic acid imaging technologies, e.g., atomic force microscopy (AFM) or transmission electron microscopy (TEM). Illustrative sequencing technologies are described in greater detail below.
- AFM atomic force microscopy
- TEM transmission electron microscopy
- the sequencing by synthesis platform by Illumina involves clustering fragments. Clustering is a process in which each fragment molecule is isothermally amplified.
- the fragment has two different adaptors attached to the two ends of the fragment, the adaptors allowing the fragment to hybridize with the two different oligos on the surface of a flow cell lane.
- the fragment further includes or is connected to two index sequences at two ends of the fragment, which index sequences provide labels to identify different samples in multiplex sequencing.
- a fragment to be sequenced is also referred to as an insert.
- a flow cell for clustering in the Illumina platform is a glass slide with lanes. Each lane is a glass channel coated with a lawn of two types of oligos. Hybridization is enabled by the first of the two types of oligos on the surface. This oligo is complementary to a first adapter on one end of the fragment. A polymerase creates a compliment strand of the hybridized fragment. The double- stranded molecule is denatured, and the original template strand is washed away. The remaining strand, in parallel with many other remaining strands, is clonally amplified through bridge application.
- sequencing starts with extending a first sequencing primer to generate the first read.
- fluorescently tagged nucleotides compete for addition to the growing chain. Only one is incorporated based on the sequence of the template.
- the cluster is excited by a light source, and a characteristic fluorescent signal is emitted. The number of cycles determines the length of the read. The emission wavelength and the signal intensity determine the base call. For a given cluster all identical strands are read simultaneously. Hundreds of millions of clusters are sequenced in a massively parallel manner. At the completion of the first read, the read product is washed away.
- index 1 primer is introduced and hybridized to an index 1 region on the template. Index regions provide identification of fragments, which is useful for de-multiplexing samples in a multiplex sequencing process.
- the index 1 read is generated similar to the first read. After completion of the index 1 read, the read product is washed away and the 3’ end of the strand is de-protected. The template strand then folds over and binds to a second oligo on the flow cell. An index 2 sequence is read in the same manner as index 1. Then an index 2 read product is washed off at the completion of the step.
- the sequencing by synthesis example described above involves paired end reads, which is used in many of the embodiments of the disclosed methods.
- Paired end sequencing involves 2 reads from the two ends of a fragment. When a pair of reads are mapped to a reference sequence, the base-pair distance between the two reads can be determined, which distance can then be used to determine the length of the fragments from which the reads were obtained. In some instances, a fragment straddling two bins would have one of its pair-end read aligned to one bin, and another to an adjacent bin. This gets rarer as the bins get longer or the reads get shorter. Various methods may be used to account for the bin-membership of these fragments.
- they can be omitted in determining fragment size frequency of a bin; they can be counted for both of the adjacent bins; they can be assigned to the bin that encompasses the larger number of base pairs of the two bins; or they can be assigned to both bins with a weight related to portion of base pairs in each bin.
- a sub-fragment encompassing the biotin junction adaptors can then be obtained by further fragmenting the circularized molecule.
- the sub-fragment including the two ends of the original fragment in opposite sequence order can then be sequenced by the same procedure as for short-insert paired end sequencing described above.
- Further details of mate pair sequencing using an Illumina platform is shown in an online publication at the following URL, which is incorporated by reference by its entirety: res I . lilluminal . lcom/documents/products/technotes/technote_nextera_matepair_data_pr ocessing. Additional information about paired end sequencing can be found in US Patent No. 7601499 and US Patent Publication No. 2012/0,053,063, which are incorporated by reference with regard to materials on paired end sequencing methods and apparatuses.
- sequence reads of predetermined length e.g., 100 bp
- sequence reads of predetermined length are mapped or aligned to a known reference genome.
- the mapped or aligned reads and their corresponding locations on the reference sequence are also referred to as tags.
- the reference genome sequence is the GRCh37/hgl9, which is available on the world wide web at genome dot ucsc dot edu/cgi-bin/hgGateway.
- Other sources of public sequence information include GenBank, dbEST, dbSTS, EMBL (the European Molecular Biology Laboratory), and the DDBJ (the DNA Databank of Japan).
- BLAST Altschul et ah, 1990
- BLITZ MPsrch
- FASTA Piererson & Lipman
- BOWTIE Landing Technology
- ELAND ELAND
- one end of the clonally expanded copies of the plasma cfDNA molecules is sequenced and processed by bioinformatics alignment analysis for the Illumina Genome Analyzer, which uses the Efficient Large-Scale Alignment of Nucleotide Databases (ELAND) software.
- a processor or group of processors for performing the methods described herein may be of various types including microcontrollers and microprocessors such as programmable devices (e.g., CPLDs and FPGAs) and non-programmable devices such as gate array ASICs or general purpose microprocessors.
- microcontrollers and microprocessors such as programmable devices (e.g., CPLDs and FPGAs) and non-programmable devices such as gate array ASICs or general purpose microprocessors.
- certain embodiments relate to tangible and/or non-transitory computer readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer- implemented operations.
- Examples of computer-readable media include, but are not limited to, semiconductor memory devices, magnetic media such as disk drives, magnetic tape, optical media such as CDs, magneto-optical media, and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM).
- ROM read-only memory devices
- RAM random access memory
- the computer readable media may be directly controlled by an end user or the media may be indirectly controlled by the end user. Examples of directly controlled media include the media located at a user facility and/or media that are not shared with other entities.
- Examples of indirectly controlled media include media that is indirectly accessible to the user via an external network and/or via a service providing shared resources such as the“cloud.”
- Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
- the data or information employed in the disclosed methods and apparatus is provided in an electronic format.
- Such data or information may include reads and tags derived from a nucleic acid sample, counts or densities of such tags that align with particular regions of a reference sequence (e.g., that align to a chromosome or chromosome segment), reference sequences (including reference sequences providing solely or primarily polymorphisms), chromosome and segment doses, calls such as aneuploidy calls, normalized chromosome and segment values, pairs of chromosomes or segments and corresponding normalizing chromosomes or segments, counseling recommendations, diagnoses, and the like.
- a reference sequence e.g., that align to a chromosome or chromosome segment
- reference sequences including reference sequences providing solely or primarily polymorphisms
- chromosome and segment doses e.g., calls such as aneuploidy calls, normalized chromosome and segment values, pairs of chromosomes or segments
- data or other information provided in electronic format is available for storage on a machine and transmission between machines.
- data in electronic format is provided digitally and may be stored as bits and/or bytes in various data structures, lists, databases, etc.
- the data may be embodied electronically, optically, etc.
- One embodiment provides a computer program product for generating an output indicating the presence or absence of an aneuploidy, e.g., a fetal aneuploidy or cancer, in a test sample.
- the computer product may contain instructions for performing any one or more of the above-described methods for determining a chromosomal anomaly.
- the computer product may include a non- transitory and/or tangible computer readable medium having a computer executable or compilable logic (e.g., instructions) recorded thereon for enabling a processor to determine chromosome doses and, in some cases, whether a fetal aneuploidy is present or absent.
- sequence information from the sample under consideration may be mapped to chromosome reference sequences to identify a number of sequence tags for each of any one or more chromosomes of interest and to identify a number of sequence tags for a normalizing segment sequence for each of said any one or more chromosomes of interest.
- the reference sequences are stored in a database such as a relational or object database, for example.
- mapping a single 30 bp read from a sample to any one of the human chromosomes might require years of effort without the assistance of a computational apparatus.
- reliable aneuploidy calls generally require mapping thousands (e.g., at least about 10,000) or even millions of reads to one or more chromosomes.
- the methods disclosed herein can be performed using a system for evaluation of copy number of a genetic sequence of interest in a test sample.
- the system comprising: (a) a sequencer for receiving nucleic acids from the test sample providing nucleic acid sequence information from the sample; (b) a processor; and (c) one or more computer-readable storage media having stored thereon instructions for execution on said processor to carry out a method for identifying any CNV, e.g., chromosomal or partial aneuploidies.
- the methods are instructed by a computer- readable medium having stored thereon computer-readable instructions for carrying out a method for identifying any CNV, e.g., chromosomal or partial aneuploidies.
- a computer program product comprising one or more computer-readable non-transitory storage media having stored thereon computer- executable instructions that, when executed by one or more processors of a computer system, cause the computer system to implement a method for evaluation of copy number of a sequence of interest in a test sample comprising fetal and maternal cell- free nucleic acids.
- the method includes: (a) receiving sequence reads obtained by sequencing the cell-free nucleic acid fragments in the test sample; (b) aligning the sequence reads of the cell-free nucleic acid fragments to a reference genome comprising the sequence of interest, thereby providing test sequence tags, wherein the reference genome is divided into a plurality of bins; (c) determining sizes of the cell-free nucleic acid fragments existing in the test sample; (d) weighting the test sequence tags based on the sizes of cell-free nucleic acid fragments from which the tags are obtained; (e) calculating coverages for the bins based on the weighted tags of (d); and (f) identifying a copy number variation in the sequence of interest from the calculated coverages.
- weighting the test sequence tags involves biasing the coverages toward test sequence tags obtained from cell-free nucleic acid fragments of a size or a size range characteristic of one genome in the test sample. In some implementations, weighting the test sequence tags involves assigning a value of 1 to tags obtained from cell-free nucleic acid fragments of the size or the size range, and assigning a value of 0 to other tags. In some implementations, the method further involves determining, in bins of the reference genome, including the sequence of interest, values of a fragment size parameter including a quantity of the cell-free nucleic acid fragments in the test sample having fragment sizes shorter or longer than a threshold value. Here, identifying the copy number variation in the sequence of interest involves using the values of the fragment size parameter as well as the coverages calculated in (e). In some implementations, the system is configured to evaluate copy number in the test sample using the various methods and processes discussed above.
- the instructions may further include automatically recording information pertinent to the method such as chromosome doses and the presence or absence of a fetal chromosomal aneuploidy in a patient medical record for a human subject providing the maternal test sample.
- the patient medical record may be maintained by, for example, a laboratory, physician’s office, a hospital, a health maintenance organization, an insurance company, or a personal medical record website.
- the method may further involve prescribing, initiating, and/or altering treatment of a human subject from whom the maternal test sample was taken. This may involve performing one or more additional tests or analyses on additional samples taken from the subject.
- Disclosed methods can also be performed using a computer processing system which is adapted or configured to perform a method for identifying any CNV, e.g., chromosomal or partial aneuploidies.
- a computer processing system which is adapted or configured to perform a method as described herein.
- the apparatus comprises a sequencing device adapted or configured for sequencing at least a portion of the nucleic acid molecules in a sample to obtain the type of sequence information described elsewhere herein.
- the apparatus may also include components for processing the sample. Such components are described elsewhere herein.
- Sequence or other data can be input into a computer or stored on a computer readable medium either directly or indirectly.
- a computer system is directly coupled to a sequencing device that reads and/or analyzes sequences of nucleic acids from samples. Sequences or other information from such tools are provided via interface in the computer system. Alternatively, the sequences processed by system are provided from a sequence storage source such as a database or other repository.
- a memory device or mass storage device buffers or stores, at least temporarily, sequences of the nucleic acids.
- the memory device may store tag counts for various chromosomes or genomes, etc.
- the memory may also store various routines and/or programs for analyzing the presenting the sequence or mapped data. Such programs/routines may include programs for performing statistical analyses, etc.
- a user provides a sample into a sequencing apparatus.
- Data is collected and/or analyzed by the sequencing apparatus which is connected to a computer.
- Software on the computer allows for data collection and/or analysis.
- Data can be stored, displayed (via a monitor or other similar device), and/or sent to another location.
- the computer may be connected to the internet which is used to transmit data to a handheld device utilized by a remote user (e.g., a physician, scientist or analyst). It is understood that the data can be stored and/or analyzed prior to transmittal ⁇
- raw data is collected and sent to a remote user or apparatus that will analyze and/or store the data. Transmittal can occur via the internet, but can also occur via satellite or other connection.
- data can be stored on a computer-readable medium and the medium can be shipped to an end user (e.g., via mail).
- the remote user can be in the same or a different geographical location including, but not limited to a building, city, state, country or continent.
- the methods also include collecting data regarding a plurality of polynucleotide sequences (e.g., reads, tags and/or reference chromosome sequences) and sending the data to a computer or other computational system.
- the computer can be connected to laboratory equipment, e.g., a sample collection apparatus, a nucleotide amplification apparatus, a nucleotide sequencing apparatus, or a hybridization apparatus.
- the computer can then collect applicable data gathered by the laboratory device.
- the data can be stored on a computer at any step, e.g., while collected in real time, prior to the sending, during or in conjunction with the sending, or following the sending.
- the data can be stored on a computer-readable medium that can be extracted from the computer.
- the data collected or stored can be transmitted from the computer to a remote location, e.g., via a local network or a wide area network such as the internet. At the remote location various operations can be performed on the transmitted data as described below.
- Tags obtained by aligning reads to a reference genome or other reference sequence or sequences
- These various types of data may be obtained, stored transmitted, analyzed, and/or manipulated at one or more locations using distinct apparatus.
- the processing options span a wide spectrum. At one end of the spectrum, all or much of this information is stored and used at the location where the test sample is processed, e.g., a doctor’s office or other clinical setting. In other extreme, the sample is obtained at one location, it is processed and optionally sequenced at a different location, reads are aligned and calls are made at one or more different locations, and diagnoses, recommendations, and/or plans are prepared at still another location (which may be a location where the sample was obtained).
- any one or more of these operations may be automated as described elsewhere herein.
- the sequencing and the analyzing of sequence data and deriving aneuploidy calls will be performed computationally.
- the other operations may be performed manually or automatically.
- Examples of locations where sample collection may be performed include health practitioners’ offices, clinics, patients’ homes (where a sample collection tool or kit is provided), and mobile health care vehicles.
- Examples of locations where sample processing prior to sequencing may be performed include health practitioners’ offices, clinics, patients’ homes (where a sample processing apparatus or kit is provided), mobile health care vehicles, and facilities of aneuploidy analysis providers.
- Examples of locations where sequencing may be performed include health practitioners’ offices, clinics, health practitioners’ offices, clinics, patients’ homes (where a sample sequencing apparatus and/or kit is provided), mobile health care vehicles, and facilities of aneuploidy analysis providers.
- the location where the sequencing takes place may be provided with a dedicated network connection for transmitting sequence data (typically reads) in an electronic format.
- a cluster or grid of computational resources collective form a super virtual computer composed of multiple processors or computers acting together to perform the analysis and/or derivation described herein.
- These technologies as well as more conventional supercomputers may be employed to process sequence data as described herein.
- Each is a form of parallel computing that relies on processors or computers.
- these processors (often whole computers) are connected by a network (private, public, or the Internet) by a conventional network protocol such as Ethernet.
- a supercomputer has many processors connected by a local high-speed computer bus.
- the diagnosis (e.g., the fetus has Downs syndrome or the patient has a particular type of cancer) is generated at the same location as the analyzing operation. In other embodiments, it is performed at a different location. In some examples, reporting the diagnosis is performed at the location where the sample was taken, although this need not be the case. Examples of locations where the diagnosis can be generated or reported and/or where developing a plan is performed include health practitioners’ offices, clinics, internet sites accessible by computers, and handheld devices such as cell phones, tablets, smart phones, etc. having a wired or wireless connection to a network. Examples of locations where counseling is performed include health practitioners’ offices, clinics, internet sites accessible by computers, handheld devices, etc.
- FIG. 17 shows one implementation of a dispersed system for producing a call or diagnosis from a test sample.
- a sample collection location 01 is used for obtaining a test sample from a patient such as a pregnant female or a putative cancer patient.
- the samples then provided to a processing and sequencing location 03 where the test sample may be processed and sequenced as described above.
- Location 03 includes apparatus for processing the sample as well as apparatus for sequencing the processed sample.
- the result of the sequencing is a collection of reads which are typically provided in an electronic format and provided to a network such as the Internet, which is indicated by reference number 05 in Figure 17.
- the sequence data is provided to a remote location 07 where analysis and call generation are performed.
- This location may include one or more powerful computational devices such as computers or processors.
- the call is relayed back to the network 05.
- an associated diagnosis is also generated.
- the call and or diagnosis are then transmitted across the network and back to the sample collection location 01 as illustrated in Figure 17.
- this is simply one of many variations on how the various operations associated with generating a call or diagnosis may be divided among various locations.
- One common variant involves providing sample collection and processing and sequencing in a single location.
- Another variation involves providing processing and sequencing at the same location as analysis and call generation.
- Figure 18 elaborates on the options for performing various operations at distinct locations. In the most granular sense depicted in Figure 18, each of the following operations is performed at a separate location: sample collection, sample processing, sequencing, read alignment, calling, diagnosis, and reporting and/or plan development.
- One embodiment provides a system for use in determining the presence or absence aneuploidies in a test sample comprising fetal and maternal nucleic acids, the system including a sequencer for receiving a nucleic acid sample and providing fetal and maternal nucleic acid sequence information from the sample; one or more processors configured to: (a) determine a value of fetal fraction of the test sample, wherein the fetal fraction of the test sample indicates the relative amount of fetal origin cell-free nucleic acid fragments in the test sample; (b) receive, by the computer system, sequence reads obtained by sequencing the cell-free nucleic acid fragments in the test sample; (c) align, by the computer system, the sequence reads of the cell-free nucleic acid fragments to a reference genome comprising a sequence of interest, thereby providing sequence tags; (d) determine, by the computer system, a coverage of the sequence tags for at least a portion of the reference genome; and (e) determine that the test sample is within an exclusion region based on
- the sequencer is configured to perform next generation sequencing (NGS). In some embodiments, the sequencer is configured to perform massively parallel sequencing using sequencing-by-synthesis with reversible dye terminators. In other embodiments, the sequencer is configured to perform sequencing-by-ligation. In yet other embodiments, the sequencer is configured to perform single molecule sequencing. [00371] In some embodiments of any of the systems provided herein, the one or more processors are programed to perform various methods described above.
- NGS next generation sequencing
- the sequencer is configured to perform massively parallel sequencing using sequencing-by-synthesis with reversible dye terminators. In other embodiments, the sequencer is configured to perform sequencing-by-ligation. In yet other embodiments, the sequencer is configured to perform single molecule sequencing.
- the one or more processors are programed to perform various methods described above.
- Another aspect of the disclosure relates to a computer program product comprising a non-transitory machine readable medium storing program code that, when executed by one or more processors of a computer system, causes the computer system to: (a) determine a value of fetal fraction of the test sample, wherein the fetal fraction of the test sample indicates the relative amount of fetal origin cell-free nucleic acid fragments in the test sample; (b) receive, by the computer system, sequence reads obtained by sequencing the cell-free nucleic acid fragments in the test sample; (c) align, by the computer system, the sequence reads of the cell-free nucleic acid fragments to a reference genome comprising a sequence of interest, thereby providing sequence tags; (d) determine, by the computer system, a coverage of the sequence tags for at least a portion of the reference genome; and (e) determine that the test sample is within an exclusion region based on the coverage of sequences tags determined in (d) and the fetal fraction determined in (a), wherein the ex
- the computer program product comprises a non-transitory machine readable medium storing program code to be executed by the one or more processors to perform the various methods described above.
- Table 1 shows the Within-Lab Precision portion of the precision study design which was used for the LOD study.
- the Within-Lab Precision portion consisted of 2 Hamilton instruments, 2 NextSeq instruments, and three reagent lots for a total of 12 runs conducted on 6 different days.
- NTC No-Template Control
- Example 2 shows empirical data of various performance metrics of the conventional method labeled as NES and the fetal fraction LOD curve method described above.
- Figure 23 shows NES coverage as a function of observed fetal fraction.
- the two-step coverage threshold technique is used to exclude samples. Various samples falling in the exclusion regions can be seen in the figure. Most of the samples excluded have fetal fraction between 0 and 20%. And their coverages are limited by the two levels of thresholds.
- Figure 24 shows data exclusion using exclusion area that is defined by an LOD curve and read threshold.
- the left panel has the read threshold at 2 million reads.
- Right panel shows the data using a coverage threshold of 1 million reads.
- Many excluded samples have a relatively high read coverage and low observed fetal fraction.
- many excluded samples have a relatively high fetal fraction and low coverage.
- Figure 25 shows the pass rates and failure rates for the first run and the second ran for the prior methods and the LOD QC method described above. The data show that both methods have similar pass rates and failure rates. The final failure rates for the prior method is 0.42%. And it is 0.38% for the LODQC method.
- Figure 26 shows data that are excluded by the two-step thresholding and rescued by the fetal fraction LOD curve method labeled as 2604.
- the samples that are excluded by the LOD QC method and rescued by the conventional method are shown in area 2602.
- the left panel of Figure 26 shows the data for a coverage threshold of 2 million reads combined with an LOD curve.
- the right panel shows the rescued data for a lower coverage threshold of 1 million reads.
- LODQC method excluded 61 additional samples. But when the coverage threshold is lowered to 1 million reads, 151 fewer samples are excluded by the LOD QC method.
- Figure 27 shows the samples that are rescued by the LOD QC method.
- the left panel shows in area 2702 the samples that are excluded by the conventional methods and rescued by the LO DQC method. After rerun, 80% of those samples fall within the inclusion area 2704. While 10% of the samples remain in the same area 2702. 9.7% of the samples fell below the LOD curve and into the exclusion region.
- the right panel shows samples that are rescued by the 1 million coverage threshold applied with the LOD curve. Samples excluded by the conventional method that are rescued by the 1 million reads threshold are shown in area 2708. After rerunning the samples, 80% moved into the inclusion area 2710, 13.7% remains in the same area 2708, and 6.3% fall below the LOD curve into the exclusion area.
- Figure 30 shows that the 75% confidence LOD curve rescued sample 3002, which is trisomy 21 (T21) false positive. It also rescued samples 3004 which is trisomy 18 (T18) false negative. This results in a 100% reduction in T21 false positives and 50% reduction in T18 false negatives. It illustrates that the LOD QC method can increase both sensitivity and specificity.
- Figure 31 shows that for simulated T21 samples, those that pass the LOD
Landscapes
- Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Priority Applications (7)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202080005447.1A CN112823391B (zh) | 2019-06-03 | 2020-06-02 | 基于检测限的质量控制度量 |
| AU2020286376A AU2020286376A1 (en) | 2019-06-03 | 2020-06-02 | Limit of detection based quality control metric |
| EP20747200.2A EP3977459A1 (en) | 2019-06-03 | 2020-06-02 | Limit of detection based quality control metric |
| KR1020217009511A KR20220013349A (ko) | 2019-06-03 | 2020-06-02 | 검출 한계 기반 품질 제어 메트릭 |
| JP2021517942A JP7506060B2 (ja) | 2019-06-03 | 2020-06-02 | 検出限界ベースの品質管理メトリック |
| CA3115513A CA3115513A1 (en) | 2019-06-03 | 2020-06-02 | Limit of detection based quality control metric |
| US17/281,565 US12260935B2 (en) | 2019-06-03 | 2020-06-02 | Limit of detection based quality control metric |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201962856651P | 2019-06-03 | 2019-06-03 | |
| US62/856,651 | 2019-06-03 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2020247411A1 true WO2020247411A1 (en) | 2020-12-10 |
Family
ID=71842782
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2020/035787 Ceased WO2020247411A1 (en) | 2019-06-03 | 2020-06-02 | Limit of detection based quality control metric |
Country Status (8)
| Country | Link |
|---|---|
| US (1) | US12260935B2 (https=) |
| EP (1) | EP3977459A1 (https=) |
| JP (1) | JP7506060B2 (https=) |
| KR (1) | KR20220013349A (https=) |
| CN (1) | CN112823391B (https=) |
| AU (1) | AU2020286376A1 (https=) |
| CA (1) | CA3115513A1 (https=) |
| WO (1) | WO2020247411A1 (https=) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022119812A1 (en) * | 2020-12-02 | 2022-06-09 | Illumina Software, Inc. | System and method for detection of genetic alterations |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113643755B (zh) * | 2021-08-11 | 2023-10-13 | 上海小海龟科技有限公司 | 一种nipt试剂盒阳性率校正方法、装置、计算机设备和介质 |
| CN117095745B (zh) * | 2022-12-28 | 2025-12-12 | 安诺优达基因科技(北京)有限公司 | 用于检测孕妇血浆游离dna中胎儿非整倍体和拷贝数变异的方法和装置及应用 |
| SK289398B6 (sk) * | 2023-05-25 | 2026-01-14 | Medirex Group Academy N.O. | Metóda detekcie vzoriek s nedostatočným množstvom fragmentov fetálnej a cirkulujúcej nádorovej DNA na neinvazívne genetické testovanie |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20080139801A1 (en) | 2006-10-10 | 2008-06-12 | Umansky Samuil R | Compositions, methods and kits for isolating nucleic acids from body fluids using anion exchange media |
| US7601499B2 (en) | 2005-06-06 | 2009-10-13 | 454 Life Sciences Corporation | Paired end sequencing |
| US20120053063A1 (en) | 2010-08-27 | 2012-03-01 | Illumina Cambridge Limited | Methods for sequencing polynucleotides |
| US20150005176A1 (en) | 2013-06-21 | 2015-01-01 | Sequenom, Inc. | Methods and processes for non-invasive assessment of genetic variations |
| US20170220735A1 (en) * | 2016-02-03 | 2017-08-03 | Verinata Health, Inc. | Using cell-free dna fragment size to determine copy number variations |
| US20170316150A1 (en) * | 2014-10-10 | 2017-11-02 | Sequenom, Inc. | Methods and processes for non-invasive assessment of genetic variations |
Family Cites Families (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN204440396U (zh) | 2012-04-12 | 2015-07-01 | 维里纳塔健康公司 | 用于确定胎儿分数的试剂盒 |
| WO2014151117A1 (en) * | 2013-03-15 | 2014-09-25 | The Board Of Trustees Of The Leland Stanford Junior University | Identification and use of circulating nucleic acid tumor markers |
| KR102665592B1 (ko) | 2013-05-24 | 2024-05-21 | 시쿼넘, 인코포레이티드 | 유전적 변이의 비침습 평가를 위한 방법 및 프로세스 |
| CN105830077B (zh) * | 2013-10-21 | 2019-07-09 | 维里纳塔健康公司 | 用于在确定拷贝数变异中改善检测的灵敏度的方法 |
| CA2945962C (en) | 2014-04-21 | 2023-08-29 | Natera, Inc. | Detecting mutations and ploidy in chromosomal segments |
| AU2015367290A1 (en) | 2014-12-16 | 2017-05-11 | Garvan Institute Of Medical Research | Sequencing controls |
| HUE058263T2 (hu) | 2015-02-10 | 2022-07-28 | Univ Hong Kong Chinese | Mutációk detektálása rákszûrési és magzatelemzési célból |
| WO2017093561A1 (en) | 2015-12-04 | 2017-06-08 | Genesupport Sa | Method for non-invasive prenatal testing |
| CA3016077A1 (en) | 2016-03-22 | 2017-09-28 | Counsyl, Inc. | Combinatorial dna screening |
| WO2018064486A1 (en) | 2016-09-29 | 2018-04-05 | Counsyl, Inc. | Noninvasive prenatal screening using dynamic iterative depth optimization |
| WO2019025004A1 (en) | 2017-08-04 | 2019-02-07 | Trisomytest, S.R.O. | METHOD FOR NON-INVASIVE PRENATAL DETECTION OF FETUS SEX CHROMOSOMAL ABNORMALITY AND FETUS SEX DETERMINATION FOR SINGLE PREGNANCY AND GEEMELLAR PREGNANCY |
| GB2615975B (en) * | 2019-02-14 | 2023-11-29 | Mirvie Inc | Methods and systems for determining a pregnancy-related state of a subject |
-
2020
- 2020-06-02 JP JP2021517942A patent/JP7506060B2/ja active Active
- 2020-06-02 KR KR1020217009511A patent/KR20220013349A/ko active Pending
- 2020-06-02 EP EP20747200.2A patent/EP3977459A1/en not_active Withdrawn
- 2020-06-02 AU AU2020286376A patent/AU2020286376A1/en not_active Abandoned
- 2020-06-02 CN CN202080005447.1A patent/CN112823391B/zh active Active
- 2020-06-02 US US17/281,565 patent/US12260935B2/en active Active
- 2020-06-02 CA CA3115513A patent/CA3115513A1/en active Pending
- 2020-06-02 WO PCT/US2020/035787 patent/WO2020247411A1/en not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7601499B2 (en) | 2005-06-06 | 2009-10-13 | 454 Life Sciences Corporation | Paired end sequencing |
| US20080139801A1 (en) | 2006-10-10 | 2008-06-12 | Umansky Samuil R | Compositions, methods and kits for isolating nucleic acids from body fluids using anion exchange media |
| US20120053063A1 (en) | 2010-08-27 | 2012-03-01 | Illumina Cambridge Limited | Methods for sequencing polynucleotides |
| US20150005176A1 (en) | 2013-06-21 | 2015-01-01 | Sequenom, Inc. | Methods and processes for non-invasive assessment of genetic variations |
| US20170316150A1 (en) * | 2014-10-10 | 2017-11-02 | Sequenom, Inc. | Methods and processes for non-invasive assessment of genetic variations |
| US20170220735A1 (en) * | 2016-02-03 | 2017-08-03 | Verinata Health, Inc. | Using cell-free dna fragment size to determine copy number variations |
Non-Patent Citations (23)
| Title |
|---|
| ALNEMRILIWACK, J BIOL. CHEM, vol. 265, 1990, pages 17323 - 17333 |
| ANN REV BIOPHYS BIOMOL STRUCT, vol. 24, 1995, pages 167 - 183 |
| AUSUBEL ET AL., CURRENT PROTOCOLS IN MOLECULAR BIOLOGY, 1987 |
| BENTLEY ET AL., NATURE, vol. 6, 2009, pages 53 - 59 |
| BOTEZATU ET AL., CLIN CHEM., vol. 46, 2000, pages 1078 - 1084 |
| CHAN ET AL., CLIN CHEM, vol. 50, 2004, pages 1002 - 1011 |
| CHEN ET AL., NATURE MED., vol. 2, 1996, pages 1033 - 1035 |
| FAN ET AL., CLIN CHEM, vol. 56, 2010, pages 1279 - 1286 |
| FAN ET AL., PROC NATL ACAD SCI, vol. 105, 2008, pages 16266 - 16271 |
| KIM ET AL., PRENATAL DIAGNOSIS, vol. 35, 2015, pages 1 - 6 |
| KOIDE ET AL., PRENATAL DIAGNOSIS, vol. 25, 2005, pages 604 - 607 |
| KOZAREWA ET AL., NATURE METHODS, vol. 6, 2009, pages 291 - 295 |
| LANGMEAD ET AL., GENOME BIOLOGY, vol. 10, 2009 |
| LO ET AL., LANCET, vol. 350, 1997, pages 485 - 487 |
| MOL BIOTECHNOL, vol. 26, 2004, pages 233 - 248 |
| REDON ET AL., NATURE, vol. 23, 2006, pages 444 - 454 |
| RICHARD B. LANMAN ET AL: "Analytical and Clinical Validation of a Digital Sequencing Panel for Quantitative, Highly Accurate Evaluation of Cell-Free Circulating Tumor DNA", PLOS ONE, vol. 10, no. 10, 1 October 2015 (2015-10-01), pages e0140712, XP055403636, DOI: 10.1371/journal.pone.0140712 * |
| RICHARDSBOYER, J MOL BIOL, vol. 11, 1965, pages 327 - 240 |
| SAMBROOK ET AL.: "Molecular Cloning: A Laboratory Manual", 2001, COLD SPRING HARBOR |
| SHAIKH ET AL., GENOME RES, vol. 19, 2009, pages 1682 - 1690 |
| SU ET AL., J MOL. DIAGN., vol. 6, 2004, pages 101 - 107 |
| WALSH ET AL., SCIENCE, vol. 320, 2008, pages 539 - 543 |
| XIN YANG ET AL: "Technical Validation of a Next-Generation Sequencing Assay for Detecting Clinically Relevant Levels of Breast Cancer-Related Single-Nucleotide Variants and Copy Number Variants Using Simulated Cell-Free DNA", THE JOURNAL OF MOLECULAR DIAGNOSTICS, vol. 19, no. 4, 1 July 2017 (2017-07-01), US, pages 525 - 536, XP055726751, ISSN: 1525-1578, DOI: 10.1016/j.jmoldx.2017.04.007 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2022119812A1 (en) * | 2020-12-02 | 2022-06-09 | Illumina Software, Inc. | System and method for detection of genetic alterations |
Also Published As
| Publication number | Publication date |
|---|---|
| JP2022534634A (ja) | 2022-08-03 |
| KR20220013349A (ko) | 2022-02-04 |
| CA3115513A1 (en) | 2020-12-10 |
| AU2020286376A1 (en) | 2021-04-22 |
| CN112823391B (zh) | 2024-07-05 |
| US20210366569A1 (en) | 2021-11-25 |
| EP3977459A1 (en) | 2022-04-06 |
| CN112823391A (zh) | 2021-05-18 |
| US12260935B2 (en) | 2025-03-25 |
| JP7506060B2 (ja) | 2024-06-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12087401B2 (en) | Using cell-free DNA fragment size to detect tumor-associated variant | |
| US12217827B2 (en) | Detecting fetal sub-chromosomal aneuploidies | |
| AU2018375008B2 (en) | Methods and systems for determining somatic mutation clonality | |
| US20200335178A1 (en) | Detecting repeat expansions with short read sequencing data | |
| EP3061021B1 (en) | Method for improving the sensitivity of detection in determining copy number variations | |
| JP2019153332A (ja) | 性染色体におけるコピー数変異を判定するための方法 | |
| US12260935B2 (en) | Limit of detection based quality control metric | |
| US20220170010A1 (en) | System and method for detection of genetic alterations | |
| HK40047016B (zh) | 基於检测限的质量控制度量 | |
| HK40047016A (en) | Limit of detection based quality control metric | |
| HK40033718A (en) | Detecting, optionally fetal, sub-chromosomal aneuploidies and copy number variations |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20747200 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 3115513 Country of ref document: CA |
|
| ENP | Entry into the national phase |
Ref document number: 2021517942 Country of ref document: JP Kind code of ref document: A |
|
| ENP | Entry into the national phase |
Ref document number: 2020286376 Country of ref document: AU Date of ref document: 20200602 Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2020747200 Country of ref document: EP Effective date: 20220103 |
|
| WWW | Wipo information: withdrawn in national office |
Ref document number: 2020747200 Country of ref document: EP |
|
| WWG | Wipo information: grant in national office |
Ref document number: 17281565 Country of ref document: US |