CN103374518B - Copy the detection and classification of number variation - Google Patents

Copy the detection and classification of number variation Download PDF

Info

Publication number
CN103374518B
CN103374518B CN201210441134.8A CN201210441134A CN103374518B CN 103374518 B CN103374518 B CN 103374518B CN 201210441134 A CN201210441134 A CN 201210441134A CN 103374518 B CN103374518 B CN 103374518B
Authority
CN
China
Prior art keywords
chromosome
sequence
interest
segment
dose
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210441134.8A
Other languages
Chinese (zh)
Other versions
CN103374518A (en
Inventor
里查德·P·拉瓦
阿奴巴玛·斯里尼瓦桑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Verinata Health Inc
Original Assignee
Verinata Health Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/445,778 external-priority patent/US9447453B2/en
Priority claimed from US13/482,964 external-priority patent/US20120270739A1/en
Priority claimed from US13/555,037 external-priority patent/US9260745B2/en
Application filed by Verinata Health Inc filed Critical Verinata Health Inc
Priority to CN201810154581.2A priority Critical patent/CN108485940B/en
Priority to CN201710644858.5A priority patent/CN107435070A/en
Publication of CN103374518A publication Critical patent/CN103374518A/en
Application granted granted Critical
Publication of CN103374518B publication Critical patent/CN103374518B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1096Processes for the isolation, preparation or purification of DNA or RNA cDNA Synthesis; Subtracted cDNA library construction, e.g. RT, RT-PCR
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B50/00Methods of creating libraries, e.g. combinatorial synthesis
    • C40B50/06Biochemical methods, e.g. using enzymes or whole viable microorganisms

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Wood Science & Technology (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biochemistry (AREA)
  • Microbiology (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Immunology (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Plant Pathology (AREA)
  • General Chemical & Material Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Mother body D NA backgrounds in maternal sample are to any operation limitation attempted from the maternal DNA group of sample for the detection of differentiation fetal chromosomal all with sensitiveness.Therefore, for by the quantitative differences and/or the diagnosis of essence difference and conventional detection between fetus and maternal DNA group, fetus fraction is the important parameter for needing to consider.The invention provides a kind of method for being used to determine the fetus fraction in maternal sample.This method obtains using fetus fraction as the function of normalization chromosome value or normalization chromosomal region segment value.The present invention is used to determine that the method for fetus fraction can be combined with other method, such as be combined with using fetus fraction as the function of polymorphism allelic unbalance information come the method obtained, the copy number variation of the fetal chromosomal in maternal sample or chromosome segment is classified.Present invention also offers the equipment and kit for implementing methods described.

Description

Detection and classification of copy number variation
Background
One of the key efforts in human medical research is the discovery of genetic abnormalities that are critically important to poor health outcomes. In many cases, specific genes and/or key diagnostic markers have been identified in various parts of the genome, which are present in abnormal copy numbers. For example, in prenatal diagnosis, extra or missing copies of the entire chromosome are frequently genetic lesions. In cancer, deletion or multiplication of copies of entire chromosomes or chromosome segments, and higher levels of amplification of specific regions of the genome are common.
Much of the information about copy number variation has been provided by cytogenetic discrimination ability allowing identification of structural abnormalities. Various conventional procedures for genetic screening and biodosimetry have utilized invasive procedures (e.g., amniocentesis) to obtain cells for karyotyping. Recognizing the need for more rapid testing methods that do not require cell culture, fluorescence In Situ Hybridization (FISH), quantitative fluorescence PCR (QF-PCR), and array-comparative genomic hybridization (array-CGH) have been developed as molecular cytogenetic methods for analyzing copy number variations.
The advent of techniques that allow sequencing of the entire genome in a shorter time, and the discovery of circulating cell-free DNA (cfDNA), has provided the opportunity to compare chromosomes derived from one chromosomal genetic material to be compared to another without the risks associated with invasive sampling processes. However, the limitations of existing methods (which include inadequate sensitivity out of limited levels of cfDNA) and sequencing bias out of techniques of inherent nature of genomic information determine the continuing need for non-invasive methods that will provide any or all of specificity, sensitivity, and applicability to reliably determine copy number changes in a variety of clinical settings.
The embodiments disclosed herein fulfill some of the above needs and in particular provide an advantage in providing a reliable method that is at least suitable for performing non-invasive prenatal diagnostics and for diagnosing and monitoring metastatic progression in cancer patients.
SUMMARY
The background of maternal DNA in a maternal sample has operational limitations of sensitivity for any detection that attempts to distinguish fetal chromosomes from the maternal chromosomal components of the sample. Therefore, fetal fraction is an important parameter to consider for diagnostic and routine testing that relies on quantitative and/or substantial differences between fetal and maternal genomes. The present invention provides a method for determining the fetal fraction in a maternal sample. The method obtains the fetal fraction as a function of a normalized chromosome value or a normalized chromosome segment value. The method of the invention for determining fetal fraction may be combined with other methods, for example with methods in which fetal fraction is obtained as a function of allele imbalance information in polymorphisms, to classify the copy number variation of a fetal chromosome or chromosome segment in a maternal sample. The invention also provides a device and a kit for carrying out the method.
Various methods are provided for determining Copy Number Variation (CNV) of a sequence of interest in a test sample comprising a mixture of nucleic acids that are known or suspected to differ in the amount of one or more sequences of interest. This method includes a statistical approach that takes into account cumulative variability from process-related, inter-chromosomal, and inter-sequence variability. The method is applicable to determining CNVs of any fetal aneuploidy, as well as a variety of CNVs known or suspected to be associated with a variety of medical conditions. CNVs that can be determined according to the present methods include trisomies or monosomies of any one or more of chromosomes 1-22, X and Y, polysomies of other chromosomes, and deletions and/or duplications of segments of any one or more of these chromosomes, which can be detected by sequencing the nucleic acids of the test sample only once. Any aneuploidy can be determined from sequencing information obtained by only one sequencing of the nucleic acids of the test sample.
In one embodiment, a method is provided for determining the presence or absence of any four or more different, intact fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acids. The method comprises the following steps: (a) Obtaining sequence information of fetal and maternal nucleic acids in a maternal test sample; (b) Using the sequence information to identify a number of sequence tags for each of any four or more chromosomes of interest selected from chromosomes 1-22, X, and Y, and a number of sequence tags for one normalized chromosome sequence for each of the any four or more chromosomes of interest; (c) Calculating a single chromosome dose for each of any four or more chromosomes of interest using the number of sequence tags identified for each of the any four or more chromosomes of interest and the number of sequence tags identified for each of the normalized chromosome sequences; and (d) comparing each said single chromosome dose for each of said any four or more chromosomes of interest to a threshold value for each of said any four or more chromosomes of interest, and thereby determining the presence or absence of any four or more intact, distinct fetal chromosomal aneuploidies in the maternal test sample. Step (a) may comprise sequencing at least a portion of the nucleic acids of a test sample to obtain said sequence information for fetal and maternal nucleic acid molecules of the test sample. In some embodiments, step (c) comprises calculating a single chromosome dose for each of said chromosomes of interest as a ratio of the number of the sequence tags identified for each of said chromosomes of interest to the number of the sequence tags identified for said normalized chromosome sequence for each of said chromosomes of interest. In some other embodiments, step (c) comprises: (i) Calculating a sequence tag density ratio for each said chromosome of interest by correlating the number of such sequence tags identified in step (b) for each said chromosome of interest with the length of each said chromosome of interest; (ii) Calculating a sequence tag density ratio for each of said normalizing chromosome sequences by correlating the number of such sequence tags identified in step (b) for each of said normalizing chromosome sequences with the length of each of said normalizing chromosome sequences; and (iii) calculating a single chromosome dose for each of said chromosomes of interest using the sequence tag density ratios calculated in steps (i) and (ii), wherein the chromosome dose is calculated as a ratio of the sequence tag density ratio for each of said chromosomes of interest to the sequence tag density ratio of said normalized chromosome sequence for each of said chromosomes of interest.
In another embodiment, a method is provided for determining the presence or absence of any four or more different, intact fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acids. The method comprises the following steps: (a) Obtaining sequence information for fetal and maternal nucleic acids in a maternal test sample; (b) Using the sequence information to identify a number of sequence tags for each of any four or more chromosomes of interest selected from chromosomes 1-22, X, and Y, and to identify a number of sequence tags for one normalized chromosome sequence for each of the any four or more chromosomes of interest; (c) Calculating a single chromosome dose for each of any four or more chromosomes of interest using the number of sequence tags identified for each of the any four or more chromosomes of interest and the number of sequence tags identified for each of the normalized chromosome sequences; and (d) comparing each said single chromosome dose for each of said any four or more chromosomes of interest to a threshold value for each of said any four or more chromosomes of interest, and thereby determining the presence or absence of any four or more complete, different fetal chromosomal aneuploidies in the maternal test sample, wherein said any four or more chromosomes of interest selected from chromosomes 1-22, X, and Y comprise at least twenty chromosomes selected from chromosomes 1-22, X, and Y, and wherein the presence or absence of at least twenty different complete fetal chromosomal aneuploidies is determined. Step (a) may comprise sequencing at least a portion of the nucleic acids of the test sample to obtain said sequence information for the fetal and maternal nucleic acid molecules of the test sample. In some embodiments, step (c) comprises calculating a single chromosome dose for each of said chromosomes of interest as a ratio of the number of the sequence tags identified for each of said chromosomes of interest to the number of the sequence tags identified for said normalized chromosome sequence for each of said chromosomes of interest. In some other embodiments, step (c) comprises: (i) Calculating a sequence tag density ratio for each of said chromosomes of interest by correlating the number of such sequence tags identified in step (b) for each of said chromosomes of interest with the length of each of said chromosomes of interest; (ii) Calculating a sequence tag density ratio for each of said normalizing chromosome sequences by correlating the number of such sequence tags identified in step (b) for each of said normalizing chromosome sequences with the length of each of said normalizing chromosome sequences; and (iii) calculating a single chromosome dose for each of said chromosomes of interest using the sequence tag density ratios calculated in steps (i) and (ii), wherein said chromosome dose is calculated as a ratio of the sequence tag density ratio for each of said chromosomes of interest to the sequence tag density ratio of said normalized chromosome sequence for each of said chromosomes of interest.
In another embodiment, a method is provided for determining the presence or absence of any four or more different, intact fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acids. The method comprises the following steps: (a) Obtaining sequence information for the fetal and maternal nucleic acids in a maternal test sample; (b) Using the sequence information to identify a number of sequence tags for each of any four or more chromosomes of interest selected from chromosomes 1-22, X, and Y, and to identify a number of sequence tags for one normalized chromosome sequence for each of the any four or more chromosomes of interest; (c) Calculating a single chromosome dose for each of any four or more chromosomes of interest using the number of sequence tags identified for each of the any four or more chromosomes of interest and the number of sequence tags identified for each of the normalized chromosome sequences; and (d) comparing each said single chromosome dose for each of any four or more chromosomes of interest to a threshold for each of said any four or more chromosomes of interest, and thereby determining the presence or absence of any four or more intact, distinct fetal chromosomal aneuploidies in the sample, wherein any four or more chromosomes of interest selected from chromosomes 1-22, X, and Y are all chromosomes 1-22, X, and Y, and wherein an intact fetal chromosomal aneuploidy for all chromosomes 1-22, X, and Y is determined to be present or absent. Step (a) may comprise sequencing at least a portion of the nucleic acids of the test sample to obtain said sequence information for the fetal and maternal nucleic acid molecules of the test sample. In some embodiments, step (c) comprises calculating a single chromosome dose for each of said chromosomes of interest as a ratio of the number of the sequence tags identified for each of said chromosomes of interest to the number of the sequence tags identified for said normalized chromosome sequence for each of said chromosomes of interest. In some other embodiments, step (c) comprises: (i) Calculating a sequence tag density ratio for each of said chromosomes of interest by correlating the number of such sequence tags identified in step (b) for each of said chromosomes of interest with the length of each of said chromosomes of interest; (ii) Calculating a sequence tag density ratio for each of said normalizing chromosome sequences by correlating the number of such sequence tags identified in step (b) for each of said normalizing chromosome sequences with the length of each of said normalizing chromosome sequences; and (iii) calculating a single chromosome dose for each of said chromosomes of interest using the sequence tag density ratios calculated in steps (i) and (ii), wherein the chromosome dose is calculated as a ratio of the sequence tag density ratio for each of said chromosomes of interest to the sequence tag density ratio of said normalized chromosome sequence for each of said chromosomes of interest.
In any of the above embodiments, the normalizing chromosome sequence may be a single chromosome selected from chromosomes 1-22, X, and Y. Alternatively, the normalizing chromosomal sequence is a set of chromosomes selected from chromosomes 1-22, X, and Y.
In another embodiment, a method is provided for determining the presence or absence of any one or more distinct, intact fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acids. The method comprises the following steps: (a) Obtaining sequence information for the fetal and maternal nucleic acids in a sample; (b) Using the sequence information to identify a number of sequence tags for each of any one or more chromosomes of interest selected from chromosomes 1-22, X, and Y, and a number of sequence tags for one normalizing chromosome sequence for each of the any one or more chromosomes of interest; (c) Calculating a single chromosome dose for each of any one or more chromosomes of interest using the number of sequence tags identified for each of the any one or more chromosomes of interest and the number of sequence tags identified for each of the normalizing segment sequences; and (d) comparing said single chromosome dose for each of said any one or more chromosomes of interest to a threshold value for each of said one or more chromosomes of interest, and therefrom determining the presence or absence of any one or more intact, distinct fetal chromosomal aneuploidies in said sample. Step (a) may comprise sequencing at least a portion of the nucleic acids of the test sample to obtain said sequence information for the fetal and maternal nucleic acid molecules of the test sample.
In some embodiments, step (c) comprises calculating a single chromosome dose for each of said chromosomes of interest as a ratio of the number of the sequence tags identified for each of said chromosomes of interest to the number of the sequence tags identified for said normalized chromosome sequence for each of said chromosomes of interest. In some other embodiments, step (c) comprises: (i) Calculating a sequence tag density ratio for each of said chromosomes of interest by correlating the number of such sequence tags identified in step (b) for each of said chromosomes of interest with the length of each of said chromosomes of interest; (ii) Calculating a sequence tag density ratio for each said normalized segment sequence by correlating the number of such sequence tags identified in step (b) for each said normalized segment sequence with the length of each said normalized chromosome; and (iii) calculating a single chromosome dose for each of the chromosomes of interest using the sequence tag density ratios calculated in steps (i) and (ii), wherein the chromosome dose is calculated as a ratio of the sequence tag density ratio for each of the chromosomes of interest and the sequence tag density ratio for the normalized segment sequence for each of the chromosomes of interest.
In another embodiment, a method is provided for determining the presence or absence of any one or more distinct, intact fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acids. The method comprises the following steps: (a) Obtaining sequence information for fetal and maternal nucleic acids in a sample; (b) Using the sequence information to identify a number of sequence tags for each of any one or more chromosomes of interest selected from chromosomes 1-22, X, and Y, and a number of sequence tags for one normalized chromosome sequence for each of the any one or more chromosomes of interest; (c) Calculating a single chromosome dose for each of any one or more chromosomes of interest using the number of sequence tags identified for each of the any one or more chromosomes of interest and the number of sequence tags identified for each of the normalization segment sequences; and (d) comparing each said single chromosome dose for each of said any one or more chromosomes of interest to a threshold value for each of said any one or more chromosomes of interest, and thereby determining the presence or absence of one or more complete, distinct fetal chromosomal aneuploidies in said sample, wherein any one or more chromosomes of interest selected from chromosomes 1-22, X, and Y comprises at least twenty chromosomes selected from chromosomes 1-22, X, and Y, and wherein the presence or absence of at least twenty distinct fetal chromosomal aneuploidies is determined. Step (a) may comprise sequencing at least a portion of the nucleic acids of the test sample to obtain said sequence information for the fetal and maternal nucleic acid molecules of the test sample. In some embodiments, step (c) comprises calculating a single chromosome dose for each of said chromosomes of interest as a ratio of the number of the sequence tags identified for each of said chromosomes of interest to the number of the sequence tags identified for said normalized chromosome sequence for each of said chromosomes of interest. In some other embodiments, step (c) comprises: (i) Calculating a sequence tag density ratio for each of said chromosomes of interest by correlating the number of such sequence tags identified in step (b) for each of said chromosomes of interest with the length of each of said chromosomes of interest; (ii) Calculating a sequence tag density ratio for each of said normalizing segment sequences by correlating the number of such sequence tags identified in step (b) for each of said normalizing segment sequences with the length of each of said normalizing chromosomes; and (iii) calculating a single chromosome dose for each of said chromosomes of interest using the sequence tag density ratios calculated in steps (i) and (ii), wherein said chromosome dose is calculated as a ratio of the sequence tag density ratio for each of said chromosomes of interest to the sequence tag density ratio for said normalizing segment sequence for each of said chromosomes of interest.
In another embodiment, a method is provided for determining the presence or absence of any one or more distinct, intact fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acids. The method comprises the following steps: (a) Obtaining sequence information for fetal and maternal nucleic acids in a sample; (b) Using the sequence information to identify a number of sequence tags for each of any one or more chromosomes of interest selected from chromosomes 1-22, X, and Y, and a number of sequence tags for one normalizing segment sequence for each of the any one or more chromosomes of interest; (c) Calculating a single chromosome dose for each of any one or more chromosomes of interest using the number of sequence tags identified for each of the any one or more chromosomes of interest and the number of sequence tags identified for each of the normalization segment sequences; and (d) comparing each said single chromosome dose for each of said any one or more chromosomes of interest to a threshold value for each of said any one or more chromosomes of interest and therefrom determining the presence or absence of one or more intact, distinct fetal chromosomal aneuploidies in said sample, wherein any one or more of said chromosomes of interest selected from the group consisting of chromosomes 1-22, X, and Y is all chromosomes 1-22, X, and Y, and wherein the presence or absence of intact fetal chromosomal aneuploidies of all chromosomes 1-22, X, and Y is determined. Step (a) may comprise sequencing at least a portion of the nucleic acids of the test sample to obtain said sequence information for the fetal and maternal nucleic acid molecules of the test sample. In some embodiments, step (c) comprises calculating a single chromosome dose for each of said chromosomes of interest as a ratio of the number of the sequence tags identified for each of said chromosomes of interest to the number of the sequence tags identified for said normalized chromosome sequence for each of said chromosomes of interest. In some other embodiments, step (c) comprises: (i) Calculating a sequence tag density ratio for each said chromosome of interest by correlating the number of such sequence tags identified in step (b) for each said chromosome of interest with the length of each said chromosome of interest; (ii) Calculating a sequence tag density ratio for each said normalized segment sequence by correlating the number of such sequence tags identified in step (b) for each said normalized segment sequence with the length of each said normalized chromosome; and (iii) calculating a single chromosome dose for each of said chromosomes of interest using the sequence tag density ratios calculated in steps (i) and (ii), wherein said chromosome dose is calculated as a ratio of the sequence tag density ratio for each of said chromosomes of interest to the sequence tag density ratio of said normalized segment sequence for each of said chromosomes of interest.
In any of the above embodiments, the different whole chromosome aneuploidies are selected from the group consisting of a whole chromosome trisomy, a whole chromosome monosomy, and a whole chromosome polysomy. These different chromosomal aneuploidies are selected from the complete aneuploidies of any of chromosomes 1-22, X, and Y. For example, the different intact fetal chromosomal aneuploidies are selected from trisomy 2, trisomy 8, trisomy 9, trisomy 20, trisomy 21, trisomy 13, trisomy 16, trisomy 18, trisomy 22, 47, xxx, 47, xyy, and monosomy X.
In any of the above embodiments, steps (a) - (d) are repeated for test samples from different maternal subjects, and the method comprises determining the presence or absence of a chromosomal aneuploidy of any four or more different whole fetuses in each test sample.
In any of the above embodiments, the method may further comprise calculating a Normalized Chromosome Value (NCV), wherein said NCV correlates said chromosome dose to an average of corresponding chromosome doses in a set of qualifying samples as:
whereinAndrespectively, the estimated mean and standard deviation for the jth chromosome dose in a set of qualifying samples, and x ij Is the jth chromosome dose observed for test sample i.
In another embodiment, a method is provided for determining the presence or absence of distinct, partial fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acids. The method comprises the following steps: (a) Obtaining sequence information for fetal and maternal nucleic acids in a sample; (b) Identifying a number of sequence tags for any one or more segments of any one or more chromosomes of interest each selected from chromosomes 1-22, X, and Y using the sequence information and identifying a number of sequence tags for the normalized segment sequences of any one or more segments of any one or more chromosomes of interest each; (c) Calculating a single chromosome dose for each of any one or more segments of any one or more chromosomes of interest using the number of sequence tags identified for each of the any one or more segments of any one or more chromosomes of interest and the number of sequence tags identified for each of the normalized segment sequences; and (d) comparing each said single segment dose in any one or more segment for each said any one or more chromosome of interest with a threshold value for any one or more segment for each said any one or more chromosome of interest, and therefrom determining the presence or absence of one or more different, partial, fetal chromosomal aneuploidies in said sample. Step (a) may comprise sequencing at least a portion of the nucleic acids of the test sample to obtain said sequence information for the fetal and maternal nucleic acid molecules of the test sample.
In some embodiments, step (c) comprises calculating a single segment dose for each of any one or more segments of any one or more chromosomes of interest as a ratio of the number of the sequence tags identified for each of the any one or more segments of any one or more chromosomes of interest to the number of the sequence tags identified for each of the normalized segment sequences of any one or more segments of any one or more chromosomes of interest. In some other embodiments, step (c) comprises: (i) Calculating a sequence tag density ratio for each said segment of interest by correlating the number of such sequence tags identified in step (b) for each said segment of interest with the length of each said segment of interest; (ii) Calculating a sequence tag density ratio for each said normalized segment sequence by correlating the number of such sequence tags identified in step (b) for each said normalized segment sequence with the length of each said normalized segment sequence; and (iii) calculating a single chromosome dose for each said segment of interest using the sequence tag density ratios calculated in steps (i) and (ii), wherein said segment dose is calculated as the ratio of the sequence tag density ratio for each said segment of interest to the sequence tag density ratio of said normalized segment sequence for each said segment of interest. The method may further comprise calculating a Normalized Segment Value (NSV), wherein said NSV correlates said segment dose to an average of corresponding segment doses in a set of qualifying samples as:
WhereinAndcorrespondinglyIs the estimated mean and standard deviation of the dose for the jth segment in a set of qualifying samples, and x ij Is the observed jth segment dose for test sample i.
In embodiments of the illustrated method, chromosome doses or segment doses are thus determined using a normalizing segment sequence, which can be a single segment of any one or more of chromosomes 1-22, X, and Y. Alternatively, such a normalizing segment sequence may be a set of segments of any one or more of chromosomes 1-22, X, and Y.
Repeating steps (a) - (d) of the method for determining the presence or absence of a portion of a fetal chromosomal aneuploidy for a plurality of test samples from different maternal subjects, and the method comprises determining the presence or absence of a different, portion of a fetal chromosomal aneuploidy in each of said samples. The part of the fetal chromosomal aneuploidy that can be determined according to the method includes aneuploidy of a part of any fragment of any chromosome. The aneuploidy of these moieties may be selected from the group consisting of replication of the moiety, multiplication of the moiety, insertion of the moiety, and deletion of the moiety. Examples of partial aneuploidies that can be determined according to this method include partial monomers of chromosome 1, partial monomers of chromosome 4, partial monomers of chromosome 5, partial monomers of chromosome 7, partial monomers of chromosome 11, partial monomers of chromosome 15, partial monomers of chromosome 17, partial monomers of chromosome 18, and partial monomers of chromosome 22.
In any of the above embodiments, the test sample may be a maternal sample selected from the group consisting of blood, plasma, serum, urine and saliva samples. In any of these embodiments, the test sample may be a plasma sample. These nucleic acid molecules of the maternal sample are fetal and maternal cell-free DNA molecules. Next Generation Sequencing (NGS) can be used to sequence these nucleic acids. In some embodiments, sequencing is massively parallel sequencing using sequencing by synthesis with reversible dye terminators. In other embodiments, the sequencing is ligation sequencing. In still other embodiments, the sequencing is single molecule sequencing. Optionally, an amplification step is performed prior to sequencing.
In another embodiment, a method is provided for determining the presence or absence of any twenty or more different, intact fetal chromosomal aneuploidies in a maternal plasma test sample comprising a mixture of fetal and maternal cell-free DNA molecules. The method comprises the following steps: (a) Sequencing at least a portion of the cell-free DNA molecules to obtain sequence information for fetal and maternal cell-free DNA molecules in the sample; (b) Using the sequence information to identify a number of sequence tags for any twenty or more chromosomes of interest each selected from chromosomes 1-22, X, and Y and to identify a number of sequence tags for a normalizing chromosome of the twenty or more chromosomes of interest each; (c) Calculating a single chromosome dose for each of the twenty or more chromosomes of interest using the number of the sequence tags identified for each of the twenty or more chromosomes of interest and the number of the sequence tags identified for each of the normalized chromosomes; and (d) comparing each said single chromosome dose for each said twenty or more chromosomes of interest to a threshold value for each said twenty or more chromosomes of interest, and thereby determining the presence or absence of any twenty or more different, intact fetal chromosomal aneuploidies in said sample.
In another embodiment, the invention provides a method for identifying Copy Number Variations (CNVs) of a sequence of interest (e.g., a clinically relevant sequence) in a test sample, the method comprising the steps of: (a) Obtaining a test sample and a plurality of qualified samples, said test sample comprising a test nucleic acid molecule and said plurality of qualified samples comprising a qualified nucleic acid molecule; (b) Obtaining sequence information of the fetal and maternal nucleic acids in the sample; (c) Calculating a qualified sequence dose for the qualified sequence of interest in each of the plurality of qualified samples based on the sequencing of the qualified nucleic acid molecules, wherein the calculating a qualified sequence dose comprises determining parameters for the qualified sequence of interest and at least one qualified normalized sequence; (d) Identifying at least one qualified normalized sequence based on the qualified sequence doses, wherein the at least one qualified normalized sequence has a minimum variability and/or a maximum resolvability in the plurality of qualified samples; (e) Calculating a test sequence dose for the test sequence of interest based on the sequencing of the nucleic acid molecule in the test sample, wherein the calculating a test sequence dose comprises determining parameters for the test sequence of interest and at least one normalized test sequence corresponding to the at least one qualified normalized sequence; (f) Comparing the test sequence dose to at least one threshold; and (g) assessing the copy number variation of the sequence of interest in the test sample based on the results of step (f). In one embodiment, the plurality of sequence tags mapped to the qualified sequences of interest are associated with the plurality of tags mapped to the qualified normalized sequences of interest for parameters of the qualified sequences of interest and at least one qualified normalized sequence, and wherein the parameters of the test sequence of interest and at least one normalized test sequence associate the plurality of sequence tags mapped to the test sequence of interest with the plurality of tags mapped to the normalized test sequence. In some embodiments, step (b) comprises sequencing at least a portion of the qualified and tested nucleic acid molecules, wherein sequencing comprises providing mapped sequence tags for testing and a qualified sequence of interest, and normalizing sequences for at least one test and at least one qualified; sequencing at least a portion of the nucleic acid molecules of the test sample to obtain sequence information of fetal and maternal nucleic acid molecules of the test sample. In some embodiments, a next generation sequencing method is used to perform this sequencing step. In some embodiments, the sequencing method can be a massively parallel sequencing method, wherein the sequencing method uses sequencing by synthesis with reversible dye terminators. In other embodiments, the sequencing method is ligation sequencing. In some embodiments, sequencing comprises one amplification. In other embodiments, the sequencing is single molecule sequencing. The CNV of the sequence of interest is an aneuploidy, which may be a chromosomal or a partial aneuploidy. In some embodiments, the chromosomal aneuploidy is selected from trisomy 2, trisomy 8, trisomy 9, trisomy 20, trisomy 16, trisomy 21, trisomy 13, trisomy 18, trisomy 22, guillefler's syndrome, 47,xxx, 47,xyy, and monomer X. In other embodiments, the partial aneuploidy is a partial chromosomal deletion or a partial chromosomal insertion. In some embodiments, the CNV identified by the method is a chromosomal or partial aneuploidy associated with cancer. In some embodiments, these tested and qualified samples are biological fluid samples, such as: a plasma sample obtained from a pregnant subject (e.g., a pregnant human subject). In other embodiments, the tested and qualified biological fluid sample (e.g., a plasma sample) is obtained from a subject known or suspected to have cancer.
Certain methods for determining the presence or absence of a fetal chromosomal aneuploidy in a maternal test sample may comprise the following operations: (a) Providing sequence reads from fetal and maternal nucleic acids in the maternal test sample, wherein the sequence reads are provided in electronic format; (b) Aligning, using a computing device, the sequence reads to one or more chromosomal reference sequences, and thereby providing sequence tags corresponding to the sequence reads; (c) Computationally identifying the number of the sequence tags from one or more chromosomes or chromosome segments of interest, and computationally identifying the number of the sequence tags from at least one normalized chromosome sequence or normalized chromosome segment sequence from each of the chromosome or chromosome segments of interest; (d) Computationally calculating a single chromosome or segment dose for each of the one or more chromosomes of interest or chromosome segments of interest using the number of sequence tags identified for each of the one or more chromosomes of interest or chromosome segments of interest and the number of sequence tags identified for each of the normalizing chromosome sequences or chromosome segment sequences; and (e) comparing, using the computing device, each of the monochromosomal doses for each of one or more chromosomes or chromosome segments of interest to a respective threshold for each of the one or more chromosomes or chromosome segments of interest, and thereby determining the presence or absence of at least one fetal aneuploidy in the test sample. In certain implementations, the number of sequence tags identified for each of the chromosome(s) or chromosome segment(s) of interest is at least about 10,000 or at least about 100,000. The disclosed embodiments also provide a computer program product comprising a non-transitory computer readable medium on which program instructions for performing the operations and other computing operations described herein are provided.
In certain embodiments, the chromosomal reference sequence has a plurality of excluded regions that are naturally present in the chromosome but which do not affect the number of its sequence tags for any chromosome or chromosome segment. In certain embodiments, a method additionally comprises: (i) Determining whether a read under consideration is aligned with a site on a chromosomal reference sequence at which another read from the test sample was previously aligned; and (ii) determining whether to include the reading under consideration in the number of sequence tags for a chromosome or chromosome segment of interest. The chromosomal reference sequence may be stored on a computer-readable medium.
In certain embodiments, a method further comprises sequencing at least a portion of the nucleic acid molecules of the maternal test sample so as to obtain the sequence information for the fetal and maternal nucleic acid molecules of the test sample. Sequencing may include massively parallel sequencing of maternal and fetal nucleic acids from the maternal test sample to generate sequence reads.
In certain embodiments, a method further comprises automatically recording, using a processor, the presence or absence of a fetal chromosomal aneuploidy as determined in (d) in a patient medical record card of the human subject providing the maternal test sample. Recording may include recording chromosome dosages and/or diagnostics based on the chromosome dosages in a computer readable medium. In some cases, patient medical records are maintained by a laboratory, doctor's office, hospital, health maintenance organization, insurance company, or personal medical record website. A method may further comprise prescribing, initiating treatment, and/or modifying treatment for the human subject from which the maternal test sample was obtained. Additionally or alternatively, the method may include reserving and/or performing one or more additional tests.
Certain methods disclosed herein identify a normalized chromosome sequence or a normalized chromosome segment sequence of a chromosome or chromosome segment of interest. Some of the methods include the operations of: (a) Providing a plurality of qualifying samples for a chromosome or chromosome segment of interest; (b) Repeatedly calculating chromosome dosages for a chromosome or chromosome segment of interest using a plurality of potential normalized chromosome sequences or normalized chromosome segment sequences, wherein such repeated calculations are performed with a computing device; and (c) selecting the sequence of the normalized chromosome or the sequence of the normalized chromosome segment, either individually or in a combination, to give minimal variability and/or large resolvability in the dose calculated for the chromosome or chromosome segment of interest.
The selected sequence of the normalizing chromosome or sequence of the normalizing chromosome segment may be part of the sequence of the normalizing chromosome or the combination of sequences of the normalizing chromosome segment, or may be provided separately, rather than in combination with other sequences of the normalizing chromosome or sequences of the normalizing chromosome segment.
The disclosed embodiments provide a method of classifying copy number variations in a fetal genome. The method comprises the following operations: (a) Receiving sequence reads from fetal and maternal nucleic acids in a maternal test sample, wherein the sequence reads are provided in electronic format; (b) Aligning, using a computing device, the sequence reads to one or more chromosomal reference sequences, and thereby providing sequence tags corresponding to the sequence reads; (c) Computationally identifying, using the computing device, a number of the sequence tags from one or more chromosomes of interest and determining that a first chromosome of interest in the fetus carries copy number variation; (d) Calculating a first fetal score value by a first method that does not use information from the tag of the first chromosome of interest; (e) Calculating a second fetal fraction value by a second method using information from the tag of the first chromosome; and (f) comparing the first fetal fraction value to the second fetal fraction value and using the comparison to classify the copy number variation of the first chromosome. In certain embodiments, the method further comprises sequencing cell-free DNA from the maternal test sample to provide the sequence reads. In certain embodiments, the method further comprises obtaining the maternal test sample from a pregnant organism. In certain embodiments, operation (b) comprises aligning at least about one million reads using one computing device. In certain embodiments, operation (f) may comprise determining whether the two fetal fraction values are approximately equal.
In certain embodiments, operation (f) may further comprise determining that the two fetal fraction values are approximately equal, and thereby determining that a ploidy assumption implied in the second method is true. In certain embodiments, the ploidy hypothesis implied by the second method is that the first chromosome of interest has a complete chromosomal aneuploidy. In certain of these embodiments, the complete chromosomal aneuploidy of the first chromosome of interest is a monosomy or a trisomy.
In certain embodiments, operation (f) may comprise determining whether the two fetal fraction values are not approximately equal, and further comprising analyzing tag information of the first chromosome of interest to determine whether (i) the first chromosome of interest carries a partial aneuploidy, or (ii) the fetus is a chimera.
In certain embodiments, the method may further comprise packaging the sequence of the first chromosome of interest into a plurality of portions; determining whether any of the portions contain significantly more or significantly less nucleic acids than one or more other portions; and determining that the first chromosome of interest carries a partial aneuploidy if any of the portions comprises significantly more or significantly less nucleic acid than one or more other portions. In one embodiment, the method may further comprise determining that a portion of the first chromosome of interest that comprises significantly more or significantly less nucleic acid than one or more other portions carries a partial aneuploidy.
In one embodiment, operation (f) may further comprise packaging the sequence of the first chromosome of interest into a plurality of portions; determining whether any of the portions contain significantly more or significantly less nucleic acids than one or more other portions; and determining that the fetus is a chimera if none of the portions contain significantly more or significantly less nucleic acid than one or more other portions.
Operation (e) may comprise: (a) Calculating the number of sequence tags from the first chromosome of interest and the at least one normalizing chromosome sequence to determine chromosome dosage; and (b) calculating a fetal fraction value from the chromosome dose using a second method. In certain embodiments, this operation further comprises calculating a Normalized Chromosome Value (NCV), wherein the second method uses the normalized chromosome value, and wherein the NCV correlates the chromosome dose to a mean of the respective chromosome doses in a set of qualifying samples as:
whereinAnd σ iU Respectively, the estimated mean and standard deviation for the ith chromosome dose in the combo-grid sample, and R iA Is the chromosome dose calculated for the chromosome of interest. In another embodiment, operation (d) further comprises the first method using information from one or more polymorphisms that exhibit allelic imbalance in fetal and maternal nucleic acid of the maternal test sample to calculate the first fetal fraction value.
In various embodiments, if the first fetal fraction value is not approximately equal to the second fetal fraction value, the method further comprises (i) determining whether the copy number variation is caused by a partial aneuploidy or a chimera; and (ii) determining the locus of a partial aneuploidy on the first chromosome of interest if the copy number variation is caused by a partial aneuploidy. In certain embodiments, determining the locus of a partial aneuploidy on the first chromosome of interest comprises dividing the sequence tags of the first chromosome of interest into nucleic acid data boxes or blocks in the first chromosome of interest; and counts the number of these mapping tags in each data box.
Operation (e) may further comprise calculating the fetal fraction value by evaluating the expression:
ff=2×|NCV iA CV iU |
wherein ff is the second fetal fraction value, NCV iA Is a normalized chromosome value on the ith chromosome in an affected sample, and CV iU Is the coefficient of variation of the dose determined for the chromosome of interest in these qualifying samples.
In any of the above embodiments, the first chromosome of interest is selected from the group consisting of chromosomes 1 through 22, X, and Y. In any of the above embodiments, operation (f) may classify the copy number variation into a class selected from the group consisting of: complete chromosome insertions, complete chromosome deletions, partial chromosome duplications, and partial chromosome deletions, and chimeras.
The disclosed embodiments also provide a computer program product comprising a non-transitory computer readable medium on which are provided program instructions for classifying copy number variations in a fetal genome. The computer program product may include: (a) Code for receiving sequence reads from fetal and maternal nucleic acids in a maternal test sample, wherein the sequence reads are provided in electronic format; (b) Code for, using a computing device, aligning the sequence reads to one or more chromosomal reference sequences and thereby providing sequence tags corresponding to the sequence reads; (c) Code for computationally identifying the number of sequence tags from one or more chromosomes of interest, and determining that a first chromosome of interest in the fetus bears copy number variation, by using the computing device; (d) Code for calculating a first fetal score value by a first method that does not use information from the tags of the first chromosome of interest; (e) Code for calculating a second fetal fraction value by a second method using information from the tag of the first chromosome; and (f) code for comparing the first fetal score value to the second fetal score value and using the comparison to classify the copy number variation of the first chromosome. In certain embodiments, the computer program product comprises code for the various operations and methods of any one of the above embodiments of the disclosed methods.
The disclosed embodiments also provide a system for classifying copy number variations in a fetal genome. The system comprises: (a) An interface for receiving at least about 10,000 sequence reads from fetal and maternal nucleic acids in a maternal test sample, wherein the sequence reads are provided in electronic format; (b) A memory for at least temporarily storing a plurality of said sequence reads; (c) A processor designed or configured with program instructions for: (i) Aligning the sequence reads to one or more chromosomal reference sequences, and thereby providing a plurality of sequence tags corresponding to the sequence reads; (ii) Identifying a number of the sequence tags from one or more chromosomes of interest and determining that a first chromosome of interest in the fetus carries copy number variation; (iii) Calculating a first fetal score value by a first method that does not use information from the tag of the first chromosome of interest; (iv) Calculating a second fetal fraction value by a second method using information from the tag of the first chromosome; and (v) comparing the first fetal fraction value to the second fetal fraction value and using the comparison to classify the copy number variation of the first chromosome. According to various embodiments, the first chromosome of interest is selected from the group consisting of chromosomes 1 through 22, X, and Y. In certain embodiments, the program instructions for (c) (v) comprise program instructions for classifying the copy number variation into a category selected from the group consisting of: complete chromosome insertions, complete chromosome deletions, partial chromosome duplications, and partial chromosome deletions, and chimeras. According to various embodiments, the system can include program instructions to sequence cell-free DNA from the maternal test sample to provide the sequence reads. According to certain embodiments, the program instructions for operating (c) (i) comprise program instructions for aligning at least about one million reads using a computing device.
In certain embodiments, the system further comprises a sequencer configured to sequence fetal and maternal nucleic acids in a maternal test sample and to provide sequence reads in electronic format. In various embodiments, the sequencer is located in a separate facility from the processor, and the sequencer and the processor are connected via a network.
In various embodiments, the system further comprises means for obtaining a maternal test sample from a pregnant mother. According to some embodiments, the device for obtaining a maternal test sample and the processor are located in separate facilities. In various embodiments, the system further comprises means for extracting cell-free DNA from the maternal test sample. In certain embodiments, the means for extracting cell-free DNA is located in the same facility as the sequencer and the means for obtaining maternal test samples is located in a remote facility.
According to some embodiments, the program instructions for comparing the first fetal fraction value to the second fetal fraction value further comprise program instructions for determining whether the two fetal fraction values are approximately equal.
In certain embodiments, the system further comprises program instructions for determining that the ploidy assumption implied in the second method is true when the two fetal fraction values are approximately equal. In certain embodiments, the ploidy hypothesis implied in the second method is that the first chromosome of interest has a complete chromosomal aneuploidy. In certain embodiments, the complete chromosomal aneuploidy of the first chromosome of interest is a monosomy or a trisomy.
In certain embodiments, the system further comprises program instructions for analyzing tag information of the first chromosome of interest to determine whether (i) the first chromosome of interest carries a partial aneuploidy, or (ii) the fetus is a chimera, wherein the program instructions for analyzing are configured to be performed when the program instructions for comparing the first fetal fraction value to the second fetal fraction value indicate that the two fetal fraction values are not approximately equal. In certain embodiments, the program instructions for analyzing tag information for the first chromosome of interest comprise: program instructions for binning the sequence of the first chromosome of interest into a plurality of portions; program instructions for determining whether any of the portions contain significantly more or significantly less nucleic acid than one or more other portions; and program instructions for determining that the first chromosome of interest carries a partial aneuploidy if any of the portions contains significantly more or significantly less nucleic acid than one or more other portions. In certain embodiments, the system further comprises program instructions for determining that a portion of the first chromosome of interest comprising significantly more or significantly less nucleic acid than one or more other portions carries the partial aneuploidy.
In certain embodiments, the program instructions for analyzing tag information for the first chromosome of interest comprise: program instructions for binning the sequence of the first chromosome of interest into a plurality of portions; program instructions for determining whether any of the portions contain significantly more or significantly less nucleic acid than one or more other portions; and program instructions for determining that the fetus is a chimera if none of the portions contain significantly more or significantly less nucleic acid than one or more other portions.
According to various embodiments, the system may include program instructions for a second method for calculating a fetal fraction value, the program instructions comprising: (a) Program instructions for calculating the number of sequence tags from the first chromosome of interest and the at least one normalizing chromosome sequence to determine chromosome dosage; and (b) program instructions for calculating a fetal fraction value from the chromosome dose using a second method.
In certain embodiments, the system further comprises program instructions for calculating a Normalized Chromosome Value (NCV), wherein program instructions for the second method comprise program instructions for using the normalized chromosome value, and wherein program instructions for the NCV correlate the chromosome dose to a mean of corresponding chromosome doses in a set of qualified samples as:
WhereinAnd σ iU Respectively, the estimated mean and standard deviation for the ith chromosome dose in the combo-lattice sample, and R iA Is the chromosome dose calculated for the chromosome of interest. In various embodiments, the program instructions for the first method comprise program instructions for calculating a first fetal fraction value using information from one or more polymorphisms that exhibit allelic imbalance in fetal and maternal nucleic acid of the maternal test sample.
According to various embodiments, the program instructions of the second method for calculating a fetal fraction value include program instructions for evaluating the expression:
ff=2×|NCV iA CV iU |
wherein ff is the second fetal fraction value, NCV iA Is a normalized chromosome value on the ith chromosome in an affected sample, and CV iU Is the coefficient of variation of the dose determined for the chromosome of interest in these qualifying samples.
According to various embodiments, the system further comprises: (i) Program instructions for determining whether the copy number variation is caused by a partial aneuploidy or a chimera; and (ii) program instructions for determining a locus of a partial aneuploidy on the first chromosome of interest if the copy number variation is caused by a partial aneuploidy, wherein the program instructions in (i) and (ii) are configured to be executed when the program instructions for comparing the first fetal fraction value to the second fetal fraction value determine that the first fetal fraction value and the second fetal fraction value are not approximately equal.
In certain embodiments, the program instructions for determining the locus of a partial aneuploidy on a first chromosome of interest comprise program instructions for dividing the sequence tags of the first chromosome of interest into nucleic acid data boxes or building blocks in the first chromosome of interest; and program instructions for counting the number of mapping tags in each data box.
In certain embodiments, methods are provided for identifying the presence of and/or increased risk of cancer in a mammal (e.g., a human), wherein the methods comprise: (a) Providing sequence reads of nucleic acid in a test sample from the mammal, wherein the test sample may comprise genomic nucleic acid from a cancer cell or a precancerous cell and genomic nucleic acid from a constituent (germline) cell, wherein the sequence reads are provided in electronic format; (b) Aligning, using a computing device, the sequence reads to one or more chromosomal reference sequences and thereby providing sequence tags corresponding to the sequence reads; (c) Computationally identifying a number of sequence tags of fetal and maternal nucleic acids from one or more chromosomes of interest known to be amplified or deleted in association with cancer or chromosomal segments of interest known to be amplified or deleted in association with cancer, wherein the chromosome or chromosomal segment is selected from chromosomes 1 through 22, X and Y and segments thereof, and computationally identifying a number of sequence tags of at least one normalized chromosome sequence or normalized chromosome segment sequence for each of the chromosome(s) or chromosome segment(s) of interest, wherein the number of sequence tags identified for each of the chromosome(s) or chromosome segment(s) of interest is at least about 2,000, or at least about 5,000, or at least about 10,000; (d) Computationally calculating a single chromosome or segment dose for each of the one or more chromosomes of interest or chromosome segments of interest using the number of the sequence tags identified for each of the one or more chromosomes of interest or chromosome segments of interest and the number of the sequence tags identified for each of the normalized chromosome sequences or normalized chromosome segment sequences; and (e) comparing, using the computing device, each of the single chromosome doses for each of one or more chromosomes or chromosome segments of interest to a respective threshold value for each of the one or more chromosomes or chromosome segments of interest, and thereby determining the presence or absence of an aneuploidy in the sample, wherein the presence of an aneuploidy and/or an increased number of sequence tags identified for each of the chromosome(s) or chromosome segment(s) of interest indicates the presence and/or increased risk of cancer. In certain embodiments, an increased risk is compared to the same subject at a different time (e.g., early), to a reference population (e.g., optionally adjusted for gender and/or race and/or age, etc.), to a similar subject without a risk factor, and the like. In certain embodiments, the chromosome or chromosome segment of interest comprises an amplification and/or deletion of a whole chromosome known to be associated with cancer (e.g., as described herein). In certain embodiments, the chromosome or chromosome segment of interest comprises an amplification or deletion of a chromosome segment known to be associated with one or more cancers. In certain embodiments, the chromosome segment comprises a substantially full chromosome arm (e.g., as described herein). In certain embodiments, the chromosome segment comprises a whole chromosome aneuploidy. In certain embodiments, a whole chromosome aneuploidy includes a loss, while in certain other embodiments, a whole chromosome aneuploidy includes a gain (e.g., a gain or loss as shown in table 1). In certain embodiments, the chromosome segment of interest is a fragment of the arm at a substantial level, including the short or long arm of any one or more of chromosomes 1 to 22, X and Y. In certain embodiments, an aneuploidy comprises an amplification of a parenchymal arm horizontal segment of a chromosome or a deletion of a parenchymal arm horizontal segment of a chromosome. In certain embodiments, the chromosome segment of interest substantially comprises one or more arms selected from the group consisting of: 1q, 3q, 4p, 4q, 5p, 5q, 6p, 6q, 7p, 7q, 8p, 8q, 9p, 9q, 10p, 10q, 12p, 12q, 13q, 14q, 16p, 17q, 18p, 18q, 19p, 19q, 20p, 20q, 21q and/or 22q. In certain embodiments, the aneuploidy comprises amplification of one or more arms selected from the group consisting of: 1q, 3q, 4p, 4q, 5p, 5q, 6p, 6q, 7p, 7q, 8p, 8q, 9p, 9q, 10p, 10q, 12p, 12q, 13q, 14q, 16p, 17q, 18p, 18q, 19p, 19q, 20p, 20q, 21q, 22q. In certain embodiments, the aneuploidy comprises a deletion of one or more arms selected from the group consisting of: 1p, 3p, 4q, 5q, 6q, 8p, 8q, 9p, 9q, 10p, 10q, 11p, 11q, 13q, 14q, 15q, 16q, 17p, 17q, 18p, 18q, 19p, 19q, 22q. In certain embodiments, the chromosomal segment of interest is a fragment comprising a region and/or gene set forth in table 3 and/or table 5 and/or table 4 and/or table 6. In certain embodiments, the aneuploidy includes amplification of the regions and/or genes set forth in table 3 and/or table 5. In certain embodiments, the aneuploidy comprises a deletion of a region and/or gene set forth in tables 4 and/or 6. In certain embodiments, the chromosomal segment of interest is a fragment known to contain one or more oncogenes and/or one or more tumor suppressor genes. In certain embodiments, the aneuploidy comprises amplification of one or more regions selected from the group consisting of: 20Q13, 19Q12, 1Q21-1Q23, 8p11-p12, and ErbB2. In certain embodiments, aneuploidy comprises amplification of one or more regions comprising a gene selected from the group consisting of: MYC, ERBB2 (EFGR), CCND1 (cyclin D1), FGFR1, FGFR2, HRAS, KRAS, MYB, MDM2, CCNE, KRAS, MET, ERBB1, CDK4, MYCB, ERBB2, AKT2, MDM2, and CDK4, among others. In certain embodiments, the cancer is a cancer selected from the group consisting of: leukemia, ALL, brain cancer, breast cancer, colorectal cancer, dedifferentiated liposarcoma, esophageal adenocarcinoma, esophageal squamous cell carcinoma, GIST, glioma, HCC, cancer of hepatocytes, lung cancer, lung NSC, lung SC, medulloblastoma, melanoma, MPD, myeloproliferative disorders, cervical cancer, ovarian cancer, prostate cancer, and renal cancer. In certain embodiments, the biological sample comprises a sample selected from the group consisting of: whole blood, blood clots, saliva/saliva, urine, tissue biopsies, pleural fluid, pericardial fluid, cerebral medullary fluid, and peritoneal fluid. In certain embodiments, the chromosomal reference sequence has a plurality of excluded regions that are naturally present in the chromosome but which do not affect the number of its sequence tags for any chromosome or chromosome segment. In certain embodiments, the method further comprises determining whether a read under consideration is aligned with a site on a chromosomal reference sequence at which another read was previously aligned; and determining whether to include the reading in question in the number of sequence tags for a chromosome of interest or a chromosome segment of interest, wherein both determinations are performed using the computing device. In various embodiments, the method further comprises at least temporarily storing sequence information for said nucleic acids in said sample in a computer readable medium (e.g., a non-transitory medium). In certain embodiments, step (d) comprises computationally calculating a segment dose for a selected one of the segments of interest as a ratio of the number of sequence tags identified for the selected segment of interest to the number of sequence tags identified for the corresponding at least one normalizing chromosome sequence or normalizing chromosome segment sequence of the selected segment of interest. In certain embodiments, the one or more chromosome segments of interest comprise at least 5 or at least 10 or at least 15 or at least 20 or at least 50 or at least 100 different segments of interest. In certain embodiments, at least 5 or at least 10 or at least 15 or at least 20 or at least 50 or at least 100 different aneuploidies are detected. In certain embodiments, the at least one normalizing chromosome sequence comprises one or more chromosomes selected from the group consisting of chromosomes 1 through 22, X, and Y. In certain embodiments, for each segment, the at least one normalizing chromosome sequence comprises a chromosome corresponding to the chromosome in which the segment is located. In certain embodiments, for each segment, the at least one normalizing chromosome sequence comprises a chromosome segment corresponding to the chromosome segment being normalized. In certain embodiments, at least one normalizing chromosome sequence or normalizing chromosome segment sequence is a chromosome or segment selected for an associated chromosome or segment of interest by: (i) Identifying a plurality of qualifying samples for the segment of interest; (ii) Calculating chromosome doses for the selected chromosome repeats using a plurality of potential normalized chromosome sequences or normalized chromosome segment sequences; and (iii) selecting the sequence of the normalized chromosome segment, either individually or in a combination, to give minimal variability and/or maximal resolvability in the calculated chromosome dose. In certain embodiments, the method further comprises calculating a Normalized Segment Value (NSV), wherein said NSV correlates said segment dose to a mean of the corresponding segment doses in a set of qualifying samples, as described herein. In certain embodiments, the normalizing segment sequence is a single segment of any one or more of chromosomes 1 through 22, X, and Y. In certain embodiments, the normalizing segment sequence is a set of segments of any one or more of chromosomes 1 through 22, X, and Y. In certain embodiments, the normalizing segment sequence comprises an arm of substantially any one or more of chromosomes 1 through 22, X, and Y. In certain embodiments, the method further comprises sequencing at least a portion of the nucleic acid molecules of the test sample to obtain the sequence information. In certain embodiments, sequencing comprises sequencing cell-free DNA from a test sample to provide sequence information. In certain embodiments, sequencing comprises sequencing cellular DNA from a test sample to provide sequence information. In certain embodiments, sequencing comprises massively parallel sequencing. In certain embodiments, the method(s) further comprise automatically recording the presence or absence of an aneuploidy as determined in (d) in a patient medical record of the human subject providing the test sample, wherein the recording is performed using a processor. In certain embodiments, recording comprises recording chromosome dosage and/or a diagnosis based on the chromosome dosage in a computer readable medium. In various embodiments, the patient chart is maintained by a laboratory, a doctor's office, a hospital, a health care organization, an insurance company, or a personal chart website. In certain embodiments, determining the presence or absence of the aneuploidy and/or number comprises a factor in a differential diagnosis for cancer. In certain embodiments, detection of aneuploidy indicates a positive result, and the method further comprises prescribing, initiating treatment, and/or altering treatment to the human subject from which the test sample was taken. In certain embodiments, prescribing, initiating treatment, and/or modifying treatment of a human subject from which a test sample is taken comprises prescribing and/or performing further diagnostics to determine the presence and/or severity of cancer. In certain embodiments, further diagnosing comprises screening a sample from the subject for cancer biomarkers, and/or imaging the subject for cancer. In certain embodiments, when the method indicates the presence of a neoplastic cell in the mammal, treating or subjecting the mammal to treatment to remove the neoplastic cell and/or inhibit the growth or proliferation of the neoplastic cell. In certain embodiments, treating the mammal comprises surgically removing neoplastic (e.g., tumor) cells. In certain embodiments, treating a mammal comprises administering radiation therapy to the mammal or subjecting the mammal to radiation therapy to kill neoplastic cells. In some embodiments of the present invention, the substrate is, treating a mammal includes administering or causing the mammal to be administered an anti-cancer drug (e.g., matuzumab, erbitux, victoribi, nimotuzumab, matuzumab, panitumumab, fluorouracil, capecitabine, 5-trifluoromethyl-2 '-deoxyuridine, 5-trifluoromethyl-1-2' -deoxyuridine, methotrexate, raltitrexed, pemetrexed, cytarabine, 6-mercaptopurine, 6-mercaptoprine, azathioprine, erbitux, macitemizolacin, etc.). 6-thioguanine (6-thioguanine), pentostatin (pentostatin), fludarabine (fludarabine), cladribine (cladribine), floxuridine (floxuridine), cyclophosphamide (cyclophosphamide), neosa (neosar), ifosfamide (ifosfamide), thiotepa (thiotepa), 1,3-bis (2-chloroethyl) -1-nitrosourea, 1- (2-chloroethyl) -3-cyclohexyl-1-nitrosourea, hexamethylmelamine (hexamethyelmamine), busulfan (busufan), procarbazine (procarbazine), dacarbazine (dacarbazine), chlorambucil (chlomambucil), melphalan (melphalan), cisplatin (splatin), carboplatin (carboplatin), oxaliplatin (oxaliplatin), bendamustine (bendamustine), carmustine (carmustine), mechlorethamine (chloromethine), dacarbazine (zazoline), fotemustine (fotemustine), lomustine (lomustine), mannosuman (mannosufan), nedaplatin (nedaplatin), nimustine (nimustine), prednimustine (prednimustine), ramustine (ranimustine), satraplatin (saplatin), semustine (semustine), streptozotocin (streptozocin), temozolomide (temozolomide), troostimulfane (trenosufan), triquone (triazine), triethylenetetramine (triethylenetetramine), thioteplatin (tetraplatin), tetranitratine (trinitroplatinum nitrate) Cyclophosphamide (trofosfamide), uracil mustard (uramustine), rhodotorula gracilin (doxorubicin), daunorubicin (daunorubicin), mitoxantrone (mitoxantrone), etoposide (etoposide), topotecan (topotecan), teniposide (teniposide), irinotecan (irinotecan), car Mo Tuosha (camptosar), camptothecin (camptothecin), belotecan (belotecan), rubitecan (rubitecan), vincristine (vincristine), vinblastine (vinblastine), vinorelbine (vinorelbisine), paclitaxel (paclitaxel), docetaxel (docetaxel), abrayne (abranane), ixabepilone (cilazapiroctone), and docetaxel (docetaxel), otaxaxel, tesetaxel, vinflunine (vinflunine), imatinib mesylate, sunitinib malate, sorafenib tosylate, nilotinib hydrochloride monohydrate, tasina, semaxanib, vandetanib, vatalanib, retinoic acid (retinoic acid), retinoic acid derivatives, and the like.
In another embodiment, a computer program product for determining the presence of and/or increased risk of cancer in a mammal is provided. The computer program product typically comprises: (a) Code for providing sequence reads of nucleic acids in a test sample from said mammal, wherein said test sample may comprise genomic nucleic acids from cancer cells or precancerous cells and genomic nucleic acids from constituent (germline) cells, wherein the sequence reads are provided in electronic format; (b) Code for using a computing device to align the sequence reads to one or more chromosomal reference sequences and thereby provide sequence tags corresponding to the sequence reads; (c) Code for computationally identifying a number of sequence tags from fetal and maternal nucleic acids for a chromosome or chromosome segment of interest known to amplify or delete associated with cancer from one or more chromosomes of interest known to amplify or delete associated with cancer, wherein the chromosome or chromosome segment is selected from chromosomes 1 through 22, X and Y and segments thereof, and computationally identifying a number of sequence tags for at least one normalized chromosome sequence or normalized chromosome segment sequence for each of the chromosome or chromosome segment of interest, wherein the number of sequence tags identified for each of the chromosome or chromosome segment of interest is at least about 10,000; (d) Code for computationally calculating a single chromosome or segment dose for each of the one or more chromosomes of interest or chromosome segments of interest using the number of sequence tags identified for each of the one or more chromosomes of interest or chromosome segments of interest and the number of sequence tags identified for each of the normalizing chromosome sequences or chromosome segment sequences; and (e) code for using the computing device to compare each of the single chromosome doses for each of one or more chromosomes or chromosome segments of interest to a respective threshold for each of the one or more chromosomes or chromosome segments of interest, and thereby determine the presence or absence of an aneuploidy in the sample, wherein the presence of an aneuploidy and/or an increased number of sequence tags identified for each of the chromosome(s) or chromosome segment(s) of interest indicates the presence of cancer and/or an increased risk of cancer. In various embodiments, the code provides instructions for performing the diagnostic methods as described above (and below).
Methods of treating cancer subjects are also provided. In certain embodiments, the methods comprise performing a method for identifying the presence of and/or increased risk of cancer in a mammal as described herein using a sample from a subject or receiving the results of such methods performed on the sample; and treating the subject, or subjecting the subject to treatment, to remove neoplastic cells and/or to inhibit the growth or proliferation of neoplastic cells, when the method indicates the presence of neoplastic cells in said subject, alone or in combination with one or more other indicators from a differential diagnosis for cancer. In certain embodiments, treating the subject comprises removing cells by surgery. In certain embodiments, treating the subject comprises administering radiation therapy to the subject or causing radiation therapy to be administered to kill the neoplastic cells. In some embodiments of the present invention, the substrate is, treating a subject comprises administering or causing a subject to be administered an anti-cancer agent (e.g., matuzumab, erbitux, victoria, nimotuzumab, matuzumab, panitumumab, fluorouracil, capecitabine, 5-trifluoromethyl-2' -deoxyuridine, methotrexate, raltitrexed, pemetrexed, cytarabine, 6-mercaptopurine, azathioprine, 6-thioguanine, pentostatin, fludarabine, cladribine, fluorouridine, cyclophosphamide, nuisamide, ifosfamide, thiotepa, 1,3-bis (2-chloroethyl) -1-nitrosourea, 1- (2-chloroethyl) -3-cyclohexyl-1-nitrosourea, hexamethylmelamine, busulfan, procarbazine, dacarbazine, chlorambucil, melphalan, cisplatin, carboplatin, oxaliplatin, bendamustine, melphalan, cisplatin, ritin, rituximab, and ritin carmustine, nitrogen mustard, dacarbazine, fotemustine, lomustine, mannosuman, nedaplatin, nimustine, prednimustine, ranimustine, satraplatin, semustine, streptozotocin, temozolomide, trooshusuon, triimiquione, triethylmelamine, thiotepa, triplatin tetranitrate, chloroacetcyclophosphamide, uracil mustard, erythromycin, daunomycin, mitoxantrone, etoposide, topotecan, teniposide, irinotecan, mo Tuosha, camptothecin, belotecan, rubitecan, vincristine, vinblastine, vinorelbine, vindesine, paclitaxel, docetaxel, abackerne, ixabepilone, larotaxib, tataside, texathic, vinflunine, imatinib mesylate, sunitinib malate, sorafenib tosylate, nilotinib hydrochloride monohydrate/, tasner, semacrib, vandetanib, vatalanib, retinoic acid derivatives, and the like).
Also provided are methods of monitoring treatment of a cancer subject. In various embodiments, the methods comprise performing a method for identifying the presence and/or increased risk of cancer in a mammal as described herein or receiving the results of such a method performed on a sample from a subject prior to or during treatment; and re-performing the method on a second sample from the subject at a later time during or after treatment or receiving the results of such method performed on the second sample; wherein a decrease in the number or severity of aneuploidies (e.g., a decrease in the frequency of aneuploidies and/or a decrease or absence of certain aneuploidies) in a second measurement (e.g., compared to a first measurement) indicates a positive course of treatment and the same or increased number or severity of aneuploidies in a second measurement (e.g., compared to a first measurement) indicates a negative course of treatment, and when the indication is negative, adjusting the treatment regimen to a more aggressive treatment regimen and/or a palliative treatment regimen.
Also provided are methods of determining the fraction of fetal nucleic acid in a maternal sample comprising a mixture of fetal and maternal nucleic acid. In one embodiment, the method for determining fetal fraction in a maternal sample comprises: (a) Receiving sequence reads from fetal and maternal nucleic acids in the maternal test sample; (b) Aligning the sequence reads to one or more chromosomal reference sequences, and thereby providing a plurality of sequence tags corresponding to the sequence reads; (c) Identifying a number of those sequence tags from one or more chromosomes of interest or chromosome segments of interest selected from chromosomes 1 to 22, X and Y and segments thereof, and identifying a number of those sequence tags from at least one normalized chromosome sequence or normalized chromosome segment sequence for each of the chromosome(s) of interest or chromosome segment of interest to determine a chromosome dose or chromosome segment dose, wherein the one or more chromosomes of interest or chromosome segments of interest have copy number variations; and (d) determining the fetal fraction using the chromosome dose or chromosome segment dose corresponding to the copy number variation identified in step (c). In some embodiments, the copy number variation is determined by comparing the dose of each chromosome or chromosome segment of interest to a respective threshold for each chromosome or chromosome segment of one or more chromosomes or chromosome segments of interest. The copy number variation may be selected from the group consisting of: complete chromosome replication, complete chromosome deletion, partial replication, partial doubling, partial insertion, and partial deletion.
In certain embodiments, the chromosome or segment dose in step (c) is calculated as a ratio of the number of sequence tags identified for the selected chromosome or segment of interest to the number of sequence tags identified for the corresponding at least one normalized chromosome sequence or normalized chromosome segment sequence of the selected chromosome or segment of interest. In some embodiments, the chromosome or segment dose in step (c) is calculated as a ratio of the sequence tag density ratio of the selected chromosome or segment of interest to the sequence tag density ratio of at least one corresponding normalized chromosome sequence or normalized chromosome segment sequence of each of the selected chromosomes or segments of interest.
In certain embodiments, the method further comprises calculating a Normalized Chromosome Value (NCV), wherein calculating the NCV correlates the chromosome dose to an average of the corresponding chromosome doses in a set of qualifying samples as:
whereinAnd σ iU Corresponding is the estimated mean and standard deviation for the ith chromosome dose in the combo-lattice sample, and R iA Is the chromosome dose calculated for the ith chromosome in the test sample, wherein the ith chromosome is the chromosome of interest. The fetal fraction is then determined according to the expression:
ff=2×|NCV iA CV iU |
Wherein ff is the fetal fraction value, NCV iA Is a normalized chromosome value on the ith chromosome in an affected sample, and CV iU Is the coefficient of variation of the dose determined for the ith chromosome in the qualifying sample, wherein the ith chromosome is the chromosome of interest.
In certain embodiments, the fetal fraction is determined using a Normalized Segment Value (NSV), wherein the NSV correlates the chromosome segment dose to an average of the corresponding chromosome segment doses in a set of qualifying samples as:
whereinAnd σ iU Correspondingly for the ith chromosomal region in the combo-grid sampleEstimated mean and standard deviation of the segment dose, and R iA Is the chromosome segment dose calculated for the ith chromosome segment in the test sample, wherein the ith chromosome segment is the chromosome segment of interest. The fetal fraction is then determined according to the following expression:
ff=2×|NSV iA CV iU |
wherein ff is the fetal fraction value, NSV iA Is the normalized chromosome segment value on the ith chromosome segment in an affected sample, and CV iU Is the coefficient of variation of the dose for the ith chromosomal segment determined in the qualifying sample, wherein the ith chromosomal segment is the chromosomal segment of interest.
In certain embodiments, the chromosome of interest is any one of chromosomes 1-22 or the male fetal X chromosome and the chromosome segment of interest is selected from chromosomes 1-22 or the male fetal X chromosome.
In certain embodiments, the at least one normalized chromosome sequence or normalized chromosome segment sequence of embodiments of the methods for determining fetal fraction is a chromosome or segment selected for an associated chromosome or segment of interest by: (i) Identifying a plurality of qualifying samples for the chromosome or segment of interest; (ii) Repeatedly calculating chromosome doses or chromosome segment doses for the selected chromosome or segment using a plurality of potential normalized chromosome sequences or normalized chromosome segment sequences; and (iii) selecting the sequence of the normalized chromosome or the sequence of the normalized chromosome segment, either individually or in a combination, to give the least variability or the greatest resolvability in the calculated chromosome dose or chromosome segment dose. The normalizing chromosome sequence may be a single chromosome of any one or more of chromosomes 1 to 22, X and Y. Alternatively, the normalizing chromosome sequence may be a set of chromosomes for any of chromosomes 1 to 22, X and Y as well as the normalizing segment sequence may be a single segment for any one or more of chromosomes 1 to 22, X and Y. Alternatively, the normalizing segment sequence may be a set of segments for any one or more of chromosomes 1 through 22, X, and Y.
In certain embodiments, the method of determining fetal fraction may further comprise comparing the fetal fraction obtained as described to a fetal fraction that may be determined using information from one or more polymorphisms that exhibit allelic imbalance in the fetal and maternal nucleic acids of the maternal test sample. Methods for determining allelic imbalance are described elsewhere in this application and include determining fetal fraction using polymorphic differences between the fetal and maternal genomes, including but not limited to differences detected in SNP or STR sequences.
In certain embodiments, the method further comprises storing the sequence reads at least temporarily.
An additional method of classifying copy number variations in a fetal genome is provided. The additional method comprises: (a) Obtaining sequence reads from fetal and maternal nucleic acids in a maternal test sample; (b) Aligning the sequence reads to one or more chromosomal reference sequences, and thereby providing a plurality of sequence tags corresponding to the sequence reads; (c) Identifying the number of the sequence tags from one or more chromosomes of interest and determining that a first chromosome of interest in the fetus carries a copy number variation; (d) Calculating a first fetal score value by a first method that does not use information from the tags of the first chromosome of interest; (e) Calculating a second fetal fraction value by a second method using information from the tags of the first chromosome; and (f) comparing the first fetal score value to the second fetal score value and using the comparison to classify the copy number variation of the first chromosome.
In certain embodiments, the first method of calculating a fetal fraction value as described in step (d) of the additional method comprises: calculating the first fetal fraction value using information from one or more polymorphisms that exhibit allelic imbalance in fetal and maternal nucleic acid of the maternal test sample; a second method of calculating a fetal fraction value as described in step (e) of the additional method comprises: (a) Calculating the number of sequence tags from the first chromosome of interest and the at least one normalizing chromosome sequence to determine chromosome dosage; and (b) calculating the fetal fraction value from the chromosome dose using the second method.
In certain embodiments, the information used by the first method comprisesBy carrying out the method on a predetermined polymorphic sequence Sequencing the obtained sequence tags, each of the polymorphic sequences comprising the one or more polymorphic sites.In certain embodiments, the information used in the first method is obtained by a non-sequencing method, such as by qPCR, digital PCR, mass spectrometry, or capillary gel electrophoresis.
In certain embodiments, the first method comprises calculating the first fetal fraction value using tags from chromosomes or chromosome segments that do not have copy number variations. For example, when the first chromosome of interest is chromosome 21, the fetal fraction determined using the sequence tag from chromosome 21 can be compared to the fetal fraction determined from the sequence tag from chromosome X in a male fetus. Any chromosome or chromosome segment known to occur in an aneuploidy state or determined to be not aneuploid (e.g., determined by calculating its NCV or NSV) by any of the methods described herein can be used to determine the first fetal fraction.
In certain embodiments, the chromosome or segment dose determined by the second method in step (e) is calculated as a ratio of the number of sequence tags identified for the selected chromosome or segment of interest to the number of sequence tags identified for the corresponding at least one normalized chromosome sequence or normalized chromosome segment sequence of the selected chromosome or segment of interest. In certain embodiments, the chromosome or segment dose determined in step (e) is calculated as a ratio of the sequence tag density ratio of the selected chromosome or segment of interest to the sequence tag density ratio of at least one corresponding normalized chromosome sequence or normalized chromosome segment sequence of each of the selected chromosomes or segments of interest.
Certain embodiments of the additional method further comprise calculating a Normalized Chromosome Value (NCV), wherein the second method uses the normalized chromosome value, and wherein calculating the NCV correlates the chromosome dose to a mean of corresponding chromosome doses in a set of qualified samples as:
whereinAnd σ iU Corresponding is the estimated mean and standard deviation for the ith chromosome dose in the combo-lattice sample, and R iA Is the chromosome dose calculated for the ith chromosome in the test sample, wherein the ith chromosome is the chromosome of interest.
In certain embodiments, the second method of calculating the fetal fraction value comprises evaluating the expression:
ff=2×|NCV iA CV iU |
wherein ff is the fetal fraction value, NSV iA Is a normalized chromosome value on the ith chromosome in an affected sample or test sample, and CV iU Is the coefficient of variation of the dose determined for the ith chromosome in the qualifying sample, wherein the ith chromosome is the chromosome of interest.
In certain embodiments, the first method of calculating a fetal fraction comprises (a) calculating the number of sequence tags from chromosomes other than the first chromosome of interest and at least one normalizing chromosome sequence to determine chromosome dosages for chromosomes other than the first chromosome of interest; and (b) calculating the first fetal fraction value from the chromosome dose by the first method; the second method comprises the following steps: (a) Calculating the number of sequence tags from the first chromosome of interest and the at least one normalizing chromosome sequence to determine a chromosome dose; and (b) calculating the second fetal fraction value from the chromosome dose by the second method.
Preferably, chromosome or segment doses are calculated as a ratio of the number of sequence tags identified for the selected chromosome or segment of interest to the number of sequence tags identified for the corresponding at least one normalized chromosome sequence or normalized chromosome segment sequence of the selected chromosome or segment of interest; alternatively, the chromosome dose or segment dose is calculated as a ratio of the sequence tag density ratio of the selected chromosome or segment of interest to the sequence tag density ratio of at least one corresponding normalized chromosome sequence or normalized chromosome segment sequence of each of the selected chromosomes or segments of interest.
Preferably, the additional method for classifying copy number variations further comprises calculating a corresponding Normalized Chromosome Value (NCV), and the first and second methods use the corresponding NCV. Calculating NCV correlates the determined chromosome dose with the mean of the corresponding chromosome doses in a set of qualified samples as:
whereinAnd σ iU Respectively, the estimated mean and standard deviation of the dose for the ith chromosome in the combo-grid sample, and R iA Is the calculated dose of the ith chromosome in the test sample. The first and second methods may use NCV to calculate the fetal fraction, evaluated by the following expression:
ff=2×|NCV iA CV iU |
Wherein ff is the fetal fraction value, NCV iA Is a normalized chromosome value on the i-th chromosome in the test sample, and CV iU Is the coefficient of variation of the dose for the ith chromosome in the qualifying sample. In the above formula, for the first method, the i-th chromosome is not the first chromosome of interest; for use in this second method, the ith chromosome is the first chromosome of interest.
The first chromosome of interest is selected from the group consisting of chromosomes 1 through 22, X and Y. The chromosome other than the first chromosome of interest may be any one of chromosomes 1 to 22, or the X chromosome when the fetus is a male.
In certain embodiments, step (f) comprises determining whether the two fetal fraction values are approximately equal. In certain embodiments, step (f) further comprises: a ploidy assumption implied in the second method is determined to be true when the two fetal fraction values are approximately equal. The ploidy hypothesis implied by the second method may be that the first chromosome of interest has a complete chromosomal aneuploidy. For example, the complete chromosomal aneuploidy of the first chromosome of interest is a monosomy or a trisomy.
In certain embodiments, additional methods for classifying copy number variations further comprise a step (g): analyzing the signature information of the first chromosome of interest to determine whether (i) the first chromosome of interest carries a partial aneuploidy or (ii) the fetus is a chimera when the two fetal fraction values are not approximately equal.
In certain embodiments, wherein said first method comprises calculating the first fetal fraction value using information from one or more polymorphisms in fetal and maternal nucleic acid exhibiting allelic imbalance in the maternal test sample, said polymorphisms being present in a chromosome other than said first chromosome of interest; and said second method comprises calculating the second fetal fraction value using information from one or more polymorphisms in fetal and maternal nucleic acids exhibiting allelic imbalance in the maternal test sample, said polymorphisms being present in said first chromosome of interest. The step (f) for comparing may include: determining that the first chromosome of interest is diploid when the ratio of the second fetal fraction value to first fetal fraction value is approximately 1; determining the first chromosome of interest as a triploid when the ratio of the second fetal fraction value to the first fetal fraction value is approximately 1.5; and determining that the first chromosome of interest is haploid when the ratio of the second fetal fraction value to the first fetal fraction value is approximately 0.5. Additional methods for classifying copy number variation may further comprise the step (g) of analyzing tag information of the first chromosome of interest when the ratio of the second fetal fraction value to the first fetal fraction value is not approximately 1, 1.5, or 0.5 to determine whether (i) the first chromosome of interest carries a partial aneuploidy or (ii) the fetus is a chimera.
In certain embodiments, the information used by the first and second methods of utilizing polymorphisms comprises sequence tags obtained by sequencing predetermined polymorphic sequences, each of which includes the one or more polymorphic sites. Alternatively, the information used in the first and second methods using polymorphisms is not obtained by a sequencing method, and is obtained by a non-sequencing method such as qPCR, digital PCR, mass spectrometry, or capillary gel electrophoresis.
In certain embodiments, step (g) of analyzing the tag information of the first chromosome of interest comprises: (a) Packaging the sequence of the first chromosome of interest into a plurality of portions; (b) Determining whether any of the portions contain significantly more or significantly less nucleic acids than one or more other portions; and, (c) determining that the first chromosome of interest carries a partial aneuploidy if any of said portions contains significantly more or significantly less nucleic acid than one or more other portions; or determining that the fetus is a chimera if none of the portions contain significantly more or significantly less nucleic acid as compared to one or more other portions. Thus, the additional method may further comprise determining that a portion of the first chromosome of interest comprising significantly more or significantly less nucleic acid than one or more other portions carries a partial aneuploidy.
Step (f) of the method for classifying copy number variations comprises classifying the copy number variation into a class selected from the group consisting of: complete chromosome replication or multiplication, complete chromosome deletion, partial chromosome replication, and partial chromosome deletion, and chimeras.
In embodiments where the step (f) of comparing the first fetal fraction value to the second fetal fraction value determines that the first fetal fraction value and the second fetal fraction value are not approximately equal, the method further comprises:
(i) Determining whether the copy number variation is caused by a partial aneuploidy or a chimera; and is
(ii) When the copy number variation is caused by a partial aneuploidy, the locus of the partial aneuploidy on the first chromosome of interest is determined.
In certain embodiments, determining the locus of a partial aneuploidy on the first chromosome of interest comprises dividing the sequence tags of the first chromosome of interest into nucleic acid boxes or blocks in the first chromosome of interest; and counts the number of these mapping tags in each bin.
In certain embodiments, the step of aligning in (b) comprises aligning at least about 1 million reads.
Any of the methods described herein can further include sequencing fetal and maternal nucleic acids (e.g., cell-free DNA) in the maternal test sample to obtain sequence reads. Sequencing maternal and fetal nucleic acids from a maternal test sample to generate sequence reads comprises massively parallel sequencing. In certain embodiments, massively parallel sequencing is sequencing-by-synthesis. Sequencing by synthesis can be achieved using reversible dye terminators. In other embodiments, massively parallel sequencing is sequencing by ligation. In still other embodiments, massively parallel sequencing is single molecule sequencing.
Maternal samples that can be used to determine fetal fraction according to the methods described herein include blood, plasma, serum, or urine samples. In certain embodiments, the maternal sample is a plasma sample. In other embodiments, the maternal sample is a whole blood sample.
A number of different devices are also provided, including devices for performing medical analysis on a sample (e.g., a maternal sample), and are used to perform the steps of the above-described methods, e.g., individually for determining copy number variation, for determining fetal fraction, or for classifying copy number variation.
Kits are also provided that include reagents that can be used to determine copy number variation, either alone or in combination with methods for determining the effect of one of two genomes on a mixture of nucleic acids derived from the two genomes (e.g., fetal fraction in a maternal sample). These kits may be used in conjunction with the devices described herein.
Although the examples herein refer to humans and the wording is primarily directed to human problems, the concepts described herein are also applicable to genomes from any plant or animal.
Brief description of the drawings
FIG. 1 is a flow chart of a method 100 for determining the presence or absence of copy number variation in a test sample comprising a mixture of nucleic acids.
FIG. 2 depicts a process flow for preparing a sequencing library according to the unabridged scheme, the abbreviated scheme (ABB), the two-step method, and the one-step method of Italian Lu Na as described herein. "P" represents a purification step; and "X" indicates that no purification steps and/or DNA repair are included.
Figure 3 depicts a process flow of an embodiment of a method for preparing a sequencing library on a solid surface.
Figure 4 shows a flow diagram of one embodiment 400 of a method for verifying the integrity of a sample subjected to a multi-step single-pass sequencing bioassay.
Figure 5 shows a flow diagram of one embodiment 500 of a method for verifying the integrity of a plurality of samples subjected to a multi-step multiple sequencing bioassay.
Fig. 6 is a flow chart of a method 600 for determining the presence or absence of aneuploidy and fetal fraction simultaneously in a maternal test sample comprising a mixture of fetal and maternal nucleic acids.
Fig. 7 is a flow diagram of a method 700 for determining fetal fraction in a maternal test sample comprising a mixture of fetal and maternal nucleic acids using massively parallel sequencing or size separation of polymorphic nucleic acid sequences.
Fig. 8 is a flow chart of a method 800 for determining the presence or absence of fetal aneuploidy and fetal fraction simultaneously in a maternal plasma test sample enriched for polymorphic nucleic acids.
Fig. 9 is a flow diagram of a method 900 for determining the presence or absence of a fetal aneuploidy and a fetal fraction simultaneously in a maternal purified cfDNA test sample enriched for polymorphic nucleic acids.
Fig. 10 is a flow diagram of a method 1000 for determining the presence or absence of fetal aneuploidy and fetal fraction simultaneously in a sequencing library constructed from fetal and maternal nucleic acids derived from a maternal test sample and enriched for polymorphic nucleic acids.
FIG. 11 is a flow chart summarizing an alternative embodiment of a method for determining fetal fraction by massively parallel sequencing as shown in FIG. 7.
Fig. 12 is a bar graph showing the identification of fetal and maternal Polymorphic Sequences (SNPs) used to determine fetal fraction in a test sample. The total number of sequence reads mapped to the SNP sequences identified by rs number (X axis) (Y axis), and the relative content of fetal nucleic acids (×) are shown.
Fig. 13 is a block diagram depicting the classification of fetal and maternal match states for a given genomic location.
Fig. 14 shows a comparison of results generated using the mixture model and known and estimated fetal fractions.
Figure 15 shows the error estimates made by sequencing base position on 30 passes of the i Lu Na GA2 data aligned using Eland with default parameters to the human genome HG 18.
FIG. 16 shows that using machine error rate as a known parameter can reduce the upward bias by one point.
Fig. 17 shows that using the machine error rate as a known parameter, the simulation data of the enhanced case 1 and 2 error models reduced the upward bias of fetal fractions below 0.2 to less than one point.
Fig. 18 is a flow chart depicting a method of classifying CNVs by comparing fetal fraction values calculated with two different techniques.
FIG. 19 is a block diagram of a discrete system for processing test samples and ultimately making a diagnosis.
Fig. 20 schematically shows how many different operations can be handled by different elements of the system in groups when processing test samples.
Fig. 21A and 21B show electrophoretic plots of cfDNA sequencing libraries prepared according to the shorthand protocol described in example 2a (fig. 21A) and the protocol described in example 2B (fig. 21B).
FIGS. 22A to 22C provide graphs showing the mean (n = 16) (% ChrN; FIG. 22A) and the percentage of sequence tags of the total number of percentages of sequence tags mapped to each individual chromosome when the sequencing library was prepared according to the shorthand protocol (ABB;) and when the sequencing library was prepared according to the two-step repair-free method (INSOL; □) (FIG. 22B) as a function of the chromosome size. Figure 22C shows the percentage ratio of the mapped tags when the library was prepared using the two-step method to the tags obtained when the library was made using the Abridged (ABB) method as a function of the GC content of the chromosome.
FIGS. 23A and 23B show histograms providing mean and standard deviation of the percentage of tags mapped to chromosome X (FIG. 23A;% ChrX) and Y (FIG. 23B;% ChrY) obtained from sequencing 10 samples of cfDNA purified from plasma of 10 pregnant women. FIG. 23A shows that the number of tags mapped to the X chromosome when the repair-free method (two steps) was used was greater than the number of tags obtained using the simple method (ABB). Figure 23B shows that the percentage of tags mapped to the Y chromosome using the repair free two-step method is not different from the percentage of tags using the Abbreviation (ABB).
Figure 24 shows the ratio of the number of non-excluded sites (NE sites) on the reference genome (hg 18) to the total number of tags mapped to the non-excluded sites of each of the 5 samples from which cfDNA was prepared and used to construct sequencing libraries according to the short scheme (ABB) (solid bars), no repair scheme in solution (two steps; open bars) and no repair scheme on solid surface (one step; grey bars) described in example 2.
FIGS. 25A and 25B are graphs showing the mean (n = 5) (% ChrN; FIG. 25A) and sequence tag percentages of the total number of sequence tags mapped to each individual chromosome when the sequencing library was prepared on a solid surface according to the shorthand protocol (ABB;), when the sequencing library was prepared according to the repair-free two-step method (□) and when the library was prepared according to the repair-free one-step method (. DELTA.) as a function of chromosome size (FIG. 25B). Regression coefficients of mapping tags obtained from sequencing libraries prepared according to the short protocol (ABB;) and the solid surface repair-free protocol (two-step; □). FIG. 25C shows the percentage ratio of mapped sequence tags per chromosome obtained from a sequencing library prepared according to the repairless two-step approach to the tags per chromosome obtained from a sequencing library prepared according to the abridged Approach (ABB) as a function of the GC content percentage per chromosome (. Diamond.), and the percentage ratio of mapped sequence tags per chromosome obtained from a sequencing library prepared according to the repairless one-step approach to the tags per chromosome obtained from a sequencing library prepared according to the abridged Approach (ABB) as a function of the GC content percentage per chromosome (□).
Fig. 26A and 26B show a comparison of the mean and standard deviation of the percentages of tags mapped to chromosomes X (fig. 26A) and Y (fig. 26B) obtained from sequencing 5 samples of cfDNA purified from plasma of 5 pregnant women according to the ABB method, the two-step method, and the one-step method. Fig. 26A shows that the number of tags mapped to the X chromosome when the repair-free method (two and one steps) was used was larger than that obtained using the abbreviation method (ABB). Figure 26B shows that the percentage of tags mapped to the Y chromosome using the repair-free two-step and one-step methods is not different from the percentage of tags using the simplified method.
Fig. 27A and 27B show that the amount of purified cfDNA used to prepare the sequencing library was correlated to the amount of library product obtained for 61 clinical samples prepared in solution using the ABB method (fig. 27A) and 35 study samples prepared using the repair-free Solid Surface (SS) one-step method (fig. 27B).
Fig. 28 shows the correlation of the amount of cfDNA used to make the library with the amount of library products obtained using the two-step (□), ABB (diamond) and one-step (Δ) methods.
Figure 29 shows the percentage of index sequence reads obtained when preparing an index library using one-step (open bars) and two-step (solid bars) and sequenced as 6 clumps (i.e. 6 index samples/flow cell channel).
FIGS. 30A and 30B are graphs showing the mean (n = 42) (% ChrN; FIG. 30A) of the total number percentage of sequence tags mapped to each individual chromosome when index sequencing libraries were prepared according to the one-step method on a solid surface and sequenced as 6 clumps and the resulting sequence tag percentages as a function of chromosome size (FIG. 30B).
Fig. 31 shows the percentage of sequence tags mapped to the Y chromosome (ChrY) relative to the percentage of tags mapped to the X chromosome (ChrX).
Fig. 32A and 32B show the distribution of chromosome dose for chromosome 21 determined from sequencing cfDNA extracted from a set of 48 blood samples obtained from human subjects each carrying a male or female fetus. The dose for eligible (i.e.: normal for chromosome 21 (O)) chromosome 21, and trisomy 21 test samples are shown as (Δ) for chromosomes 1-12 and X (fig. 32A), and for chromosomes 1-22 and X (fig. 32B).
Figure 3 shows the distribution of chromosome dose for chromosome 18 determined from sequencing cfDNA extracted from a set of 48 blood samples obtained from human subjects each carrying a male or female fetus. Test samples for eligible (i.e., normal for chromosome 18 (O)) chromosome 18, and trisomy 18 (Δ), are shown for chromosomes 1-12 and X (FIG. 33A) and for chromosomes 1-22 and X (FIG. 33B).
Fig. 34A and 34B show the distribution of chromosome dose for chromosome 13 determined from sequencing cfDNA extracted from a set of 48 blood samples obtained from human subjects each carrying a male or female fetus. Test samples for eligible (i.e., normal for chromosome 13 (O)) chromosome 13, and trisomy 13 (Δ), are shown for chromosomes 1-12 and X (FIG. 34A), and for chromosomes 1-22 and X (FIG. 34B).
Fig. 35A and 35B show the distribution of chromosome dose for chromosome X determined from sequencing cfDNA extracted from a set of 48 test blood samples obtained from human subjects each carrying a male or female fetus. Chromosome X dose, monosomy X (45, X; (+)), and complex karyotype (Cplx (X)) samples for males (46, XY; (O)), females (46, XX; (Δ)) are shown for chromosomes 1-12 and X (FIG. 35A), and for chromosomes 1-22 and X (FIG. 35B).
Fig. 36A and 36B show the distribution of chromosome dose for chromosome Y determined from sequencing cfDNA extracted from a set of 48 test blood samples obtained from human subjects each carrying a male or female fetus. Samples of chromosome Y dose, haplotype X (45, X; (+)), and complex karyotype (Cplx (X)) are shown for males (46, XY; (Δ)), females (46, XX; (O)), and for chromosomes 1-12 (FIG. 36A), and for chromosomes 1-22 (FIG. 36B).
Figure 37 shows the Coefficient of Variation (CV) for chromosomes 21 (■), 18 (●) and 13 (a) determined for the doses shown from figures 32A and 32b,33a and 33B, and 34A and 34B, respectively.
Fig. 38 shows the Coefficient of Variation (CV) for chromosomes X (■) and Y (●) determined from the doses shown in fig. 35A and 35B and 36A and 36B, respectively.
Fig. 39 shows the cumulative distribution of GC portions of the human chromosome. The vertical axis represents the frequency of chromosomes having GC contents lower than the values shown on the horizontal axis.
Figure 40 shows the sequence dose (Y-axis) for the segment of chromosome 11 (81000082-103000103 bp) determined from sequencing cfDNA, which was extracted from the obtained set of 7 qualifying samples (O) and 1 test sample from pregnant human subjects (°). A sample from a subject pregnant with a fetus with a partial aneuploidy of chromosome 11 (°) was identified.
Fig. 41A-41E show the distribution of normalized chromosome dose for chromosome 21 (41A), chromosome 18 (41B), chromosome 13 (41C), chromosome X (41D), and chromosome Y (41E) relative to the standard deviation of the mean (Y-axis) of the corresponding chromosomes in unaffected samples.
Fig. 42 shows the normalized chromosome values for chromosomes 21 (O), 18 (Δ), and 13 (□) determined in samples from training set 1 using the normalized chromosomes as described in example 12.
Fig. 43 shows normalized chromosome values for chromosomes 21 (O), 18 (Δ), and 13 (□) determined in samples from test set 1 using the normalized chromosomes as described in example 12.
Fig. 44 shows the normalized chromosome values for chromosomes 21 (O) and 18 (Δ) determined in samples from test set 1 using the normalization method of Chiu et al (for normalization of the number of sequence tags identified for the chromosome of interest with the number of sequence tags obtained for the remaining chromosomes in the sample, see example 13 elsewhere in this application).
Fig. 45 shows the normalized chromosome values for chromosomes 21 (O), 18 (Δ), and 13 (□) determined in samples from training set 1 using the systematically determined normalized chromosomes (as described in example 13).
Fig. 46 shows the normalized chromosome values for chromosomes X (X-axis) and Y (Y-axis). Arrows point to 5 (fig. 46A) and 3 (fig. 46B) X haplotypes identified in the training set and test set, respectively, as described in example 13.
Fig. 47 shows normalized chromosome values for chromosomes 21 (O), 18 (Δ), and 13 (□) determined in samples from test set 1 using systematically determined normalized chromosomes (as described in example 13).
Figure 48 shows the normalized chromosome values for chromosome 9 (O) determined in samples from test set 1 using the systematically determined normalized chromosomes (as described in example 13).
Figure 49 shows the normalized chromosome values for chromosomes 1-22 determined in samples from test set 1 using the systematically determined normalized chromosomes (as described in example 13).
FIG. 50 shows a flow chart of design (A) and random sampling scheme (B) of the study described in example 16.
Fig. 51A to 51F show a flow chart of the analysis of chromosomes 21, 18 and 13 (fig. 51A to 51C, respectively) and gender analysis of female, male and monosomy X (fig. 51D to 51F, respectively). The oval shape includes results obtained from sequencing information from the laboratory, the rectangle includes karyotype results, and the rectangle with rounded corners shows comparative results to determine test performance (sensitivity and specificity). The dashed lines in fig. 51A and 51B represent the relationship between chimeric samples of T21 (n = 3) and T18 (n = 1), which were examined by analysis of chromosomes 21 and 18, respectively, but were correctly determined as described in example 16.
Figure 52 shows Normalized Chromosome Values (NCV) versus karyotype classification for the test samples of the study described in example 16, chromosomes 21 (●), 18 (■), and 13 (a). Round samples represent unclassified samples with trisomy karyotype.
FIG. 53 shows karyotype classification relationships for normalized chromosome X (NCV) versus gender classification for test samples of the study described in example 16. Samples with female karyotypes (O), samples with male karyotypes (●), samples with 45,X (□), and samples with other karyotypes (i.e., XXX, XXY, and XYY) (■) are shown.
Figure 54 shows a plot of normalized chromosome Y values versus normalized chromosome X values for the test samples of the clinical study described in example 16. Euploid male and female samples (. Smallcircle.), XXX sample (●), 45, X sample (X), XYY sample (■) and XXY sample (. Tangle-solidup.) are shown. The dashed line shows the threshold used to classify the sample as described in example 16.
Fig. 55 schematically shows one embodiment of the CNV determination method described herein.
FIG. 56 shows the percent "ff" (ff) from example 17 determined using the dose of chromosome 21 in a synthetic maternal sample (1) containing DNA from children with trisomy 21 21 ) Percent "ff" (ff) determined as dose using chromosome X X ) A graph of the function of (c).
FIG. 57 shows the percent "ff" (ff) determined using the dose of chromosome 7 in synthetic maternal samples (2) containing DNA from euploid mothers with their children carrying a partial deletion of chromosome 7 from example 17 7 ) Percent "ff" (ff) determined as dose using chromosome X X ) A graph of the function of (c).
FIG. 58 shows the percentage "ff" (ff) determined using the dose of chromosome 15 in synthetic maternal samples (3) containing DNA from euploid maternal affinities having a 25% chimeric child with partial replication of chromosome 15 from example 17 15 ) Percent "ff" (ff) determined as dose using chromosome X X ) A graph of the function of (c).
FIG. 59 shows the percent "ff" (ff) determined using the dose of chromosome 22 in artificial sample (4) from example 17 22 ) And a map of NCV obtained therefrom, the artificial sample comprising 0% child DNA (i), and 10% DNA (ii) from an unaffected twin known to have no partial chromosomal aneuploidy of chromosome 22, and 10% DNA (iii) from an affected twin known to have a partial chromosomal aneuploidy of chromosome 22.
Fig. 60 shows a graph of CNffx versus CNff21 relationships determined in samples including fetal T21 trisomy from example 18.
Fig. 61 shows a graph of CNffx versus CNff18 relationships determined in samples including fetal T18 trisomy from example 18.
Fig. 62 shows a graph of CNffx versus CNff13 relationships determined in samples including fetal T13 trisomies from example 18.
Figure 63 shows a plot of the NCV values for chromosomes 1 through 22 and X in the test samples from example 19.
Fig. 64 shows the fetal fractions obtained for the sample with a female fetus with T21 in example 18.
Fig. 65 shows an embodiment of a medical analysis device for determining fetal fraction as a function of copy number variation present in a fetal genome.
Fig. 66 shows one embodiment of a medical analysis apparatus for determining fetal fraction to classify copy number variations in a fetal genome.
Figure 67 shows a kit that includes test control reagents and reagents for tracking and verifying the integrity of maternal cfDNA samples subjected to massively parallel sequencing.
FIG. 68 shows a kit comprising a blood collection device, DNA extraction reagents, and control reagents for testing a maternal DNA sample.
FIG. 69A, FIG. 69B, FIG. 69C show NCV plots of the internal positive control [ □ ] and maternal sample [ diamond ] examined for copy number variation of chromosomes 13, 18, and 21.
Detailed Description
The disclosed embodiments relate to methods, devices, and systems for determining Copy Number Variation (CNV) of sequences of interest in a test sample comprising a mixture of nucleic acids that are known or suspected to differ in the amount of one or more sequences of interest.&Sequences of interest include, for example, sequences of genomic segments ranging from kilobases (kb) to megabases (Mb) to entire chromosomes, which are known or suspected to be associated with genetic or disease conditions. Examples of sequences of interest include chromosomes associated with well-known aneuploidies (e.g., trisomy 21) and segments of chromosomes that are increased in diseases such as cancer, such as partial trisomy 8 in acute myeloid leukemia. CNV that can be determined according to the method includes monosomy and trisomy of autosomes 1-22, and any one or more of sex chromosomes X and Y (e.g., 45, X, 47, XXX, 47, XXY and 47, XYY), and other chromosomal polysomy, i.e., tetrasomy and pentasomy (including but not limited to, tetrasomy and pentasomy) XXXXXXXXXXXXXYAndXYYYY) And the dyeingsDeletion and/or duplication of segments of any one or more of the bodies.
The method is a statistical method that is implemented on one or more processors and takes into account cumulative variations derived from process-related, inter-chromosomal (same round) and inter-sequencing process (inter-round) variability. These methods are applicable to determining CNVs of any fetal aneuploidy, as well as CNVs known or suspected to be associated with a variety of medical conditions.
The practice of the present invention involves, unless otherwise indicated, conventional techniques and equipment commonly used in the fields of molecular biology, microbiology, protein purification, protein engineering, protein and DNA sequencing, and recombinant DNA, which are within the skill of the art. Such techniques and devices are known to those of ordinary skill in the art and are described in numerous documents and reference works (see, e.g., sambrook et al, "Molecular Cloning: A Laboratory Manual", third edition (Cold Spring Harbor)), [2001 ]); and Ausubel (Otsubel) et al, "Current Protocols in Molecular Biology (latest Molecular Biology Protocols Association)" [1987].
Numerical ranges include the numbers defining the range. It is intended that every maximum numerical limitation given throughout this specification includes every lower numerical limitation, as if such lower numerical limitations were expressly written herein. Every minimum numerical limitation given throughout this specification will include every higher numerical limitation, as if such higher numerical limitations were expressly written herein. Every numerical range given throughout this specification will include every narrower numerical range that falls within such broader numerical range, as if such narrower numerical ranges were all expressly written herein.
The headings provided herein are not intended to be limiting of the disclosure.
Unless defined otherwise herein, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Various scientific dictionaries containing terms contained herein are well known and available to those skilled in the art. Although any methods and materials similar or equivalent to those described herein find use in the practice or testing of the embodiments disclosed herein, only some preferred methods and materials are described.
The terms defined immediately below are more fully described by reference to the specification as a whole. It is to be understood that this disclosure is not limited to the particular methodology, protocols, and reagents described, as these may vary, as they may be used in accordance therewith by those skilled in the art.
Definition of
As used herein, the singular terms "a", "an", and "the" include plural references unless the context clearly dictates otherwise. Unless otherwise indicated, nucleic acids are written left to right in the 5 'to 3' direction and amino acid sequences are written left to right in the amino to carboxy direction, respectively.
The term "assessing" when used herein in the context of analyzing a nucleic acid sample for CNV refers to characterizing the state of a chromosomal or segment aneuploidy as one of three types of predicate: "normal" or "unaffected", "affected", and "no decision". The thresholds for determining normality and affected are typically set. Parameters related to non-euploidy in the sample are measured and these measurements are compared to a threshold value. For replication-type aneuploidies, an impact is determined if the chromosome or segment dose (or other measure of sequence content) exceeds a defined threshold set for the affected sample. For these aneuploidies, normal is judged if the chromosome or segment dose is below the threshold set for normal samples. In contrast, for deletion type aneuploidies, an affected is determined if the chromosome or segment dose is below a defined threshold for the affected sample, and a normal is determined if the chromosome or segment dose exceeds a threshold set for a normal sample. For example, in the presence of a trisomy, a "normal" decision is determined by the value of a parameter, such as the test chromosome dose, being below a user-defined reliability threshold, and an "affected" decision is determined by the parameter, such as the test chromosome dose, exceeding the user-defined reliability threshold. The result of "no decision" is determined by the fact that a parameter, such as the test chromosome dose, lies between the threshold values of the "normal" or "affected" decision. The term "no decision" is used interchangeably with "unclassified".
The term "copy number variation" as used herein refers to a change in the copy number of a nucleic acid sequence present in a test sample as compared to the copy number of a nucleic acid sequence present in a qualified sample. In certain embodiments, the nucleic acid sequence is 1kb or greater. In some cases, the nucleic acid sequence is a whole chromosome or a significant portion thereof. "copy number variant" refers to a nucleic acid sequence whose copy number difference is found by comparing the sequence of interest in a test sample to the expected amount of the sequence of interest. For example, the amount of the sequence of interest in the test sample is compared to the amount of the sequence of interest present in a qualified sample. Copy number variants/variations include deletions (including microdeletions), insertions (including microinsertions), duplications, inversions, translocations, and complex multi-position variations. CNV encompasses chromosomal aneuploidies and partial aneuploidies.
The term "aneuploidy" herein refers to an imbalance of genetic material caused by loss or gain of an entire chromosome, or a portion of a chromosome.
The terms "chromosomal aneuploidy" and "intact chromosomal aneuploidy" refer herein to an imbalance of genetic material resulting from the loss or gain of an entire chromosome, and include germline aneuploidy and chimeric aneuploidy.
The terms "partial aneuploidy" and "partial chromosomal aneuploidy" herein refer to an imbalance of genetic material resulting from the loss or gain of a portion of a chromosome (e.g., partial monosomy and partial trisomy), and encompass imbalances resulting from translocations, deletions, and insertions.
The term "aneuploidy sample" as used herein refers to a sample that indicates that the chromosomal content of a subject is not aneuploid, i.e.: the sample indicates that a subject has an abnormal copy number of a chromosome or chromosome portion.
The term "aneuploidy chromosome" as used herein refers to a chromosome that is known or determined to be present in a sample of abnormal copy number.
The term "plurality" herein refers to more than one. For example, the term is used herein to refer to a number of nucleic acid molecules or sequence tags that are sufficient to identify significant differences in copy number variation (e.g., chromosomal dose) in a test sample and a qualified sample using the methods disclosed herein. In some embodiments, at least about 3x10 reads, including between about 20 and 40bp reads, are obtained for each test sample 6 A sequence tag of at least about 5x10 6 A sequence tag of at least about 8x10 6 A sequence tag of at least about 10x10 6 A sequence tag of at least about 15x10 6 At least about 20x10 of sequence tags 6 At least about 30x10 of sequence tags 6 A sequence tag of at least about 40x10 6 A sequence tag, or at least about 50x10 6 And (4) sequence tags.
The terms "polynucleotide", "nucleic acid" and "nucleic acid molecule" are used interchangeably and refer to a covalently linked nucleotide sequence (i.e., ribonucleotides of RNA and deoxyribonucleotides of DNA) in which the 3 'position of the pentose of one nucleotide is linked by a phosphodiester group to the 5' position of the pentose of the next nucleotide, including sequences of any form of nucleic acid, including but not limited to RNA and DNA molecules, such as cfDNA molecules. The term "polynucleotide" includes, but is not limited to, single-stranded and double-stranded polynucleotides.
The term "portion" is used herein to refer to the amount of sequence information of fetal and maternal nucleic acid molecules in a biological sample, which amounts to less than the sequence information of a human genome.
The term "test sample" as used herein refers to a sample comprising a nucleic acid or mixture of nucleic acids comprising at least one nucleic acid sequence to be screened for copy number variation, typically derived from a biological fluid, cell, tissue, organ or organism. In certain embodiments, the sample comprises at least one nucleic acid sequence suspected of having a variation in its copy number. These samples include, but are not limited to, saliva/saliva, amniotic fluid, blood clot or fine needle biopsy samples (e.g., surgical biopsy, fine needle biopsy, etc.), urine, peritoneal fluid, pleural fluid, and the like. Although samples are often taken from human subjects (e.g., patients), assays can be used for Copy Number Variation (CNV) in samples from any mammal including, but not limited to, dogs, cats, horses, goats, sheep, cattle, pigs, and the like. The sample may be used directly as obtained from a biological source or after pretreatment to alter the characteristics of the sample. For example, the pretreatment may include preparing plasma from blood, diluting viscous fluids, and the like. Methods of pretreatment may also include, but are not limited to, filtration, precipitation, dilution, distillation, mixing, centrifugation, freezing, lyophilization, concentration, amplification, nucleic acid fragmentation, inactivation of interfering components, addition of reagents, solubilization, and the like. If such pretreatment methods are used on a sample, such pretreatment methods typically result in the retention of one or more nucleic acids of interest in the test sample, preferably at a concentration proportional to the concentration in an untreated test sample (i.e., a sample that has not been subjected to any such pretreatment methods). For the methods described herein, these samples that are "treated" or "processed" are still considered biological "test" samples.
The term "qualified sample" as used herein refers to a sample that includes a mixture of nucleic acids present at known copy numbers for comparison of the nucleic acids in a test sample, and which is a normal sample, i.e., not an aneuploid sample, for a sequence of interest. In certain embodiments, a qualified sample is used to identify one or more normalized chromosomes or segments of the chromosome under consideration. For example, a qualified sample may be used to identify a normalized chromosome for chromosome 21. In this case, a qualified sample is one that is not a trisomy 21 sample. The qualifying samples may also be used to determine a threshold for determining the affected sample.
The term "training set" as used herein refers to a set of samples that may include both affected and unaffected samples and that are used to develop a model for analyzing a test sample. Unaffected samples in the training set can be used as qualified samples to identify normalizing sequences, such as normalizing chromosomes, and the chromosome dose of unaffected samples is used to set a threshold for each of these sequences of interest (e.g., chromosomes). The affected samples in a training set can be used to verify that the affected test samples can be readily distinguished from the unaffected samples.
The term "qualified nucleic acid" is used interchangeably with "qualified sequence," which is a test sequence or a sequence to which a test nucleic acid is compared. A qualified sequence is a sequence that is preferably present in a biological sample in a known expression (i.e., the amount of qualified sequence is known). In general, a qualified sequence is a sequence that is present in a "qualified sample". A "qualified sequence of interest" is one for which the amount in a qualified sample is known, and it is a sequence that is associated with a difference in sequence expression of an individual with a medical condition.
The term "sequence of interest" as used herein refers to a nucleic acid sequence that is associated with a difference in sequence expression in healthy versus diseased individuals. A sequence of interest may be a sequence on a chromosome that is misexpressed under disease or genetic conditions, i.e.: over-or under-expression. A sequence of interest may be a portion of a chromosome (i.e., a chromosome segment), or a chromosome. For example, a sequence of interest may be a chromosome (which is overexpressed in the case of aneuploidy), or a gene (which encodes a tumor suppressor that is underexpressed in cancer). Sequences of interest include sequences that are over-or under-expressed in the total population or subpopulation of cells of a subject. A "qualified sequence of interest" is a sequence of interest in a qualified sample. A "test sequence of interest" is a sequence of interest in a test sample.
The term "normalized sequence" refers herein to a sequence that is used to normalize the number of sequence tags that map to a sequence of interest associated with the normalized sequence. In certain embodiments, the normalized sequence displays variability in the sample and sequencing rounds of the number of sequence tags mapped to the normalized sequence that is close to the variability of the sequence of interest for which the normalized sequence is used as a normalization parameter, and the affected sample can be distinguished from one or more unaffected samples. In certain implementations, the normalizing sequence optimally or efficiently distinguishes the affected sample from one or more unaffected samples as compared to other potential normalizing sequences, such as other chromosomes. "normalizing chromosome" or "normalizing chromosome sequence" is an example of a "normalizing sequence" which may be comprised of a single chromosome or a set of chromosomes. "A" normalization segment "is another example of a" normalization sequence ". A "normalizing segment sequence" may consist of a single segment of a chromosome, or it may consist of two or more segments of the same or different chromosomes. In certain embodiments, the normalizing sequence is used to normalize for variability, such as process-related variability, inter-chromosomal (same round) variability, and inter-sequencing (round-to-round) variability.
The term "resolvability" as used herein refers to a characteristic of a normalized chromosome that enables it to distinguish one or more unaffected (i.e., normal) samples from one or more affected (i.e., aneuploidy) samples.
The term "sequence dose" herein refers to a parameter that correlates the number of sequence tags identified for a sequence of interest with the number of sequence tags identified for a normalized sequence. In some cases, the sequence dose is a ratio of the number of sequence tags identified for the sequence of interest to the number of sequence tags identified for the normalized sequence. In some cases, sequence dose refers to a parameter that correlates the sequence tag density of a sequence of interest with the tag density of a normalized sequence. "test sequence dose" is a parameter that relates the sequence tag density of a sequence of interest (e.g., chromosome 21) to the sequence tag density of a normalized sequence (e.g., chromosome 9) determined in a test sample. Similarly, a "qualified sequence dose" is a parameter that relates the sequence tag density of a sequence of interest to the tag density of a normalized sequence determined in a qualified sample.
The term "sequence tag density" herein refers to the number of sequence reads that map to a reference genome sequence, e.g., the sequence tag density for chromosome 21 is the number of sequence reads generated by a sequencing method that map back to chromosome 21 of the reference genome. The term "sequence tag density ratio" as used herein refers to the ratio of the number of sequence tags mapped to a chromosome of a reference genome (e.g., chromosome 21) to the length of the chromosome of the reference genome
The term "Next Generation Sequencing (NGS)" herein refers to a sequencing method that allows massively parallel sequencing of clonally amplified molecules and individual nucleic acid molecules. Non-limiting examples of NGS include sequencing-by-synthesis using reversible dye terminators, and sequencing-by-ligation.
The term "parameter" herein refers to a numerical relationship that characterizes a physical property. Often times, the parameters numerically characterize the quantitative data set and/or the numerical relationship between the quantitative data sets. For example, the ratio (or a function of the ratio) between the number of sequence tags mapped to a chromosome and the length of the chromosome on which these tags are mapped is a parameter.
The terms "threshold" and "eligibility threshold" herein refer to any number of samples used as cutoffs to characterize a sample, such as a test sample containing nucleic acid from an organism suspected of having a medical condition. The threshold value may be compared to a parameter value to determine whether the sample from which the parameter value was generated indicates that the organism is suffering from the medical condition. In certain embodiments, the eligibility threshold is calculated using the qualified dataset and serves as a boundary to diagnose copy number variation, such as aneuploidy, in the organism. If the results obtained from the methods disclosed herein exceed a threshold, then the subject may be diagnosed with copy number variation, e.g., trisomy 21. Appropriate thresholds for the methods described herein can be identified by analyzing the normalized values (e.g., chromosome dose, NCV or NSV) calculated for a training set of samples. Thresholds may be identified using qualified (i.e., unaffected) samples in a training set that includes qualified (i.e., unaffected) samples and affected samples. These samples in the training set known to have a chromosomal aneuploidy (i.e., affected samples) can be used to confirm that the threshold values of these selections are useful in distinguishing affected samples from unaffected samples in the test set (see these examples herein). The selection of the threshold depends on the confidence level that the user wishes to make the classification. In some embodiments, the training set used to identify the appropriate threshold comprises at least 10, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, at least 200, at least 300, at least 400, at least 500, at least 600, at least 700, at least 800, at least 900, at least 1000, at least 2000, at least 3000, at least 4000, or more qualified samples. It may be advantageous to use a larger set of qualified samples to improve the diagnostic utility of the threshold.
The term "normalization value" herein refers to a value that correlates the number of sequence tags identified for a sequence of interest (e.g., a chromosome or chromosome segment) with the number of sequence tags identified for a normalization sequence (e.g., a normalization chromosome or a normalization chromosome segment). For example, the "normalized value" may be the chromosome dose as described elsewhere in the application, or it may be the NCV (normalized chromosome value) as described elsewhere in the application, or it may be the NSV (normalized segment value) as described elsewhere in the application.
The term "read" refers to a sequence read from a portion of a nucleic acid sample. Typically, but not necessarily, the reads represent short sequences of adjacent base pairs in the sample. The reads can be symbolized by the base pair sequence (ATCG) of the sample portion of the sample. The reads may be stored in a storage device and processed as appropriate to determine whether the read matches a reference sequence or meets other criteria. The read can be obtained directly from the sequencing device or indirectly from stored sequence information about the sample. In some cases, the term "read" refers to a DNA sequence that is long enough (e.g., at least 30 bp) to identify a larger sequence or region, such as a chromosome or a genomic region or a gene, and can be aligned and targeted.
The term "sequence tag" is used interchangeably herein with the term "mapped sequence tag" and refers to a sequence read that has been assigned (i.e., mapped) exactly to a larger sequence (e.g., a reference genome) by alignment. The mapped sequence tags are uniquely mapped to the reference genome, i.e., they are assigned to unit positions of the reference genome. The tags may be provided as data structures or other collections of data. In certain embodiments, the tag comprises the read sequence and information related to the read, such as the location of the sequence in the genome, e.g., the location on a chromosome. In certain embodiments, the positions are illustrated in the plus strand direction. Tags can be defined to provide a limited amount of mismatches when aligned to a reference genome. Tags that can map to more than one location in the reference genome (i.e., tags that do not uniquely map) may not be included in the analysis.
As used herein, the term "aligned" refers to the process of comparing a read or tag to a reference sequence and thereby determining whether the reference sequence comprises the read sequence. If the reference sequence contains the read, the read can be mapped to the reference sequence, or in some embodiments, to a specific location in the reference sequence. In some cases, the alignment simply informs whether the reads are members of a particular reference sequence (i.e., whether the reads are present or absent from the reference sequence). For example, aligning a read to a reference sequence of human chromosome 13 will inform whether the read is present in the reference sequence of chromosome 13. The tool that provides this information may be determined to be a set membership tester. In some cases, the alignment additionally indicates the position in the reference sequence to which the reads or tags are mapped. For example, if the reference sequence is a full human genome sequence, the alignment can indicate that the read is present on chromosome 13, and can further indicate that the read is on a particular strand and/or site of chromosome 13.
The aligned reads or tags are one or more sequences identified as matching known sequences from a reference genome, based on the order of their nucleic acid molecules. Alignment can be performed manually, although alignment is typically accomplished by computer algorithms, as alignment of reads within a reasonable time is not possible to accomplish with the methods disclosed herein. One example of an algorithm for aligning sequences is the nucleotide data Efficient Local Alignment (ELAND) computer program assigned as part of the i Lu Na genomic Analysis pipeline (illumina genomics Analysis pipeline). Alternatively, a Bloom filter or similar set membership tester may be used to align reads to a reference genome. See U.S. patent application No. 61/552,374 filed on 27/10/2011, which is incorporated by reference herein in its entirety. The match of sequence reads when aligned can be 100% sequence match or less than 100% (non-ideal match).
As used herein, the term "reference genome" or "reference sequence" refers to any specific known genomic sequence (whether partial or complete) of any organism or virus that can be used to reference a recognized sequence from a subject. For example, reference genomes for human subjects, as well as many other organisms, can be found in the National Center for Biotechnology Information (National Center for Biotechnology Information), in www.ncbi.nlm.nih.gov. "genome" refers to the complete genetic information of an organism or virus, which is expressed in a nucleic acid sequence.
In various embodiments, the reference sequence is significantly larger than the reads to which it is aligned. For example, it may be at least about 100 times greater, or at least about 1000 times greater, or at least about 10,000 times greater, or at least about 10 times greater 5 Times, or at least about 10 times, larger 6 Times, or at least about 10 times, larger 7 And (4) doubling.
In one example, the reference sequence is a sequence of a full-length human genome. These sequences may be referred to as genomic reference sequences. In another example, the reference sequence is limited to a specific human chromosome, such as chromosome 13. These sequences may be referred to as chromosomal reference sequences. Other examples of reference sequences include genomes of other species, as well as chromosomes, sub-chromosomal regions (e.g., strands) of any species, and the like.
In various embodiments, the reference sequence is a common sequence or other combination derived from multiple individuals. However, in some applications, the reference sequence may be taken from a particular individual.
The term "artificial target sequence genome" as used herein refers to a group of known sequences encompassing alleles of known polymorphic sites. For example, a "SNP reference genome" is an artificial target sequence genome comprising a sequence group encompassing alleles of known SNPs.
The term "clinically relevant sequence" as used herein refers to a nucleic acid sequence that is known to be, or suspected of being, associated with or implicated in a genetic or disease condition. Determining the presence or absence of clinically relevant sequences can be useful in determining or confirming the diagnosis of a medical condition, or providing a prediction of the progression of a disease.
When the term "derived" is used in the context of a nucleic acid or a mixture of nucleic acids, it is meant herein the manner in which the nucleic acid or nucleic acids are obtained from the source from which the nucleic acid or nucleic acids originate. For example, in one embodiment, a mixture of nucleic acids derived from two different genomes means that the nucleic acids (e.g., cfDNA) are naturally released by the cell through naturally occurring processes such as necrosis or apoptosis. In another embodiment, a mixture of nucleic acids derived from two different genomes means that the nucleic acids are extracted from two different types of cells from a subject.
The term "patient sample" as used herein refers to a biological sample obtained from a patient (i.e., the recipient of medical aid, care or treatment). The patient sample may be any sample described herein. In certain embodiments, the patient sample is obtained by a non-invasive procedure, such as a peripheral blood sample or a stool sample. The methods described herein are not necessarily limited to humans. Thus, different veterinary applications are contemplated, in which case the patient sample may be a sample from a non-human mammal (e.g., cat, pig, horse, cow, etc.).
The term "mixed sample" herein refers to a sample containing a mixture of nucleic acids derived from different genomes.
The term "maternal sample" as used herein refers to a biological sample obtained from a pregnant subject (e.g., a female).
The term "biological fluid" herein refers to a liquid taken from a biological source and includes, for example, blood, serum, plasma, saliva, lavage fluid, cerebrospinal fluid, urine, semen, sweat, tears, saliva, and the like. As used herein, the terms "blood", "plasma" and "serum" expressly encompass portions or processed portions thereof. Likewise, where a sample is taken from a biopsy, swab, smear, etc., the "sample" expressly encompasses the processed portion or portions derived from the biopsy, swab, smear, etc.
The terms "maternal nucleic acid" and "fetal nucleic acid" herein refer to nucleic acid of a pregnant female subject and nucleic acid of a fetus carried by the pregnant female, respectively.
As used herein, the term "corresponding to … …" sometimes refers to nucleic acid sequences, e.g., genes or chromosomes, that are present in the genomes of different subjects and need not have the same sequence in all genomes, but rather are used to provide the identity of a sequence of interest, e.g., a gene or chromosome, rather than genetic information.
As used herein, the term "substantially cell-free" encompasses preparations of a desired sample that remove cellular components normally associated therewith from the desired sample. For example, a plasma sample is rendered substantially cell-free by removing blood cells, such as red blood cells, that are normally associated with plasma. In certain embodiments, the substantially cell-free sample is processed to remove cells that would otherwise have an effect on the desired genetic material to be tested against CNV.
As used herein, the term "fetal fraction" refers to the fraction of fetal nucleic acid present in a sample comprising fetal and maternal nucleic acid. Fetal fraction is often used to characterize cfDNA in maternal blood.
As used herein, the term "chromosome" refers to a genetic vector that assumes inheritance in living cells, which is derived from chromatin and includes DNA and protein components (particularly histones). The internationally recognized conventional individual human genome chromosome numbering system is employed herein.
As used herein, the term "polynucleotide length" refers to the absolute number of nucleic acid molecules (nucleotides) in a sequence or in a region of a reference genome. The term "chromosomal length" refers to the known chromosomal length in base pairs, as provided, for example, in the NCBI36/hg18 collection of human chromosomes as found in world wide web genome, ucsc.
The term "subject" herein refers to human subjects as well as non-human subjects, such as mammals, invertebrates, vertebrates, fungi, yeasts, bacteria and viruses. Although the examples herein relate to humans and the language is primarily directed to human problems, the concepts disclosed herein are applicable to genomes from any plant or animal, and in the fields of veterinary medicine, zootechnics, research laboratories, and the like.
The term "condition" is used herein to refer to a "medical condition," as a broad term, that includes all diseases and disorders, and may also include [ injuries ] and normal health conditions such as pregnancy, which may affect a person's health, benefit from medical care or have implications for medical treatment.
The term "intact" as used herein in reference to a chromosomal aneuploidy refers to the acquisition or loss of an entire chromosome.
The term "portion" when used in reference to a chromosomal aneuploidy refers herein to the acquisition or loss of a portion (i.e., segment) of a chromosome.
The term "chimera" is used herein to refer to the presence of two cell populations with different karyotypes in an individual developing from a single fertilized egg. Chimerism may result from mutations that spread during development to only a subset of adult human cells.
The term "non-chimeric" as used herein refers to an organism comprising cells having a karyotype, such as a human fetus.
The term "using chromosomes" when used in reference to determining chromosome dosage refers herein to using the sequence information obtained for the chromosome, i.e., the number of sequence tags obtained for the chromosome.
The term "sensitivity" as used herein is equal to the number of true positives divided by the sum of true positives and false negatives.
The term "specificity" as used herein equals the number of true negatives divided by the sum of true negatives and false positives.
The term "hypodiploid" as used herein refers to a chromosome number that is one or more less than the normal haploid number for a genomic signature of that species.
A "polymorphic site" is a locus at which nucleotide sequence divergence occurs. Loci can be as small as one base pair. Exemplary markers have at least two alleles, each occurring at a frequency greater than 1% of the selected population, and more typically greater than 10% or 20%. Polymorphic sites may be sites of Single Nucleotide Polymorphisms (SNPs), small-scale multiple base deletions or insertions, polynucleotide polymorphisms (MNPs), or Short Tandem Repeats (STRs). The terms "polymorphic locus" and "polymorphic site" are used interchangeably herein.
"polymorphic sequence" as used herein refers to a nucleic acid sequence, e.g., a DNA sequence, that includes one or more polymorphic sites, e.g., a SNP or a tandem SNP. Polymorphic sequences according to the present technology can be used to specifically distinguish maternal from non-maternal alleles in a maternal sample that includes a mixture of fetal and maternal nucleic acids.
As used herein, a "single nucleotide polymorphism" (SNP) occurs at a single nucleotide occupied polymorphic site, which is the site at which variation occurs between the sequences of alleles. This site is usually preceded and followed by sequences that are highly conserved among alleles (e.g., sequences that vary among less than 1/100 or 1/1000 members of the population). SNPs typically result from the substitution of one nucleotide for another at a polymorphic site. Transitions are substitutions of one purine by another purine or one pyrimidine by another pyrimidine. Transversion is the replacement of a purine by a pyrimidine or the replacement of a pyrimidine by a purine. SNPs can also be caused by nucleotide deletions or nucleotide insertions relative to a reference allele. Single Nucleotide Polymorphisms (SNPs) are a condition in which two alternative bases occur at appreciable frequency (> 1%) in the human population, and are the most common type of human genetic variation.
The term "tandem SNP" herein refers to two or more SNPs present within one polymorphic target nucleic acid sequence.
As used herein, the term "short tandem repeat" or "STR" refers to a class of polymorphisms that occur when a pattern of two or more nucleotides is repeated and the repeated sequences are immediately adjacent to each other. The length of the pattern can be from 2 to 10 base pairs (bp) (e.g., in genomic regions (CATG) n ) Within ranges, and typically in non-coding intron regions. By examining several STR loci and counting how many specific STR sequence repeats at a given locus, it is possible to establish a gene profile that is unique to an individual.
As used herein, the term "miniSTR" refers herein to a tandem repeat of four or more base pairs spanning less than about 300 base pairs, less than about 250 base pairs, less than about 200 base pairs, less than about 150 base pairs, less than about 100 base pairs, less than about 50 base pairs, or less than about 25 base pairs. "miniSTR" is an STR that can be amplified from cfDNA templates.
The terms "polymorphic target nucleic acid", "polymorphic sequence", "polymorphic target nucleic acid sequence", and "polymorphic nucleic acid" are used interchangeably herein to refer to a nucleic acid sequence (e.g., a DNA sequence) that includes one or more polymorphic sites.
The term "plurality of polymorphic target nucleic acids" herein refers to a plurality of nucleic acid sequences each comprising at least one polymorphic site (e.g., one SNP), such that 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, or more different polymorphic sites are amplified from the polymorphic target nucleic acid to identify and/or quantify fetal alleles present in a maternal sample comprising fetal and maternal nucleic acids.
The term "enrichment" herein refers to a process of amplifying a polymorphic target nucleic acid contained in a portion of a maternal sample and combining the amplified product with the rest of the maternal sample from which the portion was removed. For example, the remainder of the maternal sample may be the original maternal sample.
The term "original maternal sample" herein refers to a non-enriched biological sample obtained from a pregnant subject (e.g., a female) that serves as a source for removing a portion to amplify polymorphic target nucleic acids. The "raw sample" may be any sample obtained from a pregnant subject and processed parts thereof, such as a purified cfDNA sample extracted from a maternal plasma sample.
As used herein, the term "primer" refers to an isolated oligonucleotide that is capable of acting as a point of initiation of synthesis when placed under conditions to prime synthesis of a primer extension product that is complementary to a nucleic acid strand (i.e., in the presence of nucleotides and an initiator such as a DNA polymerase and at a suitable temperature and pH). For the most efficient amplification, the primers are preferably single stranded, but alternatively may be double stranded. If double stranded, the primer is first treated to separate its strands before being used to prepare the extension product. The primer is preferably an oligodeoxyribonucleotide. The primer must be long enough to prime the synthesis of extension products in the presence of the initiator. The exact length of the primer will depend on many factors, including temperature, source of primer, use of the method, and parameters used for primer design.
The phrase "action to be taken" (cause) refers to an action taken by a medical professional (e.g., a physician) or a person controlling or instructing the medical care of a subject to control and/or permit administration of one or more agents/one or more compounds in issue to the subject. Administration may include diagnosing and/or determining an appropriate therapeutic or prophylactic regimen, and/or prescribing a particular agent/compound to the subject. The prescription can include, for example, drafting a prescription composition, writing a medical record, and the like. Likewise, an "action to be performed" (cause), such as a diagnostic procedure, refers to an action taken by a medical professional (e.g., a doctor) or a person controlling or directing the medical care of a subject and/or permitting the performance of one or more diagnostic protocols on the subject.
Introduction to the design reside in
Methods, devices, systems, and kits for determining Copy Number Variation (CNV) of different sequences of interest in a test sample comprising a mixture of nucleic acids derived from two different genomes and differing in the amount of one or more sequences of interest are known or suspected are disclosed. Methods, devices, systems, and kits for determining a score contributed by two genomes in a nucleic acid mixture are also provided. Copy number variations determined by the methods and apparatus disclosed herein include gain or loss of entire chromosomes, variations involving microscopic, extremely large segments of chromosomes, and large submicroscopic copy number variations of DNA fragments ranging in size from kilobases (kb) to megabases (Mb). In various embodiments, the methods include a machine-implemented statistical method that accounts for naturally increased variability caused by process-related variability, inter-chromosome variability, and inter-sequence variability. The method is applicable to determining CNVs of any fetal aneuploidy, as well as CNVs known or suspected to be associated with a variety of medical conditions. CNVs that can be determined according to the methods of the invention include trisomies and monosomies of any one or more of chromosomes 1 to 22, X and Y, other chromosomal polysomies, and deletions and/or duplications of segments of any one or more of the chromosomes, and are detectable by sequencing nucleic acids of a test sample only once. Any aneuploidy can be determined from sequencing information obtained by sequencing the nucleic acid of a test sample only once.
CNVs in the human Genome significantly affect human diversity and susceptibility to disease (Redon (Lei Dong) et al, nature (Nature) 23.
The methods, apparatus or devices described herein can employ next generation sequencing technology (NGS) for massively parallel sequencing. In certain embodiments, clonally amplified DNA templates or single DNA molecules are sequenced in a massively parallel fashion within a flow cell (e.g., as described in Volkering (Wo Keer D et al, clin Chem (clinical chemistry) 55-641-658 [2009]; metzker (Metzke) M, nature Rev (Nature review) 11, 31-46[2010 ]. In addition to high-throughput sequence information, NGS provides quantitative information in which each sequence read is a calculable "sequence tag" that represents an individual cloned DNA template or a single DNA molecule. Sequencing techniques for NGS include pyrosequencing, sequencing by synthesis with reversible dye terminators, sequencing by oligonucleotide probe ligation, and ion semiconductor sequencing. DNA from individual samples can be sequenced individually (i.e., singleplex sequencing), or as an index genomic molecule in a single sequencing round, DNA from multiple samples can be pooled together and sequenced (i.e., multiple sequencing) to generate reads of up to several billion DNA sequences. Examples of sequencing techniques are described below, which can be used to obtain sequence information according to the methods of the invention.
In some embodiments, the methods and apparatus disclosed herein may operate in some or all of the following order: obtaining a nucleic acid test sample from a patient (typically by a non-invasive procedure); processing the test sample to prepare for sequencing; sequencing nucleic acids from a test sample to generate a plurality of reads (e.g., at least 10,000); aligning the reads to a portion of the reference sequence/genome and determining the amount of DNA (e.g., the number of reads) mapped to a defined portion of the reference sequence (e.g., a defined chromosome or chromosome segment); calculating a dose of the one or more defined fractions by normalizing the amount of DNA mapped to the defined fraction with the amount of DNA mapped to the one or more normalized chromosomes or chromosome segments selected for the defined fraction; determining whether the dose indicates that the defined moiety is "affected" (e.g., aneuploidy or chimerism); reporting the determination and optionally converting it into a diagnosis; the diagnosis or determination is used to develop a plan to treat, monitor, or further test the patient.
Determining the normalized sequence in the qualified sample: normalizing chromosome sequences and normalizing segment sequences
Normalized sequences are identified using qualified samples from a panel of subjects known to include a normal copy number with any sequence of interest (e.g., a chromosome or segment thereof). The determination of the normalization sequence is outlined in steps 110, 120, 130, 140, and 145 of the embodiment of the method depicted in fig. 1. Sequence information obtained from qualified samples is used to statistically meaningfully identify chromosomal aneuploidies in test samples (fig. 1 step 165 and examples).
Fig. 1 provides a flow diagram 100 of one embodiment of a CNV for determining a sequence of interest, e.g., a chromosome or segment thereof, in a biological sample. In some embodiments, the biological sample is obtained from a subject and the sample comprises a mixture of nucleic acids consisting of different genomes. Different genomes may be constructed from samples from two individuals, for example, a different genome from a fetus and a mother carrying the fetus. Alternatively, the genome may be composed of a sample of aneuploid cancer cells and normal euploid cells from the same subject (e.g., a plasma sample from a cancer patient).
In addition to analyzing the patient's test sample, one or more normalization chromosomes or one or more normalization chromosome segments for each possible chromosome of interest are selected. The identification of the normalized chromosomes or segments is performed asynchronously to the normal testing of the patient sample, both of which may be performed in a clinical setting. In other words, the normalized chromosomes or segments are identified prior to testing the patient sample. The association between the normalized chromosome or segment and the chromosome or segment of interest is stored for use during testing. As explained below, this correlation typically preserves the time period spanned by testing many samples. The following discussion relates to embodiments of a normalized chromosome or chromosome segment for selecting individual chromosomes or segments of interest.
A set of qualifying samples is obtained to identify qualifying normalized sequences and to provide variance values for determining a statistically significant identification of CNVs in the test sample. In step 110, a plurality of biologically qualified samples are obtained from a plurality of subjects known to include cells having a normal copy number of any one sequence of interest. In one embodiment, a qualified sample is obtained from a mother pregnant with a fetus, and chromosomes with normal copy numbers have been identified using cytogenetic means. The biologically acceptable sample may be a biological fluid, such as plasma, or any suitable sample as described below. In some embodiments, a qualified sample contains a mixture of nucleic acid molecules (e.g., cfDNA molecules). In some embodiments, a qualified sample is a maternal plasma sample containing a mixture of fetal and maternal cfDNA molecules. Sequence information for the normalized chromosome and/or portions thereof is obtained by sequencing at least a portion of these nucleic acids (e.g., fetal and maternal nucleic acids) using any known sequencing method. Preferably, any of the Next Generation Sequencing (NGS) methods described elsewhere in this application are used to sequence fetal and maternal nucleic acids as mono-or clonally amplified molecules. In various embodiments, the qualified sample is processed before and during sequencing as disclosed below. These samples can be processed using the devices, systems, and kits as disclosed herein.
At step 120, at least a portion of each of all of the qualified nucleic acids contained within the qualified sample is sequenced to generate millions of sequence reads, e.g., 36bp reads, that are aligned to a reference genome, e.g., hg 18. In some embodiments, the sequence reads comprise about 20bp, about 25bp, about 30bp, about 35bp, about 40bp, about 45bp, about 50bp, about 55bp, about 60bp, about 65bp, about 70bp, about 75bp, about 80bp, about 85bp, about90bp, about 75bp, about 85bp, about90bp, about 20bp, about 50bp, about 55bp, about 65bp, about 70bp, about 75bp, about 80bp, about 85bp, about90bp, about 20bp, or about 35bp95bp, about 100bp, about 110bp, about 120bp, about 130bp, about 140bp, about 150bp, about 200bp, about 250bp, about 300bp, about 350bp, about 400bp, about 450bp, or about 500bp. It is expected that a technical advantage will enable single-ended reads of greater than 500bp to be made, which when generating paired-end reads, enables reads of greater than about 1000bp to be used. In one embodiment, the mapped sequence reads comprise 36bp. In another embodiment, the mapped sequence reads comprise 25bp. Sequence reads aligned to the reference genome, as well as reads that uniquely map to the reference genome, are known as sequence tags. In one embodiment, at least about 3x10 is obtained from reads of the unique enantiomeric reference genome 6 A qualified sequence tag of at least about 5x10 6 A qualified sequence tag of at least about 8x10 6 A qualified sequence tag of at least about 10x10 6 A qualified sequence tag of at least about 15x10 6 A qualified sequence tag of at least about 20x10 6 A qualified sequence tag of at least about 30x10 6 A qualified sequence tag of at least about 40x10 6 A qualified sequence tag, or at least about 50x10 6 One included qualified sequence tags between 20 and 40bp reads.
At step 130, all tags from the nucleic acids in the sequencing-qualified sample are counted to determine a qualified sequence tag density. In one embodiment, the sequence tag density is determined with reference to the plurality of qualified sequence tags corresponding to the sequence of interest on the reference genome. In another embodiment, the qualified sequence tag density is the plurality of qualified sequence tags determined to map to the sequence of interest, normalized to the length of the qualified sequence of interest to which they map. Sequence tag densities determined as a ratio of tag density relative to the length of the sequence of interest are referred to herein as tag density ratios. Normalization to the length of the sequence of interest is not required and may be included as a step, reducing the number of bits in a number, to simplify it for manual interpretation. All qualified sequence tags are mapped and counted to each qualified sample, and the sequence tag density of the sequence of interest (e.g., clinically relevant sequence) in the qualified sample is determined, while the sequence tag density of the additional sequence (from which the normalized sequence is derived) is sequentially identified.
In certain embodiments, the sequence of interest is a chromosome associated with a complete chromosomal aneuploidy, such as chromosome 21, and the qualified normalized sequence is a complete chromosome that is not associated with a chromosomal aneuploidy and has a variation in sequence tag density that is close to the sequence of interest (i.e., chromosome), such as chromosome 21. The selected normalization chromosome may be one chromosome or a group of chromosomes having a varying density of sequence tags closest to the sequence of interest. Any one or more of chromosomes 1-22, X, and Y can be a sequence of interest, and the one or more chromosomes can be identified as a normalizing sequence for each of any one of chromosomes 1-22, X, Y in a qualified sample. The normalization chromosome may be a separate chromosome, or it may be a set of chromosomes as described elsewhere in this application.
In another embodiment, the sequence of interest is a chromosome segment associated with a partial aneuploidy (e.g., a chromosome deletion or insertion or an unbalanced chromosome translocation) and the normalizing sequence is a chromosome segment (or set of segments) that is not associated with a partial aneuploidy and that has a change in sequence tag density that is close to the chromosome segment associated with a partial aneuploidy. The selected normalizing chromosome segment can be one or more chromosome segments that have a varying density of sequence tags that are closest to the sequence of interest. Any one or more segments of any one or more of chromosomes 1-22, X, and Y can be a sequence of interest.
In other embodiments, the sequence of interest is a chromosome segment associated with a partial aneuploidy and the normalizing sequence is a whole chromosome or a plurality of whole chromosomes. In still other embodiments, the sequence of interest is a whole chromosome not associated with an aneuploidy and the normalizing sequence is a chromosome segment or chromosome segments not associated with the aneuploidy.
Whether a single sequence or a set of sequences in a qualified sample is identified as a normalized sequence of any one or more sequences of interest, a qualified normalized sequence can be selected that has a sequence tag density variation closest to or effectively close to the sequence of interest as determined in the qualified sample. For example, a qualified normalized sequence is a sequence that, when used to normalize a sequence of interest, produces the least variability among qualified samples, i.e., the variability of the normalized sequence is closest to the variability of the sequence of interest determined in the qualified samples. In other words, a qualified normalized sequence is a sequence selected to minimize the variation of sequence dose (the sequence of interest) between qualified samples. Thus, the process selects the sequence that, when used as a normalization chromosome, is expected to produce the least variability in chromosome dose among different batches of the sequence of interest.
The normalized sequences identified for any one or more sequences of interest in a qualified sample remain the normalized sequences selected for determining the presence or absence of aneuploidy in the test sample for a period of days, weeks, months, and possibly years, provided that the procedure requires the generation of a sequencing library and that sequencing of the sample is essentially invariant over time. As described above, the normalized sequence used to determine the presence of aneuploidy is selected because the variability in the number of sequence tags mapped to it between samples (e.g., different samples) and sequencing rounds (e.g., sequencing rounds performed on the same day and/or different days) is closest to the variability of the sequence of interest using it as a normalization parameter (and possibly other reasons). Substantial alterations of these programs will affect the number of tags mapped to all sequences, which in turn will determine which sequence or sets of sequences have variability between samples in the same and/or different sequencing runs, same day, or different days that is closest to the variability of the sequence of interest, which will require a re-determination of the set of normalized sequences. Substantial changes to the procedure include changes to the laboratory protocol used to prepare the sequencing library, including changes associated with preparing samples for multiplex sequencing rather than singleplex sequencing; and changes in the sequencing platform, including changes in the chemistry used for sequencing.
In some embodiments, the normalizing sequence is the sequence that best discriminates one or more qualifying samples from one or more affected samples, meaning that the normalizing sequence is the sequence with the greatest resolvability, i.e., the resolvability of the normalizing sequence is such that it provides the optimal differentiation to the sequence of interest in the affected test sample for readily discriminating the affected test sample from other unaffected samples. In other embodiments, the normalized sequence is a sequence with a combination of minimal variability and maximal resolvability.
The level of resolvability can be determined as a statistical difference between the sequence dose (e.g., chromosome dose or segment dose) in a population of qualifying samples and the one or more chromosome doses in one or more test samples, as described below and shown in these examples. For example, resolvability can be numerically expressed as a T-test value representing the statistical difference between the chromosome dose in a population of qualifying samples and the chromosome dose or doses in one or more test samples. z-score for chromosome sessions as the distribution for the NCV is normal. <}0{> Alternatively, resolvability can be numerically expressed as a Normalized Chromosome Value (NCV), which is the z-score of the chromosome dose as long as the distribution of NCV is normal. Similarly, resolvability can be numerically expressed as a T-test value representing the statistical difference between the sector dose in a population of qualified samples and the one or more sector doses in one or more test samples. Where the chromosome segment is a sequence of interest, the segment dose resolvability can be numerically expressed as a Normalized Segment Value (NSV), which is the z-fraction of the chromosome segment dose, as long as the distribution of NSV is normal. In determining z-scores, the mean and standard deviation of chromosomal or segment doses in a set of qualifying samples can be used. Alternatively, the mean and standard deviation of the chromosomal or segment dose in the training set comprising the qualifying and affected samples may be used. In other embodiments, the normalized sequence is the sequence with the least variability and the greatest resolvability or the best combination of small variability and large resolvability.
The method identifies sequences that inherently have similar characteristics and are prone to similar variation between the sample and the sequencing run, and it is useful to determine sequence dose in test samples.
Determining sequence dose (i.e., chromosome dose or segment dose) in a qualified sample
At step 140, based on the calculated qualified tag densities, a qualified sequence dose (i.e., chromosome dose or segment dose) for the sequence of interest is determined as a ratio of the sequence tag density for the sequence of interest and the qualified sequence tag density for the additional sequence (from which the normalized sequence is subsequently identified at step 145). The identified normalized sequences are then used to determine the sequence dose in the test sample.
In one embodiment, the sequence dose in the qualifying sample is a chromosome dose calculated as the ratio of the number of sequence tags of the chromosome of interest and the number of sequence tags of the normalized chromosome sequence in the qualifying sample. The normalized chromosome sequence may be a single chromosome, a set of chromosomes, a segment of one chromosome, or a set of segments from different chromosomes. Thus, the chromosome dose for a chromosome of interest in a sample is determined as: (i) A ratio of the plurality of tags of the chromosome of interest and the plurality of tags of the normalized chromosome sequence consisting of a single chromosome, (ii) a ratio of the number of tags for the chromosome of interest to the number of tags for the normalized chromosome sequence comprising two or more chromosomes; (iii) A ratio of the number of tags for a chromosome of interest to the number of tags for a normalized segment sequence comprising a single segment of one chromosome; (iv) A ratio of the number of tags for a chromosome of interest to the number of tags for a normalized segment sequence comprising two or more segments from one chromosome; or (v) the ratio of the number of tags for the chromosome of interest to the number of tags for the normalized segment sequences comprising two or more segments of two or more chromosomes. Examples of chromosome dosages for determining a chromosome of interest according to (i) - (v) are as follows: the chromosome dose for a chromosome of interest (e.g., chromosome 21) is determined as a ratio of the sequence tag density for chromosome 21 and the sequence tag density for each of the entire remaining chromosomes (i.e., chromosomes 1-20, chromosome 22, chromosome X, and chromosome Y); (i) The chromosome dose for a chromosome of interest (e.g., chromosome 21) is determined as a ratio of the sequence tag density for chromosome 21 and the sequence tag densities for all possible combinations of two or more remaining chromosomes; (ii) The chromosome dose for a chromosome of interest (e.g., chromosome 21) is determined as a ratio of the sequence tag density for chromosome 21 and the sequence tag density for a segment of another chromosome (e.g., chromosome 9); (iii) The chromosome dose for a chromosome of interest (e.g., chromosome 21) is determined as a ratio of the sequence tag density for chromosome 21 and the sequence tag densities for two segments of another chromosome (e.g., two segments of chromosome 9); (iv) And chromosome dosage for a chromosome of interest (e.g., chromosome 21) is determined as a ratio of the sequence tag density for chromosome 21 and the sequence tag densities for two segments of two different chromosomes (e.g., a segment of chromosome 9 and a segment of chromosome 14).
In another embodiment, the sequence dose in the qualified sample is a segment dose calculated as the ratio of the number of sequence tags for the segment of interest that is not a whole chromosome to the number of sequence tags for the normalized segment sequence in the qualified sample. The sequence of the normalizing segment can be, for example, a whole chromosome, a set of whole chromosomes, a segment of a chromosome, or a set of segments from different chromosomes. For example, in a qualified sample, the segment dose for the segment of interest is determined as (i) the ratio of the plurality of tags for the segment of interest to the plurality of tags for the normalizing segment sequence consisting of a single segment of a chromosome, (ii) the ratio of the plurality of tags for the segment of interest to the plurality of tags for the normalizing segment sequence consisting of two or more segments of a chromosome, or (iii) the ratio of the plurality of tags for the segment of interest to the plurality of tags for the normalizing segment sequence consisting of two or more segments of two or more chromosomes.
Chromosome dosages for the chromosome or chromosomes of interest are determined in all of the qualified samples, and a normalized chromosome sequence is identified in step 145. Similarly, segment doses for one or more segments of interest are determined in all qualifying samples, and normalized segment sequences are identified in step 145.
Identification of normalized sequences from qualifying sequence doses
Based on the calculated sequence doses, a normalized sequence of the sequence of interest is identified, in step 145, as a sequence of, for example, minimal variability among all eligible samples for the sequence dose of the sequence of interest. The method identifies sequences that inherently have similar characteristics and are prone to similar variation in the sample and sequencing rounds, and it is useful to determine sequence doses in test samples.
In a set of qualifying samples, normalized sequences of the one or more sequences of interest can be identified, and the sequences identified in the qualifying samples can then be used to calculate sequence doses of the one or more sequences of interest in each test sample (step 150) to determine the presence or absence of aneuploidy in each test sample. The normalization sequences identified may differ for the chromosome or segment of interest when different sequencing platforms are used, and/or when there is a difference in the purification of the nucleic acid to be sequenced and/or in the preparation of the sequencing library. The use of normalizing sequences according to the methods described herein provides a specific and sensitive measure of copy number variation of a chromosome or segment thereof, regardless of the sequencing platform used and/or sample preparation.
In some embodiments, more than one normalized sequence is identified, i.e., different normalized sequences can be determined for one sequence of interest, and multiple sequence doses can be determined for one sequence of interest. For example, when using the sequence tag density of chromosome 14, the variation (e.g., coefficient of variation) in chromosome dose for chromosome 21 of interest is minimal. However, two, three, four, five, six, seven, eight, or more normalized sequences can be identified for use in determining sequence doses for sequences of interest in a test sample. As an example, the second dose of chromosome 21 in any one of the test samples can be determined using chromosome 7, chromosome 9, chromosome 11, or chromosome 12 as a normalizing chromosome sequence, since all of these chromosomes have CVs that are close to the CVs of chromosome 14 (see example 8, table 10). Preferably, when a single chromosome is selected as the normalizing chromosome sequence for the chromosome of interest, the normalizing chromosome will be one that results in the chromosome dose for the chromosome of interest having minimal variability across all test samples (e.g., qualifying samples).
Normalizing chromosomal sequences as normalizing sequences for chromosomes
In other embodiments, the normalizing chromosomal sequence may be a single sequence, or it may be a set of sequences. For example, in some embodiments, the normalizing sequence is a set of sequences, e.g., a set of chromosomes, that are identified as normalizing sequences for any one or more of chromosomes 1-22, X, and Y. The set of chromosomes comprising the normalizing sequence (i.e., the normalizing chromosome sequence) of the chromosome of interest can be a set of two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty-one, or twenty-two chromosomes and includes or excludes one or both of chromosomes X and Y. > The set of chromosomes identified as the normalized chromosome sequence is the set of chromosomes that results in the chromosome dose for the chromosome of interest having the least variability across all test samples (i.e., qualifying samples). Preferably, individual or groups of chromosomes are tested together for their ability to best mimic the sequence of interest, for which reason they are selected as the normalized chromosome sequence.
In one embodiment, the normalizing sequence for chromosome 21 is selected from chromosome 9, chromosome 1, chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, chromosome 7, chromosome 8, chromosome 10, chromosome 11, chromosome 12, chromosome 13, chromosome 14, chromosome 15, chromosome 16, and chromosome 17. In another embodiment, the normalizing sequence for chromosome 21 is selected from chromosome 9, chromosome 1, chromosome 2, chromosome 11, chromosome 12, and chromosome 14. Alternatively, the normalizing sequence for chromosome 21 is a set of chromosomes selected from chromosome 9, chromosome 1, chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, chromosome 7, chromosome 8, chromosome 10, chromosome 11, chromosome 12, chromosome 13, chromosome 14, chromosome 15, chromosome 16, and chromosome 17. In another embodiment, the set of chromosomes is a set selected from chromosome 9, chromosome 1, chromosome 2, chromosome 11, chromosome 12, and chromosome 14.
In some embodiments, the method is further improved by using a normalizing sequence determined by systematic calculation using the total chromosome dose for each chromosome, both individually and in all possible combinations with all remaining chromosomes (see example 13). For example, the systematically determined normalizing chromosome may be determined for each chromosome of interest by using any of chromosomes 1-22, X, and Y, and a combination of two or more of chromosomes 1-22, X, and Y, to determine which individual or set of chromosomes is the normalizing chromosome that results in the least variability in chromosome dosage for the chromosome of interest across a qualified set of samples, whereby the system calculates all possible chromosomes (see example 13). Thus, in one embodiment, the systematically calculated normalizing sequence for chromosome 21 is a set of chromosomes consisting of chromosome 4, chromosome 14, chromosome 16, chromosome 20, and chromosome 22. For all chromosomes in the genome, a single or set of chromosomes can be determined.
In one embodiment, the normalizing sequence for chromosome 18 is selected from chromosome 8, chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, chromosome 7, chromosome 9, chromosome 10, chromosome 11, chromosome 12, chromosome 13, and chromosome 14. Preferably, the normalizing sequence for chromosome 18 is selected from chromosome 8, chromosome 2, chromosome 3, chromosome 5, chromosome 6, chromosome 12, and chromosome 14. In one embodiment, the normalizing sequence for chromosome 18 is a set of chromosomes selected from chromosome 8, chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, chromosome 7, chromosome 9, chromosome 10, chromosome 11, chromosome 12, chromosome 13, and chromosome 14. Preferably, the set of chromosomes is a set selected from chromosome 8, chromosome 2, chromosome 3, chromosome 5, chromosome 6, chromosome 12, and chromosome 14.
In another embodiment, the normalizing sequence for chromosome 18 is determined by the system calculating the dose for all possible chromosomes (as explained elsewhere in this application) by using each possible normalizing chromosome individually and in all possible combinations of normalizing chromosomes. Thus, in one embodiment, the normalizing sequence for chromosome 18 is a normalizing chromosome comprised of a set of chromosomes consisting of chromosome 2, chromosome 3, chromosome 5, and chromosome 7.
In one embodiment, the normalizing sequence for chromosome X is selected from chromosome 1, chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, chromosome 7, chromosome 8, chromosome 9, chromosome 10, chromosome 11, chromosome 12, chromosome 13, chromosome 14, chromosome 15, and chromosome 16. Preferably, the normalizing sequence for chromosome X is selected from chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, and chromosome 8. In one embodiment, the normalizing sequence for chromosome X is a set of chromosomes selected from chromosome 1, chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, chromosome 7, chromosome 8, chromosome 9, chromosome 10, chromosome 11, chromosome 12, chromosome 13, chromosome 14, chromosome 15, and chromosome 16. Preferably, the set of chromosomes is a set selected from chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, and chromosome 8.
In another embodiment, the normalizing sequence for chromosome X (as explained elsewhere in this application) is determined by the system calculating the total possible chromosome doses by using each possible normalizing chromosome individually and in total possible combinations of normalizing chromosomes. Thus, in one embodiment, the normalizing sequence for chromosome X is a normalizing chromosome comprised of the set of chromosomes 4 and 8.
In one embodiment, the normalizing sequence for chromosome 13 is one chromosome selected from chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, chromosome 7, chromosome 8, chromosome 9, chromosome 10, chromosome 11, chromosome 12, chromosome 14, chromosome 18, and chromosome 21. Preferably, the normalizing sequence for chromosome 13 is one chromosome selected from chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, and chromosome 8. In another embodiment, the normalizing sequence for chromosome 13 is a set of chromosomes selected from chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, chromosome 7, chromosome 8, chromosome 9, chromosome 10, chromosome 11, chromosome 12, chromosome 14, chromosome 18, and chromosome 21. Preferably, the set of chromosomes is a set selected from chromosome 2, chromosome 3, chromosome 4, chromosome 5, chromosome 6, and chromosome 8.
In another embodiment, the normalizing sequence for chromosome 13 is determined by the system calculating all possible chromosome doses using each possible normalizing chromosome individually and all possible combinations of normalizing chromosomes (as explained elsewhere in this application). Thus, in one embodiment, the normalizing sequence for chromosome 13 is a normalizing chromosome of the set comprising chromosome 4 and chromosome 5. In another embodiment, the normalizing sequence for chromosome 13 is a normalizing chromosome comprised of the set of chromosomes 4 and 5.
The variation in chromosome dose for chromosome Y is greater than 30, independent of which normalization chromosome is used in determining chromosome Y dose. Thus, a set of two or more chromosomes selected from chromosomes 1-22 and chromosome X can be used as the normalizing sequence for chromosome Y. In one embodiment, the at least one normalizing chromosome is a set of chromosomes consisting of chromosomes 1-22, and chromosome X. In another embodiment, the set of chromosomes consists of chromosome 2, chromosome 3, chromosome 4, chromosome 5, and chromosome 6.
In another embodiment, the normalizing sequence for chromosome Y is determined by the system calculating the dose for all possible chromosomes (as explained elsewhere in this application) by using each possible normalizing chromosome individually and in all possible combinations of normalizing chromosomes. Thus, in one embodiment, the normalizing sequence for chromosome Y is a normalizing chromosome comprising the set of chromosomes consisting of chromosome 4 and chromosome 6. In another embodiment, the normalizing sequence for chromosome Y is a normalizing chromosome consisting of a set of chromosomes consisting of chromosome 4 and chromosome 6.
The normalizing sequence used to calculate the dose for different chromosomes or different segments of interest may be the same or it may be a different normalizing sequence for different chromosomes or segments, respectively. For example, the normalizing sequence(s) of chromosome a of interest (e.g., normalizing chromosome (s)) may be the same, or it may be different from the normalizing sequence(s) of chromosome B of interest (e.g., normalizing chromosome (s)).
The normalizing sequence for a complete chromosome may be a complete chromosome or a set of complete chromosomes, or it may be a segment of a chromosome, or a set of segments of one or more chromosomes.
Normalizing the sequence of the segment as a normalizing sequence for the chromosome
In another embodiment, the normalizing sequence of the chromosome may be a normalizing segment sequence. The normalizing segment sequence may be a single segment, or it may be a set of segments of one chromosome, or they may be multiple segments from two or more different chromosomes. The normalized segment sequences can be determined by systematic calculation of all combinations of segment sequences in the genome. For example, the sequence of the normalizing segment of chromosome 21 may be a single segment that is larger or smaller than the size of chromosome 21 of about 47Mbp (million base pairs), e.g., the normalizing segment may be a segment of chromosome 9 that is about 140Mbp. Alternatively, the normalizing sequence of chromosome 21 may be, for example, a combination of segment sequences from two different chromosomes (e.g., from chromosome 1 and from chromosome 12).
In one embodiment, the normalizing sequence for chromosome 21 is a normalizing segment sequence of a segment or a set of two or more segments of chromosomes 1-20, 22, X, and Y. In another embodiment, the normalizing sequence for chromosome 18 is a segment or groups of segments for chromosomes 1-17, 19-22, X', and Y. In another embodiment, the normalizing sequence for chromosome 13 is a segment or groups of segments for chromosomes 1-12, 14-22, X', and Y. In another embodiment, the normalizing sequence for chromosome X is a segment or groups of segments from chromosomes 1-22, and Y. In another embodiment, the normalizing sequence for chromosome Y is a segment or set of segments for chromosomes 1-22, and X. The normalized sequences of single or multiple sets of segments can be determined for all chromosomes in a genome. Two or more segments of the normalizing segment sequence may be segments from one chromosome, or the two or more segments may be segments of two or more different chromosomes. As illustrated for the normalizing chromosome sequence, one normalizing segment sequence can be the same for two or more different chromosomes.
Normalizing segment sequences as normalizing sequences for chromosome segments
When the sequence of interest is a segment of a chromosome, the presence or absence of CNV of the sequence of interest can be determined. Variations in copy number of chromosome segments allow determination of the presence or absence of a partial chromosomal aneuploidy. Described below are examples of partial chromosomal aneuploidies associated with different fetal abnormalities and conditions. The segments of the chromosome may be of any length. For example, it may range from kilobases to hundreds of megabases. The human genome, which comprises only more than 30 hundred million DNA bases, can be divided into tens, thousands, hundreds of thousands and millions of segments of varying sizes, the number of copies of which can be determined according to the methods of the present invention. A chromosome segment normalizing sequence is a normalizing segment sequence that can be a single segment from any of chromosomes 1-22, X, and Y, or it can be a set of segments from any of chromosomes 1-22, X, and Y.
The normalized sequence for a segment of interest is a sequence having variability across multiple chromosomes and across multiple samples that most closely approximates the variability of the fragment of interest. Where the normalizing sequence is a set of segments for any one or more of chromosomes 1-22, X and Y, the determination of the normalizing sequence can be made as described for determining the normalizing sequence for the chromosome of interest. By calculating the segment dose using one and all possible combinations of two or more segments as normalizing sequences for the segment of interest in each sample of a set of qualifying samples (i.e., samples that are diploid for the segment of interest), the normalizing segment sequence for one or a set of segments can be identified and this normalizing sequence determined to be the normalizing sequence that provides a segment dose with the lowest variability for this segment of interest across all qualifying samples, as explained above for the normalizing chromosome sequence.
For example, it is 1Mb (megabase) for the segment of interest, and the remaining 3 million segments in the approximately 3Gb human genome (minus 1mg of segment of interest) can be used alone or in combination with each other to calculate the segment dose for the segment of interest in a qualified set of samples to determine which segment or sets of segments will be used as the normalized segment sequence for the qualified and tested samples. The segment of interest can vary from about 1000 bases to tens of millions of bases. The normalized segment sequence may be composed of one or more segments of the same size as the sequence of interest. In other embodiments, the normalized segment sequence may be comprised of segments that are different from the sequence of interest, and/or different from each other. For example, a normalized sequence for a sequence of 100,000 bases in length may be 20,000 bases long, and may include combinations of sequences of different lengths, e.g., at 7,000+8,000+5,000 bases. As explained elsewhere in this application for the normalized chromosome sequences, the normalized segment sequences (as explained elsewhere in this application) can be determined by systematically calculating all possible chromosome and/or segment doses using each possible normalized chromosome segment independently and in all possible combinations of normalized segments. For all segments and/or chromosomes in a genome, individual or groups of segments can be determined.
The normalizing sequence used to calculate the dose for different chromosome segments of interest may be the same, or it may be a different normalizing sequence for different chromosome segments of interest. For example, the normalizing sequence for chromosome segment a of interest, e.g., a normalizing segment(s) may be the same, or it may be different from the normalizing sequence for chromosome segment B of interest, e.g., a normalizing segment(s).
Normalizing chromosomal sequences as normalizing sequences for chromosomal segments
In another embodiment, copy number variation of a chromosome segment can be determined using a normalizing chromosome, which can be a single chromosome or a set of chromosomes as described above. The normalized chromosome sequence may be a normalized chromosome or group of chromosomes identified for a chromosome of interest in a set of qualifying samples by systematically determining which chromosome or set of chromosomes minimizes the variability in chromosome dose in the set of qualifying samples. For example, to determine the presence or absence of a partial deletion of chromosome 7, the normalizing chromosome or chromosome group used to analyze the partial deletion is the chromosome or chromosome group that is first identified in a set of qualifying samples as the normalizing sequence that minimizes the chromosome dose for the entire chromosome 7. As described elsewhere herein with respect to the normalized chromosome sequence of the chromosome of interest, the normalized chromosome sequence of the chromosome segment (as explained elsewhere herein) can be determined by systematically calculating all possible chromosome doses using each possible normalized chromosome individually and all possible combinations of normalized chromosomes. A single chromosome or a group of chromosomes can be determined for all chromosome segments in the genome. Examples illustrating the use of normalized chromosomes to determine the presence of partial chromosome deletions and partial chromosome duplications are provided as examples 17 and 18.
In certain embodiments, the CNV of a chromosome segment is determined by first subdividing the chromosome of interest into segments or data boxes of variable length. The data bin length may be at least about 1kbp, at least about 10kbp, at least about 100kbp, at least about 1mbp, at least about 10mbp, or at least about 100mbp. The smaller the data box length, the higher the resolution of obtaining CNVs for locating segments in the chromosome of interest.
Determining the presence or absence of a CNV for a chromosome segment of interest may be accomplished by comparing the dose for each of the bins for the chromosome of interest in the test sample to the mean of the respective bin doses determined for each of the equally long bins in a set of qualifying samples. The normalized binary value for each bin may be calculated as a Normalized Binary Value (NBV) that correlates the bin dose in the test sample with the mean of the corresponding bin dose in a set of qualifying samples, as described above for the normalized zone values. The NBV was calculated as:
whereinAndrespectively, the estimated mean and standard deviation for the jth data bin dose in a set of qualifying samples, and x ij Is the jth data box dose observed for test sample i.
Determining aneuploidy in a test sample
A sequence dose is determined for a sequence of interest in a test sample based on one or more normalization sequences identified in a qualified sample, the sample comprising a mixture of nucleic acids derived from genomes that differ in one or more sequences of interest.
At step 115, a test sample is obtained from a subject suspected or known to carry a clinically relevant CNV of the sequence of interest. The test sample may be a biological fluid (e.g. plasma) or any suitable sample as described below. As illustrated, the sample may be obtained using a non-invasive procedure such as simple blood draw. In some embodiments, the test sample contains a mixture of nucleic acid molecules (e.g., cfDNA molecules). In some embodiments, the test sample is a maternal plasma sample containing a mixture of fetal and maternal cfDNA molecules.
At step 125, at least a portion of the test nucleic acids in the test sample are sequenced, as illustrated for a qualified sample, to produce millions of sequence reads (e.g., 36bp reads). As in step 120, reads generated from sequencing nucleic acid in the test sample are uniquely mapped onto or aligned with a reference genome to generate tags. As described in step 120, at least about 3x106 qualified sequence tags, at least about 5x106 qualified sequence tags, at least about 8x106 qualified sequence tags, at least about 10x106 qualified sequence tags, at least about 15x106 qualified sequence tags, at least about 20x106 qualified sequence tags, at least about 30x106 qualified sequence tags, at least about 40x106 qualified sequence tags, or at least about 50x106 qualified sequence tags are obtained from reads that uniquely map a reference genome, the qualified sequence tags comprising reads between 20 and 40 bp. In certain embodiments, the reads generated by the sequencing device are provided in an electronic format. Alignment is accomplished using a computing device as discussed below. Individual reads are compared to a reference genome, which is often very large (millions of base pairs), to identify sites at which the reads uniquely correspond to the reference genome. In certain embodiments, the alignment program allows for limited mismatches between reads and reference genomes. In some cases, 1, 2, or 3 base pairs in a read are allowed to mismatch with the corresponding base pairs in the reference genome, yet mapping still occurs.
In step 135, all or most of the tags obtained from sequencing the nucleic acids in the test sample are counted to determine the test sequence tag density using a computing device as described below. In certain embodiments, each read is aligned to a specific region of the reference genome (in most cases a chromosome or segment) and the read is converted to a tag by appending site information to the read. As the process progresses, the computing device may keep a rolling count of the number of tags/reads mapped to each region (chromosome or segment in most cases) of the reference genome. The count of each chromosome or segment of interest and each corresponding normalized chromosome or segment is stored.
In certain embodiments, the reference genome has one or more excluded regions that are part of, but not included in, the genome of the true organism. Reads that could align with these excluded regions were not counted. Examples of excluded regions include regions of long repetitive sequences, analogous regions between the X and Y chromosomes, and the like.
In certain embodiments, the method determines whether the tags are counted more than once when multiple reads are aligned to the same site on a reference genome or sequence. There may be times when two tags have the same sequence and are thus aligned to the same site on the reference sequence. Methods to count tags may in some cases exclude the same tags derived from the same sequencing sample from counting. If a disproportionate number of labels in a given sample are the same, then a large deviation or other defect in the procedure is indicated. Thus, according to certain embodiments, the counting method does not count the same tags from a given sample as previously counted tags from that sample.
When the same label is omitted from a single sample, different indices may be set for selection. In certain embodiments, the defined percentage of the count tags must be unique. If more tags than the threshold are not unique, then the tags are ignored. For example, if defining a percentage requires that at least 50% be unique, then the same tags are not counted until the percentage of unique tags for the sample exceeds 50%. In other embodiments, the critical number of unique tags is at least about 60%. In other embodiments, the critical percentage of unique tags is at least about 75%, or at least about 90%, or at least about 95%, or at least about 98%, or at least about 99%. For chromosome 21, the threshold may be set at 90%. If the 30M tag is aligned with chromosome 21, then at least 27M of the tags must be unique. If the 3M count tag is not unique and the 30,000,000 tag is not unique, then it is not counted.
A specific threshold or other indicator for determining when otherwise identical tags are not counted may be selected using an appropriate statistical analysis. One factor that affects this threshold or other criteria is the amount of sequenced sample relative to the size of the genome to which the tag can be aligned. Other factors include the size of the reading and similar considerations.
In one embodiment, the number of sequence tags mapped to a sequence of interest is normalized to the known length of a sequence of interest to which they are mapped to provide a test sequence tag density ratio. As described for the qualified samples, normalization to a known length of a sequence of interest is not necessarily required, and this may be included as a step to reduce the number of digits in a number to simplify it for human interpretation. As all mapped test sequence tags are counted in the test samples, the sequence tag density for sequences of interest (e.g., clinically relevant sequences) in the test samples is determined, as is the sequence tag density for additional sequences corresponding to at least one normalized sequence identified in the qualifying samples.
At step 150, based on the identification of at least one normalized sequence in the qualified samples, the relevant test sequence dose is determined for one of the sequences of interest in the test sample. In various embodiments, the test sequence dose is computationally determined by manipulating the sequence tag densities of the sequence of interest and the corresponding normalized sequence as described herein. The computing device responsible for the task electronically accesses the associations between the sequences of interest and their associated normalized sequences, which may be stored in a database, table, chart, or included as code in the program instructions.
As explained elsewhere in this application, the at least one normalizing sequence may be a single sequence or a set of sequences. The sequence dose for a sequence of interest in a test sample is a ratio of the sequence tag density determined for the sequence of interest in the test sample to the sequence tag density of at least one normalized sequence determined in the test sample, wherein the normalized sequence in the test sample corresponds to the normalized sequence identified for the particular sequence of interest in the qualifying samples. For example, if the normalized sequence identified for chromosome 21 in the qualifying samples is determined to be a chromosome (e.g., chromosome 14), then the test sequence dose for chromosome 21 (the sequence of interest) is determined as a ratio of the sequence tag density for chromosome 21 to the sequence tag density for chromosome 14, each determined in the test sample. Similarly, chromosome doses were determined for chromosomes 13, 18, X, Y, as well as other chromosomes associated with various chromosomal aneuploidies. The normalizing sequence for the chromosome of interest may be one or a set of chromosomes, or one or a set of chromosome segments. As described above, a sequence of interest can be a portion of a chromosome, such as a chromosome segment. Thus, the dose for a chromosome segment can be determined as the ratio of the sequence tag density determined for this segment in the test sample to the sequence tag density for the normalized chromosome segment in the test sample, where the normalized segment in the test sample corresponds to the normalized segment(s) (single or set of segments) identified for the particular segment of interest in the qualifying samples. Chromosome segments may range in size from kilobases (kb) to megabases (Mb). (e.g., about 1kb to 10kb, or about 10kb to 100kb, or about 100kb to 1 Mb). <}0{>
At step 155, a plurality of thresholds are derived from the standard deviation established for the qualified sequence doses determined for the plurality of qualified samples and the sequence doses determined for samples known to be aneuploid to the sequence of interest. Note that this operation is typically performed asynchronously to the analysis of the patient test sample. It may be performed simultaneously with, for example, selecting a normalization sequence from a qualified sample. The exact classification depends on the difference between the probability distributions for different classes (i.e., aneuploidy types). In some examples, multiple thresholds are selected from the empirical distribution for each type of aneuploidy (e.g., trisomy 21). As described in the examples, the possible thresholds established for classifying trisomy 13, trisomy 18, trisomy 21, and monosomy X aneuploidy illustrate the use of the method for determining chromosomal aneuploidy by sequencing cfDNA extracted from a maternal sample, including a mixture of fetal and maternal nucleic acids. The threshold determined for distinguishing samples affected for a chromosome aneuploidy may be the same or different from the threshold determined for distinguishing samples affected for a different aneuploidy. As shown in these examples, the threshold for each chromosome of interest is determined from variability in the dose of chromosomes of interest across multiple samples and multiple sequencing rounds. The smaller the variability of chromosome dose for any chromosome of interest, the narrower the dispersion in dose for chromosomes of interest across all unaffected samples that were used to set the thresholds for determining different aneuploidies.
Returning to the process flow associated with classifying patient test samples, at step 160, copy number variations of the sequences of interest are determined in the test sample by comparing the test sequence dose for the sequences of interest to at least one threshold established from the qualifying sample doses. This operation may be performed by the same computing device used to measure the sequence tag density and/or calculate the sector dose.
At step 165, the calculated dose for the test sequence of interest is compared to the dose set as thresholds selected according to a user-defined reliability threshold, thereby classifying the sample as "normal", "affected" or "no call". These "no-decision" samples are samples for which a reliable, definitive diagnosis cannot be made. Each type of affected sample (e.g., trisomy 21, partial trisomy 21, monosomy X) has its own threshold, one for determining normal (unaffected) samples and the other for determining affected samples (although in some cases the two thresholds coincide). As described elsewhere herein, in some cases, if the fetal fraction of nucleic acid in the test sample is sufficiently high, no determination may be converted to a determination (affected or normal). The classification of the test sequence may be reported by a computing device for other operations of the process flow. In some cases, the classifications are reported in electronic format and may be displayed, emailed, texted to the relevant person, and so forth.
Certain embodiments provide a method for providing prenatal diagnosis of fetal aneuploidy in a biological sample comprising fetal and maternal nucleic acid molecules. This diagnosis is made based on the following steps: obtaining sequence information for sequencing at least a portion of a mixture of fetal and maternal nucleic acid molecules derived from a biological test sample (e.g., a maternal plasma sample); calculating from the sequencing data a normalized chromosome dose for one or more chromosomes of interest, and/or a normalized segment dose for one or more segments of interest; and determining a statistically significant difference between the chromosome dose for the chromosome of interest and/or the segment dose for the segment of interest in the test sample, respectively, and a threshold established in qualified (normal) samples, and providing a prenatal diagnosis based on the statistical difference. A normal or affected diagnosis is made as described in step 165 of the method. In the event that a normal or affected diagnosis cannot be made with confidence, a "no decision" is provided.
Samples and sample processing
Sample (I)
Samples for determining CNVs, e.g., chromosomal aneuploidies, partial aneuploidies, etc., can include samples taken from any cell, tissue, or organ for which copy number variation of one or more sequences of interest will be determined. It is desirable that these samples comprise nucleic acid present in the cells and/or "cell-free" nucleic acid (e.g., cfDNA).
In certain embodiments, it is advantageous to obtain cell-free nucleic acids, such as cell-free DNA (cfDNA). Cell-free nucleic acids, including cell-free DNA, can be obtained from biological samples, including, but not limited to, plasma, serum, and urine, by various methods known in the art (see, e.g., norm (Fan) et al, journal of the national academy of sciences (Proc Natl Acad Sci) 105. To separate cell-free DNA from cells in a sample, different methods can be used, including but not limited to fractionation, centrifugation (e.g., density gradient centrifugation), DNA-specific precipitation or high throughput cell sorting, and/or other separation methods. Commercially available kits for manual and automated isolation of cfDNA are available (Roche Diagnostics, indianapolis, IN), qiagen, valencia, CA, difen, delaunay, duren, DE). Biological samples comprising cfDNA have been used in assays to determine the presence or absence of chromosomal abnormalities, such as trisomy 21, by sequencing assays that can detect chromosomal aneuploidies and/or different polymorphisms.
In various embodiments, cfDNA present in a sample can be specifically enriched or non-specifically enriched prior to use (e.g., prior to preparing a sequencing library). Non-specific enrichment of sample DNA refers to whole genome amplification of genomic DNA fragments of a sample, which can be used to increase the content of sample DNA prior to preparing cfDNA sequencing libraries. Non-specific enrichment may be selective enrichment of one of the two genomes present in a sample comprising more than one genome. For example, non-specific enrichment can be selective for fetal genome in a maternal sample, which can be achieved by known methods to increase the ratio of fetal DNA to maternal DNA in the sample. Alternatively, the non-specific enrichment may be a non-selective amplification of two genomes present in the sample. For example, the non-specific amplification may be amplification of fetal and maternal DNA in a sample comprising a mixture of DNA from fetal and maternal genomes. Methods for whole genome amplification are known in the art. Degenerate oligonucleotide primer PCR (DOP), primer extension PCR technology (PEP), and Multiple Displacement Amplification (MDA) are examples of whole genome amplification methods. In certain embodiments, a sample comprising a mixture of cfDNA from different genomes does not enrich for cfDNA of genomes present in the mixture. In other embodiments, a sample comprising a mixture of cfDNA from different genomes is not specifically enriched for any one genome present in the sample.
Samples comprising nucleic acids to which the methods described herein are applied typically include biological samples ("test samples"), such as those described above. In certain embodiments, the nucleic acid to be screened for one or more CNVs is purified or isolated by any of a number of well-known methods.
Thus, in certain embodiments, a sample comprises or consists of a purified or isolated polynucleotide, or may comprise a sample such as a tissue sample, a biological fluid sample, a cell sample, and the like. Suitable biological fluid samples include, but are not limited to, blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear effluent, lymph, saliva, cerebral medullary fluid, lavage (lavages), bone marrow suspensions, vaginal fluids, transcervical lavage, cerebral fluids, ascites, milk, respiratory, intestinal and genitourinary secretions, amniotic fluid, milk, and leukocyte infiltration samples. In certain embodiments, the sample is one that is readily obtainable by a non-invasive procedure, such as blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear exudates, saliva, or stool. In certain embodiments, the sample is a peripheral blood sample or a plasma and/or serum fraction of a peripheral blood sample. In other embodiments, such a biological sample is a cotton swab or smear, biopsy specimen, or cell culture. In another embodiment, such a sample is a mixture of two or more biological samples, for example a biological sample may include two or more biological fluid samples, tissue samples, and cell culture samples. As used herein, the terms "blood", "plasma" and "serum" expressly encompass fractionated or processed portions thereof. Similarly, when a sample is taken from a biopsy, swab, smear, or the like, the "sample" expressly encompasses the isolated portion or portion derived from the processing of such biopsy, swab, smear, or the like.
In certain embodiments, the sample may be obtained from a variety of sources, including but not limited to: samples from different individuals, samples from different stages of development of the same or different individuals, samples from different diseased individuals (e.g., individuals with cancer or suspected of having a genetic disorder), normal individuals, samples obtained at different stages of disease in an individual, samples from individuals undergoing different treatments for disease, samples from individuals undergoing different environmental factors, samples from individuals susceptible to a disease condition, samples from individuals exposed to an infectious disease factor (e.g., HIV), and the like.
In an illustrative but non-limiting embodiment, such a sample is a maternal sample obtained from a pregnant female (e.g., pregnant woman). In such cases, the sample can be analyzed using the methods described herein to provide a prenatal diagnosis of potential chromosomal abnormalities in the fetus. Such a maternal sample may be a tissue sample, a biological fluid sample, or a cell sample. Biological fluids include (as non-limiting examples): blood, plasma, serum, sweat, tears, sputum, urine, sputum, ear discharge, lymph, saliva, cerebrospinal fluid, lavage fluid, bone marrow suspension, vaginal discharge, transcervical lavage fluid, brain fluid, ascites, breast milk, secretions of the respiratory, intestinal, and genitourinary tracts, and leukapheresis samples.
In another illustrative but non-limiting embodiment, the maternal sample is a mixture of two or more biological samples, which may include, for example, two or more biological fluid samples, tissue samples, and cell culture samples. In some embodiments, such a sample is one that is readily obtainable by a non-invasive procedure, e.g., blood, plasma, serum, sweat, tears, sputum, urine, milk, sputum, ear exudates, saliva, and feces. In some embodiments, such a biological sample is a peripheral blood sample, and/or a plasma or serum fraction thereof. In other embodiments, such a biological sample is a cotton swab or smear, biopsy specimen, or cell culture sample. As disclosed above, the terms "blood", "plasma" and "serum" expressly encompass isolated portions or processed portions thereof. Similarly, when a sample is taken from a biopsy, swab, smear, etc., this "sample" expressly encompasses a separation or portion derived from the processing of the biopsy, swab, smear, etc.
In certain embodiments, the sample may also be a tissue, cell, or other polynucleotide-containing source obtained from in vitro culture. These cultured samples can be taken from a variety of sources, including but not limited to: a culture (e.g., tissue or cells) maintained under different media and conditions (e.g., pH, pressure, or temperature), a culture (e.g., tissue or cells) maintained for periods of different lengths, a culture (e.g., tissue or cells) treated with different factors or agents (e.g., drug candidates, or modulators), or a culture of different types of tissue and/or cells.
Methods for isolating nucleic acids from biological sources are well known and will vary depending on the nature of the source. One of ordinary skill in the art can readily isolate one or more nucleic acids from a source as needed for the methods described herein. In some cases, it may be advantageous to fragment nucleic acid molecules in a nucleic acid sample. Fragmentation may be random or it may be specific, as is the case for example with restriction enzyme digestions. Methods for random fragmentation are well known in the art and include, for example, restriction DNase digestion, alkaline treatment and physical shearing. In one embodiment, the sample nucleic acid is obtained as cfDNA, which has not undergone fragmentation.
In other illustrative embodiments, the sample nucleic acid is obtained as genomic DNA that is fragmented into fragments of about 300 or more, about 400 or more, or about 500 or more base pairs, and the NGS method can be readily applied thereto.
Sequencing library preparation
In one embodiment, the methods described herein can utilize next generation sequencing technologies (NGS) that allow multiple samples to be sequenced individually as genomic molecules (i.e., single-pass sequencing) or as pooled samples comprising indexed genomic molecules on a single sequencing batch (e.g., multiplex sequencing). These methods can generate up to several hundred million reads of DNA sequence. In various embodiments, the sequence of the genomic nucleic acid and/or the indexed genomic nucleic acid can be determined using, for example, next generation sequencing technology (NGS) as described herein. In various embodiments, a large amount of sequence data obtained using NGS can be analyzed using one or more processors as described herein.
In various embodiments, the use of these sequencing techniques does not involve the preparation of sequencing libraries.
However, in certain embodiments, the sequencing methods contemplated herein involve the preparation of sequencing libraries. In one exemplary method, preparation of a sequencing library comprises generating a random series of aptamer-modified DNA fragments (e.g., polynucleotides) ready for sequencing. A sequencing library of polynucleotides may be prepared from DNA or RNA that includes equivalents, analogs of DNA or cDNA, e.g., DNA or cDNA that is the complement or copy of DNA produced from an RNA template by the action of a reverse transcriptase. The polynucleotide may originate in a double-stranded form (e.g., dsDNA (e.g., genomic DNA fragments), cDNA, PCR amplification products, etc.), or in certain embodiments, the polynucleotide may originate in a single-stranded form (e.g., ssDNA, RNA, etc.) and have been converted to a dsDNA form. For example, in certain embodiments, single-stranded mRNA molecules can be copied into double-stranded cDNA suitable for use in preparing sequencing libraries. The precise sequence of the primary polynucleotide molecule is generally not important to the method of library preparation and may be known or unknown. In one embodiment, the polynucleotide molecule is a DNA molecule. More specifically, in certain embodiments, the polynucleotide molecule represents the entire genetic complement of an organism or substantially the entire genetic complement of an organism, and is a genomic DNA molecule (e.g., cellular DNA, cell-free DNA (cfDNA), etc.) that typically includes intron and exon sequences (coding sequences) as well as non-coding regulatory sequences (e.g., promoter and enhancer sequences). In certain embodiments, the primary polynucleotide molecule comprises a human genomic DNA molecule, such as a cfDNA molecule present in the peripheral blood of a pregnant subject.
The preparation of sequencing libraries for certain NGS sequencing platforms is facilitated by the use of polynucleotides that comprise a specific range of fragment sizes. Preparation of these libraries typically involves fragmenting large polynucleotides (e.g., cell genomic DNA) to obtain polynucleotides within a desired size range.
Fragmentation can be achieved by any of a variety of methods known to those of ordinary skill in the art. For example, fragmentation can be achieved by mechanical means including, but not limited to, spraying, sonication, and hydraulic shearing. However, mechanical fragmentation typically cleaves the DNA backbone at C-O, P-O and C-C bonds, producing a heterogeneous mixture of blunt ends with 3' -and 5' -overhangs with broken C-O, P-O and C-C bonds (see, e.g., annelri and Liwack, J. Biol. Chem. 265 17323-17333[1990]; richards and Buwyer (Boyer), journal of molecular biology (J Mol Biol) 11 327-240[1965 ]), which ends may need to be repaired because they may lack the 5' -phosphate necessary for the subsequent enzymatic reactions needed to prepare DNA for sequencing (e.g., ligation of sequencing gametes).
In contrast, cfDNA typically exists in fragments of less than about 300 base pairs, and thus fragmentation is not typically required for generating sequencing libraries using cfDNA samples.
Typically, a polynucleotide, whether it is fragmented vigorously (e.g., ex vivo into fragments) or naturally exists in fragment form, is converted into blunt-ended DNA having a 5 '-phosphate and a 3' -hydroxyl group. Standard protocols, such as those used for sequencing using the i Lu Na platform described elsewhere herein, guide the user to end-repair sample DNA, to purify the end-repaired product prior to dA tailing and to purify the dA-tailed product prior to the library-prepared aptamer ligation step.
Various embodiments of the sequence library preparation methods described herein do not require the performance of one or more steps to obtain a modified DNA product that can be sequenced by NGS, as is typically required by standard protocols. The following describes the simplified process (ABB process), the one-step process, and the two-step process. The sequential dA tailing and aptamer ligation is referred to herein as a two-step process. The sequential dA tailing, aptamer ligation and amplification are referred to herein as a one-step process. In various embodiments, the ABB process as well as the two-step process can be performed in solution or on a solid surface. In certain embodiments, the one-step process is performed on a solid surface.
A comparison of a standard method such as, for example, i Lu Na, and a short-cut method (ABB; example 2), a two-step method, and a one-step method (examples 3-6) for preparing DNA molecules for sequencing by NGS according to embodiments of the present invention is illustrated in FIG. 2.
Brief preparation-ABB
In one embodiment, a short approach (ABB approach) for preparing a sequence library is provided, which includes sequential steps of end repair, dA tailing, and adaptor ligation (ABB). In embodiments that do not require a dA tailing step for preparing sequencing libraries (see, e.g., the use of Roche 454 and SOLID) TM 3 platform for sequencing), the steps of end repair and aptamer ligation may not include a step of purifying the end-repaired product prior to aptamer ligation.
The sequencing library preparation method comprising sequential steps of end repair, dA tailing and adaptor ligation is referred to herein as the abridged Approach (ABB) and has been shown to produce a sequencing library with unexpectedly improved quality while sample analysis is faster (see e.g., example 2). According to some embodiments of the method, the ABB process may be performed in solution, as exemplified herein. The ABB method can also be performed on solid surfaces by first end-repairing and dA-tailing the DNA in solution and then binding the DNA to the solid surface as described elsewhere herein for one-or two-step preparation on solid surfaces. All three enzymatic steps including the step of attaching an aptamer to DNA with a dA tail were performed without polyethylene glycol. The disclosed protocol for performing ligation reactions involving the ligation of aptamers to DNA directs the user to perform the ligation in the presence of polyethylene glycol. Applicants determined that the ligation of aptamers to DNA with a dA tail can be performed without polyethylene glycol.
In another embodiment, preparing a sequencing library does not require end-repair of cfDNA prior to the dA tailing step. Applicants have determined that cfDNA that does not need to be fragmented does not have to be end-repaired, and preparing a cfDNA sequencing library according to embodiments of the invention does not include end-repair steps and purification steps, thereby combining enzymatic reactions and further simplifying the preparation of DNA to be sequenced. cfDNA exists as a mixture of blunt ends and 3 '-and 5' -overhangs, these ends being generated in vivo under the action of nucleases that cleave cellular genomic DNA into fragments of cfDNA ending in 5 '-phosphate and 3' -hydroxyl groups. Elimination of the end-repair step will select cfDNA molecules that naturally exist as blunt-ended molecules and cfDNA molecules that naturally have 5 'overhangs that are filled by polymerase activity of enzymes such as Klenow Exo-polymerase (Klenow Exo-) used to attach one or more deoxynucleotides to a 3' -OH (dA tailed) as described below. Elimination of the end-repair step of cfDNA does not select cfDNA molecules with 3 '-overhangs (3' -OH). Surprisingly, exclusion of these 3' -OH cfDNA molecules outside the sequencing library did not affect expression of genomic sequences in the library, indicating that the end-repair steps of cfDNA molecules can be excluded from the preparation of the sequencing library (see examples). In addition to cfDNA, other types of unrepaired polynucleotides that can be used to prepare sequencing libraries include DNA molecules produced by reverse transcription of RNA molecules (e.g., mRNA, siRNA, sRNA) and unrepaired DNA molecules that are DNA amplicons synthesized from phosphorylated primers. When unphosphorylated primers are used, DNA reverse transcribed from RNA and/or DNA amplified from a DNA template (i.e., a DNA amplicon) can also be phosphorylated after synthesis by a polynucleotide kinase.
In another embodiment, unrepaired DNA is used to prepare a sequencing library according to a two-step process, wherein no end repair of DNA is included, and the unrepaired DNA undergoes two sequential steps of dA tailing and adaptor ligation (see fig. 2). The two-step process can be performed in solution or on a solid surface. When performed in solution, the two-step process involves the use of DNA obtained from a biological sample, does not include a step of end-repairing the DNA, and adds a monodeoxynucleotide (e.g., deoxyadenosine (a)) to the 3' -end of the polynucleotide in the unrepaired DNA sample, for example, by the activity of certain types of DNA polymerases, such as Taq (Taq) polymerase or klenow exopolymerase. In subsequent sequential steps, dA-tailed products are ligated to the aptamers, these products being compatible with the ' T ' overhangs present on the 3' end of each of the double helix regions of commercially available aptamers. dA-tailing prevents the self-ligation of two blunt-ended polynucleotides to facilitate the formation of the sequence of the ligated aptamer. Thus, in some embodiments, unrepaired cfDNA is subjected to sequential steps of dA-tailing and adaptor ligation, wherein dA-tailed DNA is prepared from unrepaired DNA and no purification steps are performed after the dA-tailing reaction. Double stranded aptamers can be ligated to both ends of DNA with a dA tail. A set of aptamers with the same sequence or a set of two different aptamers can be utilized. In different embodiments, one or more different sets of the same or different aptamers may also be used. The aptamer may include an index sequence to enable multiplex sequencing of library DNA. The aptamer ligation to DNA with a dA tail is optionally performed in the absence of polyethylene glycol.
Two-step preparation in solution
In various embodiments, when the two-step process is performed in solution, the product of the aptamer ligation reaction can be purified to remove unligated aptamers, aptamers that may have ligated to each other. Purification can also be selected for the size range of templates generated in clusters, which can optionally be preceded by amplification, e.g., PCR amplification. The ligation product can be purified by any of a variety of methods including, but not limited to, gel electrophoresis, solid Phase Reversible Immobilization (SPRI), and the like. In some embodiments, the purified DNA of the ligated aptamer is amplified prior to sequencing, e.g., PCR amplification. Some sequencing platforms require library DNA to be further amplified for another time. For example, according to the illiner technique, the illi Lu Na platform requires that clustered amplification of library DNA should be performed as an integral part of sequencing. In other embodiments, the purified DNA of the ligation aptamer is denatured and single-stranded DNA molecules are attached to the flow cell of the sequencer. Thus, in certain embodiments, a method for preparing a sequencing library from unrepaired DNA in solution for NGS sequencing comprises obtaining DNA molecules from a sample; and performing successive steps of dA tailing and aptamer ligation on the unrepaired DNA molecules obtained from the sample.
As indicated above, in various embodiments, these methods of library preparation are incorporated into methods of determining Copy Number Variation (CNV), such as aneuploidy. Thus, in one illustrative embodiment, there is provided a method for determining the presence or absence of one or more fetal chromosomal aneuploidies, the method comprising: (a) Obtaining a maternal sample comprising a mixture of fetal and maternal cell-free DNA; (b) Separating a mixture of fetal and maternal cfDNA from the sample; (c) Preparing a sequencing library from a mixture of fetal and maternal cfDNA; wherein preparing the library comprises the sequential steps of dA tailing and adaptor ligation of cfDNA, and wherein preparing the library does not comprise end-repairing cfDNA, and the preparing is performed in solution; (d) Performing massively parallel sequencing of at least a portion of the sequencing library to obtain sequence information for fetal and maternal cfDNA in the sample; (e) Storing, at least temporarily, the sequence information in a computer readable medium; (f) Computationally identifying, using the stored sequence information, the number of sequence tags for each of the one or more chromosomes of interest and the number of sequence tags for the normalized sequences for each of the one or more chromosomes of interest; (g) Computationally calculating a chromosome dose for each of the chromosome(s) of interest using the number of sequence tags for each of the chromosome(s) of interest and the number of sequence tags for the normalizing sequence for each of the chromosome(s) of interest; and (h) comparing the chromosome dose for each of the chromosome(s) of interest to a respective threshold for each of the chromosome(s) of interest, and thereby determining the presence or absence of a fetal chromosomal aneuploidy in the sample, wherein steps (e) - (h) are performed using one or more processors. This method is illustrated in examples 3 and 4.
Two-step and one-step solid phase preparation
In certain embodiments, the sequencing library is prepared on a solid surface according to the two-step method described above for preparing the library in solution. Preparing a sequencing library on a solid surface according to a two-step method comprises obtaining DNA molecules, e.g. cfDNA, from a sample and performing the sequential steps of dA tailing and adaptor ligation, wherein adaptor ligation is performed on the solid surface. Either repaired or unrepaired DNA may be used. In certain embodiments, the aptamer-ligated product is isolated from a solid surface, purified, and amplified prior to sequencing. In other embodiments, the aptamer-ligated product is isolated from a solid surface, purified, and not amplified prior to sequencing. In still other embodiments, the aptamer-ligated product is amplified, isolated from a solid surface, and purified. In certain embodiments, the purified product is amplified. In other embodiments, the purified product is not amplified. The sequencing protocol may include amplification, such as clustered amplification. In various embodiments, the isolated aptamer ligated product is purified prior to amplification and/or sequencing.
In certain embodiments, the sequencing library is prepared according to a one-step process on a solid surface. In various embodiments, preparing a sequencing library on a solid surface according to a one-step method comprises obtaining DNA molecules, e.g., cfDNA, from a sample and performing successive steps of dA tailing, aptamer ligation and amplification, wherein the aptamer ligation is performed on the solid surface. The aptamer-linked product need not be isolated prior to purification.
Figure 3 depicts a two-step and one-step method for preparing sequencing libraries on a solid surface. Sequencing libraries can be prepared on solid surfaces using DNA, with or without repair. In certain embodiments, unrepaired DNA is used. Examples of unrepaired DNA that can be used to prepare sequencing libraries on solid surfaces include, but are not limited to, cfDNA, DNA that has been reverse transcribed from RNA using phosphorylated primers, DNA that has been amplified from a DNA template using phosphorylated primers (i.e., phosphorylated DNA amplicons). Examples of repaired DNA that can be used to prepare sequencing libraries on solid surfaces include, but are not limited to, cfDNA and pieces of genomic DNA that have been blunt-ended and phosphorylated (i.e., repaired phosphorylated DNA produced by reverse transcription of RNA, e.g., mRNA, sRNA, siRNA, etc.). In certain illustrative embodiments, unrepaired cfDNA obtained from a maternal sample is used to prepare a sequencing library.
Preparing a sequencing library on a solid surface comprises coating the solid surface with a first part of a two-part binder, modifying a first aptamer by attaching a second part of the two-part binder to the aptamer, and immobilizing the aptamer on the solid surface by a binding interaction of the first and second parts of the two-part binder. For example, preparing a sequencing library on a solid surface may comprise attaching a polypeptide, polynucleotide or small molecule to one end of a library adaptor, the polypeptide, polynucleotide or small molecule being capable of forming a binding complex with a polypeptide, polynucleotide or small molecule immobilized on the solid surface. Solid surfaces that can be used to immobilize polypeptides, polynucleotides or small molecules include, but are not limited to, plastic, paper, thin films, filter paper, chips, needles or slides, silica or polymer beads (e.g., polypropylene, polystyrene, polycarbonate), 2D or 3D molecular scaffolds, or any support for solid phase synthesis of polypeptides or polynucleotides.
The bonds between polypeptide-polypeptide, polypeptide-polynucleotide, polypeptide-small molecule, and polynucleotide-polynucleotide conjugates can be covalent or non-covalent. Preferably, the binding complex is bound by non-covalent bonds. For example, binders that can be used to prepare sequencing libraries on solid surfaces include, but are not limited to, streptavidin-biotin binders, antibody-antigen binders, and ligand-receptor binders. Examples of polypeptide-polynucleotide conjugates that can be used to prepare sequencing libraries on solid surfaces include, but are not limited to, DNA-binding protein-DNA conjugates. Examples of polynucleotide-polynucleotide conjugates that can be used to prepare sequencing libraries on solid surfaces include, but are not limited to, oligodT-oligoA and oligodT-oligodA. Examples of polypeptide-small molecule and polynucleotide-small molecule conjugates include streptavidin-biotin.
According to the embodiment of the solid surface method (one and two steps) as shown in fig. 3, the solid surface of a vessel (e.g. a polypropylene PCR tube or 96-well plate) used for preparing a sequencing library is coated with a polypeptide such as streptavidin. The ends of the first set of aptamers are modified by attaching small molecules such as biotin molecules and biotinylated aptamers are bound to streptavidin on solid surfaces (1). Subsequently, the unrepaired or repaired DNA is attached to streptavidin-conjugated biotinylated aptamers, thereby immobilizing them on the solid surface (2). The second set of aptamers is attached to immobilized DNA (3).
Two-step preparation on solid phase
In one embodiment, the two-step process is performed using unrepaired DNA, e.g., cfDNA, for preparing a sequencing library on a solid surface. Unrepaired DNA is dA-tailed by attaching a single nucleotide base, e.g., dA, to the 3' end of a strand of unrepaired DNA, e.g., cfDNA. Optionally, multiple nucleotide bases may be attached to the unrepaired DNA. The mixture comprising DNA with dA tail is added to an aptamer immobilized on a solid surface, to which the DNA is attached. The steps of dA-tailing and adaptor ligation of DNA are sequential, i.e. no purification of the dA-tailed product is performed (as shown for the two-step process in fig. 2). As described above, the aptamer may have an overhang that is complementary to an overhang on an unrepaired DNA molecule. Subsequently, a second set of aptamers is added to the DNA-biotinylated aptamer complex to provide a DNA library of ligated aptamers. Optionally, the repaired DNA is used to prepare a library. The repaired DNA may be genomic DNA that has been fragmented and subjected to ex vivo enzymatic repair at the 3 'and 5' ends. In one embodiment, DNA such as maternal cfDNA is end-repaired, dA-tailed and aptamers are ligated to aptamers immobilized on a solid surface in sequential steps of end-repair, dA-tailed and aptamer ligation as described for the short-cut method performed in solution.
In certain embodiments utilizing a two-step process, the aptamer-linked DNA is isolated from the solid surface by chemical or physical means (e.g., heat, uv, etc.) (4 a in fig. 2), purified (5 in fig. 2), and optionally amplified in solution before the sequencing process is initiated. In other embodiments, the aptamer-ligated DNA is not amplified. Without amplification, aptamers ligated to DNA can be configured to include sequences that hybridize to oligonucleotides present on the flow cell of a sequencer (kuzaerwa (Kozarewa) et al, nature Methods (Nat Methods) 6. The library of DNA of the ligated aptamers was subjected to massively parallel sequencing as described for DNA of the ligated aptamers generated in solution (6 in fig. 2). In certain embodiments, sequencing is massively parallel sequencing using sequencing by synthesis with reversible dye terminators. In other embodiments, the sequencing is massively parallel sequencing using ligation sequencing. The sequencing process may include solid phase amplification, e.g., cluster amplification, as described elsewhere herein.
Thus, in various embodiments, a method for preparing a sequencing library on a solid surface from unrepaired DNA for NGS can comprise obtaining DNA molecules from a sample; and subjecting the unrepaired DNA molecules to successive steps of dA tailing and aptamer ligation, wherein the aptamer ligation is performed on a solid phase. In certain embodiments, the aptamer may comprise an indexing sequence to allow multiple sequencing of multiple samples within a single reaction vessel (e.g., one channel of a flow cell). As described above, the DNA molecule may be a cfDNA molecule, it may be a DNA molecule transcribed from RNA, it may be an amplicon of a DNA molecule, and the like.
As indicated above, in various embodiments, these library preparation methods are incorporated into methods for determining Copy Number Variation (CNV), such as aneuploidy. Thus, in certain embodiments, the methods for preparing a sequencing library on a solid surface from unrepaired cfDNA are incorporated into a method for analyzing a maternal sample to determine the presence or absence of a fetal chromosomal aneuploidy. Accordingly, in one embodiment, there is provided a method for determining the presence or absence of one or more fetal chromosomal aneuploidies, the method comprising: (a) Obtaining a maternal sample comprising a mixture of fetal and maternal cell-free DNA; (b) Separating a mixture of fetal and maternal cfDNA from the sample; (c) Preparing a sequencing library from a mixture of fetal and maternal cfDNA; wherein preparing the library comprises the sequential steps of dA-tailing and adaptor ligation of cfDNA, wherein preparing the library does not comprise end-repairing cfDNA, and the preparing is performed on a solid surface; (d) Performing massively parallel sequencing of at least a portion of the sequencing library to obtain sequence information for fetal and maternal cfDNA in the sample; (e) Storing, at least temporarily, the sequence information in a computer readable medium; (f) Computationally identifying, using the stored sequence information, a number of sequence tags for each of the one or more chromosomes of interest and a number of sequence tags for the normalized sequence for each of the one or more chromosomes of interest; (g) Computationally calculating a chromosome dose for each of the chromosome(s) of interest using the number of sequence tags for each of the one or more chromosome(s) of interest and the number of sequence tags for the normalizing sequence for each of the chromosome(s) of interest; and (h) comparing the chromosome dose for each of the chromosome(s) of interest to a respective threshold for each of the chromosome(s) of interest, and thereby determining the presence or absence of a fetal chromosomal aneuploidy in the sample, wherein steps (e) - (h) are performed using one or more processors. The sample may be a biological fluid sample such as plasma, serum, urine, and saliva. In certain embodiments, the sample is a maternal blood sample, or plasma and serum fractions thereof. This method is illustrated in example 4.
One-step preparation on solid phase
In another embodiment, the DNA that is not repaired is dA-tailed, but the dA-tailed product is not purified prior to amplification, such that the steps of dA-tailing, aptamer ligation, and amplification are performed sequentially or consecutively. Consecutive dA tailing, aptamer ligation and amplification followed by purification prior to sequencing is referred to herein as a one-step process. The one-step process may be performed on a solid surface (see, e.g., fig. 3). The steps of attaching the first set of aptamers to a solid surface (1), linking the unrepaired and dA-tailed DNA to the surface bound aptamers (2) and linking the second set of aptamers to the surface bound DNA (3) can be performed as described above for the two-step method. However, in the one-step method, the surface-bound DNA of the ligated aptamer can be amplified while attached to the solid surface (4 b in fig. 2). Subsequently, the resulting library of adaptor-ligated DNA generated on the solid surface was isolated and purified (5 in fig. 2), followed by massively parallel sequencing as described for adaptor-ligated DNA generated in solution. In certain embodiments, sequencing is massively parallel sequencing using sequencing by synthesis with reversible dye terminators. In other embodiments, the sequencing is massively parallel sequencing using ligation sequencing.
Thus, in certain embodiments, a method is provided for preparing a sequencing library for NGS sequencing by performing steps comprising: obtaining DNA molecules from a sample; and subjecting the DNA molecule to successive steps of dA tailing, aptamer ligation and amplification, wherein the aptamer ligation is performed on a solid surface. As described for the two-step process, in various embodiments, the aptamer can include an index sequence to allow multiple sequencing of multiple samples within a single reaction vessel (e.g., one channel of a flow cell).
In certain embodiments, the DNA may be repaired. The DNA molecule may be a cfDNA molecule, which may be a DNA molecule transcribed from RNA, or the DNA molecule may be an amplicon of the DNA molecule. Adaptation of the connections is performed as described above. Excess unligated aptamer can be washed from the immobilized adaptor-ligated DNA; reagents required for amplification are added to the immobilized adaptor-ligated DNA, which is subjected to multiple rounds of amplification, e.g., PCR amplification, as is known in the art. In other embodiments, the aptamer-ligated DNA is not amplified. In the absence of amplification, the aptamer-linked DNA can be removed from the solid surface by chemical or physical means (e.g., heat, ultraviolet light, etc.). In the absence of amplification, aptamers ligated to DNA can include sequences that hybridize to oligonucleotides present on the flow cell of a sequencer (kuzarawa et al, nature Methods (Nat Methods) 6, 291-295, [2009 ]).
In various embodiments, the sample can be a biological fluid sample (e.g., blood, plasma, serum, urine, cerebral spinal fluid, amniotic fluid, saliva, etc.). In certain embodiments, the method for preparing a sequencing library from unrepaired cfDNA on a solid surface is included as a step in a method for analyzing a maternal sample to determine the presence or absence of a fetal chromosomal aneuploidy.
Accordingly, in one embodiment, there is provided a method for determining the presence or absence of one or more fetal chromosomal aneuploidies, the method comprising: (a) Obtaining a maternal sample comprising a mixture of fetal and maternal cell-free DNA; (b) Separating a mixture of fetal and maternal cfDNA from the sample; (c) Preparing a sequencing library from a mixture of fetal and maternal cfDNA; wherein preparing the library comprises the sequential steps of dA-tailing, aptamer ligation, and amplification of cfDNA, and wherein the preparing is performed on a solid surface; (d) Performing massively parallel sequencing of at least a portion of the sequencing library to obtain sequence information for fetal and maternal cfDNA in the sample; (e) Storing, at least temporarily, the sequence information in a computer readable medium; (f) Computationally identifying, using the stored sequence information, a number of sequence tags for each of the one or more chromosomes of interest and a number of sequence tags for the normalized sequence for each of the any one or more chromosomes of interest; (g) Computationally calculating a chromosome dose for each of the chromosome(s) of interest using the number of sequence tags of each of the chromosome(s) of interest and the number of sequence tags of the normalizing sequence of each of the chromosome(s) of interest; and (h) comparing the chromosome dose for each of the chromosome(s) of interest to a respective threshold for each of the chromosome(s) of interest, and thereby determining the presence or absence of a fetal chromosomal aneuploidy in the sample, wherein steps (e) - (h) are performed using one or more processors. In certain embodiments, the DNA is end repaired. In other embodiments, preparing the library does not include end-repairing the cfDNA. This method is illustrated in examples 5 and 6.
The processes described above for preparing sequencing libraries are applicable to sample analysis methods, including but not limited to methods for determining Copy Number Variation (CNV), and methods for determining the presence or absence of any polymorphism of a sequence of interest in a sample comprising a single genome and in a sample comprising a mixture of at least two genomes that are known or suspected to differ in one or more sequences of interest.
Amplification of the aptamer-ligated product prepared on a solid phase or in solution may be required to introduce the oligonucleotide sequences required for hybridization to flow cells or other surfaces present in some NGS platforms into the aptamer-ligated template molecules. The contents of the amplification reaction are known to those of ordinary skill in the art and include appropriate substrates (e.g., dNTPs), enzymes (e.g., DNA polymerase), and buffer components required for the amplification reaction. Optionally, amplification of the aptamer-linked polynucleotide can be omitted. In general, the amplification reaction requires at least two amplification primers, e.g., primer oligonucleotides, which may be the same or different, and may include an "aptamer-specific portion" capable of annealing to a primer-binding sequence in the polynucleotide molecule to be amplified (or the complement thereof if the template is considered single-stranded) during the annealing step.
Once formed, a library of templates prepared according to the methods described above can be used for solid phase nucleic acid amplification that may be required for certain NGS platforms. As used herein, the term "solid phase amplification" refers to any nucleic acid amplification reaction that is performed on or in association with a solid support such that all or a portion of the amplification product is immobilized on the solid support as it is formed. In particular embodiments, the term encompasses solid phase polymerase chain reactions (solid phase PCR) and solid phase isothermal amplifications thereof, which are reactions similar to standard solution phase amplifications except that one or both of the forward and reverse amplification primers are immobilized on a solid support. Solid phase PCR also includes, for example, the following systems: an emulsion in which one primer is anchored to the bead and the other primer is in free solution; colonies form in a solid gel matrix, with one primer anchored to the surface and one primer in free solution.
In various embodiments, after amplification, the sequencing library can be analyzed by microfluidic capillary electrophoresis to ensure that the library is free of aptamer dimer or single stranded DNA. Libraries of template polynucleotide molecules are particularly useful in solid phase sequencing methods. In addition to providing templates for solid phase sequencing and solid phase PCR, library templates also provide templates for whole genome amplification.
Marker nucleic acids for tracking and verifying sample integrity
In various embodiments, the integrity of the sample and the tracking of the sample can be verified by sequencing a mixture of sample genomic nucleic acid (e.g., cfDNA) and, for example, accompanying marker nucleic acid that has been introduced into the sample prior to processing.
The marker nucleic acid can be combined with a test sample (e.g., a sample of biological origin) and subjected to a process comprising, for example, one or more of the following: samples of biological origin are fractionated, e.g., obtaining a substantially cell-free plasma fraction from a whole blood sample, purifying nucleic acids from samples of biological origin (e.g., plasma) that have been fractionated or samples of biological origin (e.g., tissue samples) that have not been fractionated, and sequencing. In certain embodiments, sequencing comprises preparing a sequencing library. The sequence or combination of sequences of the marker molecules combined with the source sample is selected to be unique to the source sample. In certain embodiments, the unique marker molecules in the sample all have the same sequence. In other embodiments, the unique marker molecules in the sample are a combination of a plurality of sequences, e.g., two, three, four, five, six, seven, eight, nine, ten, fifteen, twenty or more different sequences.
In one embodiment, the integrity of the sample can be verified using a plurality of marker nucleic acid molecules having the same sequence. Alternatively, the identity of the sample may be verified using a plurality of marker nucleic acid molecules having at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 25, at least 30, at least 35, at least 40, at least 50, or more different sequences. Verifying the integrity of a plurality of biological samples (i.e., two or more biological samples) requires that each of the two or more samples be labeled with a marker nucleic acid having a sequence that is unique to each of the labeled plurality of test samples. For example, a first sample may be labeled with a marker nucleic acid having sequence a and a second sample may be labeled with a marker nucleic acid having sequence B. Alternatively, a first sample may be labeled with a plurality of marker nucleic acid molecules all having sequence a, and a second sample may be labeled with a mixture of sequences B and C, where the sequences A, B and C are marker molecules having different sequences.
Marker nucleic acids can be added to the sample at any stage of sample preparation that occurs prior to library preparation (if a library is to be prepared) and sequencing. In one embodiment, the marker molecule may be combined with a sample of unprocessed origin. For example, the marker nucleic acid may be provided in a collection tube used to collect a blood sample. Alternatively, the marker nucleic acid may be added to the blood sample after blood withdrawal. In one embodiment, the marker nucleic acid is added to a container used to collect a biological fluid sample, for example, the marker nucleic acid is added to a blood collection tube used to collect a blood sample. In another embodiment, the marker nucleic acid is added to a portion of the biological fluid sample. For example, the marker nucleic acid is added to the plasma and/or serum fraction of a blood sample (e.g., a maternal plasma sample). In yet another embodiment, the marker molecule is added to a purified sample (e.g., a nucleic acid sample that has been purified from a biological sample). For example, marker nucleic acids are added to samples of purified maternal and fetal cfDNA. Similarly, the marker nucleic acid may be added to the biopsy specimen prior to processing the specimen. In certain embodiments, the marker nucleic acid may be combined with a carrier that delivers the marker molecule into the cells of the biological sample. Cell delivery vehicles include pH sensitive liposomes and cationic liposomes.
In various embodiments, the marker molecules have anti-gene strand sequences that are not present in the genome of the sample of biological origin. In an exemplary embodiment, the marker molecule used to verify the integrity of a sample of human biological origin has a sequence that is not present in the human genome. In an alternative embodiment, the marker molecule has a sequence that is not present in the source sample and in any one or more known genomes. For example, marker molecules used to verify the integrity of samples of human biological origin have sequences that are not present in the human genome and in the mouse genome. An alternative allows verification of the integrity of a test sample comprising two or more genomes. For example, the integrity of a human cell-free DNA sample obtained from a subject invaded by a pathogen (e.g., a bacterium) can be verified using marker molecules having sequences that are not present in both the human genome and the genome of the invading bacterium. Sequences of genomes of many pathogens (e.g., bacteria, viruses, yeasts, fungi, protozoa, etc.) are publicly available on the world wide web ncbi. In another embodiment, the marker molecule is a nucleic acid having a sequence that is not present in any known genome. The sequence of the marker molecules can be randomly generated by an algorithm.
In various embodiments, the marker molecule can be a naturally occurring deoxyribonucleic acid (DNA), ribonucleic acid, or artificial nucleic acid analogs (nucleic acid mimetics) including peptide nucleic acids (PMA), morpholino nucleic acids, locked nucleic acids, glycol nucleic acids, and threose nucleic acids (which differ from naturally occurring DNA or RNA in that the backbone of the molecule is altered) or DNA mimetics that do not have a phosphodiester backbone. Deoxyribonucleic acids can be derived from naturally occurring genomes or can be produced in the laboratory by using enzymes or by solid phase chemical synthesis. Chemical methods can also be used to produce DNA mimetics not found in nature. Available DNA derivatives in which the phosphodiester bond is replaced, but the deoxyribose remains include, but are not limited to, DNA mimetics having a backbone formed by a thiometal or formamide bond, which have been shown to be excellent structural DNA mimetics. Other DNA mimetics include morpholino derivatives and Peptide Nucleic Acids (PNAs) comprising N- (2-aminoethyl) glycine-based pseudopeptide backbones (biophysical and biomolecular structure Ann Rev Biophys Biomol Structure 24. PNA is a very excellent DNA (or ribonucleic acid [ RNA ]) structure mimic, and PNA oligomer is capable of forming a very stable duplex structure with Watson-Crick (Watson-Crick) complementary DNA and RNA (or PNA) oligomers, and it can also bind to a target in duplex DNA by helix invasion (molecular Biotechnology (Mol Biotechnol) 26. Another excellent structural mimic/analog of a DNA analog that can be used as a marker molecule is phosphorothioate DNA, in which one of the non-bridging oxygens is replaced by sulfur. This modification reduced the effect of endonucleases and exonucleases 2 including 5 'to 3' and 3 'to 5' DNA POL 1 exonucleases, nucleases S1 and P1, ribonucleases, serum nucleases and snake venom phosphodiesterase.
The length of the marker molecule may be different or comparable to the length of the sample nucleic acid, i.e. the length of the marker molecule may be similar to the length of the sample genomic molecule, or it may be greater or smaller than the length of the sample genomic molecule. The length of the marker molecule is measured by the number of nucleotides or nucleotide analogue bases that make up the marker molecule. Marker molecules of a length different from the length of the sample genomic molecule can be distinguished from the source nucleic acid using separation methods known in the art. For example, the difference in length of the label and the sample nucleic acid molecule can be determined by electrophoretic separation, such as capillary electrophoresis. Size differentiation may be advantageous for quantifying and assessing the quality of marker nucleic acids and sample nucleic acids. Preferably, the marker nucleic acid is shorter than the genomic nucleic acid and is long enough to exclude it from being mapped to the sample genome. For example, a 30 base human sequence is required for unique mapping to the human genome. Thus, in certain embodiments, the marker molecules used in sequencing bioassays of human samples should be at least 30bp long.
The choice of marker molecule length is determined primarily by the sequencing technique used to verify the integrity of the source sample. The length of the sample genomic nucleic acid sequenced can also be considered. For example, certain sequencing techniques employ clonal amplification of polynucleotides, which may require that the genomic polynucleotide to be clonally amplified has a minimum length. For example, sequencing using the i Lu Na GAII sequence analyzer involves ex vivo clonal amplification by bridge PCR (also known as cluster amplification) of polynucleotides of a minimum length of 110bp to which aptamers are ligated to provide clonally amplified nucleic acids of at least 200bp and less than 600bp and sequencing. In certain embodiments, the aptamer-linked marker molecule has a length between about 200bp and about 600bp, between about 250bp and 550bp, between about 300bp and 500bp, or between about 350 and 450 bp. In other embodiments, the aptamer-linked marker molecule is about 200bp in length. For example, when sequencing fetal cfDNA present in a maternal sample, the length of the selectable marker molecules is similar to the length of the fetal cfDNA molecules. Thus, in one embodiment, the marker molecules used in an assay comprising massively parallel sequencing cfDNA in a maternal sample to determine the presence or absence of a fetal chromosomal aneuploidy may be about 150bp, about 160bp, 170bp, about 180bp, about 190bp, or about 200bp in length; the marker molecule is preferably about 170bp. Other Sequencing methods, such as SOLiD Sequencing, polony Sequencing, and 454 Sequencing, use emulsion PCR to clonally amplify DNA molecules for Sequencing, and each technique specifies a minimum and maximum length of the molecule to be amplified. The length of the marker molecules to be sequenced in the form of clonally amplified nucleic acids can reach approximately 600bp. In certain embodiments, the marker molecules to be sequenced may be greater than 600bp in length.
Single molecule sequencing techniques that do not employ molecular clonal amplification and are capable of sequencing nucleic acids over an extremely wide range of template lengths do not in most cases require any particular length of the molecule to be sequenced. However, the yield of sequence per unit mass depends on the number of 3' terminal hydroxyl groups, so having a relatively short template for sequencing is more efficient than having a long template. If starting from nucleic acids longer than 1000nt, it is generally advantageous to cleave these nucleic acids to an average length of 100 to 200nt, so that more sequence information can be generated from nucleic acids of the same mass. Thus, the length of the marker molecule may range from tens of bases to thousands of bases. Marker molecules for single molecule sequencing may be up to about 25bp, up to about 50bp, up to about 75bp, up to about 100bp, up to about 200bp, up to about 300bp, up to about 400bp, up to about 500bp, up to about 600bp, up to about 700bp, up to about 800bp, up to about 900bp, up to about 1000bp or more in length.
The length selected for the marker molecule is also determined by the length of the genomic nucleic acid sequenced. For example, cfDNA circulates in the human bloodstream as a genomic fragment of cellular genomic DNA. Fetal cfDNA molecules found in maternal plasma are generally shorter than maternal cfDNA molecules (chen (Chan) et al, clinical chemistry (Clin Chem) 50. Size fractionation of circulating fetal DNA it has been demonstrated that the average length of the circulating fetal DNA fragments is <300bp, whereas maternal DNA is estimated to be between about 0.5Kb and 1Kb (plum (Li) et al, clinical chemistry, 50 1002-1011[2004 ]). These findings are consistent with the findings of norm (Fan) et al (norm et al, clinical chemistry 56-1279-1286 [2010 ]) which determined that fetal cfDNA rarely exceeds 340bp using NGS. DNA isolated from urine using standard silica-based methods consists of two parts: high molecular weight DNA from exfoliated cells and low molecular weight (150-250 base pairs) portions of renal DNA (Tr-DNA) (Bozha plot et al, J.Clin.46. Recently developed techniques for isolating cell-free nucleic acids from body fluids have shown that DNA and RNA fragments present in urine are much shorter than 150 base pairs in applications where renal nucleic acids are isolated (U.S. patent application publication No. 20080139801). In embodiments where the cfDNA is genomic nucleic acid for sequencing, the marker molecules selected may be approximately up to the length of the cfDNA. For example, the length of a marker molecule used in a parent cfDNA sample to be sequenced, either as a single nucleic acid molecule or as clonally amplified nucleic acid, can be between about 100bp and 600. In other embodiments, the sample genomic nucleic acid is a fragment of a larger molecule. For example, the sample genomic nucleic acid that is sequenced is fragmented cellular DNA. In embodiments where the fragmented cellular DNA is sequenced, the length of the marker molecule may be up to the length of the DNA fragment. In certain embodiments, the length of the marker molecule is at least the minimum length required to uniquely map the sequence reads to the appropriate reference genome. In other embodiments, the length of the marker molecule is the minimum length required to exclude the marker molecule from being mapped to the sample reference genome.
Furthermore, the marker molecules can be used to verify samples that have not been verified by nucleic acid sequencing and that can be verified by common biotechnology other than sequencing (real-time PCR).
Sample controls (e.g., in-process positive controls for sequencing and/or analysis)
In various embodiments, marker sequences introduced into the sample, such as those described above, can serve as positive controls to verify the accuracy and efficacy of sequencing and subsequent processing and analysis.
Thus, compositions and methods for providing an in-process positive control (IPC) for sequencing DNA in a sample are provided. In certain embodiments, a positive control for sequencing cfDNA in a sample comprising a genomic mixture is provided. IPC can be used to correlate baseline shifts in sequence information obtained from different sets of samples (e.g., samples sequenced at different times on different sequencing batches). Thus, for example, IPC can correlate sequence information obtained for a parent test sample with sequence information obtained from a set of qualified samples sequenced at different times.
Also, in the case of fragment analysis, IPC can correlate sequence information obtained from a subject for a particular fragment with sequences obtained from a set of qualified samples sequenced at different times (similar sequences). In certain embodiments, IPCs can correlate sequence information obtained from a subject for a particular cancer-associated locus with sequence information obtained from a set of qualifying samples (e.g., from known amplifications/deletions, etc.).
In addition, IPCs can be used as markers to track samples during sequencing. IPCs can also provide qualitative positive sequence dose values (e.g., NCVs) for one or more aneuploidies (e.g., trisomy 21, trisomy 13, trisomy 18) of the chromosome of interest to provide more proper interpretation and ensure reliability and accuracy of the data. In certain embodiments, IPCs comprising nucleic acids from male and female genomes can be established to provide doses of chromosomes X and Y in a maternal sample to determine whether a fetus is male.
The type and number of controls in the process will depend on the type or nature of the test desired. For example, for tests that require sequencing DNA from a sample comprising a mixture of genomes to determine whether a chromosomal aneuploidy is present, an in-process control can include DNA obtained from a test sample known to comprise the same chromosomal aneuploidy. In certain embodiments, the IPC comprises DNA from a sample known to comprise a chromosomal aneuploidy of interest. For example, an IPC tested to determine the presence or absence of a fetal trisomy (e.g., trisomy 21) in a maternal sample includes DNA obtained from an individual with trisomy 21. In certain embodiments, an IPC comprises a mixture of DNA obtained from two or more individuals with different aneuploidies. For example, for tests to determine the presence or absence of trisomy 13, trisomy 18, trisomy 21, and monosomy X, IPC includes a combination of DNA samples obtained from pregnant women each carrying a fetus of one of the test trisomies. In addition to whole chromosome aneuploidy, IPCs can be established that provide a positive control for tests to determine the presence or absence of partial aneuploidy.
IPCs serving as controls for detecting single aneuploidy can be established using a mixture of cellular genomic DNA obtained from two subjects, one of whom is a donor of the aneuploid genome. For example, an IPC as a control for a test to determine a fetal trisomy (e.g., trisomy 21) can be established by combining genomic DNA from a male or female subject carrying the trisomy chromosome with genomic DNA of a female subject known not to carry the trisomy chromosome. Genomic DNA can be extracted from cells of both subjects and sheared to provide fragments of between about 100bp to 400bp, between about 150bp to 350bp, or between about 200bp to 300bp to mimic circulating cfDNA fragments in a maternal sample. The proportion of fragmented DNA from subjects carrying an aneuploidy (trisomy 21) is selected so as to mimic the proportion of circulating fetal cfDNA found in maternal samples, to provide an IPC comprising a mixture of fragmented DNA comprising about 5%, about 10%, about 15%, about 20%, about 25%, about 30% of DNA from subjects carrying the aneuploidy. The IPC may include DNA from different subjects each carrying a different aneuploidy. For example, IPC may comprise about 80% of unaffected female DNA, and the remaining 20% may be DNA from three different subjects each carrying one trisomy chromosome 21, trisomy chromosome 13, and trisomy chromosome 18. A mixture of fragmented DNA was prepared for sequencing. Processing the mixture of fragmented DNA can include preparing a sequencing library that can be sequenced in a single or multiplex format using any massively parallel method. Stock solutions of genomic IPC can be stored and used for multiple diagnostic tests.
Alternatively, IPCs can be established using cfDNA obtained from mothers known to carry fetuses with known chromosomal aneuploidies. For example, cfDNA may be obtained from pregnant women carrying fetuses with trisomy 21. cfDNA was extracted from maternal samples and cloned into bacterial vectors and grown in bacteria to provide an uninterrupted source of IPC. DNA can be extracted from bacterial vectors using restriction enzymes. Alternatively, the cloned cfDNA may be amplified by, for example, PCR. IPC DNA can be processed to sequence in the same batch as cfDNA from test samples to be analyzed for the presence or absence of chromosomal aneuploidy.
While the establishment of IPCs with respect to trisomy is described above, it will be appreciated that IPCs reflecting other partial aneuploidies including, for example, different fragment amplifications and/or deletions, may be established. Thus, for example, where different cancers are known to be associated with a particular amplification (e.g., breast cancer is associated with 20Q 13), IPCs incorporating those known amplifications can be established.
Sequencing method
As noted above, the prepared samples (e.g., sequencing libraries) are sequenced as part of a procedure to identify copy number variations. Any of a variety of sequencing techniques may be utilized.
Some sequencing technologies are commercially available, such as the hybridization sequencing platform of African America (Sanneville, CA) (Affymetrix Inc. (Sunnyvale, CA)) and 454Life Sciences (Bradford, CT) (454 Life Sciences (Bradford, CT)), ilumi/Sox Le Kesa (Haward, CA)) and the Hitaco Biosciences (Cambridge Biosciences, MA)) and the ligation sequencing platform of Applied Biosystems (Foster City, CA), as described below. In addition to single molecule sequencing using sequencing-by-synthesis of Haycoris Biosciences, other single molecule sequencing technologies include, but are not limited to, SMRT of Pacific Biosciences (Pacific Biosciences) TM Technique, ION torent TM Techniques, and Nanopore sequencing methods developed by, for example, oxford Nanopore Technologies.
While the automated Sanger method (Sanger method) is considered a 'first generation' technique, sanger sequencing, including automated Sanger sequencing, can also be used in the methods described herein. Additional suitable sequencing methods include, but are not limited to, nucleic acid imaging techniques such as Atomic Force Microscopy (AFM) or Transmission Electron Microscopy (TEM). Exemplary sequencing techniques are described in more detail below.
In one illustrative but non-limiting embodiment, the methods described herein comprise the use of the hircist genuine single molecule sequencing (tSMS) technology (e.g., harris t.d., et al, science 320, 106-109[2008 ]]Described in (1) to obtain sequence information of nucleic acids in a test sample, e.g., cfDNA in a maternal sample, cfDNA or cellular DNA of a subject screened for cancer, etc. In the tSMS technique, a DNA sample is split into strands having approximately 100 to 200 nucleotides, and a poly a sequence is added to the 3' end of each DNA strand. Each strand is labeled by the addition of a fluorescently labeled adenosine nucleotide. The DNA strands are then hybridized to a flow cell containing millions of oligo T capture sites immobilized to the surface of the flow cell. In certain embodiments, the template density may be about 1 hundred million templates/cm 2 . The flow cell is then loaded into an instrument, such as a HeliScope TM The sequencer and the laser illuminates the flow cell surface to show the position of each template. The CCD camera can determine the position of the template on the flow cell surface. The template fluorescent label is then cleaved and washed away. The sequencing reaction is initiated by the introduction of a DNA polymerase and a fluorescently labeled nucleotide. The oligo T nucleic acid serves as a primer. The polymerase incorporates the labeled nucleotides into the primer in a template-directed manner. The polymerase and unbound nucleotides are removed. Guiding fluorescent markers The bound template of nucleotides is identified by flow cell surface imaging. After imaging, the fragmentation step removes the fluorescent label and the procedure is repeated for other fluorescently labeled nucleotides until the desired read length is obtained. Sequence information was collected using each nucleotide addition step. Whole genome sequencing by single molecule sequencing techniques can eliminate or typically avoid PCR-based amplification when preparing sequencing libraries, and these methods allow for direct measurement of a sample, rather than measuring a copy of that sample.
In another illustrative but non-limiting embodiment, the methods described herein include obtaining sequence information for nucleic acids in a test sample, e.g., cfDNA in a maternal test sample, cfDNA or cellular DNA of a subject screened for cancer, etc., using a 454 sequencing method (Roche) (e.g., described in margulus m. (Margulies, m.). Et al, nature (Nature) 437. 454 sequencing methods typically comprise two steps. In the first step, the DNA is sheared into fragments having approximately 300 to 800 base pairs, and these fragments are blunt-ended. Oligonucleotide aptamers are then ligated to the ends of the fragments. The aptamers serve as primers for fragment amplification and sequencing. Fragments can be attached to DNA capture beads, for example streptavidin-coated beads, using, for example, aptamer B containing a 5' -biotin tag. The fragments attached to the beads were PCR amplified in oil-in-water droplets. The result is multiple copies of clonally amplified DNA fragments on each bead. Second, the beads are captured in wells (e.g., picoliter sized wells). Pyrosequencing was performed on each DNA fragment in parallel. The addition of one or more nucleotides generates an optical signal that is recorded by a CCD camera in the sequencing instrument. The signal intensity is proportional to the number of nucleotides bound. Pyrosequencing is a method in which pyrophosphate (PPi) is used to dissociate upon nucleotide addition. PPi is converted to ATP by ATP sulfurylase in the presence of adenosine 5' phosphate sulfate. Luciferase uses ATP to convert luciferin to oxyluciferin and this reaction generates light, which is measured and analyzed.
In another illustrative, but nonlimiting, embodiment, the invention is described hereinThe method comprises using SOLID TM Techniques (Applied Biosystems) are used to obtain sequence information of nucleic acids in test samples, such as cfDNA in maternal test samples, cfDNA or cellular DNA of subjects screened for cancer, and the like. In SOLID TM In the ligation sequencing method, genomic DNA is cut into fragments, and aptamers are attached to the 5 'and 3' ends of the fragments to generate a fragment library. Alternatively, the internal aptamer may be introduced as follows: ligating adaptors to the 5 'and 3' ends of the fragments, looping the fragments, digesting the looped fragments to produce inner adaptors, and attaching the adaptors to the 5 'and 3' ends of the resulting fragments to produce a paired library. Next, clonal bead populations are prepared in a microreactor containing beads, primers, templates, and PCR components. Following PCR, the template is denatured and the beads enriched to isolate beads with amplified template. The template on the selected beads is 3' modified to allow binding to the slide. The sequence can be determined by sequential hybridization and ligation of portions of the random oligonucleotide to a centrally determined base (or base pair) identified by a particular fluorophore. After recording the color, the ligated oligonucleotides are cleaved and removed, and the process is repeated.
In another illustrative but non-limiting embodiment, the methods described herein comprise the use of Single Molecule Real Time (SMRT) by pacific biosciences, inc TM ) Sequencing techniques to obtain sequence information of nucleic acids in a test sample, such as cfDNA in a maternal test sample, cfDNA or cellular DNA of a subject screened for cancer, and the like. In SMRT sequencing, the continuous binding of dye-labeled nucleotides is imaged during DNA synthesis. The single DNA polymerase molecule is attached to the bottom surface of a single zero mode wavelength detector (ZMW detector) where sequence information is obtained, while the phosphate-linked nucleotides are binding to the growing primer strand. ZMW detectors comprise a closed structure that allows for the observation of single nucleotide binding by DNA polymerase against a background of fluorescent nucleotides that diffuse rapidly outside the ZMW range (e.g., microseconds). The binding of nucleotides into a growing strand typically takes several milliseconds. During this time, the fluorescent label is excited anda fluorescent signal is generated and the fluorescent label is cleaved. Measuring the fluorescence of the corresponding dye indicates which base is bound. This process was repeated to obtain a sequence.
In another illustrative but non-limiting embodiment, the methods described herein include using nanopore sequencing methods (e.g., sorivv and mellea a., clinical chemistry (Clin Chem) 53 1996-2001[2007 ]) to obtain sequence information for nucleic acids in a test sample, such as cfDNA in a maternal test sample, cfDNA or cellular DNA of a subject screened for cancer, and the like. Nanopore sequencing DNA analysis techniques have been developed by a number of companies, including, for example, oxford Nanopore Technologies (Oxford, england), sequenom, nabesys (NABsys), and the like. Nanopore sequencing is a single molecule sequencing technique in which a single molecule of DNA is sequenced directly as it passes through a nanopore. Nanopores are small pores, typically about 1 nanometer in diameter. The nanopore is immersed in a conducting fluid and an electrical potential (voltage) is applied across it, creating a small current due to ionic conduction through the nanopore. The amount of current flowing is sensitive to the size and shape of the nanopore. As a DNA molecule passes through a nanopore, individual nucleotides on the DNA molecule cause varying degrees of obstruction to the nanopore, thereby causing varying degrees of change in the magnitude of the current passing through the nanopore. Thus, this change in current that occurs as the DNA molecule passes through the nanopore provides a readout of the DNA sequence.
In another illustrative but non-limiting embodiment, the methods described herein include using a chemosensitive field effect transistor (chemFET) array (e.g., described in U.S. patent application publication No. 2009/0026082) to obtain sequence information for nucleic acids in a test sample, e.g., cfDNA in a maternal test sample, cfDNA or cellular DNA of a subject screened for cancer, and the like. In one example of this technique, a DNA molecule can be placed in a reaction chamber and a template molecule can be hybridized to a sequencing primer that binds to a polymerase. One or more triphosphates are bound at the 3' end of the sequencing primer to form a new nucleic acid strand that can be distinguished by a change in current by a chemFET. An array may have a plurality of chemFET sensors. In another example, a single nucleic acid can be attached to a bead and the nucleic acid can be amplified on the bead and the individual beads can be transferred into individual reaction chambers on a chemFET array, where each chamber has a chemFET sensor and the nucleic acid can be sequenced.
In another embodiment, the methods of the invention comprise using hall health Molecular technology (Halcyon Molecular's technology) using Transmission Electron Microscopy (TEM) to obtain sequence information for nucleic acids in a test sample, such as cfDNA in a maternal test sample. A method known as individual molecule placement rapid nano-delivery (IMPRNT) comprises: high molecular weight (150 kb or greater) DNA selectively labeled with heavy atom labels is imaged using single atom resolution transmission electron microscopy and the molecules are aligned on an ultrathin film at uniform base-to-base spacing in a highly dense (3 nm strand-to-strand) parallel array. Electron microscopy was used to image molecules on the film to determine the location of heavy atom markers and extract base sequence information of the DNA. The method is further described in PCT patent publication WO 2009/046445. This method allows the sequencing of the complete human genome in less than ten minutes.
In another embodiment, the DNA sequencing technology is Ion current (Ion Torrent) single molecule sequencing, which combines semiconductor technology with simple sequencing chemistry to convert chemically coded information (A, C, G, T) directly into digital information (0, 1) on a semiconductor chip. In essence, when a nucleotide is bound to a DNA strand by a polymerase, hydrogen ions are released as a byproduct. Ion flow is the use of microfabricated hole high density array, in large scale parallel manner to carry out the biochemical process. Each well contains a different DNA molecule. An ion sensitive layer is arranged below the hole, and an ion sensor is arranged below the ion sensitive layer. When a nucleotide (e.g., C) is added to a DNA template and then bound to a DNA strand, hydrogen ions will be released. The charge of that Ion will change the pH of the solution, which can be detected by an Ion sensor of Ion current (Ion Torrent). The sequencer (essentially the smallest solid state PH meter in the world) reads the bases (from chemical information directly to digital information). Ion personal genome machineDevice (PGM) TM ) The sequencer then sequentially bumps the chip with nucleotides one after the other. If the next nucleotide on the impact chip does not match, no voltage change is recorded and no base is determined. If two identical bases are present on the DNA strand, the voltage is doubled and the chip records the two identical bases that were determined. Direct detection can record nucleotide binding within seconds.
In another embodiment, the methods of the invention comprise obtaining sequence information of nucleic acids in a test sample, such as cfDNA in a maternal test sample, using a hybridization sequencing method. Sequencing by hybridization includes contacting a plurality of polynucleotide sequences with a plurality of polynucleotide probes, wherein each of the plurality of polynucleotide probes can optionally be tethered to a substrate. The substrate may be a flat surface comprising an array of known nucleotide sequences. The pattern of hybridization to the array can be used to determine the polynucleotide sequence present in the sample. In other embodiments, each probe is tethered to a bead, such as a magnetic bead or the like. Hybridization to the beads can be determined and used to identify a plurality of polynucleotide sequences within a sample.
In another embodiment, the methods of the invention comprise obtaining sequence information of nucleic acids in a test sample, such as cfDNA in a maternal test sample, by massively parallel sequencing of millions of DNA fragments using ilutamide (Illumina) sequencing-by-synthesis and reversible terminator-based sequencing chemistry techniques (e.g., described in Bentley (Bentley) et al, nature 6 (Nature) 53-59, 2009). The template DNA may be genomic DNA, such as cfDNA. In certain embodiments, genomic DNA of the isolated cells is used as a template and fragmented into lengths of several hundred base pairs. In other embodiments, cfDNA is used as a template, and because cfDNA exists as short fragments, fragmentation is not required. For example, fetal cfDNA circulates in the bloodstream as fragments approximately 170 base pairs (bp) in length (norm (Fan) et al, clinical chemistry (Clin Chem) 56 1279-1286[2010 ]), and prior to sequencing, DNA is not required to be fragmented. The illimeter nano-sequencing technique relies on the attachment of fragmented genomic DNA to an optically clear flat surface to which oligonucleotide anchors are bound. The template DNA ends were repaired to create 5 '-phosphorylated blunt ends, and the polymerase activity of Klenow fragment (Klenow fragment) was used to add a single a base to the 3' end of the blunt-ended phosphorylated DNA fragments. This addition produces DNA fragments for ligation to the oligonucleotide aptamers, which fragments have a single T base overhang at their 3' end to improve ligation efficiency. The aptamer oligonucleotide is complementary to the flow cell anchor. The aptamer-modified single-stranded template DNA is added to the flow cell under limiting dilution conditions and immobilized to the anchor by hybridization. The attached DNA fragments were extended and bridge amplified to create ultra-high density sequencing flow cells with billions of clumps, each containing about 1,000 copies of the same template. In one embodiment, randomly fragmented genomic DNA (e.g., cfDNA) is amplified using PCR before being subjected to cluster amplification. Alternatively, a genomic library preparation without amplification is used, and random fragmented genomic DNA such as cfDNA is enriched using a clustering amplification method alone (Gao Nawa (Kozarewa) et al, nature Methods 6 (291-295), [2009 ]). The template was sequenced using a reliable four-color DNA sequencing-by-synthesis technique using a reversible terminator with removable fluorescent dye. High sensitivity fluorescence detection is obtained using laser excitation and total internal reflection optics. Short sequence reads of about 20bp to 40bp (e.g., 36 bp) are aligned against the repeat-masked reference genome and uniquely mapped to the reference genome using specially developed data analysis pipeline software. Non-repetitive fragment masked reference genomes can also be used. Whether a duplicate-masked reference genome or a non-duplicate-masked reference genome is used, only reads that uniquely map to the reference genome are counted. After the first read is complete, the template can be regenerated in situ to enable a second read from the opposite end of the fragment. Thus, single-ended or paired-end sequencing of DNA fragments can be used. DNA fragments present in the sample are partially sequenced and sequence tags comprising reads of a predetermined length (e.g. 36 bp) mapped to known reference genomes are counted. In one embodiment, the reference genomic sequence is an NCBI36/hg18 sequence, which is available at world wide web genome, ucsc, edu/cgi-bin/hgGatewayorg = Human & db = hg18& hgsid = 166260105. Alternatively, the reference genomic sequence is GRCh37/hg19, available on the world Wide Web genome, ucsc. Other sources of common sequence information include GenBank, dbEST, dbSTS, EMBL (European Molecular Biology Laboratory) and DDBJ (Japanese DNA database). There are a variety of computer algorithms available for aligning sequences, including, but not limited to, BLAST (Altschul et al, 1990), BLITZ (MPsrc) (Sterolac and Coriolis (Sturrock & Collins), 1993), FASTA (Pulson and Lipman (Person & Lipman), 1988), BOWTIE (Lang Gemi (Langmead), et al, genome Biology (Genome Biology) 10, R25.1-R25.10[2009 ]), or ELAND (Illumna, san Diego, CA, USA (Illumina, inc., san Diego, calif.), among others. In one embodiment, one end of clonally amplified copies of plasma cfDNA molecules are sequenced and processed by bioinformatics alignment analysis by an illumi Genome Analyzer (Illumina Genome Analyzer) using large-scale, high-efficiency aligned nucleotide database (ELAND) software.
In certain embodiments of the methods described herein, the mapped sequence tags comprise sequence reads of about 20bp, about 25bp, about 30bp, about 35bp, about 40bp, about 45bp, about 50bp, about 55bp, about 60bp, about 65bp, about 70bp, about 75bp, about 80bp, about 85bp, about 90bp, about 95bp, about 100bp, about 110bp, about 120bp, about 130bp, about 140bp, about 150bp, about 200bp, about 250bp, about 300bp, about 350bp, about 400bp, about 450bp, or about 500 bp. It is expected that technological advances will enable single-ended reads greater than 500bp, and when paired-end reads are generated, reads greater than about 1000 bp. In one embodiment, the mapped sequence tags comprise 36bp sequence reads. Mapping of sequence tags is obtained by comparing tag sequences to reference sequences to determine the chromosomal origin of sequenced nucleic acid (e.g., cfDNA) molecules, and does not require specific genetic sequence information. Minor mismatches (0 to 2 mismatches per sequence tag) may account for minor polymorphisms that may exist between the reference genome and the genomes in the mixed sample.
Multiple sequence tags are typically obtained for each sample. In certain embodiments, at least about 3X 10 of each sample is obtained using read mapping to a reference genome 6 At least about 5X 10 of sequence tags 6 At least about 8X 10 of sequence tags 6 At least about 10X 10 of sequence tags 6 At least about 15 × 10 of sequence tags 6 At least about 20X 10 of sequence tags 6 At least about 30X 10 of sequence tags 6 At least about 40 × 10 of sequence tags 6 At least about 50X 10 of sequence tags 6 Sequence tags comprising reads between 20bp and 40bp (e.g., 36 bp). In one embodiment, all sequence reads are mapped to all regions of the reference genome. In one embodiment, tags that have been mapped to all regions of the reference genome (e.g., all chromosomes) are counted and the CNVs (i.e., over-represented or under-represented) of the sequences of interest (e.g., chromosomes or portions thereof) in the mixed DNA sample are determined. This method does not require a distinction to be made between the two genomes.
The accuracy necessary to correctly determine whether a CNV (e.g., aneuploidy) is present or absent in a sample is judged by the variation in the number of sequence tags mapped to the reference genome in a sequencing operation from sample to sample (inter-chromosomal variability), and the variation in the number of sequence tags mapped to the reference genome in different sequencing operations (inter-sequence variability). For example, the variation of tags mapped to GC-rich or GC-poor reference sequences may be particularly significant. Other variations may result from using different nucleic acid extraction and purification schemes, preparing sequencing libraries, and using different sequencing platforms. The present method uses sequence doses (chromosome doses or segment doses) based on knowledge of the normalizing sequence (normalized chromosome sequence or normalized segment sequence) to essentially account for the naturally increasing variability due to inter-chromosome variability (batch) and inter-sequence variability (run-to-run) and platform-related variability. Chromosome dosage is based on knowledge of the normalized chromosome sequence, which may comprise a single chromosome, or two or more chromosomes selected from chromosomes 1 through 22, X and Y. Alternatively, the normalized chromosome sequence may comprise a single chromosome segment, or two or more segments comprising one chromosome or two or more chromosomes. The segment dose is based on knowledge of the sequence of the normalizing segment, which may include a single segment of any one chromosome, or two or more segments including any two or more chromosomes 1 through 22, X and Y.
Singleplex sequencing
Figure 4 illustrates a flow diagram of one embodiment of the method wherein marker nucleic acids are combined with a single sample of source sample nucleic acids to analyze genetic abnormalities while determining the integrity of the biological source sample. In step 410, a sample of biological origin comprising genomic nucleic acid is obtained. In step 420, the marker nucleic acid is combined with a sample of biological origin to produce a marker sample. A sequencing library of the clonally amplified mixture of source sample genomic nucleic acid and marker nucleic acid is prepared in step 430, and the library is sequenced in a massively parallel manner in step 440 to provide sequencing information about the sample source genomic nucleic acid and marker nucleic acid. Massively parallel sequencing methods provide sequencing information about sequence reads that are mapped to one or more reference genomes to generate sequence tags that can be analyzed. In step 450, all sequencing information is analyzed, and in step 460, the integrity of the source sample is verified based on the sequencing information associated with the marker molecules. Verifying source sample integrity is accomplished by determining the identity between the sequencing information of the marker molecules obtained at step 450 and the known sequence of the marker molecules added to the original source sample at step 420. The same process can be applied to multiple samples sequenced separately, where each sample contains molecules with a sequence unique to that sample, i.e., one sample is labeled with a unique marker molecule and sequenced separately from the other samples in the flow cell or slide of the sequencer. If the sample is tested for integrity, sequencing information related to the sample genomic nucleic acid can be analyzed to provide information, for example, related to the condition of the subject from which the source sample was obtained. For example, if the sample is tested for integrity, sequencing information related to genomic nucleic acid is analyzed to determine the presence or absence of chromosomal abnormalities. If the sample is not checked for integrity, the sequencing information is not considered.
The method depicted in figure 4 is also applicable to biological analysis including single-plex sequencing of single molecules, e.g., tSMS by hai rico, SMRT by pacific biosciences, BASE by oxford nanopore, and other techniques, such as the technique proposed by IBM, which do not require the preparation of libraries.
Multiplex sequencing
The large number of sequence reads that can be obtained per batch of sequencing operations allows for analysis of pooled samples, i.e., multiplex analysis, which maximizes sequencing capacity and reduces workflow. For example, massively parallel sequencing of eight libraries using an eight lane flow cell of an ilu nano genomics analyzer can be performed in multiplex to sequence two or more samples in each lane, such that 16, 24, 32, etc. or more samples are sequenced in a single operation. Parallel sequencing (i.e., multiplex sequencing) of multiple samples requires pooling of sample-specific index sequences (also known as barcodes) during sequencing library preparation. The sequencing index is a unique base sequence of about 5, about 10, about 15, about 20, about 25 or more bases added at the 3' end of the genomic nucleic acid and the marker nucleic acid. Multiplex systems are capable of sequencing hundreds of biological samples in a single batch sequencing operation. An indexed sequencing library can be prepared for sequencing clonally amplified sequences by incorporating the index sequence into one of the PCR primers used for cluster amplification. Alternatively, the index sequence can be incorporated into an aptamer, ligated to cfDNA prior to PCR amplification. An index library for single molecule sequencing can be created by merging index sequences at the 3 'end of the tags and genomic molecules or at the 5' end of the sequence required for hybridization to the flow cell anchor (e.g., adding a poly a tail for single molecule sequencing using tSMS). Sequencing the uniquely labeled and indexed nucleic acids provides indexing sequence information that identifies the samples in the pooled sample libraries, and the sequence information of the marker molecules correlates the sequencing information of the genomic nucleic acids to the sample source. In embodiments where multiple samples are sequenced individually (i.e., singleplex sequencing), only the markers and genomic nucleic acid molecules of each sample need be modified to include the aptamer sequence and exclude the index sequence as needed by the sequencing platform.
Figure 5 provides a flow diagram of an embodiment 500 of a method for testing the integrity of samples that are subjected to a multi-step multiplex sequencing biological analysis, i.e., the nucleic acids of individual samples are combined and sequenced as a complex mixture. In step 510, a plurality of samples of biological origin are obtained, each sample comprising genomic nucleic acid. In step 520, the unique marker nucleic acids are combined with each of the biological source samples to yield a plurality of uniquely labeled samples. In step 530, a sequencing library of sample genomic nucleic acid and marker nucleic acid is prepared for each uniquely labeled sample. Library preparation of samples destined for multiplex sequencing involves incorporating a unique index tag into the sample and the marker nucleic acid of each uniquely labeled sample to provide a sample whose source nucleic acid sequence can be correlated with the corresponding marker nucleic acid sequence and identified in a complex solution. In embodiments of the method comprising marker molecules (e.g., DNA) that can be enzymatically modified, the index molecule can be incorporated at the 3' end of the sample and marker molecules by ligating a sequenceable aptamer sequence comprising the index sequence. In embodiments of the method that include marker molecules that are not amenable to enzymatic modification (e.g., DNA analogs that do not have a phosphate backbone), the index sequence is incorporated at the 3' end of the analog marker molecule during synthesis. The sequencing libraries for two or more samples are pooled and loaded into the flow cell of the sequencer, which are sequenced in a massively parallel manner in step 540. In step 550, all sequencing information is analyzed and in step 560, the integrity of the source sample is verified based on the sequencing information associated with the marker molecules. Verifying the integrity of each of the plurality of source samples is accomplished by first grouping sequence tags associated with the same index sequence such that these genomic sequences and marker sequences belonging to each library consisting of genomic molecules of the plurality of samples are associated with a discriminating sequence. The grouped markers and genomic sequences are then analyzed to verify that the sequences obtained for the marker molecules correspond to known unique sequences added to the corresponding source samples. If the sample is to be tested for integrity, sequencing information related to the sample genomic nucleic acid can be analyzed to provide genetic information related to the subject from which the source sample was obtained. For example, if the sample is tested for integrity, sequencing information related to genomic nucleic acid is analyzed to determine the presence or absence of chromosomal abnormalities. The lack of correspondence between the sequencing information of the marker molecules and the known sequences indicates sample confusion and does not consider the accompanying sequencing information related to the genomic cfDNA molecules.
Determination of CNV for prenatal diagnosis
Cell-free fetal DNA and RNA circulating in maternal blood can be used for early non-invasive prenatal diagnosis (NIPD) of an increasing number of genetic conditions, both for pregnancy management and to aid in reproductive decision-making. The presence of cell-free DNA circulating in the bloodstream has been known for over 50 years. Recently, the presence of small amounts of circulating fetal DNA has been found in the maternal blood stream during pregnancy (Lo et al, lancet (Lancet) 350. Considered to be derived from dying placental cells, cell-free fetal DNA (cfDNA) has been shown to consist of short fragments typically less than 200bp in length, (Chan (chen) et al, clinical chemistry, 50. In addition to cfDNA, fragments of cell-free fetal RNA (cfRNA) can be identified in maternal blood flow, which are derived from genes transcribed in the fetus or placenta. The extraction and subsequent analysis of these fetal genetic elements from maternal blood samples provides a new opportunity for NIPD.
The present method is a polymorphism-independent method, which is for use in NIPD and which does not require discrimination of fetal cfDNA from maternal cfDNA in order to be able to determine fetal aneuploidy. In some embodiments, the aneuploidy is a complete chromosomal trisomy or monosomy, or a partial trisomy or monosomy. Partial aneuploidy is caused by obtaining or losing parts of chromosomes and encompasses chromosomal imbalances generated from unbalanced translocations, unbalanced inversions, deletions and insertions. To date, the most common known aneuploidy that is coexisting with energy of life is trisomy 21, down Syndrome (DS), which is caused by the presence of some or all of chromosome 21. Rarely, DS can be caused by a genetic or sporadic defect whereby an extra copy of all or part of chromosome 21 becomes attached to another chromosome (usually chromosome 14) to form a single aberrated chromosome. DS is associated with mental impairment, severe learning difficulties, and excessive mortality due to long-term health problems (e.g., heart disease). Other aneuploidies of known clinical significance include edward's syndrome (trisomy 18) and pata syndrome (trisomy 13), which are often fatal in the first months of life. Aneuploidy related to the number of sex chromosomes is also known and includes monomer X, such as turner syndrome (XO) and triploid syndrome (XXX) in female newborns, and guillian syndrome (XXY) and XYY syndrome in male newborns, all of which are associated with different phenotypes including infertility and reduced intellectual skills. Monomeric X [45, X ] is a common cause of early pregnancy abortions, and accounts for approximately 7% of spontaneous abortions. Based on a 45,X (also known as Telner syndrome) live birth frequency of 1-2/10,000, it is estimated that less than 1% of the 45,X carcasses survive through the labor phase. Approximately 30% of patients with turner's syndrome are chimeras of 45, X and 46, xx cell lines or cell lines containing rearranged X chromosomes (hooke (Hook) and Wo Badu (warbourton), 1983). The phenotype of live born infants is relatively mild (high embryo lethality is a concern) and it has been hypothesized that potentially all live born women with turner's syndrome carry a cell line containing two sex chromosomes. Monosomy X can occur at 45,x or at 45,x/46XX in females and at 45,x/46XY in males. Autosomal haplotypes in humans are generally considered to be life incompatible; however, a considerable number of cytogenetic reports describe the complete monosomy of one chromosome 21 of live-born young children (Wo Silan child (Vosranova), et al, molecular cytogenetics (Molecular cytogen.) 1 [ 13 ], [2008]; zhu Tan (Joosten), et al, prenatal diagnosis (Prenatal Diagn.) 17. The methods described herein can be used for prenatal diagnosis of these and other chromosomal abnormalities.
According to some embodiments, the methods disclosed herein can determine the presence or absence of a chromosomal trisomy of any of chromosomes 1 through 22, X, and Y. Examples of chromosomal trisomies that can be detected according to the methods of the invention include, but are not limited to, trisomy 21 (T21; down Syndrome), trisomy 18 (T18; edward Syndrome), trisomy 16 (T16), trisomy 20 (T20), trisomy 22 (T22; cat eye Syndrome), trisomy 15 (T15; prideville Syndrome), trisomy 13 (T13; patar Syndrome), trisomy 8 (T8; hua Kani Syndrome (Warkanny Syndrome)), trisomy 9, and XXY (Krelifeld Syndrome), XYY, or XXX trisomy. Other autosomal complete trisomies are lethal when present in a non-chimeric state, but are life-compatible when present in a chimeric state. It is to be understood that different complete trisomies (whether present in a chimeric or non-chimeric state) as well as partial trisomies in fetal cfDNA may be determined according to the teachings provided herein.
Non-limiting examples of partial trisomies that can be determined using the methods of the invention include, but are not limited to, partial trisomies 1q32-44, trisomy 9p, trisomy 4 chimera, trisomy 17p, partial trisomy 4q26-qter, partial 2p trisomy, partial trisomy 1q, and/or partial trisomy 6 p/monosomy 6q.
The methods disclosed herein can also be used to determine chromosomal monosomy X, chromosomal monosomy 21, and partial monosomy, such as monosomy 13, monosomy 15, monosomy 16, monosomy 21, and monosomy 22, which are known to be associated with pregnancy abortions. Partial monosomy of chromosomes typically associated with complete aneuploidy can also be determined using the methods described herein. Non-limiting examples of deletion syndromes that can be determined according to the methods of the present invention include syndromes resulting from partial deletion of chromosomes. Examples of partial deletions that can be determined according to the methods described herein include, but are not limited to, partial deletions of chromosomes 1, 4, 5, 7, 11, 18, 15, 13, 17, 22, and 10, which are described below.
1q21.1 deletion syndrome or 1q21.1 (recurrent) microdeletions are rare malformations of chromosome 1. Following the deletion syndrome, there is also a 1q21.1 replication syndrome. Although deletion syndromes lack a portion of DNA at a particular point, replication syndromes exist for two or three copies of a similar portion of DNA at the same point. Deletions and duplications are mentioned in the literature as being 1q21.1 Copy Number Variations (CNVs). 1q21.1 deletion can be associated with TAR syndrome (thrombocytopenia with radial loss).
Wolff-Hirschhorn syndrome (WHS) (OMIN # 194190) is a syndrome of contiguous gene deletions associated with the hemizygous deletion of chromosome 4 p16.3. Walf-heck houn syndrome is a congenital malformation syndrome characterized by prenatal and postnatal growth deficits, varying degrees of developmental disorders, characteristic craniofacial features (nose in the appearance of the 'greek warrior helmet', high forehead, cheeks, hyperdistant organs, high bowled eyebrows, eye prominence, inner canthal excrescence, short middle, sharp corner of mouth, and small lower jaw), and epilepsy.
A partial deletion of chromosome 5 (also known as 5 p-or 5p minus, and known as Cat's syndrome (coin duChat syndrome (OMIN # 123450)) is due to a deletion of the short arm (short arm) (5p15.3-p 15.2) of chromosome 5.
Williams-Bi Ren Syndrome (Williams-beer Syndrome), also known as chromosome 7q11.23 deletion Syndrome (OMIN 194050), is a contiguous gene deletion Syndrome that results in multisystem disorders, caused by the deletion of a hemizygous stretch from 1.5Mb to 1.8Mb on chromosome 7q11.23, which contains approximately 28 genes.
Jacobsen Syndrome (Jacobsen Syndrome), also known as 11q deletion disorder, is a rare congenital disorder caused by deletion of the terminal region of chromosome 11, including zone 11q24.1. It can lead to intellectual disability, distinctive features, and a wide variety of practical problems, including cardiac defects and bleeding disorders.
Partial monosomy of chromosome 18, known as monosomy 18p, is a rare chromosomal disorder in which all or part of the short arm (p) of chromosome 18 (monochromosomal) is deleted. This disease is typically characterized by short stature, variable mental retardation, language retardation, deformity of the cranial and facial (craniofacial) regions, and/or additional physical abnormalities. The associated craniofacial defects may vary widely in scope and severity from case to case.
Conditions caused by changes in the structure or copy number of chromosome 15 include An Geman syndrome and pride-willi syndrome, which involve loss of gene activity in the same portion of chromosome 15 (the 15q11-q13 region). It is understood that in a parent carrier, several translocations and microdeletions may be asymptomatic, but may still cause major genetic disease in the offspring. For example, a healthy mother carrying 15q11-q13 microdeletions can give rise to a child with An Geman syndrome, a severe neurodegenerative disease. Thus, the methods, devices, and systems described herein may be used to identify such partial and other deletions in a fetus.
Partial monosomy 13q is a rare chromosomal disorder that occurs when a segment of the long arm (q) of chromosome 13 is missing (monomeric). Infants with partial monosomy 13q at birth can exhibit low birth weight, deformity of the head and face (cranial areas), skeletal abnormalities (especially hands and feet), and other physical abnormalities. Mental retardation is a characteristic of this condition. Mortality in infancy is high in individuals with the disease at birth. Almost all cases of partial monosomy 13q occur randomly (sporadically) with no apparent cause.
Smith-Magenis syndrome (SMS-OMIM # 182290) is caused by a deletion or loss of genetic material on one copy of chromosome 17. This well-known syndrome is associated with developmental delays, mental retardation, congenital abnormalities such as heart and kidney defects, and neurobehavioral abnormalities such as severe sleep disorders and self-injurious behavior. Smith-moguis syndrome (SMS) is in most cases (90%) due to an intermediate 3.7-Mb deletion in chromosome 17p11.2.
22q11.2 deletion syndrome, also known as Diger's Oldham syndrome, is a syndrome caused by the deletion of a small segment of chromosome 22. This deletion (22q11.2) occurs near the middle of the chromosome on the long arm of one of the pair of chromosomes. The characteristics of this syndrome vary widely even among members of the same family and affect many parts of the body. Characteristic signs and symptoms may include birth defects, such as congenital heart disease, jaw defects most commonly involving neuromuscular problems of closure (palatopharyngeal insufficiency), learning disorders, minor differences in facial features, and recurrent infections. A microdeletion in the chromosomal region 22q11.2 is associated with a 20 to 30 fold increased risk of schizophrenia.
Deletions in the short arm of chromosome 10 are associated with a digger-alder syndrome-like phenotype. Partial monosomy of chromosome 10p is rare, but has been observed in a fraction of patients showing characteristics of digger's syndrome.
In one embodiment, the methods, devices and systems described herein are used to measure segmental monosomy, including but not limited to, segmental monosomy of chromosomes 1, 4, 5, 7, 11, 18, 15, 13, 17, 22 and 10, and can also be used to measure, for example, segmental monosomy 1q21.11, segmental monosomy 4p16.3, segmental monosomy 5p15.3-p15.2, segmental monosomy 7q11.23, segmental monosomy 1q24.1, segmental monosomy 18p, segmental monosomy of chromosome 15 (15 q11-q 13), segmental monosomy 13q, segmental monosomy 17p11.2, segmental monosomy of chromosome 22 (22q11.2), and segmental monosomy 10p.
Other moiety monosomy that can be determined according to the methods described herein include: unbalanced translocation t (8; p23.2; p 15.5); 11q23 microdeletion; 17p11.2 deletion; 22q13.3 deletion; xp22.3 microdeletions; 10p14 deletion; microdeletion of 20p [ del (22) (q11.2q11.23) ], 7q11.23 and 7q36 deletions; 1p36 deletion; 2p microdeletion; neurofibromatosis type 1 (17q11.2 microdeletion), yq deletion; 4p16.3 microdeletion; 1p36.2 microdeletions; 11q14 is absent; 19q13.2 microdeletion; lu Binsi tan-taibi syndrome (Rubinstein-Taybi) (1693.3 microdeletion); 7p21 microdeletion; miller-dicker syndrome (17p13.3); and 2q37 microdeletions. A partial deletion may be a small deletion of a part of a chromosome, or it may be a microdeletion of a chromosome in which a deletion of a single gene may occur.
Several replication syndromes due to replication of a part of the chromosome arm have been identified (see OMIN [ Online human Mendelian Inheritance in Man, viewed Online at ncbi. In one embodiment, the method of the invention may be used to determine the presence or absence of replication and/or amplification of any of chromosome segments 1 to 22, X and Y. Non-limiting examples of replication syndromes that may be determined according to the methods of the present invention include the replication of a portion of chromosomes 8, 15, 12, and 17, which are described below.
The 8p23.1 replication syndrome is a rare genetic disorder caused by replication of a region of human chromosome 8. The incidence of this replication syndrome in born is estimated at 1/64,000 and is the reciprocal of the 8p23.1 deletion syndrome. 8p23.1 replication is associated with different phenotypes including one or more of slow speech, developmental delay, mild abnormal morphology, with prominent forehead and arched eyebrows, and Congenital Heart Disease (CHD).
Chromosome 15q replication syndrome (Dup 15 q) is a clinically identifiable syndrome that results from the replication of chromosome 15q 11-13.1. Infants with Dup15q often exhibit hypotonia (low muscle tone), growth retardation; they may have a life span with cleft lip and/or palate or malformation of the heart, kidneys or other organs; they showed some degree of cognitive delay/impairment (mental retardation), speech and language delay, and sensory processing disorders.
Panite-kalial syndrome (Pallister Killian syndrome) is the result of an additional #12 chromosomal material. There is usually a mixture of cells (chimeras), some with extra #12 material and some normal (46 chromosomes without extra #12 material). Infants with this syndrome suffer from a number of problems including severe mental retardation, low muscle tone, "coarse" facial features, and forehead protrusions. They tend to have a very thin upper lip, a thicker lower lip, and a short nose. Other health problems include epilepsy, poor feeding, ankylosis, adult cataracts, hearing loss, and cardiac defects. Shortened life in patients with panite-kelly syndrome.
Individuals with a genetic condition designated dup (17) (p 11.2p 11.2) or dup17p carry additional genetic information on the short arm of chromosome 17 (known as replication). Replication of chromosome 17p11.2 results in the Potoki-Lu Puji syndrome (PTLS), which is a genetic pathology just identified, with only a few dozen cases reported in the medical literature. Patients with this replication often present with low muscle tone, poor feeding, and developmental arrest in infancy, and also present with a delay in the development of action and language milestones. Many individuals with PTLS have difficulties with pronunciation and speech processing. In addition, the patient may have behavioral characteristics similar to those seen in patients with autism or autism spectrum disorders. Individuals with PTLS may suffer from cardiac defects and sleep apnea. Replication of a larger region in chromosome 17p12, including the gene PMP22, is known to result in chactot-Marie-toss disease (Charcot-Marie-Tooth disease).
CNV has been associated with stillbirths. However, due to the inherent limitations of traditional cytogenetics, it is believed that the cause of stillbirth by CNV is underrepresented (Harris et al, prenatal diagnosis (Prenatal Diagn) 31-932-944 [2011 ]). As shown in the examples and described elsewhere herein, the present methods are capable of determining the presence of partial aneuploidy, such as deletion and amplification of chromosome segments, and can be used to identify and determine the presence or absence of CNV associated with stillbirth.
Determining complete fetal chromosomal aneuploidy
In one embodiment, a method is provided for determining the presence or absence of any one or more different, intact fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acid molecules. Preferably, the method determines the presence or absence of any four or more different, intact fetal chromosomal aneuploidies. The method comprises the following steps: (a) Obtaining sequence information for fetal and maternal nucleic acids in a maternal test sample; and (b) using the sequence information to identify a number of sequence tags for each of any one or more chromosomes of interest selected from the group consisting of chromosomes 1-22, X, and Y, and to identify a number of sequence tags for a normalizing chromosome sequence for each of the any one or more chromosomes of interest. This normalizing chromosomal sequence may be a single chromosome, or it may be a set of chromosomes selected from chromosomes 1-22, X, and Y. The method further calculates in step (c) a single chromosome dose for each of said any one or more chromosomes of interest using the number of said sequence tags identified for each of said any one or more chromosomes of interest and the number of said sequence tags identified for each of said normalizing chromosome sequences; and (d) comparing each said single chromosome dose for each of said any one or more chromosomes of interest to a threshold value for each of said any one or more chromosomes of interest, thereby to determine the presence or absence of any one or more intact, distinct fetal chromosomal aneuploidies in the maternal test sample.
In some embodiments, step (c) comprises calculating for each of said chromosomes of interest a single chromosome dose as a ratio of the number of sequence tags identified for each of said chromosomes of interest to the number of sequence tags identified for said normalized chromosome sequences for each of said chromosomes of interest.
In other embodiments, step (c) comprises calculating a single chromosome dose for each of said chromosomes of interest as a ratio of the number of sequence tags identified for each of said chromosomes of interest to the number of sequence tags identified for said normalized chromosomes of each of said chromosomes of interest. In other embodiments, step (c) comprises calculating a sequence tag ratio for one chromosome of interest by correlating the number of sequence tags obtained for the chromosome of interest with the length of the chromosome of interest and correlating the number of tags for the corresponding normalized chromosome sequence for the chromosome of interest with the length of the normalized chromosome sequence, and calculating one chromosome dose for the chromosome of interest as the ratio of the sequence tag density for the chromosome of interest to the sequence tag density for the normalized chromosome sequence. This calculation is repeated for each of all sequences of interest. Steps (a) - (d) may be repeated for test samples from different maternal subjects.
Four or more intact fetal chromosomal aneuploidies are determined in a maternal test sample comprising a mixture of fetal and maternal cell-free DNA molecules by an example of this embodiment, including: (a) Sequencing at least a portion of the cell-free DNA molecules to obtain sequence information for the fetal and maternal cell-free DNA molecules in the test sample; (b) Using the sequence information to identify a number of sequence tags for any twenty or more chromosomes of interest selected from each of chromosomes 1-22, X, and Y and to identify a number of sequence tags for a normalized chromosome of each of the twenty or more chromosomes of interest; (c) Calculating a single chromosome dose for each of the twenty or more chromosomes of interest using the number of sequence tags identified for each of the twenty or more chromosomes of interest and the number of sequence tags identified for each normalizing chromosome; and (d) comparing each single chromosome dose for each of the twenty or more chromosomes of interest to a threshold value for each of the twenty or more chromosomes of interest, and therefrom determining the presence or absence of any twenty or more different, intact fetal chromosomal aneuploidies in the test sample.
In another embodiment, the method for determining the presence or absence of any one or more different, intact fetal chromosomal aneuploidies in a maternal test sample, as described above, uses a sequence of normalization segments for determining the dose of a chromosome of interest. In this case, the method comprises: (a) Obtaining sequence information for fetal and maternal nucleic acids in the sample; and (b) using the sequence information to identify a number of sequence tags for each of any one or more chromosomes of interest selected from the group consisting of chromosomes 1-22, X, and Y, and to identify a number of sequence tags for a normalized chromosome sequence for each of the any one or more chromosomes of interest. The normalizing segment sequence may be a single segment of a chromosome, or it may be a set of segments from one or more different chromosomes. The method further calculates in step (c) a single chromosome dose for each of the any one or more chromosomes of interest using the number of sequence tags identified for each of the any one or more chromosomes of interest and the number of sequence tags identified for the normalization segment sequence; and (d) comparing each said single chromosome dose for each of said any one or more chromosomes of interest to a threshold value for each of said one or more chromosomes of interest, and thereby determining the presence or absence of one or more different, intact fetal chromosomal aneuploidies in said sample.
In some embodiments, step (c) comprises calculating a single chromosome dose for each of said chromosomes of interest as a ratio of the number of sequence tags identified for each of said chromosomes of interest to the number of sequence tags identified for said normalized segment sequence of each of said chromosomes of interest.
In other embodiments, step (c) comprises calculating a sequence tag ratio for one chromosome of interest by correlating the number of sequence tags obtained for the chromosome of interest with the length of the chromosome of interest, and correlating the number of tags for the corresponding normalizing segment sequence for the chromosome of interest with the length of the normalizing segment sequence, and calculating a chromosome dose for the chromosome of interest as the ratio of the sequence tag density for the chromosome of interest to the sequence tag density for the normalizing segment sequence. This calculation is repeated for each of all sequences of interest. Steps (a) - (d) may be repeated for test samples from different maternal subjects.
The determination of the Normalized Chromosome Value (NCV) provides a means for comparing chromosome dosages for different sample sets, which correlates chromosome dosages in test samples to the average of the corresponding chromosome dosages in a set of qualifying samples. This NCV was calculated as:
WhereinAndrespectively, the estimated mean and standard deviation for the jth chromosome dose in a set of qualifying samples, and x ij x ij Is the jth chromosome dose observed for test sample i.
In some embodiments, the presence or absence of at least one intact fetal chromosomal aneuploidy is determined. In other embodiments, the presence or absence of at least two, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least two stones, at least thirteen, at least fourteen, at least fifteen, at least sixteen, at least seventeen, at least eighteen, at least nineteen, at least twenty-one, at least twenty-two, at least twenty-three, or twenty-four intact fetal chromosomal aneuploidies is determined in a sample, wherein twenty-two of the intact fetal chromosomal aneuploidies correspond to an intact chromosomal aneuploidy of any one or more autosomes; the twenty-third and twenty-fourth chromosomal aneuploidies correspond to complete fetal chromosomal aneuploidies of chromosomes X and Y. Because the aneuploidy of a sex chromosome can include tetrasomy, pentasomic, and other polysomy, the number of different intact chromosomal aneuploidies that can be determined according to the present methods can be at least 24, at least 25, at least 26, at least 27, at least 28, at least 29, or at least 30 intact chromosomal aneuploidies. Thus, the number of distinct intact chromosomal aneuploidies determined is correlated with the number of chromosomes of interest selected for analysis.
In one embodiment, determining the presence or absence of any one or more different, intact fetal chromosomal aneuploidies in a maternal test sample as described above uses a sequence of normalization segments for a chromosome of interest selected from chromosomes 1-22, X, and Y. In other embodiments, the two or more chromosomes of interest are selected from any two or more of chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, or Y. In one embodiment, any one or more chromosomes of interest selected from the group consisting of chromosomes 1-22, X, and Y includes at least twenty chromosomes selected from the group consisting of chromosomes 1-22, X, and Y, and wherein the presence or absence of at least twenty different, intact fetal chromosomal aneuploidies is determined. In other embodiments, any one or more of the chromosomes of interest selected from the group consisting of chromosomes 1-22, X, and Y is all of chromosomes 1-22, X, and Y, and wherein the presence or absence of a complete fetal chromosomal aneuploidy of all of chromosomes 1-22, X, and Y is determined. The different fetal chromosomal aneuploidies that can be determined are intact chromosomes, intact chromosome monosomy, and intact chromosome polysomy. Examples of intact fetal chromosomal aneuploidies include, but are not limited to: any one or more autosomal trisomies such as trisomy 2, trisomy 8, trisomy 9, trisomy 20, trisomy 21, trisomy 13, trisomy 16, trisomy 18, trisomy 22; trisomies of sex chromosomes such as 47, xxy, 47XXX, and 47XYY; tetragonia of sex chromosomes, e.g., 48, XXYY, 48, XXXY, 48 XXXXX, and 48, XYYY; the pentasomal nature of sex chromosomes, e.g., 49,XXXYY, 49,XXXXY, 49,XXXXX, 49,XYYY; and a monomeric X. Other complete fetal chromosomal aneuploidies that can be determined according to the present method are described below.
Determining partial fetal chromosomal aneuploidy
In another embodiment, a method is provided for determining the presence or absence of any one or more different, partial fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acid molecules. The method comprises the following steps: (a) Obtaining sequence information for fetal and maternal nucleic acids in the sample; and (b) using the sequence information to identify a number of sequence tags for each of any one or more segments of any one or more chromosomes of interest selected from chromosomes 1-22, X, and Y, and to identify a number of sequence tags for a normalizing segment sequence for each of said any one or more segments in any one or more chromosomes of interest. The normalizing segment sequence may be a single segment of one chromosome, or it may be a set of segments from one or more different chromosomes. The method further uses in step (c) the number of said sequence tags identified for any one or more segments of said any one or more chromosomes of interest and the number of said sequence tags identified for each of said normalizing segment sequences to calculate a single segment dose for each of any one or more segments of said any one or more chromosomes of interest; and (d) comparing each said single chromosome dose for each of any one or more segments of said any one or more chromosomes of interest to a threshold for each of any one or more chromosome segments of said any one or more chromosomes of interest, and thereby determining the presence or absence of one or more different, partial fetal chromosomal aneuploidies in said sample.
In some embodiments, step (c) comprises calculating a single segment dose for each of any one or more segments of any one or more chromosomes of interest as a ratio of the number of sequence tags identified for each of any one or more segments of any one or more chromosomes of interest to the number of sequence tags identified for the normalized segment sequence for each of any one or more segments of any one or more chromosomes of interest.
In other embodiments, step (c) comprises calculating a sequence tag ratio for a segment of interest as follows: calculating a segment dose as a ratio of the sequence tag density of the segment of interest to the sequence tag density of the normalized segment sequence by correlating the number of sequence tags obtained for the segment of interest with the length of the segment of interest and correlating the number of tags of the corresponding normalized segment sequence for the segment of interest with the length of the normalized segment sequence, and for the segment of interest. This calculation is repeated for each of all sequences of interest. Steps (a) - (d) may be repeated for test samples from different maternal subjects.
The determination of a Normalized Segment Value (NSV) provides a means for comparing segment doses for different sample sets, which correlates segment doses in a test sample to the average of the corresponding segment doses in a set of qualifying samples. NSV was calculated as:
whereinAndcorresponding is the estimated mean and standard deviation for the jth segment dose in a set of qualifying samples, and x ij Is the jth segment dose observed for test sample i.
In some embodiments, the presence or absence of a portion of a fetal chromosomal aneuploidy is determined. In other embodiments, the presence or absence of two, three, four, five, six, seven, eight, nine, ten, fifteen, twenty-five, or more portions of a fetal chromosomal aneuploidy is determined in a sample. In one embodiment, a segment of interest selected from any one of chromosomes 1-22, X, and Y is selected from chromosomes 1-22, X, and Y. In another embodiment, the two or more segments of interest selected from chromosomes 1-22, X, and Y are selected from chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, or Y. In one embodiment, any one or more segments of interest selected from chromosomes 1-22, X, and Y includes at least one, five, ten, 15, 20, 25 or more segments selected from chromosomes 1-22, X, and Y, and wherein the presence or absence of at least one, five, ten, 15, 20, 25 different, partial fetal chromosomal aneuploidies is determined. Different, partial fetal chromosomal aneuploidies that can be determined include partial replication, partial doubling, partial insertion, and partial deletion. Examples of partial fetal chromosomal aneuploidies include partial monosomy and partial trisomy of autosomes. The partial monosomy of an autosome includes a partial monosomy of chromosome 1, a partial monosomy of chromosome 4, a partial monosomy of chromosome 5, a partial monosomy of chromosome 7, a partial monosomy of chromosome 11, a partial monosomy of chromosome 15, a partial monosomy of chromosome 17, a partial monosomy of chromosome 18, and a partial monosomy of chromosome 22. Other portions of fetal chromosomal aneuploidy that can be determined according to the present methods are described below.
In any of the above embodiments, such test sample is a maternal sample selected from the group consisting of blood, plasma, serum, urine and saliva samples. In some embodiments, the maternal test sample is a plasma sample. The nucleic acid molecules of the maternal sample are a mixture of fetal and maternal cell-free DNA molecules. Sequencing of nucleic acids can be performed using Next Generation Sequencing (NGS) as described elsewhere in the application. In some embodiments, sequencing is massively parallel sequencing using sequencing by synthesis with reversible dye terminators. In other embodiments, the sequencing is ligation sequencing. In still other embodiments, the sequencing is single molecule sequencing. Optionally, an amplification step is performed prior to sequencing.
Determination of CNV of clinical conditions
In addition to early detection of birth defects, the methods described herein can be used to detect any abnormalities in the expression of genetic sequences within the genome. The abnormal number of genetic sequences within the genome in expression has been associated with different pathologies. Such conditions include, but are not limited to, cancer, infectious and autoimmune diseases, neurological diseases, metabolic and/or cardiovascular diseases, and the like.
Accordingly, the use of the methods described herein for diagnosing and/or monitoring and/or treating such conditions is contemplated in various embodiments. For example, these methods can be used to determine the presence or absence of a disease, monitor the progression of a disease and/or the efficacy of a treatment regimen, determine the presence or absence of a pathogen (e.g., viral) nucleic acid, determine chromosomal abnormalities associated with Graft Versus Host Disease (GVHD), and determine the role of an individual in forensic analysis.
CNV of cancer
It has been demonstrated that plasma and serum DNA from cancer patients contains measurable amounts of tumor DNA, which can be recovered and used as a surrogate source of tumor DNA, and that tumors are characterized by aneuploidy, or inappropriate numbers of gene sequences or even intact chromosomes. Determining the difference in the amount of a given sequence (i.e., the sequence of interest) in a sample from an individual can therefore be used for prognosis and diagnosis of a medical condition. In some embodiments, the methods can be used to determine the presence or absence of a chromosomal aneuploidy in a patient suspected or known to have cancer.
In certain embodiments, the aneuploidy is characteristic of the genome of the subject and causes an overall increase in the susceptibility to cancer. In certain embodiments, a particular cell (e.g., a tumor cell, a proto-tumor neoplastic cell, etc.) that is predisposed to, or has an increased predisposition to, tumor formation has aneuploidy characteristics. Specific aneuploidies are associated with a specific cancer or a specific cancer predisposition, as described below.
Accordingly, various embodiments of the methods described herein provide for the determination of copy number variation of a sequence of interest (e.g., a clinically relevant sequence) in a test sample from a subject, wherein certain variations in copy number provide an indication of the presence of cancer and/or susceptibility to cancer. In certain embodiments, the sample comprises a mixture of nucleic acids derived from two or more cells. In one embodiment, the nucleic acid mixture is derived from normal cells and cancer cells, the cancer cells being derived from a subject suffering from a medical condition (e.g., cancer).
The development of cancer is often accompanied by changes in the number of whole chromosomes, i.e. complete chromosomal aneuploidies, and/or in the number of chromosomal segments, i.e. partial aneuploidies, resulting from a process known as Chromosomal Instability (CIN) (tomm (Thoma) et al, swiss Med Weekly 2011. It is believed that many solid tumors, such as breast cancer, progress from the beginning to metastasis through the accumulation of several genetic abnormalities. [ Sato et al, cancer research (Cancer Res.), 50; jian Sima (Jongsma) et al, journal of clinical pathology: molecular pathology (J Clin Pathol: mol Path) 55 (305-309, 2002). Such genetic malformations, when accumulated, may confer proliferative dominance, genetic instability and the attendant ability to rapidly develop resistance, as well as enhanced angiogenesis, proteolysis and metastasis. Genetic abnormalities may affect recessive "tumor suppressor genes" or dominant-acting oncogenes. The recombination of deletions and resulting loss of heterozygosity (LOH) is thought to play a major role in tumor progression by revealing mutated tumor suppressor alleles.
cfDNA has been found in the circulatory system of patients diagnosed with malignancies, including, but not limited to, lung Cancer (pascal et al, clinical medicine 52. Identifying genomic instability associated with cancer (which can be determined from circulating cfDNA of cancer patients) is a potential diagnostic and prognostic tool. In one embodiment, the methods described herein are used to determine CNV of one or more sequences of interest in a sample (e.g., a sample comprising a mixture of nucleic acids derived from a subject suspected of having or known to have cancer, such as a carcinoma, sarcoma, lymphoma, leukemia, germ cell tumor, and blastoma). In one embodiment, the sample is a plasma sample derived (processed) from peripheral blood, which may comprise a mixture of cfDNA derived from normal and cancer cells. In another embodiment, the biological sample for which the presence of CNV is to be determined is a cell derived from other biological tissues, including mixtures of cancerous and non-cancerous cells if cancer is present, other biological tissues including, but not limited to, biological fluids such as serum, sweat, tears, sputum, urine, sputum, ear exudates, lymph, saliva, cerebrospinal fluid, lavage, bone marrow suspensions, vaginal fluids, transcervical lavage, brain fluids, ascites, milk, secretions of the respiratory, intestinal, and genito-urinary tracts, and leukopheresis samples, or in tissue biopsies, swabs, or smears. In other embodiments, the biological sample is a stool (fecal) sample.
The methods described herein are not limited to analysis of cfDNA. It will be appreciated that similar analysis can be performed on a sample of cellular DNA.
In various embodiments, the sequence of interest comprises a nucleic acid sequence known or suspected to play a role in cancer development and/or progression. Examples of sequences of interest include nucleic acid sequences, such as complete chromosomes and/or chromosome segments, that are amplified or deleted in cancer cells as described below.
Total CNV number and cancer risk.
Common cancer SNPs and by analogy common cancer CNVs each produce only a slight increase in disease risk. However, in general, they may lead to a substantially increased risk of cancer. In this regard, it should be noted that germline acquisition and loss of large DNA fragments has been reported as a factor in individuals' predisposition to neuroblastoma, prostate and colorectal cancers, breast Cancer and BRCA 1-associated ovarian cancers (see, e.g., cleupick (krepshi) et al, breast Cancer study (Breast Cancer Res), 14. It should be noted that CNVs frequently found in healthy populations (common CNVs) are thought to play a role in cancer etiology (see, e.g., serin (shulien) and Malkin (Malkin) (2009) genomic medicine (GenomeMedicine), 1 (6): 62). In one study test, the following assumptions were tested: common CNVs are associated with malignant diseases (serin (shulien) et al, journal of the national academy of sciences of the united states (Proc Natl Acad Sci USA) 2008,105, 11264-11269), which is a map of each known CNV whose locus is consistent with that of a real cancer-related gene (as classified in Ha Jin (Higgins) et al, nucleic acids research (Nucleic acids sres) 2007,35 d 721-726. These CNVs are referred to as "cancer CNVs". In an initial analysis (serin (shulien) et al, journal of the national academy of sciences of the united states (Proc Natl Acad Sci USA) 2008,105, 11264-11269), 770 healthy genomes were evaluated using the array set alfimei 500K (Affymetrix 500K) whose mean inter-probe distance was 5.8 kb. Since CNVs are generally considered to be excluded in the gene region (Lei Tang (Redon) et al (2006), nature (Nature) 2006, 444-454), it was surprisingly found that in a large reference population of multiple people, 49 cancer genes are directly covered or overlapped by CNVs. Among the first ten genes, cancer CNVs can be found in four or more people.
It is therefore believed that CNV frequency can be used as a measure of cancer risk (see, e.g., U.S. patent publication No. 2010/0261183 A1). CNV frequency can be determined simply by the organism's constitutive genome or it can represent a fraction derived from one or more tumors (neoplastic cells) if these are present.
In certain embodiments, the number of CNVs in a test sample (e.g., a sample comprising constitutive (germline) nucleic acids) or in a mixture of nucleic acids (e.g., germline nucleic acids and nucleic acids derived from neoplastic cells) is determined using the methods described herein for copy number variation. Identifying an increased number of CNVs in the test sample (e.g., as compared to a reference value) is indicative of a subject at risk of or susceptible to cancer. It is to be understood that the reference value may vary with a given population. It will also be appreciated that the absolute value of CNV frequency amplification will vary depending on the resolution of the method used to determine CNV frequency and other parameters. Typically, determining an increase in CNV frequency of at least about 1.2-fold over the reference value is indicative of a cancer risk (see, e.g., U.S. patent publication No. 2010/0261183 A1), e.g., an increase in CNV frequency of at least 1.5-fold or about 1.5-fold or greater (such as 2-4-fold over the reference value) over the reference value is an indicator of an increased cancer risk (e.g., as compared to a normal healthy reference population).
It is also believed that a determination of structural variation in the genome of the mammal (as compared to a reference value) is indicative of cancer risk. In this context, in one embodiment, the term "structural variation" may be defined by the frequency of CNV in a mammal multiplied by the mean CNV size (bp) of the mammal. Thus, a high structural variation score will be due to increased CNV frequency and/or due to large genomic nucleic acid deletions or duplications. Thus, in certain embodiments, the number of CNVs in a test sample (e.g., a sample comprising constitutive (germline) nucleic acids) is determined using the methods described herein to determine copy number variation size and number. In certain embodiments, a total fraction of structural variation within genomic DNA of greater than about 1 megabase, or greater than about 1.1 megabase, or greater than about 1.2 megabases, or greater than about 1.3 megabases, or greater than about 1.4 megabases, or greater than about 1.5 megabases, or greater than about 1.8 megabases, or greater than about 2 megabases of DNA is indicative of a cancer risk.
These methods are believed to provide a measure of the risk of any cancer, including, but not limited to, acute and chronic leukemias, lymphomas, many solid tumors of mesenchymal or epithelial tissue, brain, breast, liver, stomach, colon, B-cell lymphoma, lung, bronchial, colorectal, prostate, breast, pancreatic, stomach, ovarian, bladder, brain or central nervous system, peripheral nervous system, esophageal, cervical, melanoma, uterine or endometrial, oral or pharyngeal, liver, kidney, biliary, small intestine or intestinal, salivary gland, thyroid, adrenal, osteosarcoma, chondrosarcoma, liposarcoma, testicular, and malignant fibrous histiocytoma, among others.
Complete chromosomal aneuploidy.
As noted above, there is a high frequency of aneuploidies in cancer. In certain studies examining the prevalence of somatic copy number variations (SCNAs) in cancer, it has been found that aneuploidy of whole-arm SCNAs or whole-chromosome SCNAs has an effect on the quarter genome of typical cancer cells (see, e.g., beroukuhim et al, nature 463 899-905[2010 ]). Whole-chromosome variation is repeatedly observed in several cancer types. For example, the acquisition of chromosome 8 is seen in 10% to 20% of cases of Acute Myeloid Leukemia (AML), as well as in certain solid tumors, including Ewing's sarcomas and fibroids (see, e.g., bayer Nad (Barnard) et al, leukemia (Leukemia) 10.
Table 1: schematic acquisition and loss of specific recurrent chromosomes in human cancers (see, e.g., goden (Gordon) Et al (2012), nature reviews genetics (Nature rev. Genetics), 13.
In various embodiments, the methods described herein can be used to detect and/or quantify whole chromosome aneuploidies associated with cancer in general and/or with a particular cancer. Thus, for example, in certain embodiments, it is contemplated to detect and/or quantify whole chromosome aneuploidies characterized by gains or losses as shown in table 1.
Arm level chromosomal segment copy number variation.
Several studies have reported patterns of arm-level copy number variation across a large number of Cancer specimens (forest (Lin) et al, cancer research (Cancer Res) 68,664-673 (2008); george (George) et al, PLoS ONE 2, e255 (2007); dai Miche rics (Demichelis) et al, gene chromosomal Cancer (Genes Chromosomes Cancer) 48 366-380 (2009); beroukhim et al, nature (nature) 463 (7283): 899-905 2010 ]). It has also been observed that the frequency of arm horizontal copy number variation decreases with chromosomal arm length. According to this trend adjustment, most chromosome arms exhibit strong evidence of preferential acquisition or loss, but across multiple cancer lineages, both are rare (see, e.g., beroukhim et al, nature 463 (7283): 899-905[2010 ]).
Thus, in one embodiment, the methods described herein are used to determine arm level CNVs (CNVs comprising one chromosomal arm or substantially one chromosomal arm) in a sample. Among CNVs in test samples comprising constitutive (germline) nucleic acids, CNVs can be determined and, in some constitutive nucleic acids, arm level CNVs can be identified. In certain embodiments, arm level CNVs (if present) are identified in a sample comprising a mixture of nucleic acids (e.g., nucleic acids derived from normal cells and nucleic acids derived from neoplastic cells). In certain embodiments, the sample is derived from a subject suspected of or known to have cancer (e.g., carcinoma, sarcoma, lymphoma, leukemia, germ cell tumor, blastoma, and the like). In one embodiment, the sample is a plasma sample derived (processed) from peripheral blood, which may comprise a mixture of cfDNA derived from normal and cancer cells. In another embodiment, the biological sample used to determine the presence of CNV is derived from cells, which if cancer is present, include a mixture of cancerous and non-cancerous cells from other biological tissues including, but not limited to, biological fluids such as serum, sweat, tears, sputum, urine, sputum, ear exudates, lymph, saliva, cerebrospinal fluid, lavage (lavages), bone marrow suspensions, vaginal fluids, transcervical lavage, brain fluids, ascites, milk, respiratory, intestinal and genitourinary tract secretions, as well as leukapheresis samples, or in tissue biopsies, swabs or smears. In other embodiments, the biological sample is a fecal (fecal) stool (fecal) sample.
In various embodiments, CNVs identified as indicative of the presence of cancer or increased risk of cancer include, but are not limited to, arm level CNVs listed in table 2. As illustrated in table 2, certain CNVs obtained including substantial arm levels indicate the presence of cancer or an increased risk of certain cancers. Thus, for example, 1q of acquisition indicates the presence or increased risk of Acute Lymphoblastic Leukemia (ALL), breast cancer, GIST, HCC, lung NSC, medulloblastoma, melanoma, MPD, ovarian cancer and/or prostate cancer. 3q gain indicates the presence or increased risk of esophageal squamous cell carcinoma, pulmonary SC and/or MPD. 7q acquisition indicates the presence or increased risk of colorectal, glioma, HCC, lung NSC, medulloblastoma, melanoma, prostate and/or renal cancer. 7p acquired indicates the presence or increased risk of breast cancer, colorectal cancer, esophageal adenocarcinoma, glioma, HCC, lung NSC, medulloblastoma, melanoma, and/or renal cancer. 20q gain indicates the presence or increased risk of breast, colorectal, dedifferentiated liposarcoma, esophageal adenocarcinoma, esophageal squamous carcinoma, glioma carcinoma, HCC, lung NSC, melanoma, ovarian, and/or renal cancers, and the like.
Similarly, as illustrated in table 2, certain CNVs including substantial arm level loss indicate the presence and/or increased risk of certain cancers. Thus, for example, a loss of 1p indicates the presence or increased risk of a gastrointestinal stromal tumor. Loss of 4q is indicative of the presence or increased risk of colorectal, esophageal adenocarcinoma, lung sc, melanoma, ovarian and/or renal cancer. 17p loss indicates the presence or increased risk of breast, colorectal, esophageal adenocarcinoma, HCC, lung NSC, lung SC, and/or ovarian cancer, among others.
Table 2:16 cancer subtypes (breast cancer, colorectal cancer dedifferentiated liposarcoma, esophageal adenocarcinoma, esophageal squamous carcinoma, GIST (gastrointestinal stromal tumor), glioma, HCC (hepatocellular carcinoma), lung NSC, lung SC, medulloblastoma, melanoma, MPD (myeloproliferative disorder), ovarian cancer, prostate cancer, acute Lymphoblastic Leukemia (ALL) and renal cancer) Remarkable arm horizontal chromosome segment copy number variation (see, e.g., berouki (Beroukhim) et al, nature (Nature) (2010)463(7283):899-905)。
Examples of relationships between arm-level copy number variations are intended to be illustrative and not limiting. Other arm horizontal copy number variations and their cancer relationships are known to those skilled in the art.
Smaller (e.g., focal) copy number variations.
As noted above, in certain embodiments, the methods described herein can be used to determine the presence or absence of chromosomal amplification. In some embodiments, the chromosomal amplification is the acquisition of one or more whole chromosomes. In other embodiments, the chromosomal amplification is the obtaining of one or more segments in a chromosome. In still other embodiments, the chromosomal amplification is the obtaining of two or more segments in two or more chromosomes. In various embodiments, chromosomal amplification can involve the acquisition of one or more oncogenes.
Dominant open genes associated with human solid tumors typically exert their effects through overexpression or altered expression. Gene amplification is a common mechanism that results in the upregulation of gene expression. Evidence from cytogenetic studies indicates that significant expansion occurs in more than 50% of human breast carcinomas. Most notably, amplification of the proto-oncogene, human epidermal growth factor receptor 2 (HER 2), located on chromosome 17 (17 (17 q21-q 22)), results in overexpression of the HER2 receptor on the cell surface, resulting in an excessive and dysregulated signal in Breast Cancer and other malignancies (Park et al, clinical Breast Cancer, 8. A variety of oncogenes have been found to be amplified in other human malignancies. Examples of cellular oncogene amplification in human tumors include amplification of: promyelocytic leukemia cell line HL60, as well as c-myc in small cell lung cancer, primary neuroblastoma (stages III and IV), neuroblastoma cell line, retinoblastoma cell line and primary tumor, and N-myc in small cell lung cancer cell line and tumor, L-myc in small cell lung cancer cell line and tumor, c-myb in acute myelogenous leukemia and colon cancer cell line, epidermoid carcinoma cells, and c-erbb in primary glioma, c-K-ras-2 in primary carcinoma of lung, colon, bladder, and rectum, N-ras in breast cancer cell line (Varmus H., ann Rev Genetics (Ann Rev Genetics), 18-553-612 (1984), [ cite Watson (Watson) et al, molecular Biology of the Gene (Molecular Biology) (4; beismin 1987, inc.; pushing Co.).
Oncogene replication is a common cause of many types of cancer, as is the case with P70-S6 kinase 1 amplification and breast cancer. In such cases, genetic replication occurs in somatic cells and affects only the genome of the cancer cell itself (rather than the entire organism), with much less impact on any subsequent progeny. Other examples of oncogenes that are amplified in human cancers include MYC, ERBB2 (EFGR), CCND1 (cyclin D1), FGFR1, and FGFR2 in breast cancer; MYC and ERBB2 in cervical cancer; HRAS, KRAS and MYB in cervical cancer; MYC, CCND1, and MDM2 in esophageal cancer; CCNE, KRAS and MET in gastric cancer; ERBB1 and CDK4 in glioblastoma; CCND1, ERBB1, and MYC in head and neck cancer; CCND1 in hepatocellular carcinoma; MYCB in neuroblastoma; MYC: ERBB2 and AKT2 in ovarian cancer; MDM2 and CDK4 in sarcomas; MYC in small cell lung cancer. In one embodiment, the methods of the invention can be used to determine the presence or absence of amplification of an oncogene associated with cancer. In certain embodiments, the amplified oncogene is associated with breast cancer, cervical cancer, colorectal cancer, esophageal cancer, gastric cancer, glioblastoma, head and neck cancer, hepatocellular cancer, neuroblastoma, ovarian cancer, sarcoma, and small cell lung cancer.
In one embodiment, the method may be used to determine the presence or absence of a chromosomal deletion. In some embodiments, such a chromosomal deletion is a loss of one or more entire chromosomes. In other embodiments, such a chromosomal deletion is a loss of one or more segments of a chromosome. In still other embodiments, such a chromosome deletion is the loss of two or more segments of two or more chromosomes. Such chromosomal deletions may involve the loss of one or more tumor suppressor genes.
Chromosomal deletions involving tumor suppressor genes are thought to play an important role in the development and progression of solid tumors. The retinoblastoma tumor suppressor gene (Rb-1) (located on chromosome 13q 14) is the most widely characterized tumor suppressor gene. The Rb-1 gene product, a 105kDa nuclear phosphoprotein, appears to play an important role in cell cycle regulation (Howe et al, proc Natl Acad Sci (Proc Natl Acad Sci) (USA, 87. Altered or lost expression of the Rb protein results from inactivation of alleles of both genes by a point mutation or chromosomal deletion. It has been found that the Rb-i gene alteration is present not only in retinoblastoma, but also in other malignancies such as osteosarcoma, small cell lung carcinoma (Rygaard et al, cancer Res (Cancer research), 50 5312-5317[ 1990) ]) and breast Cancer. Restriction Fragment Length Polymorphism (RFLP) studies have shown that such tumor types often lose heterozygosity at 13q, suggesting that one of the alleles of the Rb-1 gene has been lost due to total chromosomal deletion (Bowcock et al, am J Hum Genet (american journal of human genetics), 46. Chromosomal 1 abnormalities, including those involving duplications, deletions and unbalanced translocations of chromosome 6 and other companion chromosomes, indicate that regions of chromosome 1, particularly q21-1q32 and 1p11-13, may harbor oncogenes or tumor suppressor genes involved in the development of chronic and advanced stages of myeloproliferative neoplasms (Caramazza et al, eur J Hematol (journal of european hematology), 84. Myeloproliferative neoplasms are also associated with the loss of chromosome 5. Loss of integrity or an intermediate deletion of chromosome 5 is the most common karyotypic abnormality in myelodysplastic syndrome (MDS). Isolated del (5 q)/5 q-MDS patients have a more favorable prognosis than those with additional karyotypic defects, they are predisposed to developing myeloproliferative neoplasms (MPN) and acute myelogenous leukemia. The frequency of unbalanced chromosome 5 deletions has led to the idea that: 5q harbor one or more tumor suppressor genes which play a fundamental role in the growth control of hematopoietic stem/progenitor cells (HSCsHPC). Cytogenetic mapping of the normally deleted regions (CDRs) focused on candidate tumor suppressor genes identified at 5q31 and 5q32, including ribosomal subunit RPS14, transcription factor Egr1/Krox20 and cytoskeletal remodelling protein, alpha-catenin (Eisenmann, oncogene, 28. Cytogenetic and allelic studies of fresh tumors and tumor cell lines have demonstrated that loss of alleles from several defined regions on chromosome 3p (including 3p25, 3p21-22, 3p21.3, 3p12-13, and 3p 14) is the earliest and most common genomic abnormality involved in a broad spectrum of major epithelial cancers of lung, breast, kidney, head and neck, ovary, cervix, colon, pancreas, esophagus, bladder, and other organs. Several tumor suppressor genes have been mapped to the chromosomal 3p region, and it is believed that the intermediate deletion or promoter hypermethylation precedes the loss of 3p or intact chromosome 3 in the development of the cancer ((Angeloni (An Geluo ni) d., briefings Functional Genomics, 6.
Newborns and children with Down Syndrome (DS) often present with congenital transient leukemia and have an increased risk of acute myeloid leukemia and acute lymphoblastic leukemia. Chromosome 21 (containing approximately 300 genes) may be involved in a variety of structural aberrations, such as translocations, deletions, and amplifications in leukemias, lymphomas, and solid tumors. In addition, the important role played by genes located on chromosome 21 in tumorigenesis has been identified. The number of entities of chromosome 21, as well as structural aberrations, are associated with leukemia, and specific genes include RUNX1, TMPRSS2, and TFF, which are located at 21q, play a role in tumorigenesis (Fonatsch (Feng Naci grams) C, gene Chromosomes Cancer (Gene, chromosome and carcinoma), 49.
In view of the foregoing, in various embodiments, the methods described herein can be used to determine segmented CNVs known to include one or more oncogenes or tumor suppressor genes and/or known to be associated with cancer or an increased risk of cancer. In certain embodiments, CNVs can be determined in a test sample comprising constituent (germline) nucleic acids, and segments can be identified in those constituent nucleic acids. In certain embodiments, segment CNVs (if present) are identified in a sample comprising a mixture of nucleic acids (e.g., nucleic acids derived from normal cells and nucleic acids derived from neoplastic cells). In certain embodiments, the sample is derived from a subject suspected of or known to have cancer (e.g., carcinoma, sarcoma, lymphoma, leukemia, germ cell tumor, blastoma, and the like). In one embodiment, the sample is a plasma sample derived (processed) from peripheral blood, which may comprise a mixture of cfDNA derived from normal and cancer cells. In another embodiment, the biological sample used to determine the presence of dell CNV is derived from a cell, which if cancer is present comprises a mixture of cancerous and non-cancerous cells from other biological tissues including, but not limited to, biological fluids such as serum, sweat, tears, sputum, urine, sputum, ear exudates, lymph, saliva, cerebrospinal fluid, lavage (lavages), bone marrow suspensions, vaginal fluids, transcervical lavage, brain fluids, ascites, milk, respiratory, intestinal and genitourinary tract secretions, and leukapheresis samples, or in tissue biopsies, swabs, or smears. In other embodiments, the biological sample is a stool (fecal) sample.
CNVs for determining the presence of cancer and/or increased risk of cancer may include amplifications or deletions.
In various embodiments, CNVs identified as indicative of the presence of cancer or increased risk of cancer comprise one or more of the amplifications shown in table 3.
Table 3: characterized by an illustrative but non-limiting chromosomal segment that is amplified in association with cancer. Cancer as listed The disease types are those identified in Beroukhim (Beroukhim), nature (Nature) 18.
In certain embodiments, in combination with or separately from the amplification described above (herein), CNVs identified as indicative of the presence of cancer or an increased risk of cancer comprise one or more of the deletions set forth in table 4.
Table 4: exemplary but not limiting chromosomal segments characterized by deletions associated with cancer. Cancer of the listed The disease types are those identified in Beroukhim (Beroukhim), nature (Nature) 18.
The aneuploidies identified that characterize different cancers (e.g., aneuploidies identified in tables 3 and 4) can include genes known to be involved in cancer etiology (e.g., tumor suppressors, oncogenes, etc.). These aneuploidies can also be probed to identify related, but previously unknown genes.
For example, GRAIL (a genetic relationship between Loci20 involved), an algorithm that searches for functional relationships between genomic regions, is used by Beroukhim et al, supra, to evaluate potential oncogenes based on copy number changes. GRAIL evaluates the 'relatedness' of each gene in a set of genomic regions to genes in other regions based on the textual similarity of the published abstracts of all papers referring to genes in the view that certain target genes act in a common way. These methods allow the identification/characterization of genes in disputes that were previously not associated with a particular cancer. Table 5 illustrates the target genes known to be located within the identified amplified segments and predictor genes, and table 6 illustrates the target genes known to be located within the identified deleted segments and predictor genes.
Table 5: exemplary, but not limiting, of known or predicted presence in an amplified region characterized by different cancers Sex chromosome segments and genes (see, e.g., berroukhim et al, supra).
Table 6: exemplary, but not limiting, of known or predicted presence in an amplified region characterized by different cancers Sex chromosome segments and genes (see, e.g., berroukhim et al, supra).
In various embodiments, it is contemplated that CNVs comprising amplified regions or segments of genes identified in table 5 are identified using the methods identified herein, and/or CNVs comprising deleted regions or segments of genes identified in table 6 are identified using the methods identified herein.
In one embodiment, the methods described herein provide a means to assess the correlation between gene amplification and the extent of tumor evolution. The association between amplification and/or deletion and cancer stage or grade can be important for prognosis, as such information can constitute a definition of genetic tumor grade, which would better predict the future course of more advanced tumors with worst prognosis. In addition, information about early amplification and/or deletion events can be useful in correlating these events as predictors of subsequent disease progression.
Gene amplification and deletion identified by the present methods can be correlated with other known parameters such as tumor grade, medical history, brd/Urd marker index, hormonal status, lymph node metastasis, tumor size, survival time, and other tumor characteristics available from epidemiological and biometric studies. For example, tumor DNA to be tested by the present method may include atypical hyperplasia, ductal carcinoma in situ, stage I-III cancer, and metastatic lymph nodes, in order to allow identification of associations between amplifications and deletions and stages. The association made may enable effective therapeutic intervention. For example, a consistently amplified region may contain an overexpressed gene, the product of which may be capable of receiving therapeutic attachment (e.g., growth factor receptor tyrosine kinase p185 HER2 )。
In various embodiments, the methods described herein can be used to identify amplification and/or deletion events associated with drug resistance by determining copy number variations of those nucleic acid sequences from the primary cancer to cells that have metastasized to other sites. If gene amplification and/or deletion is a manifestation of karyotypic instability that allows rapid development of drug resistance, more amplification and/or deletion in the primary tumor from chemotherapy-resistant patients would be expected compared to tumors from chemotherapy-sensitive patients. For example, if amplification of a particular gene results in the development of drug resistance, a consistent amplification of the region surrounding those genes would be expected in tumor cells from chemotherapy-resistant patients, rather than in the primary tumor. The discovery of associations between gene amplifications and/or deletions and drug resistance development may allow for the identification of patients who will or will not benefit from adjuvant therapy.
In a manner similar to that described for determining the presence or absence of a complete and/or partial fetal chromosomal aneuploidy in a maternal sample, the methods, devices, and systems described herein can be used to determine the presence or absence of a complete and/or partial chromosomal aneuploidy in any patient sample (including patient samples that are not maternal samples) that contains nucleic acids (e.g., DNA or cfDNA). Such patient sample may be any biological sample type as described elsewhere in the application. Preferably, such a sample is obtained by a non-invasive procedure. Such a sample may be, for example, a blood sample, or serum and plasma fractions thereof. Alternatively, such a sample may be a urine sample or a stool sample. In still other embodiments, the sample is a tissue biopsy sample. In all cases, such samples include nucleic acids, such as cfDNA or genomic DNA, which is purified and sequenced using any of the NGS sequencing methods described above.
Both complete and partial chromosomal aneuploidies associated with the development and progression of cancer can be determined according to the present methods.
In various embodiments, when determining the presence and/or increased risk of cancer using the methods described herein, the data can be normalized with respect to one or more chromosomes of the determined CNV. In certain embodiments, the data can be normalized with respect to one or more chromosomal arms of the CNV determined. In certain embodiments, the data can be normalized with respect to one or more specific segments of the determined CNV.
In addition to the role of CNVs in cancer, CNVs are also associated with an increasing number of common complex diseases, including Human Immunodeficiency Virus (HIV), autoimmune diseases, and a range of neuropsychiatric disorders.
CNV in infectious and autoimmune diseases
To date, a number of studies have reported the relationship between CNV and HIV, asthma, crohn's disease and other autoimmune disorders involving genes of inflammation and immune response (Fan Cini (fanciuli) et al, clinical genetics (Clin Genet) 77. For example, CNV in CCL3L1 has been implicated in HIV/AIDS susceptibility (CCL 3L1, 17q11.2 deficiency), rheumatoid arthritis (CCL 3L1, 17q11.2 deficiency), and Kawasaki disease (CCL 3L1, 17q11.2 replication); CNV in HBD-2 has been reported to predispose to colonic Crohn's disease (HDB-2,8p23.1 deletion) and psoriasis (HDB-2,8p23.1 deletion); CNV in FCGR3B has been shown to predispose to glomerulonephritis in systemic lupus erythematosus (FCGR 3B,1q23 loss, 1q23 replication), anti-neutrophil cytoplasmic antibody (ANCA) associated vasculitis (FCGR 3B,1q23 loss), and an increased risk of rheumatoid arthritis. At least two inflammatory or autoimmune diseases have been shown to be associated with CNV at different loci. For example, crohn's disease is associated not only with low copy number of HDB-2, but also with common deletion polymorphisms upstream of the IGRM gene encoding a p47 immunity-related GTPase family member. In addition to being associated with FCGR3B copy number, SLE susceptibility was reported to increase significantly in subjects with lower C4 copy number of the complement component.
The relationship between genomic deletions of the GSTM1 (GSTM 1,1q23 deletion) and GSTT1 (GSTT 1, 22q11.2 deletion) loci and increased risk of allergic asthma has been reported in a number of independent studies. In some embodiments, the methods described herein can be used to determine the presence or absence of CNV associated with inflammation and/or autoimmune disease. For example, these methods can be used to determine the presence of CNV in a patient suspected of having HIV, asthma, or crohn's disease. Examples of CNVs associated with such diseases include, but are not limited to, deletions at 17q11.2, 8p23.1, 1q23, and 22q11.2, and duplications at 17q11.2 and 1q 23. In some embodiments, the methods of the invention may be used to determine the presence of CNV in genes including, but not limited to, CCL3L1, HBD-2, FCGR3B, GSTM, GSTT1, C4, and IRGM.
CNV diseases of the nervous system
The relationship between neonatal and genetic CNVs and several common neurological and psychiatric diseases has been reported in certain cases of autism, schizophrenia and epilepsy, and neurodegenerative diseases, such as parkinson's disease, amyotrophic Lateral Sclerosis (ALS), and autosomal dominant alzheimer's disease (Fan Cini (fanciuli), et al, clinical genetics (Clin gene) 77. The presence of replicating cytogenetic abnormalities at 15q11-q13 has been observed in patients with autism and Autism Spectrum Disorder (ASD). According to the Autism Genome project Consortium, 154CNV, which includes several recurrent CNVs, is located on either chromosome 15q11-q13 or a new genomic position, including chromosomes 2p16, 1q21, and 17p12 in the region associated with smith-mageny syndrome that overlaps with the ASD. Recurrent microdeletions or microreplications on chromosome 1693.2 have emphasized the following observations: neonatal CNVs are detected at loci of genes known to regulate synaptic differentiation and release of glutamatergic neurotransmitters, such as SHANK3 (22q13.3 deletion), presynaptic membrane-derived protein 1 (NRXN 1,2p16.3 deletion) and neurogenin (NLGN 4, xp22.33 deletion). Schizophrenia is also associated with multiple neonatal CNVs. Microdeletion and microreplication associated with schizophrenia contain over representation of genes belonging to both neurodevelopmental and glutamatergic pathways, suggesting that multiple CNVs affecting these genes may directly contribute to the pathogenesis of schizophrenia, e.g., ERBB4,2q34 deletion; SLC1A3,5p13.3 deletion; absence of RAPEGF4, 2q31.1; CIT,12.24 deletion; and multiple genes with nascent CNVs. CNV is also associated with other neurological disorders, including epilepsy (CHRNA 7, 15q13.3 deficiency), parkinson's disease (SNCA 4q22 replication), and ALS (SMN 1,5q12.2.-q13.3 deficiency; and SMN2 deficiency). In some embodiments, the methods described herein can be used to determine the presence or absence of a CNV associated with a neurological disease. For example, these methods may be used to determine the presence of CNV in a patient suspected of having autism, schizophrenia, epilepsy, neurodegenerative disease (such as parkinson's disease), amyotrophic Lateral Sclerosis (ALS), or autosomal dominant alzheimer's disease. The methods can be used to determine the CNV of genes associated with neurological diseases, including but not limited to any of Autism Spectrum Disorder (ASD), schizophrenia, and epilepsy, as well as the CNV of genes associated with neurodegenerative disorders, such as parkinson's disease. Examples of CNVs associated with such diseases include, but are not limited to, replication at 15q11-q13, 2p16, 1q21, 17p12, 16911.2, and 4q22, and deletions at 22q13.3, 2p16.3, xp22.33, 2q34, 5p13.3, 2q31.1, 12.24, 15q13.3, and 5q12.2. In some embodiments, the methods can be used to determine the presence of CNV in genes including, but not limited to, SHANK3, NLGN4, NRXN1, ERBB4, SLC1A3, rapgof 4, CIT, chra 7, SNCA, SMN1, and SMN2.
CNV and metabolic or cardiovascular diseases
The relationship between metabolic and cardiovascular disease characteristics (e.g., familial Hypercholesterolemia (FH), atherosclerosis, and coronary artery disease) and CNV has been reported in a number of studies (Fan Cini (fanciuli), et al, clinical genetics (Clin gene) 77. For example, germline rearrangements (mainly deletions) have been observed at the LDLR gene (LDLR, 19p13.2 deletion/duplication) of certain FH patients that do not carry other LDLR mutations. Another example is the LPA gene encoding apolipoprotein (a) (apo (a)), the plasma concentration of apolipoprotein (a) being associated with the risk of coronary artery disease, myocardial Infarction (MI) and stroke. Plasma concentrations of apo (a) comprising lipoprotein Lp (a) vary by more than 1000-fold between individuals, and this variation is 90% determined genetically at the LPA locus, with plasma concentrations and Lp (a) isoform size proportional to the number of highly variable 'kringle 4' repeats (range 5 to 50). These data indicate that CNV in at least two genes can be associated with cardiovascular risk. The methods described herein may be particularly useful in large studies to search for CNVs associated with cardiovascular disorders. In some embodiments, the methods of the invention can be used to determine the presence or absence of CNV associated with metabolic or cardiovascular disease. For example, the methods of the invention can be used to determine the presence of CNV in a patient suspected of having familial hypercholesterolemia. The methods described herein can be used to determine the CNV of genes associated with metabolic or cardiovascular disease (e.g., hypercholesterolemia). Examples of CNV associated with such diseases include, but are not limited to, the 19p13.2 deletion/duplication in the LDLR gene, and amplification in the LPA gene.
Determining complete chromosomal aneuploidy in patient samples
In one embodiment, methods are provided for determining the presence or absence of any one or more distinct, intact chromosomal aneuploidies in a patient test sample comprising nucleic acid molecules. In some embodiments, the method determines the presence or absence of any one or more different, intact chromosomal aneuploidies. The method comprises the following steps: (a) Obtaining sequence information for patient nucleic acids in a patient test sample; and (b) using the sequence information to identify a number of sequence tags for each of any one or more chromosomes of interest selected from chromosomes 1-22, X, and Y, and a number of sequence tags for a normalized chromosome sequence for each of the any one or more chromosomes of interest. The normalizing chromosomal sequence may be a single chromosome, or it may be a set of chromosomes selected from chromosomes 1-22, X, and Y. The method further calculates in step (c) a single chromosome dose for each of any one or more chromosomes of interest using the number of sequence tags identified for each of the any one or more chromosomes of interest and the number of sequence tags identified for each of the normalizing chromosome sequences; and (d) comparing each said single chromosome dose for each of said any one or more chromosomes of interest to a threshold value for each of said any one or more chromosomes of interest, thereby to determine the presence or absence of any one or more different, intact patient chromosomal aneuploidies in the patient test sample.
In some embodiments, step (c) comprises calculating for each of said chromosomes of interest a single chromosome dose as a ratio of the number of sequence tags identified for each of said chromosomes of interest to the number of sequence tags identified for said normalized chromosome sequences for each of said chromosomes of interest.
In other embodiments, step (c) comprises calculating a single chromosome dose for each of said chromosomes of interest as a ratio of the number of sequence tags identified for each of said chromosomes of interest to the number of sequence tags identified for said normalized chromosomes of each of said chromosomes of interest. In other embodiments, step (c) comprises: by correlating the number of sequence tags obtained for a chromosome of interest with the length of the chromosome of interest and correlating the number of tags for the corresponding normalized chromosome sequences for the chromosome of interest with the length of the normalized chromosome sequences, a sequence tag ratio is calculated for one chromosome of interest and one chromosome dose is calculated for this chromosome of interest as the ratio of the sequence tag density for the chromosome of interest to the sequence tag density for the normalized chromosome sequences. This calculation is repeated for each of all sequences of interest. Steps (a) - (d) may be repeated for test samples from different patients.
One or more intact chromosomal aneuploidies are determined in a cancer patient test sample comprising a cell-free DNA molecule by an example of this embodiment, which includes: (a) Sequencing at least a portion of the cell-free DNA molecules so as to obtain sequence information for the patient cell-free DNA molecules in the test sample; (b) Using the sequence information to identify a number of sequence tags for any twenty or more chromosomes of interest selected from chromosomes 1-22, X, and Y and to identify a number of sequence tags for a normalizing chromosome of each of the twenty or more chromosomes of interest; (c) Calculating a single chromosome dose for each of twenty or more chromosomes of interest using the number of sequence tags identified for each of the twenty or more chromosomes of interest and the number of sequence tags identified for each of the normalized chromosomes; and (d) comparing each single chromosome dose for each of said twenty or more chromosomes of interest to a threshold value for each of twenty or more chromosomes of interest, and therefrom determining the presence or absence of any twenty or more different, intact chromosomal aneuploidies in the patient test sample.
In another embodiment, the method for determining the presence or absence of any one or more distinct, intact chromosomal aneuploidies in a patient test sample as described above uses a sequence of normalized segments to determine the dose of chromosomes of interest. In this example, the method includes: (a) Obtaining sequence information for nucleic acids in the sample; and (b) using the sequence information to identify a number of sequence tags for each of any one or more chromosomes of interest selected from chromosomes 1-22, X, and Y, and a number of sequence tags for a normalizing segment sequence for each of the any one or more chromosomes of interest. The normalizing segment sequence may be a single segment of one chromosome, or it may be a set of segments from one or more different chromosomes. The method further calculates in step (c) a single chromosome dose for each of the any one or more chromosomes of interest using the number of sequence tags identified for each of the any one or more chromosomes of interest and the number of sequence tags identified for the normalizing segment sequence; and (d) comparing each said single chromosome dose for each of said any one or more chromosomes of interest to a threshold value for each of said one or more chromosomes of interest, and thereby determining the presence or absence of one or more different, intact chromosomal aneuploidies in the patient sample.
In some embodiments, step (c) comprises calculating for each of said chromosomes of interest a single chromosome dose as a ratio of the number of sequence tags identified for each of said chromosomes of interest to the number of sequence tags identified for said normalized segment sequence of each of said chromosomes of interest.
In other embodiments, step (c) comprises: by correlating the number of sequence tags obtained for a chromosome of interest with the length of the chromosome of interest and the number of tags for the corresponding normalized segment sequence for the chromosome of interest with the length of the normalized segment sequence, a sequence tag ratio is calculated for one chromosome of interest and one chromosome dose is calculated for this chromosome of interest as the ratio of the sequence tag density for the chromosome of interest to the sequence tag density for the normalized segment sequence. This calculation is repeated for each of all sequences of interest. Steps (a) - (d) may be repeated for test samples from different patients.
Determining a Normalized Chromosome Value (NCV) provides a means for comparing chromosome dosages for different sample sets, which correlates chromosome dosages in test samples with the average of the corresponding chromosome dosages in a set of qualifying samples. NCV was calculated as:
WhereinAndestimated mean and standard deviation, respectively, of the jth chromosome dose for the qualified sample set, and x ij Is the jth chromosome dose observation for test sample i.
In some embodiments, the presence or absence of an intact chromosomal aneuploidy is determined. In other embodiments, the presence or absence of two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty-one, twenty-two, twenty-three, or twenty-four intact chromosomal aneuploidies is determined in a sample, wherein twenty-two intact chromosomal aneuploidies correspond to the intact chromosomal aneuploidies of any one or more autosomes; the twenty-third and twenty-fourth chromosomal aneuploidies correspond to the complete chromosomal aneuploidies of chromosomes X and Y. Since aneuploidies can include trisomy, tetrasomy, pentasomy, and other polysomy, and the number of intact chromosomal aneuploidies varies in different diseases and in different stages of the same disease, the number of intact chromosomal aneuploidies determined according to the present methods is at least 24, at least 25, at least 26, at least 27, at least 28, at least 29, at least 30complete, at least 40, at least 50, at least 60, at least 70, at least 80, at least 90, at least 100, or more chromosomal aneuploidies. Systemic karyotyping of tumors has revealed that the number of chromosomes in cancer cells is highly variable, ranging from hypodiploid (considerably less than 46 chromosomes) to tetraploid and supertetraploid (up to 200 chromosomes) (Storchova and Kuffer, J Cell Sci (journal of cytoscience), 121. In some embodiments, the method comprises determining the presence or absence of up to 200 or more chromosomal aneuploidies in a sample from a patient suspected or known to have cancer (e.g., colon cancer). These chromosomal aneuploidies include the loss of one or more intact chromosomes (hypodiploids), resulting in intact chromosomes that include trisomies, tetrasomy, pentasomies, and other polysomies. As explained elsewhere in this application, the acquisition and/or loss of chromosome segments may also be determined. The method is suitable for determining the presence or absence of different aneuploidies in a sample from a patient suspected or known to have a cancer as specified elsewhere in the application.
In some embodiments, any of chromosomes 1-22, X, and Y can be the chromosome of interest in determining the presence or absence of any one or more different, intact chromosomal aneuploidies in a patient test sample as described above. In other embodiments, the two or more chromosomes of interest are any two or more selected from chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,17, 18, 19, 20, 21, 22, X, or Y. In one embodiment, any one or more chromosomes of interest selected from the group consisting of chromosomes 1-22, X, and Y comprises at least twenty chromosomes selected from the group consisting of chromosomes 1-22, X, and Y, and wherein the presence or absence of at least twenty different, intact chromosomal aneuploidies is determined. In other embodiments, any one or more of the chromosomes of interest selected from the group consisting of chromosomes 1-22, X, and Y is all of chromosomes 1-22, X, and Y, and wherein the presence or absence of an intact chromosomal aneuploidy of all of chromosomes 1-22, X, and Y is determined. Intact, different chromosomal aneuploidies that can be determined include an intact chromosomal monosomy of any one or more of chromosomes 1-22, X and Y; a complete chromosomal trisomy of any one or more of chromosomes 1-22, X, and Y; a complete chromosomal tetrasomy of any one or more of chromosomes 1-22, X and Y; a complete chromosomal pentasomal of any one or more of chromosomes 1-22, X and Y; and other complete chromosomal polysomy of any one or more of chromosomes 1-22, X and Y.
Determining partial chromosomal aneuploidy in patient samples
In another embodiment, methods are provided for determining the presence or absence of any one or more distinct, partial, chromosomal aneuploidies in a patient test sample comprising nucleic acid molecules. The method comprises the following steps: (a) Obtaining sequence information for patient nucleic acids in the sample; and (b) using the sequence information to identify a number of sequence tags for each of any one or more chromosomes of interest selected from chromosomes 1-22, X, and Y, and a number of sequence tags for a normalized segment sequence for each of the any one or more segments in any one or more chromosomes of interest. The normalizing segment sequence may be a single segment of one chromosome, or it may be a set of segments from one or more different chromosomes. The method further uses in step (c) the number of said sequence tags identified for any one or more segments of each of said any one or more chromosomes of interest and the number of said sequence tags identified for each of said normalizing segment sequences to calculate a single segment dose for each of any one or more segments of said any one or more chromosomes of interest; and (d) comparing each said single chromosome dose in any one or more segments for each said any one or more chromosomes of interest with a threshold value for any one or more chromosome segments for each said any one or more chromosomes of interest, and thereby determining the presence or absence of one or more different, partial chromosomal aneuploidies in said sample.
In some embodiments, step (c) comprises: calculating a single segment dose for any one or more segment of each any one or more chromosome of interest as a ratio of the number of sequence tags identified for any one or more segment of each any one or more chromosome of interest to the number of sequence tags identified for the normalized segment sequence for any one or more segment of each said any one or more chromosome of interest.
In other embodiments, step (c) comprises: by correlating the number of sequence tags obtained for a segment of interest with the length of the segment of interest and the number of tags of the corresponding normalized segment sequence for the segment of interest with the length of the normalized segment sequence, a sequence tag ratio is calculated for one segment of interest and a segment dose is calculated for this segment of interest as the ratio of the sequence tag density of the segment of interest to the sequence tag density of the normalized segment sequence. This calculation is repeated for each of all sequences of interest. Steps (a) - (d) may be repeated for test samples from different patients.
Determining the Normalized Segment Value (NSV) provides a means for comparing the segment doses for different sample sets, which correlates the segment dose in the test sample to the average of the corresponding segment doses in a set of qualifying samples. NSV was calculated as:
whereinAndestimated mean and standard deviation, respectively, of the jth segment dose of the set of eligible samples, and x ij Is the j-th segment dose observation for test sample i.
In some embodiments, the presence or absence of a fraction of a chromosomal aneuploidy is determined. In other embodiments, the presence or absence of two, three, four, five, six, seven, eight, nine, ten, fifteen, twenty-five, or more fractions of a chromosomal aneuploidy is determined in a sample. In one embodiment, a segment of interest selected from any one of chromosomes 1-22, X, and Y is selected from chromosomes 1-22, X, and Y. In other embodiments, the two or more segments of interest selected from chromosomes 1-22, X, and Y are selected from any two or more of chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,17, 18, 19, 20, 21, 22, X, or Y. In one embodiment, any one or more segments of interest selected from chromosomes 1-22, X, and Y include at least one, five, ten, 15, 20, 25, 50, 75, 100 or more segments selected from chromosomes 1-22, X, and Y, and wherein the presence or absence of at least one, five, ten, 15, 20, 25, 50, 75, 100, or more different, partial chromosomal aneuploidies is determined. Different, partial chromosomal aneuploidies that can be determined include partial replication, partial doubling, partial insertion, and partial deletion.
The sample that can be used to determine the presence or absence of a chromosomal aneuploidy (partial or complete) in a patient can be any biological sample described elsewhere in this application. The type of sample or samples that may be used to determine aneuploidy in a patient will depend on the type of disease the patient is known to or suspected of having. For example, a fecal sample can be selected as a source of DNA to determine the presence or absence of aneuploidy associated with colorectal cancer. The method is also applicable to the tissue samples described herein. Preferably, the sample is a biological sample obtained by non-invasive means, such as a plasma sample. As described elsewhere in this application, sequencing of nucleic acids in patient samples can be performed using Next Generation Sequencing (NGS) as described elsewhere in this application. In some embodiments, the sequencing is massively parallel sequencing using sequencing by synthesis with reversible dye terminators. In other embodiments, the sequencing is ligation sequencing. In still other embodiments, the sequencing is single molecule sequencing. Optionally, an amplification step is performed prior to sequencing.
In some embodiments, the presence or absence of aneuploidy in a patient suspected of having a cancer as described elsewhere in this application, e.g., lung, breast, kidney, head and neck, ovary, cervix, colon, pancreas, esophagus, bladder, and other organ cancers, as well as hematological cancers, is determined. Hematologic cancers include cancers of the bone marrow, blood, and lymphatic system, including lymph nodes, lymphatic vessels, tonsils, thymus, spleen, and gut lymphoid tissues. Leukemia and myeloma, both from the bone marrow, and lymphoma, both from the lymphatic system, are the most common types of blood cancers.
A determination may be made in a patient sample of the presence or absence of one or more chromosomal aneuploidies without limitation to: determining a patient's susceptibility to a particular cancer, determining the presence or absence of a cancer of interest as part of routine screening among patients known or unknown to be susceptible to a cancer, providing a prognosis for the disease, assessing the need for adjuvant therapy, and determining the progression or regression of the disease.
Genetic counseling
Fetal chromosomal abnormalities are the leading cause of abortion, congenital abnormalities and perinatal death (welsley et al, european journal of human genetics (europ. J. Human gene., 20. Since the introduction of amniocentesis, followed by Chorionic Villus Sampling (CVS), pregnant women have had access to information about the condition of the fetal chromosomes (ACOG Practice Bulletin No. 77): obstetrics and gynecology (Obstet Gynecol) 109. The foetal cells or chorionic villi obtained from these procedures are cytogenotypically karyotyped when sufficient tissue is obtained, in most cases making the diagnostic sensitivity and specificity very high (about 99%) (Harlerman (Hahnemann) and Freund Ji Sile (Vejerslev), prenatal diagnosis (Prenat Diagn.), 17. However, these procedures also pose risks to the fetus and pregnant woman (oddebo et al, obstetrics and gynecology (Obstet Gynecol) 112, 813-819[2008]; oddebo (oddebo) et al, obstetrics and gynecology (Obstet Gynecol) 111.
To alleviate these risks, a series of prenatal screening algorithms have been developed to rank women for their likelihood of developing the most common fetal trisomy-T21 (down syndrome) and trisomy 18 (T18, edwardsies syndrome), and to a lesser extent trisomy 13 (T13, pata syndrome). Screening typically involves measuring multiple biochemical analytes in maternal serum at different points, measuring fetal Neck Translucency (NT) in conjunction with ultrasonography, and a combination of other maternal factors (e.g., age) to generate a risk score. In accordance with its development and improvement over many years and depending on when screening is given (first or second trimester of pregnancy only, continuous or fully integrated) and how screening is given (serum or serum in combination with NT only), an options menu with different detection rates (65% to 90%) and high screening positivity (5%) has been developed (ACOG Practice Bulletin No. 77): obstetrics and gynecology (Obstet Gynecol) 109 217-227[2007 ]).
For patients, after this multi-step procedure, the resulting information or "risk score" can confuse and cause anxiety, especially in the absence of comprehensive counseling. Finally, the risk of miscarriage due to invasive procedures is weighed when the woman makes a decision. A better non-invasive way to obtain more specific information about the chromosome status of the fetus aids in making decisions in this context. Such non-invasive means of improving to obtain more specific information about the chromosomal status of the fetus are believed to be provided by the methods described herein.
In various embodiments, genetic counseling is contemplated as part of using the assays described herein, particularly in a clinical setting. In contrast, the aneuploidy detection methods described herein may include an option offered in the context of prenatal care and related genetic counseling.
Thus, in various embodiments, the methods described herein can be provided as a primary screening (e.g., for women at a pre-established risk of pregnancy) or as a secondary screening for those women who are positive for "routine" screening. In certain embodiments, it is contemplated that the non-invasive prenatal testing (NIPT) methods described herein additionally include a genetic counseling component, and/or optionally or explicitly incorporate genetic counseling and pregnancy "management" in the NIPT methods described herein.
For example, in certain embodiments, the woman is at one or more pre-established risks of pregnancy. Such risks include, but are not limited to, one or more of the following:
1) Mothers are older than 35 years, although it is noted that about 80% of children born with down syndrome are born by women younger than 35 years of age.
2) A prior fetus/child with an autosomal trisomy. Depending on the type of trisomy, whether or not the previous pregnancy induced spontaneous abortion, and the age of the mother at the time of the first occurrence and the age of the mother at the time of the subsequent prenatal diagnosis, the recurrence rate is considered to be about 1.6 times to about 8.2 times the risk of maternal age.
3) Previous fetuses/children with sex chromosome abnormalities-not all sex chromosome abnormalities have maternal origin, and not all are at risk of recurrence. When they occur, the recurrence rate is about 1.6 times to about 1.5 times the risk of maternal age.
4) Parental carrier of chromosomal translocations.
5) Parental carriers of chromosomal inversions.
6) Parental aneuploidy or chimerism.
7) Certain assisted reproductive techniques are used.
In such cases, the mother, e.g., in consultation with a physician, genetic consultant, etc., may be provided with the methods described herein for non-invasively determining the presence or absence of a fetal aneuploidy (e.g., trisomy 21, trisomy 18, trisomy 13, monosomy X, etc.), subject to the different considerations described below. In this regard, it should be noted that the methods described herein are considered effective even during the first three months of gestation. Thus, in certain embodiments, it is contemplated that the NIPT method described herein is used as early as 8 weeks, and in various embodiments, about 10 weeks or later.
In certain embodiments, those women who are positive for "routine" screening may be provided with the methods described herein as secondary screening. For example, in certain embodiments, the pregnant woman may exhibit structural abnormalities, such as, for example, fetal vesicular lymphangioma, or increased neck translucency, such as, for example, as detected using ultrasonography. Typically, ultrasonic detection of structural defects is performed between 18 and 22 weeks, and may be coupled to fetal echocardiography, particularly when irregularities are observed. It is contemplated that when an abnormality is observed (e.g., positive for "routine" screening), the mother, e.g., in consultation with a physician, genetic consultant, etc., may be provided for non-invasively determining the presence or absence of a fetal aneuploidy (e.g., trisomy 21, trisomy 18, trisomy 13, monosomy X, etc.) using the methods described herein.
Thus, in various embodiments, genetic counseling is contemplated wherein the (NIPT) analysis described herein is provided as an integral part of prenatal care, pregnancy management, and/or development/design of labor protocols. By providing NIPT as a secondary screening to those women who routinely screen positive (or otherwise pre-set risk), it is expected that the number of unnecessary amniocentesis and CVS procedures may be reduced. However, as consent is an important component of NIPT, the necessity of genetic counseling has increased.
Since NIPT positive results (using the methods described herein) are more similar to those of amniocentesis or CVS, women should be provided with an opportunity to determine whether they need this level of information at the time of genetic counseling prior to this test. Pre-test NIPT Genetic counseling should also include discussions/recommendations to confirm abnormal test results via CVS, amniocentesis, umbilical cord puncture etc (depending on gestational age) so that the desired timing of the results may be given appropriate consideration for post-test planning according to the National Society of Genetic consultants (NSGC, USA) Statements on the subject (see for example deves (devices) et al, non-invasive Prenatal Testing/non-invasive Prenatal Diagnosis: the National Society of genetics institute of Public Policy council (by NSGC Public Policy council) NSGC standing statement 2012 (non-invasive Testing/non-invasive Prenatal Diagnosis: the Position of the National Society of Genetic Counselors (by NSGC Public Policy Committee): NSGC Position states; bern (Benn) et al, prenatal Diagnosis (Prenat Diagn), 31.
NIPT using the methods described herein may be more similar to CVS in that the detection of aneuploidy typically indicates the chromosomal composition of the fetus, but in some cases may indicate restricted placental aneuploidy or restricted placental Chimerism (CPM), as compared to amniocentesis. In today's CVS results, CPM is present in about 1% to 2% of cases, and some women undergo amniocentesis at a later gestational age after CVS to make a difference between clearly separated placental aneuploidy versus fetal aneuploidy. As NIPT is more widely practiced, it is therefore expected that CPM conditions may yield a certain number of positive NIPT results that may not be subsequently confirmed by invasive procedures (particularly amniocentesis). Again, in various embodiments, it is contemplated that this information is presented to the patient in the context of a genetic counseling (e.g., by a physician, genetic counselor, etc.).
It will be appreciated that in various embodiments, a component of genetic counseling may be the recommended manner of diagnosis, the notification of risk level scheduling, and scheduling for various manners of diagnosis, may be used to provide input regarding the value of information provided by such verification methods, particularly in the context of selecting a time of pregnancy. In various embodiments, genetic counseling may also establish a protocol for monitoring pregnancy (e.g., subsequent ultrasonography, additional physician visits, etc.) and for establishing a series of decision points as appropriate. In addition, genetic counseling may suggest and assist in developing a labor program, which may include, for example, information about the labor location (e.g., home, hospital, specialized facility, etc.), the personnel involved in the labor location, third party care available to the baby, and so forth.
While the above discussion has focused on the methods described herein as an integral part of prenatal diagnosis (and perhaps a secondary tool), as clinical experience has accumulated and if the results from comparative studies to routine screening succeed, the NIPT methods described herein may replace existing screening protocols and may be used as a primary tool.
It is also contemplated that the methods described herein will find use in pregnancy with multiple pregnancies.
Typically, it is contemplated that genetic counseling (e.g., as described above) may be provided by a physician (e.g., a chief physician, obstetrician, etc.) and/or by a genetic counselor or other qualified medical professional. In some embodiments, the consultation is provided face-to-face, however, it is recognized that in some instances, the consultation may be provided through remote access (e.g., through text, cell phone application, tablet computer application, the internet, etc.).
It should also be recognized that in certain embodiments, the genetic query or a component thereof may be delivered by a computer system. For example, a "smart advice" system may be provided that provides genetic counseling information (e.g., as described above) in response to test results, instructions from a medical care provider, and/or in response to queries (e.g., from patient queries). In certain embodiments, the information will be specific clinical information provided by a physician, a healthcare system, and/or a patient. In certain embodiments, the information can be provided in an iterative manner. Thus, for example, the patient may provide an "if-like" query and the system may return information such as diagnostic options, risk factors, scheduling, and implications for different results.
In some embodiments, the information can be provided in a transitory manner (e.g., presented on a computer screen). In certain embodiments, the information can be provided in a non-transitory manner. Thus, for example, information can be printed out (e.g., as a menu of options and/or recommendations, optionally with associated scheduling, etc.) and/or stored on a computer-readable medium (e.g., magnetic media such as a local hard disk, server, etc.; optical media; flash memory, etc.).
It will be appreciated that such systems are typically configured to provide sufficient security in order to maintain patient privacy, for example according to current standards in the industry.
The above discussion of genetic counseling is intended to be illustrative and not limiting. Genetic counseling is a well-established branch of medical science, and the combination of counseling components with respect to the analysis described herein is within the skill of the practitioner. Furthermore, it should be recognized that as the field develops, genetic counseling and related information and the nature of recommendations are likely to change.
Determining fetal fraction
Fetal fraction determination methods are disclosed in U.S. patent application publication 2010-0010085 (117.201), U.S. patent application publication 2011-0201507 (120.201), U.S. patent application No. 13/365,240 (filed 2/2012), and U.S. patent application No. 13/445,778 (filed 4/12/2012). A sufficient discussion of techniques for determining the fetal fraction can be found in these documents.
The methods described herein enable the determination of the fetal fraction in a sample comprising a mixture of fetal and maternal nucleic acids, or more generally, a mixture of nucleic acids derived from two different genomes. For the purposes of this discussion, maternal and fetal nucleic acids will be described, but it will be understood that any two genomes may be substituted accordingly. In some embodiments, fetal fraction is determined, while the presence or absence of copy number variation (e.g., aneuploidy) is determined. As described more fully below, a set of tags for a test sample may be used to determine fetal fraction and copy number variation.
The method of quantifying fetal fraction is dependent on the difference between the fetal and maternal genomes. In certain embodiments described herein, determining the fetal fraction of sample DNA is dependent on multiple DNA sequence reads at sequence sites known to accommodate one or more polymorphisms. In some embodiments, the polymorphic sites or target nucleic acid sequences are found while aligning the sequence tags with each other and/or a reference sequence. In certain embodiments, the fetal fraction of sample DNA is determined by considering copy number information for a particular chromosome or chromosome sequence, where there is a copy number difference between maternal and fetal chromosomes. In such embodiments, the fetal fraction of sample DNA is determined by considering the relative amounts of sample DNA of the mother and fetus, where the chromosomes or segments are inherently determined or known to have copy number variation. In such embodiments, the fetal fraction may be calculated using copy number variation between maternal and fetal chromosomes. To this end, the method and apparatus may calculate a Normalized Chromosome Value (NCV), as described below, or similar metrics.
Certain methods are limited by fetal gender, for example methods for quantifying fetal fraction rely on the presence of sequences specific for the Y chromosome or determining chromosome dosage for the X chromosome of a male fetus. In certain embodiments, quantifying fetal DNA is directed to fetal targets that are free of maternal counterparts, such as the absence of the RhD1 gene in the Y chromosome sequence (exemplar (Fan) et al, journal of the national academy of sciences (Proc Natl Acad Sci) 105. Other methods are independent of fetal gender and rely on polymorphic differences between the fetal and maternal genomes.
Allelic imbalance in polymorphisms can be detected and quantified by different techniques. In some embodiments, digital PCR is used to determine allelic imbalance in polymorphisms, such as SNPs on mRNA. Alternatively, capillary gel electrophoresis is used to detect differences in the size of the polymorphic region, for example in the case of STR.
In some embodiments, epigenetic differences, such as methylation with differences in promoter regions, can be detected and used alone or in combination with digital PCR to determine differences between fetal and maternal genomes and quantify fetal fraction (child (Tong) et al, clinical chemistry (Clin Chem) 56. Also included are variations of epigenetic methods such as methylation-based DNA discrimination (Ai Niji (Erich) et al, AJOG 204: p. 205.e1 to p. 205.e11 [2011 ]). In some embodiments, fetal scores are estimated using sequencing of one or more pre-selected sets of polymorphic sequences as described elsewhere in the application.
In addition to methods of sequencing sets of preselected polymorphic sequences as described elsewhere in this application, methods for quantifying fetal DNA in maternal plasma include, but are not limited to, real-time qPCR, mass spectrometry, digital PCR (including microfluidic digital PCR), capillary gel electrophoresis.
The discussion in this section begins with consideration of fetal fraction, as determined from one or more polymorphisms or other information from chromosomes or chromosome segments that do not (or are determined not) have copy number variation. The fetal fraction determined by such techniques will be referred to herein as the non-CNV fetal fraction or "NCNFF". In a later section of this section, various techniques are described for calculating fetal fractions from chromosomes or chromosome segments determined to possess copy number variations. The fetal fraction determined from such techniques will be referred to herein as the CNV fetal fraction or "CNFF".
In some embodiments, the fetal fraction is assessed by determining the relative contribution of polymorphic alleles derived from the fetal genome and the contribution of corresponding polymorphic alleles derived from the maternal genome. In some embodiments, the fetal fraction is assessed by determining the relative contribution of polymorphic alleles derived from the fetal genome versus the total contribution of corresponding polymorphic alleles derived from the fetal genome versus the maternal genome.
Polymorphisms can be indicative, informative (informational), or both. The indicative polymorphism indicates the presence of fetal cell-free DNA ("cfDNA") in the maternal sample. Informative polymorphisms (e.g., informative SNPs) yield information about the fetus, e.g., the presence or absence of a disease, a genetic abnormality, or any other biological information, such as the stage of pregnancy or gender. In this case, the informative polymorphisms are those that identify differences between sequences of the mother and the fetus and are used in the methods disclosed herein. In other words, an informative polymorphism is a polymorphism in a nucleic acid sample that possesses different sequences (i.e., they have different alleles), and these sequences are present in different amounts. In some methods herein, fetal scores, particularly NCNFF, are determined using different numbers of sequences/alleles.
Polymorphic sites include, but are not limited to, single Nucleotide Polymorphisms (SNPs), tandem SNPs, small-scale multiple base deletions or insertions (IN-DELS or Deletion Insertion Polymorphisms (DIPs)), polynucleotide polymorphisms (MNPs), short Tandem Repeats (STRs), restriction Fragment Length Polymorphisms (RFLPs), or any polymorphism possessing any other allelic sequence variation IN a chromosome. In some embodiments, each target nucleic acid comprises two tandem SNPs. Tandem SNPs are analyzed as a single unit (e.g., as short haplotypes) and are provided herein as multiple pools with two SNPs.
In some embodiments, the fetal fraction is determined by statistical and approximation techniques that assess the relative contribution of the match of the fetal and maternal genomes by using polymorphic sites to determine the relative contribution. Fetal fraction can also be determined by electrophoretic methods, where certain types of polymorphic sites are electrophoretically separated and used to identify the relative contribution of a polymorphic allele from the fetal genome and the relative contribution of the corresponding polymorphic allele from the maternal genome.
In one embodiment shown in the process flow diagram of fig. 6, the fetal fraction is determined by the method 600, the method 600 including first obtaining a test sample comprising a mixture of fetal and maternal nucleic acids in operation 610, enriching the nucleic acid mixture for polymorphic target nucleic acids in operation 620, sequencing the enriched nucleic acid mixture in operation 630, and simultaneously determining the fetal fraction and aneuploidy in the sample in operation 640.
Figure 7 shows a process flow diagram for some embodiments. Determining the fetal fraction by: the method includes the steps of (i) obtaining a maternal plasma sample in operation 710, (ii) purifying cfDNA in the sample in operation 720, (iii) amplifying the polymorphic nucleic acids in operation 730, (iv) sequencing the mixture using massively parallel sequencing methods in operation 740, and (v) calculating a fetal fraction in operation 760. In another embodiment, the fetal fraction is determined by: the method includes the steps of (i) obtaining a maternal plasma sample in operation 710, (ii) purifying cfDNA in the sample in operation 720, (iii) amplifying the polymorphic nucleic acids in operation 730, (iv) separating the nucleic acids by size using electrophoresis in operation 750, and (v) calculating a fetal fraction in operation 770.
In one embodiment shown in the process flow diagram of fig. 8, the fetal fraction is determined by: (ii) obtaining a sample comprising a mixture of fetal and maternal nucleic acids in operation 810, (ii) amplifying the sample in operation 820, (iii) enriching the sample by combining the amplified sample with an unamplified sample of the initial mixture in operation 830, (iv) purifying the sample in operation 840, and (v) sequencing the sample using different methods to determine fetal fraction in operation 850, determining the presence or absence of fetal fraction and aneuploidy simultaneously in operation 860.
In another embodiment shown in the process flow diagram of fig. 9, the fetal fraction is determined by: (i) obtaining a sample comprising a mixture of fetal and maternal nucleic acids in operation 910, (ii) purifying the sample in operation 920, (iii) amplifying a portion of the sample in operation 930, (iv) enriching the sample in operation 940 by combining the amplified sample with the purified but unamplified portion of the initial sample of the initial mixture, and (v) sequencing the sample in operation 950 to determine a fetal fraction, the presence or absence of fetal fraction and aneuploidy being determined simultaneously using different methods in the 960 operation.
In another embodiment shown in the process flow diagram of fig. 10, the fetal fraction is determined by: (i) obtaining a sample comprising a mixture of fetal and maternal nucleic acids in operation 1010, (ii) purifying the sample in operation 1020, (iii) amplifying a first portion of the sample in operation 1040, (iv) preparing a sequencing library of the amplified portion of the sample in operation 1050, (v) preparing a sequencing library of a second purified but unamplified portion of the sample in operation 1030, (vi) enriching the mixture by combining the two sequencing libraries in operation 1060, and (vii) sequencing the mixture in operation 1070, determining the presence or absence of fetal fraction and aneuploidy simultaneously using different methods in operation 1080.
In another embodiment, the fetal fraction is determined by: (ii) obtaining a sample comprising a mixture of fetal and maternal nucleic acids, (ii) purifying the sample, (iii) amplifying the sample using labeled primers, and (iv) sequencing the sample using electrophoresis to determine fetal fraction using different methods.
In another embodiment, the fetal fraction is determined by: (ii) obtaining a sample comprising a mixture of fetal and maternal nucleic acids, (ii) purifying the sample, (iii) optionally enriching the sample by amplifying a portion of the sample, and (iv) sequencing the sample to determine fetal fraction using different methods.
Purification of the initially obtained sample, amplified sample, or amplified and enriched sample, or other nucleic acid sample associated with the methods disclosed herein (e.g., in operations 720, 840, 920, and 1020), can be accomplished by any conventional technique. To isolate cfDNA from cells, fractionation, centrifugation (e.g., density gradient centrifugation), DNA-specific precipitation, or high-throughput cell sorting, and/or separation methods can be used. Optionally, the resulting sample may be fragmented prior to purification or amplification. If the sample used comprises cfDNA, fragmentation may not be required because cfDNA is fragmented in nature, with fragment sizes oftentimes being about 150bp to 200bp.
In some of the above procedures, selective amplification and enrichment are used to increase the relative amount of nucleic acid from the region in which the polymorphism is located. Similar results can be obtained by in-depth sequencing of selected regions of the genome, particularly where the polymorphism is located.
Amplification of
After obtaining a sample and purifying the sample, a plurality of polymorphic target nucleic acids, each comprising a polymorphic site, is amplified using a portion of a purification mixture of fetal and maternal nucleic acids (e.g., cfDNA). Amplification of target nucleic acids in a mixture of fetal and maternal nucleic acids, in certain implementations, is achieved by any method that uses PCR (polymerase chain reaction) or a variation of this method, including but not limited to asymmetric PCR, helicase-dependent amplification, hot start PCR, qPCR, solid phase PCR, and touchdown PCR. In some embodiments, the sample may be partially amplified to assist in determining fetal fraction. In some embodiments, amplification is not performed. The disclosed amplification methods and other amplification techniques can be used in operations 730, 820, 930, and 1040.
Amplification of SNPs
A large number of nucleic acid primers are available for amplifying DNA fragments containing SNPs and the sequences thereof can be obtained, for example, from databases known to those of ordinary skill in the art. Additional primers can also be designed, for example using similar methods as disclosed in: viekx e.f. (Vieux, e.f.), guo P-Y (Kwok, P-Y) and Miller r.d. (Miller, r.d.), biotechnology (BioTechniques) (2002, 6 months), volume 32, suppl: "SNP: discovery of Marker diseases (SNPs: discovery of Marker Disease) ", pages 28 to 32.
Sequence specific primers are selected to amplify the target nucleic acid. In one embodiment, a target nucleic acid comprising a polymorphic site is amplified, such as an amplicon. In another embodiment, a target nucleic acid comprising two or more polymorphic sites (e.g., two tandem SNPs) is amplified, e.g., as an amplicon. The amplified target nucleic acid amplicon of at least about 100bp comprises a single or tandem SNP. Primers used to amplify a target sequence comprising tandem SNPs are designed to encompass two SNP sites.
Amplification of STR
Several nucleic acid primers are available for amplifying DNA fragments comprising STRs, and such sequences can be obtained from databases known to one skilled in the art.
In some embodiments, a portion of the fetal and maternal nucleic acid mixture is used as a template for amplification of target nucleic acids having at least one STR. A comprehensive list of references, talking data and sequence information about STRs, disclosed PCR primers, common multiplex systems and related population data is compiled in STRBase, which is accessible via the internet at cstl. From ncbi, nlm, nih, gov/genbankSequence information for commonly used STR loci is also accessible via STRBase.
STR multiplex systems allow for the simultaneous amplification of multiple non-overlapping loci in a single reaction, thereby substantially increasing throughput. Because STRs are highly polymorphic, most individuals are heterozygous. STRs can be used in electrophoretic analysis as described further below.
Amplification can also be performed using miniSTRs to produce amplicons of reduced size, thereby distinguishing STR alleles that are shorter in length. The method of the disclosed embodiments encompasses determining the fraction of fetal nucleic acid in a maternal sample that has been enriched for target nucleic acids, each comprising one miniSTR, the method comprising quantifying at least one fetal and one maternal allele at one polymorphic miniSTR, which can be amplified to produce amplicons of a length of about the size of circulating fetal DNA fragments. Any pair of miniSTR primers or a combination of two or more pairs of miniSTR primers can be used to amplify at least one miniSTR.
Enrichment of
The enriched sample may comprise: a plasma separated portion of the blood sample; a sample of purified cfDNA extracted from plasma; a sequencing library sample prepared from a purified mixture of fetal and maternal nucleic acids; and so on.
In certain embodiments, the sample comprising the mixture of DNA molecules is non-specifically enriched for the whole genome prior to sequencing the whole genome, i.e., whole genome amplification is performed prior to sequencing. By non-specifically enriching a nucleic acid mixture is meant whole genome amplification of genomic DNA fragments of a DNA sample that can be used to increase the level of sample DNA prior to identification of polymorphisms by sequencing. Non-specific enrichment may be selective enrichment of one of the two genomes (fetal and maternal) present in the sample.
In other embodiments, the cfDNA in the sample is specifically enriched. Specific enrichment refers to the enrichment of a genomic sample for a particular sequence (e.g., a polymorphic target sequence) by a method that includes specifically amplifying the target nucleic acid sequence, which includes the polymorphic site.
In other embodiments, the mixture of nucleic acids present in the sample is enriched for polymorphic target nucleic acids that each comprise a polymorphic site. Such enrichment may be used in operation 620. Enriching a mixture of fetal and maternal nucleic acids includes amplifying a target sequence from a portion of the nucleic acids contained in an initial maternal sample and combining a portion or the entire amplification product with the remainder of the initial maternal sample, e.g., in operations 830 and 940.
In yet another embodiment, the enriched sample is a sequencing library sample prepared from a purified mixture of fetal and maternal nucleic acids. The amount of amplification product used to enrich the initial sample is selected to obtain sufficient sequence information for determining fetal fraction. At least about 3%, at least about 5%, at least about 7%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, or more of the total number of sequence tags obtained from sequencing are mapped to determine the fetal fraction.
In one embodiment, in fig. 10, enriching comprises amplifying target nucleic acids contained in a portion of an initial sample of a purified mixture of fetal and maternal nucleic acids (e.g., cfDNA that has been purified from a maternal plasma sample) in operation 1040. Similarly, in operation 1050, a primary sequencing library is prepared using a portion of the purified but unamplified cfDNA. In operation 1060, a portion of the library of interest is combined with a primary library generated from the unamplified mixture of nucleic acids, and the mixture of fetal and maternal nucleic acids contained in the two libraries is sequenced in operation 1070. The enriched library may comprise at least about 5%, at least about 10%, at least about 15%, at least about 20%, or at least about 25% of the library of interest. In operation 1080, data from the sequencing round is analyzed and the presence or absence of fetal fraction and aneuploidy is determined simultaneously as described in operation 640 of the embodiment depicted in fig. 6.
Sequencing technology
Sequencing the enriched fetal and maternal nucleic acid mixture. The sequence information necessary to determine the fetal fraction can be obtained using any known DNA sequencing method, many of which are described elsewhere in this application. Such sequencing methods include the followingFirst generation Sequencing (NGS), sanger Sequencing (Sanger Sequencing), and Helicos True Single Molecule Sequencing (tSMS) TM ) 454 sequencing method (Roche), SOLID technology (applied biosystems), single Molecule Real Time (SMRT) TM ) Sequencing techniques (pacific biosciences), nanopore sequencing, chemosensitive field effect transistor (chemFET) arrays, hall Molecular methods using Transmission Electron Microscopy (TEM), ion-flux single molecule sequencing, hybridization sequencing, and the like. In certain embodiments, massively parallel sequencing is employed. In one embodiment, illunamer synthesis sequencing and reversible terminator-based sequencing chemistry techniques are used. In certain embodiments, partial sequencing is used.
The sequenced DNA maps to the reference genome. The reference genome may be an artificial genome or may be a human reference sequence genome. Such reference genomes include: an artificial target sequence genome comprising a polymorphic target nucleic acid sequence; an artificial SNP reference genome; an artificial STR reference genome; artificially connecting STR reference genomes in series; human reference sequence genome NCBI36/hg18 sequence, available at internet genome, ucsc, edu/cgi-bin/hgGatewayorg = Human & db = hg18& hgsid = 166260105; and a human reference sequence genome NCBI36/hg18 sequence and an artificial target sequence genome, e.g., a SNP genome, comprising the polymorphic sequence of interest. Some mismatches are allowed in the mapping process.
In one embodiment, the sequencing information obtained in operation 630 is analyzed and a determination is made simultaneously, determining the fetal fraction and determining the presence or absence of aneuploidy.
As explained above, multiple sequence tags were obtained for each sample. In certain embodiments, at least about 3x10 is obtained for each sample using read mapping to a reference genome 6 A sequence tag of at least about 5x10 6 A sequence tag of at least about 8x10 6 A sequence tag of at least about 10x10 6 A sequence tag of at least about 15x10 6 At least about 20x10 of sequence tags 6 At least about 30x1 of sequence tags0 6 A sequence tag of at least about 40x10 6 A sequence tag, or at least about 50x10 6 Sequence tags comprising reads between 20bp and 40 bp. In one embodiment, all sequence reads map to all regions of the reference genome. In one embodiment, tags comprising reads that have been mapped to all regions of the human reference sequence genome (e.g., all chromosomes) are counted and fetal aneuploidy, i.e., over-representation or under-representation of the sequence of interest (e.g., a chromosome or portion thereof), is determined in the mixed DNA sample and tags comprising reads mapped to the artificial target sequence genome are counted to determine fetal fraction. The method does not require a distinction to be made between maternal and fetal genomes.
In one embodiment, data from the sequencing run is analyzed and the fetal fraction, and the presence or absence of aneuploidy, is determined simultaneously.
Sequencing libraries
In some embodiments, a portion or all of the amplified polymorphic sequences are used to prepare a sequencing library for sequencing in the parallel fashion. In one embodiment, libraries are prepared for sequencing-by-synthesis using illu nano reversible terminator-based sequencing chemistry. Libraries can be prepared from purified cfDNA and include at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 45%, or at least about 50% amplification products.
Sequencing the library generated by either of the methods depicted in figure 11 provides sequence tags derived from amplified target nucleic acids and tags derived from the original unamplified maternal sample. The fetal fraction is calculated from the number of tags mapped to the artificial reference genome.
Calculating fetal fraction
As explained, after sequencing the DNA of interest, computational methods can be used to map or align sequences to specific genes, chromosomes, alleles, or other structures. There are a variety of computer algorithms for aligning sequences, including, but not limited to, BLAST (Altschul et al, 1990), BLATG (MPsrc) (Sttroco and Coriolis (Sturrock & Collins), 1993), FASTA (Pulson and Lipman (Pearson & Lipman, 1988), BOWTIE (Lang Gemi (Langmead), et al, genome Biology (Genome Biology) 10R 25.1-R25.10[2009 ]), or ELAND (Illumina, san Diego, calif., USA). In some embodiments, the data box sequences are found in nucleic acid databases known to those of skill in the art, including GenBank, dbEST, dbSTS, EMBL (European molecular biology laboratory) and DDBJ (Japanese DNA database). The identified sequences can be searched against a sequence database using BLAST or similar tools, and the identified sequences can be sorted into the appropriate data bins using search hits. Alternatively, reads can be aligned to the reference genome using a Bloom filter or similar set member tester (set membership tester). See U.S. patent application No. 61/552,374, filed on 27/10/2011, which is incorporated by reference herein in its entirety.
As mentioned, determining a fetal fraction according to some embodiments (particularly the NCNFF technique) is based on the total number of tags mapped to a first allele and the total number mapped to a second allele, which is located at an informative polymorphic site (e.g., SNP) contained in the reference genome. Informative polymorphic sites are identified by differences in the allelic sequences and the number of each possible allele. Fetal cfDNA is often present at a concentration of <10% maternal cfDNA. Thus, there is a secondary contribution of alleles of the fetal, fetal and maternal nucleic acid mixture that can be assigned to the fetus relative to the primary contribution of maternal alleles. Alleles derived from the maternal genome are referred to herein as primary alleles, and alleles derived from the fetal genome are referred to herein as secondary alleles. Alleles represented by similar levels of mapped sequence tags represent maternal alleles. The results of an exemplary multiplex amplification of target nucleic acids comprising SNPs derived from maternal plasma samples are shown in fig. 12.
As used herein, the terms "chromosomal aneuploidy" and "intact chromosomal aneuploidy" refer herein to an imbalance of genetic material resulting from the loss or gain of an entire chromosome, and include germline aneuploidy and chimeric aneuploidy. The terms "partial aneuploidy" and "partial chromosomal aneuploidy" herein refer to an imbalance of genetic material caused by loss or gain of a portion of a chromosome (e.g., partial monosomy and partial trisomy), and encompass imbalances caused by translocations, deletions, and insertions.
Estimating fetal fraction using allele ratios
For each of the two alleles at a predetermined polymorphic site, the relative abundance of fetal cfDNA in the maternal sample can be determined as a parameter of the total number of unique sequence tags mapped to the target nucleic acid sequence on the reference genome. In one embodiment, the fraction of fetal nucleic acid in the mixture of fetal and maternal nucleic acids is calculated for each informative allele (allele x) as follows:
and calculating the fetal fraction for the sample as the average of the fetal fractions for all informative alleles. Optionally, for each informative allele (allele x), the fraction of fetal nucleic acid in the mixture of fetal and maternal nucleic acids is calculated as follows:
to compensate for the presence of both fetal alleles, one was masked by the maternal background.
Determination of fetal fraction by sequencing a predetermined polymorphic sequence
More details on the determination of fetal fraction by sequencing a predetermined polymorphic sequence are provided below.
Referring to fig. 7, operations 720, 730, 740, and 760 show a process flow for determining the fraction of fetal nucleic acid in a maternal biological sample by massively parallel sequencing of PCR-amplified polymorphic target nucleic acids. In step 720, a maternal sample comprising a mixture of fetal and maternal nucleic acids is obtained from a subject. The sample is a maternal sample obtained from a pregnant female (e.g., pregnant woman). Other maternal samples may be from mammals, such as cows, horses, dogs or cats. If the subject is a human, then the sample may be taken during the first or second trimester of pregnancy. Any maternal biological sample can be used as a source of fetal and maternal nucleic acids contained in the cells or without the cells. In certain embodiments, it is advantageous to obtain a maternal sample comprising cell-free nucleic acids (cfDNA). Preferably, the maternal biological sample is a biological fluid sample. Preferably, the maternal sample is a maternal sample selected from the group consisting of blood, plasma, serum, urine and saliva. In certain embodiments, the maternal sample is a plasma sample.
In step 720, the mixture of fetal and maternal nucleic acids is further processed from a sample fraction, such as plasma, to obtain a sample comprising a purified mixture of fetal and maternal nucleic acids (e.g., cfDNA). Methods for processing a maternal sample are described elsewhere herein.
In step 730, a portion of the purified mixture of fetal and maternal cfDNA is used to amplify a plurality of polymorphic target nucleic acids, each of which comprises a polymorphic site. In certain embodiments, each of these target nucleic acids comprises a SNP. In other embodiments, each of these target nucleic acids comprises a pair of tandem SNPs. In still other embodiments, each target nucleic acid comprises an STR. Polymorphic sites included IN a target nucleic acid include, but are not limited to, single Nucleotide Polymorphisms (SNPs), tandem SNPs, small-scale multiple base deletions or insertions (referred to as IN-DELS, also known as deletion insertion polymorphisms or DIPs), polynucleotide polymorphisms (MNPs), short Tandem Repeats (STRs), restriction Fragment Length Polymorphisms (RFLPs), or polymorphisms including any other sequence variation IN the chromosome. In certain embodiments, the polymorphic sites encompassed by the methods are located on an autosome, thereby enabling the determination of fetal fraction independent of fetal gender. Polymorphisms associated with chromosomes other than chromosomes 13, 18, 21 and Y can also be used in the methods described herein.
Polymorphisms can be indicative, informative, or both. The indicative polymorphism indicates the presence of fetal cell-free DNA in the maternal sample. For example, the more specific genetic sequences (e.g., SNPs), the easier it is for a method to translate its presence into specific color intensities, color densities, or some other property that can be detected and measured and indicates the presence, absence, and amount of a specific DNA segment and/or a specific polymorphism (e.g., an embryo's SNP). In connection with the present invention, these methods are not performed using all possible SNPs in one genome, but using pre-selected polymorphisms (i.e. informative polymorphisms) that are likely to recognize sequence differences between mother and fetus. Informative polymorphic sites are identified by the difference in the sequence of the alleles and the amount of each of the possible alleles. Any polymorphic site encompassed by a read generated by the sequencing methods described herein can be used to determine fetal fraction.
A portion of a mixture of fetal and maternal nucleic acids (e.g., cfDNA) in a sample is used as a template for amplification of a target nucleic acid comprising at least one SNP. In certain embodiments, each target nucleic acid comprises a single (i.e., one) SNP. Target nucleic acid sequences comprising SNPs may be obtained from publicly accessible databases including, but not limited to, the human SNP database with the web address wi.mit.edu, the NCBI dbSNP homepage with the web address ncbi.nlm.nih.gov, the web address Life Technologies TM Applied Biosystems (Applied Biosystems), celera human SNP database with world wide web address Celera. In one embodiment, the SNPs selected to enrich for fetal and maternal cfDNA are selectedFrom pax (Pakstis) et al (pax et al, human genetics (Hum Genet) 127]) The 92 individuals described identified a cohort of SNPs (IISNPs) that have been shown to have very small variations in frequency throughout the population (F) st &lt, 0.06) and is highly informative worldwide with an average heterozygosity of > 0.4. SNPs encompassed by the methods of the invention include both ligated and unligated SNPs. Other useful SNPs that may be applied or suitable for use in the methods described herein are disclosed in U.S. patent application nos. 20080070792, 20090280492, 20080113358, 20080026390, 20080050739, 20080220422, and 20080138809, which are incorporated herein by reference in their entirety. Each target nucleic acid comprises at least one polymorphic site, e.g., a single SNP, that is different from the polymorphic sites present on another target nucleic acid, thereby generating a set of polymorphic sites, e.g., SNPs, that contain a sufficient number of polymorphic sites, at least 1, at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least 25, at least 30, at least 40, or more, that are informative. For example, a set of SNPs may be configured to include at least one informative SNP. In one embodiment, the SNP targeted for amplification is selected from rs560681, rs1109037, rs9866013, rs13182883, rs13218440, rs7041158, rs740598, rs10773760, rs4530059, rs7205345, rs8078417, rs576261, rs2567608, rs430046, rs9951171, rs338882, rs10776839, rs25 zxft 3725, rs1277284, rs258684, rs1347696, rs508485, rs 6258, 62626262xxft 3758, zxft 6258, and rs 6258 zxft 58. In one embodiment, the set of SNPs includes at least 3, at least 5, at least 10, at least 13, at least 15, at least 20, at least 25, at least 30, or more SNPs. In one embodiment, the set of SNPs includes rs560681, rs1109037, rs9866013, rs13182883, rs13218440, rs7041158, rs740598, rs10773760, rs4530059, rs7205345, rs8078417, rs576261, and rs2567608. Polymorphic nucleic acids comprising SNPs The exemplary primer pairs provided in example 24 and disclosed as SEQ ID NOs:63-118 can be used for amplification.
In other embodiments, each target nucleic acid comprises two or more SNPs, i.e., each target nucleic acid comprises tandem SNPs. Preferably, each target nucleic acid comprises two tandem SNPs. Tandem SNPs are analyzed as a single unit (e.g., as a short haplotype), and are provided herein as multiple sets of two SNPs. To identify suitable tandem SNP sequences, the International HapMap organization (International HapMap Consortium) database (International HapMap Project, nature 426 (Nature) 789-796[2003 ]) may be searched. The database is available on the world wide web at hapmap. In one embodiment, the tandem SNPs targeted for amplification are selected from the following sets of tandem SNP pairs: rs7277033-rs2110153; rs2822654-rs1882882; rs368657-rs376635; rs2822731-rs2822732; rs1475881-rs7275487; rs1735976-rs2827016; rs447340-rs2824097; rs418989-rs13047336; rs987980-rs987981; rs4143392-rs4143391; rs1691324-rs13050434; rs11909758-rs9980111; rs2826842-rs232414; rs1980969-rs1980970; rs9978999-rs9979175; rs1034346-rs12481852; rs7509629-rs2828358; rs4817013-rs7277036; rs9981121-rs2829696; rs455921-rs2898102; rs2898102-rs458848; rs961301-rs2830208; rs2174536-rs458076; rs11088023-rs11088024; rs1011734-rs1011733; rs2831244-rs9789838; rs8132769-rs2831440; rs8134080-rs2831524; rs4817219-rs4817220; rs2250911-rs2250997; rs2831899-rs2831900; rs2831902-rs2831903; rs11088086-rs2251447; rs2832040-rs11088088; rs2832141-rs2246777; rs2832959-rs9980934; rs2833734-rs2833735; rs933121-rs933122; rs2834140-rs12626953; rs2834485-rs3453; rs9974986-rs2834703; rs2776266-rs2835001; rs1984014-rs1984015; rs7281674-rs2835316; rs13047304-rs13047322; rs2835545-rs4816551; rs2835735-rs2835736; rs13047608-rs2835826; rs2836550-rs2212596; rs2836660-rs2836661; rs465612-rs8131220; rs9980072-rs8130031; rs418359-rs2836926; rs7278447-rs7278858; rs385787-rs367001; rs367001-rs386095; rs2837296-rs2837297; and rs2837381-rs4816672.
In one embodiment, a portion of the mixture of fetal and maternal nucleic acids (e.g., cfDNA) in the sample is used as a template for amplification of target nucleic acids comprising at least one STR. In certain embodiments, each target nucleic acid comprises a single (i.e., one) SNP. STR loci are found on almost every chromosome in the genome and can be amplified using a variety of Polymerase Chain Reaction (PCR) primers. Tetranucleotide repeats are preferred among forensic scientists due to fidelity in PCR amplification, although certain trinucleotide and pentanucleotide repeats are also used. Detailed tabulations of references, facts and sequence information about STRs, public PCR primers, common multiplex systems and related population data are compiled in STRBase, which is accessible via the world wide web ibm4.Carb. Nist. Gov:8800/dna/home. Htm. FromSequence information about commonly used STR loci (http:// www2.Ncbi. Nlm. Nih. Gov/cgi-bin/genbank) can also be obtained by STRbase. Commercial kits that can be used to analyze STR loci typically provide all the necessary reaction components and controls needed for amplification. STR multiplex systems allow for the simultaneous amplification of multiple non-overlapping loci in a single reaction, which substantially increases throughput. Even overlapping loci can be multiplexed using multicolor fluorescence detection. Polymorphisms of tandem repeat DNA sequences that are widespread throughout the human genome make these sequences important genetic markers for gene localization studies, ligation assays, and human identification tests. Because STRs are highly polymorphic, most individuals will be heterozygous, i.e., most people possess two alleles (versions), one inherited from each parent, each with a different number of repeats. PCR products comprising STRs can be isolated and detected using manual, semi-automated, or automated methods. Semi-automated systems are gel-based and combine electrophoresis, detection, and analysis into one unit. On a semi-automated system, gel assembly and sample loading remain A manual process; however, once the sample is loaded on the gel, electrophoresis, detection and analysis will be automated. Data collection occurs "in real time" as the fluorescently labeled fragments migrate through the detector at the fixed point and can be observed as they are collected. As the name implies, capillary electrophoresis is performed in microtubes rather than between glass plates. Once the sample, gel polymer and buffer are loaded on the instrument, the capillary is filled with gel polymer and the sample is automatically loaded. Thus, a fetal STR sequence that is not maternally inherited will differ in number of repeats from the maternal sequence. Amplification of these STR sequences may produce one or two major amplification products corresponding to the maternal allele (and the maternal inherited fetal allele) and one minor product corresponding to the non-maternal inherited fetal allele. This technique was first reported in 2000 (Pu' er (Pertl) et al, human Genetics (Human Genetics) 106]) And then have been developed using real-time PCR to simultaneously identify a variety of different STR regions (Liu et al, acta octet gynscad 86]). PCR amplicons of various sizes have been used to distinguish the corresponding particle size distributions of circulating fetal and maternal DNA species, and it has been shown that fetal DNA molecules in the plasma of pregnant women are generally shorter than maternal DNA molecules (Chan et al, clinical chemistry (Clin Chem) 50 ]. Size fractionation of circulating fetal DNA it has been demonstrated that the average length of circulating fetal DNA fragments&lt, 300bp, and the estimated maternal DNA is between about 0.5Kb and 1Kb (Li et al, clinical chemistry, 50]). The present invention provides a method for determining the fraction of fetal nucleic acid in a maternal sample, the method comprising determining the copy number of at least one fetal and one maternal allele located at a polymorphic miniSTR site, which miniSTR may be amplified to produce amplicons of a length approximately the size of a circulating fetal DNA fragment (e.g., less than about 250 base pairs). In one embodiment, the fetal fraction may be determined by a method comprising sequencing at least a portion of amplified polymorphic target nucleic acids, each target nucleic acid comprising a miniSTR. Fetal and maternal alleles at informative STR loci are distinguished by their different lengths, i.e., number of repeats, andthe fetal fraction can be calculated by the percentage ratio of the amount of fetal maternal alleles at the site. The method may use one informative miniSTR or a combination of any number of informative ministrs to determine the fraction of fetal nucleic acid. In one embodiment, the method includes determining the copy number of at least one fetal and at least one maternal allele of at least one polymorphic miniSTR amplified to produce amplicons of less than about 300bp, less than about 250bp, less than about 200bp, less than about 150bp, less than about 100bp, or less than about 50bp. In another embodiment, the amplicon produced by amplification of the miniSTR is less than about 300bp. In another embodiment, the amplicon produced by amplification of the miniSTR is less than about 250bp. In another embodiment, the amplicons produced by the amplification of miniSTRs are less than about 200bp. Amplification of informative alleles includes the use of miniSTR primers that can amplify the size-reduced amplicons to detect STR alleles of less than about 500bp, less than about 450bp, less than about 400bp, less than about 350bp, less than about 300 base pairs (bp), less than about 250bp, less than about 200bp, less than about 150bp, less than about 100bp, or less than about 50bp. The reduced size amplicons generated using miniSTR primers are called miniSTRs, which are identified by the marker names corresponding to the loci to which they have been mapped. In one embodiment, the miniSTR primers include those that have allowed for maximum size reduction of amplicon size for all 13 CODIS STR loci found in commercially available STR kits, except for D2S1338, penta D, and pentaE (Butler et al, J forensics Sci) 48 1054-1064[2003 ]) A miniSTR locus not linked to a CODIS marker as described by Cupressus (Coble) and Butler (Cupressus and Butler, J Fasci 50 [ 43-53 ], [2005 ]]) And other miniSTRs that have been characterized at NIST. Information about miniSTR characterized at NIST can be obtained via the world Wide Web cstl. Any pair of miniSTR primers or a combination of two or more pairs of miniSTR primers can be used to amplify at least one miniSTR.
Amplification of nucleic acids of interest in a mixture of fetal and maternal nucleic acids (e.g., cfDNA) is achieved by any method using PCR or variation as described elsewhere in this application. Amplification of these target sequences is achieved using primer pairs each capable of amplifying one target nucleic acid sequence that includes a polymorphic site (e.g., a SNP) in a multiplex PCR reaction. The multiplex PCR reaction includes combining at least 2, at least three, at least 3, at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 40, or more primer sets in the same reaction to quantify amplified target nucleic acids including at least two, at least three, at least 5, at least 10, at least 15, at least 20, at least 25, at least 30, at least 40, or more polymorphic sites in the same sequencing reaction. Any subset of the primer sets can be configured to amplify at least one informative polymorphic sequence.
The primer is designed to hybridize to a sequence close to the SNP site on the cfDNA to ensure that the SNP site is included within the length of the read generated by the sequencer. As provided in the examples, at least one of the two primers in the primer set used to identify any one polymorphic site hybridizes in sufficient proximity to the polymorphic site to encompass the polymorphic site within a 36bp read generated by massively parallel sequencing on an illu nano-analyzer GII and to generate an amplicon of sufficient length for bridge amplification during cluster formation. Thus, primers were designed to produce at least 110bp amplicons that when combined with a universal adaptor for cluster amplification (illumana inc., san Diego, CA) produced at least 200bp DNA molecules. The SNPs given in table 33 were used to amplify 13 target sequences simultaneously in one multiplex assay. The panel provided in table 33 is an exemplary SNP panel. Fewer or more SNPs may be employed to enrich for fetal and maternal DNA for polymorphic target nucleic acids. Additional SNPs that may be used include the SNPs given in table 34. The SNP alleles are shown in bold and underlined. Other SNPs that may be used for determining the fetal fraction according to the method of the invention include rs315791, rs3780962, rs1410059, rs279844, rs38882, rs9951171, rs214955, rs6444724, rs2503107, rs1019029, rs1413212, rs1031825, rs891700, rs1005533, rs2831700, rs354439, rs1979255, rs1454361, rs8037429 and rs1490413. These SNPs have been analyzed by TaqMan PCR for determining fetal fraction and are disclosed in U.S. patent application publication 2010-0010085.
The forward or reverse primer in each primer set hybridizes to a DNA sequence sufficiently close to the polymorphic site to be included in sequence reads generated by the massively parallel sequencing of amplified preselected polymorphic nucleic acids. The length of the sequence reads is related to the particular sequencing technique. Massively parallel sequencing methods provide sequence reads that vary in size from tens to hundreds of base pairs. At least one primer in each primer set is designed to recognize a polymorphic site present within a sequence read of 20bp, about 25bp, about 30bp, about 35bp, about 40bp, about 45bp, about 50bp, about 55bp, about 60bp, about 65bp, about 70bp, about 75bp, about 80bp, about 85bp, about 90bp, about 95bp, about 100bp, about 110bp, about 120bp, about 130bp, about 140bp, about 150bp, about 200bp, about 250bp, about 300bp, about 350bp, about 400bp, about 450bp, or about 500 bp. In certain embodiments, at least one primer in each of the primer sets is designed to recognize a polymorphic site present within a sequence read of about 25bp, about 40bp, about 50bp, or about 100 bp.
Circulating cell-free DNA is approximately <300bp. Thus, the primer set is designed to hybridize to and amplify polymorphic sequences up to about 300bp in length on average, where fetal DNA is about 170bp in length on average. In certain embodiments, the primer set hybridizes to DNA to produce amplicons of up to about 300bp. In other embodiments, the primer set hybridizes to the DNA sequence to produce amplicons of at least about 100bp, at least about 150bp, at least about 200 bp. The primer set may hybridize to DNA sequences present on the same chromosome or to DNA sequences present on a different chromosome. For example, one or more primer sets may hybridize to sequences present on the same chromosome. Alternatively, two or more primer sets hybridize to sequences present on different chromosomes. In one embodiment, the primer pair amplifies a polymorphic sequence present on one or more of chromosomes 1 through 22. In certain embodiments, the primer set does not hybridize to a DNA sequence present on chromosome 13, 18, 21, X, or Y.
In step 740 (FIG. 7), a portion or all of the amplified polymorphic sequences are used to prepare a sequencing library for sequencing in the parallel fashion. In one embodiment, libraries are prepared for sequencing using the reversible terminator-based sequencing chemistry synthesis of illu nano.
In step 740, the sequence information needed to determine the fetal fraction is obtained using any known DNA sequencing method. Preferably, the methods described herein employ next generation sequencing technology (NGS) to provide countable sequence tags as described elsewhere in the application. Sequencing may be synthetic massively parallel sequencing. Preferably, synthetic massively parallel sequencing uses reversible dye terminators. Alternatively, massively parallel sequencing may be sequencing by ligation, or single molecule sequencing.
The amplified target polymorphic nucleic acids are partially sequenced, and sequence tags comprising reads of a predetermined length (e.g., 36 bp) mapped to a known reference genome are counted. Only sequence reads that are uniquely aligned with the reference genome are counted as sequence tags. In one embodiment, the reference genome is an artificial target sequence genome comprising a polymorphic target nucleic acid (SNP) sequence. In one embodiment, the reference genome is an artificial SNP reference genome. In another embodiment, the reference genome is an artificial STR reference genome. In yet another embodiment, the reference genome is an artificial tandem STR reference genome. The artificial reference genome can be edited using the target polymorphic nucleic acid sequence. The artificial reference genome can include polymorphic target sequences that each comprise one or more different types of polymorphic sequences. For example, an artificial reference genome may include a polymorphic sequence comprising a SNP allele and/or an STR. In one embodiment, the reference genome is a human reference sequence genome NCBI36/hg18 sequence, which is found in world wide web genome, ucsc, edu/cgi-bin/hgGateway? org = Human & db = hg18& hgsid =166260105 are available. Other published sources of sequence information include GenBank, dbEST, dbSTS, EMBL (European Molecular Biology Laboratory) and DDBJ (Japanese DNA database). In another embodiment, the reference genome comprises a human reference genome NCBI36/hg18 sequence and an artificial target sequence genome comprising a target polymorphic sequence, e.g., a SNP genome. Mapping of sequence tags can be achieved by comparing the sequence of the mapping tag to the sequence of a reference genome to determine the chromosomal origin of the sequenced nucleic acid (e.g., cfDNA) molecule, and no specific genetic sequence information is required. A variety of computer algorithms can be used to align sequences, including, without limitation, BLAST (Altschul et al, 1990), BLITZ (MPsrc) (Sttroco and Coriolis (Sturrock & Collins), 1993), FASTA (Pearson & Lipman, 1988), BOWTIE (Lang Gemi (Langmead), et al, genome Biology (Genome Biology) 10, R25.1-R25.10[2009 ]), or ELAND (San Diego, calif., USA). In one embodiment, one end of clonally amplified copies of plasma cfDNA molecules are sequenced and processed by bioinformatics alignment analysis by an illu nano-genomics analyzer, which is performed using large-scale high-efficiency alignment of nucleotide database (elad) software. In embodiments including methods of determining the presence or absence of aneuploidy and fetal fraction using NGS sequencing methods, analysis of the sequencing information to determine aneuploidy may allow for a smaller degree of mismatch (0 to 2 mismatches per sequence tag) to account for possible minor polymorphisms between the reference genome and the genomes in the pooled sample. Analysis of the sequencing information to determine fetal fraction may allow for a smaller degree of mismatching, depending on the polymorphic sequence. For example, if the polymorphic sequence is an STR, a lesser degree of mismatch may be tolerated. Where the polymorphic sequence is a SNP, all sequences that exactly match either of the two alleles at the SNP site are first counted and filtered out of the remaining reads for which a lesser degree of mismatch may be allowed. Quantification of the number of sequence reads aligned to each chromosome to determine chromosomal aneuploidy can be determined as described herein, either using a surrogate analysis that normalizes the median of the sequence tag of the chromosome of interest relative to the median of the tag of each of the other autosomes (range et al, proc Natl Acad Sci 105. A "z-score" was generated to represent the difference between the percent genomic expression of the chromosome of interest and the average percent expression of the same chromosome between euploid control groups divided by the standard deviation (zhao (Chiu) et al, clinical chemistry (ClinChem) 56. In another embodiment, sequencing information may be determined as described in U.S. provisional patent application No. 32047-768.101, entitled "normalized biological test," filed on 2010, 1/19, which is incorporated herein by reference in its entirety.
Analysis of the sequencing information to determine fetal fraction may allow for a smaller degree of mismatching, depending on the polymorphic sequence. For example, if the polymorphic sequence is an STR, a lesser degree of mismatch may be tolerated. Where the polymorphic sequence is a SNP, all sequences that exactly match either of the two alleles at the SNP site are first counted and filtered out of the remaining reads for which a lesser degree of mismatch may be allowed. The method of the invention for determining the fetal fraction by sequencing nucleic acids may be used in combination with other methods.
In step 760, a fetal fraction is determined based on the total number of tags mapped to the first allele and the total number of tags mapped to the second allele at the informative polymorphic sites (e.g., SNPs) contained in the reference genome. For example, the reference genome is a genome covering a target sequence comprising SNPs rs560681, rs1109037, rs9866013, rs13182883, rs13218440, rs7041158, rs740598, rs10773760, rs4530059, rs7205345, rs8078417, rs576261, rs2567608, rs430046, rs9951171, rs338882, rs10776839, rs25 zxft 3725, rs 62 1277284, rs258684, rs 4287 zxft 4252 zxft 5252, rs 58 zxft 3758, rs9788670, rs 626258 zxft 6258, and rs 6258 zxft. In one embodiment, the artificial reference genome comprises the polymorphic target sequences of SEQ ID NOS: 7 to 62 (see example 24).
In another embodiment, the artificial genome is an artificial target sequence genome encompassing a polymorphic sequence comprising tandem SNPs. In another embodiment, the artificial target genome encompasses a polymorphic sequence comprising an STR. The composition of the artificial target sequence genome will vary depending on the polymorphic sequence used to determine fetal fraction. Thus, the artificial target sequence genome is not limited to the SNP, tandem SNP, or STR sequences exemplified herein.
Informative polymorphic sites (e.g., SNPs) are identified by differences in the sequence of the alleles and the amount of each of the possible alleles. Fetal cfDNA is present at a concentration 10% lower than that of the maternal cfDNA. Thus, there is a minor contribution of alleles that can be assigned to the fetus to the fetal and maternal nucleic acid mixture relative to the major contribution of maternal alleles. Alleles derived from the maternal genome are referred to herein as major alleles, and alleles derived from the fetal genome are referred to herein as minor alleles. Alleles represented by similar levels of mapped sequence tags represent maternal alleles. The results of an exemplary multiplex amplification of target nucleic acids comprising SNPs and derived from maternal plasma samples are shown in fig. 12. Informative SNPs are distinguished from single nucleotide changes at polymorphic sites, and fetal alleles are distinguished by their relatively minor contribution to a mixture of fetal and maternal nucleic acids in a sample, as compared to the major contribution of maternal nucleic acids to the mixture. Thus, for each of the two alleles at a predetermined polymorphic site, the relative abundance of fetal cfDNA in the maternal sample can be determined as a parameter of the total number of unique sequence tags mapped to the target nucleic acid sequence on the reference genome. In one embodiment of the process of the present invention, For each informative allele (allele) x ) The fraction of fetal nucleic acid in the mixture of fetal and maternal nucleic acid is calculated as described elsewhere in this application.
Fetal fraction estimation using STR sequences and capillary electrophoresis
Individuals have different STR lengths due to the different number of repeats. Because STRs are highly polymorphic, most individuals will be heterozygous, i.e., most people possess two alleles (versions), one inherited from each parent, each with a different number of repeats. Fetal STR sequences that are non-maternally inherited will differ in the number of repeats from the maternal sequence. Amplification of these STR sequences may produce one or two major amplification products corresponding to the maternal allele (and the maternal inherited fetal allele) and one minor product corresponding to the non-maternal inherited fetal allele. When sequencing, the collected samples can be correlated with the corresponding alleles and counted to determine relative scores by using equation 3.
PCR was performed on the purified samples by using fluorescently labeled primers. PCR products comprising STRs can be separated and detected using manual, semi-automated, or automated electrophoresis. Semi-automated systems are gel-based and combine electrophoresis, detection and analysis into one unit. On semi-automated systems, gel assembly and sample loading remain manual procedures; however, once the sample is loaded on the gel, the freezing, detection and analysis are automated. As the name suggests, capillary icing is performed in microtubes rather than between glass plates. Once the sample, gel polymer and buffer are loaded on the instrument, the capillary is filled with gel polymer and the sample is automatically loaded. Data collection occurs "in real time" as the fluorescently labeled fragments migrate past the detector at the fixed point and can be observed as they are collected. The sequence obtained by co-capillary electrophoresis can be detected by a program that measures the wavelength of the fluorescent label. The calculation of the fetal fraction is based on averaging all informative markers. Informative markers are identified by the presence of peaks on the electropherogram that fall within preset data bin parameters for the STR being analyzed.
The score for the minor allele for any given informative marker is calculated by dividing the peak height of the minor component by the sum of the peak heights of the major components, and is expressed as a percentage for each informative locus as follows:
the fetal fraction for a sample containing two or more informative STRs will be calculated as the average of the fetal fractions calculated for two or more informative markers.
Estimating fetal fraction using a hybrid model
In the embodiments disclosed herein, there are up to four different data types (typing scenarios) that constitute the sub-allele frequency data for the polymorphism under consideration.
As shown in fig. 13, cases 1 and 2 are polymorphic cases, where the mother is homozygous at a certain allele. In case 1, the polymorphism is a case 1 polymorphism if both the infant and the mother are homozygous. This is typically not of particular interest because the data collected presents only one type of allele at the polymorphic site being analyzed. In case 2, if the mother is homozygous and the baby is heterozygous, the fetal fraction f is nominally obtained by 2 times the ratio of sub-allelic counts to coverage. Coverage is defined as the total number of reads or tags (fetal to maternal) mapped to a particular site of the polymorphism. The equation for approximating fetal fraction in terms of fraction of fetal and maternal samples in case 2 is as follows:
In case 3, where the mother is heterozygous and the infant is homozygous, the fetal fraction is nominally 1-2 times the ratio of sub-allelic counts to coverage. In case 3, the equation for approximating fetal fraction as the fraction of total reads in both fetal and maternal samples is as follows:
finally, in case 4, where both mother and fetus are heterozygous, the sub-allele score should always be 0.5 (not including error). For polymorphisms that fall in case 4, fetal fraction cannot be deduced.
Table 7 summarizes an example of estimating the fetal fraction using equations 4 and 5 if the number of primary allele reads is 300 and the number of secondary allele reads is 200. The coverage would be 500.
Table 7: example of fetal fraction estimation Using gametes
In certain embodiments, a mixed model may be employed to classify a set of polymorphisms into two or more proposed match cases, and simultaneously estimate fetal DNA fractions from the average allele frequencies for each of these cases. In general, the mixture model assumes that a particular data set is composed of a mixture of different types of data, each of which has its own desired distribution (e.g., a normal distribution). The program attempts to find the average and possibly other characteristics of each type of data. In embodiments disclosed herein, there are up to four different data types (typing scenarios) that constitute the sub-allele frequency data for the polymorphism under consideration.
In certain embodiments employing a mixture model, one or more factorial moments given by equation 1 are calculated for the location being considered a polymorphism. For example, the factorial moment F is calculated using a plurality of SNP positions considered in the DNA sequence i (or a set of factorial moments). As follows and the likeEach different factorial moment F shown in the formula 10 i Is for a given location, for a sub-allele frequency a i And coverage d i The ratio of (a) to (b), the sum over all the different polymorphic positions considered. As shown in equation 11 below, these factorial moments also relate to the parameters α and p associated with each of the four match cases described above i . In particular, they relate to the probability p for each case i And the relative amount given by α for each of the four cases in the set of polymorphisms under consideration. As explained, the probability p i Is a function of the fraction of fetal DNA in cell-free DNA in the maternal blood. As explained more fully below, by calculating a sufficient number of these factorial moments, the method provides a sufficient number of expressions to solve for all unknowns. The unknown quantity in this case would be the relative quantity of each of the four cases and the probability (and thus the fetal DNA fraction) associated with each of these four cases in the polymorphic population under consideration. Similar results can be obtained using other versions of the hybrid model. Some versions only utilize polymorphisms that fall in case 1 and case 2, where the polymorphisms of case 3 and case 4 are filtered by a threshold technique.
Thus, the factorial moments can be used as part of a hybrid model to identify the probability of any combination of the four cases of a match. And, as mentioned, these probabilities, or at least for cases 2 and 3, are directly related to the fraction of fetal DNA in total cell-free DNA in maternal blood.
It should also be mentioned that the sequencing error given by e can be used to reduce the system complexity of the factorial moment equation that must be solved. In this regard, it should be recognized that sequencing errors can actually have any one of four outcomes (corresponding to each of the four possible bases at any given polymorphic position).
Assuming that the dominant allele count at genomic position j is B, the first order statistic of the counts (counts of reads) at position j. The major allele, b, is the corresponding maximum independent variable (argmax). When more than one SNP is considered, subscripts are used. The major allele counts are given as follows:
assuming the next allele count at position j is a, the second order statistic for the count at position j (i.e., the next highest allele count):
coverage is defined as the total number of reads (fetal and maternal) mapped to a specific site of the polymorphism. Assume that the coverage of location j is defined as D:
D≡D j ={di}=A j +B j Equation 8
In this embodiment, the minor allele frequency a is the sum of the four terms as shown in equation 9. The four heterozygosity cases suggest to point (a) i ,d i ) A of (a) i The following binomial mixture model of the distribution of individual allele counts, where d i Is the coverage:
A={α i }~α 1 data box (p) 1 ,d i )+α 2 Data box (p) 2 ,d i )+α 3 Data box (p) 3 ,d i )+α 4 Data box (p) 4 ,d i )
Wherein
1=α 1234
m=4
Equation 9
Each entry corresponding to one of four match cases. Each term is the product of the polymorphism fraction a and the binomial distribution of the frequency of the sub-alleles. These α represent the fraction of polymorphisms that fall in each of the four cases. Each binomial distribution has an associated probability, p, and coverage, d. The sub-allele probability of case 2 is given by f/2, for example, where f is the fetal fraction. For making p i And fetal fractionDifferent models associated with sequencing error rates are described below. The parameter α i relates to a population-specific parameter and the ability to "float" these values relative to, for example, the race and offspring of the parent may confer additional robustness to these methods.
The disclosed embodiments utilize factorial moments for the allele frequency data under consideration. It is well known that the distribution mean is a first order moment. It is the expected value of the frequency of the minor allele. The variance is a second order moment. It is calculated from the expectation of the square of the allele frequency.
For different heterozygosity situations, the fetal fraction can be solved in equation 9 above. In certain embodiments, the fetal fraction is solved by a factorial moments method, wherein the mixing parameters can be represented by moments, which can be easily estimated from the observed data.
Allele frequency data across all polymorphisms can be used to calculate the ith factorial moment F i (first factorial moment F) 1 Second order moment of multiplication F 2 Etc.), as shown in equation 10. (SNPs are used for purposes of example only other types of polymorphisms may be used as discussed elsewhere in this application.) given n SNP positions, the factorial moment is defined as follows:
as shown by these equations, the factorial moment is the sum of more than i terms (individual polymorphisms in the dataset) where there are n such polymorphisms in the dataset. The summed terms are the sub-allele counts a i And a coverage value d i Is measured as a function of (c).
Usefully, the factorial moment and α i And p i As illustrated in equation 11. The factorial moment may be related to { α } i ,p i Are associated, thereby
From probability p i The fetal fraction f may be determined. For example,and isThus, reliable logic can solve a system of equations that relates the unknowns α and p variables to factorial moment expressions for the sub-allelic gene fraction across multiple polymorphisms under consideration. Of course, other techniques for solving the hybrid model exist within the scope of the disclosed embodiments.
When n is&gt, 2 (number of parameters to be estimated), by solving the system of equations { alpha ] derived from the above relation equation 8 i ,p i The solution of (c) can identify a solution. It is obvious that the problem becomes much more mathematically difficult, since the higher g, the estimated a is needed i ,p i The more.
Typically not possible through a lower tireA simple threshold at the sub-score accurately distinguishes the data of case 1 from case 2 (or case 3 from case 4). By applying at the pointDistinguishing, the data of case 1 and case 2 can be easily separated from the data of case 3 and case 4, where a is the sub-allele count and D is the coverage and T is the threshold. It has been found that the use of T =0.5 may perform satisfactorily.
Note that the mixed model approach using equations 10 and 11 utilizes data for all polymorphisms, but does not account for sequencing errors separately. Suitable methods to separate the data of the first and second instances from the data of the third and fourth instances can account for sequencing errors.
In further examples, the data set provided to the mixture model contains only data for polymorphisms of case 1 and case 2. These are polymorphisms homozygous for the mother. The polymorphisms of cases 3 and 4 can be eliminated using a thresholding technique. For example, polymorphisms in which the frequency of the secondary allele is greater than a particular threshold are excluded before the mixture model is employed. Using the appropriately filtered data and factorial moments, which have been simplified in accordance with equations 13 and 14 below, one can calculate the fetal fraction f, as shown in equation 15. Note that equation 13 is a restatement of equation 9 for this implementation of the hybrid model. Note also that in this particular example, the sequencing errors associated with the machine readings are unknown. As a result, the error of the equation set, e, must be separately solved.
Fig. 14 shows the results using this hybrid model compared to the known fetal fraction (X-axis) and the estimated fetal fraction (Y-axis). If the mixed model perfectly predicts the fetal fraction, the depicted result will follow the dashed line. However, the estimated score is clearly good, especially considering that most of the data is excluded before applying the mixture model.
For further details, several other methods may be utilized for parameter estimation of the model from equation 7. In some cases, a tractable solution may be found by setting the chi-squared statistical derivative to zero. In cases where an easy solution cannot be found by direct differentiation, it may be effective to perform taylor series expansion on a binomial Probability Distribution Function (PDF) or other approximating polynomial. The least chi-squared estimation has been known to be effective. The method of solving the moment from equation 9 can be used as a starting point for the iterative method. The following chi-squared estimate may be used:
wherein P is i Is the number of points counting i. Iterative methods of Lecahn (Le Cam) [ "Asymptotic Theory of Estimation and testability Hypotheses (systematic Theory of Estimation and Testing Hypotheses)," Third Boke's theoretic Statistics and Probability workshop corpus (Proceedings of the Third Berkeley Symposium on physical Statistics and Probability), vol.1, boke's, calif. (Berkeley CA): university of California Press (University of CA Press), 1956, pages 129 to 156 ]Is to use a Ralph-newton iteration (Ralph-newtoniation) in the likelihood function.
According to another application, a method of resolving a mixture model is discussed that involves an expectation maximization approach that operates on a mixture that approximates a β -distribution.
Model 1: cases 1 and 2, sequencing errors unknown
Consider a narrowing model illustrating only heterozygosity cases 1 and 2. In this case, the mixture distribution can be written as:
A={α i }~α 1 Bin(e,d i )+α 2 ,Bin(f/2,d i )
wherein
1=a 1 +a 2
m =4 equation 13.
And the system of equations:
F 1 =α 1 e+(1-α 1 )(f/2)
F 2 =α 1 e 2 +(1-α 1 )(f/2) 2
F 3 =α 1 e 3 +(1-α 1 )(f/2) 3 in the case of the equation 14,
solving for e (sequencing error rate), α (ratio of case 1 points), and F (fetal fraction), where F i As defined in equation 10 above. The closed form solution to fetal fraction is chosen as the real solution to the following equation:
the solution is between 0 and 1.
To measure the performance of the inferences, a simulated dataset (a) of Hardy-Wen Boge Equilibrium points (Hardy-Weinberg Equisium points) was constructed with fetal fractions designed as {1%,3%,5%,10%,15%,20%, and 25% } and a constant sequencing error rate of 1% i ,d i ). The 1% error rate is the currently accepted rate for the sequencing machine and protocol used and is consistent with the illu nano genome analyzer II data shown in fig. 15. Equation 15 was applied to the data and found to be generally consistent with the "known" fetal fraction, except for the four point upward deviations. Interestingly, it is estimated that the sequencing error rate, e, is well above 1%.
Model 2: cases 1 and 2, sequencing errors known
In the next example of a mixture model, data for polymorphisms belonging to cases 3 and 4 are removed again using thresholding or another filtering technique. However, in this case, the sequencing error is known. This simplifies the resulting expression of fetal fraction, f, as shown in equation 16. Fig. 16 shows that this version of the mixture model provides improved results compared to the approach taken by equation 15. In the subsequent equation, the sequencing machine error rate is given as e.
A similar approach is shown in equations 17 and 18. This method recognizes that only some sequencing errors are added to the sub-allelic gene counts. However, only one out of every four sequencing errors should increase the sub-allele count. Fig. 17 shows a very good fit between the actual and estimated fetal fractions when using this technique.
Because the sequencing error rate of the machine used is largely known, the bias and complexity of the calculations can be reduced by eliminating e as the variable to be solved for. Thus, we obtain a system of equations for the fetal fraction f:
F 1 =α 1 e+(1-α 1 )(f/2)
F 2 =α 1 e 2 +(1-α 1 )(f/2) 2 equation 16, to obtain a solution:
FIG. 16 shows that using machine error rate as a known parameter can reduce the dot up bias.
Model 3: cases 1 and 2, sequencing errors known, improved error model
To improve the bias in this model, we developed an error model of the above equation to account for the fact that: in heterozygosity case 1, not every sequencing error event will increase to the sub-allelic gene count a = a i . Furthermore, we allow the following facts: sequencing error events may contribute to the count of heterozygosity case 2. Therefore, we determine the fetal fraction f by solving the system for the following moment relationships:
F 1 =α 1 e/4+(1-α 1 )(e+f/2)
the solution to the system is then:
fig. 17 shows that the simulated data of the error models of cases 1 and 2 were enhanced using the machine error rate as a known parameter, greatly reducing the upward bias to a point less than for fetal fractions below 0.2.
Classifying affected samples using fetal fraction
In certain embodiments, the fetal fraction estimate is used to further characterize the affected sample. In some cases, fetal fraction estimates allow classification of the affected sample as chimerism, complete aneuploidy, or partial aneuploidy. One computer-implemented method for obtaining this information is depicted relative to the flowchart of FIG. 18. This and related methods can be performed to provide simultaneous estimation of fetal fraction, determination of CNV, and classification of CNV. In other words, the same tag may be employed to perform any of these three functions.
To use this method, two modes of assessing fetal fraction are employed. One mode yields NCNFF values and the other mode yields CNFF values. As explained, CNFF values are obtained using techniques that rely on chromosomes or chromosome segments that are determined to possess copy number variations. There is no need to rely on polymorphisms to calculate fetal fraction. An example of a non-polymorphic technique for calculating fetal fraction is described in example 17, which assumes the presence of a full chromosome duplication or deletion and employs the following expression:
ff (i) =2*NCV jA CV jU in the equation (28) in which,
where j represents the identification of aneuploidy chromosomes and CV represents the coefficient of variation obtained from a qualified sample to determine the mean and standard deviation in the expression for NCV.
NCNFF values are obtained using techniques that rely on chromosomes or chromosome segments that do not have copy number variations. In other words, the NCN fetal fraction is determined by a technique that reliably determines the fetal fraction, assuming normal ploidy of the portion of the genome used to calculate the fetal fraction. The CN fetal fraction is determined by a form of technique that assumes that the sample under consideration has aneuploidy. The CNV of the affected chromosome or chromosome segment is used to calculate the CN fetal fraction. The following presents techniques for its computation.
By comparing the estimate of the NCN fetal fraction to the estimate of the CN fetal fraction, a method can determine the type of aneuploidy that may be present in the sample. Basically, if the NCN fetal fraction and CN fetal fraction values match, then the ploidy assumption in the technique for assessing CN fetal fraction may be considered to be true. For example, if the method of calculating CN fetal fraction assumes that the sample has a complete chromosomal aneuploidy exhibiting a single additional copy of a chromosome or a single deletion of a chromosome, and the NCN fetal score value matches the CN fetal score value, then the method can conclude: the sample exhibited a complete chromosomal aneuploidy. The basis for making this assumption is described in more detail below.
The NCN fetal fraction can be determined by different techniques. In some embodiments, the NCN fetal fraction is estimated using selected polymorphisms in the reference sequence genome. Examples of these techniques are described above. In other embodiments, the NCN fetal fraction is determined using the relative amount of chromosomes that are known to be not aneuploid or have been determined to be not aneuploid. For example, a chromosome known not to be aneuploid in a sample may be chromosome X in a male fetus. Thus, in other embodiments, the relative amount of X-or Y-chromosomes in a sample comprising DNA from a pregnant woman pregnant with a son (e.g., chromosome dosage for such chromosomes) is used to determine the NCN fetal fraction. The genome of the son should not include the second copy of the X chromosome. Knowing this, the relative amount of X chromosome DNA can be used to provide an NCN value for fetal fraction. In a sample comprising female fetal DNA, a chromosome known to be not aneuploid may be a chromosome known to be not life-compatible. Alternatively, for samples containing DNA from a male or female fetus, the sequence tags can be used to determine chromosome dosage (and NCV or NSV) to confirm that chromosomes are useful for determining NCN fetal fraction to determine the presence of normal ploidy of chromosomes that can be used to determine NCN fetal fraction.
Turning to the flow chart of FIG. 18, the NCN fetal fraction estimate 1802 and the CN fetal fraction estimate 1804 are compared. If they match, as indicated at block 1806, the process concludes and determines that the assumptions contained in the technique for estimating the CN fetal fraction are authentic. In various embodiments, the assumption is: trisomy or monosomy is present in one of the chromosomes of the fetus.
If, on the other hand, the comparison indicates that the values of the two fetal fractions do not match (condition 1808) and that in fact the estimate of the CN fetal fraction is less than the NCN fetal fraction, then the second phase of the method will be performed as indicated at block 1810.
In this second stage, the method determines whether the sample comprises a partial aneuploidy or chimerism. In addition, if the sample includes a partial aneuploidy, the method determines where the aneuploidy resides on the aneuploidy chromosome. In certain embodiments, this is accomplished by first binning the affected chromosomes into multiple pieces. In one example, each block is about 1 million base pairs in length. Of course, other block lengths can be used, such as about 1 kilobase, about 10 kilobases, about 100 kilobases, and the like. These blocks do not overlap and span most or all of the length of the chromosome. These base blocks or data boxes are compared to each other and the comparison provides insight about the conditions. In one approach, the mapped tags are counted for each base block or data bin and optionally converted to a data bin dose. If any of these bins or blocks are aneuploid, then these counts or bin doses indicate it. As part of the analysis of the individual data bins, it may be appropriate to normalize the information from each data bin to account for inter-bin variations, such as G-C content. The resulting normalized data bin may be referred to as NBV for the normalized data bin values; NBV is an example of a chromosome segment normalized to the label of the normalized segment mapped to GC content of a segment with similar GC content (as in example 19 below). In some embodiments, a fetal fraction is calculated for each data bin and individual values of the fetal fraction values are compared. This sequence analysis of the data boxes is depicted in block 1812 of FIG. 18. If any data bin or basis block is identified as having an aneuploidy (by considering tag density, fetal fraction, or other information), the method determines the aneuploidy of the sample-containing portion and additionally locates the aneuploidy with a data bin in which the tag count deviates sufficiently from the expected value. See block 1814.
However, if the method does not identify any chromosomal regions exhibiting aneuploidy when analyzing the individual ends of the chromosome under consideration, then the method determines that the sample contains chimerism. See block 1816.
On chromosomes of interest and chromosomes known not to be aneuploid (e.g., staining) of the affected sample Volume X) using polymorphisms, e.g., SNPs, to calculate and compare true fetal fractions in order to determine the presence of a male fetus Or the absence of whole or partial aneuploidy
As explained, using informative polymorphic sequences, such as informative SNPs, the determined Fetal Fraction (FF) can be used to distinguish complete chromosomal aneuploidies from partial aneuploidies.
The presence or absence of aneuploidy, whether partial or complete, can be determined from the value of the fetal fraction determined using the polymorphic target sequence present on the chromosome of interest and compared to the value of the fetal fraction determined using the polymorphic target sequence present on a different chromosome in the sample. In a sample where the fetus is a male, the FF on the chromosome of interest can be determined and compared to the FF determined for chromosome X in the same sample. For example, given a maternal sample from a mother pregnant with a male fetus with trisomy 21, polymorphic sequences, e.g., sequences comprising at least one informative SNP, are selected for presentation on chromosome 21 and on chromosome X; polymorphic target sequences were amplified and sequenced and fetal fractions were determined as described elsewhere in the application.
Given that the fetal fraction is proportional to the amount of fetal chromosomes in the sample, the fetal fraction determined using the polymorphic sequences present on trisomy chromosomes in a maternal sample will be 1+1/2 times the fetal fraction determined using polymorphic sequences on chromosomes (e.g., chromosome X) that are known not to be aneuploid in a male fetus in the same maternal sample. For example, in a normal sample, when the polymorphism group on chromosome 21 is used to determine Fetal Fraction (FF) 21 ) And determining Fetal Fraction (FF) using the polymorphic set on chromosome X X ) Then, given that chromosome X is unaffected in the male fetus, FF is 21 =FF X . However, if the fetus is trisomy against chromosome 21, the Fetal Fraction (FF) against trisomy chromosome 21 21 ) Will be equal to the Fetal Fraction (FF) of chromosome X in the same sample X ) One and a half times (FF) of 21 =1.5 * FF X ). Thus, if FF is 21 <FF X Then the analysis logic may conclude that: there is a deletion of a portion of chromosome 21 and/or chimerism. If FF is not present 21 >FF X Then the analysis logic may conclude that: there is an increase in a portion of chromosome 21, such as replication or doubling of a portion of chromosome 21 or complete replication, chromosome 21 not illustrated in the techniques for calculating fetal fraction from chromosome 21. A difference between the two results will result as soon as it can be resolved as a partial replication <1.5 * FF X FF of (1). Alternatively, the replication, deletion or presence of a chimeric portion can be determined by, for example, increasing the number of polymorphic sequences on chromosome 21 to obtain multiple FF values along the length of the chromosome, such that the local presence of double or multiple values for FF indicates an increase in a portion of the chromosome. Alternatively, as would be the case for chimeric samples, FF determined by polymorphic sequences remains unchanged over the entire length of the chromosome, indicating an overall increase in the amount of intact chromosome, but that is less than for FF X As described above. In the presence of a loss of the entire chromosome, e.g. chromosome X monosomy, then FF Monomer character =1/2 FF X . Fetal fraction values obtained from informative polymorphic sequences can be used in combination with sequence doses and their normalized dose values, e.g., NCV, NSV, to confirm the presence of a complete aneuploidy.
Calculation of fetal fraction from chromosomal dose of aneuploid sequences
The NCV for the chromosome of interest is calculated according to the following equation:
whereinAndcorresponding to the estimated mean and standard deviation for the jth chromosome dose in the qualified sample set, and x ij Is the observed jth chromosome dose for test sample i.
Overall, chromosome dosage for trisomies will increase in proportion to fetal fraction (ff). Thus, ff for chromosome dose in a trisomy chromosome-containing sample will increase proportionally to fetal fraction:
chromosome dose to monosomy will decrease proportionally to fetal fraction (ff). Thus, ff for chromosome dose in a sample containing monosomic chromosomes will be reduced proportionally to the fetal fraction:
in equations 20 and 21, R jA Is the chromosome dose (x) against chromosome j in the affected sample (e.g., the maternal sample to be tested) i ij ) (ii) a ff is not receivedExpected fetal fraction in affected (qualified) sample U; and R is jU Is the chromosome dose in the unaffected sample. The factor "2" is included based on the following assumption: the notation of the calculation in equation 20 is "plus", i.e., there is one extra copy of the chromosome of interest; the notation of the calculation in equation 21 is "minus", i.e., one complete copy of the chromosome of interest is missing. If a different assumption is made otherwise (e.g., this is a duplication of part of the chromosome of interest), then the factor "2" does not represent a practical meaning.
Substitution of chromosome dose R in equation 19 A
WhereinIs thatIs expressed as an equivalent of, and σ jU Is thatAn equivalent representation of (a); ff is solved as follows:
or
Or
Or
Thus, the percentage of any chromosome hypothesized for a trisomy chromosome can be "ff (i) "determine as:
ff (i) =2*NCV jA CV jU equation 26.
The percentage of any chromosome hypothesized for a monosomic chromosome can be "ff (i) "determine as:
ff (i) =-2*NCV jA CV jU equation 27.
The assumption of equation 27 is that one complete copy of the chromosome is missing. The corresponding NCV of the chromosome jA Necessarily negative. Thus, although equation 27 contains a negative sign, the calculated fetal fraction is still a positive value.
Any chromosome "ff, since the fetal fraction cannot be negative (i) "can be calculated by the following equation:
ff (i) =2*|NCV jA CV jU equation 28
Resolving non-determinism using fetal fraction
The ability to determine significant differences in the expression of one or more sequences present in a mixture of two genomes is predicated on the relative sequence contribution of a first genome relative to the contribution of a second genome. For example, non-invasive prenatal diagnosis using cfDNA in maternal samples is challenging, as only a small fraction of DNA samples are derived from the fetus. For prenatal diagnostic analysis, the background of maternal DNA forms a practical limit to sensitivity, and therefore, the fraction of fetal DNA present in a maternal sample is an important parameter. The sensitivity of fetal aneuploidy detection by counting DNA molecules depends on the fetal DNA fraction and the number of molecules counted.
Typically, about 1% of maternal test samples analyzed for fetal aneuploidy by massively parallel sequencing are "no-call" samples for which insufficient sequencing information, such as the number of fetal sequence tags, prevents the confident determination of the presence or absence of one or more fetal aneuploidies in the maternal sample. A "no-call" determination may be due to the fetal cfDNA content being too low relative to the content that the maternal contributes to the sample used to provide sequencing information to distinguish aneuploid samples from the determined sequencing information in the qualified samples. To determine whether a "no-call" sample is an aneuploid sample, the fetal fraction is determined empirically and/or derived, for example, from the NVC value, and used to determine or negate the presence of a chromosomal aneuploidy. As described elsewhere herein, ff can be used to characterize the type of aneuploidy present in the test sample. For example, for a "no decision" region set at a threshold between 2.5 and 4NCV values, a test sample with an NCV close to 4 times the NCV threshold and showing a lower fetal fraction (e.g., less than 3%) may be the affected sample. Conversely, a test sample having an NCV near the 2.5NCV threshold and exhibiting a higher (e.g., greater than 40%) fetal fraction may be an unaffected sample. Splitting a "no decision" sample may rely on a determination of fetal fraction. Preferably, the fetal fraction is determined according to two or more different methods, or by using NCVs determined using the same method from two or more different chromosomes of the sample, and similarly, the fetal fraction can be used to assess whether a sample with an NCV slightly greater than 4 or slightly less than NCV 2.5, respectively, is likely to be a false positive or false negative decision.
Apparatus and system for determining CNV
Analysis of sequencing data and diagnostics derived therefrom are typically performed using various computer-implemented algorithms and programs. Thus, certain embodiments employ processes that involve the storage or transfer of data in or through one or more computer systems or other processing systems. Embodiments of the invention also relate to apparatus for performing these operations. This apparatus may be specially constructed for the required purposes, or it may be a general-purpose computer (or a group of computers) selectively activated or reconfigured by a computer program and/or data structure stored in the computer. In some embodiments, a set of processors perform some or all of the recited analysis operations in a coordinated fashion and/or simultaneously (e.g., via network or cloud computing). A processor or set of processors for performing the methods described herein may be of different types, including microcontrollers and microprocessors, such as programmable devices (e.g., CPLDs and FPGAs) and non-programmable devices, such as gate arrays ASICs or general purpose microprocessors.
In addition, certain embodiments pertain to tangible and/or non-transitory computer-readable media or computer program products that include program instructions and/or data (including data structures) for performing various computer-implemented operations. Examples of computer readable media include, but are not limited to, semiconductor memory devices; magnetic media such as disk drives, magnetic tape; optical media, such as CD; a magneto-optical medium; and hardware devices that are specially configured to store and execute program instructions, such as read-only memory devices (ROM) and Random Access Memory (RAM). The computer-readable medium may be controlled directly by the end user, or the medium may be controlled indirectly by the end user. Examples of directly controlled media include media located at a user device and/or media that is not shared with other institutions. Examples of indirectly controlled media include media that is indirectly accessible to users over an external network and/or through a service that provides shared resources (e.g., a "cloud"). Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
In various embodiments, the data or information employed in the disclosed methods and apparatus is provided in an electronic format. Such data or information may include reads and tags derived from nucleic acid samples, counts or densities of such tags aligned to specific regions of a reference sequence (e.g., aligned to chromosomes or chromosome segments), reference sequences (including reference sequences that are only or predominantly polymorphic), chromosome and segment dosages, determinations (e.g., aneuploidy determinations), normalized chromosome and segment values, paired chromosomes or segments and corresponding normalized chromosomes or segments, counseling recommendations, diagnostics, and the like. As used herein, data or other information provided in an electronic format may be stored on the machines and transmitted between the machines. Conventionally, data in electronic format is provided in digital form and may be stored as bits and/or bytes in different data structures, lists, databases, etc. The data may be embodied electronically, optically, and the like.
In one embodiment, the invention provides a computer program product for producing an output indicative of the presence or absence of aneuploidy (e.g., fetal aneuploidy) or cancer in a test sample. The computer product may contain instructions for performing any one or more of the above-described methods for determining chromosomal abnormalities. As illustrated, the computer product may include a non-transitory and/or tangible computer-readable medium having computer-executable or compilable logic (e.g., instructions) recorded thereon to initiate a processor to determine chromosome dosage and, in some cases, the presence or absence of a fetal aneuploidy. In one example, the computer product comprises a computer-readable medium having computer-executable or interpretable logic (e.g., instructions) recorded thereon for initiating a processor to diagnose a fetal aneuploidy, the computer product comprising: a receiving program for receiving sequencing data of at least a portion of nucleic acid molecules from a maternal biological sample, wherein the sequencing data comprises calculated chromosome and/or segment doses; computer-assisted logic for analyzing the fetal aneuploidy from the received data; and an output routine for generating an output indicative of the presence, absence or type of the fetal aneuploidy.
Sequencing information from the sample under consideration can be mapped to the chromosomal reference sequence to identify a number of sequence tags for each of any one or more chromosomes of interest and to identify a number of sequence tags for the sequence of the normalized segment of each of the any one or more chromosomes of interest. In various embodiments, these reference sequences are stored in a database, such as a relational curve or a target database.
It will be appreciated that it is in most cases impractical or even impossible for a person who does not use an aid to perform the computational operations of the methods disclosed herein. For example, mapping a single 30bp read from a sample to any one human chromosome without the aid of a computing device may require several years of effort. Of course, this problem is complicated by the fact that reliable aneuploidy determinations generally require mapping thousands (e.g., at least about 10,000) or even millions of reads of one or more chromosomes.
The methods disclosed herein may be performed using a computer readable medium having computer readable instructions stored thereon for performing a method for identifying aneuploidy of any CNV, such as a chromosome or part. Accordingly, in one embodiment, the invention provides a computer readable medium having computer readable instructions stored thereon for performing a method for identifying complete and partial chromosomal aneuploidies, such as fetal aneuploidies. These instructions may include, for example, instructions to: (a) Obtaining and/or at least temporarily storing sequence information for fetal and maternal nucleic acids in a sample in a computer readable medium; (b) Using the stored sequence information to computer identify, from the mixture of fetal and maternal nucleic acids, a number of sequence tags for each of any one or more chromosomes of interest selected from the group consisting of chromosomes 1-22, X, and Y, and a number of sequence tags for at least one normalized chromosome sequence for each of the one or more chromosomes of interest; and (c) calculating, by the computer, a single chromosome dose for each chromosome of interest using the number of sequence tags identified for each of the one or more chromosomes of interest and the number of sequence tags identified for each normalized chromosome sequence. The instructions may be executed using one or more processors suitably designed or configured. The instructions may additionally include comparing each chromosome dose to a relevant threshold and thereby determining the presence or absence of any four or more partial or complete different fetal chromosomal aneuploidies in the sample. As explained above, there are many variations on this process. All of these variations may be implemented when processing and storage features are used as described herein.
In some embodiments, the instructions may further comprise automatically recording information about the method, such as chromosome dosage and presence or absence of fetal chromosomal aneuploidy, in a patient medical record for a human subject providing a maternal test sample. The patient medical record may be maintained by, for example, a laboratory, physician's office, hospital, health maintenance organization, insurance company, or personal medical record website. Moreover, based on the results of the analysis performed by the processor, the method may further involve prescribing, initiating, and/or altering treatment of the human subject from which the maternal test sample was taken. This may involve performing one or more additional tests or analyses on additional samples taken from the subject.
The disclosed methods may also be performed using a computer processing system adapted or configured to perform a method for identifying any CNV, such as a chromosome or part of an aneuploidy. Accordingly, in one embodiment, the invention provides a computer processing system adapted or configured to perform a method as described herein. In one embodiment, the apparatus comprises a sequencing device adapted or configured to sequence at least a portion of the nucleic acid molecules in the sample to obtain the type of sequence information described elsewhere herein. The apparatus may further comprise means for processing the sample. These devices are described elsewhere herein.
The sequence or other data may be entered directly or indirectly into a computer or stored on a computer readable medium. In one embodiment, the computer system is directly linked to a sequencing device that can read and/or analyze nucleic acid sequences from the sample. Sequences or other information from these tools are provided in a computer system via an interface. Alternatively, the sequences processed by the system are provided by a sequence storage source, such as a database or other repository. After use of the processing device, the storage device or mass storage device buffers or stores, at least temporarily, the sequence of nucleic acids. In addition, the storage device may store tag counts or the like for different chromosomes or genomes. The memory may also store different subroutines and/or procedures for analyzing existing sequences or mapping data. These procedures/subroutines may include procedures for performing statistical analysis, and the like.
In one example, a user provides a sample to a sequencing device. Data is collected and/or analyzed by a sequencing device connected to a computer. Software on the computer allows for data collection and/or analysis. The data may be stored, displayed (via a monitor or other similar device), and/or transmitted to another location. The computer may be connected to the internet for transmitting data to a handheld device used by a remote user, such as a physician, scientist or analyst. It is to be understood that the data may be stored and/or analyzed prior to transmission. In some embodiments, raw data is collected and sent to a remote user or device that will analyze and/or store the data. The transmission may be via the internet, but may also be via satellite or other connection. Alternatively, the data may be stored on a computer readable medium, and the medium may be sent to the end user (e.g., by mail). The remote users may be in the same or different geographic locations including, but not limited to, buildings, cities, states, countries, or continents.
In some embodiments, the methods further comprise collecting data (e.g., reads, tags, and/or reference chromosome sequences) about the plurality of polynucleotide sequences and transmitting the data to a computer or other computing system. For example, the computer may be connected to a laboratory apparatus, such as a sample collection device, a nucleotide amplification device, a nucleotide sequencing device, or a hybridization device. The computer may then collect the appropriate data collected by the laboratory device. This data may be stored on the computer at any step, such as in real time as it is collected, before transmission, during or simultaneously with transmission, or after transmission. The data may be stored on a computer readable medium that is removable from the computer. The collected or stored data may be transmitted from the computer to a remote location, for example, over a local or wide area network, such as the internet. At the remote location, the transmitted data may be manipulated differently as described below.
The types of electronically formatted data that may be stored, transferred, analyzed, and/or manipulated in the systems, devices, and methods disclosed herein are as follows:
reads obtained by sequencing nucleic acids in a test sample
Tags obtained by aligning reads to a reference genome or other reference sequence
The reference genome or sequence
Sequence tag Density-the count or number of tags per each of two or more regions (typically chromosomes or chromosome segments) of a reference genome or other reference sequence
Uniformity of a normalized chromosome or chromosome segment for a particular chromosome or chromosome segment of interest
Doses for chromosomes or chromosome segments (or other regions) obtained from chromosomes or segments of interest and corresponding normalized chromosomes or segments
A threshold for determining affected, unaffected or no determination of chromosome dosage;
actual determination of chromosome dosage
Diagnosis (clinical conditions associated with these determinations)
Recommendations for other tests derived from these determinations and/or diagnoses
Treatment and/or monitoring plan derived from these determinations and/or diagnoses
These different data types may be obtained, stored, transmitted, analyzed, and/or manipulated at one or more locations using different devices. The processing options span a wide range. At one end of the range, all or most of this information is stored and used at the location where the test sample is processed, such as a physician's office or other clinical setting. In the other extreme, a sample is obtained at one location, processed and optionally sequenced at a different location, reads are aligned and judged at one or more different locations, and a diagnosis, recommendation and/or plan is made at yet another location (which may be the location where the sample was obtained).
In various embodiments, the readings are generated by the sequencing device and then transmitted to a remote site where they are processed to generate aneuploidy determinations. At the remote location, the reads are aligned with reference sequences to generate tags, which are counted and assigned to the chromosome or segment of interest, for example. Also at the remote location, these counts are converted to doses using the relevant normalizing chromosomes or segments. Still further, at the remote location, the doses are used to generate an aneuploidy determination.
The processing operations that may be employed at different locations are as follows:
sample collection
Sample processing before sequencing
Sequencing
Analyzing the sequence data and deriving aneuploidy determinations
Diagnosis of
Reporting diagnosis and/or determination to a patient or care provider
Planning for further treatment, testing and/or monitoring
Execute the plan
Consultation
Any one or more of these operations may be automated as described elsewhere herein. Typically, sequencing and analysis of sequence data and derivation of aneuploidy determinations will be performed on a computer. Other operations may be performed manually or automatically.
Examples of locations where sample collection may be performed include healthcare worker offices, clinics, patient homes (where sample collection tools or kits are provided), and mobile care vehicles. Examples of locations where pre-sequencing sample processing may be performed include healthcare worker offices, clinics, patient homes (where sample processing devices or kits are provided), mobile care vehicles, and facilities of aneuploidy analysis vendors. Examples of locations where sequencing may be performed include healthcare worker offices, clinics, patient homes (where sample sequencing devices and/or kits are provided), mobile care vehicles, and facilities of aneuploidy analysis vendors. The location where sequencing is performed may be provided with a dedicated network connection for transmitting sequencing data (typically reads) in electronic format. The connection may be wired or wireless and has been, and may be, configured to send data to a site where the data can be processed and/or summarized before transmission to the processing point. The data summarizer may be maintained by a healthcare organization, such as a Health Maintenance Organization (HMO).
The analysis and/or derivation operations may be performed at any of the above locations, or alternatively, at another remote site dedicated to the computation and/or nucleic acid sequence data analysis services. These locations include, for example, clusters, such as general server zones, aneuploidy analysis service facilities, and the like. In some embodiments, the computing device used to perform the analysis is leased or rented. The computing resources may be part of a collection accessible to the processor on the internet, such as processing resources colloquially referred to as the cloud. In some cases, the computations are performed by groups of parallel or massively parallel processors, associated or unassociated with each other. The processing may be implemented using distributed processing, such as cluster computing, grid computing, and the like. In these embodiments, the cluster or grid of computing resources collectively form one super virtual computer comprised of multiple processors or computers that act together to perform the analysis and/or derivation described herein. These techniques, as well as more conventional supercomputers, may be used to process sequence data as described herein. Each in the form of parallel computations relying on a processor computer. In the case of grid computing, the processors (often complete computers) are connected by conventional network protocols (e.g., ethernet) over a network (private, public, or Internet). In contrast, a supercomputer has many processors connected by a local high-speed computer bus.
In certain embodiments, the diagnosis is made at the same location as the analysis procedure (e.g., the fetus has down's syndrome or the patient has a particular type of cancer). In other embodiments, it is performed at a different location. In some examples, reporting a diagnosis is performed at the location where the sample was taken, but this need not be the case. Examples of locations where a diagnosis and/or plan can be generated or reported include healthcare worker offices, clinics, computer accessible internet sites, and handheld devices such as cell phones, tablets, smart phones, etc. with wired or wireless connections to a network. Examples of locations at which consultation is made include healthcare worker offices, clinics, computer accessible internet sites, handheld devices, and the like.
In some embodiments, sample collection, sample processing, and sequencing operations are performed at a first location, and inferencing operations are performed at a second location. However, in some cases, sample collection is at one location (e.g., a healthcare worker's office or clinic) and sample processing and sequencing is performed at a different location, which may optionally be the same location where analysis and inferencing is performed.
In various embodiments, the order of the operations listed above may be triggered by the user or the institution initiating sample collection, sample processing, and/or sequencing. After one or more of these operations have begun to be performed, other operations may naturally follow. For example, a sequencing operation may cause readings to be automatically collected and sent to a processing device, which then typically performs sequence analysis and derivation of aneuploidy operations automatically and possibly without other user intervention. In some implementations, the results of the processing operation are then automatically delivered (possibly with reformatting as a diagnosis) to a system component or institution that processes the information and reports to a health professional and/or patient. As illustrated, this information, possibly along with advisory information, may also be automatically processed to generate treatment, testing, and/or monitoring plans. Thus, initiating early operation may trigger an end-to-end sequence in which a health professional, patient, or other relevant party is provided with a diagnosis, plan, consultation, and/or other information that may be used to affect the physical health condition. This can be achieved even if the parts of the overall system are physically separated and may be remote from the location of, for example, the sample and sequencing devices.
FIG. 19 illustrates one implementation of a dispensing system for generating a decision or diagnosis from a test sample. The sample collection site 01 is used to obtain test samples from a patient, such as a pregnant female or a presumed cancer patient. The sample is then provided to a processing and sequencing location 03 where the test sample can be processed and sequenced as described above. Location 03 includes means for processing the sample and means for sequencing the processed sample. The sequencing result, as described elsewhere herein, is a collection of reads, typically provided in electronic format and provided to a network, such as the internet, indicated in fig. 19 by reference number 05.
The sequence data is provided to a remote location 07 where analysis and decision making occurs. The location may include one or more efficient computing devices, such as a computer or processor. After the computing resources at location 07 have completed their analysis and a decision is generated from the received sequence information, the decision is relayed to the network 05. In some embodiments, not only is a determination made at location 07, but a relevant diagnosis is also made. The decision and or diagnosis is then transmitted over the network and back to the sample collection location 01 as illustrated in fig. 19. As illustrated, this is but one of many variations on how the different operations associated with generating a decision or diagnosis are distributed among different locations. One common variation involves providing sample collection and processing and sequencing at a single location. Another variation involves providing processing and sequencing at the same location as the analysis and determination results.
Fig. 20 details the selection of different operations for different locations. In the most comprehensive sense depicted in fig. 20, each of the following operations is performed at separate locations: sample collection, sample processing, sequencing, read alignment, judgment, diagnosis, and reporting and/or planning.
In one embodiment that summarizes some of these operations, sample processing and sequencing are performed at one location, and read alignment, determination, and diagnosis are performed at a separate location. See the portion of fig. 20 identified by reference letter a. In another implementation, identified by the letter B in fig. 20, sample collection, sample processing, and sequencing are all performed at the same location. In this implementation, the read alignment and determination is performed at the second location. Finally, diagnosis and reporting and/or planning of the development are performed at a third location. In the implementation depicted by letter C in fig. 20, sample collection is performed at a first location, sample processing, sequencing, read alignment, decision making, and diagnosis are all performed together at a second location, and reporting and/or planning is performed at a third location. Finally, in the implementation labeled by letter D in fig. 20, sample collection is performed at a first location, sample processing, sequencing, read alignment, and determinations are all performed at a second location, and diagnosis and reporting and/or planning processing are performed at a third location.
In one embodiment, the present invention provides a system for determining the presence or absence of any one or more different intact fetal chromosomal aneuploidies in a maternal test sample comprising fetal and maternal nucleic acids, comprising: a sequencer for receiving a nucleic acid sample and providing fetal and maternal nucleic acid sequence information from the sample; a processor; and a machine-readable storage medium comprising instructions for execution on the processor, the instructions comprising:
(a) Code for obtaining sequence information of the fetal and maternal nucleic acids in the sample;
(b) Code for identifying, by a computer, from the fetal and maternal nucleic acids, a number of sequence tags for each of any one or more chromosomes of interest selected from chromosomes 1-22, X and Y, and a number of sequence tags for at least one normalized chromosome sequence or normalized chromosome segment sequence for each of the any one or more chromosomes of interest;
(c) Code for calculating a single chromosome dose for each of the any one or more chromosomes of interest using the sequence tag numbers identified for each of the any one or more chromosomes of interest and the sequence tag numbers identified for each normalized chromosome sequence or normalized chromosome segment sequence; and
(d) Code for comparing each single chromosome dose for each of the any one or more chromosomes of interest to a respective threshold value for each of the any one or more chromosomes of interest, and thereby determining the presence or absence of any one or more intact distinct fetal chromosomal aneuploidies in the sample.
In some embodiments, the code for calculating the single chromosome dose for each of any one or more chromosomes of interest comprises code for calculating the chromosome dose for the selected one chromosome of interest as a ratio of the number of sequence tags for the selected chromosome of interest to the number of sequence tags identified for the corresponding at least one normalized chromosome sequence or normalized chromosome segment sequence of the selected chromosome of interest.
In some embodiments, the system further comprises code for repeatedly calculating chromosome doses for each of any remaining chromosome segments of any one or more chromosomes of interest.
In some embodiments, the one or more chromosomes of interest selected from chromosomes 1-22, X, and Y comprise at least twenty chromosomes selected from chromosomes 1-22, X, and Y, and wherein the instructions comprise instructions for determining the presence or absence of at least twenty different intact fetal chromosomal aneuploidies.
In some embodiments, the at least one normalizing chromosome sequence is a set of chromosomes selected from chromosomes 1-22, X, and Y. In other embodiments, the at least one normalizing chromosome sequence is a single chromosome selected from chromosomes 1-22, X, and Y.
In another embodiment, the invention provides a system for determining the presence or absence of any one or more different portions of a fetal chromosomal aneuploidy in a maternal test sample comprising fetal and maternal nucleic acids, comprising: a sequencer for receiving a nucleic acid sample and providing fetal and maternal nucleic acid sequence information from the sample; a processor; and a machine-readable storage medium comprising instructions for execution on the processor, the instructions comprising:
(a) Code for obtaining sequence information of the fetal and maternal nucleic acids in the sample;
(b) Code for identifying, by a computer, from the fetal and maternal nucleic acids, a number of sequence tags for each of any one or more segments of any one or more chromosomes of interest selected from the group consisting of chromosomes 1-22, X, and Y, and a number of sequence tags for at least one normalized segment sequence for each of the any one or more segments of any one or more chromosomes of interest;
(c) Code for calculating a single chromosome segment dose for each of any one or more segments of any one or more chromosomes of interest using the sequence tag numbers identified for each of the any one or more segments of any one or more chromosomes of interest and the sequence tag numbers identified for the normalized segment sequences; and
(d) Code for comparing each of the single chromosome segment doses for each of the any one or more segments of any one or more chromosomes of interest to a respective threshold for each of the any one or more chromosome segments of any one or more chromosomes of interest, and thereby determining the presence or absence of one or more different portions of fetal chromosomal aneuploidies in the sample.
In some embodiments, the code for calculating the single chromosome segment dose comprises code for calculating the chromosome segment dose for the selected one chromosome segment as a ratio of the number of sequence tags identified for the selected chromosome segment to the number of sequence tags identified for the corresponding normalized segment sequence of the selected chromosome segment.
In some embodiments, the system further comprises code for repeatedly calculating chromosome segment doses for each of any remaining chromosome segments of any one or more chromosomes of interest.
In some embodiments, the system further comprises (i) code for repeating (a) - (d) for test samples from different maternal subjects, and (ii) code for determining the presence or absence of any one or more different portions of fetal chromosomal aneuploidies in each of the samples.
In other embodiments of any of the systems provided herein, the code further comprises code for automatically recording the presence or absence of a fetal chromosomal aneuploidy in a patient medical record for a human subject that provides the maternal test sample as determined in (d), wherein the recording is performed using a processor.
In some embodiments of any of the systems provided herein, the sequencer is configured to perform Next Generation Sequencing (NGS). In some embodiments, the sequencer is configured to perform massively parallel sequencing using sequencing by synthesis, utilizing reversible dye terminators. In other embodiments, the sequencer is configured to perform ligation sequencing. In yet other embodiments, the sequencer is configured to perform single molecule sequencing.
Device for determining fetal fraction
Analysis of sequence tags derived from a sequenced sample (e.g., a maternal sample) can be performed using an apparatus for medical analysis of samples to provide information about the fraction of one or both gene groups contributing to a nucleic acid mixture. For example, various devices are provided for analyzing sequence tags obtained from sequenced maternal samples to determine the fraction of fetal nucleic acid in a mixture of fetal and maternal nucleic acid present in the maternal sample. The provided medical apparatus comprises a series of means for performing the steps of the method for determining the fetal fraction as described elsewhere in the application.
Fig. 65 shows an embodiment of a medical analysis device for determining fetal fraction in a maternal test sample comprising a mixture of fetal and maternal nucleic acids. The apparatus comprises:
a means (a) for receiving a plurality of sequence reads of said fetal and maternal nucleic acids from said maternal test sample;
a means (b) for aligning the plurality of sequence reads to one or more chromosomal reference sequences and thereby providing a plurality of sequence tags corresponding to the sequence reads;
A means (c) for identifying a number of those sequence tags from one or more chromosomes or chromosome segments of interest selected from chromosomes 1-22, X and Y and segments thereof, and for identifying, for each of said one or more chromosomes or chromosome segments of interest, a number of those sequence tags from at least one normalized chromosome sequence or normalized chromosome segment sequence to determine a chromosome dose or chromosome segment dose, wherein said chromosomes or chromosome segments of interest have copy number variations; and
a means (d) for determining said fetal fraction using said dose of said chromosome of interest or said dose of said chromosome segment of interest.
Preferably, the signal output of the device (a) is connected to the device (b), the signal output of the device (b) is connected to the device (c), and the signal output of the device (c) is connected to the device (d).
In certain embodiments, the copy number variation is determined by comparing the chromosome dose for each of the one or more chromosomes or chromosome segments of interest to a respective threshold for each of the one or more chromosomes or chromosome segments of interest.
Copy number variations that a fetus may carry include complete chromosome replication, complete chromosome deletion, partial replication, partial doubling, partial insertion, and partial deletion.
In certain embodiments, the chromosome or segment dose determined by means (c) is calculated as a ratio of the number of sequence tags identified for the selected chromosome or segment of interest to the number of sequence tags identified for the corresponding at least one normalized chromosome sequence or normalized chromosome segment sequence of the selected chromosome or segment of interest. In certain embodiments, the chromosome dose or segment dose determined by device (c) is calculated as a ratio of the sequence tag density ratio of the selected chromosome or segment of interest to the sequence tag density ratio of at least one corresponding normalized chromosome sequence or normalized chromosome segment sequence of each of the selected chromosomes or segments of interest.
In certain embodiments, the apparatus further comprises means (e) for calculating a Normalized Chromosome Value (NCV) or a Normalized Segment Value (NSV), wherein calculating the NCV correlates the chromosome dose with an average of the respective chromosome doses in a set of qualifying samples as:
WhereinAnd σ iU Corresponding is the estimated mean and standard deviation for the ith chromosome dose in the combo-lattice sample, and R iA Is a chromosome dose calculated for the ith chromosome in the test sample, wherein theThe ith chromosome is the chromosome of interest; wherein the NSV is calculated to correlate the chromosome segment dose with an average of corresponding chromosome segment doses in a set of qualifying samples as:
whereinAnd σ iU Respectively, the estimated mean and standard deviation for the i-th chromosome segment dose in the combo-lattice sample, and R iA Is a chromosome segment dose calculated for an ith chromosome segment in the test sample, wherein the ith chromosome segment is the chromosome segment of interest. Preferably, the signal output of the device (c) is connected to the device (e).
In certain embodiments, means (d) of the apparatus then determines the fetal fraction according to the expression:
ff=2×|NCV iA CV iU |
wherein ff is the fetal fraction value, NCV iA Is a normalized chromosome value on the ith chromosome in an affected sample (e.g., a maternal sample to be tested), and CV iU Is the coefficient of variation of the dose determined for the chromosome of interest in the qualified samples; or determining the fetal fraction according to the expression:
ff=2×|NSV iA CV iU |
Wherein ff is the fetal fraction value, NSV iA Is a normalized chromosome segment value on the ith chromosome segment in an affected sample (e.g., a maternal sample to be tested), and CV iU Is the coefficient of variation of the dose determined for the ith chromosome in these qualifying samples, wherein the ith chromosome is the chromosome of interest. Preferably, the signal output of the device (e) is connected to the component (d).
In certain embodiments, the chromosome of interest is an autosome or an X chromosome of a male fetus, and the chromosome segment of interest is selected from the autosome or the X chromosome of a male fetus.
In certain embodiments, the at least one normalizing chromosome sequence or normalizing chromosome segment sequence is a chromosome or segment selected for an associated chromosome or segment of interest by: (i) Identifying a plurality of qualifying samples for the chromosome or segment of interest; (ii) Repeatedly calculating chromosome doses or chromosome segment doses for the selected chromosome or chromosome segment using a plurality of potential normalized chromosome sequences or normalized chromosome segment sequences; and (iii) selecting the sequence of the normalized chromosome or the sequence of the normalized chromosome segment, either individually or in a combination, to give the least variability or the greatest resolvability in the calculated chromosome dose or chromosome segment dose. In certain embodiments, the normalizing chromosome sequence is a single chromosome of any one or more of chromosomes 1 to 22, X, and Y; alternatively, the normalizing sequence is a set of chromosomes for any of chromosomes 1 through 22, X, and Y. In certain embodiments, the normalizing segment sequence is a single segment of any one or more of chromosomes 1 through 22, X, and Y; alternatively, the normalizing segment sequence is a set of segments for any one or more of chromosomes 1 through 22, X, and Y.
In certain embodiments, the apparatus for determining a fetal fraction further comprises a means for comparing said fetal fraction determined using chromosome dosage or chromosome segment dosage to a fetal fraction determined using information from one or more polymorphisms in fetal and maternal nucleic acid from a maternal test sample that exhibit allelic imbalance that exist in a chromosome other than said chromosome of interest.
In certain embodiments, the apparatus further comprises a sequencing device (10), the sequencing device (10) configured to sequence fetal and maternal nucleic acids in a maternal test sample and obtain sequence reads. Preferably, the signal output of the sequencing device (10) is connected to the device (a).
In certain embodiments, the sequencing device (10) is configured for performing sequencing-by-synthesis. Sequencing by synthesis can be performed using reversible dye terminators. In other embodiments, the sequencing device (10) is configured for performing ligation sequencing. In still other embodiments, the sequencing device (10) is configured for performing single molecule sequencing.
In certain embodiments, the sequencing device (10) is located in a separate location from the devices (a) - (d), and the signal output of the sequencing device (10) is connected to the device (a) via a network.
In certain embodiments, the apparatus comprising the sequencing device as described further comprises means (11) for obtaining a maternal test sample from a pregnant mother. The means (11) for obtaining a maternal test sample and the means (a) - (d) and (10) may be located in separate locations. In addition to comprising means (a) - (d) and (10), the apparatus may further comprise means (12), the means (12) for extracting cell-free DNA from the maternal test sample. In certain embodiments, the means for extracting cell-free DNA (12) is located in the same location as the sequencing means (10), and the means for obtaining a maternal test sample (11) is located in a remote location.
In certain embodiments, the apparatus for determining fetal fraction further comprises a storage device for at least temporarily storing the sequence reads accepted by device (a). Preferably, the signal output of the device (a) is connected to a storage device, the signal output of which is connected to the device (b).
Additional apparatus for determining fetal fraction-classifying copy number variations
An additional medical analysis device is also provided for classifying copy number variations in a fetal genome in a maternal sample containing fetal and maternal nucleic acids (e.g., cell-free DNA). The additional apparatus includes means for determining a fetal fraction and means for comparing the fetal fraction values determined by the different methods. The additional device uses the two calculated fetal fractions to classify copy number variations in the fetal genome. The maternal sample that can be used for analysis by the device may be selected from a blood, plasma, serum or urine sample. In certain embodiments, the maternal sample is a plasma sample. Fig. 66 shows an embodiment of such a medical analysis apparatus.
In one embodiment, there is provided a medical analysis device for classifying copy number variations in a fetal genome, the device comprising:
means (1) for receiving sequence reads from fetal and maternal nucleic acids in a test sample;
means (2) for aligning the sequence reads with one or more chromosomal reference sequences and thereby providing a plurality of sequence tags corresponding to the sequence reads;
means (3) for identifying the number of sequence tags from one or more chromosomes of interest and determining that a first chromosome of interest in the fetus carries a copy number variation;
means (4) for calculating a first fetal score value by a first method that does not use information from the tags of the first chromosome of interest;
means (5) for calculating a second fetal fraction value by a second method using information from the tags of the first chromosome; and
means (6) for comparing the first fetal fraction value to the second fetal fraction value and using the comparison to classify the copy number variation of the first chromosome.
Preferably, the signal output of the device (1) is connected to the device (2), the signal output of the device (2) is connected to the device (3), the signal outputs of the devices (2) and (3) are connected to the device (4), the signal outputs of the devices (2) and (3) are connected to the device (5), and the signal outputs of the devices (4) and (5) are connected to the device (6). The first chromosome of interest may be selected from any of chromosomes 1 through 2, X, and Y.
In certain embodiments, the additional apparatus further comprises a storage device for at least temporarily storing the sequence reads accepted by the device (1). Preferably, the signal output of the device (1) is connected to a storage device, the signal output of which is connected to the device (2).
In certain embodiments, the means (4) of the first method for calculating a first fetal fraction comprises a component for calculating the first fetal fraction value using information from one or more polymorphisms exhibiting allelic imbalance in fetal and maternal nucleic acid of the maternal test sample, said polymorphisms being present in chromosomes other than said first chromosome of interest; and the means (5) of the second method for calculating a second fetal fraction value comprises:
(a) A component (5-1) for calculating the number of sequence tags from the first chromosome of interest and at least one normalizing chromosome sequence to determine chromosome dosage; and
(b) A component (5-2) for calculating the fetal fraction value from the chromosome dose using the second method. In certain embodiments, the signal outputs of devices (2) and (3) are connected to module (5-1), and the signal output of module (5-1) is connected to module (5-2), and the signal output of module (5-2) is connected to device (6).
In certain embodiments, the information used by device (4) of the first method comprises sequence tags obtained by sequencing predetermined polymorphic sequences, each of which comprises the one or more polymorphic sites. The information used by the apparatus (4) of the first method may not be obtained by a sequencing method, for example, a non-sequencing method such as qPCR, digital PCR, mass spectrometry, or capillary gel electrophoresis.
In certain embodiments, the apparatus (4) for the first method comprises a component for calculating the first fetal fraction value using tags from chromosomes or chromosome segments that do not have copy number variations. For example, when the first chromosome of interest is chromosome 21, the fetal fraction determined using the sequence tags from chromosome 21 can be compared to the fetal fraction determined from the sequence tags from chromosome X in a male fetus. Any chromosome or chromosome segment known not to be present in an aneuploidy state, or determined to be not aneuploid (e.g. determined by calculating its NCV or NSV) in a test sample by any of the methods described herein, may be used to determine fetal fraction by the device (4).
In certain embodiments, the apparatus (5) of the second method for calculating the fetal fraction value further comprises a component (5-3) for calculating a Normalized Chromosome Value (NCV), wherein the component (5-3) for calculating the NCV correlates the chromosome dose with the mean of the corresponding chromosome doses in a set of qualifying samples as:
whereinAnd σ iU Corresponding is the estimated mean and standard deviation for the ith chromosome dose in the combo-lattice sample, and R iA Is the chromosome dose calculated for the ith chromosome in the test sample, wherein the ith chromosome is the chromosome of interest.
Preferably, the signal output of the component (5-1) is connected to the component (5-3) and the signal output of the component (5-3) is connected to the component (5-2).
In certain embodiments, the component (5-2) for calculating the fetal fraction value from the chromosome dose by the second method uses the normalized chromosome value. A component (5-2) of the apparatus (5) of the second method for calculating the fetal fraction value evaluates the fetal fraction according to the expression:
ff=2×|NCV iA CV iU |
wherein ff is the second fetal fraction value, NCV iA Is a normalized chromosome value on the ith chromosome in an affected sample (e.g., a maternal sample to be tested), and CV iU Is the coefficient of variation of the dose determined for the ith chromosome in the qualifying sample, wherein the ith chromosome is the chromosome of interest.
In certain embodiments, the apparatus (4) of the first method of calculating a first fetal score comprises: (a) A component (4-1) for calculating the number of sequence tags from chromosomes other than said first chromosome of interest and at least one normalizing chromosome sequence to determine chromosome dosages for chromosomes other than said first chromosome of interest; and (b) a component (4-2) for calculating the first fetal fraction value from the chromosome dose by the first method; and, the apparatus (5) of the second method of calculating a second fetal fraction includes: (a) A component (5-1) for calculating the number of sequence tags from the first chromosome of interest and at least one normalizing chromosome sequence to determine a chromosome dose; and (b) a component (5-2) for calculating the second fetal fraction value from the chromosome dose by the second method.
Preferably, the device (4) of the first method further comprises a component (4-3), the device (5) of the second method further comprises a component (5-3), the components (4-3) and (5-3) respectively calculate Normalized Chromosome Values (NCV), the components (4-3) and (5-3) respectively associate the chromosome dose determined by the components (4-1) and (5-1) with the mean value of the corresponding chromosome dose in the set of qualifying samples as:
WhereinAnd σ iU Respectively, the estimated mean and standard deviation of the dose for the ith chromosome in the combo-grid sample, and R iA Is the calculated dose of the ith chromosome in the test sample,
wherein, for the apparatus (4) of the first method, the i-th chromosome is the chromosome other than the first chromosome of interest; for the apparatus (5) of the second method, the i-th chromosome is the first chromosome of interest.
Preferably, the signal output of the component (4-1) is connected to the component (4-3) and the signal output of the component (4-3) is connected to the component (4-2), wherein the component (4-2) calculates a first fetal fraction value from the respective chromosome dose by said first method using the respective normalized chromosome value; the signal output of the component (5-1) is connected to the component (5-3) and the signal output of the component (5-3) is connected to the component (5-2), wherein the component (5-2) calculates a second fetal fraction value from the respective chromosome dose by said second method using the respective normalized chromosome value.
In certain embodiments, the component (4-2) of the apparatus (4) of the first process and the component (5-2) of the apparatus (5) of the second process are evaluated by the following expression:
ff=2×|NCV iA CV iU |
Wherein ff is the fetal fraction value, NCV iA Is a normalized chromosome value on the ith chromosome in an affected sample (e.g., a maternal sample to be tested), and CV iU Is the coefficient of variation of the dose for the ith chromosome in the qualifying sample;
wherein, for the apparatus (4) used in this first method, the ith chromosome is the chromosome other than the first chromosome of interest; for the apparatus (5) used for this second method, the ith chromosome is the first chromosome of interest. Preferably, when the fetus is a male, the chromosome other than the first chromosome of interest is an X chromosome.
In certain embodiments, the means (6) for comparing the first fetal fraction value to the second fetal fraction value determines whether the two fetal fraction values are approximately equal. In certain embodiments, device (6) further comprises a component that determines that a ploidy hypothesis implied in the second method is true when the two fetal fraction values are approximately equal. The ploidy hypothesis implied by the second method may be that the first chromosome of interest has a complete chromosomal aneuploidy, e.g., the complete chromosomal aneuploidy of the first chromosome of interest is a monosomy or a trisomy.
In certain embodiments, said additional apparatus further comprises a means (7) for analyzing tag information of said first chromosome of interest to determine whether (i) the first chromosome of interest carries a partial aneuploidy, or (ii) the fetus is a chimera, wherein the means (7) for analyzing tag information of the first chromosome of interest is configured to perform when said means (6) for comparing the first fetal fraction value to the second fetal fraction value indicates that the two fetal fraction values are not approximately equal. Preferably, the signal outputs of the devices (2), (3) and (6) are connected to the device (7).
In certain embodiments, in said additional apparatus, the means (4) of the first method comprises a component for calculating the first fetal fraction value using information from one or more polymorphisms exhibiting allelic imbalance in fetal and maternal nucleic acid of the maternal test sample, said polymorphisms being present in a chromosome other than said first chromosome of interest; the apparatus (5) of the second method comprises a component for calculating the second fetal fraction value using information from one or more polymorphisms exhibiting allelic imbalance in fetal and maternal nucleic acid of the maternal test sample, said polymorphisms being present in said first chromosome of interest. The information used by the device (4) of the first method may include sequence tags obtained by sequencing predetermined polymorphic sequences, each of which includes the one or more polymorphic sites. The information used by the apparatus (4) of the first method may not be obtained by a sequencing method, for example, a non-sequencing method such as qPCR, digital PCR, mass spectrometry, or capillary gel electrophoresis.
In certain embodiments, the means for comparing (6) comprises: determining that the first chromosome of interest is a component of a diploid when the ratio of the second fetal fraction value to the first fetal fraction value is approximately 1; determining that the first chromosome of interest is a component of a triploid when the ratio of the second fetal fraction value to the first fetal fraction value is approximately 1.5; and determining that the first chromosome of interest is a component of a haploid when the ratio of the second fetal fraction value to the first fetal fraction value is approximately 0.5.
More preferably, the additional apparatus for classifying copy number variations further comprises a means (7 ') for analyzing the signature information of said first chromosome of interest to determine whether (i) the first chromosome of interest carries a partial aneuploidy or (ii) the fetus is a chimera, wherein the means (7') for analyzing the signature information of the first chromosome of interest is configured to be performed when said means (6) for comparing the first fetal fraction value to the second fetal fraction value indicates that the ratio of the second fetal fraction value to the first fetal fraction value is not approximately 1, 1.5 or 0.5. Preferably, the signal outputs of the devices (2), (3) and (6) are connected to the device (7').
In certain embodiments, the means (7) or (7') for analyzing tag information for the first chromosome of interest comprises: (a) A component (7-1) for binning the sequence of the first chromosome of interest into a plurality of portions; (b) A component (7-2) for determining whether any of said portions contains significantly more or significantly less nucleic acid than one or more other portions; and (c) a component (7-3) for determining that the first chromosome of interest carries a partial aneuploidy if any of said parts contains significantly more or significantly less nucleic acid than one or more other parts, or that the fetus is a chimera if none of said parts contains significantly more or significantly less nucleic acid than one or more other parts. Preferably, the signal outputs of the devices (2), (3) and (6) are connected to the component (7-1), and the signal output of the component (7-1) is connected to the component (7-2), and the signal output of the component (7-2) is connected to the component (7-3). In certain embodiments, component (7-3) further determines that a portion of the first chromosome of interest comprising significantly more or significantly less nucleic acid than one or more other portions carries a partial aneuploidy.
In certain embodiments, the first chromosome of interest is selected from the group consisting of chromosomes 1-22, X, and Y.
In certain embodiments, the apparatus (6) comprises means for classifying the copy number variation into a category selected from the group consisting of: whole chromosome insertions or duplications, whole chromosome deletions, partial chromosome duplications, and partial chromosome deletions, and chimeras.
In certain embodiments, the additional medical analysis device further comprises:
(i) Means (8) for determining whether the copy number variation is caused by a partial aneuploidy or a chimera; and
(ii) Means (9) for determining the locus of a partial aneuploidy on the first chromosome of interest if the copy number variation is caused by a partial aneuploidy.
Wherein the means (8) and (9) are configured for performing when the means (6) for comparing the first fetal fraction value with the second fetal fraction value determines that the first fetal fraction value and the second fetal fraction value are not approximately equal. Preferably, the signal output of the device (6) is connected to the device (8) and the signal output of the device (8) is connected to the device (9). In certain embodiments, the means (9) for determining the locus of a partial aneuploidy on the first chromosome of interest comprises a component for separating the sequence tags of the first chromosome of interest into nucleic acid data boxes or building blocks in the first chromosome of interest; and a component for counting the mapping tags in each data box.
In certain embodiments, the additional apparatus further comprises a sequencing device (10) configured to sequence fetal and maternal nucleic acids in a maternal test sample (e.g., a blood, plasma, serum, or urine sample) and obtain the sequence reads. Preferably, the fetal and maternal nucleic acids are cell-free DNA (cfDNA). Preferably, the signal output of the sequencing device (10) is connected to the device (1).
In certain embodiments, the sequencing device (10) is configured to perform sequencing-by-synthesis. Sequencing by synthesis can be performed using reversible dye terminators. Alternatively, the sequencing apparatus (10) is configured to perform ligation sequencing. Alternatively, the sequencing device (10) is configured to perform single molecule sequencing. In certain embodiments, the sequencing device (10) and the devices (1) - (6) of the additional apparatus for sorting are located in separate locations. Preferably, the signal output of the sequencing device (10) is connected to the device (1) via a network.
In certain embodiments, the additional apparatus for classifying further comprises means (11) for obtaining the maternal test sample from the pregnant mother. The device (11) and the devices (1) - (6) may be located in separate locations. In addition, the additional apparatus may further comprise means (12) for extracting cell-free DNA from the maternal test sample. The means (12) for extracting cell-free DNA may be located in the same location as the sequencing means (10), and wherein the means (11) for obtaining the maternal test sample is located in a remote location.
In certain embodiments, device (2) aligns at least about 1 million reads.
Reagent kit
In various embodiments, kits are provided for carrying out the methods described herein. In certain embodiments, these kits include one or more positive internal controls for complete aneuploidy and/or partial aneuploidy. Typically, but not necessarily, these controls include internal positive controls that include nucleic acid sequences of the type to be screened. For example, a control for a test to determine the presence or absence of a fetal trisomy (e.g., trisomy 21) in a maternal sample can include DNA characterized by trisomy 21 (e.g., DNA obtained from an individual with trisomy 21). In some embodiments, the control comprises a mixture of DNA obtained from two or more individuals with different aneuploidies. For example, for tests to determine the presence or absence of trisomy 13, trisomy 18, trisomy 21, and monosomy X, the control may comprise a combination of DNA samples obtained from pregnant women each carrying a fetus with one of the trisomies tested. In addition to complete chromosomal aneuploidies, IPCs can be generated to provide positive controls for testing in order to determine the presence or absence of partial aneuploidies.
In certain embodiments, the positive control(s) comprise one or more nucleic acids comprising trisomy 21 (T21) and/or trisomy 18 (T18) and/or trisomy 13 (T13). In certain embodiments, nucleic acids comprising each trisomy present as T21 are provided in separate containers. In certain embodiments, nucleic acids comprising two or more trisomies are provided in a single container. Thus, for example, in certain embodiments, a container may comprise T21 and T18, T21 and T13, T18 and T13. In certain embodiments, the container may contain T18, T21, and T13. In these various embodiments, the trisomy may be provided in equal amounts/concentrations. In other embodiments, trisomy may be provided in a particular predetermined ratio. In various embodiments, the control may be provided as a "stock" solution of known concentration.
In certain embodiments, the control for detecting aneuploidy comprises a mixture of cellular genomic DNA obtained from two subjects, one being a contributor to the aneuploidy genome. For example, as explained above, an Internal Positive Control (IPC) generated as a control for a test to determine a fetal trisomy, e.g., trisomy 21, may comprise a combination of genomic DNA from a male or female subject carrying the trisomy chromosome and genomic DNA from a female subject known not to carry the trisomy chromosome. In certain embodiments, the genomic DNA is sheared to provide fragments between about 100-400bp, between about 150-350bp, or between about 200-300bp to mimic circulating cfDNA fragments in a maternal sample.
In certain embodiments, the proportion of fragmented DNA in the control from subjects carrying an aneuploidy (e.g. trisomy 21) is selected to mimic the proportion of circulating fetal cfDNA found in maternal samples, so as to provide an IPC comprising a mixture of fragmented DNA comprising about 5%, about 10%, about 15%, about 20%, about 25%, about 30% DNA from subjects carrying the aneuploidy. In certain embodiments, the control comprises DNA from different subjects each carrying a different aneuploidy. For example, an IPC may comprise about 80% unaffected female DNA, and the remaining 20% may be DNA from three different subjects each carrying trisomy chromosome 21, trisomy chromosome 13, and trisomy chromosome 18.
In certain embodiments, the control(s) comprise cfDNA obtained from a maternal host known to be pregnant with a fetus having a known chromosomal aneuploidy. For example, these controls may include cfDNA obtained from pregnant women carrying fetuses with trisomy 21 and/or trisomy 18 and/or trisomy 13. The cfDNA can be extracted from maternal samples and cloned into bacterial vectors and grown in bacteria to provide a continuous source of IPC. Alternatively, the cloned cfDNA may be amplified by, for example, PCR.
While the controls present in the kit are described above with respect to trisomy, they need not be so limited. It will be appreciated that the positive controls present in the kit may be generated to reflect aneuploidy in other parts, including, for example, different segment amplifications and/or deletions. Thus, for example, where different cancers are known to be associated with a particular amplification or deletion of a substantially intact chromosome arm, the positive control(s) can include the short or long arm of any one or more of chromosomes 1-22, X, and Y. In certain embodiments, the control comprises amplification of one or more arms selected from the group consisting of: 1q, 3q, 4p, 4q, 5p, 5q, 6p, 6q, 7p, 7q, 8p, 8q, 9p, 9q, 10p, 10q, 12p, 12q, 13q, 14q, 16p, 17q, 18p, 18q, 19p, 19q, 20p, 20q, 21q and/or 22q (see e.g. table 2).
In certain embodiments, these controls include aneuploidy against any region known to be associated with a particular amplification or deletion (e.g., breast cancer associated with amplification at 20Q 13). Illustrative regions include, but are not limited to, 17q23 (associated with breast cancer), 19q12 (associated with ovarian cancer), 1q21-1q23 (associated with sarcoma and various solid tumors), 8p11-p12 (associated with breast cancer), erbB2 amplicon, and the like. In certain embodiments, these controls comprise amplification or deletion of a chromosomal region as set forth in any of tables 3-6. In certain embodiments, these controls comprise an amplification or deletion of a chromosomal region comprising a gene as set forth in any of tables 3-6. In certain embodiments, the controls comprise nucleic acid sequences comprising amplification of nucleic acids comprising one or more oncogenes. In certain embodiments, the controls comprise nucleic acid sequences comprising amplification of nucleic acid comprising one or more genes selected from the group consisting of: MYC, ERBB2 (EFGR), CCND1 (cyclin D1), FGFR1, FGFR2, HRAS, KRAS, MYB, MDM2, CCNE, KRAS, MET, ERBB1, CDK4, MYCB, ERBB2, AKT2, MDM2, and CDK4.
The above references are intended to be illustrative and not limiting. Using the teachings provided herein, one of ordinary skill in the art will be able to identify many other controls suitable for incorporation into a kit.
In various embodiments, in addition to or as an alternative to these controls, the kits include one or more nucleic acids and/or nucleic acid mimetics that provide a marker sequence suitable for tracking and determining the integrity of the sample. In certain embodiments, these markers comprise an anti-gene chain sequence. In certain embodiments, the length of these marker sequences is in the range of about 30bp up to about 600bp in length or about 100bp to about 400bp in length. In certain embodiments, the marker sequence(s) is at least 30bp (or nt) in length. In certain embodiments, the label is linked to an aptamer, and the aptamer-linked label molecule is between about 200bp (or nt) and about 600bp (or nt), between about 250bp (or nt) and 550bp (or nt), between about 300bp (or nt) and 500bp (or nt), or between about 350 and 450 in length. In certain embodiments, the aptamer-linked marker molecule is about 200bp (or nt) in length. In certain embodiments, the marker molecule may be about 150bp (or nt), about 160bp (or nt), 170bp (or nt), about 180bp (or nt), about 190bp (or nt), or about 200bp (or nt) in length. In certain embodiments, the length of the label is in the range of about 600bp (or nt).
In certain embodiments, the kit provides at least two, or at least three, or at least four, or at least five, or at least six, or at least seven, or at least eight, or at least nine, or at least ten, or at least 11, or at least 12, or at least 13, or at least 14, or at least 15, or at least 16, or at least 17, or at least 18, or at least 19, or at least 20, or at least 25, or at least 30, or at least 35, or at least 40, or at least 50 different sequences. The different nucleic acids and/or nucleic acid mimetics providing the marker sequence(s) may be stored in separate containers/bottles. Alternatively, different marker molecules may be stored in the same container/bottle.
In various embodiments, the markers comprise one or more DNAs, or the markers comprise one or more DNA mimics. Suitable mimetics include, but are not limited to, morpholino derivatives, peptide Nucleic Acids (PNA), and phosphorothioate DNA. In various embodiments, these markers are incorporated into these controls. In certain embodiments, these labels are incorporated into and/or provide attachment to the aptamer.
In certain embodiments, the kit further comprises one or more sequencing aptamers. These aptamers include, but are not limited to, indexed sequencing aptamers. In certain embodiments, the aptamers comprise a single-stranded arm comprising an index sequence and one or more PCR priming sites.
In certain embodiments, the kit further comprises a sample collection device for collecting the biological sample. In certain embodiments, the sample collection device comprises a means for collecting blood and, optionally, a container for holding blood. In certain embodiments, the kit comprises a container for holding blood, and the container comprises an anticoagulant and/or a cell fixative and/or one or more anti-gene chain marker sequences.
In certain embodiments, the kit further comprises DNA extraction reagents (e.g., separation matrix and/or elution solution). The kit may also include reagents for sequencing library preparations. These reagents include, but are not limited to, solutions for end repair DNA and/or solutions for dA tail DNA and/or solutions for adaptor ligation DNA.
In certain embodiments, the kit further comprises a composition comprising one or more primer sets for amplifying at least one preselected polymorphic nucleic acid in a maternal sample, wherein each preselected polymorphic nucleic acid comprises at least one polymorphic site, and wherein a forward or reverse primer in each primer set hybridizes to a DNA sequence sufficiently proximal to the polymorphic site to be included in sequence reads generated by said massively parallel sequencing of the amplified preselected polymorphic nucleic acids. Sequencing the amplified preselected polymorphic sequence may be used to determine the fetal fraction in a maternal sample, as described elsewhere in this application. The preselected polymorphic nucleic acids may comprise SNPs or STRs. In certain embodiments, at least one primer in each of the primer sets is designed to recognize a polymorphic site present within a sequence read of about 25bp, about 40bp, about 50bp, or about 100 bp. In certain embodiments, the primer set hybridizes to the DNA sequence to produce an amplicon of at least about 100bp, at least about 150bp, or at least about 200 bp. The primer set may hybridize to DNA sequences present on the same chromosome, or the primer set may hybridize to DNA sequences present on a different chromosome. In certain embodiments, the primer set does not hybridize to a DNA sequence present on chromosome 13, 18, 21, X, or Y.
Embodiments of kits provided for practicing these methods and for use in combination with various devices as described herein are illustrated in fig. 67 and 68. In one embodiment, the kit provides for determining fetal fraction. As shown in fig. 67, the kit comprises a kit body (1), a clamp groove arranged in the kit body for placing a bottle, a bottle (2) including an internal positive control; a bottle (3) comprising a marker nucleic acid suitable for tracking and determining the integrity of a sample and a bottle (4) comprising a buffer solution.
The kit can include a plurality of additional bottles, wherein each of the plurality of bottles includes a different internal positive control or a different marker nucleic acid.
In certain embodiments, the vial (2) includes two or more internal positive controls. The internal positive control comprises a trisomy selected from the group consisting of: trisomy 21, trisomy 18, trisomy 21, trisomy 13, trisomy 16, trisomy 13, trisomy 9, trisomy 8, trisomy 22, XXX, XXY, and XYY. In certain embodiments, the internal positive control comprises a trisomy selected from the group consisting of: trisomy 21 (T21), trisomy 18 (T18), and trisomy 13 (T13). In other embodiments, the internal positive controls loaded into vial (2) include trisomy 21 (T21), trisomy 18 (T18), and trisomy 13 (T13). Alternatively, the positive control included in the kit may include an amplification or deletion of a portion of one or more of chromosomes 1 through 22, X, and Y. In certain embodiments, the positive control comprises an amplification or deletion of a short arm or a long arm of any one or more of chromosomes 1 to 22, X, and Y. In certain embodiments, the vial (2) comprises an amplification or deletion of one or more arms selected from the group consisting of: 1q, 3q, 4p, 4q, 5p, 5q, 6p, 6q, 7p, 7q, 8p, 8q, 9p, 9q, 10p, 10q, 12p, 12q, 13q, 14q, 16p, 17q, 18p, 18q, 19p, 19q, 20p, 20q, 21q and 22q. In other embodiments, the vial (2) comprises an amplification of a region selected from the group consisting of: 20Q13, 19Q12, 1Q21-1Q23, 8p11-p12 and ErbB2. Alternatively, the positive control loaded into the vial (2) comprises amplification of a region or a gene set forth in table 3, table 4, table 5 and table 6. In certain embodiments, the positive control loaded into the vial (2) comprises amplification of a region or a gene selected from the group consisting of: MYC, ERBB2 (EFGR), CCND1 (cyclin D1), FGFR1, FGFR2, HRAS, KRAS, MYB, MDM2, CCNE, KRAS, MET, ERBB1, CDK4, MYCB, ERBB2, AKT2, MDM2, and CDK4.
Marker nucleic acids, also known as Marker Molecules (MMs), included in various embodiments of the kits are anti-gene strand marker sequences. The length of these marker sequences may range from about 30bp to about 600bp in length. In other embodiments, the length of these marker sequences ranges from about 100bp to about 400bp in length. In certain embodiments, the kit comprises at least 2, or at least 3, or at least 4, or at least 5, or at least 6, or at least 7, or at least 8, or at least 9, or at least 10, or at least 11, or at least 12, or at least 13, or at least 14, or at least 15, or at least 16, or at least 17, or at least 18, or at least 19, or at least 20, or at least 25, or at least 30, or at least 35, or at least 40, or at least 50 vials for different marker sequences.
In certain embodiments, the marker included in the kit comprises one or more DNAs. In other embodiments, the marker comprises one or more mimetics selected from the group consisting of: morpholino derivatives, peptide Nucleic Acids (PNA) and phosphorothioate DNA.
In certain embodiments, a label is incorporated into the control. In other embodiments, the label is incorporated into the aptamer. In certain embodiments, vial (3) of the kit may be further loaded with one or more sequencing aptamers. The aptamers include indexed sequencing aptamers. The aptamers may further comprise a single-stranded arm comprising an index sequence and one or more PCR priming sites.
Fig. 68 shows a schematic of a kit that may further include a sample collection device for collecting a biological sample. The sample collection device comprises a device (5) for collecting blood and a container (6) for holding blood. In certain embodiments, the device for collecting blood and the container for holding blood comprise an anticoagulant and a cell fixative.
In certain embodiments, the kit may further comprise a bottle (7), the bottle (7) being loaded with DNA extraction reagents. The DNA extraction reagent(s) may comprise a separation matrix and/or an elution solution.
In certain embodiments, the kit further comprises a bottle (8), the bottle (8) loaded with reagents for preparing a sequencing library. These reagents for preparing sequencing libraries can include solutions for end repair of DNA, dA tailing of DNA, and aptamer ligation of DNA.
In other embodiments, the kit further comprises a bottle (9), the bottle (9) comprising a composition of primers for amplifying a predetermined target nucleic acid.
In certain embodiments, the kit further comprises instructional material teaching the use of the reagents to determine the fetal fraction in a biological sample. These instructional materials teach the use of these materials to detect trisomy or monosomy. In certain embodiments, these instructional materials teach the use of these materials to detect cancer or a predisposition for cancer.
In addition, these kits may optionally include labeling and/or instructional materials that provide instructions (e.g., protocols) for using the reagents and/or devices provided in the kit. For example, the instructional materials can teach the use of the reagents to prepare samples and/or determine copy number variations in biological samples. In certain embodiments, the instructional materials teach the use of these materials for detecting trisomy. In certain embodiments, these instructional materials teach the use of these materials to detect cancer or a predisposition for cancer.
Although the instructional materials in the various kits typically comprise handwritten or printed materials, they are not limited thereto. Any medium capable of storing these instructions and communicating them to an end user is contemplated herein. Such media include, but are not limited to, electronic storage media (e.g., magnetic disks, magnetic tapes, audio heads, chips), optical media (e.g., CD ROMs), and the like. The media may include an address to an internet site that provides the instructional material.
Various methods, devices, systems and uses are described in further detail in the following examples, which are in no way intended to limit the scope of the invention as claimed. The accompanying drawings are intended to be part of this specification and description of the invention. The following examples are provided to illustrate, but not to limit, the claimed invention.
Experiment of
Example 1
Sample processing and cfDNA extraction
A peripheral blood sample is collected from a pregnant woman in the first trimester or second trimester of pregnancy and who is considered at risk for fetal aneuploidy. Consent was obtained from each participant prior to blood draw. Blood is collected prior to amniocentesis or chorionic villus sampling. Karyotyping is performed using chorionic villus or amniocentesis samples to determine fetal karyotype.
Peripheral blood drawn from each subject was collected in ACD tubes. One tube of blood sample (about 6 to 9 ml/tube) was transferred to a 15 ml low speed centrifuge tube. Blood was centrifuged at 2640rpm for 10 minutes at 4 ℃ using a Beckmann Allegra 6R centrifuge and a GA 3.8 type rotor.
For cell-free plasma extraction, the upper plasma layer was transferred to a 15 ml high-speed centrifuge tube and centrifuged at 16000 Xg for 10 minutes at 4 ℃ using a Beckmann Coulter Avanti J-E centrifuge and JA-14 rotor. After blood collection, two centrifugation steps were performed within 72 hours. Cell-free plasma containing cfDNA was stored at-80 ℃ and thawed only once before plasma cfDNA amplification or cfDNA purification.
Purified cell-free DNA (cfDNA) was extracted from cell-free plasma using a QIAamp Blood DNA Mini kit (Qiagen), essentially according to the manufacturer's instructions. One ml of buffer AL and 100 μ l protease solution were added to 1ml of plasma. The mixture was incubated at 56 ℃ for 15 minutes. One ml of 100% ethanol was added to the plasma digest. The resulting mixture was transferred to a QIAamp mini column in combination with VacValve and VacConnector as provided in QIAvac 24Plus column assembly (Qiagen). To the sampleVacuum was applied and cfDNA retained on the column filter was washed under vacuum with 750. Mu.l buffer AW1 followed by a second wash with 750. Mu.l buffer AW 24. The column was centrifuged at 14,000RPM for 5 minutes to remove any residual buffer from the filter. cfDNA was eluted with buffer AE by centrifugation at 14,000RPM and using Qubit TM Quantification platform (Qubit) TM Quantification Platform (Invitrogen)) determined concentration.
Example 2
Preparation and sequencing of initial and enriched sequencing libraries
a. Preparation of sequencing library-shortening protocol (ABB)
All sequencing libraries, i.e. initial and enriched libraries, were prepared from about 2ng of purified cfDNA extracted from maternal plasma. Using the reagent NEBNext TM DNA sample preparation DNA reagent set 1 (NEBNext) TM DNA Sample Prep DNA Reagent Set 1) (article No. E6000L; new England Biolabs (New England Biolabs), ipusley, mass.) as followsLibrary preparation was performed. Since cell-free plasma DNA is virtually fragmented, the plasma DNA sample is no longer fragmented by spraying or sonication. According toEnd repair module (a)End Repair Module) by combining cfDNA with NEBNext TM DNA sample preparation DNA reagent set 1 provided 5. Mu.l 10 XPhosphorylation buffer, 2. Mu.l deoxynucleotide solution mix (10 mM per dNTP), 1. Mu.l 1Conversion to phosphorylated blunt ends. The enzyme was then heat inactivated by incubating the reaction mixture for 5 minutes at 75 ℃. The mixture was cooled to 4 ℃ and 10 μ l of dA tailed master mix (NEBNext) containing klenow fragments (3 'to 5' exo minus) was used TM DNA sample preparation DNA reagent set 1) and incubation at 37 ℃ for 15 minutes to achieve dA tailing of blunt-ended DNA. Subsequently, the klenow fragment was heat inactivated by incubating the reaction mixture at 75 ℃ for 5 minutes. After inactivation of klenow fragment, NEBNext was used TM DNA sample preparation DNA reagent 4. Mu. l T4 DNA ligase as provided in set 1, illumina aptamer (Non-indexed Y-aptamers) was ligated to DNA with a dA tail using 1. Mu.l of 1:5 dilution of Illumina Genomic aptamer Oligo Mix (Illumina Genomic Adaptor Oligo Mix) (article No. 1000521; illumina, hayward, calif.) by incubating the reaction mixture for 15 minutes at 25 ℃. The mixture was cooled to 4 ℃ and the aptamer-ligated cfDNA was purified from unligated aptamers, aptamer dimers, and other reagents using magnetic beads provided in the Agencourt AMPure XP PCR purification System (article number A63881; beckmann Coulter genome, denfoss, mass.). Use ofHigh fidelity master mix (25. Mu.l; finnzymes, wobbe, mass.) and adaptor-complementing Ill nanometer PCR primers (0.5. Mu.M each) (article Nos. 1000537 and 1000537) were subjected to 18 PCR cycles to selectively enrich for adaptor-ligated cfDNA (25. Mu.l). Using Ilu nano genome PCR primers (article Nos. 100537 and 1000538) and NEBNext TM DNA sample preparation of DNA reagents the Phusion HF PCR master mix provided in set 1 was used to perform PCR on adaptor ligated DNA according to the manufacturer's instructions (98 ℃,30 seconds; 98 ℃,10 seconds, 18 cycles; 65 ℃,30 seconds; and 72 ℃,30 seconds; final extension at 72 ℃ for 5 minutes and held at 4 ℃). The Ampure XP PCR purification System (Agencourt AMPure XP PCR purification System) (Agencourt B Biotech, inc.) ioscience Corporation), belgium, massachusetts), the amplified product was purified according to the manufacturer's instructions available at www.beckmangenomics.com/products/AMPureXPProtocol — 000387v001. Pdf. The purified amplification product was eluted in 40 μ l of Qiagen EB Buffer (Qiagen EB Buffer) and the concentration and size distribution of the amplified library was analyzed using Agilent DNA 1000 kit for a 2100 bioanalyzer (Agilent technologies inc.), santa clara, ca).
b. Preparation of sequencing libraries-full Length protocols
The full-length protocol described herein is essentially the standard protocol provided by illu nano-meters and differs from the illu nano-protocols only in the purification of the amplified library. The illimeter protocol indicates that the amplified library is purified using gel electrophoresis, whereas the protocol described herein uses magnetic beads to perform the same purification steps. Use is directed toNEBNext of TM DNA sample preparation DNA reagent set 1 (article number E6000L; new England Biolabs, ipuschwich, mass.) an initial sequencing library was prepared using approximately 2ng of purified cfDNA extracted from maternal plasma, essentially according to the manufacturer's instructions. All steps were performed with NEBNext based on genomic DNA library sample preparation except for final purification of the aptamer ligation product (this step was performed using magnetic beads and reagents other than purification columns) and TM By protocols attached to reagents, the DNA library usingGAII for sequencing. NEBNext TM The protocol is basically following the protocol provided by Ilu nano, available at grcf. Jhml. Edu/hts/protocols/11257047_ChIP _sample _Prep. Pdf.
According toEnd repair module by mixing 40. Mu.l cfDNA with NEBNext TM DNA sample preparation DNA reagent set 1 provided 5. Mu.l 10 XPhosphorylation buffer, 2. Mu.l deoxynucleotide solution mix (10 mM per dNTP), 1. Mu.l 1. The samples were cooled to 4 ℃ and purified using a qiaguick PCR purification cartridge (qiagen, valencia, ca) as follows. 50 μ l of the reaction was transferred to a 1.5ml microcentrifuge tube and 250 μ l of Qiagen buffer PB was added. The resulting 300. Mu.l was transferred to a QIAquick column and centrifuged in a microcentrifuge at 13,000RPM for 1 minute. The column was washed with 750. Mu.l of Qiagen buffer PE and recentrifuged. Residual ethanol was removed by additional centrifugation at 13,000RPM for 5 minutes. The DNA was eluted by centrifugation in 39. Mu.l of Qiagen buffer EB. 16 μ l of dA tailed master mix (NEBNext) containing klenow fragments (3 'to 5' exo minus) was used TM DNA sample preparation DNA reagent set 1) and according to the manufacturerdA tailing Module, incubated at 37 ℃ for 30 minutes to achieve dA tailing of 34. Mu.l blunt-ended DNA. The samples were cooled to 4 ℃ and purified using a column provided in the MinElute PCR purification kit (qiagen, valencia, ca) as follows. 50 μ l of the reaction was transferred to a 1.5ml microcentrifuge tube and 250 μ l of Qiagen buffer PB was added. Transfer 300. Mu.l to a MinElute column, centrifuge it in a microcentrifuge for 1 min at 13,000RPM. The column was washed with 750. Mu.l of Qiagen buffer PE and recentrifuged. Residual ethanol was removed by centrifugation at 13,000RPM for an additional 5 minutes. The DNA was eluted by centrifugation in 15. Mu.l of Qiagen buffer EB. According toA quick connect module for eluting ten microliters of DNAIncubate with 1. Mu.l 1:5 Ill nano-genomic aptamer oligo mix dilution (item No 1000521), 15. Mu.l 2 Xquick ligation buffer, and 4. Mu.l quick T4 DNA ligase at 25 ℃ for 15 minutes. The sample was cooled to 4 ℃ and purified using a MinElute column as follows. One hundred fifty microliters of qiagen buffer PE was added to 30 μ l of the reaction and the entire volume was transferred to a MinElute column, which was centrifuged for 1 minute at 13,000RPM in a microfuge. The column was washed with 750. Mu.l of Qiagen buffer PE and recentrifuged. Residual ethanol was removed by centrifugation at 13,000RPM for an additional 5 minutes. The DNA was eluted by centrifugation in 28. Mu.l of Qiagen buffer EB. Using Ilvimmi genome PCR primers (item Nos. 100537 and 1000538) and NEBNext TM DNA sample preparation of DNA reagent set 1 was supplied as Phusion HFPCR master mix, and twenty-three microliters of the aptamer-linked DNA eluate were subjected to 18 PCR cycles (98 ℃,30 seconds; 98 ℃,10 seconds, 18 cycles; 65 ℃,30 seconds; and 72 ℃,30 seconds; final extension at 72 ℃ for 5 minutes and held at 4 ℃) according to the manufacturer's instructions. The amplification product was purified using the Anjinote AMPure XP PCR purification System (Anjinote Biotech, belleville, mass.) according to manufacturer's instructions available at www.beckmangenomics.com/products/AMPureXP protocol-000387v001. Pdf. The amjinkote AMPure XP PCR purification system will remove unbound dntps, primers, primer dimers, salts and other contaminants and recover amplicons greater than 100 bp. The amplified products were eluted from the amjinkott beads in 40 μ l of qiagen EB buffer and the size distribution of the library was analyzed using the agilent DNA 1000 kit for a 2100 bioanalyzer (agilent technologies, santa clara, ca).
c. Analysis of sequencing libraries prepared according to the shortened (a) and full-Length (b) protocols
The electrophorograms produced by the bioanalyzer are shown in fig. 21A and 21B. Fig. 21A shows an electropherogram of library DNA prepared from cfDNA purified from plasma sample M24228 using the full-length protocol described in (a), while fig. 21B shows an electropherogram of library DNA prepared from cfDNA purified from plasma sample M24228 using the full-length protocol described in (B). In both figures, peaks 1 and 4 represent the 15bp lower internal standard and the 1,500 upper internal standard, respectively; the numbers above the peak indicate the number of migration of the library fragments; and the horizontal line indicates the set threshold for integration. The electrophoretogram in FIG. 21A shows one minor peak with a fragment of 187bp and one major peak with a fragment of 263bp, while the electrophoretogram in FIG. 21B shows only one peak at 265 bp. Integration of the peak areas resulted in a calculated concentration of DNA at the 187bp peak in FIG. 21A of 0.40 ng/. Mu.l, a concentration of DNA at the 263bp peak in FIG. 21A of 7.34 ng/. Mu.l, and a concentration of DNA at the 265bp peak in FIG. 21B of 14.72 ng/. Mu.l. It is known that the illu nano-aptamer attached to cfDNA is 92bp, which when subtracted from 265bp, indicates that the peak size of cfDNA is 173bp. The minor peak at 187bp may represent a fragment of two primers joined end-to-end. When shortened protocols were used, the linear double-primer fragments were eliminated from the final library product. The shortened protocol also eliminates other smaller fragments smaller than 187 bp. In this example, the concentration of purified adaptor-ligated cfDNA is twice the concentration of adaptor-ligated cfDNA generated using the full-length protocol. It was noted that the concentration of these aptamer-ligated cfDNA fragments was consistently greater than that obtained using the full-length protocol (data not shown).
Thus, one advantage of preparing sequencing libraries using shortened protocols is that the resulting library always comprises only one major peak in the 262-267bp range, whereas libraries prepared using full-length protocols vary in quality, as reflected by the number and mobility of peaks other than the one representing cfDNA. non-cfDNA products will occupy space on the flow cell and reduce the quality of cluster amplification and subsequent imaging of sequencing reactions, which is the basis for overall partitioning in the non-integer state. It was shown that shortening the protocol did not affect sequencing of the library.
Another advantage of preparing sequencing libraries using a shortened protocol is that the steps of blunting, dA tailing, and adaptor ligation of the three enzymes take less than one hour to complete, supporting the validation and implementation of a rapid aneuploidy diagnostic service.
Another advantage is that the steps of blunt-ended, dA-tailed and aptamer ligation of the three enzymes are performed in the same reaction tube, thus avoiding multiple sample transfers that may result in material loss and, more importantly, sample mixing and sample contamination.
Example 3
Preparation of sequencing libraries from unrepaired cfDNA: aptamer ligation in solution
To determine if the shortening protocol can be further shortened in order to further speed up sample analysis, sequencing libraries were made from unrepaired cfDNA and sequenced using the illu nano genomics analyzer II as described previously.
cfDNA was prepared from peripheral blood samples as described herein. Blunt-ended and phosphorylated 5' phosphates as required by published protocols for the illu nano-platform were not performed in order to provide an unrepaired cfDNA sample.
It was determined that omission of DNA repair or DNA repair and phosphorylation did not affect the quality or yield of the sequencing library (data not shown).
In solution 2-step method for unrepaired DNA
In the first set of experiments, dA-tailing and aptamer ligation were performed on unrepaired cfDNA simultaneously by combining klenow Exo-and T4-DNA ligases in the same reaction mixture as follows: thirty microliters of cfDNA at concentrations between 20-150pg/μ l were dA tailed (5 μ l 10X2 NEB buffer, 2 μ l 10nM dNTP, 1 μ l 10nM ATP, and 1 μ l 5000U/ml crenolexo-), and linked to an illunay aptamer (1 μ l 1, 15 dilution of a 3 μ M stock) in a 50 μ l reaction volume using 1 μ l 400,000U/ml T4-DNA ligase. The un-indexed Y-adaptor is derived from Ilu nano. The combined reactions were incubated at 25 ℃ for 30 minutes. The enzyme was heat inactivated at 75 ℃ for 5 minutes and the reaction product was stored at 10 ℃.
The aptamer-linked product was purified using SPRI beads (amjinkote AMPure XP PCR purification system, beckmann coulter genomics) and subjected to 18 PCR cycles. The PCR amplified library was purified using SPRI and sequenced using illu nano genomics analyzer IIx or HiSeq according to the manufacturer's instructions to obtain a single-ended reading of 36 bp. Many 36bp reads were obtained covering about 10% of the genome. After completion of sample sequencing, illu nano "sequencer control software/real-time analysis" transferred the base call file in binary format onto a network connected storage for data analysis. Sequence data was analyzed using software designed to run on a Linux server that converted binary format base calls to a Human readable text file using the iruna nano "BCLConverter" and then called the open source "Bowtie" program to align the sequences to a reference Human genome derived from the hg18 genome (NCBI 36/hg18, available on the world wide web as http:// genome. Ucsc. Edge/cgi-bin/hgGatewayorg = Human & db = hg18& hgsid = 166260105) provided by the National Center for Biotechnology Information.
The software reads the sequence data generated by the above program that is uniquely aligned with the genome from the Bowtie output (bowtieout. Sequences with up to 2 base mismatches are allowed to align and are only included in the alignment count when they are uniquely aligned to the genome. Sequence alignments (duplicates) with identical start and end coordinates were excluded. About 500 to 2500 ten thousand 36bp tags with 2 or less mismatches were uniquely mapped to the human genome. All mapping tags were counted and included in chromosome dose calculations in test and qualified samples. Extension from base 0 to base 2X 10 6 Base 10X 10 6 To base 13X 10 6 And base 23X 10 6 Regions to the Y-terminus of the chromosome were specifically excluded from the analysis, as tags derived from male or female fetuses mapped to these regions of the Y chromosome.
Fig. 22A shows the average (N = 16) of the percentage of the total number of sequence tags (% chromosome N) mapped to each human chromosome when the sequencing library was prepared according to the shortened protocol (ABB;) and when the sequencing library was prepared according to the repair-free 2-step method (INSOL; □). These numbersIt was shown that preparing a sequencing library using the repair-free 2-step method resulted in a greater percentage of tags mapped to chromosomes with lower GC content and a smaller percentage of tags mapped to chromosomes with higher GC content when compared to the percentage of tags mapped to corresponding chromosomes when using the shortening method. Figure 22B is a plot of percentage of sequence tags as a function of chromosome size and shows that no repair method reduces sequence bias. The regression coefficient of the mapping tags obtained from the sequencing libraries prepared according to the shortened protocol (ABB; delta) and the no repair protocol in solution (2-step; □) is correspondingly R 2 =0.9332 and R 2 =0.9806。
TABLE 8 percent GC content per chromosome
Size (Mbp) GC(%) Size (Mbp) GC(%)
Chr1 247 41.37 Chr13 114 38.24
Chr2 243 39.44 Chr14 106 40.85
Chr3 199 38.74 Chr15 100 41.80
Chr4 191 38.60 Chr16 89 44.64
Chr5 181 39.35 Chr17 79 45.01
Chr6 171 39.94 Chr18 76 39.66
Chr7 159 39.78 Chr19 63 48.21
Chr8 146 40.30 Chr20 62 42.05
Chr9 140 40.17 Chr21 47 40.68
Chr10 135 40.43 Chr22 50 47.64
Chr11 134 41.37 ChrX 155 39.26
Chr12 132 40.59 ChrY 58 37.74
The comparison of the shortening method to the no repair 2-step method is also considered to be that the ratio of the percentage of tags mapped to individual chromosomes when the no repair method is used to the percentage of tags mapped to individual chromosomes when the shortening method is used varies with the percentage of GC content of each chromosome. The percent GC content relative to chromosome size was calculated based on published information on chromosome sequence and GC content partitions (Constantini) et al, genome research (Genome Res) 16 536-541, [2006 ]) and provided in Table 8. The results are provided in fig. 22C, which shows that the ratio for chromosomes with high GC content is significantly reduced, while the ratio for chromosomes with low GC content is increased. These data clearly show that there is no normalization effect that the repair method has to overcome the GC offset.
These data show that the no repair method corrects the GC bias to some extent, which is known to be associated with sequencing of amplified DNA.
To determine whether no repair method affected the proportion of fetuses versus sequenced maternal cfDNA, the number percentage of tags mapped to chromosomes x and Y was determined. Figures 23A and 23B show bar graphs providing mean and standard deviation of the percentages of tags mapped to chromosomes X (figure 23A;% chromosome X) and Y (figure 23B;% chromosome Y) obtained from sequencing 10 cfDNA samples purified from plasma of 10 pregnant women. Fig. 23A shows that the number of tags mapped to the X chromosome was greater when the no repair method was used, relative to the number obtained using the shortening method. Figure 23B shows that the percentage of tags mapped to the Y chromosome when the no repair method was used was not the same as when the shortening method was used.
These data show that the no repair method does not introduce any bias towards or against sequencing fetal versus maternal DNA, i.e. the proportion of fetal sequences sequenced is not changed when the no repair method is used.
Taken together, these data show that repair-free methods do not adversely affect the quality of the sequencing library, nor the information obtained from sequencing the library. Eliminating the DNA repair steps required for the published protocols would reduce reagent costs and speed sequencing library preparation.
In solution 2-step method for indexed unrepaired DNA
In a second set of experiments, dA tailing was performed on unrepaired cfDNA followed by heat inactivation and aptamer ligation of klenow Exo-. When ligation was performed using an unindexed iluo aptamer, which carries a single-stranded arm with 21 bases, exclusion of heat inactivation of klenow does not affect the yield or quality of the sequencing library.
To determine whether repair-free methods are applicable to multiplex sequencing, home-made indexed Y aptamers comprising an index sequence with 6 bases were used to generate libraries by including or excluding klenow heat inactivation. Unlike the non-indexed aptamers, the indexed aptamers comprise a single stranded arm with 43 bases, which includes the indexing sequence and the PCR priming site.
Twelve different indexed aptamers consistent with the illimeter TruSeq aptamer were made starting with oligonucleotides obtained from Integrated DNA Technologies (koralville, iowa). The oligonucleotide sequences were obtained from published aptamer sequences indexed by irus nano-TruSeq. The oligonucleotides were dissolved to obtain a final concentration of 300. Mu.M annealing buffer (10 mM Tris, 1mM EDTA, 50mM NaCl, pH 7.5). An equimolar oligonucleotide mixture, typically 10. Mu.l (300. Mu.M each), containing two cantilevers of any of the specified indexed aptamers was mixed and allowed to anneal (95 ℃,6 minutes; followed by slow controlled cooling from 95 ℃ to 10 ℃). The final 150. Mu.M aptamer was diluted to 7.5. Mu.M in 10mM Tris, 1mM EDTA (pH 8) and stored at-20 ℃ until use.
The data show that when using indexed aptamers, library preparation by the 2-step method is not feasible if active klenow-is present in the same reaction with both ligase and indexed aptamer. However, the 2-step method is very feasible if klenow Exo-is first heat inactivated at 75 ℃ for 5 minutes, followed by the addition of ligase plus indexed aptamer. It is possible that when the indexed aptamer and active klenow-are present together, the strand displacement activity of klenow-results in digestion of the longer single stranded DNA arm of the indexed aptamer, thereby eliminating the PCR primer site. The same cfDNA and enzyme were used to obtain an electropherogram of the sequencing library, without or with a heat inactivation step, after the klenow-reaction showed that a library with the expected characteristic curve (with the main peak at 290 bp) could be made (data not shown) by including klenow-Exo-heat inactivation before adding ligase and indexed aptamer in the 2-step method. Thus, all experiments using the indexed Y aptamer were modified to include heat inactivation of klenow-since no repair method was applicable to multiplex sequencing.
Example 4
Preparation of sequencing libraries from unrepaired cfDNA: adaptation of adapters on Solid Surfaces (SS) for untangled cords 1-step solid surface method for DNA priming
To determine whether the repair-free library process can be further simplified, the repair-free sequencing library preparation method described in example 3 was configured to be performed on a solid surface. The prepared library was sequenced as described in example 3.
cfDNA was prepared from peripheral blood samples as described in example 1. Polypropylene tubes were coated with streptavidin, washed, and a first set of biotinylated indexed aptamers were bound to the streptavidin coated tubes as follows. Tubes of 8-well PCR tube banks (U.S. technology (USA Scientific), ocara, florida) were coated with 50. Mu.l PBS containing 0.5 nanomole of streptavidin (Thermo Scientific, rockford, ill.) by incubating the SA overnight at 4 ℃. The tubes were washed four times with 1XTE, 200 μ l each. Biotinylated index 1 aptamer, each in 50 μ Ι TE, 7.5 picomoles, 3.75 picomoles, 1.8 picomoles, and 0.9 picomoles were added in duplicate to the SA-coated tubes and incubated for 25 minutes at room temperature. Unbound aptamers were removed and the tubes were washed four times with 200 μ Ι TE. Biotinylated index 1 aptamers were made using biotinylated universal aptamer oligonucleotides purchased from IDT as described in example 3.
1-step SS method Using cfDNA from non-pregnant subjects
In the second row of PCR tubes, either a control sample (NTC: no template control) or 30. Mu.l of about 120 pg/. Mu.l, i.e.about 32 femtomoles, purified cfDNA obtained from non-pregnant women was incubated with 5 units of KlenoExo-in NEB buffer # 2 containing 20 nmol dNTPs and 10 nmol ATP in a 50. Mu.l reaction volume for 15 minutes at 37 ℃. Subsequently, klenow enzyme was inactivated by incubating the reaction mixture at 75 ℃ for 5 minutes. The klenow-DNA mixture was transferred to the corresponding tube containing the SA-bound biotinylated aptamer and cfDNA was ligated to the immobilized aptamer by incubating the mixture with 400 units of T4-DNA ligase in 10 μ l 1XT4-DNA ligase buffer for 15 min at 25 ℃. Subsequently, 7.5 picomoles of the index 1 aptamer without biotinylation was ligated to the cfDNA bound to the solid phase by incubating it with 200 units of T4-DNA ligase for 15 minutes at 25 ℃ in 10 μ l of buffer. The reaction mixture was removed and the tube was washed 5 times with 200. Mu.l TE buffer. The adaptor-ligated cfDNA was amplified by PCR using 50. Mu.l Phusion PCR mix [ New England Biolabs ] containing P5 and P7 primers (IDT; 1. Mu.M each) and cycled as follows: [30 seconds, 98 ℃; x18 cycles (10 s, 98 deg.C; 10 s, 50 deg.C; 10 s, 60 deg.C; 10 s, 72 deg.C); 5 minutes, 72 ℃; incubation at 10 ℃). The resulting library products were subjected to SPRI cleaning [ Beckmann Coulter genomics ] and the library quality was assessed from the profile obtained from analysis using a high sensitivity bioanalyzer chip [ Agilent technology, santa Clara, calif. ]. These characteristic curves show that solid phase sequencing library preparation of unrepaired cfDNA provides high yield and high quality sequencing libraries (data not shown).
1-step SS method Using cfDNA from pregnant Subjects
The Solid Surface (SS) method was tested using cfDNA samples obtained from pregnant women.
cfDNA was prepared from 8 peripheral blood samples obtained from pregnant women as described in example 1, and a sequencing library was prepared from the purified cfDNA as described above. The library was sequenced and sequence information was analyzed.
Figure 24 shows the ratio of the number of non-excluded sites (NE sites) on the reference sequence genome (hg 18) and the total number of tags mapped to these non-excluded sites for each of 5 samples from which cfDNA was prepared and used to construct sequencing libraries according to the shortening protocol (ABB) (packed bar) described in example 2, the no repair protocol in solution (2 steps; open bar) described in example 18, and the no repair protocol on solid surfaces (1 step; grey bar) described in this example.
The data shown in figure 24 show that PCR amplified sequences prepared according to the three protocols are expressed comparably, indicating that the solid surface method does not skew the sequence variants expressed in the library.
Fig. 25A shows that the number of sequence tags uniquely mapped to each chromosome obtained when sequencing a library prepared according to the no repair solid surface method is comparable to the number obtained when the no repair 2-step method in solution described above is used. The data show that both repair-free methods reduce GC bias of the sequencing data.
FIG. 25B shows the relationship between the number of tags mapped and the size of the chromosome to which the tags are mapped. The regression coefficients obtained from mapping tags of sequencing libraries prepared according to the shortened protocol (ABB), no repair protocol in solution (2 steps) and no repair protocol on solid surfaces (1 step) were correspondingly R 2 =0.9332、R 2 =0.9802 and R 2 =0.9807。
Fig. 25C shows the ratio of the percentage mapped sequence tags/chromosomes obtained from the sequencing library prepared according to the no-repair 2-step protocol to the tags/chromosomes obtained from the sequencing library prepared according to the shortened protocol (ABB) as a function of the percentage GC content of each chromosome (#) and the ratio of the percentage mapped sequence tags/chromosomes obtained from the sequencing library prepared according to the no-repair 1-step protocol to the tags/chromosomes obtained from the sequencing library prepared according to the shortened protocol (ABB) as a function of the percentage GC content of each chromosome (□). In summary, the data in fig. 25B and 25C show that both the 1-step and 2-step methods show similar GC homogenization effects, since both omit the DNA repair step of the library process.
To determine whether no repair method affected the proportion of fetuses versus sequenced maternal cfDNA, the number percentage of tags mapped to chromosomes x and Y was determined. Fig. 26A and 26B show a comparison of mean and standard deviation of tag percentages mapped to chromosomes X (fig. 26A) and Y (fig. 26B), obtained from sequencing 5 cfDNA samples purified from the plasma of 5 pregnant women of the ABB, 2-step and 1-step methods. FIG. 26A shows that the number of tags mapped to the X chromosome was greater when the no repair method (2 steps and 1 step) was used, relative to the number obtained using the shortening method (filled bars). Fig. 26B shows that the percentage of tags mapped to the Y chromosome when the repair-free 2-step and 1-step methods were used is different from when the shortening method was used.
These data show that the no repair solid surface 1 step method does not introduce any bias towards or against sequencing fetal versus maternal DNA, i.e. the proportion of fetal sequences sequenced is not changed when the no repair solid surface method is used.
In summary, the data show that generating a sequencing library on a solid surface is an easy and feasible option for sequencing a sample preparation.
Example 5
High-throughput compatibility of repair-free solid surface 1-step library preparation method
To determine whether the repair-free 1-step library preparation by NGS technique sequencing can be applied to high-throughput sample processing, 96 cfDNA libraries were prepared from 96 peripheral blood samples in a 96-well PCR plate coated with SA-bound indexed aptamers. The prepared library was sequenced as described in example 5.
Coating of the first PCR plate with SA and ligation of biotinylated indexed aptamers was performed as described in example 4. Each column of wells of a 96-well plate was coated with biotinylated aptamers containing unique indices. Using a second 96-well PCR plate, 37 different cfdnas in 30 μ Ι were dA-tailed at 37 ℃ for 15 min followed by klenow inactivation at 75 ℃ for 5 min in the presence of 10 μ Ι of klenow master mix each. Several cfDNA were used in multiple wells, for a total of 94 wells containing cfDNA;2 wells were used as no template controls. The dA-tailed cfDNA mix was transferred to a first PCR plate and connected to bound biotinylated aptamers using a PCT-225 quadruplex gradient cycling heater (burle (BioRad), herkless, ca) at 25 ℃ in the presence of 10 μ Ι of fast ligase master mix 1. Add 10 μ Ι ligation master mix 2 tailored to each indexed aptamer and ligate for 15 min at 5 ℃. Unbound DNA was removed and bound DNA-biotinylated aptamer complex was washed five times with TE buffer. To each well 50 μ Ι PCR master mix was added and the adaptor-ligated DNA was amplified and SPRI cleaned as described in example 4. The library was diluted and analyzed using HiSens BA chip.
A correlation between the amount of purified cfDNA used to prepare the sequencing library and the resulting amount of library product was obtained for 61 clinical samples prepared using the ABB method (fig. 27A) and 35 study samples prepared using the repair-free SS 1 step method (fig. 27B). These data show that the correlation is significantly greater for the library prepared using the repair-free SS 1 step method (R2 =0.5826; fig. 27A) when compared to the correlation obtained using the shortened method described in example 2 (R2 =0.1534; fig. 27B). Note that: the cfDNA samples in this comparison are not identical, as clinical samples are not available for development. However, these results show that the no repair SS 1 step method consistently has a greater correlation of cfDNA input to library output than the ABB method. Subsequently, the correlations of the 3 methods, i.e., ABB, no repair 2-step, and no repair SS 1-step methods, were compared using serial dilutions of the same purified cfDNA for all three methods. As shown in FIG. 28, the best correlation was obtained when the library was prepared according to the SS 1 step method (R) 2 =0.9457; Δ), followed by a 2-step process(R 2 =0.7666; □) and ABB method (R) with significantly lower correlation 2 =0.0386; o). These data show that end-point modification [ DNA repair and phosphorylation ]The method of cfDNA provides consistent and predictable yields compared to the repair-free method, whether in solution or on a solid surface, whether with or without purification of the repaired DNA and dA-tailed product.
The time taken to prepare a library according to the solid surface method described in this example is several times less than when preparing a sequencing library according to the shortened method. For example, 10 to 14 samples can be prepared manually using the ABB method in about 4 hours, whereas 96 or 192 libraries can be prepared manually in 4 and 5 hours, respectively, when the SS 1 step method is used. Also, the SS 1 step method can be easily automated to prepare libraries at multiple 96 multiple sequencing using NGS technology. Therefore, the SS method would be suitable for commercial automated high throughput sample analysis.
Analysis of DNA libraries shows that solid phase sequencing library preparation of unrepaired cfDNA provides high yield and high quality sequencing libraries that can be configured for use in automated processes to further accelerate sample analysis requiring massively parallel sequencing using NGS technology. The solid surface method is suitable for repaired DNA.
Example 6
Multiplex sequencing of libraries prepared according to the 1-step SS method
In multiplex format, library samples prepared by SS 1 step method on 96-well plates (example 20) were sequenced with six different indexed samples per elunm HySeq sequencer flow cell lane. The prepared library was sequenced as described in example 2. The data shown in fig. 29 compares the indexing efficiency as assessed by multiplex sequencing between 2 steps (filled bars) and SS 1 steps (open bars). These data show that the preparation of libraries on solid surfaces does not compromise the indexing efficiency. FIGS. 30A and 30B show the percentage of the total number of sequence tags mapped to each human chromosome (% chromosome N; FIG. 30A) when a sequencing library was prepared according to the 1-step solid surface method; and figure 30B (R2 = 0.9807) shows the percentage of sequence tags as a function of chromosome size. Fig. 30A and 30B show that the GC bias for the SS 1 step method is the same as the 2 step method, since both processes use DNA repair free samples for preparative enzymology.
Figure 31 shows the percentage of sequence tags mapped to the Y chromosome relative to tags mapped to the X chromosome obtained from sequencing 42 libraries prepared with the SS 1 step method with indexed aptamers and sequenced in multiplex using illu nano sequencing by synthesis with reversible terminator technology. The data clearly distinguishes samples obtained from pregnant women carrying male fetuses from those carrying female fetuses.
Example 7
Sample processing and DNA extraction
A peripheral blood sample is collected from a pregnant woman in the first trimester or second trimester of pregnancy and who is considered at risk for fetal aneuploidy. Consent was obtained from each participant prior to blood draw. Blood is collected prior to amniocentesis or chorionic villus sampling. Karyotyping is performed using chorionic villus or amniocentesis samples to determine fetal karyotype.
Peripheral blood drawn from each subject was collected in ACD tubes. A tube of blood sample (about 6 to 9 ml/tube) was transferred to a 15 ml low speed centrifuge tube. Blood was centrifuged using a Beckmann Allegra 6R centrifuge and a GA 3.8 type rotor at 2640rpm for 10 minutes at 4 ℃.
For cell-free plasma extraction, the upper plasma layer was transferred to a 15 ml high-speed centrifuge tube and centrifuged at 16000Xg at 4 ℃ for 10 minutes using a Beckmann Coulter Avanti J-E centrifuge and JA-14 rotor. After blood collection, two centrifugation steps were performed within 72 hours. Cell-free plasma was stored at-80 ℃ and thawed only once prior to DNA extraction.
Cell-free DNA was extracted from cell-free plasma by using QIAamp DNA blood mini kit (qiagen) according to the manufacturer's instructions. Five ml of buffer AL and 500 μ l of qiagen protease were added to 4.5ml to 5ml of cell-free plasma. The volume was adjusted to 10ml with Phosphate Buffered Saline (PBS) and the mixture was incubated at 56 ℃ for 12 minutes. Precipitated cfDNA was isolated from solution by centrifugation at 8,000RPM in a beckman microfuge using multiple columns. The column was washed with AW1 and AW2 buffers and cfDNA was eluted with 55 μ Ι nuclease free water. About 3.5 to 7ng cfDNA was extracted from the plasma samples.
All sequencing libraries were prepared from approximately 2ng of purified cfDNA extracted from maternal plasma. Using the reagent NEBNext TM DNA sample preparation DNA reagent set 1 (article No. E6000L; new England Biolabs, ipusley, mass.) was prepared as follows. Because cell-free plasma DNA is essentially fragmented, the plasma DNA sample is no longer fragmented by spraying or sonication. Overhang of approximately 2ng of purified cfDNA fragment contained in 40. Mu.l was determined according toEnd Repair Module to phosphorylated blunt End by applying cfDNA in NEBNext in a 1.5ml microcentrifuge tube TM Mu.l of 10 Xphosphorylated buffer provided in DNA Sample Prep DNA Reagent Set 1, 2. Mu.l of deoxynucleotide solution mix (10 mM per dNTP), 1. Mu.l of a dilution of 1:5 with DNA polymerase I, 1. Mu. l T4 DNA polymerase and 1. Mu. l T4 polynucleotide kinase were incubated for 15 minutes at 20 ℃. The enzymes were then heat inactivated by incubating the reaction mixture at 75 ℃ for 5 minutes. The mixture was cooled to 4 ℃ and 10. Mu.l of a mixture containing klenow fragments (3 'to 5' exo-) (NEBNext) were used TM DNA Sample Prep DNA Reagent Set 1) complete dA-tailing of blunt-ended DNA in the master mix and incubate for 15 min at 37 ℃. Subsequently, the klenow fragments were heat inactivated by incubating the reaction mixture at 75 ℃ for 5 minutes. After inactivation of klenow fragment, the protein was used in NEBNext TM Mu.l of T4 DNA ligase provided In DNA Sample Prep DNA Reagent Set 1, these Illumina aptamers (Non-In Adaptor, hayward, calif.) were ligated by incubating the mixture for 15 minutes at 25 ℃ with 1. Mu.l of a dilution of 1:5 of Illumina Genomic Adaptor Oligo Mix (item number: 1000521dex Y-Adaptors) to DNA with a dA tail. The mixture was cooled to 4 ℃ and the aptamer-ligated cfDNA was purified from unligated aptamers, aptamer dimers, and other reagents using magnetic beads provided in the Agencour AMPure XP PCR purification System (article number: A63881; beckman Coulter Genomics, danvers, mass.). Eighteen cycles of PCR were performed to selectively enrich adaptor-ligated cfDNA using PhusionHigh-Fidelity Master Mix (Finnzymes, woburn, mass.) and Illumina PCR primers complementary to the aptamers (Part No.1000537and 1000537). Using Illumina genomic PCR primers (accession Nos. 100537 and 1000538) and NEBNext TM Phusion HF PCR Master Mix supplied in DNA Sample Prep DNA Reagent Set 1 (according to the manufacturer's instructions), the aptamer-ligated DNA was subjected to PCR (30 seconds at 98 ℃; 18 cycles at 98 ℃ for 10 seconds, 30 seconds at 65 ℃ and 30 seconds at 72 ℃; final extension at 72 ℃ for 5 minutes and held at 4 ℃). The amplified product was purified using an Agencourt AMPure XP PCR purification system (Agencourt Bioscience Corporation, beverly, mass.) according to the manufacturer's instructions (available at www.beckmangenomics.com/products/AMPureXP protocol-000387v001. Pdf). The purified amplification product was eluted in 40 μ l Qiagen EB buffer and the amplified library was analyzed for concentration and size distribution using the Agilent DNA 1000Kit from 2100Bioanalyzer (Agilent technologies Inc., santa Clara, calif.).
The amplified DNA was sequenced using Illumina genome analyzer II to obtain 36bp single-ended reads. To identify a sequence as belonging to a particular human chromosome, only about 30bp of random sequence information is required. Longer sequences can uniquely identify more specific targets. In the current case, numerous 36bp reads were obtained, covering approximately 10% of the genome. Once sequencing of the sample was completed, illumina "sequence Control Software (sequence Control Software)" transferred the image and base call files to a Unix server running Illumina "Genome Analyzer Pipeline (Genome Analyzer Pipeline)" Software version 1.51. The Illumina "Gerald" program was run to align sequences to a reference Human genome, derived from the hg18 genome provided by the National Center for Biotechnology Information (NCBI 36/hg18, available at world web site http:// genome. Ucsc. Edu/cgi-bin/hgGatewayorg = Human & db = hg18& hgsid = 166260105). Sequence data generated from the above programs that uniquely aligned with the genome were read from the Gerald export (export. Txt file) by running a program (c 2c. Pl) on a computer running the Linnux operating system. Sequences with base mismatches are allowed to align and are only included in the alignment count if they are only uniquely aligned with the genome. Sequence alignments (replicates) with identical start and stop coordinates were excluded.
Between about 500 and 1500 million 36bp tags with 2 or fewer mismatches are uniquely mapped to the human genome. All mapped tags were counted and included in the calculation of chromosome dose for both the test and the qualified samples. From base 0 to base 2X10 of chromosome Y 6 Base 10X10 6 To base 13x10 6 And base 23x10 6 The regions to the end were excluded from the analysis exactly because the tags obtained from both the male and female fetuses mapped to these regions of the Y chromosome.
It should be noted that some changes in the total number of sequence tags map to individual chromosomes throughout the sample sequenced in the same round (inter-chromosome variability), but a substantially greater change in sequencing from round to round (variability between sequence sequencing processes) was noted.
Example 8
Dosage and variation for chromosomes 13, 18, 21, X, and Y
To examine the extent of inter-chromosomal and inter-sequencing variability in the number of sequence tags mapped for all chromosomes, plasma cfDNA obtained from peripheral blood of 48 volunteers pregnant subjects was extracted and sequenced as illustrated in example 7, and analyzed as follows.
The total number of sequence tags mapped to each chromosome (sequence tag density) is determined. Alternatively, the number of mapped sequence tags can be normalized to the length of the chromosome to produce a sequence tag density ratio. Normalization to the length of the chromosome is not a necessary step, but can be done separately to reduce the number of digits in a number to simplify it for human interpretation. The length of the chromosome that can be used to normalize these sequence tag counts may be the length provided at world web site gene.
The sequence tag density obtained for each chromosome is correlated with the sequence tag density of each of the remaining chromosomes to obtain a qualified chromosome dose, which is calculated as the ratio of the sequence tag density for the chromosome of interest (e.g., chromosome 21) to the sequence tag density for the remaining chromosomes (i.e., chromosomes 1-20, 22 and X). Table 9 provides an example of the calculated qualified chromosome doses for chromosomes 13, 18, 21, X, and Y of interest, which were determined in one of the qualified samples. Chromosome dosages were determined for all chromosomes in all samples, and the average dosages for chromosomes 13, 18, 21, X, and Y of interest in the qualifying samples are provided in tables 10 and 11, and are illustrated in fig. 32-36. Figures 32 to 36 also illustrate that the chromosome dosage for each chromosome of interest in the chromosome dosage eligibility sample for the test sample provides a measure of the variation in the total number of mapped sequence tags (relative to each remaining chromosome) for each chromosome of interest. Thus, a qualified chromosome dose can identify a chromosome or set of chromosomes that is the normalizing chromosome that best approximates the variability between samples to that of the chromosome of interest, and that normalizing chromosome would be the ideal sequence for normalizing to a value for further statistical evaluation. Fig. 37 and 38 depict calculated average chromosome doses determined for chromosomes 13, 18, and 21, and chromosomes X and Y in a qualified sample population.
In some cases, the best normalized chromosome may not have the least variability, but may have a distribution of qualifying doses that best distinguishes one or more test samples from those qualifying samples, i.e.: the best normalizing chromosome may not have the lowest variability but may have the greatest resolvability. Thus, the resolvability takes into account the variation in chromosome dose and the distribution of dose in the qualifying samples.
Tables 10 and 11 provide the coefficient of variation as a measure of variability and the t-test values as a measure of the resolvability of chromosomes 18, 21, X and Y, where the smaller the t-test value, the greater the resolvability. The chromosome 13 resolvability was determined as the ratio of the difference between the mean chromosome dose in the qualifying samples and the dose of chromosome 13 in the T13 test samples alone to the mean standard deviation of the qualifying doses.
When aneuploidy is identified in the test sample as explained below, the qualified chromosome dose also serves as the basis for determining the threshold.
Table 9, qualified chromosome doses for chromosomes 13, 18, 21, X, and Y (n =1; sample No. 11342, 46XY)
TABLE 10 eligible chromosome dosages, variations and resolvability for chromosomes 21, 18 and 13
TABLE 11 eligible chromosome dosages, variations and resolvability for chromosomes 13, X and Y
Diagnostic examples of T21, T13, T18 and one tner syndrome case obtained using the normalized chromosome for the chromosome of interest, chromosome dose and resolvability are illustrated in example 9.
Example 9
Diagnosis of fetal aneuploidy using normalized chromosomes
To adapt the use of chromosome dosage for assessing aneuploidy in biological test samples, maternal blood test samples were obtained from pregnant volunteers and cfDNA was prepared and sequenced and analyzed as described in examples 1 and 2.
Trisomy 21
Table 12 provides the calculated dose for chromosome 21 in an exemplary test sample (# 11403). The threshold calculated for a positive diagnosis of T21 was set at >2 standard deviation from the mean of these qualifying (normal) samples. The diagnosis of T21 is given based on the amount of chromosome in the test sample being greater than a set threshold. Chromosomes 14 and 15 are used as normalization chromosomes in separate calculations to show that either the chromosome with the lowest variability (e.g., chromosome 14) or the chromosome with the greatest resolvability (e.g., chromosome 15) can be used to identify aneuploidy. Thirteen T21 samples were identified using the calculated chromosome dose, and these aneuploidy samples were confirmed by karyotype to be T21.
TABLE 12 chromosome dose for T21 aneuploidy (sample #11403, 47XY + 21)
Trisomy 18
Table 13 provides the calculated dose for chromosome 18 in one test sample (# 11390). The threshold calculated for a positive diagnosis of T18 was set at >2 standard deviation from the mean of the qualifying (normal) samples. The diagnosis of T18 is given based on the amount of chromosome in the test sample being greater than a set threshold. Chromosome 8 was used as the normalizing chromosome. In this example, chromosome 8 has the lowest variability and the greatest resolvability. Eighteen T18 samples were identified using chromosome dose and were confirmed by karyotype to be T18.
These data indicate that one normalized chromosome can have the lowest variability and the greatest resolvability.
TABLE 13 chromosome dose for T18 aneuploidy (sample #11390, 47XY + 18)
Trisomy 13
Table 14 provides the calculated dose for chromosome 13 in one test sample (# 51236). The threshold calculated for a positive diagnosis of T13 was set at >2 standard deviation from the mean of the eligible samples. The diagnosis of T13 is given based on the amount of chromosome in the test sample being greater than a set threshold. Chromosome dose was calculated for chromosome 13 using chromosome 5 or the genome of 3, 4, 5 and 6 as the normalizing chromosome. A T13 sample was identified.
TABLE 14 chromosome dose for T13 aneuploidy (sample #51236, 47XY + 13)
The sequence tag density for chromosomes 3 through 6 is the average tag count for chromosomes 3 through 6.
This data indicates that the combination of chromosomes 3, 4, 5 and 6 provides a lower variability than chromosome 5 and a greater maximum distinguishability than any of the other chromosomes.
Thus, a set of chromosomes can be used as a normalization chromosome to determine chromosome dose and identify aneuploidies.
Turner syndrome (monomer X)
Table 15 provides the calculated doses for chromosomes X and Y in one test sample (# 51238). The threshold calculated for a positive diagnosis of turner syndrome (monosomy X) was set at mean < -2 standard deviations from a qualified (normal) sample for the X chromosome and mean < -2 standard deviations from a qualified (normal) sample for the absence of the Y chromosome.
TABLE 15 chromosome dose for the Tener (XO) aneuploidy (sample #51238,45X)
Samples with X chromosome doses less than a set threshold are identified as having less than one X chromosome. The same sample was determined to have a Y chromosome dose less than the set threshold, indicating that the sample does not have Y chromosomes. Thus, a combination of doses of X and Y was used to identify turner syndrome (monomeric X) samples.
Thus, the provided method enables determination of the CNV of a chromosome. In particular, the method enables the determination of over-representative and under-representative chromosomal aneuploidies by massively parallel sequencing maternal plasma cfDNA and identifying normalization chromosomes for statistical analysis of the sequencing data. The sensitivity and reliability of the method allows for the accurate determination of aneuploidy in the first and second trimester.
Example 10
Determination of partial aneuploidy
The use of sequence doses was applied to evaluate partial aneuploidy of cfDNA biological test samples prepared from plasma and sequenced as illustrated in example 7. The sample was confirmed by karyotyping from a subject having a partial deletion of chromosome 11.
Analysis of sequencing data for partial aneuploidies (chromosome 11, i.e., partial deletion of q21-q 23) was performed as described for the chromosomal aneuploidies in the previous examples. The mapping of sequence tags to chromosome 11 in a test sample showed a significant loss of tag counts between base pairs 81000082-103000103 in the long arm of the chromosome relative to tag counts obtained for the corresponding sequence of chromosome 11 in the qualifying sample (data not shown). The sequence tags of interest (810000082-103000103 bp) that map to chromosome 11 in each qualifying sample, and the sequence tags that map to all 20 megabase fragments in the entire genome of the qualifying sample (i.e., qualifying sequence tag densities) were used to determine qualifying sequence doses as a ratio of tag densities in all qualifying samples. The mean sequence dose, standard deviation, and coefficient of variation were calculated for all 20 megabase fragments in the entire genome, and the 20-megabase sequence with the smallest variability was identified as the normalized sequence on chromosome 5 (13000014-33000033 bp) (see table 16) that was used to calculate the dose against the sequence of interest in the test sample (see table 17). Table 16 provides the sequence doses for the sequences of interest (810000082-103000103 bp) on chromosome 11 in the test sample calculated as the ratio of the sequence tags mapped to the sequences of interest to the sequence tags mapped to the identified normalized sequences. Fig. 40 shows the sequence dose for the sequence of interest in 7 pass samples (O) and the sequence dose for the corresponding sequence in test sample (O). The mean values are shown by the solid lines and the threshold values calculated for positive diagnosis of partial aneuploidy are shown by the dashed lines, which are set at 5 standard deviations from the mean. The diagnosis of partial aneuploidy is given based on the sequence dose in the test sample being less than a set threshold. The test sample was confirmed by karyotyping to have deletions q21-q23 on chromosome 11.
Thus, in addition to identifying chromosomal aneuploidies, the methods of the invention can also be used to identify partial aneuploidies.
TABLE 16 eligible normalized sequences, doses and variations for the sequence Chr11:81000082-103000103 (acceptable sample n = 7)
TABLE 17 sequence doses (test) against sequences of interest on chromosome 11 (81000082-103000103) Sample 11206)
Example 11
Demonstration of aneuploidy detection
The sequence data obtained for the samples illustrated in examples 2 and 3 and shown in figures 32 to 36 were further analyzed to demonstrate the sensitivity of the method in successfully identifying aneuploidies in maternal samples. Normalized chromosome doses for chromosomes 21, 18, 13, X and Y were analyzed as a distribution (Y-axis) relative to standard mean deviation and are shown in fig. 41A-41E. The normalization chromosomes used are shown as denominators (X-axis).
FIG. 41 (A shows a distribution of chromosome dose versus standard deviation for chromosome 21 dose in unaffected samples (o) and trisomy 21 samples (T21; Δ) when chromosome 14 is used as the normalizing chromosome for chromosome 21 FIG. 41 (B shows a distribution of chromosome dose versus standard deviation for chromosome 18 dose in unaffected samples (o) and trisomy 18 samples (T18; Δ) when chromosome 8 is used as the normalizing chromosome for chromosome 18 FIG. 41 (C shows a distribution of chromosome dose versus standard deviation for chromosome 13 dose in unaffected samples (o) and trisomy 18 samples (T13; Δ) using the mean sequence tag densities for a chromosome set of 3, 4, 5 and 6 as the normalizing chromosomes to determine chromosome dose for chromosome 13 FIG. 41 (D shows a distribution of chromosome dose versus standard deviation for chromosome 13 in unaffected samples (o), unaffected female samples (o), unaffected samples (o), monosomy samples (T21; Δ) and male samples (Y1; Y + for chromosome 18) when chromosome 4 is used as the normalizing chromosome X), determining chromosome dose versus chromosome dose for a distribution of chromosome X from the chromosome set of unaffected samples (X + X), determining chromosome dose versus chromosome dose for male sex chromosome 1 and determining chromosome dose for chromosome X + for male sex samples And (4) distribution.
This data indicates that trisomy 21, trisomy 18, trisomy 13 are clearly distinguishable from unaffected (normal) samples. When having chromosome X doses significantly lower than those of unaffected female samples (fig. 41 (D)) and chromosome Y doses significantly lower than those of unaffected male samples (fig. 41 (E)), the monosomic X samples can be easily identified.
Thus, the provided methods are sensitive and specific for determining the presence or absence of a chromosomal aneuploidy in a maternal blood sample.
Example 12
Determining fetal chromosomal non-representation using massively parallel DNA sequencing of cell-free fetal DNA from maternal blood And (3) euploidy: test set 1 independent of training set 1
The study was conducted by qualified fixed-point clinical researchers in 13 U.S. clinical areas between 2009-4 and 2010-10 according to a human subject scientific experimental program approved by the ethical review board (IRB) of each institution. Written consent was obtained from each subject prior to study participation. The scientific experimental program was designed to provide blood samples as well as clinical data to support the development of non-invasive prenatal genetic diagnostic methods. Pregnant women 18 years or older are eligible for participation. Blood was collected prior to the procedure for patients undergoing clinically indicated Chorionic Villus Sampling (CVS) or amnion puncture, and the results of fetal karyotypes were also collected. Peripheral blood samples (two tubes or about 20mL total) were drawn from all subjects and placed in Acid Citrate Dextrose (ACD) tubes (Becton Dickinson). All samples were de-identified and assigned an anonymous patient ID number. Blood samples were shipped to the laboratory overnight in temperature-controlled shipping containers provided for the study. The time taken between blood draw and sample acceptance was recorded as part of the sample site.
The site-directed study coordinator entered clinical data relating to the patient's current pregnancy and history into a study Case Report Form (CRF) using the anonymous patient ID number. Cytogenetic analysis of fetal karyotypes was performed at each laboratory on samples from non-invasive prenatal procedures and the results were also recorded in the study CRF. All data obtained on the CRF is entered into the clinical database of the laboratory. Cell-free plasma was obtained from individual blood tubes using a two-step centrifugation method after 24 to 48 hours of venipuncture sampling. Plasma from a single blood tube is sufficient for sequencing analysis. Cell-free DNA was extracted from cell-free plasma by using the QIAamp DNAblood Mini kit (Qiagen) according to the manufacturer's instructions. Since these cell-free DNA fragments are known to be about 170 base pairs (bp) in length (Fan et al, clin Chem 56, 1279-1286[2010 ]), fragmentation of the DNA prior to sequencing is not required.
For samples of this training set, cfDNA was sent to Prognosys Biosciences, inc. (La Jolla, CA) for sequencing library preparation (cfDNA blunted and ligated onto common aptamers) and sequenced using a standard manufacturer scientific test program with an Illumina Genome Analyzer IIx instrument (http:// www.illumina.com /). Single-ended reads of 36 base pairs were obtained. After sequencing was completed, all base call files were collected and analyzed. For the test group samples, sequencing libraries were prepared and sequenced on the Illumina Genome Analyzer IIx instrument. The sequencing library was prepared as follows. The full-length scientific experimental program described is mainly the standard protocol provided by Illumina and differs from Illumina scientific experimental program only in the purification of amplified libraries. Illumina scientific experimental program indicates: the amplified library was purified using gel electrophoresis, while the scientific experiments described herein planned to use magnetic beads for the same purification steps. Preparation of a primary sequencing library using about 2ng of purified cfDNA extracted from maternal plasma, this mainly used NEBNext of TM DNA Sample Prep DNA Reagent Set 1 (Part No. E6000L; new England Biolabs, ipshoch, mass.) was performed according to the manufacturer's instructions. All steps were accompanied by NEBNext for sample preparation of genomic DNA libraries according to the scientific experimental program, except that the aptamer ligated products were finally purified using Agencourt magnetic beads and reagents instead of purification columns TM Reagent (used)GAII sequencing). NEBNext TM NEBNext TM Mainly according to Illumina provided, this is grcf. Jhml. Edu/hts/protocols/11257047_ChIP _sample _. Prep.
Overhang of approximately 2ng of purified cfDNA fragment contained in 40. Mu.l was determined by applying cfDNA to NEBNext in a 1.5ml microcentrifuge tube TM DNA Sample Prep DNA ReageMu.l of 10 Xphosphorylated buffer provided in nt Set 1, 2. Mu.l of deoxynucleotide solution mixture (10 mM per dNTP), 1. Mu.l of a dilution of 1:5 DNA polymerase I, 1. Mu. l T4 DNA polymerase and 1. Mu. l T4 polynucleotide kinase were incubated at 20 ℃ for 15 minutes according toEnd Repair Module to a phosphorylated blunt End. The sample was cooled to 4 ℃ and purified using a QIA flash column provided in the QIAQuick PCR Purification Kit (QIAGEN inc., valencia, CA). Mu.l of the reaction was transferred to a 1.5ml centrifuge tube and 250. Mu.l of Qiagen Buffer PB was added. The resulting 300. Mu.l was transferred to a QIA flash column, which was centrifuged in a microfuge at 13,000RPM for 1 minute. The column was washed with 750. Mu.l Qiagen Buffer PE and recentrifuged. Residual ethanol was removed by centrifugation at 13,000RPM for an additional 5 minutes. The DNA was eluted by centrifugation in 39. Mu.l Qiagen Buffer EB. 16. Mu.l of a plasmid containing the Klenow fragment (3 'to 5' exo-) (NEBNext) TM dA tailing of DNA Sample Prep DNA Reagent Set 1) the master mix completed dA tailing of 34. Mu.l of blunt-ended DNA and was done according to the manufacturerdA-tailing Module (dA-labeling Module) was incubated at 37 ℃ for 30 minutes. The sample was cooled to 4 ℃ and purified using a column provided in the MinElute PCR Purification Kit (QIAGEN Inc., valencia, calif.). Mu.l of the reaction was transferred to a 1.5ml microcentrifuge tube and 250. Mu.l of Qiagen Buffer PB (Qiagen Buffer PB) was added. Mu.l were transferred to a MinElute column, which was centrifuged in a microcentrifuge at 13,000RPM for 1 minute. The column was washed with 750. Mu.l Qiagen Buffer (PE Qiagen Buffer PE) and recentrifuged. Residual ethanol was removed by centrifugation at 13,000RPM for an additional 5 minutes. The DNA was eluted by centrifugation in 15. Mu.l Qiagen Buffer EB. According toQuick connect module (Quick Ligation Module), ten microliters of DNA eluate were incubated with 1. Mu.l of a dilution of 1:5 illuminer Oligo Mix (article No. 1000521), 15. Mu.l of 2 Xquick Ligation Reaction Buffer, and 4. Mu.l of Quick T4 DNA ligase at 25 ℃ for 15 minutes. The sample was cooled to 4 ℃ and a MinElute column was used as follows. One hundred fifty microliters of Qiagen Buffer PE was added to 30. Mu.l of the reaction and the entire volume was transferred to a MinElute column, which was centrifuged in a microfuge for 1 minute at 13,000RPM. The column was washed with 750. Mu.l Qiagen Buffer PE and recentrifuged. Residual ethanol was removed by recentrifugation at 13,000RPM for 5 minutes. The DNA was eluted by centrifugation in 28. Mu.l Qiagen Buffer EB. The Illumina genome PCR primers (article Nos. 100537 and 1000538) and the primer set NEBNext TM Phusion HF PCR Master Mix supplied in DNA Sample Prep DNA Reagent Set 1 (according to the manufacturer's instructions), twenty-three microliters of aptamer-ligated DNA eluate were subjected to 18 PCR cycles (98 ℃ for 30 seconds; 98 ℃ for 18 cycles for 10 seconds, 65 ℃ for 30 seconds, and 72 ℃ for 30 seconds; final extension at 72 ℃ for 5 minutes, and held at 4 ℃). The amplified product was purified using an Agencourt AMPure XP PCR purification system (Agencourt Bioscience Corporation, beverly, mass.) according to the manufacturer's instructions (available at www.beckmangenomics.com/products/AMPureXP protocol-000387v001. Pdf). The Agencourt AMPure XP PCR purification system removed unbound dntps, primers, primer dimers, salts, and other contaminants, and recovered amplicons greater than 100 bp. The purified amplified product was eluted from the Agencourt beads in 40. Mu.l Qiagen EB buffer and the library was analyzed for size distribution using the Agilent DNA 1000Kit from 2100Bioanalyzer (Agilent technologies Inc., santa Clara, calif.). For the training and test sample sets, single-sided reads of 36 base pairs were sequenced.
Data analysis and sample classification
Sequence reads 36 bases in length were aligned to the human genome component hg18 obtained from the UCSC database (http:// hgdownload. Cse. UCSC. Edu/goldenPath/hg18/bigZips /). A Bowtie short sequence segment aligner (version 0.12.5) that allows up to two base mismatches in the alignment process was used (Langmead et al, genome Biol 10 r25[2009 ]) To perform an alignment. Only reads that map clearly to a single genomic position are included. The genomic loci mapped by the reads were counted and included in the calculation of chromosome dose (see below). Regions on the Y chromosome where sequence tags from male and female fetuses mapped without any distinction were excluded from the analysis (specifically, from base 0 to base 2x 10) 6 Base 10X10 6 To base 13x10 6 (ii) a And base 23x10 6 To the end of the Y chromosome. )
The same round and inter-round sequencing changes in the chromosomal distribution of sequence reads may make the distribution of fetal aneuploidy to mapped sequence sites less obvious. To correct for this variation, a chromosome dose is calculated because the counts for a given mapping site of a chromosome of interest are normalized to the counts observed for a pre-set normalized chromosome sequence. As previously explained, a normalized chromosomal sequence may consist of a single chromosome or of a set of chromosomes. In a subset of samples within the training set of unaffected (i.e., qualified) samples, the normalized chromosome sequence is first identified as a diploid karyotype with chromosomes 21, 18, 13 and X of interest, taking into account each autosome as a potential denominator in the ratio of counts with chromosomes of our interest. The denominator chromosome (i.e., the normalized chromosome sequence) is selected to minimize the variation in chromosome dose between sequencing batches. Each chromosome of interest was identified as having a significant normalized chromosome sequence (denominator) (table 10). No single chromosome can be identified as a normalizing chromosome sequence for chromosome 13 because no single chromosome is determined to reduce the variation in dosage of chromosome 13 in the sample, i.e., the spread in NCV values for chromosome 13 is not reduced enough to allow for the correct identification of a T13 aneuploidy. Chromosomes 2 through 6 were randomly selected and tested as a group for their ability to mimic the behavior of chromosome 13. The set of chromosomes 2 to 6 was found to substantially reduce the variation in dose to chromosome 13 in the training set samples and was therefore selected as the normalising chromosome sequence for chromosome 13. As described above, the variation in chromosome dose for chromosome Y is greater than 30, and independently thereof, a single chromosome is used as a normalizing chromosome sequence in determining the dose for chromosome Y. The set of chromosomes 2 to 6 was found to substantially reduce the variation in dose to chromosome Y in the training set samples and was therefore selected as the normalising chromosome sequence for chromosome Y.
The chromosome dose for each chromosome of interest in the qualifying sample provides a measure of the change in the total number of mapped sequence tags for each chromosome of interest relative to the total number of mapped sequence tags for each remaining chromosome. Thus, a qualified chromosome dose can identify the chromosome or set of chromosomes, i.e., the normalized chromosome sequence that has a variability in the sample that best approximates the variability of the chromosome of interest, and that will be the ideal sequence for normalization values to be subjected to further statistical evaluation.
The chromosome dose for all samples in the training set (i.e., eligible and affected) also serves as a basis for determining a threshold when identifying aneuploidy in the test sample as explained below.
TABLE 18 normalized chromosome sequences for determining chromosome dosages
For each chromosome of interest in each sample of the test set, a normalized value is determined and used to determine the presence or absence of aneuploidy. The normalized value is calculated as the chromosome dose that can be further calculated to provide a Normalized Chromosome Value (NCV).
Chromosome dosage
For the test set, one chromosome dose was calculated for each chromosome of interest 21, 18, 13, X and Y for each sample. As provided in table 18 above, the chromosome dose for chromosome 21 is calculated as the ratio of the number of tags in the test sample mapped to chromosome 21 in the test sample to the number of tags in the test sample mapped to chromosome 9 in the test sample; the chromosome dose for chromosome 18 is calculated as the ratio of the number of tags in the test sample mapped to chromosome 18 in the test sample to the number of tags in the test sample mapped to chromosome 8 in the test sample; the chromosome dose for chromosome 13 is calculated as the ratio of the number of tags in the test sample mapped to chromosome 13 in the test sample to the number of tags in the test sample mapped to chromosomes 2 through 6 in the test sample; calculating chromosome dose for chromosome X as a ratio of the number of tags in the test sample mapped to chromosome X in the test sample to the number of tags in the test sample mapped to chromosome 6 in the test sample; the chromosome dose for chromosome Y is calculated as the ratio of the number of tags in the test sample mapped to chromosome Y in the test sample to the number of tags in the test sample mapped to chromosomes 2 through 6 in the test sample.
Normalized chromosome value
Using the chromosome dose for each chromosome of interest in each test sample and the corresponding chromosome dose determined in the qualifying samples of the training set, a Normalized Chromosome Value (NCV) is calculated using the following equation:
whereinAndcorresponding to the estimated training set mean and standard deviation for the jth chromosome dose, and x ij x ij Is the jth chromosome dose observed for test sample i. When the chromosomal doses were normalized distributed, the NCV corresponded to a statistical z-score for these doses. No significant deviation from linearity was observed in the quantile-quantile plots of NCV from unaffected samples. Furthermore, standard tests for the degree of normalization of NCVs fail to overrule the null hypothesis of normality.
For the test set, one NCV was calculated for each chromosome of interest 21, 18, 13, X and Y for each sample. To ensure a safe and efficient classification scheme, conservative boundaries are chosen for aneuploidy classification. To classify the aneuploidy state of an autosome, NCV is required to classify a chromosome as affected (i.e., aneuploidy for that chromosome); and NCV <2.5 to classify chromosomes as unaffected. Samples with autosomes having NCV between 2.5 and 4.0 were classified as "no-call".
In the tests, the classification of sex chromosomes was carried out by applying NCV successively for both X and Y as follows:
a sample is classified as male (XY) if NCV Y > -2.0 male sample standard deviation.
A sample is classified as female (XX) if NCV Y < -2.0 male sample standard deviation and NCV Y > -2.0 female sample standard deviation.
If NCV Y < -2.0 male sample standard deviation and NCV Y < -3.0 female sample standard deviation, then the sample is classified as haplotype X, i.e., turner's syndrome.
If the NCV does not meet any of the above criteria, the specimen cup is classified as "no decision" for gender.
Results
Study demographics
A total of 1,014 patients were enrolled between months 4 and 7 in 2009 and 2010. The patient demographics, invasive procedure type, and karyotype results are summarized in table 19 with the mean age of the study population being 35.6 years (ranging from 17 to 47 years) and gestational age ranging from 6 weeks 1 day to 38 weeks 1 day (mean 15 weeks 4 days). The overall incidence of abnormal fetal karyotypes was 6.8%, with a T21 incidence of 2.5%. Of 946 subjects with single-gestation and karyotype, 906 (96%) presented at least one clinically recognized risk factor for fetal aneuploidy during prenatal processing. Even with the exclusion of those subjects who only had a high gestational age as their sole indication, the data still demonstrated a very high false positive rate for the current screening modality. The results of the ultrasound examination with ultrasound were: increased neck translucency, water-cystic lymphangioma, or other structural congenital abnormalities, which are the most predictable abnormal karyotypes in this age group.
TABLE 19 patient demographics
* Results of fetuses including multiple pregnancies assessed and reported by clinicians
Abbreviations: AMA = high gestational age, NT = neck translucency
The distribution of the diverse ethnic backgrounds exhibited in the study population is also shown in table 19. Overall, 63% of the patients in this study were caucasians, 17% were spain, 6% were asians, 5% were multi-ethnic, and 4% were african americans. Note that the ethnicity differences vary significantly from site to site. For example, one site enrolled 60% of spain and 26% of caucasian subjects, while three clinical sites located in the same state did not enroll spain subjects. As expected, no discernable difference was observed in our results for different ethnicities.
Training data set 1
The training set study picked 71 samples from 435 samples collected between 4 months 2009 and 12 months 2009, which were accumulated consecutively at early age. All subjects with affected fetuses (abnormal karyotypes) in the first series of subjects were included for sequencing, as well as one random pick and a random number of unaffected subjects with appropriate samples and data. The clinical characteristics of the patients in the training set were consistent with the demographics of the overall study shown in table 19. The gestational age range for samples within the training group ranged from 10 weeks 0 days to 23 weeks 1 days. Thirty-eight experienced CVS,32 experienced amniocentesis and 1 patient did not have the type of invasive procedure specified (unaffected karyotype 46,xy). 70% of patients are caucasians, 8.5% are spain, 8.5% are asians, and 8.5% are multi-ethnic. Six sequenced samples were removed from this set for training purposes. 4 samples were from subjects who were twinborn (discussed in detail below), 1 sample had T18 and was contaminated during the preparation process, and 1 sample had a fetal karyotype of 69,xxx, leaving 65 samples as the training set.
The number of single sequence sites (i.e., tags recognized with unique sites in the genome) varied from 2.2M at the early stage of the training set study to 13.7M at the later stage (due to improvements in sequencing technology over time). To monitor any potential changes in chromosome dose in the unique loci beyond this 6-fold range, different, unaffected samples were run at the beginning and end of the study. For the round of the first 15 unaffected samples, the average number of unique loci was 3.8M and the average chromosome dose for chromosome 21 and chromosome 18 was 0.314 and 0.528, respectively. For the next 15 unaffected sample runs, the average number of unique loci was 10.7M and the average chromosome dose for chromosome 21 and chromosome 18 was 0.316 and 0.529, respectively. There were no statistical differences between chromosome 21 and chromosome 18 chromosome dosages over time of the training set study.
The training set NCV for chromosomes 21, 18, and 13 is shown on figure 42. The results shown in fig. 42 are consistent with an assumption of normality that: approximately 99% of diploid NCVs will fall within ± 2.5 standard deviations of the mean. Of the 65 samples in this set, 8 samples with clinical karyotypes indicative of T21 had NCVs ranging from 6 to 20. Four samples with clinical karyotypes indicate that fetal T18 has NCVs ranging from 3.3 to 12, and two samples with clinical karyotypes indicate that fetal trisomy 13 (T13) has NCVs of 2.6 and 4. The spread of NCVs in affected samples is due to their dependence on the percentage of fetal cfDNA in a single sample.
Similar to autosomes, mean and standard deviation of sex chromosomes were determined within the training set. The threshold for sex chromosomes allows 100% discrimination between male and female fetuses within the training set.
Test data set 1
After establishing the chromosomal dose mean and the standard deviation from the training set, a test set of 48 samples was selected from the samples collected from a total of 575 samples between 1 month 2010 and 6 months 2010. One of the samples from the twins was removed from the final analysis, leaving 47 samples in the test group. Personnel preparing samples for sequencing and handling the equipment were blinded to clinical karyotype information. Gestational age ranges were similar to those seen in the training group (table 19). The 58% of invasive procedures were CVS, higher than the overall procedural demographics, but also similar to the training set. 50% of the subjects were caucasians, 27% were spain, 10.4% were asians and 6.3% were african americans.
Within the test group, the number of unique sequence tags varied from about 13M to 26M. For unaffected samples, chromosome doses were 0.313 and 0.527 for chromosome 21 and chromosome 18, respectively. The test set of NCVs for chromosome 21, chromosome 18 and chromosome 13 is shown in fig. 43 and the classification is given in table 20.
TABLE 20 test component Classification data test component class data
* MX is monosomy of the X chromosome, while the Y chromosome has no sign
Within the test group, 13/13 subjects with karyotype indicated as fetal T21 were correctly identified as having NCV ranging from 5 to 14. Eight/eight subjects with a karyotype indicated as fetal T18 were correctly identified as having NCV ranging from 8.5 to 22. Within this test group, a single sample with a C classified as T13 was classified as a no-call where the NCV was about 3.
For the test data set, all male samples were correctly identified, including samples with complex karyotypes 46, xy + marker chromosomes (not identifiable by cytogenetics) (table 11). For three samples with karyotypes of 45,x in the test group, two of the three were correctly identified as monosomy X, and 1 was classified as no predicate (table 20).
Twin
Four of the samples initially selected for the training group and one within the test group was from a twin pregnancy. The thresholds used herein may be plagued by different amounts of cfDNA expected in the context of a twin pregnancy. Within the training set, the karyotype from one of the twinned samples is Monochorion 47, XY +21. A second double fetal sample is anovulatory and amniocentesis is performed separately for each fetus. In this twin pregnancy, one fetus has a karyotype of 47,XY +21 and the other has a normal karyotype of 46,XX. In both cases, cell-free classification based on the methods discussed above classified the sample as T21. The other two twins within the training set were correctly classified as unaffected for T21 (all twins showed diploid karyotypes for chromosome 21). For twins pregnancies within the test group, a karyotype (46, xx) was established only for twins B, and the algorithm was correctly classified as unaffected for T21.
Conclusion
The data indicate that massively parallel sequencing can be used to determine multiple abnormal fetal karyotypes from the blood of pregnant women. These data indicate that 100% correct classification of samples with trisomy 21 and trisomy 18 can be identified using independent test panel data. Even in the case of fetuses with abnormal karyotypes, none of the samples were misclassified using the algorithm of the method. Importantly, the algorithm also performed well in determining the presence or absence of T21 in the two groups of twins. Furthermore, the present study examined many consecutive samples from multiple centers, representing not only the range of abnormal karyotypes one might see in a commercial clinical setting, but also demonstrates the importance of accurately categorizing pregnancies unaffected by common trisomies to emphasize the high to unacceptable false positive rate present in current prenatal screening. This data provides valuable insight into the great potential of using the method in the future. Analysis of a subset of unique gene loci indicates an increase in variance-consistent poisson count statistics.
This data was based on the findings of Fan and Quake, which confirmed: the sensitivity of non-invasive determination of fetal aneuploidy from maternal plasma using massively parallel sequencing is limited only by counting statistics (Fan and Quake, PLos One 5, e10439[2010 ]). Because sequencing information is collected throughout the entire genome, this method enables the determination of any aneuploidy or other copy number variation, including insertions and deletions. The karyotype from one of the samples had a small deletion in chromosome 11 between q21 and q23, and when sequencing data was analysed in the 500k base data box, a reduction of about 10% in the relative number of tags within a 25Mb region starting at q21 was observed. Furthermore, within the training set, three of the samples had minute sex karyotypes due to mosaicism in cytogenetic analysis. These karyotypes are: i) 47, XXX 2 [9]/45, X [6], ii) 45, X3/46, XY [17], and iii) 47, XXX [13]/45, X [7]. Samples ii that exhibited some XY-containing cells were correctly classified as XY. Samples i (from the CVS process) and iii (from amniocentesis), both displaying a mixture of XXX and X cells by cytogenetic analysis (consistent with chimera turner's syndrome), were classified as nondeterministic and monosomic X, respectively.
When testing the algorithm, another interesting data point was observed for chromosome 21 from one sample of the test set (fig. 43) with an NCV between-5 and-6. Although the sample is diploid on chromosome 21 by cytogenetics, the karyotype exhibits chimerism with partial triploidy for chromosome 9: 47, XX +9[ 2 ]/46, XX [6]. This reduces the overall NCV value since chromosome 9 was used in the denominator to determine chromosome dose for chromosome 21 (table 18). The results provided in example 13 below demonstrate the ability to determine fetal trisomy 9 in this sample using a normalization chromosome.
Fan et al concluded the sensitivity of these methods only if the algorithm used was able to take into account any random or systematic deviations from the sequencing method. If the sequencing data is not properly normalized, the resulting analysis will be inferior to the count statistics. Chiu et al noted in their recent papers that their measurements of chromosomes 18 and 13 using a massively parallel sequencing method were inaccurate and concluded that more research was required to apply the method to the assay of T18 and T13 (Chiu et al, BMJ 342 c7401[2011 ]). The method used in the Chiu et al paper simply uses the number of sequence tags of the chromosome of interest in their case chromosomes 21, normalized by the total number of tags in the sequencing round. The challenges with this approach are: the distribution of tags on each chromosome can vary from sequencing round to sequencing round and thus increases the overall variation in the aneuploidy determination measure. To compare the results of the Chiu algorithm with the dosages of the chromosomes used in this example, the test data for chromosomes 21 and 18 were reanalyzed using the method recommended by Chiu et al, as shown in FIG. 44. In general, compression in the range of NCV was observed for each of chromosomes 21 and 18, and a reduction in the certainty rate was observed, with NCV threshold 4.0 for aneuploidy classification being used to correctly identify 10/13 of T21 and 5/8 of T18 samples from our test group.
Ehrich et al also focused on T21 only and used the same algorithm as Chiu et al (Ehrich et al, am J Obstet Gynecol 204. In addition, after observing a shift in their test set z-score measure from the external reference data (i.e., the training set), they retrained the test set to establish classification boundaries. While this approach is feasible in principle, it would be challenging in practice to decide how many samples to train and how often to retrain to ensure that these classification data are correct. One way to alleviate this problem is to include controls in each sequencing run that measure the baseline and are calibrated for quantitative behavior.
Data obtained using the present method show that massively parallel sequencing is able to determine a variety of fetal chromosomal abnormalities from the plasma of pregnant women when the algorithm used to normalize the chromosome count data is optimized. The present method for quantification not only minimizes random and systematic variations between sequencing rounds, but also allows classification of aneuploidy throughout the entire genome, most notably T21 and T18. Larger sample collections are required to test the algorithm for T13 determination. To this end, a prospective, blind, multi-site clinical study is being conducted to further demonstrate the diagnostic accuracy of the present method.
Example 13
Determining the presence or absence of at least 5 different chromosomal non-integers in all chromosomes of a single test sample Ploidy property
To demonstrate the ability of the present method to determine the presence or absence of any chromosomal aneuploidy in each set of maternal test samples (test set 1; example 12), systematically determined normalized chromosome sequences were identified in the unaffected test set samples (training set 1; example 12) and used to calculate chromosome doses for all chromosomes for each test sample. The determination of the presence or absence of any one or more distinct intact fetal chromosomal aneuploidies in each test and training set sample is achieved from sequencing information obtained from a single sequencing run performed on each individual sample.
Using chromosome density, i.e., the number of sequence tags identified for each chromosome in the samples of each test set as illustrated in example 12, a systematically determined normalized chromosome sequence consisting of a single chromosome or set of chromosomes was determined by calculating a single chromosome dose for each of chromosomes 1-22, X, and Y. A systematically determined normalized chromosome sequence for each of chromosomes 1-22, X, and Y is determined by systematically calculating chromosome dosages for each chromosome using each possible chromosome combination as a denominator. For example, for chromosome 21 as the chromosome of interest, the chromosome dose is calculated as a ratio of (i) the number of sequence tags obtained for chromosome 21 (the chromosome of interest) and (ii) the number of sequence tags obtained for each remaining chromosome to the sum of the number of tags obtained for all possible combinations of remaining chromosomes (not including chromosome 21), i.e.: 1. 2, 3, 4, 5, etc. up to 20, 21, 22, X and Y;1+2, 1+3, 1+4, 1+5, and so on up to 1+20, 1+22, 1+X, and 1+Y;1, 2, 5 and the like until 1, 2, 22, 1, 2, X and 1, 2, Y;1, 3, 5, 1, 3, 6 and so on until 1, 3, 22, 1, 3, and 1, 3, Y;1, 2, 3 1, 2, 3, 6 and the like until 1, 2, 3, 20 1, 2, 3, and 1, 2, 3; and so on so that all possible combinations of all chromosomes 1-20, 22, X and Y are used as the normalizing chromosome sequence (molecule) to determine all possible chromosome doses for each chromosome of interest for each of these qualifying (aneuploidy) samples within the training set. Chromosome doses were determined in the same manner for chromosomes 21 in all training set samples, and these normalized chromosome sequences determined systematically for chromosomes 21 were determined as a single or set of chromosomes that resulted in a dose for 21 with minimal variability throughout all training samples. The same analysis is repeated to determine a single chromosome or combination of chromosomes to be the systematically determined normalizing chromosome sequence for each of the remaining chromosomes (including chromosomes 13, 18, X and Y), i.e., all possible combinations of chromosomes are used to determine the normalizing sequence (single chromosome or set of chromosomes) for all other chromosomes 1-12, 14-17, 19-20, 22, X and Y of interest in all training samples. Thus, all chromosomes are considered as chromosomes of interest, and a systematically determined normalization sequence is determined for each of all chromosomes in each unaffected sample within the training set. Table 21 provides the individual chromosomes or chromosome sets identified as systematically determined normalizing sequences for each chromosome 1-22, X, and Y of interest. As highlighted by table 21, for some chromosomes of interest, the systematically determined normalized chromosome sequences are determined to be a single chromosome (e.g., when chromosome 4 is the chromosome of interest), while for other chromosomes of interest, the systematically determined normalized chromosome sequences are determined to be a set of chromosomes (e.g., when chromosome 21 is the chromosome of interest).
TABLE 21 systematically determined normalized chromosome sequences for all chromosomes
The mean, standard Deviation (SD), and Coefficient of Variation (CV) of the systematically determined normalized chromosome sequences determined for each of all chromosomes are given in table 22.
TABLE 22Mean, standard Deviation (SD) and variation lines for systematically determined normalized chromosomal sequences Number (CV)
Chromosome of interest Mean value of SD CV
1 0.36637 0.00266 0.72%
2 0.31580 0.00068 0.22%
3 0.21983 0.00055 0.18%
4 0.98191 0.02509 2.56%
5 0.30109 0.00076 0.25%
6 0.21621 0.00059 0.27%
7 0.21214 0.00044 0.21%
8 0.25562 0.00068 0.27%
9 0.12726 0.00034 0.27%
10 0.24471 0.00098 0.40%
11 0.26907 0.00098 0.36%
12 0.12358 0.00029 0.23%
13 a 0.26023 0.00122 0.47%
14 0.09286 0.00028 0.30%
15 0.21568 0.00147 0.68%
16 0.25181 0.00134 0.53%
17 0.46000 0.00248 0.54%
18 a 0.10100 0.00038 0.38%
19 1.43709 0.02899 2.02%
20 0.19967 0.00123 0.62%
21 a 0.07851 0.00053 0.67%
22 0.69613 0.01391 2.00%
X b 0.46865 0.00279 0.68%
Y b 0.00028 0.00004 14.97%
a Not including trisomy
b Female fetus
The variation in chromosome dose across all training samples (as reflected by the value of CV) confirms the use of systematically determined normalized chromosome sequences to provide a large signal-to-noise ratio and dynamic range, allowing the determination of aneuploidy with high sensitivity and high specificity, as shown below.
To demonstrate the sensitivity and specificity of this method, the chromosome dose for all chromosomes of interest 1-22, X and Y was determined in each sample within the training set for all chromosomes of interest 1-22, X and Y, and each of all samples within the test set illustrated in example 11 used the corresponding systematically determined normalized chromosome sequences provided in table 21 above.
The presence or absence of any fetal aneuploidy is determined in the samples of each training set and in each test sample using the systematically determined normalized chromosome sequences for each chromosome of interest, i.e., determining whether each sample contains a complete fetal chromosomal aneuploidy for chromosomes 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, and Y. Sequence information, i.e., the number of sequence tags, was obtained for all chromosomes in the samples of each training set and in each test sample, and the number of sequence tags obtained using systematically determined normalized chromosome sequences (table 21) corresponding to those determined within the test set was used for each chromosome in each training and test sample to calculate a single chromosome dose as described above. The number of sequence tags obtained for the systematically determined normalized chromosome sequences in each training sample is used to determine the chromosome dosage for each chromosome in each training sample, and the number of sequence tags obtained for the systematically determined normalized chromosome sequences in each test sample is used to determine the chromosome dosage for each chromosome in each test sample. To ensure safe and effective classification of aneuploidy, the same conservative boundaries were chosen as illustrated in example 12.
Training set of results
A plot of chromosome dose for chromosomes 21, 18 and 13 in the samples of the training set using the systematically determined normalized chromosome sequences is given in fig. 45. When using the systematically determined normalized chromosome sequence, set of chromosomes 4+14+16+20+22, 8 samples in which clinical karyotype indicates T21 have NCV between 5.4 and 21.5. When using systematically determined normalized chromosome sequences (i.e., set of chromosomes 4+14+16+20+ 22), 8 samples in which clinical karyotype indicates T21 have NCV between 5.4 and 21.5. When using the systematically determined normalized chromosome sequence (i.e., the set of chromosomes 2+3+5+ 7), the 4 samples in which clinical karyotype indicates T18 have NCV between 3.3 and 15.3. The T21 samples of the training set are shown as the last 8 samples of chromosome 21 data (O); the T18 samples of the training set are shown as the last 4 samples of chromosome 18 data (Δ); and the T13 sample of the training set is shown as the last 2 samples of chromosome 13 data (□).
These data indicate that different, intact fetal chromosomal aneuploidies can be determined and correctly classified with high confidence using the normalized chromosome sequences. Since all samples with affected karyotypes have NCV greater than 3, there is a probability of about 0.1%, namely: these samples were part of the unaffected distribution.
Similar to autosomes, when the systematically determined normalized chromosome sequence (i.e., set of chromosomes 4+8) is used for chromosome X, and when the systematically determined normalized chromosome sequence (i.e., set of chromosomes 4+6) is used for chromosome Y, all female and male fetuses within the training set are correctly identified. In addition, all 5 monomeric X samples were identified. Figure 46A shows a plot of NCV determined for the X chromosome (X-axis) and the NCV determined for the Y chromosome (Y-axis) for each sample within the training set. All samples that were haplotype X by karyotype had NCV values less than-4.83. Those monomeric X samples with a karyotype consistent with 45,x karyotypes (complete or chimeric) had a Y NCV value as expected close to zero. Female samples were clustered around NCV =0 for both X and Y.
Test set of results
A plot of chromosome dose for chromosomes 21, 18 and 13 in the test samples using the relevant systematically determined normalized chromosome sequences is given in fig. 47. When using systematically determined normalized chromosome sequences (i.e., set of chromosomes 4+14+16+20+ 22), 13 of the 13 samples in which clinical karyotype indicates T21 were correctly identified as having NCV between 7.2 and 16.3. When using the systematically determined normalized chromosome sequence (i.e., the set of chromosomes 2+3+5+ 7), all 8 samples in which clinical karyotype indicates T18 were identified with NCV between 12.7 and 30.7. When using the systematically determined normalized chromosome sequence (i.e., the set of chromosomes 2+3+5+ 7), all 8 samples in which clinical karyotype indicates T18 were identified with NCV between 12.7 and 30.7. The T21 samples of the test set are shown as the last 13 samples of chromosome 21 data (O); the T18 samples of the test set are shown as the last 8 samples of chromosome 18 data (Δ); and the T13 sample of the test group is shown as the last sample of chromosome 13 data (□).
These data indicate that systematically determined, normalized chromosomal sequences can be used with high confidence to determine different intact fetal chromosomal aneuploidies and classify them correctly. Similar to the training set, all samples with affected karyotypes had NCVs greater than 7, indicating a very small probability that: these samples are part of the unaffected distribution. (FIG. 47).
Similar to autosomes, when the systematically determined normalized chromosome sequence (i.e., set of chromosomes 4+8) is used for chromosome X, and when the systematically determined normalized chromosome sequence (i.e., set of chromosomes 4+6) is used for chromosome Y, all female and male fetuses within the test set are correctly identified. In addition, all 3 haplotypes X samples were identified. Figure 46B shows a plot of the NCV determined for the X chromosome (X-axis) and the NCV determined for the Y chromosome (Y-axis) for each sample within the test set.
As explained above, the method allows the determination of the presence or absence of a complete, or partial, chromosomal aneuploidy of each of chromosomes 1-22, X and Y in each sample. In addition to determining intact chromosomal aneuploidy T13, T18, T21 monomer X, the method also determined the presence of trisomy 9 in one of the test samples. When the systematically determined normalized chromosome sequence (i.e., the set of chromosomes 3+4+8+10+17+19+20+ 22) is used, a sample with an NCV of 14.4 is identified for chromosome 9 of interest (FIG. 48). This sample corresponds to the test sample in example 12, which is suspected of being aneuploidy for chromosome 9 based on a low dose for malformations of chromosome 21 (where chromosome 9 was used as the normalizing chromosome sequence in example 12).
This data indicates that 100% of samples with clinical karyotypes indicative of T21, T13, T18, T9, and haplotype X were correctly identified. Figure 49 shows a plot of NCV for each of chromosomes 1-22 in each of 47 test samples. The median of NCV was normalized to zero. This data shows that the method of the invention (including the use of systematically determined normalised chromosome sequences) determines the presence of all 5 types of chromosomal aneuploidy present in this test set with 100% sensitivity and 100% specificity and clearly indicates that the method can identify any chromosomal aneuploidy for any of chromosomes 1-22, X and Y in any sample.
Example 14
Determining the presence or absence of a partial fetal chromosomal aneuploidy: determining cat eye syndrome
Digger-alder syndrome (22q11.2 deletion syndrome), a condition caused by a defect in chromosome 22, leads to poor development of several body systems. Medical problems commonly associated with degrang's syndrome include cardiac defects, poor immune system function, cleft palate, parathyroid gland, and behavioral disorders. The number and severity of the problems associated with deguelge syndrome vary greatly. Almost every person with deguelg syndrome requires treatment from experts in multiple areas.
To determine the presence or absence of a partial deletion of fetal chromosome 22, a blood sample was obtained by performing venipuncture on the mother, and cfDNA was prepared as described in the examples above. The purified cfDNA was ligated to an aptamer and subjected to cluster amplification using an Illumina cBot clustering station (cluster station). Massively parallel sequencing was performed using reversible dye terminators to generate millions of 36bp reads. These sequence reads were aligned to the human hg19 reference genome and reads uniquely mapped to the reference genome were counted as tags.
A set of qualified samples, all known as diploid for chromosome 22 (i.e., chromosome 22 or any portion thereof known to exist only in the diploid state), is first sequenced and analyzed to obtain a plurality of sequence tags for each of the 1000 segments of 3 megabases (Mb), excluding region 22q11.2. If the human genome comprises about 30 hundred million bases (3 Gb), 1000 segments of 3Mb each constitute about the remainder of the genome. Each of these 1000 segments can be served individually or as a set of segment sequences that are used to determine the normalized segment sequence of the segment of interest, i.e., the 3Mb region of 22q11.2. The number of sequence tags mapped onto each single 1000bp segment was used separately to calculate the segment dose for the 3Mb region of 22q11.2. Furthermore, all possible combinations of two or more segments are used to determine the segment dose for the segment of interest in all qualified samples. The single 3Mb segment or a combination of two or more 3Mb segments that resulted in a segment dose with the lowest variability across the sample was selected as the normalization segment sequence.
The number of sequence tags that map onto the segment of interest in each qualifying sample is used to determine the segment dose in each qualifying sample. The mean and standard deviation of the sector doses in all qualifying samples were calculated and used to determine thresholds against which the sector doses determined in the test samples can be compared. Preferably, normalized Segment Values (NSV) are calculated for all segments of interest in all qualifying samples, and these values are used to set the threshold.
The number of tags mapped to the normalized segment sequences in the corresponding test sample is then used to determine the dose of the segment of interest in the test sample. A Normalized Segment Value (NSV) is calculated for the segment in the test sample as previously described and the NCV of the segment of interest in the test sample is compared to a threshold value determined using a qualified sample to determine the presence or absence of 22q11.2 in the test sample.
Testing NCV < -3 indicates a loss in the segment of interest, i.e. the presence of a partial deletion of chromosome 22 (22q11.2) in the test sample.
Example 15
Fecal DNA testing to obtain predictive outcomes for stage II colon cancer patients
Approximately 30% of all stage II colon cancer patients will relapse and die from the disease they suffer from. Stage II colon cancer patients who have had a relapse of disease show significantly more loss on chromosomes 4, 5, 15q, 17q and 18 q. In particular, loss of stage II colon cancer patients from 4q22.1 to 4q35.2 has been shown to be associated with worse outcomes. Determination of the presence or absence of these genomic alterations can aid in the selection of patients for adjuvant therapy (Brosens et al, analysis of cytopathology/Cellular Oncology 33 (Analytical Cellular Pathology/Cellular Oncology) 95-104[2010 ]). )
To determine the presence or absence of one or more chromosome deletions in the region from 4q22.1 to 4q35.2 in patients with stage II colon cancer, stool and/or plasma samples were obtained from this or these patients. Fecal DNA is prepared according to the method described by Chen et al, J Natl Cancer Inst 97; and plasma DNA was prepared according to the method described in the examples above. DNA was sequenced according to the NGS method described herein, and sequence information of the patient sample(s) was used to calculate segment doses for one or more segments spanning the region from 4q22.1 to 4 q35.2. The segment dose is determined using a previously determined normalized segment dose within a panel of qualified stool and/or plasma samples, respectively. Segment doses in test samples (patient samples) were calculated and the presence or absence of one or more partial chromosome deletions in the region from 4q22.1 to 4q35.2 was determined by comparing each segment of interest to a threshold set by NSV within the eligible sample set.
Example 16
By comparing with maternal plasma DNA was sequenced for complete genome fetal aneuploidy detection: in the future, Accuracy of diagnosis in blind multicenter studies
Methods for determining the presence or absence of aneuploidy in maternal test samples were used for prospective studies, and the accuracy of their diagnosis is shown below. Prospective studies further demonstrate the efficacy of the methods of the invention for detecting fetal aneuploidy for complex chromosomes across a genome. Blind studies mimic the actual pregnant woman population, where the fetal karyotype is unknown, and all samples with any abnormal karyotype are selected for sequencing. The determination of the classification made according to the method of the invention is compared to fetal karyotypes obtained from invasive procedures to determine the diagnostic ability of the method to diagnose a variety of chromosomal aneuploidies.
Overview of the present example
In a prospective blind study, blood samples were collected from 2,882 women undergoing a prenatal diagnostic procedure at 60 U.S. sites (clinicaltrials. Gov NCT 01122524).
An independent biometist selects all single pregnancies with any abnormal karyotype and an equal number of randomly selected pregnancies with euploid karyotypes. Each sample was subjected to chromosomal classification and compared to fetal karyotypes according to the method of the present invention.
Within the analysis cohort of 532 samples, 89/89 cases of trisomy 21 (sensitivity 100% (95% CI 95.9-100)), 35/36 cases of trisomy 18 (sensitivity 97.2%, (95% CI 85.5-99.9)), 11/14 cases of trisomy 13 (sensitivity 78.6%, (95% CI 49.2-99.9)), 232/233 women (sensitivity 99.6%, (95% CI 97.6- > 99.9)), 184/184 men (sensitivity 100%, (95% CI 98.0-100)) and 15/16 cases of monosomy X (sensitivity 93.8%, (95% CI 69.8-99.8)) were classified. In unaffected subjects, there were no false positives for autosomal aneuploidy (100% specificity, (95% CI >, 98.5-100)). Furthermore, fetuses with trisomy 21 (3/3), trisomy 18 (1/1) and monosomy X (2/7) chimerism, three translocation trisomies, two other autosomal trisomies (20 and 16) and other sex chromosome aneuploidies (XXX, XXY and XYY) were correctly classified.
These results further demonstrate the efficacy of the present method to detect fetal aneuploidy across the genome of a double chromosome using maternal plasma DNA. The high sensitivity and specificity for trisomy 21, 18, 13 and monosomy X detection indicates that the present method can be incorporated into existing aneuploidy screening algorithms to reduce unnecessary invasive procedures.
Materials and methods
A MELISSA (maternal blood is the source of accurate diagnosis of fetal aneuploidy) study was performed as a prospective multicenter observation study, with blind nested cases: and (4) carrying out control analysis. Pregnant women aged 18 and over 18 years who underwent invasive prenatal procedures were recruited to determine fetal karyotypes (clinicaltralals. Gov NCT 01122524). Eligibility criteria include pregnant women between 8 week 0 and 22 week 0 of gestation who meet at least one of the following additional criteria: the age is more than or equal to 38 years old; positive screening test results (serum analysis values and/or Neck Translucency (NT) measurements); the presence of an ultrasound marker associated with an increased risk of fetal aneuploidy; or previously pregnant with an aneuploid fetus. Written consent was obtained from all women who consented to participation.
Registration was performed at 60 geographically dispersed medical centers in 25 states according to ethical review board (IRB) approved protocols of various institutions. Two Clinical Research Organizations (CROs) (Quintiles, deum, north carolina; and An Pusen (emphauon), san francisco, ca) were hired to keep the studies blind and to provide clinical data management, data monitoring, biometrics and data analysis services.
Prior to any invasive procedure, peripheral venous blood samples (17 mL) were collected in two Acid Citrate Dextrose (ACD) tubes (budai), with the logo removed and labeled with a unique study number. The location researcher enters the study number, data, and time of blood draw into a safe electronic case report form (eCRF). Whole blood samples were transported from multiple sites to the laboratory (viritanium Health, inc., ca) in temperature controlled containers overnight. After receiving and performing sample testing, cell-free plasma was prepared according to the methods described previously (see example 13) and stored frozen in 2 to 4 aliquots at-80 ℃ until sequencing. Date and time of sample receipt by the laboratory was recorded if the sample was received overnight, cool to the touch and contained at least 7mL of blood, then it was determined to be suitable for analysis. Samples that qualify upon receipt are reported to the CRO weekly and used for selection of a random sampling list (see below and fig. 50). Clinical data from women's current pregnancy and fetal karyotypes were imported into the eCRF by site researchers and validated by CRO.
The sample size is determined based on the accuracy of the estimate of the target range of performance characteristics (sensitivity and specificity) of the exponential test. Specifically, the number of affected (T21, T18, T13, male, female or monosomy X) cases and unaffected (non-T21, non-T18, non-T13, non-male, non-female or monosomy X) controls was determined to correspondingly assess sensitivity and specificity based on normal approximation to within a predefined small margin of error (N = (1.96 √ p (1-p)/margin of error) 2 Where p = estimate of sensitivity or specificity). Assuming true sensitivity is 95% or greater, sample sizes between 73 and 114 cases ensure that the accuracy of the sensitivity estimates will be such that the lower bound of the 95% Confidence Interval (CI) will be 90% or greater (error magnitude ≦ 5%). For smaller sample sizes, the 95% CI of the projected sensitivity was more significant in the error of estimation (from 6% to 13.5%). To estimate specificity with greater accuracy, a larger number of unaffected controls (about 4:1 ratio for case) were planned during the sampling phase. Thereby ensuring that the accuracy of the estimate of specificity reaches at least 3%. Thus, as sensitivity and/or specificity increases, the accuracy of the confidence partitions will also increase.
Based on sample size determination, CRO designed a random sampling protocol to generate a list of selected samples for sequencing (a minimum of 110 cases affected by T21, T18 or T13 and 400 unaffected for trisomy, allowing up to half of these cases to have karyotypes other than 46,xx or 46,xy). It is suitable to select subjects with single pregnancy and a qualified blood sample. Subjects with non-qualified samples, non-karyotype records, or multiple pregnancies were excluded (fig. 50). Lists were generated periodically throughout the study and sent to virenatide health laboratory.
Each qualified blood sample was analyzed for six independent categories. These categories are for the aneuploidy status of chromosomes 21, 18 and 13, and the gender status of male, female and monosomy X. While still blind, one of three classifications (affected, unaffected, or unclassified) was prospectively generated for each of the six independent classes of each plasma DNA sample. Using this protocol, the same sample may be classified as affected in one analysis (e.g., for chromosome 21 aneuploidy) and unaffected in another analysis (e.g., for chromosome 18 euploid).
Routine metaphase cytogenetic analysis of cells obtained by Chorionic Villus Sampling (CVS) or amniocentesis was used as a reference standard in this study. Fetal karyotyping is performed in a diagnostic laboratory commonly used at the participating sites. If the patient underwent CVS and amniocentesis after enrollment, the karyotype produced by amniocentesis was used for study analysis. If a metaphase karyotype is not available, fluorescence In Situ Hybridization (FISH) results targeting chromosomes 21, 18, 13, X, and Y are allowed (Table 24). All abnormal karyotype reports (i.e., except 46,xx and 46,xy) were reviewed by a committee-certified cytogeneticist and classified as affected or unaffected with respect to chromosomes 21, 18 and 13 and gender status XX, XY and monosomy X.
The pre-defined protocol convention specifies the 'checked' status that the following abnormal karyotypes will be specified as karyotypes by cytogeneticists: triploidy, tetraploidy, complex karyotype of chromosome 21, 18, or 13 involved in addition to trisomy (e.g., chimerism), chimerism with mixed sex chromosomes, sex chromosome aneuploidy, or karyotype that cannot be completely translated from a source document (e.g., a marker chromosome of unknown origin). Since cytogenetic diagnosis is not known to the sequencing laboratory, all cytogenetic examined samples were analyzed independently and assigned to the classification determined using sequencing information according to the method of the invention (sequencing classification), but not included in the statistical analysis. The examined states belong to only the relevant one or more of the six analyses (e.g. chimerism T18 will be examined from chromosome 18 analysis, but considered 'unaffected' by other analyses, such as chromosomes 21, 13, X and Y) (table 25). Other anomalies that were not fully foreseen in the design of the specification and rare complex karyotypes were not detected from the analysis (table 26).
The data contained in the eCRF and clinical repositories is limited to authorized users (research sites, CROs, and contracting clinicians). Any employee of virenatide health is not accessible until the time of dawn.
After receiving the random sample list from the CRO, total cell-free DNA (mixture of maternal and fetal) was extracted from the thawed selected plasma samples as described in example 13. Sequencing libraries were prepared using the iruna TruSeq kit v 2.5. Sequencing was performed on an illimeter HiSeq 2000 instrument at virenatide health laboratory (6 plexes, i.e. 6 samples/lane). Single-ended reads of 36 base pairs were obtained. Reads are mapped across the entire genome and sequence tags on each chromosome of interest are counted and used to classify samples for individual categories as described above.
Clinical protocols require evidence of the presence of fetal DNA to report classification results. Classification of males or aneuploidies is considered sufficient evidence of fetal DNA. In addition, each sample was also tested for the presence of fetal DNA using two allele-specific methods. In the first method, ampflstrminifier kit (Life Technologies), san diego, ca) was used to examine the presence of fetal components in cell-free DNA. Electrophoresis of Short Tandem Repeat (STR) amplicons was performed on an ABI 3130 gene analyzer according to the manufacturer's protocol. All nine STR loci in this kit were analyzed by comparing the reported intensities of each peak as a percentage of the sum of the intensities of all peaks, and the presence of the secondary peak was used to provide evidence of fetal DNA. Aliquots of the samples were examined in the absence of identifiable trace STR using a SNP panel with 15 Single Nucleotide Polymorphisms (SNPs), where the average heterozygosity ≧ 0.4 (Kidd et al, international forensics (Forensic Sci Int) 164 (1): 20-32[2006 ]), selected from the group of Kidd et al. Allele-specific methods that can be used to detect and/or quantify fetal DNA in maternal samples are described in U.S. patent publications 20120010085, 20110224087 and 20110201507, which are incorporated herein by reference.
The Normalized Chromosome Value (NCV) was determined by calculating all possible denominator permutations of all autosomes and sex chromosomes as described in example 13, however, since sequencing in this study was performed on a different instrument than we previously worked with multiple samples/lanes, new normalized chromosome denominators had to be determined. The normalized chromosome denominator in the current study was determined based on sequencing a training set of 110 independent (i.e., not from MELISSA-qualified samples) unaffected samples (i.e., qualified samples) prior to analyzing the study samples. The new normalized chromosome denominator is determined by calculating all possible denominator permutations of all autosomes and sex chromosomes, so that the variation of the unaffected training set is minimized for all chromosomes of the entire genome (table 23).
The NCV rule applied to provide the autosomal classification for each test sample is described in example 12, i.e., for classification of an autosomal aneuploidy, NCV >4.0 requires that a chromosome be classified as affected (i.e., aneuploidy of that chromosome) and NCV <2.5 classifies a chromosome as unaffected. Samples with autosomes having NCV between 2.5 and 4.0 are referred to as "unclassified".
Sex chromosome classification in this test was performed by applying NCVs for X and Y in order, as follows:
1. if NCV X < -4.0 and NCV Y <2.5, then the sample is classified as monosomic X.
2. If NCV X > -2.5 and NCV X <2.5 and NCV Y <2.5, then the sample is classified as female (XX).
3. If NCV X >4.0 and NCV Y <2.5, then the sample is classified as XXX.
4. If NCV X > -2.5 and NCV X <2.5 and NCV Y >33, then the sample is classified as XXY.
5. If NCV X < -4.0 and NCV Y >4.0, then the sample is classified as male (XY).
6. If condition 5 is met, but NCV Y is about 2 times the expected measurement of NCV X, then the sample is classified as XYY.
7. If the NCV of chromosomes X and Y does not meet any of the above criteria, then the sample is classified as unclassified with respect to gender.
Because the laboratory was blind to clinical information, the sequencing results were not adjusted for any of the following demographic variables: maternal body mass index, smoking status, presence of diabetes, type of pregnancy (spontaneous or assisted), previous pregnancy, previous aneuploidy or gestational age. Samples that are neither maternal nor paternal are used for classification, and classification according to the present method does not depend on measurements of specific loci or alleles.
The sequencing results were sent back to an independent contracting biometist prior to being uncovered and analyzed. Personnel at the study site, CROs (including biometists who generated randomly sampled lists), and contracted cytogeneticists were blind to sequencing results.
TABLE 23 systematically determined normalized chromosome sequences for all chromosomes
Statistical methods were recorded in the detailed statistical analysis plan of the study. Point estimates of sensitivity and specificity and accurate 95% confidence partitions were calculated using the cloper-Pearson method for each of the six analytical categories. For all statistical evaluation procedures performed, undetected fetal DNA, ' complex karyotypes checked (by convention defined by protocols), or ' unclassified ' samples tested by sequencing were removed.
Results
Between 6 months in 2010 and 8 months in 2011, 2,882 pregnant women were enrolled in the study. Characteristics of the eligible subjects and the selected cohort are provided in table 24. Subjects enrolled and provided blood but subsequently found to override inclusion criteria during data monitoring and the actual gestational age at enrollment was over 22 weeks 0 days were allowed to remain in the study (n = 22). Three of these samples were in the selected group. Fig. 50 shows the flow of the sample between enrollment and analysis. There were 2,625 samples suitable for selection.
TABLE 24 patient demographics
* GA in invasive procedures.
** Higher penetration of ultrasound anomalies in fetuses with abnormal karyotypes
Abbreviations: BMI-body mass index; IUGR-intrauterine growth retardation of fetus
According to a random sampling protocol, all eligible subjects with abnormal karyotypes as well as a group of subjects pregnant with euploid fetuses were selected for analysis (fig. 50B) so that the total sequencing study population produced a ratio of unaffected: affected subjects of approximately 4:1 for trisomy 21. From this process 534 subjects were selected. Both samples were then removed from the analysis due to the sample tracking problem, where the entire chain of custody between the sample tube and the data acquisition failed quality monitoring (fig. 50). 532 subjects contributed by 53 of the 60 study sites were thus generated for analysis. The demographics of the selected cohort are similar to the total cohort.
Testing performance
Fig. 51A to 51C show a flow chart of aneuploidy analysis of chromosomes 21, 18, and 13, and fig. 51D to 51F show a flow chart of sex analysis. Table 27 shows the sensitivity, specificity and confidence partitions for each of the six analyses, and figures 52, 53 and 54 show graphical sample distributions according to NCV after sequencing. In all 6 analytical categories, 16 samples were removed (3.0%) due to no fetal DNA detected. After revelation, these samples had no discernible clinical features. The number of examined karyotypes for each class depends on the condition being analyzed (fully detailed in fig. 52).
The sensitivity and specificity of the method for detecting T21 in the assay population (n = 493) were 100% (95% ci =95.9, 100.0) and 100% (95% ci =99.1, 100.0), respectively (table 27 and fig. 51A). This example includes the correct classification of: a complex T21 karyotype 47, XX, inv (7) (p 22q 32), +21; and two translocations T21 resulting from Robertsonian translocations, one of which is also chimeric with respect to monomeric X (45, X, +21, der (14) q10; q 10) [4]/46, XY, +21, der (14) q10; q 10) [17] and 46,XY, +21,der (21; 21 Q10; q 10).
The sensitivity and specificity of T18 in the assay population (n = 496) were detected at 97.2% (85.5, 99.9) and 100% (99.2, 100.0) (table 27 and fig. 51B). Although checked from preliminary analysis (according to the protocol), four samples with chimeric karyotypes for T21 and T18 were correctly classified by the method of the invention as 'affected' for aneuploidy (table 25). Because they are correctly detected, they are indicated on the left side of fig. 51A and 51B. All remaining examined samples were correctly classified as unaffected for chromosomes 21, 18 and trisomy 13 (table 25). The sensitivity and specificity of detecting T13 in the assay population was 78.6% (49.2, 99.9) and 100% (99.2, 100.0) (fig. 51C). One T13 case detected was caused by a Robertsonian translocation (46,XY, +13,der (13) q10; q 10). There were seven unclassified samples in the chromosome 21 analysis (1.4%), five in the chromosome 18 analysis (1.0%), and two in the chromosome 13 analysis (0.4%) (FIGS. 51A-51C). In all classes, there were three samples overlapping, these samples having both the karyotype examined (69, XXX) and no fetal DNA detected. An unclassified sample from the chromosome 21 analysis is correctly identified as T13 from the chromosome 13 analysis, and an unclassified sample from the chromosome 18 analysis is correctly identified as T21 from the chromosome 21 analysis.
TABLE 25 examined karyotypes
* Subjects excluded from all categories analyzed due to marker chromosomes in one cell line.
** Karyotypes 48, XXY, +18 were not classified in the chromosome 18 analysis and no sex chromosome aneuploidy was detected.
TABLE 26 Unchecked Exception and Complex karyotypes
* After revelation, an increased Normalized Chromosome Value (NCV) of 3.6 was noted from the sequencing tag in chromosome 6.
The sex chromosome analysis population (female, male or monosomy X) used to determine the performance of the method was 433. Our refinement algorithm for classifying gender status allows for an accurate determination of sex chromosome aneuploidy, resulting in a higher number of unclassified results. The sensitivity and specificity for detecting diploid female status (XX) are 99.6% (95% ci =97.6, > 99.9) and 99.5% (95% ci =97.2, > 99.9), respectively; both sensitivity and specificity for the detection of males (XY) were 100% (95% ci =98.0, 100.0); and the sensitivity and specificity for detecting monomer X (45,x) was 93.8% (95% ci =69.8, 99.8) and 99.8% (95% ci =98.7, > 99.9). Although checked by analysis (according to protocols), sequencing of the chimeric haplotype X karyotypes was classified as follows (table 25): 2/7 was classified as monosomy X,3/7 as having a Y chromosome component classified as XY, and 2/7 as having a XX chromosome component was classified as female. Two samples classified as haplotype X according to the method of the invention have karyotypes 47,XXX and 46,XX. For karyotypes 47, XXX, 47, XXY, and 47, XYY, eight tenths of sex chromosome aneuploidies were correctly classified (Table 25). If sex chromosome classification is limited to monosomy X, XY and XX, then most unclassified samples will be able to be correctly classified as males, but will not be able to identify XXY and XYY sexual aneuploidies.
In addition to accurately classifying chromosomes 21, 18, trisomy 13, and gender, the sequencing results also correctly classified aneuploidy for chromosomes 16 and 20 in both samples (47,xx, +16 and 47,xx, + 20) (table 26). Interestingly, one sample with the long arm of chromosome 6 (6 q) and clinically complex changes in both replications (one of which was 37.5 megabases in size) showed that the sequencing tag in chromosome 6 resulted in an increase in NCV (NCV = 3.6). In another sample, chromosomal 2 aneuploidy was detected according to the methods of the invention, but was not observed in the fetal karyotype at the time of amniocentesis (46,xx). Other complex karyotypic variants shown in tables 25 and 26 include samples from fetuses with chromosomal inversions, deletions, translocations, triploids, and other abnormalities not detected here, but may be classified at higher sequencing densities and/or under further algorithmic optimization using the methods of the present invention. In these cases, the method of the invention can correctly classify the sample as unaffected for trisomy 21, 18 or 13, and as male or female.
In this study, the 38/532 analyzed samples were from women who underwent assisted reproduction. Wherein the 17/38 sample has a chromosomal abnormality; no false positives or false negatives were detected in this subpopulation.
TABLE 27 sensitivity and specificity of the method
Discussion of the related Art
This prospective study of determining whole chromosome fetal aneuploidy from maternal plasma was designed to simulate real-world sample collection, handling and analysis. Whole blood samples were obtained at the check-in site, without immediate processing, and shipped overnight to the sequencing laboratory. In contrast to previous prospective studies involving only chromosome 21 (Palomaki et al, genetics in Medicine 2011 1), in this study all qualified samples with any abnormal karyotype were sequenced and analyzed. The sequencing laboratory does not know in advance which fetal chromosomes may be affected, nor the ratio of aneuploid to euploid samples. The study was designed to recruit a population of pregnant women at high risk studies to ensure a statistically significant prevalence of aneuploidy, and tables 25 and 26 indicate the complexity of the karyotype analyzed. The results prove that: i) Fetal aneuploidy (including that caused by translocation trisomy, chimerism, and complex variation) can be detected with high sensitivity and specificity; and ii) aneuploidy in one chromosome does not affect the ability of the method of the invention to correctly identify the euploid status of other chromosomes. The algorithms utilized in previous studies do not appear to be effective in determining other aneuploidies that will inevitably exist in the general clinical population (Erich et al, am J Obstet Gynecol 2011 3; 204 (3): 205e1-11; zhao et al, uk medical journal (BMJ) 2011 342.
Regarding chimerism, analysis of sequencing information in this study correctly classified samples with a chimerical karyotype for chromosomes 21 and 18 among 4/4 of the affected samples. These results demonstrate the sensitivity of the assay for detecting specific characteristics of cell-free DNA in complex mixtures. In one case, sequencing data for chromosome 2 indicates a complete or partial chromosomal aneuploidy, while the amniocentesis karyotype result for chromosome 2 is diploid. In two other examples, one sample had a 47,xxx karyotype and the other sample had a 46,xx karyotype, and the methods of the invention classified these samples as monomeric X. It is possible that these are cases of chimerism, or that the pregnant woman is chimerized by himself. (it is important to remember that sequencing is performed on total DNA, which is a combination of maternal and fetal DNA.) although cytogenetic analysis of amniotic cells or villi by invasive procedures is currently the reference standard for the classification of aneuploidy, karyotyping on a limited number of cells cannot rule out low levels of chimerism. Current clinical study design does not include long term infant follow-up or exposure to placental tissue at delivery, so we cannot determine whether these are true or false positive results. We speculate that the specificity of the sequencing process combined with the algorithm optimized according to the method of the invention for detecting the entire genome may ultimately provide a more sensitive identification of fetal DNA abnormalities, particularly in the case of chimerism, than standard karyotyping.
The international prenatal diagnostics society has published a rapid response statement commenting on the commercial availability of Massively Parallel Sequencing (MPS) for prenatal detection of Down syndrome (Benn et al, prenatal diagnosis (presat Diagn) 2012doi. They state that prior to the introduction of conventional massively parallel sequencing-based population screening for fetal down syndrome, evidence of testing in some subpopulations is required, such as in pregnant women through in vitro fertilization. The results reported here indicate that the present method is accurate in this group of pregnant women, where many people are at higher risk of aneuploidy.
While these results demonstrate the superior performance of the present method using optimized algorithms for aneuploidy detection of the entire genome in single-pregnancy from women with a higher risk of aneuploidy, more experience is required to establish confidence in the diagnostic capabilities of the method when prevalence is low and multiple-pregnancy, especially in low risk populations. In the early stages of clinical implementation, chromosomes 21, 18 and 13 should be classified using sequencing information according to the present method after the first or second trimester screening results of a positive pregnancy. Thereby, unnecessary invasive procedures resulting from false positive screening results will be reduced, with concomitant reduction in procedures associated with adverse events. Invasive procedures may be limited to confirming positive results from sequencing. However, there are clinical situations (e.g. maternal age and infertility) in which pregnant women want to avoid invasive procedures; they may require this test as an alternative to preliminary screening and/or invasive procedures. All patients should receive adequate pre-test counseling to ensure that they understand the limitations of the test and the implications of the results. As experience accumulates with more samples, it is likely that this test will replace current screening experimental plans and become a preliminary screen, and ultimately a non-invasive diagnostic test for fetal aneuploidy.
Example 17
Determination of fetal fraction from NCV to discern the presence of complete or partial fetal chromosomal aneuploidy in an assay sample
Assuming that the chromosome dose of the relevant fetal chromosomes in the maternal sample increases in proportion to the increased fetal fraction, one would expect that for the intact chromosome of interest, the ff value based on the NCV value would determine the presence or absence of an intact fetal chromosomal aneuploidy. To demonstrate that the ff determined by NCV can be used to distinguish the presence of complete chromosomal aneuploidy from partial chromosomal aneuploidy or the contribution of chimeric samples, genomic DNA from mothers and their children was used to create artificial samples that mimic the mixture of fetal and maternal cfDNA found in the maternal circulation. The NCV-based value of the fetal fraction is a form of the above-mentioned hypothetical fetal fraction.
Maternal and daughter DNA was purchased from the Coriell Institute for medical research (camden, new jersey). DNA recognition and sample karyotype are provided in table 27.
TABLE 27 EXAMPLE 17
Samples containing complete chromosomes or partial chromosomal aneuploidies were analyzed as follows.
In all cases, genomic DNA from the mother and genomic DNA from the daughter was sheared by sonication, with a peak of 200bp. An artificial sample containing maternal DNA plus 0%, 5% or 10% w/w daughter DNA was processed to prepare a sequencing library, which was sequenced using synthetic sequencing in a massively parallel fashion as described in example 12. Each artificial DNA sample was sequenced four times on a sequencer using a separate flow cell to provide 4 sets of sequence information for each sample containing 0%, 5%, and 10% daughter DNA. The 36bp reads were aligned to the human reference sequence genome hg19 and the uniquely mapped tags were counted. For each of the 4 flow cell lanes used for each sample, approximately 125X10 was obtained 6 And (4) sequence tags.
Normalized chromosomes (single or chromosome cohorts) were identified in a qualified sample set comprising 20 male and 20 female gDNA libraries, as described elsewhere herein. The normalized chromosome for chromosome 21 is identified as chromosome 4+ chromosome 16+ chromosome 22; the normalized chromosome for chromosome 7 is identified as chromosome 4+ chromosome 6+ chromosome 8+ chromosome 12+ chromosome 19+ chromosome 20; the normalized chromosome for chromosome 15 is identified as chromosome 9+ chromosome 12+ chromosome 14+ chromosome 19+ chromosome 20; the normalized chromosome for chromosome 22 is identified as chromosome 19; and the normalized chromosome for chromosome X is identified as chromosome 4+ chromosome 6+ chromosome 7+ chromosome 8. The sequence tags of the chromosomes of interest and the corresponding normalized chromosomes (single chromosomes or chromosome groups) obtained from sequencing the human sample are counted and used to calculate chromosome dosages and to calculate NCVs.
In this example, ff is determined using the NCV for chromosome 21 in sample mixture (1), where the NCV 21A Is the NCV value determined for chromosome 21 in a test sample (1) containing triploid chromosome 21 and CV is 21U Is the coefficient of variation of the dose of chromosome 21 determined in a qualified sample (containing diploid chromosomes 21); and wherein NCV XA Is the NCV value determined for chromosome X in a test sample (1) containing triploid chromosome 21 and CV XU Is the coefficient of variation of the dose of chromosome X determined in a qualified sample (containing unaffected female fetal chromosomes).
FIG. 56 shows the dose (ff) using chromosome 21 in the synthetic maternal sample (1) 21 ) The percentage "ff" determined is a function of the dose (ff) using chromosome X X ) A graph of the determined percent "ff" change, the sample comprising DNA from a child with trisomy 21.
The data show that chromosome dose and NCV derived therefrom increase proportionally with increasing ff, and that there is a 1:1 relationship between the percentage ff determined using the dose of a triploid chromosome (i.e., chromosome 21) and the percentage ff determined using the dose of a chromosome known to exist as a single chromosome (i.e., chromosome X).
FIG. 57 shows the dose (ff) using chromosome 7 in the synthetic maternal sample (2) 7 ) The percentage "ff" determined is a function of the dose (ff) using chromosome X X ) Graph of the determined percent "ff" change, the sample containing DNA from a euploid mother and her daughter carrying a partial deletion in chromosome 7.
As shown for samples (1) and (2), the data show that the chromosome dose and NCV derived therefrom increase proportionally with increasing ff. However, in the case where the aneuploidy is a partial chromosomal aneuploidy, the chromosome dose (ff) of the partial aneuploidy chromosome is used 7 ) The percentage ff determined is not comparable to the dose (ff) using chromosome X X ) The determined percentage ff corresponds. Thus, a deviation from the 1:1 relationship shown for the intact trisomy sample indicates the presence of a partial aneuploidy.
FIG. 58 shows the dose (ff) using chromosome 15 in the synthetic maternal sample (3) 15 ) The percentage "ff" determined is a function of the dose (ff) using chromosome X X ) A graph of the determined percent "ff" change, the sample comprising DNA from a euploid mother and her daughter, the daughter being 25% chimeric with partial replication of chromosome 15.
As shown for samples (1) and (2), the ff determined using the dose and the NCV derived therefrom increase proportionally with increasing ff. As shown in sample (2), sample (3) contained a partial chromosomal aneuploidy, and the chromosome dose (ff) of the partial aneuploidy chromosome was used 15 ) The determined percentage ff is not comparable to the dose (ff) used for chromosome X X ) The determined percentage ff corresponds. The lack of correspondence between the two ffs indicates the presence of a partial aneuploidy rather than a complete chromosomal aneuploidy.
FIG. 59 shows the dose (ff) using chromosome 22 in the artificial sample (4) 22 ) A graph of the determined percentage "ff" and the NCV derived therefrom, the sample comprising 0% daughter DNA (i); and 10% DNA from an unaffected twin son (ii) known to have no chromosomal aneuploidy of part of chromosome 22; and 10% DNA from the affected twin son (iii), which is known to have a chromosomal aneuploidy of part of chromosome 22. Data show that samples containing DNA from unaffected twinsThe "ff" of the product and determined by the four NCVs calculated from the dose of chromosome 22 is close to zero, indicating that there is no aneuploidy of chromosome 22 in the unaffected daughter; and the "ff" of the unaffected twins, when calculated from the dose of chromosome X, confirms that the "ff" of the unaffected twins sample is about 10%. The data also show that the dose (ff) for chromosome 22 was determined for samples containing DNA from affected twins 22 ) The calculated "ff" determined for the four NCVs is about 3%, indicating the presence of aneuploidy in chromosome 22; when the dose (ff) is determined according to chromosome X X ) When calculated, "ff" confirms that the "ff" of the unaffected twins samples is about 10%. ff (a) 22 And ff X Lack of correspondence between indicates that the aneuploidy of chromosome 22 in the affected twins is a partial chromosomal aneuploidy.
Thus, the data show that in maternal samples containing cfDNA of male fetuses, the chromosome dose and NCV values derived therefrom can be used to distinguish the presence of a complete trisomy from partial aneuploidy and/or complete or partial aneuploidy present in the chimeric sample. A partial aneuploidy may be an increase or decrease in a portion of a chromosome. Optionally, the resolution of partial aneuploidy and/or chimerism can be obtained by using chromosome dose and estimated fetal fraction as described in example 12.
The fetal fraction method described above may also be used to determine the likelihood of one or more fetuses having aneuploidy in a multiple pregnancy. For example, in the case of a heterozygotic twins, the NCV according to X The fetal fraction determined by the value was 8.3%, whereas that determined by NCV 21 The fraction measured was 5.0%. This indicates that only one of the pair of male fetuses has a T21 aneuploidy and the results are confirmed by karyotype results. In another example with maternal twins, the fetal fraction determined from the X chromosome is 7.3%, while the fetal fraction determined from chromosome 18 is 8.9%. In this example, both twins were determined to be T18 males based on karyotype.
Example 18
Determining fetal fraction from NCV to identify the presence of intact fetal chromosomal aneuploidy in clinical samples
To demonstrate that ff (CNff) determined from NCV can be used to distinguish the presence of complete chromosomal aneuploidy from partial chromosomal aneuploidy in clinical samples, cfDNA obtained from maternal blood was used to quantify chromosomes of interest 21, 13 and 18 in clinical samples. The presence of trisomy was verified by karyotype.
cfDNA was obtained from the following samples: 46 maternal samples of each pregnant woman carrying a male fetus with trisomy 21 (T21); 13 maternal samples of pregnant women each carrying a fetus with trisomy 18 (T18); and 3 maternal samples of pregnant women carrying one male fetus with trisomy 13 (T13). These clinical samples were samples from the clinical study described in example 16. cfDNA was isolated and sequencing libraries were prepared as described in example 16, but using a new illu nano v3 chemistry.
Sequencing libraries made from cfDNA from qualified samples known to be unaffected for chromosomes 21, 18, and 13 were also sequenced using a new elunamide v3 chemistry. Sequence reads obtained for the qualifying samples are mapped to the human reference sequence genome hg19, and sequence reads that uniquely map all chromosomal sequences (unshielded repeats) corresponding to the human reference sequence genome hg19 are counted and used to systematically determine which chromosome or set of chromosomes in the test sample will serve as the normalizing chromosome for each chromosome 21, 18, and 13 of interest.
Table 28 below shows the normalized chromosomes (denominator chromosomes) identified for determining chromosome dosages (ratios) for chromosomes 1-22, X, and Y in each test sample.
TABLE 28 example 18-systematically identified normalized chromosomes for use in the T21, T18, and T13 test samples
When a normalized chromosome in a qualifying sample has been identified, the test sample is sequenced and the sequence tags mapped to each chromosome 21, 18, 13 and the corresponding normalized chromosome in the test sample are counted and used to calculate chromosome dose (ratio). The NCV value is then calculated as previously described according to the following equation:
for each test sample, the fetal fraction for chromosome x and the chromosome of interest was determined according to the following equation described elsewhere in this specification:
ff=2×|NCV iA CV iU equation 28.
Figure 60 shows a graph of CNffx versus CNff21 determined in a sample containing a fetal T trisomy 21. As expected for a complete chromosomal aneuploidy, CNffx matches (CNff 21) determined using NCV for chromosome 21.
Similarly, in the T18 test sample, CNffx matched (CNff 18) determined using the NCV of chromosome 18 (fig. 61), and in the T13 test sample, CNffx matched (CNff 13) determined using the NCV of chromosome 13 (fig. 62).
Fig. 60 also shows the fetal fraction obtained for samples of female fetuses affected by T21. As expected, CNff21 in these "female" samples could not be verified by comparison with chromosome X. To verify the CNff21 of a female sample, the CNff of a chromosome known to be unable to become a fetal aneuploidy (e.g., chromosome 1) can be determined. Alternatively, CNff21 of a "female" sample can be determined by comparing it to NCNff, e.g., by counting the tags of the polymorphic sequences as described elsewhere herein.
Thus, the number of sequence tags and the resulting NCV values identifying copy number variations of intact chromosomes can be used to determine the corresponding fetal fraction in aneuploidy/affected samples. The correspondence of the CNff of the chromosome of interest to the CNff of chromosomes known to be not aneuploid can be used to confirm the presence of a complete chromosomal trisomy.
Example 19
Determining fetal fraction from NCV to identify fetal chromosomal aneuploidy of the fraction present in clinical sample
To demonstrate that ff (CNff) determined from NCV can be used to identify and locate partial chromosomal aneuploidies and the presence of partial chromosomal aneuploidies in clinical samples, cfDNA from clinical samples that have been identified as having a chromosomal 17 aneuploidy was sequenced and analyzed as described in example 18.
Using the sequence tags (table 28 above) mapped to chromosome 17 in the test sample and the identified normalization chromosomes (chromosome 16+ chromosome 20+ chromosome 22) in the set of qualifying samples, the NCV value for each chromosome in the test sample was calculated.
FIG. 63 shows a graph of NCV values for chromosomes 1-22 and X in the test sample. As shown in the figure, the NCV value for chromosome 17 was determined to have NCV >4, which is the threshold selected for identifying aneuploid chromosomes. The figure also shows the NCV value for chromosome X, which as expected has a negative NCV.
The CNff for chromosome 17 and chromosome X is calculated according to the following equation:
ff (i) =2*NCV jA CV jU in the case of the equation 25,
and CNff17=3.9% and CNffX =13.5% were determined.
The difference between CNff indicates that there is partial aneuploidy or possible chimerism.
To distinguish partial aneuploidies from possible chimerism, the tag numbers are counted for each 100Kbp contiguous block/partition on chromosome 17 and a Normalized Binary Value (NBV) is calculated for each partition. Normalization of the number of tags in an individual partition is performed by determining the ratio of tags/bins to the sum of the number of tags in 20 bins of the same size and having the closest GC content to the analyzed bin. Thus, in this case, the normalization is related to the GC content. Optionally, the bin normalization may also be related to variability in bin dose, as determined in qualified samples as described for chromosome dose/ratio. In this example, the GCC Z score is equal to the NBV value as determined by:
Wherein M is j And MAD j Corresponding to the estimated median and median adjusted bias for the jth chromosome dose in the qualified sample set, and x ij Is the jth chromosome dose observed for test sample i.
The Normalized Binary Values (NBV) for each 100Kbp partition along the length of chromosome 17 are shown on the Y-axis of fig. 64 as GCC Z score forms indicating GC normalization. The graph shown in fig. 64 clearly shows the copy number increase corresponding to the approximately last 200,000bp partition in chromosome 17. This finding is consistent with the karyotype provided for a sample demonstrating a duplication of chromosome 17 at qter.
Thus, CNff can be used to identify and locate partial aneuploidies in chromosomes.
__________________
Example 20
Validation of sample integrity in multiple bioassay of maternal cfDNA
Marker molecules having sequences known not to be contained in any known genome are synthesised and used to verify the integrity of whole blood and plasma maternal derived samples which are processed to extract a mixture of fetal and maternal cfDNA in a maternal sample and to sequence it.
Data from experiments at the time and before have shown that the average length of cfDNA is about 170bp. Using a BLAST search, 170bp of the anti-gene strand sequence that is not present in any of the known genomes was identified for all genome entries. Six marker molecules (MM 1-MM 6) were synthesized based on the sequence of the recognized antigene strand sequence (SEQ ID NOS: 1-6; table 29) and used to verify the integrity of the sample as follows.
Watch 29
Marker molecules
Peripheral blood was collected from a pregnant woman into 4 blood collection tubes (Cell-Free DNA from Streck, inc. Omaha NE, omaha, nebraska, inc.) TM BCT) and shipped overnight to the laboratory for analysis. Two whole blood derived samples were spiked with marker molecules as follows. One blood derived sample plus 720pg marker molecules 1 (MM 1) and a second blood derived sample plus 720pg marker molecules 2. All 4 tubes were centrifuged at 1600g for 10 min at 4 ℃. Plasma supernatant was removed from each of the four tubes and placed into a 5mL high speed centrifuge tube and centrifuged at 16000g for 10 minutes at 4 ℃. The plasma fraction of the whole blood to which the marker molecules have been applied is aliquoted into separate tubes and stored at-80 ℃. The plasma fraction from the two remaining blood tubes (without addition) was then divided into 1.1mL aliquots. Plasma-derived samples were prepared as follows. One hundred picograms of MM1 was added to one plasma aliquot, 100pg MM2 was added to plasma aliquot 2, and so on, to obtain 6 labeled plasma-derived samples, each containing a different marker molecule (MM 1-MM 6) stored at-80 ℃.
One tube per labeled plasma derived sample and 1 tube per labeled source Blood sample were thawed and DNA was extracted using the Qiagen Blood Mini Kit according to the method described in example 1. Using TruSeq including indices 1-6 TM DNA sample preparation kit (of san Diego, calif.)) Thirty microliters of each sample DNA was used to prepare the library. Sequencing libraries were prepared such that samples comprising MM1 were indexed using index molecule 1, samples comprising MM2 were indexed using index 2, and so on. Sequencing libraries were quantified using an agilent bioanalyzer DNA1000 kit (agilent technologies, santa clara, ca) and diluted to 4nM with qiagen buffer EB. The indexed and labeled samples were pooled and further diluted to 2nM, followed by sequencing in four lanes of the illum nano HiSeq flow cell according to table 30 using the illum nano TruSeq SBS kit v 3.
Watch 30
Layout of multiple sequencing flow cells
The sequence reads were aligned to the human reference genome hg19 and to a synthetic reference genome comprising anti-gene strand marker molecular sequences. Sequence reads uniquely (i.e., only once) mapped to either the hg19 reference genome or the synthetic reference genome with the marker molecule sequences were counted (table 31).
Watch 31
Correspondence of MM sequence to cfDNA sequence of source sample
* I = index
* L = lane
The data show that for each sample, the sequence of MM determined to have been added to the source sample only corresponds to the sequence of cfDNA of the source sample to which MM has been added. For example, the data for sample 1 shows that the sequence of reads mapped to MM1 was determined to correspond only to the sequence of cfDNA that had been obtained from the source sample to which MM1 had been added (plasma sample 1). In addition, the absence of a different sequence (e.g., MM 2) in the reads obtained from the sequenced cfDNA of source sample 1 indicates that source sample 1 is not cross-contaminated by another sample (e.g., source sample 2).
Example 21
Internal positive control
An in-process positive control was developed for massively parallel sequencing of maternal cfDNA, providing qualitatively positive chromosome dose and NCV values for trisomy 13, trisomy 18, and trisomy 21.
Fragmented genomic DNA from three male patients with known trisomies corresponding to Chr13, chr18 and Chr21 was added to the female fragmented DNA background. The fragmented genomic DNA was size selected by PAGE to contain fragments ranging in length from about 150bp to about 250bp, mimicking the size of fetal cfDNA. The size-selected DNA of the T13, T18 and T21 controls were purified and end-repaired and the concentration was measured using Nanodrop (Wilmington, DE) in Wilmington, terawa. The prepared DNA was confirmed on a high-sensitivity DNA chip (Agilent, santa Clara, calif.) of a bio-analyzer. These DNAs of trisomy 13, trisomy 18, and trisomy 21 were obtained from the korea Medical Institute (Coriell Institute for Medical Research) (Camden, NJ). Female genomic DNA was obtained from The Biochain Institute (The Biochain Institute), sea Wo Deshi (Hayward, CA), california. A small amount of trisomy DNA was added to the main female DNA background to mimic the "male fetal" DNA score in the female "maternal" DNA background. The composition of this DNA mixture is optimized such that when used in a sequencing assay to determine copy number variation, the mixture is always qualitatively positive for trisomy 13, trisomy 18 and trisomy 21, with NCV values of 13, 18 and 21 being greater than 4.
Maternal cfDNA was extracted from plasma samples obtained from pregnant women; and a sequencing library of the maternal sample cfDNA and control DNAs of T13, T18 and T21 was prepared for multiplex sequencing using the elumbina platform. Four positive controls and 56 samples were sequenced in each flow cell of the sequencer. As described elsewhere in this application, 36bp reads were obtained, tags for multiple chromosomes were identified, and NCV values were calculated.
Fig. 69A, 69B and 69C show NCV values of the maternal test sample (o) and the internal positive control (□). NCV values above 4 were determined to have copy number variation for chromosomes 13 (a), 18 (B) and 21 (C) of interest, respectively. The figure shows that the NCV of the positive control correlates with that of the maternal test sample, identifying that it has copy number variation, i.e. extra copies of chromosomes 13, 18 and 21.
Internal positive controls can be designed to mimic both whole and partial chromosomal variations, and these can be used in prenatal diagnostic tests and related tests such as determining fetal fraction by massively parallel sequencing as described throughout this specification.
Example 22
Fetal fraction was determined using massively parallel sequencing: sample processing and cfDNA extraction
A peripheral blood sample is collected from a pregnant woman in the first trimester or second trimester of pregnancy and who is considered at risk for fetal aneuploidy. Consent was obtained from each participant prior to blood draw. Blood is collected prior to amniocentesis or chorionic villus sampling. Karyotyping is performed using chorionic villus or amniocentesis samples to determine fetal karyotype.
Peripheral blood drawn from each subject was collected in ACD tubes. One tube of blood sample (about 6 to 9 ml/tube) was transferred to a 15 ml low speed centrifuge tube. Blood was centrifuged at 2640rpm for 10 minutes at 4 ℃ using a Beckmann Allegra 6R centrifuge and a GA 3.8 type rotor.
For cell-free plasma extraction, the upper plasma layer was transferred to a 15 ml high-speed centrifuge tube and centrifuged at 16000 Xg for 10 minutes at 4 ℃ using a Beckmann Coulter Avanti J-E centrifuge and JA-14 rotor. After blood collection, two centrifugation steps were performed within 72 hours. Cell-free plasma containing cfDNA was stored at-80 ℃ and thawed only once before plasma cfDNA amplification or cfDNA purification.
Purified cell-free DNA (cfDNA) was extracted from cell-free plasma using a QIAamp blood DNA mini kit (qiagen), essentially according to the manufacturer's instructions. One ml of buffer AL and 100. Mu.l of protease solution were added to 1ml of plasma. The mixture was incubated at 56 ℃ for 15 minutes. One ml of 100% ethanol was added to the plasma digest. The resulting mixture was transferred to a QIAamp mini-column in combination with vacvale and VacConnector as provided in the QIAvac ac 24 Plus column assembly (qiagen). Vacuum was applied to the sample and cfDNA retained on the column filter was washed under vacuum with 750 μ Ι buffer AW1 followed by a second wash with 750 μ Ι buffer AW 24. The column was centrifuged at 14,000RPM for 5 minutes to remove any residual buffer from the filter. cfDNA was eluted with buffer AE by centrifugation at 14,000RPM and using Qubit TM The quantification platform (Invitrogen)) determines the concentration.
Example 23
Fetal fraction was determined using massively parallel sequencing: preparation of sequencing libraries, sequencing and analysis of sequencing data
a. Preparation of sequencing libraries
All sequencing libraries, i.e. target, primary and enriched libraries, were prepared from about 2ng of purified cfDNA extracted from maternal plasma. Use ofNEBNext of TM DNA sample preparation the reagents of DNA reagent set 1 (article No. E6000L; new England Biolabs, ipusley, mass.) were subjected to library preparation as follows. Since cell-free plasma DNA is essentially fragmented, the plasma DNA sample is no longer fragmented by spraying or sonication. According toAn end repairing module by mixing cfDNA and NEBNext TM DNA sample preparation DNA reagent set 1 provided 5. Mu.l 10 XPhosphorylation buffer, 2. Mu.l deoxynucleotide solution mix (10 mM each dNTP), 1. Mu.l 1. The enzyme was then heat inactivated by incubating the reaction mixture for 5 minutes at 75 ℃. The mixture was cooled to 4 ℃ and 10 μ l of dA tailed master mix (NEBNext) containing klenow fragments (3 'to 5' exo-) was used TM DNA sample preparation DNA reagent set 1) and incubation at 37 ℃ for 15 minutes to achieve dA tailing of blunt-ended DNA. Subsequently, the klenow fragment was heat inactivated by incubating the reaction mixture at 75 ℃ for 5 minutes. After inactivation of klenow fragment, NEBNext was used TM DNA sample preparation DNA reagent set 1 provided 4. Mu. l T4 DNA ligase by incubating the reaction mixture at 25 ℃ for 15 minutes, and ligating the Ilvim nano aptamer (non-indexed Y aptamer) to DNA with a dA tail using 1. Mu.l of 1:5 dilution of Ilvim nano genomic aptamer oligo mix (article No. 1000521; sea Wo Deshi Ilvim, calif.). The mixture was cooled to 4 ℃ and the aptamer-ligated cfDNA was purified from unligated aptamers, aptamer dimers, and other reagents using magnetic beads provided in the angukets AMPure XP PCR purification system (article No. a63881; beckman coulter genome, denfrost, massachusetts). Use ofThe high fidelity master mix (fenzime, woben, massachusetts) and the aptamer-complementing illimeter PCR primers (article nos. 1000537 and 1000537) were subjected to 18 PCR cycles to selectively enrich for aptamer-ligated cfDNA. Using Ilu nano genome PCR primers (article Nos. 100537 and 1000538) and NEBNext TM Preparation of DNA samples the Phusion HF PCR master mix provided in DNA reagent set 1 was used to perform PCR (98 ℃,30 seconds; 98 ℃ C.),10 seconds, 18 cycles; 30 seconds at 65 ℃; and 72 ℃,30 seconds; final extension at 72 ℃ for 5 min and hold at 4 ℃). The amplified product was purified using the Anjinote AMPure XP PCR purification System (Anjinote Biotech, belleville, mass.) according to the manufacturer's instructions available at www.beckmangenomics.com/products/AMPureXP protocol-000387v001. Pdf. The purified amplification product was eluted in 40 μ l of Qiagen EB buffer and the concentration and size distribution of the amplified library was analyzed using the Agilent DNA 1000 kit for a 2100 bioanalyzer (Agilent technologies, santa Clara, calif.).
b. Sequencing
Library DNA was sequenced using a genomic analyzer II (iruna, san diego, ca, usa) according to standard manufacturer protocols. Copies of protocols for whole genome sequencing using iruna/solocosa technology can be found on page 29 of the biotechniques.rtm. protocol guide 2007 published 2006 12 and on world wide web biotechnology.com/default.application = protocol & subset = particulate _ display & id = 112378.
The DNA library was diluted to 1nM and denatured. Library DNA (5 pM) was subjected to Cluster amplification according to the procedures described in Illumina clustering Station User Guide (Illumina's Cluster Station User Guide) and Cluster Station Operations Guide (Cluster stations Operations Guide) available on the world Wide Web. The amplified DNA was sequenced using the iru nano genomics analyzer II to obtain a single-ended reading of 36 bp. Identifying a sequence as belonging to a particular human chromosome requires only about 30bp of random sequence information. Longer sequences may uniquely identify more specific targets. In the current case, numerous 36bp reads were obtained, covering approximately 10% of the genome.
c. Analyzing the sequencing data to determine fetal fraction
Once the sequencing of the sample was completed, the illu nano "sequence control software" transferred the image and base call files to a Unix server running the illu nano "Genome Analyzer Pipeline (Genome Analyzer Pipeline)" software version 1.51. The 36bp reads were aligned to an artificial reference genome (e.g., SNP genome) using the BOWTIE program. The artificial reference genome is identified as a grouping of polymorphic DNA sequences encompassing alleles contained in the polymorphic target sequence. For example, the artificial reference genome is a SNP genome comprising SEQ ID NOS: 7-62. Only reads uniquely mapped to the artificial genome are used to analyze fetal fraction. Reads that completely match the SNP genome are counted as tags and filtered. Of the remaining reads, only reads with one or two mismatches counted as tags and included in the analysis. The tags mapped to each of the polymorphic alleles are counted and the fetal fraction is determined as the ratio of the number of tags mapped to the primary allele (i.e., maternal allele) to the number of tags mapped to the secondary allele (i.e., fetal allele).
Example 24
Selection of autosomal SNPs to determine fetal fraction
The set of 28 autosomal SNPs is a member of a list selected from 92 SNPs (pax et al, human genetics 127]) And Life technologies selected from world Wide Web addresses of appliedbiosystems TM (Calsbad, calif.) applied biosystems. The primers were designed to hybridize to a sequence close to the SNP site on cfDNA to ensure that the SNP site is included within the 36bp read generated by massively parallel sequencing on the elunano-analyzer GII and to generate amplicons of sufficient length to perform bridge amplification during cluster formation. Thus, primers were designed to produce at least 110bp amplicons that, when combined with a universal aptamer for clustered amplification (irnma, san diego, ca), produced at least 200bp DNA molecules. The primer sequences were identified and the primer sets (i.e., forward and reverse primers) were synthesized by integrated DNA technology (san diego, ca) and stored as 1 μ M solutions to be used to amplify polymorphic target sequences as described in examples 25 to 27. Table 33 provides the context of RefSNP (rs) The accession number, the primers used to amplify the target cfDNA sequence, and the sequence of the amplicon comprising the possible SNP alleles to be generated using these primers. The SNPs given in table 33 were used to amplify 13 target sequences simultaneously in one multiplex assay. The panel provided in table 33 is an exemplary SNP panel. Fewer or more SNPs may be employed to enrich for fetal and maternal DNA for polymorphic target nucleic acids. Additional SNPs that may be used include the SNPs given in table 34. The SNP alleles are shown in bold and underlined. Other additional SNPs that may be used for determining fetal fraction according to the methods of the invention include rs315791, rs3780962, rs1410059, rs279844, rs38882, rs9951171, rs214955, rs6444724, rs2503107, rs1019029, rs1413212, rs1031825, rs891700, rs1005533, rs2831700, rs354439, rs1979255, rs1454361, rs8037429 and rs1490413, which have been analyzed for determining fetal fraction by TaqMan PCR and are disclosed in U.S. provisional application nos. 296,358/8361 and/83360.
Watch 33
SNP panel for determining fetal fraction
Watch 34
Additional SNPs for determining fetal fraction
Example 25
Determination of fetal fraction by massively parallel sequencing of a library of interest
To determine the cfDNA fraction of a fetus in a maternal sample, target polymorphic nucleic acid sequences, each comprising a SNP, are amplified and used to prepare a target library for sequencing in a massively parallel mode.
cfDNA was extracted as described above. The sequencing library of interest was prepared as follows. Mu.l of cfDNA contained in the purified cfDNA was amplified in a 50. Mu.l reaction volume containing 7.5. Mu.l of 1. Mu.M primer mix (Table 1), 10. Mu.l of NEB 5X master mix, and 27. Mu.l of water. Thermal cycling was performed with GeneAmp9700 (applied biosystems) using the following cycling conditions: incubate at 95 ℃ for 1 minute, followed by 95 ℃ for 20 seconds, 68 ℃ for 1 minute, and 68 ℃ for 30 seconds, cycling 20 to 30 times, followed by a final incubation at 68 ℃ for 5 minutes. Finally, it was kept at 4 ℃ until the sample was removed for combination with the non-amplified portion of the purified cfDNA sample. The amplified product was purified using the Ankinokite AMPure XP PCR purification System (article number A63881; beckmann Coulter genome, denfoss, mass.). Finally, it is kept at 4 ℃ until it is removed for the preparation of the library of interest. The amplified products were analyzed with a 2100 bioanalyzer (agilent technologies, sunnyvale, CA) and the concentration of the amplified products was determined. A sequencing library of amplified target nucleic acids was prepared as described in example 23 and sequenced in massively parallel mode using synthesis with reversible dye terminators and according to the brunam protocol (biotechniques.rtm. Protocol guide 2007 page 29, published 12.2006, and on the world wide web biotechnology.com/default. Application = protocol & subset = particle _ display & id = 112378). As described, tags mapped to a reference genome consisting of 26 sequences containing SNPs (13 pairs, each pair representing two alleles) (i.e., SEQ ID NOS: 7-32) were analyzed and counted.
Table 35 provides tag counts obtained from sequencing the library of interest, and calculated fetal fractions obtained from the sequencing data.
Watch 35
Determination of fetal fraction by massively parallel sequencing of a polymorphic nucleic acid library
The results indicate that each polymorphic nucleic acid sequence comprising at least one SNP can be amplified from cfDNA derived from a maternal plasma sample to construct a library that can be sequenced in massively parallel mode to determine the fraction of fetal nucleic acids in the maternal sample.
Example 26
Fetal fraction is determined after enrichment of fetal and maternal nucleic acids in cfDNA sequencing library samples.
To enrich for fetal and maternal cfDNA contained in a primary sequencing library constructed using purified fetal and maternal cfDNA, a portion of the purified cfDNA sample is used to amplify a polymorphic target nucleic acid sequence, and a sequencing library of the amplified polymorphic target nucleic acid is prepared that enriches for fetal and maternal nucleic acid sequences contained in the primary library.
The method corresponds to the workflow illustrated in fig. 10. A sequencing library of interest was prepared from a portion of the purified cfDNA as described in example 23. The remaining portion of purified cfDNA was used to prepare a primary sequencing library as described in example 23. Enrichment of the primary library for amplified polymorphic nucleic acids contained in the target library is achieved by diluting the primary and target sequencing libraries to 10nM, and combining the target library with the primary library at a ratio of 1:9 to provide an enriched sequencing library. The enriched library was sequenced and the sequencing data was analyzed as described in example 23.
Table 36 provides the number of sequence tags mapped to the SNP genome of the informative SNPs identified by sequencing an enriched library derived from plasma samples of pregnant women each carrying a T21, T13, T18 and a monosomic X fetus, respectively. The fetal fraction is calculated as follows:
alleles x Fetal fraction% = ((∑ allele) x Fetal sequence tag)/(∑ allele x Parent sequence tag) x 100 (g.))
Table 36 also provides the number of sequence tags mapped to the human reference genome. The presence or absence of aneuploidy is determined using the same plasma sample as used to determine the corresponding fetal fraction, using a signature mapped to the human reference genome. Methods for determining aneuploidy using sequence tag counts are described in U.S. provisional applications 61/407,017 and 61/455,849778, which are incorporated herein by reference in their entirety.
TABLE 36 determination of fetal fraction by massively parallel sequencing of an enriched library of polymorphic nucleic acids
Example 27
Fetal fraction was determined by massively parallel sequencing:
enrichment of fetal and maternal nucleic acids for polymorphic nucleic acids in purified cfDNA samples.
To enrich fetal and maternal cfDNA contained in a purified sample of cfDNA extracted from a maternal plasma sample, a portion of the purified cfDNA was used to amplify polymorphic target nucleic acid sequences, each comprising one SNP selected from the SNP panel given in table 33.
The method corresponds to the workflow illustrated in fig. 9. As described in example 22, cell-free plasma was obtained from a maternal blood sample, and cfDNA was purified from the plasma sample. The final concentration was determined to be 92.8 pg/. Mu.l. Mu.l of cfDNA contained in the purified cfDNA was amplified in a 50. Mu.l reaction volume containing 7.5. Mu.l of 1. Mu.M primer mix (Table 1), 10. Mu.l of NEB 5X master mix, and 27. Mu.l of water. Thermal cycling was performed with Gene Amp9700 (applied biosystems). The following cycling conditions were used: incubate at 95 ℃ for 1 minute, followed by 95 ℃ for 20 seconds, 68 ℃ for 1 minute, and 68 ℃ for 30 seconds, cycling 30 times, followed by a final incubation at 68 ℃ for 5 minutes. Finally, it was kept at 4 ℃ until the sample was removed for combination with the non-amplified portion of the purified cfDNA sample. The amplified product was purified using the Ankinkott AMPure XP PCR purification System (article number A63881; beckmann Coulter genome, denfoss, mass.) and the concentration quantified using Nanodrop 2000 (Thermo Scientific), wilmington, del.). The purified amplification product was diluted 1. The enriched fetal and maternal cfDNA present in the purified cfDNA samples was used to prepare sequencing libraries and sequenced as described in example 22.
Table 37 provides the tag counts obtained for each of chromosomes 21, 18, 13, X, and Y, i.e., the sequence tag densities, and the tag counts obtained for informative polymorphic sequences contained in the SNP reference genome, i.e., the SNP tag densities. The data indicate that sequencing information can be obtained by sequencing a single library constructed from purified maternal cfDNA samples that have been enriched for sequences containing SNPs to determine the presence or absence of aneuploidy and fetal fraction simultaneously. The number of tags mapped to chromosomes was used to determine the presence or absence of aneuploidy as described in U.S. provisional applications 61/407,017 and 61/455,849. In the example given, the data indicate that the fraction of fetal DNA in plasma sample AFR105 can be quantified from five informative SNP sequencing results and determined to be 3.84%. Sequence tag densities are provided for chromosomes 21, 13, 18, X and Y.
This example shows that enrichment rules provide the necessary tag counts to determine aneuploidy and fetal fraction by a single sequencing process.
Watch 37
Fetal fraction was determined by massively parallel sequencing:
enrichment of fetal and maternal nucleic acids for polymorphic nucleic acids in purified cfDNA samples
Example 28
Fetal fraction determination by capillary electrophoresis of polymorphic sequences comprising STR
To determine the fetal fraction in maternal samples containing fetal and maternal cfDNA, peripheral blood samples were collected from pregnant women of volunteers carrying male or female fetuses. As described in example 22, peripheral blood samples were obtained and processed to provide purified cfDNA.
Use ofMiniFiler TM PCR amplification kit (applied biosystems, foster city, ca) ten microliter cfDNA samples were analyzed according to the manufacturer's instructions. Briefly, cfDNA contained in 10. Mu.l in a primer containing 5. Mu.l of a fluorescent label ((II))MiniFiler TM Primer set) andMiniFiler TM amplification in a 25. Mu.l reaction volume of the master mix, theMiniFiler TM The master mixture contained AmpliTaqDNA polymerase and associated buffers, salts (1.5 mM MgCl) 2 ) And 200 μ M deoxynucleoside triphosphate (dNTP: dATP, dCTP, dGTP, and dTTP). The fluorescence-labeled primer is 6FAM TM 、VIC TM 、NED TM And PET TM Dye labeled forward primer. Heat was applied using Gene Amp9700 (applied biosystems) using the following cycling conditionsAnd (3) circulation: incubate at 95 ℃ for 10 minutes, followed by 94 ℃ for 20 seconds, at 59 ℃ for 2 minutes, and at 72 ℃ for 1 minute, cycle 30 times, followed by a final incubation at 60 ℃ for 45 minutes. Finally, it was kept at 4 ℃ until the sample was removed for analysis. Amplified products were prepared by diluting 1. Mu.l of amplified product in 8.7. Mu. lHi-DiTM formamide (applied biosystems) and 0.3. Mu.l GeneScan-500 LIZ internal dimension standards (applied biosystems) and analyzed with ABI PRISM3130xl Gene Analyzer (applied biosystems) using data collection HID _ G5_ POP4 (applied biosystems) and a 36cm capillary array. All genotyping was performed using GeneMapper _ ID v3.2 software (applied biosystems) using the allele typing standards (allelic ladders) and data boxes and panels supplied by the manufacturer.
All genotyping measurements were performed on an applied biosystems 3130xl gene analyzer, using size ± 0.5-nt "windows" obtained for each allele to allow detection and correction of the alignment of the alleles. Any sample allele outside the size of the + -0.5-nt window was identified as OL, i.e. "Off Ladder" outside the typing standard. The OL allele is of sizeMiniFiler TM Alleles that do not appear in the allelic typing standard, or alleles that do not correspond to the allelic typing standard but are just outside the window in size due to measurement error. Minimum peak height threshold&gt, 50RFU was set up based on validation experiments performed to avoid typing when random effects could interfere with accurate interpretation of the mixture. The calculation of the fetal fraction is based on averaging all informative markers. The informative markers are identified by the presence of peaks on the electropherograms that fall within the parameters of the preset data bin for the STR being analyzed.
The average peak height of the primary and secondary alleles at each STR locus determined from triplicate injections was used to calculate the fetal fraction. The rules applicable to this calculation are:
1. Allele (OL) data for alleles not included in the calculation outside the typing criteria for the allele; and
2. only peak heights obtained from >50RFU (relative fluorescence units) were included in the calculations.
3. If only one data box exists, the tag is considered to be non-informative; and
4. if a second data bin is identified, but the peaks of the first and second data bins are within 50% to 70% of their Relative Fluorescence Units (RFU) in peak height, then the minority score is not measured and the marker is not considered informative.
The fraction of the minor alleles for any given informative marker is calculated by dividing the peak height of the minor component by the sum of the peak heights of the major component and expressed as a percentage, first for each informative locus
Fetal fraction = (∑ peak height of the minor allele/peak height of the major allele) X100,
the fetal fraction of a sample comprising two or more informative STRs will be calculated as the average of the fetal fractions calculated for the two or more informative markers.
Table 38 provides data obtained from analysis of cfDNA of subjects carrying male fetuses.
Watch 38
Fetal fraction determined in cfDNA of pregnant subjects by analysis of STR
The results indicate that cfDNA can be used to determine the presence or absence of fetal DNA, as indicated by detection of minor components on one or more STR alleles, to determine percent fetal fraction, and to determine fetal gender as indicated by the presence or absence of Amelogenin alleles.

Claims (55)

1. A medical analysis device for determining the fetal fraction in a maternal test sample comprising a mixture of fetal and maternal nucleic acids, the device comprising:
(a) A means for receiving a plurality of sequence reads of said fetal and maternal nucleic acids from said maternal test sample;
(b) A means for aligning the plurality of sequence reads to one or more chromosomal reference sequences and thereby providing a plurality of sequence tags corresponding to the sequence reads;
(c) A means for identifying a number of those sequence tags from one or more chromosomes or chromosome segments of interest selected from chromosomes 1-22, X and Y and segments thereof, and for identifying, for each of the one or more chromosomes or chromosome segments of interest, a number of those sequence tags from at least one normalizing chromosome sequence or normalizing chromosome segment sequence to determine a chromosome dose or chromosome segment dose,
Wherein the chromosome or chromosome segment of interest has a copy number variation, wherein the copy number variation is determined by comparing the chromosome dose of each chromosome or chromosome segment of interest to a respective threshold value for each chromosome or chromosome segment of the one or more chromosomes or chromosome segments of interest;
(d) A means for determining the fetal fraction using the dose of the chromosome of interest or the dose of the chromosome segment of interest; and
(e) A means for calculating a normalized chromosome value or a normalized segment value, wherein calculating the normalized chromosome value correlates the chromosome dose to an average of the corresponding chromosome doses in a set of qualifying samples as:
wherein NCV iA Is a normalized chromosome value on the ith chromosome in the test sample,and σ iU Respectively, the estimated mean and standard deviation for the ith chromosome dose in the combo-lattice sample, and R iA Is a chromosome dose calculated for an ith chromosome in the test sample, wherein the ith chromosome is the chromosome of interest;
Wherein calculating the normalized segment value correlates the chromosome segment dose to an average of the corresponding chromosome segment doses in a set of qualifying samples as
Wherein NSV iA Is a normalized chromosome segment value on the ith chromosome segment in the test sample,and σ iU Respectively, the estimated mean and standard deviation of the dose for the ith chromosome segment in the combo-lattice sample, and R iA Is a chromosome segment dose calculated for an ith chromosome segment in the test sample, wherein the ith chromosome segment is the chromosome segment of interest;
wherein the means (d) determines the fetal fraction according to the expression:
ff=2×|NCV iA CV iU |
wherein ff is the fetal fraction value, NCV iA Is a normalized chromosome value on the i-th chromosome in the test sample, and CV iU Is the coefficient of variation of the dose for the ith chromosome determined in the qualifying sample, wherein the ith stainThe body is the chromosome of interest; or
Wherein the device (d) determines said fetal fraction according to the expression;
ff=2×|NSV iA CV iU |
wherein ff is the fetal fraction value, NSV iA Is a normalized chromosome segment value on the ith chromosome segment in the test sample, and CV iU Is the coefficient of variation of the dose for the ith chromosomal segment determined in the qualifying sample, wherein the ith chromosomal segment is the chromosomal segment of interest.
2. The apparatus of claim 1, wherein the chromosome or segment dose determined by means (c) is calculated as a ratio of the number of sequence tags identified for the chromosome or segment of interest to the number of sequence tags identified for at least one corresponding normalized chromosome sequence or normalized chromosome segment sequence of the chromosome or segment of interest; or wherein said chromosome dose or segment dose determined by means (c) is calculated as the ratio of the sequence tag density ratio of said chromosome or segment of interest to the sequence tag density ratio of the sequence of the normalizing chromosome sequence or normalizing chromosome segment sequence.
3. The apparatus of claim 1, wherein the chromosome of interest is an autosome or a male fetal X chromosome and the chromosome segment of interest is selected from an autosome or a male fetal X chromosome.
4. The apparatus of claim 1, wherein at least one sequence of normalized chromosome sequences or normalized chromosome segment sequences is a chromosome or segment selected for an associated chromosome or segment of interest by: (i) Identifying a plurality of qualifying samples for the chromosome or segment of interest; (ii) Repeatedly calculating chromosome doses or chromosome segment doses for the selected chromosome or segment using a plurality of potential normalized chromosome sequences or normalized chromosome segment sequences; and (iii) selecting the sequence of normalized chromosomes or the sequence of normalized chromosome segments, individually or in a combination, to give minimal variability and/or maximal resolvability in the calculated chromosome dose or chromosome segment dose.
5. The apparatus of claim 1, wherein the normalizing chromosome sequence is a single chromosome or a set of chromosomes selected from any one or more of chromosomes 1-22, X, and Y.
6. The device of claim 1, wherein the normalizing segment sequence is a single segment or a set of segments from any one or more of chromosomes 1-22, X, and Y.
7. The apparatus of claim 1, wherein said copy number variation is selected from the group consisting of: complete chromosome replication, complete chromosome deletion, partial replication, partial doubling, partial insertion, and partial deletion.
8. The apparatus of claim 1, further comprising a means for comparing said fetal fraction determined using chromosome dosage or chromosome segment dosage to a fetal fraction determined using information from one or more polymorphisms in fetal and maternal nucleic acid from a maternal test sample that exhibit allelic imbalance in chromosomes other than said chromosome of interest.
9. The apparatus of claim 1, wherein the aligning in device (b) comprises aligning at least one million reads.
10. The apparatus of claim 1, further comprising a sequencer configured to sequence fetal and maternal nucleic acids in said maternal test sample to obtain the sequence reads.
11. The apparatus of claim 10, wherein the sequencing comprises sequencing cell-free DNA from the maternal test sample to provide the sequence reads.
12. The apparatus of claim 10, wherein said sequencing comprises massively parallel sequencing the maternal and fetal nucleic acids from the maternal test sample to generate the sequence reads.
13. The apparatus of claim 12, wherein the massively parallel sequencing is sequencing-by-synthesis.
14. The apparatus of claim 13, wherein the sequencing-by-synthesis uses a reversible dye terminator.
15. The apparatus of claim 12, wherein the massively parallel sequencing is ligation sequencing.
16. The apparatus of claim 12, wherein the massively parallel sequencing is single molecule sequencing.
17. The apparatus of claim 1, further comprising means for obtaining said maternal test sample from a pregnant organism.
18. The apparatus of claim 1, wherein the maternal sample is a blood, plasma, serum, or urine sample.
19. A medical analysis device for classifying copy number variations in a fetal genome, the device comprising:
(1) A means for receiving a plurality of sequence reads from fetal and maternal nucleic acids in a maternal test sample;
(2) A means for aligning the sequence reads to one or more chromosomal reference sequences and thereby providing a plurality of sequence tags corresponding to the sequence reads;
(3) Means for identifying a number of those sequence tags from one or more chromosomes of interest and determining that a first chromosome of interest in the fetus carries a copy number variation;
(4) Means for calculating a first fetal score value by a first method that does not use information from the tags of the first chromosome of interest;
(5) A means for calculating a second fetal fraction value by a second method that uses information from the tags of the first chromosome of interest, wherein said means comprises a component that calculates a normalized chromosome value that correlates chromosome dose for the first chromosome of interest with the mean of the corresponding chromosome doses in a set of qualifying samples as:
Wherein NCV iA Is a normalized chromosome value on the ith chromosome in the test sample,and σ iU Respectively, the estimated mean and standard deviation of the dose for the ith chromosome in the combo-grid sample, and R iA Is a chromosome dose calculated for an ith chromosome in the test sample, wherein the ith chromosome is the first chromosome of interest, and
(6) Means for comparing the first fetal score value to the second fetal score value and using the comparison to classify the copy number variation of the first chromosome of interest;
wherein the component of the first method of calculating the first fetal fraction value in the apparatus (4) is evaluated by the expression:
ff=2×|NCV iA CV iU |
wherein ff is the fetal fraction value, NCV iA Is a normalized chromosome value on the i-th chromosome in the test sample, and CV iU Is the coefficient of variation of the dose for the ith chromosome determined in the qualifying sample, wherein the ith chromosome is the chromosome of interest; or
Wherein the component of the first method of calculating the first fetal fraction value in the apparatus (4) is evaluated by the following expression;
ff=2×|NSV iA CV iU |
wherein ff is the fetal fraction value, NSV iA Is a normalized chromosome segment value on the ith chromosome segment in the test sample, and CV iU Is the coefficient of variation of the dose for the ith chromosomal segment determined in the qualifying sample, wherein the ith chromosomal segment is the chromosomal segment of interest;
wherein the component of the second method in the apparatus (5) that calculates the second fetal fraction value is evaluated by the expression:
ff=2×|NCV iA CV iU |
wherein ff is the second fetal fraction value, NCV iA Is a normalized chromosome value on the i-th chromosome in the test sample, and CV iU Is the coefficient of variation of the dose determined for the ith chromosome in the qualifying sample, wherein the ith chromosome is the first chromosome of interest.
20. The apparatus of claim 19, wherein the means (4) of the first method comprises a component for calculating the first fetal fraction value using information from one or more polymorphisms exhibiting allelic imbalance in fetal and maternal nucleic acid of the maternal test sample, said polymorphisms being present in chromosomes other than said first chromosome of interest; and
wherein the apparatus (5) of the second method comprises:
(a) A component for calculating the number of sequence tags from the first chromosome of interest and at least one normalizing chromosome sequence to determine a chromosome dose; and
(b) A component for calculating the second fetal fraction value from the chromosome dose by the second method.
21. The apparatus according to claim 20, wherein the information used by the device (4) of the first method comprises sequence tags obtained by sequencing predetermined polymorphic sequences, each of which comprises the one or more polymorphic sites.
22. The apparatus of claim 21, wherein the information used by the means (4) of the first method is obtained by a non-sequencing method.
23. The apparatus of claim 22, wherein the method is qPCR, digital PCR, mass spectrometry, or capillary gel electrophoresis.
24. The apparatus of claim 19, wherein the means (4) of the first method comprises:
(a) A component for calculating the number of sequence tags from chromosomes other than said first chromosome of interest and at least one normalizing chromosome sequence to determine a chromosome dose; and
(b) A component for calculating the first fetal fraction value from the chromosome dose by the first method; and
wherein the apparatus (5) of the second method comprises:
(a) A component for calculating the number of sequence tags from the first chromosome of interest and at least one normalizing chromosome sequence to determine a chromosome dose; and
(b) A component for calculating the second fetal fraction value from the chromosome dose by the second method.
25. The apparatus of claim 24, wherein the means (4) of the first method and the means (5) of the second method further comprise a component for calculating a normalized chromosome value and a component using the normalized chromosome value, respectively, wherein calculating the normalized chromosome value is by correlating the calculated chromosome dose with an average of corresponding chromosome doses in a set of qualifying samples as:
wherein NCV iA Is a normalized chromosome value on the ith chromosome in the test sample,and σ iU Respectively, the estimated mean and standard deviation of the dose for the ith chromosome in the combo-grid sample, and R iA Is the calculated dose of the ith chromosome in the test sample,
wherein
For the apparatus (4) of the first method, the ith chromosome is the chromosome other than the first chromosome of interest;
for the apparatus (5) of the second method, the i-th chromosome is the first chromosome of interest.
26. The apparatus of claim 25, wherein the components of the means (4) of the first method and the means (5) of the second method for calculating the fetal fraction are evaluated by the expression:
ff=2×|NCV iA CV iU |
Wherein ff is the fetal fraction value, NCV iA Is a normalized chromosome value on the i-th chromosome in the test sample, and CV iU Is the coefficient of variation of the dose for the ith chromosome in the qualifying sample;
wherein
For the apparatus (4) used in this first method, the ith chromosome is the chromosome other than the first chromosome of interest;
for the apparatus (5) used for this second method, the ith chromosome is the first chromosome of interest.
27. The apparatus of claim 26, wherein the chromosome other than the first chromosome of interest is an X chromosome when the fetus is a male.
28. Apparatus as claimed in claim 20 or 24, wherein the means (6) for comparing the first fetal fraction value with the second fetal fraction value comprises a component for determining whether the two fetal fraction values are approximately equal.
29. The apparatus of claim 28, wherein the device (6) further comprises a component for determining that a ploidy hypothesis implied in the second method is true when the two fetal fraction values are approximately equal.
30. The apparatus of claim 29, wherein the ploidy assumption implied in the second method is: the first chromosome of interest has a complete chromosomal aneuploidy.
31. The apparatus of claim 30, wherein the complete chromosomal aneuploidy of the first chromosome of interest is a monosomy or a trisomy.
32. The apparatus of claim 31, further comprising a means for analyzing tag information of the first chromosome of interest to determine whether (i) the first chromosome of interest carries a partial aneuploidy, or (ii) the fetus is a chimera, wherein the means for analyzing tag information of the first chromosome of interest is configured to be performed when the means for comparing the first fetal fraction value to the second fetal fraction value indicates that the two fetal fraction values are not approximately equal.
33. The apparatus of claim 19, wherein the means (4) of said first method comprises a component for calculating the first fetal fraction value using information from one or more polymorphisms exhibiting allelic imbalance in fetal and maternal nucleic acid of the maternal test sample, said polymorphisms being present in chromosomes other than said first chromosome of interest; and
the apparatus (5) of said second method comprises a component for calculating the second fetal fraction value using information from one or more polymorphisms in fetal and maternal nucleic acid exhibiting allelic imbalance in the maternal test sample, said polymorphisms being present in said first chromosome of interest.
34. The apparatus of claim 33, wherein the means (6) for comparing comprises:
a component for determining that said first chromosome of interest is diploid when the ratio of said second fetal fraction value to said first fetal fraction value is approximately 1;
a component for determining that the first chromosome of interest is a triploid when the ratio of the second fetal fraction value to the first fetal fraction value is approximately 1.5; and
a component for determining that said first chromosome of interest is haploid when the ratio of said second fetal fraction value to said first fetal fraction value is approximately 0.5.
35. The apparatus of claim 34, further comprising a means for analyzing signature information of said first chromosome of interest to determine whether (i) the first chromosome of interest carries a partial aneuploidy or (ii) the fetus is a chimera, wherein the means for analyzing signature information of the first chromosome of interest is configured to be performed when said means for comparing (6) the first fetal fraction value to the second fetal fraction value indicates that the ratio of the second fetal fraction value to the first fetal fraction value is not approximately 1, 1.5, or 0.5.
36. The apparatus of claim 32 or 35, wherein the means for analyzing tag information of the first chromosome of interest comprises:
(a) A component for packaging the sequence of the first chromosome of interest into a plurality of portions;
(b) A component for determining whether any of the portions contain significantly more or significantly less nucleic acids than one or more other portions; and
(c) A component for determining that the first chromosome of interest carries a partial aneuploidy if any of said portions contain significantly more or significantly less nucleic acid than one or more other portions; or determining that the fetus is a chimera if none of the portions contain significantly more or significantly less nucleic acid as compared to one or more other portions.
37. The apparatus of claim 36, wherein the component (c) further determines that a portion of the first chromosome of interest comprising significantly more or significantly less nucleic acid than one or more other portions carries the partial aneuploidy.
38. The apparatus of claim 19, wherein the first chromosome of interest is selected from the group consisting of chromosomes 1-22, X, and Y.
39. The apparatus of claim 19, wherein the device (6) comprises means for classifying the copy number variation into a category selected from the group consisting of: complete chromosome insertions, complete chromosome deletions, partial chromosome duplications, and partial chromosome deletions, and chimeras.
40. The apparatus of claim 19, further comprising:
(i) A means for determining whether the copy number variation is caused by a partial aneuploidy or a chimera; and
(ii) A means for determining the locus of a partial aneuploidy on the first chromosome of interest if the copy number variation is caused by the partial aneuploidy,
wherein the means in (i) and (ii) are configured to perform when the means for comparing the first fetal fraction value to the second fetal fraction value determines that the first fetal fraction value and the second fetal fraction value are not approximately equal.
41. The apparatus of claim 40, wherein the means for determining the locus of the partial aneuploidy on the first chromosome of interest comprises a component for classifying the sequence tags of the first chromosome of interest into nucleic acid data boxes or building blocks in the first chromosome of interest; and a component for counting the number of mapping tags in each bin.
42. The device of claim 1 or 19, wherein the maternal test sample is a blood, plasma, serum, or urine sample.
43. The apparatus of claim 1 or 19, wherein the fetal and maternal nucleic acids are cell-free DNA (cfDNA).
44. The apparatus of claim 1 or 19, further comprising a sequencer configured to sequence the fetal and maternal nucleic acids in a maternal test sample and obtain the sequence reads.
45. The apparatus of claim 44, wherein the sequencer is configured for sequencing-by-synthesis.
46. The apparatus of claim 45, wherein the sequencer is configured to perform sequencing-by-synthesis using reversible dye terminators.
47. The apparatus of claim 44, wherein the sequencer is configured to perform ligation sequencing.
48. The apparatus of claim 44, wherein the sequencer is configured to perform single molecule sequencing.
49. The apparatus of claim 44, wherein the sequencer and devices (a) - (d) of the apparatus of claim 1, or devices (1) - (6) of the apparatus of claim 19 are located in separate locations and connected by a network.
50. The apparatus of claim 44, further comprising a means for obtaining the maternal test sample from the pregnant mother.
51. The apparatus of claim 50, wherein the means for obtaining the maternal test sample and means (a) - (d) of the apparatus of claim 1 or means (1) - (6) of the apparatus of claim 19 are located in separate locations.
52. The apparatus of claim 50, further comprising a means for extracting cell-free DNA from the maternal test sample.
53. The apparatus of claim 52, wherein the means for extracting cell-free DNA is located in the same location as the sequencer, and wherein the means for obtaining the maternal test sample is located in a remote location.
54. The apparatus of claim 44, wherein the fetal and maternal nucleic acids in the maternal test sample are cell-free DNA.
55. The apparatus of claim 1 or 19, wherein the means (2) for aligning aligns at least 1 million reads.
CN201210441134.8A 2012-04-12 2012-11-07 Copy the detection and classification of number variation Active CN103374518B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810154581.2A CN108485940B (en) 2012-04-12 2012-11-07 Detection and classification of copy number variation
CN201710644858.5A CN107435070A (en) 2012-04-12 2012-11-07 Copy the detection and classification of number variation

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US13/445,778 2012-04-12
US13/445,778 US9447453B2 (en) 2011-04-12 2012-04-12 Resolving genome fractions using polymorphism counts
US13/482,964 2012-05-29
US13/482,964 US20120270739A1 (en) 2010-01-19 2012-05-29 Method for sample analysis of aneuploidies in maternal samples
US13/555,037 2012-07-20
US13/555,037 US9260745B2 (en) 2010-01-19 2012-07-20 Detecting and classifying copy number variation

Related Child Applications (2)

Application Number Title Priority Date Filing Date
CN201710644858.5A Division CN107435070A (en) 2012-04-12 2012-11-07 Copy the detection and classification of number variation
CN201810154581.2A Division CN108485940B (en) 2012-04-12 2012-11-07 Detection and classification of copy number variation

Publications (2)

Publication Number Publication Date
CN103374518A CN103374518A (en) 2013-10-30
CN103374518B true CN103374518B (en) 2018-03-27

Family

ID=49460351

Family Applications (4)

Application Number Title Priority Date Filing Date
CN201210441134.8A Active CN103374518B (en) 2012-04-12 2012-11-07 Copy the detection and classification of number variation
CN201220583608.8U Expired - Lifetime CN204440396U (en) 2012-04-12 2012-11-07 For determining the kit of fetus mark
CN201810154581.2A Active CN108485940B (en) 2012-04-12 2012-11-07 Detection and classification of copy number variation
CN201710644858.5A Pending CN107435070A (en) 2012-04-12 2012-11-07 Copy the detection and classification of number variation

Family Applications After (3)

Application Number Title Priority Date Filing Date
CN201220583608.8U Expired - Lifetime CN204440396U (en) 2012-04-12 2012-11-07 For determining the kit of fetus mark
CN201810154581.2A Active CN108485940B (en) 2012-04-12 2012-11-07 Detection and classification of copy number variation
CN201710644858.5A Pending CN107435070A (en) 2012-04-12 2012-11-07 Copy the detection and classification of number variation

Country Status (1)

Country Link
CN (4) CN103374518B (en)

Families Citing this family (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
PL3117012T3 (en) * 2014-03-14 2019-08-30 Caredx, Inc. Methods of monitoring immunosuppressive therapies in a transplant recipient
CN106795558B (en) * 2014-05-30 2020-07-10 维里纳塔健康公司 Detection of fetal sub-chromosomal aneuploidy and copy number variation
CN104152553B (en) * 2014-07-21 2016-11-23 上海交通大学 A kind of auxiliary diagnoses the test kit whether fetus to be measured is mongolism patient
CN107750277B (en) * 2014-12-12 2021-11-09 维里纳塔健康股份有限公司 Determination of copy number variation using cell-free DNA fragment size
US10395759B2 (en) * 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
WO2016201507A1 (en) * 2015-06-15 2016-12-22 Murdoch Childrens Research Institute Method of measuring chimerism
WO2017007903A1 (en) * 2015-07-07 2017-01-12 Farsight Genome Systems, Inc. Methods and systems for sequencing-based variant detection
EP3347466B9 (en) 2015-09-08 2024-06-26 Cold Spring Harbor Laboratory Genetic copy number determination using high throughput multiplex sequencing of smashed nucleotides
WO2017044885A1 (en) * 2015-09-09 2017-03-16 uBiome, Inc. Method and system for microbiome-derived diagnostics and therapeutics for conditions associated with cerebro-craniofacial health
WO2017050244A1 (en) * 2015-09-22 2017-03-30 The Chinese University Of Hong Kong Accurate quantification of fetal dna fraction by shallow-depth sequencing of maternal plasma dna
WO2017106768A1 (en) * 2015-12-17 2017-06-22 Guardant Health, Inc. Methods to determine tumor gene copy number by analysis of cell-free dna
US10095831B2 (en) 2016-02-03 2018-10-09 Verinata Health, Inc. Using cell-free DNA fragment size to determine copy number variations
BR112019000296A2 (en) * 2016-07-06 2019-04-16 Guardant Health, Inc. methods for cell free nucleic acid fragmentome profiling
RU2674700C2 (en) * 2016-12-30 2018-12-12 Общество с ограниченной ответственностью "Научно-производственная фирма ДНК-Технология" (ООО "НПФ ДНК-Технология") Method of determining the source of aneuploid cells on the blood of a pregnant woman
WO2018236911A1 (en) * 2017-06-20 2018-12-27 Illumina, Inc. Methods and systems for decomposition and quantification of dna mixtures from multiple contributors of known or unknown genotypes
CN110770839A (en) 2017-06-20 2020-02-07 伊鲁米那股份有限公司 Method for the accurate computational decomposition of DNA mixtures from contributors of unknown genotype
CN108427864B (en) * 2018-02-14 2019-01-29 南京世和基因生物技术有限公司 A kind of detection method, device and computer-readable medium copying number variation
CN110656159B (en) * 2018-06-28 2024-01-09 深圳华大生命科学研究院 Copy number variation detection method
WO2020023509A1 (en) * 2018-07-24 2020-01-30 Affymetrix, Inc. Array based method and kit for determining copy number and genotype in pseudogenes
CN110880356A (en) * 2018-09-05 2020-03-13 南京格致基因生物科技有限公司 Method and apparatus for screening, diagnosing or risk stratification for ovarian cancer
CN109628579B (en) * 2019-01-13 2022-11-15 清华大学 Detection method for determining whether chromosome number in biological sample is abnormal
US20210366569A1 (en) * 2019-06-03 2021-11-25 Illumina, Inc. Limit of detection based quality control metric
CN110373477B (en) * 2019-07-23 2021-05-07 华中农业大学 Molecular marker cloned from CNV fragment and related to porcine ear shape character
CN110317877A (en) * 2019-08-02 2019-10-11 苏州宏元生物科技有限公司 Application of the unstable variation of one group chromosome in preparation diagnosis bladder transitional cell carcinoma, the reagent or kit of assessing prognosis
CN110452985A (en) * 2019-08-02 2019-11-15 苏州宏元生物科技有限公司 Application of the unstable variation of one group chromosome in the reagent or kit for preparing diagnosing liver cancer, assessment prognosis
CN112342627A (en) * 2019-08-09 2021-02-09 深圳市真迈生物科技有限公司 Preparation method and sequencing method of nucleic acid library
CN111105844B (en) * 2019-11-22 2023-06-06 广州金域医学检验集团股份有限公司 Somatic cell mutation classification method, apparatus, device, and readable storage medium
CN111394474B (en) * 2020-03-24 2022-08-16 西北农林科技大学 Method for detecting copy number variation of GAL3ST1 gene of cattle and application thereof
CN111476497B (en) * 2020-04-15 2023-06-16 浙江天泓波控电子科技有限公司 Distribution feed network method for miniaturized platform
CN111948394B (en) * 2020-08-10 2023-07-28 山西医科大学 Application of TSTA3 and LAMP2 as targets in esophageal squamous carcinoma cell metastasis detection and drug screening
CN112322722B (en) * 2020-11-13 2021-11-12 上海宝藤生物医药科技股份有限公司 Primer probe composition and kit for detecting 16p11.2 microdeletion and application thereof
CN112614548B (en) * 2020-12-25 2021-08-03 北京吉因加医学检验实验室有限公司 Method for calculating sample database building input amount and database building method thereof
CN113462768B (en) * 2021-07-29 2023-05-30 中国医学科学院整形外科医院 Primer and kit for detecting copy number of ECR region of small ear deformity patient by ddPCR
CN113684277B (en) * 2021-09-06 2022-05-17 南方医科大学南方医院 Method for predicting ovarian cancer homologous recombination defect based on biomarker of genome copy number variation and application
CN114093417B (en) * 2021-11-23 2022-10-04 深圳吉因加信息科技有限公司 Method and device for identifying chromosomal arm heterozygosity loss
CN114507904B (en) * 2022-04-19 2022-07-12 北京迅识科技有限公司 Method for preparing second-generation sequencing library

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011090559A1 (en) * 2010-01-19 2011-07-28 Verinata Health, Inc. Sequencing methods and compositions for prenatal diagnoses

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6828098B2 (en) * 2000-05-20 2004-12-07 The Regents Of The University Of Michigan Method of producing a DNA library using positional amplification based on the use of adaptors and nick translation
EP1294883A2 (en) * 2000-06-30 2003-03-26 Incyte Genomics, Inc. Human extracellular matrix (ecm)-related tumor marker
JP4480715B2 (en) * 2003-01-29 2010-06-16 454 コーポレーション Double-end sequencing
ES2620012T3 (en) * 2008-09-20 2017-06-27 The Board Of Trustees Of The Leland Stanford Junior University Non-invasive diagnosis of fetal aneuploidy by sequencing
JP5882234B2 (en) * 2010-02-25 2016-03-09 アドバンスト リキッド ロジック インコーポレイテッドAdvanced Liquid Logic, Inc. Method for preparing nucleic acid library
CN102409043B (en) * 2010-09-21 2013-12-04 深圳华大基因科技服务有限公司 Method for constructing high-flux and low-cost Fosmid library, label and label joint used in method
CN102127818A (en) * 2010-12-15 2011-07-20 张康 Method for creating fetus DNA library by utilizing peripheral blood of pregnant woman

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011090559A1 (en) * 2010-01-19 2011-07-28 Verinata Health, Inc. Sequencing methods and compositions for prenatal diagnoses

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma;Chiu et al;《PNAS》;20081231;第20458-20463页 *

Also Published As

Publication number Publication date
CN108485940B (en) 2022-01-28
CN204440396U (en) 2015-07-01
CN103374518A (en) 2013-10-30
CN107435070A (en) 2017-12-05
CN108485940A (en) 2018-09-04

Similar Documents

Publication Publication Date Title
CN103374518B (en) Copy the detection and classification of number variation
US11697846B2 (en) Detecting and classifying copy number variation
US11875899B2 (en) Analyzing copy number variation in the detection of cancer
US20200219588A1 (en) Detecting and classifying copy number variation
US9411937B2 (en) Detecting and classifying copy number variation
EP2877594B1 (en) Detecting and classifying copy number variation in a fetal genome
KR102184868B1 (en) Using cell-free dna fragment size to determine copy number variations
CN107750277B (en) Determination of copy number variation using cell-free DNA fragment size
US9323888B2 (en) Detecting and classifying copy number variation
CN103003447B (en) Method for determining the presence or absence of different aneuploidies in a sample
AU2019200163B2 (en) Detecting and classifying copy number variation
AU2019200162B2 (en) Detecting and classifying copy number variation
US20240203601A1 (en) Analyzing copy number variation in the detection of cancer

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1187363

Country of ref document: HK

C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: WD

Ref document number: 1187363

Country of ref document: HK