US20190066842A1 - A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing - Google Patents

A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing Download PDF

Info

Publication number
US20190066842A1
US20190066842A1 US16/083,452 US201716083452A US2019066842A1 US 20190066842 A1 US20190066842 A1 US 20190066842A1 US 201716083452 A US201716083452 A US 201716083452A US 2019066842 A1 US2019066842 A1 US 2019066842A1
Authority
US
United States
Prior art keywords
copy number
smn1
smn2
samples
ngs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/083,452
Other languages
English (en)
Inventor
Jinglan Zhang
Lee-Jun C. Wong
Yanming Feng
Xiaoyan Ge
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baylor College of Medicine
Baylor Miraca Genetics Laboratories LLC
Original Assignee
Baylor College of Medicine
Baylor Miraca Genetics Laboratories LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baylor College of Medicine, Baylor Miraca Genetics Laboratories LLC filed Critical Baylor College of Medicine
Priority to US16/083,452 priority Critical patent/US20190066842A1/en
Assigned to BAYLOR MIRACA GENETICS LABORATORIES, LLC reassignment BAYLOR MIRACA GENETICS LABORATORIES, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FENG, YANMING
Publication of US20190066842A1 publication Critical patent/US20190066842A1/en
Assigned to BAYLOR COLLEGE OF MEDICINE reassignment BAYLOR COLLEGE OF MEDICINE ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: Wong, Lee-Jun C.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • G06F19/22
    • G06F19/28
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • Embodiments of the disclosure concern at least the fields of genetics, cell biology, molecular biology, diagnostics, and medicine.
  • SMA Spinal muscular atrophy
  • MIM #253300 is a neuromuscular disorder caused by the loss of motor neurons in the spinal cord and the brainstem leading to generalized muscle weakness and muscular atrophy which impair activities such as crawling, walking, sitting up, and controlling head movement (Emery, et al., 1976).
  • SMA has a variable expressivity with a broad range of onset and severity. In severe cases, death occurs within the first two years of life mostly due to respiratory failure (Dubowitz, 1995).
  • SMA is the second most common autosomal recessive disorder after cystic fibrosis (CF), with an incidence of about 1 in 10,000 live births and a carrier frequency of about 1/40 to 1/100 in different ethnic groups, with lower carrier frequencies in African Americans and Hispanics (Swoboda, et al., 2005; Hendrickson, et al., 2009; Prior, et al., 2008; MacDonald, et al., 2014). SMA is caused by mutations in the survival motor neuron 1 (SMN1) gene including deletions, gene conversions or intragenic mutations in both of the SMN1 alleles, while SMN2 copy number may modify the disease severity (Feldkotter, et al., 2002).
  • SMA survival motor neuron 1
  • SMN1 and SMN2 are highly homologous, and only differ by five base pairs, none of which change the amino acid sequences.
  • a single C to T change in SMN2 exon 7 (c.840C>T) affects an exonic splicing enhancer (ESE) or creates an exon silencer element (ESS) that results in the majority of transcripts lacking exon 7 (Cartegni et al., 2002; Kashima and Manley, 2003), which results in a reduction of full-length transcripts from SMN2 (Lorson, et al., 1999).
  • SMA has unique features that can be recognized clinically that often prompt follow-up molecular diagnosis.
  • RFLP is commonly used as a diagnostic test for SMA patients, while it cannot detect carrier status.
  • the first carrier test for SMA was developed in 1997 using a competitive PCR strategy for the quantitative analysis of SMN1 copy numbers which set the foundation for carrier screening for SMA (McAndrew, et al., 1997). With the advancement of technology in the last two decades, high-throughput methods were developed using MLPA or quantitative PCR which enabled expanded population SMA carrier screening most of which involve SMN1 copy numbers.
  • NGS-based carrier screening panel has also been developed which offers greater clinical outcomes with increased detection rate and lower total healthcare cost compared to conventional genotyping or other targeted approach (Hallam, et al., 2014).
  • the comprehensiveness of NGS testing makes receiving a negative result much more reassuring in terms of residual risk of sequence variants detected.
  • NGS has been shown by us and others that it can discover CNVs at both gene and exonic levels for clinical tests (Feng, et al., 2015; Retterer, et al., 2015).
  • NGS based CNV detection in general is still challenging for small deletions/duplications at single exon or sub-exon level due to technical noises introduced by uneven coverage in regions with different GC contents, non-linear amplification by PCR, or inter-run variations caused by other assay artifacts known as batch effects.
  • Another drawback for CNV analysis by NGS is the lack of locus-specific computational program for genes with homologous sequences requiring accurate alignment of gene specific reads and subsequent copy number analysis. Therefore, such genes including SMN1/2 are normally not included in NGS secondary analysis for variant calling, or variant calling in these genes often fail mapping quality filter.
  • the present disclosure satisfies a long felt need in the art to employ NGS for highly homologous sequences, at least to determine their gene copy number, and also provides a long felt need in the art for reliable testing for carrier status for SMA.
  • Embodiments of the disclosure concern methods and compositions for analysis of one or more samples from an individual.
  • the disclosure concerns determination of whether or not an individual has an allele that includes at least one specific gene sequence and/or polymorphism and/or mutation and/or copy number.
  • DNA from a sample from an individual is analyzed to determine if the individual has certain copy number(s) of one or more genes that would classify the individual as a carrier for a disease.
  • a pair of genes in question is one in which the genes are nearly identical (for example, greater than 95, 96, 97, 98, 99, 99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.8, or 99.9% identity) or otherwise has significant sequence similarity to another gene, such as the pair being a gene and a pseudogene or paralogue gene, for example (such as SMN1/SMN2, CYP21A2/CYP21A1P, or HBA1/HBA2).
  • the pair of genes that are in need of determination of copy number may have a difference of only 1, 2, 3, 4, 5, or more nucleotides.
  • the methods allow one to utilize sequencing data from NGS to determine copy number of one or more genes.
  • Embodiments of the disclosure utilize counts of single instances of a particular sequenced region (every single sequenced DNA fragment may be referred to as one “read”) that corresponds to all or part of exons for a certain gene.
  • the counts therefore, are a representative and corresponding value of the copy number of a region of a gene and, thereby, of the gene itself.
  • the reads that comprise sequence that does not encompass one or more signature variants are utilized for determination of total copy number of both of a first and second gene but are not utilized for determination of copy number ratio between a first and second gene.
  • the reads that comprise sequence that does encompass one or more signature variants are utilized for determination of a ratio of copy number between a first and second gene but are not utilized for determination of total copy number of both of a first and second gene. That is, in specific embodiments of the methods there is no distinguishing between the two genes when the determining of the total copy number value for the ultimate computation.
  • the disclosure encompasses methods for determining whether or not an individual is a carrier for a genotype associated with SMA, including in at least some cases determining the severity of the affliction with SMA. At least some methods described herein analyze copy number of both SMN1 and SMN2. Certain methods allow for the use of next generation sequencing (NGS) using analysis of SMN1 and SMN2 even though they are highly similar in sequence identity. The methods exploit the minimal differences between the two genes. Methods described herein for genetic analysis may be used as a sole test for an individual or may be employed as one of multiple tests for an individual.
  • NGS next generation sequencing
  • Some methods of the disclosure determine whether or not an individual is a carrier for SMA.
  • the DNA of an individual is analyzed for copy number of SMN1 and SMN2.
  • the ratio and/or total copy number of one or more genes, including SMN1 and SMN2 are encompassed as part of analyses herein.
  • the analysis of an individual's DNA using methods of the disclosure can allow for determination whether or not an individual is a carrier for spinal muscular atrophy (SMA), for example.
  • SMA spinal muscular atrophy
  • methods and compositions for distinguishing SMN1 and/or SMN2 copy number(s) utilize as part of the method the determination of a variance between SMN1 and SMN2 at a particular exon or intron, such as exons 7 and 8 or introns 6 and 7.
  • compositions for carrier screen tests are encompassed in the disclosure.
  • the carrier screen tests may be utilized with other types of tests, including other carrier screen tests, or the composition may solely be utilized for determination of carrier status for a particular genetic mutation and related disease.
  • a method of determining gene copy number for an individual comprising the step of identifying copy number of two nearly identical genes using sequencing data from next generation sequencing to distinguish at least one variance between the two genes.
  • the identifying step comprises the determination of a mathematical relationship between a) the copy number ratio of the two genes, and b) the total copy number for both of the two genes in sum.
  • the mathematical relationship is further defined as computing copy number for each gene by applying the copy number ratio to the total copy number.
  • the two genes are SMN1 and SMN2.
  • the gene copy number identifies carrier status for an individual, and the gene copy number may be 0, 1, 2, 3, 4, 5, 6, 7, or more.
  • a method of assaying nucleic acid from a sample from an individual for a recessive allele for a genetic mutation associated with spinal muscular atrophy comprising the step of generating a mathematical relationship between the total copy number of SMN1 and SMN2 and the copy number ratio of SMN1 to SMN2, wherein the total copy number and copy number ratio are determined using next generation sequencing data.
  • the method may further comprise the step of determining that an individual is in need of assaying for the allele.
  • the individual has a family history of SMA.
  • the individual may be pregnant.
  • the individual may be in need of family planning.
  • a method comprising: receiving sequenced sample data; determining a copy number ratio between two nearly identical genes of the received sample data; determining a total copy number of the two nearly identical genes of the received sample data; and determining a final copy number for the two nearly identical genes for the received sample.
  • the method further comprises determining a patient outcome hypothesis based, at least in part, on the determined final copy number for the received sample corresponding to the patient.
  • the step of determining the patient outcome hypothesis comprises determining that a patient is a carrier when the final copy number is not equal to two.
  • the received sequenced sample data may be received from next generation sequencing (NGS) and the sample data may be aligned to hg19, for example.
  • NGS next generation sequencing
  • the received sequenced sample data comprise a plurality of samples corresponding to a plurality of patients, and wherein a copy number ratio, a total copy number, and a final copy number is determined for each of the plurality of samples.
  • the two nearly identical genes may comprise the SMN1 and SMN2 genes.
  • the step of determining the copy number ratio may comprise reading a depth(rd) of PSVs for the received sample data; calculating a copy number ratio for the received sample data for predetermined exons selected based on exons with expected differences; and building a table of calculations for the calculated copy number ratios for a plurality of samples.
  • the step of determining the total copy number may comprise determining a total coverage of selected exons of the two nearly identical genes for each of a plurality of received samples; determining a median or mean of each of the selected exons from samples having a ratio of the two nearly identical genes equal to approximately one; normalizing the total coverage for the selected exons for each sample of the plurality of samples relative to all samples of the plurality of samples; and determining the total copy number for each of the selected exons for each of the plurality of samples based, at least in part, on the normalized total coverage.
  • an apparatus comprising a processor and a memory, wherein the processor is coupled to the memory, and wherein the processor is configured to perform the steps recited in any of methods encompassed by the disclosure.
  • a computer program product comprising: a non-transitory computer readable medium comprising code to perform steps comprising the steps recited in any of the methods encompassed by the disclosure.
  • FIGS. 1A-F show an example of NGS data processing for SMN1 and SMN2 copy number analysis.
  • FIG. 2 demonstrates a SMN1:SMN2 copy number ratio distribution in 2,488 pan-ethnic group individuals.
  • FIG. 3 shows the SMN1 and SMN2 copy number(s) distribution in 2,488 pan-ethnic group individuals.
  • FIG. 4 shows a sample with two copies of SMN1 and zero copy of SMN2 in which all reads that mapped to E7 and E8 of SMN2 were those without SMN1 PSVs (SEQ ID NOS 1-10).
  • FIGS. 5A-5D shows a representative batch of capture NGS data for SMN1 copy number detection.
  • FIG. 6 illustrates general embodiments of at least some steps of the methods that include alignment of pair-end reads (reads anchored by single gene-specific variants) to SMN1 or SMN2 locus.
  • FIG. 7 is a schematic block diagram illustrating one embodiment of a system for multi-attribute clustering.
  • FIG. 8 is a schematic block diagram illustrating one embodiment of a database system for multi-attribute clustering.
  • FIG. 9 is a schematic block diagram illustrating one embodiment of a computer system that may be used in accordance with certain embodiments of the system for multi-attribute clustering.
  • FIGS. 10A-10C is the SMN1 and SMN2 NGS sequence alignment surrounding the functional PSV at c.840.
  • FIG. 10A The SI MN gene PSV1 (c.840C/T), PSV2 (c.888+100A/G) and SMN1 SNP g.27134T>G are located within a 148 bp region spanning exon 7 and intron 7 of the SMN1 or SMN2 gene,
  • FIG. 10B The alignment of pair-end sequence reads (2 ⁇ 100) in a normal and SMN1/SMN2 gene hybrid sample.
  • the red or purple box represents the pair-end read R1 or R2 respectively.
  • FIG. 10(C) Sequence pileups of read pairs at the correct SMN1 locus (top) (SEQ ID NO: 11) and incorrect SMN2 locus (bottom) (SEQ ID NO: 12) (SEQ ID NO: 1).
  • FIG. 11 is a novel computational algorithm PGCNARS (paralogous gene copy number analysis by ratio and sum) for SMN1 copy number analysis using NGS coverage depth data for SMA carrier screening.
  • PGCNARS involves three major steps for the SMN1 copy number analysis. Firstly, for each sample in the same capture pool, the copy number ratio of SMN1 to SMN2 is calculated using the read-depth of the PSVs in the exon 7 (c.840C/T) or exon 8 (c.*233T/A) of SMN1 and SMN2 (step a1-3). The SMN1 and SMN2 total copy number was determined by their exonic coverage data after normalization to the read depth of the median identified in the sample group (step b1-7). Lastly, the SMN1 copy number in each sample is calculated based on the SMN1 to SMN2 copy number ratio and their total copy number (step c).
  • FIGS. 12A-12B is a paralogous sequence variant (PSV) can be informative for NGS read alignment for highly homologous genes.
  • PSD paralogous sequence variant
  • FIG. 13 is SMN1 and SMN2 alignment and copy number analysis were confounded by gene hybrids and SNP.
  • a group of eight samples with three copies of SMN1, one copy of SMN2 and an SMN1 SNP (g.27134T>G) were aligned using pair-end (PE) and single-end (SE) mapping algorithm.
  • the SMN1 and SMN2 copy number analyses were performed using the coverage data generated by the PE or SE alignment algorithm.
  • the PE method underestimated SMN1 to SMN2 copy number ratio (left panel) and SMN1 copy number (middle panel) and the SMN2 copy number was overestimated (right panel).
  • FIGS. 14A-14C is distribution of SMN1 to SMN2 copy number ratios and SMN1 and SMN2 copy numbers in 6,738 samples.
  • FIG. 14A There are four major groups of samples with different SMN1 to SMN2 copy number ratios approximately at 1, 2, 3, and ⁇ (zero copy of SMN2).
  • FIG. 14B The relative distributions of samples with different SMN1 copy numbers in 6,738 samples.
  • FIG. 14C The relative distributions of samples with different SMN2 copy numbers in 6,738 samples.
  • FIG. 15 is a pedigree of a representative SMA family analyzed by NGS. Pedigree and the NGS pileup showed two children affected by SMA with zero copy SMN1. Both parents were carriers with one copy of SMN1 (SEQ ID NOS:31-34).
  • FIG. 16 is gene specific PCR was used to amplify the SMN1 gene to confirm sequence variants identified by capture NGS. Two fragments (5′ and 3′ fragment) were amplified using a gene specific primer designed based on exon 7 PSV and non-specific primers upstream (exon 2 primer) and downstream (exon 8 primer) of the PSV. Controls used in this study included DNA with two copies of SMN1, zero copy of SMN1 (SMA) and zero copy of SMN2.
  • SMA zero copy of SMN1
  • FIG. 17 is RFLP analysis specifically detected the g.27134T>G SNP in the SMN1 locus.
  • PCR was performed to amplify the SMN1 fragment containing the 2+0 carrier SNP (g.27134T>G).
  • Primers were designed to specifically amplify SMN1, but not SMN2, by utilizing the c.840C PSV at exon 7, as well as an additional mismatch base pair before the PSV.
  • HpyCH4III cut SMN1 PCR product only when SNP g.27134T>G was present.
  • Controls were included (from left to right): DNA with a heterozygous SNP g.27134T>G in SMN1 producing digested PCR products of 173 bp, 235 bp and 408 bp in size, DNA without the g.27134T>G SNP, DNA with a homozygous g.27134T>G SNP, DNA with zero copy of SMN1 copy and no template control (NTC).
  • FIG. 18 is a haplotype with misaligned g.27134T>G SNP.
  • the g.27134T>G SNP was misaligned to the SMN2 locus by NGS, but SMN1 specific RFLP analysis was able to correctly identify it in the SMN1 locus.
  • a” or “an” may mean one or more.
  • the words “a” or “an” when used in conjunction with the word “comprising”, the words “a” or “an” may mean one or more than one.
  • another may mean at least a second or more.
  • aspects of the invention may “consist essentially of” or “consist of” one or more sequences of the invention, for example.
  • Some embodiments of the invention may consist of or consist essentially of one or more elements, method steps, and/or methods of the invention. It is contemplated that any method or composition described herein can be implemented with respect to any other method or composition described herein.
  • Embodiments of the disclosure allow determination of gene copy number using NGS data for genes that have highly homologous regions. Methods of the disclosure may be employed following next generation sequencing or third generation sequencing for determining copy number of two highly homologous genes or of determining copy number for a gene and a pseudogene. The determination of copy number in such situations may be informative for a medical purpose, such as determining whether or not an individual is a carrier, affected, or at risk for particular genetic disease(s).
  • Embodiments of the disclosure concern clinical molecular testing including carrier screening using NGS for testing for a particular carrier status for a disease in an individual.
  • the present disclosure concerns methods for analyzing copy number of SMN1 and SMN2 (as examples) for screening for whether or not a particular individual is a carrier for SMA, for example.
  • the methods employ next generation sequencing including gene-specific reads by utilizing fragments having unique nucleotide(s) for SMN1 and/or SMN2.
  • the methods of the disclosure avoid the use of primers or probes that target particular single nucleotide polymorphisms (SNPs).
  • SNPs single nucleotide polymorphisms
  • Embodiments of the disclosure are useful for determining copy number using NGS methods including for those genes with homologous sequences necessitating accurate alignment of gene specific reads and subsequent copy number analysis.
  • Methods of the disclosure allow for enhanced variant calling using NGS in gene(s) that are difficult to analyze with NGS, particularly when the analysis requires or would benefit from reliable copy number analysis.
  • determination of a copy number ratio between a first gene and a second gene that are highly identical to each other in sequence utilizes one or more informative variants (such as polymorphisms or mutations) that allow accurate alignment of multiple reads over a particular exon present in both genes, and this alignment facilitates accurate quantitation of the reads.
  • informative variants such as polymorphisms or mutations
  • Methods of the disclosure utilize read depths of gene specific reads to calculate copy number ratio of a first gene to a second gene. In at least some cases, non-discriminating reads are utilized to calculate total copy number using all exons.
  • embodiments of this disclosure allow for Next Generation Sequencing or Third Generation Sequencing coverage data to call SMN1/SMN2 copy numbers.
  • the highly homologous gene SMN2 makes the short NGS reads difficult to be aligned to the gene specific locus of SMN1 or SMN2.
  • NGS is semi-quantitative in that the copy number analysis by NGS data is impacted by a lot of variables in library preparation, PCR cycle numbers, and sequencing artifacts.
  • the inventors deployed a method decoupled the pair-end reads and performed alignment based on single-end reads to increase mapping specificity (reads anchored to gene specific locus by gene specific variants) to SMN1 or SMN2 locus.
  • SMN1 and SMN2 gene copy numbers were determined.
  • a first step in the methods includes alignment of reads according to one or more nucleotides that differentiate between a first gene and a second gene.
  • a copy number ratio of how many reads are aligned for the first gene versus how many reads are aligned for a second gene.
  • a total copy number as a sum of both genes is determined. The value of the total copy number and the value of the copy number ratio allow interpretation of the exact copy number of the first and second genes.
  • the actual copy number of the first gene is 1 and the actual copy number of the second gene is 2.
  • a signature variance between two genes for use in the methods is known (e.g., SMN1/SMN2), but in some cases a signature variance is selected after sequencing a large number of samples in order to determine gene specific loci that are not affected by polymorphisms, gene conversions, or other genetic events. These gene specific loci will be used to accurately align NGS reads harboring at least one of these gene specific nucleotides.
  • those differences may be employed in the method if they are within a certain number of bases (less than the length of NGS reads).
  • the methods provide carrier screen tests for individual(s) that are in need of determining whether or not they are a carrier for a genetic-based disease, including one in which the carrier would be autosomal recessive for a mutated gene in question.
  • the individual may be male or female. In specific embodiments, the individual intends to procreate.
  • the methods may be implemented as part of family planning for one or more individuals. The methods may or may not be employed as part of routine medical practices.
  • the individual may be a pregnant female, such as one with an option of terminating a pregnancy dependent on the outcome of the carrier screen test.
  • this method can also be used as a diagnostic test for individuals (fetus, infant, child or adult) who may be affected by such recessive diseases.
  • Fetal tissues used for analysis may include CVS, amniocytes, or product of conception.
  • the method may be employed as part of a single carrier testing assay that is for testing multiple genes or it may be a single gene testing assay or it may be used as part of multiple assays for multiple genes.
  • An individual may utilize the methods described herein as a sole user, or the methods may be performed by another party. In certain cases, an individual that utilizes the methods does so because of a desire for general personal genetic knowledge, because of family planning concerns, because of a concern for risk of producing offspring with SMA, or because of a known risk for producing offspring with SMA, for example because of family history or a positive result of another type of genetic test.
  • the methods may be used as a primary and sole means of determining whether or not an individual is a carrier for SMA or may be used as a secondary means, such as obtaining a second opinion.
  • the disclosed methods may be utilized as a first tier test for determining whether or not an individual is a carrier for a genetic defect, which may be defined as carrier status.
  • further testing to confirm whether or not an individual is a carrier may be employed, regardless of whether or not the individual tested as being a carrier or not being a carrier.
  • the disclosed methods are employed to determine the copy number of SMN1 and SMN2 for carrier status for SMA, in some cases the carrier status for other genetic diseases may be queried. For example, one may determine the carrier status of congenital adrenal hyperplasia (CAH; CYP21A2/CYP21A1P), hemoglobin disorders (HBA1/HBA2), and any other genetic diseases that may be caused by gene copy number variations due to the presence of regions homologous to the disease genes.
  • CAH congenital adrenal hyperplasia
  • HBA1/HBA2 hemoglobin disorders
  • a sample is obtained from an individual in need of determining carrier/affected status for an allele.
  • the sample from the individual may be of any kind so long as DNA is able to be extracted therefrom.
  • the sample may be obtained using any method.
  • the sample comprises blood, saliva, hair, semen, urine, feces, cheek scrapings, biopsy, amniotic fluid, chorionic villus, and so on.
  • SMA Spinal muscular atrophy
  • SMA is one of the most common autosomal recessive diseases with an incidence of ⁇ 1 in 10,000 live births.
  • the carrier frequency of this disease is approximately 1:40 ⁇ 1:70 in different ethnic groups and population-based carrier screening is recommended by professional societies such as the ACMGG.
  • SMA is caused by the complete loss of the survival motor neuron 1 (SMN1) protein while the number SMN2 copy gene may serve as a modifier for disease severity in affected patients.
  • SMA is caused by the complete loss of the survival motor neuron 1 (SMN1) protein while the number SMN2 copy gene may serve as a modifier for disease severity in affected patients.
  • the underlying mechanism for SMN1 gene copy number change is attributed to its deletion or gene conversion.
  • SMN1 and SMN2 are highly homologous with only five different nucleotides within the gene.
  • SMN1 exon 7 SMN1 exon 7
  • NGS next generation sequencing
  • Gene specific reads were counted by surveying fragments with at least one of the SMN1/2 unique nucleotides in order to calculate SMN1:SMN2 copy number ratios.
  • the total SMN1 and SMN2 copy numbers were independently determined by counting all of the exon 7 and neighboring exons' reads. Together with SMN1 and SMN2 total copy and their copy number ratio, SMN1 and SMN2 gene copy numbers were determined.
  • the inventors analyzed over 3,000 clinical samples and compared the copy number obtained from NGS with that from qPCR and/or MLPA studies. Individuals carrying one, two, three, four or above copies of SMN1 and SMN2 were all correctly identified by the NGS method. Potential limitations of this method due to gene hybrid or rare SNPs can be addressed by a refined local alignment algorithm and recounting gene specific reads. This method is useful to more efficiently perform large-scale carrier detection of SMA.
  • the present example shows population carrier screening for spinal muscular atrophy by next generation sequencing.
  • genomic DNA samples were fragmented with the use of sonication, ligated to Illumina multiplexing paired-end adapters, amplified by means of a polymerase-chain-reaction assay with the use of primers with sequencing barcodes (indexes), and hybridized to biotin-labeled, solution-based capture reagent that was custom designed (Roche NimbleGen). Hybridization was performed at 47° C. for 64 to 72 hours, and paired-end sequencing (100 cycles each) was performed on the Illumina HiSeq.
  • FIGS. 1A-1F An example of NGS data processing and copy number analysis procedure is illustrated in FIGS. 1A-1F .
  • samples from the same capture pool were grouped together.
  • the raw sequence data can be aligned to hg19 reference by NextGENe software (available from SoftGenetics, State College, Pa.).
  • NextGENe software available from SoftGenetics, State College, Pa.
  • three steps may be performed in CNV analysis.
  • a first step is to extract a read depth of the four PSV (paralogous sequence variant) loci of interest, in E7 and E8 of SMN1 and SMN2, and to calculate the copy number ratio of, e.g., SMN1 to SMN2, for each sample in the same capture pool.
  • a second step is to generate the total (e.g., SMN1 and SMN2) copy number of each exon from the normalized average coverage depth of each exon according to CNV analysis algorithm (such as the one that is or is based on the one described in Feng, et al., 2015; Retterer, et al., 2015, or one modified from those algorithms), such that only the read depth of samples with SMN1:SMN2 ratios between 0.8-1.2 from the first step are selected to generate the median coverage depth of each exon.
  • the total coverage depth of each exon is then normalized against the corresponding medians of the group.
  • the total copy numbers of SMN1+SMN2 of each exon were obtained by multiplying the normalized values with 4.
  • the copy numbers are generated for individual SMN1 and SMN2 genes from SMN1:SMN2 copy number ratio from the first step and the total SMN1+SMN2 copy number from the second step.
  • FIG. 1A is a block diagram illustrating a system for processing data to determine a diagnosis for a patient, such as to determine whether the patient is a carrier of a trait, according to one embodiment of the disclosure.
  • a system 100 may correspond to a software program embodied as various modules on a non-tangible computer readable medium.
  • the system 100 may correspond to circuitry, including logic and memory, configured to perform the functions described.
  • the system 100 may correspond to a combination of hardware and software, such as when a general purpose processor is executing code to perform steps that accomplish the described functions.
  • the system 100 may receive one or more input files 102 that include sequenced sample data.
  • the sequenced sample data may be received from DNA sequencing, such as Next-Generation Sequencing (NGS) or Third Generation Sequencing, and may be aligned in reference to the hg19 or hg38 human genome, as examples.
  • the input files 102 may be processed by one or more modules, such as a copy number ratio determination module 106 and a total copy number determination module 108 .
  • a copy number ratio and a total copy number may be determined by the modules 106 and 108 , respectively, and their outputs provided to a final copy number determination module 110 .
  • a final copy number may be determined and provided to diagnosis module 112 , which generates a diagnosis based, at least in part, on the final copy number received from module 110 .
  • the diagnosis may also be based on other data, such as information about a patient that provided a sample and/or statistical data regarding other patients in a cohort.
  • the diagnosis may be output to a user, such as shown in display 114 indicating whether a patient is determined to be a carrier or affected of a trait.
  • the output may be provided, such as shown in a window on a computer system, but the output may also be provided verbally, through e-mail, text message, a web interface, a printed report, or any other type of communication.
  • a method for processing sequenced data to determine a patient diagnosis is described in FIG. 1B .
  • a method 120 begins at block 122 with receiving aligned and sequenced sample data, such as NGS data for a batch of samples, in which the NGS data is aligned to human gene hg19, for example. Then, at block 124 , a copy number ratio between two nearly identical genes is determined (in specific embodiments, the term nearly identical may refer to two genes that are greater than 95, 96, 97, 98, 99, 99.1, 99.2, 99.3, 99.4, 99.5, 99.6, 99.7, 99.8, or 99.9% in identity). At block 126 , a total copy number of the two nearly identical genes is determined.
  • Block 124 may be processed prior to block 126 from the data received at block 122 .
  • a final copy number for the two nearly identical genes may be determined based, at least in part, on the determined copy number ratio of block 124 and the determined total copy number of block 126 .
  • the final copy number of block 128 may be used, in part or in whole, to diagnose a patient.
  • a patient outcome hypothesis may be determined based, at least in part, on the determined final copy number.
  • the patient outcome hypothesis may be a determination as to whether a patient is a carrier of a genetic trait or other characteristic. That patient outcome hypothesis may be confirmed by other tests, such as to eliminate or reduce the likelihood of false positives or false negatives.
  • FIG. 1C is a block diagram illustrating a process for diagnosing whether a patient is a carrier of a trait related to the SMN1 and SMN2 genes.
  • a data flow 140 may begin with receiving a batch of n samples when an NGS reads data aligned to hg19. That data may be processed in first data processing 144 and second data processing 146 . The first data processing 144 may be used to determine a copy number ratio for SMN1:SMN2 genes.
  • block 144 C includes building a table of the SMN1:SMN2 ratios for the batch of N samples received at block 142 .
  • Processing 146 may include, at processing block 146 A, averaging a coverage of each exon for each sample.
  • a total E1 coverage may be computed as SMN1+SMN2 E1
  • a total E8 coverage may be computed as SMN E8+SMN2 E8.
  • an exon coverage table may be built for a batch of samples, and at block 146 D samples selected that have a SMN1:SMN2 ratio equal to approximately one. A median or mean of each exon from the samples selected at block 146 D is computed at block 146 E.
  • the exon coverage table of block 146 C may then be normalized at block 146 F, and a total copy number of SMN1+SMN2 computed at block 146 G from the normalized coverage of block 146 F.
  • the total copy number of block 146 G and the ratio table from block 144 C may be combined to determine a final SMN1 and/or SMN2 copy number.
  • Sample data for the various processing blocks is shown throughout FIG. 1C .
  • a method 150 for determining a copy number ratio begins at block 152 with receiving a first sample and then reading a depth (rd) of PSVs for the received sample at block 154 .
  • a copy number ratio is calculated for the received sample for a predetermined set of exons, which may include some or all exons, wherein the predetermined exons may be selected based on having expected differences.
  • a method 170 may begin at block 172 with determining a total coverage of selected exons of two nearly identical genes for each of a plurality of received samples. Then, at block 174 , a median may be determined for each of the selected exons from samples having a ratio of the two nearly identical genes equal to approximately one. Next, at block 176 , the total coverage of block 174 may be normalized relative to all samples of the plurality of samples. Then, at block 178 , a total copy number may be determined for each of the selected exons for each of the plurality of samples based, at least in part, on the normalized total coverage of block 176 .
  • a method 180 begins at block 182 with determining of the final copy number. If the final copy number is one, the method proceeds to block 184 with the determination that the patient is a carrier of a trait. If not equal to one, the method 180 proceeds to block 185 to determine if the copy number is greater than one. If the copy number is greater than one, the method 180 proceeds to block 186 to determine that the sample indicates the patient is not a carrier of a trait. If the copy number is not greater than one, then the method 180 proceeds to block 188 to determine that the copy number is zero and the sample indicates the patient is affected for the trait.
  • FIGS. 1A-1F The schematic flow chart diagrams of FIGS. 1A-1F is each generally set forth as a logical flow chart diagram. As such, the depicted order and labeled steps are indicative of aspects of the disclosed method. Other steps and methods may be conceived that are equivalent in function, logic, or effect to one or more steps, or portions thereof, of the illustrated method. Additionally, the format and symbols employed are provided to explain the logical steps of the method and are understood not to limit the scope of the method. Although various arrow types and line types may be employed in the flow chart diagram, they are understood not to limit the scope of the corresponding method. Indeed, some arrows or other connectors may be used to indicate only the logical flow of the method. For instance, an arrow may indicate a waiting or monitoring period of unspecified duration between enumerated steps of the depicted method. Additionally, the order in which a particular method occurs may or may not strictly adhere to the order of the corresponding steps shown.
  • Computer-readable media includes physical computer storage media.
  • a storage medium may be any available medium that can be accessed by a computer.
  • such computer-readable media can comprise random access memory (RAM), read-only memory (ROM), electrically-erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • Disk and disc includes compact discs (CD), laser discs, optical discs, digital versatile discs (DVD), floppy disks and Blu-ray discs. Generally, disks reproduce data magnetically, and discs reproduce data optically. Combinations of the above should also be included within the scope of computer-readable media.
  • instructions and/or data may be provided as signals on transmission media included in a communication apparatus.
  • a communication apparatus may include a transceiver having signals indicative of instructions and data. The instructions and data are configured to cause one or more processors to implement the functions outlined in the claims.
  • SMN1 and SMN2 only differ in five bases, the majority of SMN1 or SMN2 derived sequences are identical and cannot be distinguished by the aligner (a Burrows-Wheeler transform alignment method). As a result, these reads were ambiguously mapped to either SMN1 or SMN2 locus randomly with low mapping confidence. For any 100-bp read containing at least one SMN1 or SMN2 PSV, the aligner was able to map the reads to the reference correctly ( FIG. 4 ). On the other hand, when none of PSVs was present in a given read, it would be misaligned. As illustrated in FIG.
  • FIG. 2 demonstrates the copy number ratio distribution from the read depth of PSV on exon 7. Apparently, there are three major populations with the SMN1:SMN2 copy number ratio at 1, 2 or 3.
  • Samples were grouped from the same capture pool to generate the total copy number of SMN1+SMN2 using previously published coverage based copy number analysis methods with modifications (Retterer, et al., 2015; Feng, et al., 2015). Briefly, the coverage of each exon of a test sample was compared to the value of the same exon in the reference file which is the median coverage of a group of samples. There are several modifications. First because SMN1 and SMN2 are highly homologous the NGS reads belonging to SMN1 or SMN2 may be misaligned in a random manner, so the coverage of the same exon from SMN1 and SMN2 were combined to generate the total SMN1+SMN2 copy number.
  • the copy number calculation is more accurate if it is normalized by each midpool library, compared to that by each flowcell with two or more midpool libraries.
  • Table 2 shows the sensitivity and specificity of SMN1 copy number detection when it is normalized for each flowcell with multiple midpool libraries.
  • n 2,290, 95% confidence interval of 97.9-98.9%, Table 2.
  • FIGS. 5A-5D shows a representative diagram for all the samples from a single midpool library. SMN1 Copy numbers are clearly shown in FIG. 5D , and there is clear separation of 1, 2, 3, 4 copies of SMN1.
  • NGS has made tremendous progress in clinical molecular testing including population carrier screening. While it generates reliable SNV results for a large number of genes in a high-throughput mode and can be used for CNV analysis, it is very challenging to generate reliable and reproducible CNV results for genes with highly homologous sequences, such as SMN1 and SMN2.
  • the inventors established and clinically validated a method that the exact copy numbers of SMN1 and SMN2 can be reliably obtained. First most NGS reads belonging to SMN1 or SMN2 may be mapped to either SMN1 or SMN2 randomly, NGS reads containing a PSV nucleotide can be accurately mapped to the correct locus with proper settings on the alignment.
  • the read depth at the PSV position may represent the real coverage of the exon where the PSV is located, in specific embodiments. Subsequently, the read depth of such gene specific reads was used to calculate SMN1 to SMN2 copy number ratio. Because the majority of the NGS reads lack informative PSVs for accurate mapping, the coverage based on incorrectly aligned reads cannot be used for gene specific copy number analysis but can be useful for SMN1 and SMN2 total copy number analysis. Therefore, the inventors combined the non-discriminating reads from SMN1 or SMN2 together to obtain data to calculate their copy number.
  • FIG. 7 illustrates one embodiment of a system 700 for multi-attribute clustering.
  • the system 700 may include a server 702 and a data storage device 704 .
  • the system 700 may include a network 708 and a user interface device 710 .
  • the system 700 may include a storage controller 706 or storage server configured to manage data communications between the data storage device 704 and the server 702 or other components in communication with the network 708 .
  • the storage controller 706 may be coupled to the network 708 .
  • the system 700 may store databases comprising records, perform searches of those records, and calculate statistics regarding the records.
  • the databases may store sequenced sample data and/or results of patient diagnoses.
  • the user interface device 710 is referred to broadly and is intended to encompass a suitable processor-based device such as a desktop computer, a laptop computer, a Personal Digital Assistant (PDA), a mobile communication device or organizer device having access to the network 708 .
  • the user interface device 710 may access the Internet to access a web application or web service hosted by the server 702 and provide a user interface for enabling the service consumer (user) to enter or receive information, such as their diagnosis.
  • the network 708 may facilitate communications of data between the server 702 and the user interface device 710 .
  • the network 708 may include any type of communications network including, but not limited to, a direct PC-to-PC connection, a local area network (LAN), a wide area network (WAN), a modem-to-modem connection, the Internet, a combination of the above, or any other communications network now known or later developed within the networking arts which permits two or more computers to communicate, one with another.
  • the data storage device 704 may include a hard disk, including hard disks arranged in a Redundant Array of Independent Disks (RAID) array, a tape storage drive comprising a magnetic tape data storage device, an optical storage device, or the like.
  • the data storage device 704 may store health-related data, such as sequenced gene data, insurance claims data, consumer data, or the like.
  • the data may be arranged in a database and accessible through Structured Query Language (SQL) queries, or other database query languages or operations.
  • SQL Structured Query Language
  • FIG. 8 illustrates one embodiment of a database management system 800 configured to store and manage data for multi-attribute clustering.
  • the system 800 may include a server 702 .
  • the server 702 may be coupled to a data-bus 802 .
  • the system 800 may also include a first data storage device 804 , a second data storage device 806 , and/or a third data storage device 808 .
  • the system 800 may include additional data storage devices (not shown).
  • each data storage device 804 - 808 may host a separate and/or redundant databases of healthcare information.
  • the storage devices 804 - 808 may be arranged in a RAID configuration for storing redundant copies of the database or databases through either synchronous or asynchronous redundancy updates.
  • the server 702 may submit a query to selected data storage devices 804 - 808 to collect a consolidated set of data elements associated with an individual or a group of individuals or organizations.
  • the server 702 may store the consolidated data set in a consolidated data storage device 810 .
  • the server 702 may refer back to the consolidated data storage device 810 to obtain a set of data attributes associated with a specified sample.
  • the server 702 may query each of the data storage devices 804 - 808 independently or in a distributed query to obtain the set of data elements associated with a specified individual.
  • multiple databases may be stored on a single consolidated data storage device 810 .
  • the server 702 may communicate with the data storage devices 804 - 810 over the data-bus 802 .
  • the data-bus 802 may comprise a SAN, a LAN, or the like.
  • the communication infrastructure may include Ethernet, Fibre-Chanel Arbitrated Loop (FC-AL), Small Computer System Interface (SCSI), and/or other similar data communication schemes associated with data storage and communication.
  • the server 702 may communicate indirectly with the data storage devices 804 - 810 , the server first communicating with a storage server or storage controller 706 .
  • the server 702 may host a software application configured for processing sequenced sample data, such as described in FIGS. 1A-1E .
  • the software application may further include modules or functions for interfacing with the data storage devices 804 - 810 , interfacing with a network 708 , interfacing with a user, and the like.
  • the server 702 may host an engine, application plug-in, or application programming interface (API).
  • the server 702 may host a web service or web accessible software application.
  • FIG. 9 illustrates a computer system 900 adapted according to certain embodiments of the server 702 and/or the user interface device 710 .
  • the central processing unit (CPU) 902 is coupled to the system bus 904 .
  • the CPU 902 may be a general purpose CPU or microprocessor. The present embodiments are not restricted by the architecture of the CPU 902 , so long as the CPU 902 supports the modules and operations as described herein.
  • the CPU 902 may execute the various logical instructions according to the present embodiments. For example, the CPU 902 may execute machine-level instructions according to the exemplary operations described above with reference to FIGS. 1A-1E .
  • the computer system 900 also may include Random Access Memory (RAM) 908 , which may be SRAM, DRAM, SDRAM, or the like.
  • RAM Random Access Memory
  • the computer system 900 may utilize RAM 908 to store the various data structures used by a software application.
  • the computer system 900 may also include Read Only Memory (ROM) 906 which may be PROM, EPROM, EEPROM, or the like.
  • ROM Read Only Memory
  • the ROM may store configuration information for booting the computer system 900 .
  • the RAM 908 and the ROM 906 may hold user and system 800 data.
  • the computer system 900 may also include an input/output (I/O) adapter 910 , a communications adapter 914 , a user interface adapter 916 , and a display adapter 922 .
  • the I/O adapter 910 and/or user the interface adapter 916 may, in certain embodiments, enable a user to interact with the computer system 900 in order to input information for authenticating a user, identifying an individual, or receiving health profile information.
  • the display adapter 922 may display a graphical user interface associated with a software or web-based application for processing sequenced sample data.
  • the I/O adapter 910 may connect one or more storage devices 912 , such as one or more of a hard drive, a Compact Disk (CD) drive, a floppy disk drive, a tape drive, to the computer system 900 .
  • the communications adapter 914 may be adapted to couple the computer system 900 to the network 808 , which may be one or more of a LAN and/or WAN, and/or the Internet.
  • the user interface adapter 916 couples user input devices, such as a keyboard 920 and a pointing device 918 , to the computer system 900 .
  • the display adapter 922 may be driven by the CPU 902 to control the display on the display device 924 .
  • the present embodiments are not limited to the architecture of system 900 .
  • the computer system 900 is provided as an example of one type of computing device that may be adapted to perform the functions of server 802 and/or the user interface device 810 .
  • any suitable processor-based device may be utilized including without limitation, including personal data assistants (PDAs), computer game consoles, and multi-processor servers.
  • the present embodiments may be implemented on application specific integrated circuits (ASIC) or very large scale integrated (VLSI) circuits.
  • ASIC application specific integrated circuits
  • VLSI very large scale integrated circuits.
  • persons of ordinary skill in the art may utilize any number of suitable structures capable of executing logical operations according to the described embodiments.
  • SMA Spinal muscular atrophy
  • MIM #253300 is a neuromuscular disorder caused by loss of motor neurons in the spinal cord and brainstem, leading to generalized muscle weakness and atrophy that impairs activities such as crawling, walking, sitting up, and controlling head movement (Emery, et al., 1976).
  • SMA has variable expressivity with a broad range of onset and severity. In severe cases, death occurs within the first two years of life mostly due to respiratory failure (Dubowitz, 1995).
  • SMA survival motor neuron 1
  • SMN1 and SMN2 are highly homologous differing in five base pairs, none of which changes the amino acid sequence.
  • a single C to T change in SMN2 exon 7 affects an exonic splicing enhancer (ESE), which results in a reduction of full-length transcripts from SMN2 (Lorson, et al., 1999).
  • ESE exonic splicing enhancer
  • This nucleotide is considered as the only functional paralogous sequence variant (Lindsay, et al., 2006) (PSV, FIG. 10A ) and is what differentiates SMN1 from SMN2.
  • SMA has features that can be recognized clinically but molecular testing is typically required to confirm the diagnosis.
  • PCR coupled with restriction fragment length polymorphism (RFLP) analysis is a commonly used diagnostic test for SMA (van der Steege, et al., 1995), but this method does not detect carrier status.
  • the first carrier test for SMA developed in 1997 used a competitive PCR strategy for quantification of SMN1 copy number (McAndrew, et al., 1997). Since then, the development of higher throughput methods, such as MLPA or qPCR, has enabled SMA carrier screening on a population basis (Cusco, et al., 2002; Arkblad, et al., 2006). These methodologies determine SMN1 copy number by interrogating the c.840C/T functional PSV that distinguishes the two SMN genes.
  • Massively parallel sequencing (MPS) or next-generation sequencing (NGS) technologies have rapidly transformed medicine as a cost effective approach to detecting pathogenic variants in patients with genetic diseases on a genomic scale (Yang, et al., 2014).
  • NGS-based carrier screening panels offer increased detection rates relative to conventional genotyping in a high-throughput mode for a large number of genes (Hallam, et al., 2014; Abuli, et al., 2016).
  • NGS is now used on a clinical basis for the detection of copy number variants (CNVs) (Retterer, et al., 2015; Feng, et al., 2015).
  • NGS Newcastle disease virus
  • SMA pathogenic variants
  • NGS based CNV detection is challenging for deletions and duplications at the single exon or sub-exon level due to technical noise introduced by uneven coverage in regions with variable GC content, non-linear amplification by PCR, and/or inter-run variations caused by assay artifacts known as batch effects.
  • Another major drawback of CNV analysis by short-read NGS is the lack of locus-specific computational programs for genes with highly homologous sequences that have poor mappability to the genome.
  • SMN1 and SMN2 are normally excluded from NGS variant calling and copy number analyses (Mandelker, et al., 2016).
  • SMN1 and SMN2 often undergo gene conversion events leading to gene hybrids that harbor PSVs from both genes (Cusco, et al., 2001). This complicates CNV analysis by NGS and underscores the need for nuanced data analysis to avoid errors caused by misalignment and gene conversion.
  • SMN1 copy number analysis using a Bayesian hierarchical model applied to the 1,000 genome database was recently reported (Larson, et al., 2015). This analysis characterized individuals as “likely”, “possibly”, or “unlikely” SMA carriers.
  • an NGS based clinical method for copy number analysis of SMN1 and/or other genes with highly homologous sequences has not been reported in the literature to our knowledge.
  • Sequence variants including single nucleotide variants or other small deletions, insertions or indels in SMN1 are medically relevant but not routinely detected by existing SMA carrier testing approaches.
  • SMA pathogenic variants are point mutations (MacDonald, et al., 2014). These pathogenic single nucleotide variants are not detected by carrier testing methods that only interrogate the c.840 PSV.
  • PGCNARS paralogous gene copy number analysis by ratio and sum
  • Copy number analysis for SMN1 was performed using the MCR-Holland SALSA MLPA Kit P060-B2 (MRC Holland, Netherland) or custom designed MLPA reagents according to manufacturer's recommendations.
  • the MLPA reagent contains sequence specific probes targeted to exons 7 and 8 of both SMN1 and SMN2 (Schouten, et al., 2002).
  • the MLPA data were analyzed using Coffalyzer software (MRC Holland, Netherland).
  • SMN1 copy number was assessed by Taqman quantitative PCR assay as part of a panel using the BioMark 96.96 Dynamic Array (Fluidigm, South San Francisco, Calif.). Exon 7 from both SMN1 and SMN2 genes were amplified by the following primer pair, 5′-ATAGCTATTTTTTTTAACTTCCTTTATTTTCC-3′ (SEQ ID NO:35) and 5′-TGAGCACCTTCCTTCTTTTTGA-3′ (SEQ ID NO:36).
  • a probe that specifically targets the SMN1 PSV was used to detect SMN1, while SMN2 was blocked by probe that targets the SMN2 PSV (VIC-TTTTGTCT A AAACCC [SEQ ID NO:38]).
  • Quantitative PCR was performed on the BioMark HD system (Fluidigm, South San Francisco, Calif.) as previously described with minor modifications (Forreryd, et al., 2014). Copy number was calculated using the ⁇ Ct method by normalizing to the genomic reference of the case and to the batch reference within the chip (Liu, et al., 2004).
  • genomic DNA was fragmented by sonication, ligated to Illumina multiplexing paired-end adapters, amplified by polymerase-chain-reaction with indexed (barcoded) primers for sequencing, and hybridized to biotin-labeled, custom-designed (Roche NimbleGen, Madison, Wis.) capture probes in a solution-based reaction.
  • Hybridization was performed at 47° C. for at least 16 hours, followed by paired-end sequencing (100 bp) on the Illumina HiSeq 2500 platform with average coverage of >300 ⁇ in the targeted regions.
  • Raw image data conversion and demultiplexing were performed following Illumina's primary data analysis pipeline using CASAVA v2.0 (Illumina, San Diego, Calif.). Low-quality reads (Phred score ⁇ Q25) were removed prior to demultiplexing. Batched samples from the same capture pool were grouped and processed together. Sequences were aligned to the hg19 reference genome by NextGENe software using the recommended standard settings for SNV and indel discovery (SoftGenetics, State College, Pa.). In every sample, the average coverage depth of each targeted exon of non-homologous genes was extracted and normalized according to our previously published methods (Feng, et al., 2015). Similar to Derivative Log Ratio Spread (DLRS) used in the quality assurance of aCGH data analysis, DRS (Derivative Ratio Spread) was used to quantify the coverage depth variation of each sample from the NGS data, which is defined below.
  • DLRS Derivative Log Ratio Spread
  • stands for the difference of normalized coverage ratio between two adjacent exons; ⁇ is the mean of all ⁇ ; N is the total number of data points which is the number of total exons minus 1.
  • DRS>0.1 is considered as not passing quality control and thus not included for the copy number analysis.
  • the script for the detection of is deposited at https://sourceforge.net/projects/PGCNARS
  • SMN1 and SMN2 differ at only five bases, most of the SMN1 or SMN2 derived NGS reads (2 ⁇ 100-bp pair-end sequencing used in this work) were indistinguishable. As a result, these reads were ambiguously aligned to either SMN1 or SMN2 with poor mapping quality, making read-depth-based copy number analysis inapplicable. Notably, reads containing at least one SMN1 or SMN2 PSV were mapped to the reference locus with higher mapping specificity.
  • the NGS reads derived from such gene hybrid regions may confound the mapping algorithm and result in incorrect alignment ( FIG. 10B ).
  • SMN1 functional PSV c.840C
  • SMN1 SNP g.271347T>G
  • SMN2 PSV c.888+100G
  • n 1 rd 1/( rd 1+ rd 2)* ⁇ c/ ⁇ c* 4
  • n1 is the calculated copy number of SMN1, rd1 and rd2 are the read depth of the c.840 PSV at SMN1 and SMN2 respectively
  • ⁇ c is the combined exonic (exon 7) coverage of SMN1 and SMN2
  • ⁇ c is the median of all the calculated ⁇ c in a group of samples batched together for the analysis.
  • the overall SMN1 and SMN2 copy number calculation algorithm is illustrated in FIG. 11 . Note that the formula can also be used for the exon 8 copy number analysis to compare with the exon 7 copy number results by applying the coverage data of the exon 8 PSV (c.*233T/A).
  • the multiethnic SMN1 copy number analysis data for SMA carrier population screening by NGS is summarized in Table 6.
  • Table 6 The multiethnic SMN1 copy number analysis data for SMA carrier population screening by NGS is summarized in Table 6.
  • African Americans and Hispanics had the lowest carrier frequency at 1.0% and 0.9% while Asians had the highest carrier frequency at 2.4%.
  • Caucasians and individuals of Ashkenazi Jewish ancestry had SMA carrier frequencies at 1.4% and 1.9% respectively.
  • NGS has enabled tremendous progress in clinical molecular testing including population-based expanded carrier screening (Hallam, et al., 2014; Abuli, et al., 2016; Haque, et al., 2016).
  • a recent large cohort study suggested that expanded carrier screen involving NGS increases detection rates for a variety of potentially serious genetic diseases when compared with current recommendations, which focus on testing a small number of diseases in high-risk populations (Haque, et al., 2016).
  • NGS generates reliable SNV results in a high-throughput mode and can be used for CNV analysis
  • calling sequence and copy number variants for genes with highly homologous sequences is technically challenging. For this reason, SMN1 and SMN2 have been put into a “dead zone” of genes that are not amenable to accurate NGS alignment (Mandelker, et al., 2016).
  • SMN1 and SMN2 NGS short reads lack informative PSVs for accurate mapping and simple depth of coverage analyses cannot be used directly for gene-specific copy number analysis.
  • ambiguously aligned reads i.e. reads aligned to SMN1 or SMN2
  • Gene-specific reads containing the c.840C/T PSV can then be used to calculate the SMN1 to SMN2 copy number ratio and in turn permit derivation of gene-specific copy number.
  • We used this approach to analyze 6,738 samples submitted to our lab for carrier testing. Measures of test reproducibility, sensitivity, and specificity indicate that this NGS method is highly accurate and robust for SMN1 copy number analysis.
  • the NGS test reported herein is a sensitive and robust assay of SMN1 copy number and sequence variation that increases SMA carrier detection rates across all populations.
  • this approach can be integrated into existing NGS based carrier screening panels to improve SMA detection rates and reduce the overall cost of population carrier screening.
  • RFLP Restriction fragment length polymorphism
  • Primers were designed to specifically amplify SMN1, but not SMN2, by utilizing the c.840C PSV at exon 7, as well as an additional mismatch nucleotide before the PSV.
  • DNA with zero copy of SMN1 and two copies of SMN2 was included as a negative control to ensure no SMN2 copy is amplified nonspecifically.
  • HpyCH4III cuts SMN1 PCR product only when SNP g.27134T>G is present.
  • Genomic regions containing exon 2-7 (5′ long fragment, 13 kb) and exon 7-8 (3′ short fragment, 1 kb) were amplified using long-range PCR reagents (TaKaRa LA Taq DNA Polymerase Hot-Start Version). Primers were designed to preferentially amplify SMN1 by utilizing the c.840C PSV at exon 7. For the short fragment, an additional mismatch base-pair before the PSV was also used to ensure SMN1 specificity.
  • RFLP Restoral fragment length polymorphism
  • PCR was performed to amplify the SMN1 fragment containing the silent carrier SNP (g.27134T>G).
  • Primers were designed to specifically amplify SMN1, but not SMN2, by utilizing the c.840C PSV at exon 7, as well as an additional mismatch basepair before the PSV.
  • HpyCH4III will cut SMN1 PCR product only when SNP g.27134T>G is present.
  • Forward primer was 5′-TGTAAAACGACGGCCAGTCTTCCTTTATTTTCCTTACAGGGTTGC (SEQ ID NO:43) and reverse primer was 5′-CAGGAAACAGCTATGACCAAGTCTGCTGGTCTGCCTACTAG (SEQ ID NO:44).
  • 1 ⁇ PCR buffer 1 ul of each primer (10 ⁇ M), 4 ul of dNTP (2.5 mM), 0.25 ⁇ l of Platinum Taq polymerase (Invitrogen), 1.5 ul of MgCl 2 (50 mM), and 2 ul of genomic DNA (50 ng/ul) were used.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Medical Informatics (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Pathology (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Bioethics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
US16/083,452 2016-03-09 2017-03-09 A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing Abandoned US20190066842A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/083,452 US20190066842A1 (en) 2016-03-09 2017-03-09 A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201662305780P 2016-03-09 2016-03-09
PCT/US2017/021603 WO2017156290A1 (fr) 2016-03-09 2017-03-09 Nouvel algorithme pour l'analyse du nombre de copies de smn1 et smn2 à l'aide de données de profondeur de couverture à partir d'un séquençage de prochaine génération
US16/083,452 US20190066842A1 (en) 2016-03-09 2017-03-09 A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing

Publications (1)

Publication Number Publication Date
US20190066842A1 true US20190066842A1 (en) 2019-02-28

Family

ID=59789847

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/083,452 Abandoned US20190066842A1 (en) 2016-03-09 2017-03-09 A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing

Country Status (2)

Country Link
US (1) US20190066842A1 (fr)
WO (1) WO2017156290A1 (fr)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112435710A (zh) * 2020-10-16 2021-03-02 赛福解码(北京)基因科技有限公司 一种在wes数据中检测单样本smn基因拷贝数的方法
CN113192555A (zh) * 2021-04-21 2021-07-30 杭州博圣医学检验实验室有限公司 一种通过计算差异等位基因测序深度检测二代测序数据smn基因拷贝数的方法
CN113308538A (zh) * 2021-06-29 2021-08-27 广东博奥医学检验所有限公司 基于sanger测序的SMA检测方法
CN114420204A (zh) * 2022-03-29 2022-04-29 北京贝瑞和康生物技术有限公司 用于预测待测基因的拷贝数的方法、计算设备和存储介质
CN114480620A (zh) * 2022-01-18 2022-05-13 无锡中德美联生物技术有限公司 联合检测人smn1和smn2基因的试剂盒及其应用
TWI781518B (zh) * 2020-03-03 2022-10-21 起元生物科技股份有限公司 用以診斷脊髓性肌肉萎縮症之套組及其用途
US11519024B2 (en) * 2017-08-04 2022-12-06 Billiontoone, Inc. Homologous genomic regions for characterization associated with biological targets
US11646100B2 (en) 2017-08-04 2023-05-09 Billiontoone, Inc. Target-associated molecules for characterization associated with biological targets

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110268072B (zh) * 2016-12-15 2023-11-07 Illumina公司 确定旁系同源基因的方法和系统
WO2019182956A1 (fr) * 2018-03-22 2019-09-26 Myriad Women's Health, Inc. Appel de variant par apprentissage automatique
WO2020041946A1 (fr) * 2018-08-27 2020-03-05 深圳华大生命科学研究院 Procédé et dispositif de détection de séquences homologues sur la base d'un séquençage à haut débit
WO2020146519A1 (fr) * 2019-01-09 2020-07-16 Coyote Bioscience Usa Inc. Procédés et systèmes d'identification de l'amyotrophie spinale
WO2023081639A1 (fr) * 2021-11-03 2023-05-11 Foundation Medicine, Inc. Système et procédé d'identification d'altérations de nombres de copies

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9494520B2 (en) * 2010-02-12 2016-11-15 Raindance Technologies, Inc. Digital analyte analysis
US9150852B2 (en) * 2011-02-18 2015-10-06 Raindance Technologies, Inc. Compositions and methods for molecular labeling
US11339435B2 (en) * 2013-10-18 2022-05-24 Molecular Loop Biosciences, Inc. Methods for copy number determination
US10851414B2 (en) * 2013-10-18 2020-12-01 Good Start Genetics, Inc. Methods for determining carrier status

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11519024B2 (en) * 2017-08-04 2022-12-06 Billiontoone, Inc. Homologous genomic regions for characterization associated with biological targets
US11646100B2 (en) 2017-08-04 2023-05-09 Billiontoone, Inc. Target-associated molecules for characterization associated with biological targets
TWI781518B (zh) * 2020-03-03 2022-10-21 起元生物科技股份有限公司 用以診斷脊髓性肌肉萎縮症之套組及其用途
CN112435710A (zh) * 2020-10-16 2021-03-02 赛福解码(北京)基因科技有限公司 一种在wes数据中检测单样本smn基因拷贝数的方法
CN113192555A (zh) * 2021-04-21 2021-07-30 杭州博圣医学检验实验室有限公司 一种通过计算差异等位基因测序深度检测二代测序数据smn基因拷贝数的方法
CN113308538A (zh) * 2021-06-29 2021-08-27 广东博奥医学检验所有限公司 基于sanger测序的SMA检测方法
CN114480620A (zh) * 2022-01-18 2022-05-13 无锡中德美联生物技术有限公司 联合检测人smn1和smn2基因的试剂盒及其应用
CN114420204A (zh) * 2022-03-29 2022-04-29 北京贝瑞和康生物技术有限公司 用于预测待测基因的拷贝数的方法、计算设备和存储介质

Also Published As

Publication number Publication date
WO2017156290A9 (fr) 2017-11-09
WO2017156290A1 (fr) 2017-09-14

Similar Documents

Publication Publication Date Title
US20190066842A1 (en) A novel algorithm for smn1 and smn2 copy number analysis using coverage depth data from next generation sequencing
JP7081829B2 (ja) 無細胞試料中の腫瘍dnaの解析
Lefterova et al. Next-generation molecular testing of newborn dried blood spots for cystic fibrosis
WO2018090991A1 (fr) Test prénatal non effractif basé sur un haplotype universel pour des maladies à gène unique
AU2015271883B2 (en) Determining a nucleic acid sequence imbalance using fractional fetal concentration
AU2013202132C1 (en) Determining a nucleic acid sequence imbalance using multiple markers
LI et al. Bestimmung eines Nukleinsäuresequenzungleichgewichts Détermination d’un déséquilibre de séquences d’acide nucléique

Legal Events

Date Code Title Description
AS Assignment

Owner name: BAYLOR MIRACA GENETICS LABORATORIES, LLC, TEXAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:FENG, YANMING;REEL/FRAME:046827/0446

Effective date: 20171129

STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

AS Assignment

Owner name: BAYLOR COLLEGE OF MEDICINE, TEXAS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:WONG, LEE-JUN C.;REEL/FRAME:051784/0784

Effective date: 20171017

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION