WO2018064547A1 - Methods for classifying somatic variations - Google Patents

Methods for classifying somatic variations Download PDF

Info

Publication number
WO2018064547A1
WO2018064547A1 PCT/US2017/054445 US2017054445W WO2018064547A1 WO 2018064547 A1 WO2018064547 A1 WO 2018064547A1 US 2017054445 W US2017054445 W US 2017054445W WO 2018064547 A1 WO2018064547 A1 WO 2018064547A1
Authority
WO
WIPO (PCT)
Prior art keywords
variants
tumor
somatic
normal
classification model
Prior art date
Application number
PCT/US2017/054445
Other languages
French (fr)
Inventor
Raul Rabadan
Alireza ROSHAN-GHIAS
Chioma J. MADUBATA
Jiguang WANG
Original Assignee
The Trustees Of Columbia University In The City Of New York
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US201662402137P priority Critical
Priority to US62/402,137 priority
Application filed by The Trustees Of Columbia University In The City Of New York filed Critical The Trustees Of Columbia University In The City Of New York
Publication of WO2018064547A1 publication Critical patent/WO2018064547A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Abstract

Techniques for classifying variants in DNA sequences obtained from tumor samples are provided. Methods include obtaining a DNA sequence from a tumor sample, identifying variants in the DNA sequence from the tumor sample, filtering the variants from the tumor sample to obtain a tumor only set of variants, and applying a classification model to the tumor only set to identify one or more somatic variants. The classification model can be created by obtaining from about 10 to about 50 matched tumor-normal DNA sequences, identifying variants in the matched tumor-normal DNA sequences, filtering the variants to obtain a tumor-normal set of variants, segregating the tumor-normal set of variants into a training set and a testing set, applying a gradient boosting technique to the training set to obtain a preliminary classification model, and applying the preliminary classification model to the testing set to obtain the classification model using machine-learning.

Description

METHODS FOR CLASSIFYING SOMATIC VARIATIONS
CROSS REFERENCE TO RELATED APPLICATIONS
[001] The present application claims priority to U.S. Provisional Application No.
62/402,137, filed September 30, 2036, the contents of which are hereby incorporated by reference in its entirety.
STATEMENT OF GOVERNMENT INTEREST
[002] This invention was made with government support under Grant Nos. ROl CA185486-01 , R01 CAl 79044-01 Al, U54 CA193313, ULI TR000040, and 5T32 GM07367 awarded by the National Institutes of Health. The Government has certain rights in the invention.
BACKGROUND
[003] The identification of mutations responsible for tumor development in tumor cells, forms the foundation for targeted drug development and genome-based personalized cancer treatment . Germline DNA alterations, seen in both normal and tumor DNA, can contribute to carcinogenesis, but many tumors result from somatic alterations unique to the tumor DNA. Somatic DNA variants can be identified by whole-exome sequencing (WES) of matched tumor and normal samples and comparison of the tumor and normal exomes. This approach can identify somatic variants fordifferent types of cancers, with certain somatic variants affecting driver genes that are responsible for tumor development . Strategies for identifying driver genes include analysis for genes enriched in functionally important mutations , genes with clustered mutations, and genes significantly mutated above background.
[004] However, acquiring a matched normal sample for somatic analysis can be difficult. For example, assembling multiple living patients with rare cancers can be difficult and tissue banks can contain many historical frozen or paraffin-embedded tumor samples without normal counterparts. Additionally, when studying advanced metastatic cancers, acquiring normal tissue free of cancer cells can be difficult. Moreover, in certain
comparative techniques, acquiring and sequencing both tumor and normal samples from a patient creates higher costs as compared to a single tumor sample.
[005] Without normal DNA for comparison, sequence analysis of a tumor exome can yield multiple types of variants including single nucleotide polymorphisms (SNPs), germline variants, and technical artifacts from PCR, high-throughput sequencing, mapping, and other processes. Only a percentage of variants will be true somatic mutations. While the number of protein-changing somatic mutations varies across cancer types (from an order of ten in certain pediatric tumors to hundreds in melanomas and lung cancers, the number of somatic mutations can be orders of magnitude smaller than the number of germline variants in an exome (tens of thousands). However, filtering non-somatic variants from tumor-only WES analysis can involve removal of common dbSNP mutations, low quality variants, variants frequently observed in germline sequences, or variants outside of known cancer-related genes in databases like the Catalogue Of Somatic Mutations In Cancer (COSMIC ).
[006] Therefore, there remains a need for techniques for the analysis of tumor DNA when matched nonnal DNA has limited availability.
SUMMARY
[007] The presently disclosed subject matter provides techniques for classifying variants in DNA sequences obtained from tumor samples, in certain aspects, the present disclosure provides methods of classifying variants in DNA extracted from a tumor sample of a subject. Example methods can include obtaining a DNA sequence from a tumor sample, identifying variants in the DNA sequence from the tumor sample, filtering the variants from the tumor sample to obtain a tumor only set of variants, and applying a classification model to the tumor only set to identify one or more somatic variants.
[008] For example, and as embodied herein, the classification model can be created by obtaining from about 10 to about 50 matched tumor-normal DNA sequences, identifying variants in the matched tumor-normal DNA sequences, filtering the variants to obtain a tumor-normal set of variants, segregating the tumor-normal set of variants into a training set and a testing set, applying a gradient boosting technique to the training set to obtain a preliminary classification model, and applying the preliminary classification model to the testing set to obtain the classification model using machine-learning.
[009] In certain embodiments, variants in the matched tumor-normal DNA sequences and/or in the DNA sequence from the tumor sample can be identified using a variant calling process. The filtering of the variants can use one or more technical filters. For example, technical filter can apply a numerical cutoff corresponding to variant quality. In certain embodiments, the numerical cutoff can correspond to at least one of variant depth, mapping quality, strand bias, map quality bias, and tail distance bias. Additionally or alternatively, the filtering of the variants can use one or more biological filters. For example, a biological filter can remove variants present in 1% or more of the population as represented by the 1000 genome project populations. Additionally or alternatively, a biological filter can remove variants present in one or more existing databases and/or variants from intragenic non- coding exon regions and/or splice site regions within the DNA sequence. In certain embodiments, the gradient boosting technique can incorporate one or more tunable parameters selected from shrinkage, interaction depth, and number of decision trees. The method can further include determining a relative influence of one or more biological features of variants in the training set based on the gradient boosting technique.
[0010] In certain embodiments, the method can further include identifying one or more somatic-like germline variants from among the identified somatic variants. For example, the one or more somatic-like germline variants can be identified by comparison to a DNA sequence obtained from a normal sample corresponding to the tumor sample of the subject.
[0011] In certain embodiments, matched tumor-normal DNA sequences can be obtained from tumor and nonnal tissue samples from a subject. For example, the normal tissue sample can be obtained prior to tumor onset and/or from a region of the subject separate from a region containing a tumor. In certain embodiments, the tumor sample can be obtained from a subject for which there is no matched normal tissue sample.
[0012] As embodied herein, the one or more somatic variants can be present on one or more driver genes. For example and not limitation, the driver gene can be selected from the group consisting of ACVR1 , ATRX, ARID 1 A, BCOR, BRAF, CTNND2, DD1T3, EGFR, FAT2, FGFR3, GPR1 16, HERC2, IDH1, KIT, LRP1 B, LRP2, LZTR1 , MYCN, NF1, NOS1 , NRAS, PCNX, PDGFRA, PIK3CA, PIK3R1 , PKHDl , PPM I D, PTEN, RBI , TEK, TP53, and TSHZ2.
[0013] In certain embodiments, the one or more somatic variants are present on BRCA2. In such embodiments, the method can further include treating the subject with an effective amount of a PARP inhibitor, in certain other embodiments, the one or more somatic variants are present on BRCA1, FANCC, ATM, or RBI . In such embodiments, the one or more somatic variants can be nonsense variants and the method can further include treating the subject with an effective amount of cisplatin.
[0014] The presently disclosed methods can correctly identify more than 85% of somatic variants within the DNA sequence from the tumor sample.
[0015] In certain other aspects, the present disclosure provides systems for classifying variants in DNA extracted from a tumor sample of a subject. An example system can include a processor configured to create a classification model by obtaining from about 10 to about 50 matched tumor-normal DNA sequences. The processor can also be configured to identify variants in the matched tumor-normal DNA sequences, filter the variants to obtain a tumor- normal set of variants, segregate the tumor-normal set of variants into a training set and a testing set, apply a gradient boosting technique to the training set to obtain a preliminary classification model, and apply the preliminary classification model to the testing set to obtain the classification model using machine-learning. The processor can be further configured to obtain a DNA sequence from the tumor sample, identify variants in the DNA sequence from the tumor sample, filter the variants from the tumor sample to obtain a tumor only set of variants, and apply the classification model to the tumor only set to identify one or more somatic variants.
BRIEF DESCRIPTION OF THE DRAWINGS
[0016] FIG. 1 provides a schematic method for predicting somatic variants with TOBI, in accordance with an embodiment of the disclosed subject matter and as further detailed in Example 1.
[0017] FIG. 2 provides the average F-score for increasing numbers of cases in the training set in seven cancer types, as described in Example 1. TOBI.bam indicates samples were analyzed from aligned sequence files using TOBI Steps l-III (.bam); TOBI. vcf indicates samples were analyzed from variant call files (.vcf) using TOBI Steps II-III.
[0018] FIG. 3 shows the Fl score versus probability cutoff, maximum tree depth, and number of trees (100, 150, 200) in the cross-validation process of Example 1 , performed on glioblastoma (GBM) samples.
[0019] FIG. 4 provides the ratio of somatic and non-somatic variants in GBM samples.
[0020] FIG. 5 shows the cutoff value to maximize the Fl score, with the same point shown on the ROC and Precision-Recall curves.
[0021] FIG. 6 shows the relative influence (0 to 100%) of features in the gradient boosting classification model, which was generated from a training set with twenty cases in each of the 7 individual cancers.
[0022] FIG. 7. illustrates the importance of each category of biological features, as assessed when the features were removed from the prediction model.
[0023] FIG. 8 provides the predicted somatic mutations in normal and tumor GBM cases, in which the dashed line is the average number of false positives in tumors. [0024] FIG. 9 provides a comparison of actual versus predicted cases with somatic, nonsynonymous variants in six adult cancer types and pediatric glioma.
[0025] FIG. 10 shows the relative reasons why certain true somatic mutations were filtered out in GBM; 78% of the filtered somatic mutations were called with low confidence (Low QUAL).
[0026] FIG. 1 1 illustrates the fate of mutations in GBM driver genes before and after analysis with TOBI. TP is true positive, FP is false positive, and FN is false negative in the classification step.
[0027] FIG. 12 illustrates TOBI prediction in pediatric glioma. The top-left panel provides a comparison of actual versus predicted cases with somatic, nonsynonymous variants. The top-right panel shows the percentages of true positive (TP) or false negative (FN) TOBI somatic predictions in nonsynonymous variants across all genes or only driver genes. The bottom-right panel shows the number of cases with predicted somatic variants when pediatric glioma classification model is applied to 68 tumor-only samples; the genes predicted in at least 3 cases are shown.
[0028] FIG. 13 provides, for each indicated cancer type, percentages of true positive (TP) or false negative (FN) TOBI somatic predictions in nonsynonymous variants across all genes or only driver genes (top panel) and ROC curves comparing somatic variant prediction (synonymous and nonsynonymous) based on TOBI, CADD score, Mutation Assessor, SIFT and MutationTaster.
[0029] FIG. 14 illustrates the number of variants preducted as somatic by TOBI, including variants not reported as somatic in published analysis of five adult cancer types and pediatric glioma (FP variants) and variants associated with autosomal dominant (AD) cancer- predisposition syndromes.
[0030] FIG. 15 provides the distribution of patient cases with FP variants in AD genes.
[0031] FIG. 16 shows FP variants in TP53 domains, in which the height of the line represents allele frequency, with normal frequency in the shaded range and tumor frequency in the black range. Circles indicate patients for which normal frequency of variant is greater than or equal to 0.3; diamonds indicate normal frequency less than 0.3.). indicates P71 L and P72A occurred in same LUAD patient. "R273C (2)" indicates two patients with LGG had this variant. Colored "+·' or "Λ'" indicate individual patient allele frequencies.
[0032] FIG. 17 provides comparison of performance metrics in 9 FFPE tumor cases and 161 frozen cases from the LUAD TCGA cohort. [0033] FIG. 18 illustrates SLG variants in low-grade glioma associated with earlier age of diagnosis. The left panel shows distribution of diagnosis age in 492 LGG test set cases with or without nonsynonymous SLG variants in 565 cancer genes. For the violin plots, width of shape indicates density. In overlaid boxplots, the horizontal center line indicates the median (37 years vs. 41 years), upper and lower box edges correspond to the 25th and 75th percentiles, and the upper and lower whiskers extends from the closest box edge to the highest or lowest value within 1.5x the interquartile range, respectively. The right panel shows cancer genes with recurrent nonsynonymous SLG in LGG.
[0034] FIG. 19 shows percentage of test set cases with TOBI-somatic nonsense mutations; "Gerrnline" indicates variant allele frequency (VAF) >= 30% in normal; "TP", or true positives, were previously reported as somatic and have VAF < 30% in normal. Total number of test cases: 100 BLCA, 317 SKCM, 165 LUAD, and 199 STAD.
[0035] FIG. 20 provides TOBI-somatic nonsense variants in BRCA2 and FANCM; diamond and dashed line indicate TP variant; solid line and circle are germline; grey arrows go from VAF in normal to tumor.
[0036] FIG. 21 illustrates FA pathways with number of altered cases in bladder cancer shown for each component.
[0037] FIG. 22 shows enrichment for signature 4 in BLCA FA nonsense mutant vs. wildtype samples, p-value calculated with rank sum test. Mu = mutant, WT = wildtype.
[0038] FIG, 23 provides TOB1 performance when training and testing set are stratified by patients' self-reported race.
[0039] FIG. 24 illustrates TOBI accuracy and F-score when training and testing set are stratified by institution. Each box corresponds to one cancer type. Y-axis shows F-score, x- axis shows reported race of training set used to generate model (20 randomly selected patients) above institution of test set; number of cases in the test set shown within plot area. An institution required greater than 20 patients for inclusion as a training set, and a minimum of 5 patients for inclusion as a test set. Points represent performance metric for five runs with randomly selected training and testing sets from specified race; error bars represent mean +/- s.e.m.
[0040] FIG. 25 shows inclusion of 61 STAD cases with hypermutation phenotype does not significantly alter TOBI performance.
[0041] FIG. 26 illustrates F-score of variants with VAF 0-100% binned by 5%.
[0042] FIG. 27 illustrates F-score of variants with VAF 0-20% binned by 1 %. [0043] FIG. 28 shows false positive rate (FPR) in seven cancers compared to false positive rate from 1000 Genomes samples.
[0044] FIG. 29 provides age distribution for cases with or without SLG in autosomal dominant cancer-predisposition syndromes (AD) gene sets.
[0045] FIG. 30 shows age distribution for cases with or without SLG in 565 cancer gene sets.
[0046] FIG. 31 illustrates age distribution for cases with or without SLG in Fanconi anemia (FA) pathway gene sets.
[0047] FIG. 32 shows selection of k~four somatic signatures for BLCA maximizes stability and minimizes error.
[0048] FIG. 33 provides somatic signatures from TCGA BLCA cohort.
[0049] FIG. 34 illustrates somatic SNV per megabase (Mb) for each cancer type.
[0050] FIG. 35 shows scatterplots of median somatic SNV per Mb versus true positive rate of nonsynonymous variants. Each point is a cancer type. Left panel uses true positive rate from all genes, right panel for driver genes only.
DETAILED DESCRIPTION
[0051] The presently disclosed subject matter relates to techniques for classifying variants in DNA sequences obtained from tumor samples. The variants can be classified as somatic or non-somatic to inform the identification of mutations and genes responsible for tumor growth and tailor treatment accordingly. Thus, the present disclosure provides a design and computational framework for somatic analysis. The presently disclosed techniques use machine-learning to generate a classification model for classifying somatic variants in tumor-only samples from a small training set of matched tumor-normal pairs.
[0052] As used herein, the term "about" or "approximately" means within an acceptable error range for the particular value as determined by one of ordinary skill in the art, which will depend in part on how the value is measured or determined, i.e., the limitations of the measurement system. For example, '"about" can mean within 3 or more than 3 standard deviations, per the practice in the art.
[0053] As used herein, the phrase "tumor-normal" or "matched tumor-normal" refers to samples or DNA sequences that have been obtained from both a tumor and a normal sample from the same subject. For example, the tumor and normal samples can be obtained at different points of time (e.g., before and after onset of cancer) or from different regions of the body {e.g., tumor-containing and tumor-free regions). In contrast, the phrase "tumor only" refers to samples or DNA sequences that have been obtained from a tumor sample only, and for which there is no corresponding normal tissue sample.
[0054] As used herein, ''treatment" or "treating" refers to inhibiting the progression of a disease or disorder, or delaying the onset of a disease or disorder, whether physically, e.g., stabilization of a discernible symptom, physiologically, e.g., stabilization of a physical parameter, or both. As used herein, the terms "treatment," "treating," and the like, refer to obtaining a desired pharmacologic and/or physiologic effect. The effect can be prophylactic in terms of completely or partially preventing a disease or condition, or a symptom thereof and/or can be therapeutic in terms of a partial or complete cure for a disease or disorder and/or adverse effect attributable to the disease or disorder. "Treatment," as used herein, covers any treatment of a disease or disorder in an animal or mammal, such as a human, and includes: decreasing the risk of death due to the disease; preventing the disease of disorder from occurring in a subject which can be predisposed to the disease but has not yet been diagnosed as having it; inhibiting the disease or disorder, i.e., arresting its development (e.g., reducing the rate of disease progression); and relieving the disease, i.e., causing regression of the disease.
[0055] As used herein, the term "subject" includes any human or nonhuman animal. The term "nonhuman animal" includes, but is not limited to, all vertebrates, e.g., mammals and non-mammals, such as nonhuman primates, dogs, cats, sheep, horses, cows, chickens, amphibians, reptiles, etc. In certain embodiments, the subject is a pediatric patient. In certain embodiments, the subject is an adult patient.
[0056] As used herein, an "effective amount" refers to an amount of the compound sufficient to treat, prevent, or manage the disease, e.g., a cancer. An effective amount can refer to the amount of a compound that provides a therapeutic benefit in the treatment or management of the disease, and as such, an "effective amount" depends upon the context in which it is being applied. In the context of administering a composition to treat and/or to reduce the severity of cancer in a subject, an effective amount of a composition described herein is an amount sufficient to treat and/or ameliorate tumor cell growth, as well as decrease the severity and/or reduce the likelihood of tumor cell growth. The decrease can be a 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 98%, or 99% decrease in severity of tumor cell growth, or likelihood of developing cancer. An effective amount can be administered in one or more administrations. Animal models accepted in the art as models of disease (e.g., cancer) can be used to test particular compounds, routes of administration etc., to determine appropriate amounts of therapeutic treatments of the disclosure.
[0057] In certain aspect, the present disclosure provides methods of classifying variants in a DNA sequence, e.g., as somatic or non-somatic. As embodied herein, a method of classifying variants in a DNA sequence can include creating a classification model. The classification model can be built using a small number of matched tumor-normal DNA sequences, for example, from about 10 to about 50 matched tumor-normal DNA sequences, from about 10 to about 50 matched tumor-normal DNA sequences, from about 10 to about 25 matched tumor-normal DNA sequences, or from about 10 to about 20 matched tumor-normal DNA sequences. As embodied herein, increasing the number of matched tumor-normal DNA sequences can improve the accuracy of the method, although acceptable accuracy can be obtained with at least 10 matched tumor-normal DNA sequences. The model can classify variants, e.g., as somatic or germline, from tumor only samples that do not have a matched normal DNA sequence.
[0058] For example and not limitation, the tumor DNA samples in the matched tumor- normal sets can be obtained from multiple tumors from the same type of cancer in order to tailor the classification model to a specific cancer type. As embodied herein, the cancer can be any cancer type that is driven or otherwise modulated by somatic mutations. In certain embodiments, the cancer can be bladder urothelial carcinoma, glioblastoma, low grade glioma, lung adenocarcinoma, melanoma, stomach adenocarcinoma, or pediatric glioma. The DNA samples can be sequenced using any suitable means as known in the art, including whole exome sequencing (WES) or whole genome sequencing (WGS).
[0059] As embodied herein, methods of creating the classification model can include at least three steps. The first step can include identifying variants in the DNA sequence. The variants can be identified using a variant calling process. The variant calling process can include a quality threshold, for example based on mapping quality, below which variants will not be considered as such and will not be identified as variants. In certain embodiments, programs such as Samtools and/or Bcftools can be used during the variant calling process.
[0060] In certain embodiments, the identified variants can be annotated. The
annotations can be based on known characteristics of the variants, including the variant quality, predicted functional impact, and existence in known databases (e.g., COSMIC). The annotations can be derived from one or more existing or custom databases. For example, existing databases include, but are not limited to, SnpEff and SnpSift with dbSNP build 144, Cosmic v74, and dbNSFP v2.4 databases. Custom databases can include common mutations found from normal, non-tumor DNA sequences.
[0061] Variant calling can identify numerous, e.g., millions or more, variants in the DNA sequence. Accordingly, it can be desirable to perform a filtering step to exclude false variants (such as artifacts or noise generated during the sequencing or variant calling steps) or variants that are likely to be germline variants. The filtering step can employ one or more filters with technical and/or biological cutoffs. These cutoffs can be based on analysis of historical or exemplary DNA sequences related to the cancer being studied. The technical cutoff can be based on a numerical evaluation of quality, e.g., variant depth, mapping quality, strand bias, map quality bias, or tail distance bias. In contrast, the biological cutoff can remove variants that are commonly found in the baseline population, indicating that such variants are not cancer driving. For example, the biological filter can remove variants present in 1 % or more of the population (e.g., as represented by the 1000 genome project
populations). Additional biological filters can remove variants present in existing databases, such as the dbSNP database, or variants in particular DNA regions such as intragenic non- coding exon and splice site regions.
[0062] After variants are identified and filtered from the tumor-normal set, this tumor- normal set of variants can be segregated into a training set and a testing set. For example, a portion of variants corresponding to a first set of subjects can form the training set and another portion of variants corresponding to a second set of subjects can form the testing set. Each of the training set and testing set can have a predetermined size, for example, based on the total number of matched tumor-normal DNA sequences available. In certain
embodiments, the subjects having matched tumor-normal DNA sequences can be divided evenly between the training and testing sets.
[0063] Once the training set is established, a preliminary classification model can be generated by applying a gradient boosting technique or algorithm to the training set. The gradient boosting technique or algorithm can incorporate one or more biological features of the training set to generate a decision tree for classifying variants. As embodied herein, the gradient-boosting algorithm can have excellent performance on diverse binary classification problems compared to other supervised learning methods (e.g., Caruana & Niculescu-Mizil, 2006). In certain embodiments, the gradient-boosting algorithm generates the classification model using an ensemble of decision trees that iteratively learn from previously misclassified training set observations. For example, there can be three tunable parameters for gradient boosting performance: shrinkage, interaction depth, and the number of trees. These parameters can be optimized to improve classification performance. Gradient boosting returns a probability that a variant is somatic, which can be converted into a binary decision using an optimized probability threshold, in certain embodiments, the optimized probability threshold is not 0.5. For example, the method can include selecting an optimized probability threshold maximizes classification performance.
[0064] The preliminary classification model can be validated and improved using machine-learning techniques on the testing set. Thus, the preliminary classification model can be applied to the testing set to obtain a classification model suitable for classifying variants from tumor only samples. The preliminary classification model can be used to obtain preliminary classifications for variants from testing set, which can be compared to the known classifications obtained by comparing matched tumor and normal samples within the testing set. The preliminary classification model can be tuned using machine-learning to take into account one or more features of the variants. Such features can include, but are not limited to the number of variants per gene (e.g., is the total number of variants per gene normalized by the number of patients in the cohort, calculated separately for training and testing datasets), the number of cases of the variant in an existing database such as COSMIC (e.g., the total number of samples in COSMIC with a specific nucleotide variant), the allele frequency (e.g., the variant allele frequency (VAF) in the tumor sample), the Combined Annotation-Dependent Depletion (CADD) score (e.g., a score of variant deleteriousness integrated from multiple genome annotations (see Kircher et al, 2014)), the total number of mutations in a gene (e.g., based on an existing databse, such as COSMIC), the length of the protein in amino acids, the probability of a mutation to be a germline mutation (e.g.,. the VAF score with VAF~50%, calculated using a binomial distribution based on variant depth and total depth), the mutability (e.g., whether a gene is prone to mutation in a normal, non- tumor cohort), the variants per case for a particular gene and a particular sample (e.g., the number of variants in that sample divided by the number of patients in the cohort, such as the training set or testing set), and/or the variant impact (e.g., the predicted effect impact from SnpEff (see Cingolani et al, 2012)). The relative influence of the features can be determined based on the gradient boosting, e.g., as a function of how many times a features is selected for splitting within the decision trees of the gradient boosting algorithm.
[0065] Once the classification model has been established, it can be used to classify variants in tumor only samples, i.e., samples for which there is no matched normal sample. Thus, once established, the classification model can be used to evaluate samples from new subjects, including subjects for which no historical normal tissue is available and/or who exhibit metastisized cancers. A DNA sequence can be acquired from a tumor sample in the subject. Additionally or alternatively, a DNA sequence can be acquired from a historical tumor sample, including a frozen and/or formalin-fixed and paraffin-embedded (FFPE) sample. Variants in the DNA sequence can be identified and filtered using the variant calling and filtering techniques described above. The variant calling and filtering techniques can be the same as used to create the classification model, or different variant calling algorithms or filters can be applied to target specific variants of interest.
[0066] Identifying and filtering the variants from the tumor sample can be used to obtain a tumor only set of variants, and the classification model can be applied to this tumor only set of variants. For example, the classification model can be used to classify variants as somatic or non-somatic.
[0067] In certain embodiments, the method can further include identifying one or more somatic-like germline variants from the identified somatic variants. Such somatic-like germline variants can be identified by the classification mode! from the tumor-only DNA. Thus, the results of the classification model can be compared to sequencing data for matched normal samples (e.g., WES or WGS data for normal samples corresponding to tested tumor sample(s)) to determine if classified somatic variants are somatic or somatic-like germline.
[0068] As embodied herein, the classification model can further include additional features to improve the precision of variant classifications. For example, various filters can be applied to select against variants on genes that are known to have low tumorigenicity, such as TTN and MUC 16. For example, in certain embodiments, the length of a gene could be a confounder for tumorigenicity, as longer genes are expected to have more variants and genes that are not expressed are not relevant. Alternatively or additionally, expression data can be used to filter for tumorigenicity. Thus, the classification model can prioritize identification and classification of variants on driver genes, that is, genes that are known to drive tumor growth. Such driver genes include, but are not limited to, ACVRl , ATRX, ARIDl A, BCOR, BRAF, BRCA2, CTNND2, DDIT3, EGFR, FAT2, FGFR3, GPR1 36, H3F3A, HERC2, HIST1 H3B, IDH 1 , KIT, LRP1 B, LRP2, LZTR 1 , MYCN, NF1 , NOS 1 , NRAS, PCNX, PDGFRA, PIK3CA, PIK3R1 , PKHDl , PPM I D, PTEN, RB I , TEK, TP53, and TSHZ2. The driver genes targeted by a certain classification model can be specific to the cancer(s) used to generate the classification model. Additionally, the classification model can identify variants located on somatic hotspots, e.g., BRAF V600E and 1DH1 R123H.
[0069] Once a variant on a driver gene has been identified and classified as somatic using the classification model, treatment can be tailored accordingly. For example, the subject can be treated with an inhibitor that is known to be effective against the specific driver gene. In certain embodiments, the somatic variant-containing gene can be BRCA2, which is known to be sensitive to treatment with PARP inhibitors. Accordingly, a subject having a somatic variant on this gene can be treated with an effective amount of a PARP inhibitor to slow or stop tumor growth. Additionally, in certain embodiments, the somatic variant-containing gene can be BRCA1 , FANCC, ATM, or RBI and the variant can be a nonsense variant. Such nonsense variants can predict a beneficial response to cisplatin, including cisplatin neoadjuvant chemotherapy. Accordingly, a subject having a nonsense somatic variant on one or more of these genes can be treated with an effective amount of cisplatin.
[0070] The present disclosure further includes systems for carrying out the disclosed methods. Such systems can include a processor for obtaining the DNA sequencing data and generating and using the classification model. The processor can be configured to carry out the instructions specified by software stored in a hard drive, a removable storage medium, or any other storage media. The software can include computer codes, which can be written in a variety of languages, e.g., R, Matlab, and/or Microsoft Visual C++. Additionally or alternately, the processor can include hardware logic, such as logic implemented in an application-specific integrated circuit (ASIC).
[0071] The presently disclosed techniques and classification models can identify somatic variants with a high degree of precision. For example, in certain embodiments, the method can correctly identify more than 75%, more than 80%, or more than 85% of somatic variants within a particular DNA sequence.
EXAMPLES
[0072] The present disclosure is further illustrated by the following Examples which should not be construed as further limiting.
EXAMPLE 1: TOBI Framework for Classifying Somatic Variations
[0073] This Example describes the Tumor-Only Boosting identification (TOBI), a computational framework for identifying somatic variants in mostly unmatched tumor samples. Using tumor-only data from 1 ,769 patients from seven cancer types (bladder urothelial carcinoma, glioblastoma, low grade glioma, lung adenocarcinoma, melanoma, stomach adenocarcinoma, and pediatric glioma) and a training data set with as few as five, but optimally 20, tumor-normal pairs, this Example demonstrates that TOBI classifies 71.1% of somatic variants on average per cancer, and 86.6% across all cancers. TOBI also classified certain germline variants as "somatic-like", with a significant enrichment for autosomal- dominant (AD) cancer predisposition genes (90 variants in 60 AD genes out of 12,142 variants in 17,150 genes, p<1.53e- 10), including known germline cancer-predisposition variants in RBI , RET, and TP53.
[0074] Framework for predicting somatic, germline and "somatic-like" germline
[0075] The TOBI framework consists of four main steps (see FIG. 1) and accepts a tumor-normal training set and a tumor-only testing set. The four steps of TOBI analysis are: (I) variant calling and annotation, (II) filtering, (III) machine learning, and (IV) identification of somatic-like germline variants.
[0076] Step I accepts tumor WES data and performs variant calling and annotation. Annotations include variant quality, predicted functional impact, and presence in COSMIC 16 and other databases. Due to the lack of normal counterparts, Step I generates millions of variants, the majority of which are germline variants or artifacts.
[0077] WES files (.bam files) were downloaded for 104 randomly selected tumor- normal GBM cases from TCGA (accession number phsOOOl 78.v9.p8). Samples were processed with Samtools and Bcftools (Li 201 1 ) to call variants, excluding variants with mapping quality lower than 10.
[0078] Pediatric glioma WES bam files (accession numbers EGAD00001000807 (Wu et al, 2014), EGAD00001000706 (Taylor et al, 2014)) underwent variant calling and annotation as described for GBM above; fastq files (EGAD00001000792 (Fontebasso et al, 2014), EGAD00001000791 (Schwartzentruber et al, 2012)) were mapped to GRCh37.71 using BWA 0.7.1227 prior to variant calling. Published somatic variant calls were used to label true somatic variants for the 74 paired samples; only experimentally validated somatic mutations in Wu et al. were included.
[0079] As shown in FIG. 2, all GBM and pediatric glioma bam files went through the TOBI.bam pathway. For five TCGA cancers (BLCA, LGG, LUAD, SKCM, STAD), Protected Mutation vcf files with somatic and germline variants (accession number phs000178.v9.p8) were downloaded for entry into the TOBI. vcf pathway. All TCGA Data Matrix cases with Broad Institute-generated Protected Mutation vcf files between July 28, 2015 and September 1 , 2015, were downloaded and analyzed as well as 226 additional LGG cases downloaded between September 1 , 2016 and September 4, 2016. For STAD, 282 cases had available vcf files; 63 cases classified as "hyper-mutated" in TCGA clinical data were excluded from the main analysis. For all TCGA cancers, clinical data was retrieved from cBioPortal and publication MAFs from the TCGA Data Matrix provided true somatic variant calls. For 1000 Genomes Project samples, phase 3 bam files were downloaded from the public FTP site for the first 99 "mapped" samples listed in
ftp://ftp.1 000genomes.ebi.ac.uk/vol 1 /ftp/alignment indices/20130502.exome.alignment.inde x, as well as sample NA 1 1994, which was previously reported to have a germline variant in TP53 (R273H).
[0080] Bam files were analyzed with Samtools and Bcfiools to call variants, excluding variants with mapping quality lower than 10. Variants were annotated using SnpEff and SnpSift with dbSNP build 144, Cosmic v74, and dbNSFP v2.4 databases25. Variants were also annotated with an in-house database of common mutations in 219 normal WES cases ("Meganormal" database). These variant calls (e.g., vcf files) were the input for Step II, allowing users to jump to Step II if they have previous annotated variants from tumor-only samples.
[0081] Step II filters these variants using biological and technical cutoffs, which increase sensitivity in machine-learning. Filter thresholds were selected based on preliminary analysis of GBM samples. Two main filters were applied to the variants: ( 1 ) Technical filter and (2) Biological filter. The technical filter retained all variants with either a quality score from Bcftools23 greater than 60 or variant depth higher than 10 on both strands. These filters retained a high fraction of true somatic mutations in known driver genes (e.g., EGFR, which had good depth but a QUAL score <=60), while removing many low quality variants.
Variants having low mapping quality (mq < 40), and that also had strand bias, map quality bias, and tail distance bias with the p-values below 0.01 were also removed by the technical filter. The biological filter removed SNPs that were present in 1 % or more of 1000 genome project populations, as well as variants that were present in a Meganormal database. SNPs that were in the dbSNP database, but were not in COSMIC were also removed. Variants in intragenic, non-coding exon, splice-site regions were also filtered. These filters were applied to GBM and pediatric glioma variants for further analysis.
[0082] The TCGA variants in the TOBI.vcf pathway did not have reported per strand depth, mapping quality, and technical biases; thus, a modified technical filter was used to remove variants with total depth <10 and QUAL score <=60. The biological filters were the same across all samples.
[0083] Step III generates and applies the classification model using machine-learning. In order to generate a machine-learning model using tumor-normal samples (paired samples), Step III. first randomly divided variants from all patients into training and testing sets of predetermined sizes. Each set separately received cohort-specific features; these features and selected annotations from Step I create the feature space for machine-learning, as described in greater detail below. Variants were also labeled as either non-somatic or somatic based on prior analysis of tumor-normal pairs. Next, machine-learning using a gradient boosting algorithm (Friedman, 2002) trains a classification model on the training set variants. Step III ends by applying the final somatic classification model to the testing set.
[0084] The machine-learning step was performed using Caret and gbm packages in R (Kuhn et ai, 2013). A gradient-boosting algorithm was used for machine learning given its efficient performance on diverse binary classification problems as compared to other supervised learning methods. This algorithm generated a classification model using an ensemble of decision trees that iteratively learn from the previously misclassified training set observations (Friedman et ai, 2001). Gradient boosting can return a probability that a variant is somatic, which TOBI converts into a binary decision using an optimized probability threshold. TOBI did not use the default threshold probability of 0.5 because that would favor the majority class (in this case, non-somatic mutations), resulting in low sensitivity. Instead, TOBI selects a probability threshold that maximizes classification performance; the threshold's potential range is 0.05 to 0.95 in increments of 0.0375.
[0085] For each cancer, TOBI generated an optimum classification model by running a systematic grid search through gradient boosting' s three parameters: shrinkage {e.g., constant at 0.1 ), interaction depth {e.g., about 3-7 splits), and number of trees {e.g., 100, 150, 200, etc.). These parameters were determined by five repeats of 5-fold cross-validation to avoid over-fitting {see FIG. 3). The ratio of somatic and non-somatic variants was determined as a function of the mutation allele frequency (MAF) {see FIG. 4). As expected, there are many more non-somatic variants with MAF around 50%, suggesting heterozygous germline variants; these mutations have a small Somatic Score. Since there is a class imbalance with many more non-somatic variants to true somatic {see FIG. 4), the probability cutoff was optimized for class association so that the model would not favor the majority class (non- somatic mutations), resulting in decreased sensitivity. Given the class imbalance between somatic and non-somatic variants, the cutoff value was optimized such that the Fl score is maximized (see FIG. 5, in which the same point is shown on the ROC and Precision-Recall curves).
[0086] Once all models were developed, performance was assessed using the F-score (Fl ), a combination of precision and recall:
[0087] Fl = 2 (Precision χ Recall )/(Precision + Recall) - 2TP/(2TP + FP + FN),
[0088] where TP, FP, and FN stand for true positive, false positive, and false negative, respectively. Maximizing the F-score results in maximizing TP while minimizing FP and FN. Performance level was assessed by calculating sensitivity, specificity, positive predictive value, negative predictive value, prevalence, accuracy, false positive rate (FPR), false discovery rate (FDR), and area under the curve (AUC). For these calculations, true negatives were those variants that passed all TOBI quality filters, were not published as somatic in source publications, and were not predicted as somatic by TOBI.
[0089] For each cancer, cases were randomly assigned to the training or test set using the sampleQ function without replacement in R. TOBI then calculated cohort- specific annotations separately for the training and test set. Somatic status of training set variants was annotated using a user-supplied list of somatic variants, defined by affected case, genomic position, and variant nucleotide. Next, TOBI used the Caret and gbm packages in R to perform gradient boosting and generate a classification model. To assess feature importance, relative influence of features was automatically calculated during model generation. Relative influence is a measure of how many times a feature is selected for splitting in all trees in the gradient boosting model, weighted and scaled so that the sum of relative influence of all features equals one hundred. Drivers in were defined using the list of driver genes provided by the Intogen group (Rubio-Perez et ai, 2015). The rate of somatic single nucleotide variants (SNVs) per Mb for each case was calculated using the number of published somatic SNVs, after converting di-nucleotide mutations into single nucleotide components and removing iudels. This number was divided by the total megabases covered in Agilent SureSelect Human All Exon 50 Mb regions.bed file.
[0090] TOBI was developed using glioblastoma multiforme (GBM) cases from TCGA, and assessed on five adult cancer types from TCGA: bladder urothelial carcinoma (BLCA), brain lower grade glioma (LGG), lung adenocarcinoma (LUAD), skin cutaneous melanoma (SKCM), and stomach adenocarcinoma (STAD). TCGA's previously published somatic calls were considered as the "true somatic" calls for labeling training set variants. To assess TOBI's performance on pediatric tumors, pediatric glioma cases (Ped. Glioma) were analyzed, including cases with published tumor-normal analysis (Abate et al., 2015;
Palomero et al., 2014) and tumor-only cases (Fontebasso et al., 2015; Wu et al., 2014;
Schwartzentruber et al., 2012). Since cancer-sequencing studies have variable numbers of paired tumor-normal samples, the number of training cases required for model generation was assessed (FIG. 2). Increasing the number of training set tumor samples from one to fifty samples improved performance, with F-scores plateauing between 20 and 50 training cases in the six adult cancers. Twenty training cases produced an average F-score within 10% of the F-score at the maximum training set size. Thus, 20 random cases were used as the training set size and all remaining cases as the testing set to reflect a WES scenario where the majority of patient samples are tumor-only.
[0091] Historical tumor-only samples may be formalin-fixed and paraffin-embedded (FFPE), which introduces sequencing artifacts. TOBI's LUAD classification mode! was applied to FFPE LUAD cases (FIG. 17), and observed a slightly decreased F-score for FPPE (0.68) vs. frozen samples (0.81). FFPE samples had similar sensitivity and specificity (0.94, 0.97) compared to frozen samples (0.87, 0.96). In FIG. 17, the metric is listed on the top of the box; for each metric, the top panel represents FFPE samples, and the bottom panel represents frozen samples. The y-axis shows case counts and the x-axis represents a 0 to 1 range of metrics. In each box, ordered pairs represent ("mean, median") of metrics for that patient cohort and the dashed line represents mean and the dotted line represents median.
[0092] Next, the effects of differences in patient ancestry, sequencing institution, or hypermutator status within a cohort on TOBI performance were assessed. Stratifying on patient's reported race, TOBI had decreased mean F-scores when the training and testing sets differed by race in almost all cancers (FIG. 23). In FIG. 23, each box corresponds to one cancer type. The y-axis shows F-score, the x-axis shows reported race of the training set used to generate the model (20 randomly selected patients) above the reported race of the testing set, and the number of cases in the race-stratified testing set are shown within the plot area. Self-reported race categories required greater than 20 patients for inclusion as a training set, and a minimum of 5 patients for inclusion as a testing set. Points represent F-score for five runs with randomly selected training and testing sets from a specified race and the error bars represent mean +/- s.e.m. Differing sequencing institutions between the training and testing set also generated lower mean F-scores in almost all cross-institutional predictions (TCGA GBM with a cohort of 80 additional non-TCGA cases and Ped. Glioma analysis in FIG. 24). No significant effect on TOBI's performance was observed when analyzing a non- hypermutator population or mixed population (61 hypermutator, 219 non-hypermutator) (FIG. 24), Thus, TOBI's performance can improve with features denoting racial or institutional differences, but performance can be robust to hypermutator samples.TOBI was developed using 104 glioblastoma multiforme (GBM) cases with matched tumor-germline DNA from TCGA (Brennan et al, 2013), and its somatic classification ability was assessed across an expanded GBM cohort (184 total cases) and five additional adult cancer types from TCGA: bladder urothelial carcinoma (BLCA, 120 paired samples) (The Cancer Genome Atlas Research Network, 2014a), brain lower grade glioma (LGG, 512 paired samples) (Network, 2015), lung adenocarcinoma (LUAD, 194 cases) (The Cancer Genome Atlas Research Network, 2014b), skin cutaneous melanoma (SKCM, 337 paired samples) (Akbani et al., 2015), and stomach adenocarcinoma (STAD, 280 paired samples) (The Cancer
Genome Atlas Research Network, 2014c). To assess TOBI's performance on pediatric cancer cases, 142 pediatric glioma cases (Ped. Glioma) were analyzed, consisting of 74 tumor-normal samples (Taylor et al, 2014; Wu et al, 2014) and 68 tumor-only samples (Wu et al, 2014; Fontebasso et al, 2014; Schwartzentruber et al, 2012).
[0093] Because the number of cases with matched normal DNA can be quite small in many cancer-sequencing studies, the number of training samples required for optimal model generation in each of the seven studied cancer types was assessed. FIG. 2 sows the F-score for each model, in which the number of samples in the training set equals number in testing set. In FIG. 2, points represent predictions from five training and testing sets randomly selected from all patients and error bars represent +/- standard error. As the number of training samples increased from one to ten, F-scores increased in all cancer types, with F- scores plateauing between 20 and 50 training cases in the six adult cancers. In pediatric glioma, the F-score at the maximum training set size (29 samples) was within a standard deviation of the average F-score using 20 training cases. In all seven cancers, the average F- score with 20 training samples was within 10% of the F-score at the maximum training set size for every cancer, and therefore, the TOBI classification models used training sets with only 20 tumor-nomial samples of a particular cancer type. All remaining tumor-nomial cases in each cancer constituted the testing set.
[0094] TOBI's machine-learning used ten biological features, as noted below:
(1 ) "Var. per Gene" is the total number of variants per gene normalized by the number of patients in the cohort, calculated separately for training and testing datasets.
(2) "Num. COSMIC Var." is the total number of samples in COSMIC with this specific nucleotide variant ("CNT" in COSMIC v74 vcf) (see
(3) "'Allele Frequency" is the variant allele frequency (VAF) in the tumor sample.
(4) "CADD Score'" is the Combined Annotation-Dependent Depletion Score, a score of variant deleteriousness integrated from multiple genome annotations. (See Kircher et ah, 2014).
(5) "Num. COSMIC Gene" is the total number of COSMIC mutations in a gene.
(6) "Protein Length" is the length of the protein in amino acids.
(7) "VAF score" is the probability of a mutation to be a germline mutation with VAF = 50%. It can be calculated using the binomial distribution:
VAF Score = Binom(dpvar, dptot, 0.5)
where dpvar and dptot are variant depth and total depth, respectively. The justification is that assuming no copy number variation (CNV), the VAF of germline mutations should be either 50% or 100%. This can be seen in FIG. 7, where a local minimum ratio of somatic to non- somatic mutations occurs around VAF = 50%. This feature helps identify mutations with a high probability of being germline in cases without CNV.
(8) "Mutability" indicates if a gene is prone to mutation in a normal, non-tumor cohort. Per gene calculation involved counting the total number of mutations per gene in a cohort of 219 normal samples, and dividing by the amino acid length.
(9) "Var. per Case" for a particular gene and a particular sample represents the number of variants in that sample divided by the number of patients in the cohort; cohort indicates either training or testing set.
(10) "Variant Impact" is the predicted effect impact from SnpEff, and it can be "High", "Moderate", "Low", and "Modifier". (See Cingolani et ai, 2012).
[0095] The importance of each individual feature was assessed using relative influence (Elith et at., 2008), a measure of how many times a feature is selected for splitting in all trees in the gradient boosting model, weighted and scaled so that the sum of relative influence of all features equals one hundred (Elith et al., 2008) (see FIG. 6). Features were divided into four different categories, and their importance assessed by removing each category from the prediction model and calculating different performance metrics (see FIG. 7). Mutation- specific features were the most important, since without them precision decreases by almost 50%. The feature with the greatest relative influence in all adult cancers was the total variants per gene normalized by the number of patients in the cohort ("Var. per Gene"); in pediatric glioma, the number of cases in COSMIC for a specific variant ("Num. COSMIC Var.1') had highest relative influence, representing the number of cases in COSMIC with a specific variant; this can reflect both the lower mutation burden in pediatric glioma and the prevalence of hotspot mutations in H3F3A. Allele frequency was in the top three relative influence features in all cancers, which can be due to the clustering of germline variants near allele frequency 0, 0.5, or 1 {see FIG. 4). Removal of these top features from the
classification model caused a slight drop in F-score, while removal of other individual features or both COSMIC-derived features minimally affected performance (FIG. 7). As shown in FIG. 7, each box indicates a cancer type. The left of the dashed line indicates performance using the standard TOBI model with all features included; to the right of the dashed line indicates F-scores after the specified feature was removed from the model. Points represent F- score for five runs with randomly selected training and testing sets from specified race; the error bars represent mean +/- s.e.m.
[0096] As a negative control during TOBI development, TOBI somatic classification was assessed on normal samples. The GBM trained machine-learning model was applied to 16 matched normal samples from the testing cohort. On average, normal samples had 350 mutations after filtering (tumor samples had 400), out of which 13 mutations were predicted to be somatic (significantly different from the 36 mutations predicted in a tumor, p- value<0.0001 , see FIG. 8). In FIG. 8, the dashed lines shows the average number of false positives in a tumor and the number of predicted somatic mutations in normal cases mostly lies below this line. The significantly lower number of predicted somatic variants in normal cases as compared to false positives per tumor suggests that a fraction of TOBI false positives in tumors are tumorigenic germline variants or somatic variants missed in the published somatic analysis.
[0097] TOBFs ability to predict high confidence somatic variants required biologically and technically appropriate model features and filter cutoffs. For example, the MAF and Somatic Score (the probability of a variant being somatic based the variant allele frequency) was included in the model because many technical errors have allele frequencies near zero, heterozygous SNPs have frequencies of approximately 0.5, and true somatic variant allele frequencies often fall between those bounds (FIG. 4). Additional machine-learning features can be identified in the future that reduce TOBI's over-classification of somatic variants in TTN, MUC16, and other frequently mutated genes with low evidence of tumorigenicity. For example, FIG. 9 provides a comparison of actual versus predicted cases with somatic, nonsynonymous variants in six adult cancer types. In FIG. 9, dot shading corresponds to the fraction of synonymous variants out of all variants remaining after TOBI filtering (Step II) and dot size corresponds to the number of predicted cases over protein length in amino acids. Driver genes are labeled in black whereas other genes in the top five most predicted cases are labeled in grey. For clarity, genes with less than three previously published somatic variants are not shown.
[0098] Technical and biological filters reduce the number of variants prior to machine- learning. Preliminary analysis on GBM data found that filtering removed 22.5% of somatic mutations in driver genes, where 78% of these variants had low confidence variant calls (low QUAL, see FIG. 10). While current filters remove a small percentage of somatic variants in driver genes (FIG. 1 1 ), these filters also sharply reduced the total variant count from millions to hundreds per patients before machine-learning. The low number of variants entering the machine-learning step improves model generation and classification.
[0099] Next, TOBI's somatic classifications on the testing sets were compared to published somatic calls (FIG. 9 and FIG. 12, top-left panel). Per gene, the number of cases with nonsynonymous variants predicted as somatic closely matches the previously published somatic analysis, particularly in recurrently mutated genes. LGG's top five predicted genes are all known to drive adult LGG, and TP53 or BRAF is in the top three recun-ently predicted genes in six cancers. While TOBI's classification model learns features from all somatic variants in the training set, including synonymous variants and probable passenger mutations, TOBI's classifications are enriched for known driver genes (see FIG. 12, top-right panel and FIG, 10, top panel). In six cancers, TOBI classifies a higher percentage of nonsynonymous variants as compared to all variants.
[00100] The pediatric glioma classification model was also applied to the 68-sample tumor-only screening set (FIG. 12, bottom-right panel). Nine genes were recurrently predicted to have somatic, nonsynonymous variants in this unpaired screening set, including three known drivers in pediatric glioma (TP53, H3F3A, PIK3CA) and three genes recurrently mutated in other cancers (EGFR, BRAF, IDH l). All predicted BRAF and all 1DH1 variants occurred at established somatic hotspots (BRAF V600E, IDHl R123H).
[00101] A fraction of the variants TOBI predicted as somatic in the testing set had not been previously published as validated somatic variants (FP variants in the text). These FP variants could be germline variants that passed quality filters and share many features with true somatic variants. TOBI could also be classifying somatic variants that are present only in tumor and not in normal DNA, but were not called as somatic in the published analysis. To clarify whether these FP variants were germline or somatic, germline allele frequencies for FP variants in six cancers were analyzed with available germline frequency information (excludes GBM). To be classified as a germline variant, the germline variant allele frequency had to be greater than or equal to 30 (Zhang et ai, 2015). Across these 1 ,327 cases, TOB1 predicted 22,048 FP variants; 12, 142 of these variants were nonsynonymous and had a germline frequency of at least 30% (FIG. 14).
[00102] TOBI can identify somatic variants from tumor-only samples and can capture germline variants with somatic features. TOBI's false positive (FP) variants can include germline variants that share features with true somatic variants, making them "somatic-like" germline (SLG) variants. SLG variants could be benign or oncogenic. Alternatively, false positive variants might be tumor- specific variants that were not previously published due to variability in somatic variant analysis. TOBI's overall false positive rate (FPR) in the cancer test sets was assessed. Since false positive variants can include SLG variants, the FPR was calculated from applying the Ped.GIioma classification model to a set of 100 non-tumor exomes from individuals without cancer sequenced by the 1000 Genomes Project (Consortium et al., 2012). The FPR in these 1000 Genomes individuals (median FPR 0.25%, range 0.15- 1.62%) was significantly lower than the FPR in any of the cancer cohorts (FIG. 28). The higher FPR from tumor cohorts suggests that some false positive calls represent somatic-like germline variants. To identify SLG variants, germline variant allele frequency (VAF) from 1 ,327 test cases was analyzed in six cancers excluding GBM. VAF is the fraction of exome sequencing reads corresponding to the variant allele at a genomic site within a specific patient sample. To be classified as an SLG variant, a false positive variant needed a germline VAF of at least 30% to decrease the probability that the germline variant represented tumor contamination or artifacts.
[00103] Because certain germline variants highly increase predisposition to cancer (Lu et al, 2015; Zhang et al, 2015) and TOBI enriches for genes that drive cancer, TOBI's FP calls were further analyzed to determine whether they were enriched for genes associated with autosomal dominant cancer-predisposition syndromes (AD genes). A set of 60 AD genes reported by Zhang et al (Zhang et al, 2015) was used for this analysis. AD genes show statistical enrichment in all FP variants (p < 4.35e- 15) and in nonsynonymous FP with a germline allele frequency at least 30% (p< 1.53e-10). TP53 had the highest number of cases with nonsynonymous FP variants (15 cases in four cancer types), and at least seven cases had FP mutations in CDH1 , RBI, RET or TSC2 (FIG. 15). In TP53, FP variants were a mix of potential somatic variants and germline variants, with SLG nonsynonymous variants in seven cases (FIG. 16). Five of TP53 SLG variants exhibit evidence of loss of heterozygosity, with germline VAFs below 45% and tumor VAFs above 70%.
[00104] Focusing on nonsynonymous FP variants in AD genes, fifteen cases with TPS 3 mutations and at least seven cases with mutations in CDHJ, RBI, RET or TSC2 (FIG. 15) were identified. In three Ped. Glioma cases, TOBI predicted somatic TPS 3 variants with tumor VAF greater than 65% and germline VAF of 0% (FIG. 16; variants G105V, R175H, and R273C). Three pediatric glioma cases with TP53 FP variants had germline frequencies of 0, but were not published as somatic variants in their original study (Wu et al., 2014). Paired rumor-normal somatic analysis with the SAVI variant caller (Trifonov et al., 2013) found these variants to be somatic, illustrating that TOBI can identify potential somatic variants may be inconsistently called by somatic variant callers.
[00105] Since certain germline variants in cancer-associated genes correlate with earlier age of diagnosis, presence of nonsynonymous SLG variants in 565 cancer-associated genes was analyzed. In LGG, patients with cancer-associated SLG variants had significantly earlier age at diagnosis (median 37 years vs. 41 years, p= 0.0013; FIG. 18; FIGS. 29, 30, and 31). The most LGG cases had SLG variants in TPS 3 (n-4), followed by IDH1 (three cases: V71 I [COSM96923], one case: R82K [COSM4169909]) and RET (Y791 F [COSM1 159820], I852M [COSM457361 1], R982H [COSM 1264016], T 1038 A [COSM4650197]). Many genes with SLG variants in LGG have also shown recurrent somatic mutations in prior analysis (see Network, T.C.G.A.R., 2015) (e.g., TP53, IDH1, EGFR, and NF2; FIG. 18).
[00106] Step IV occurs only if norma! WES DNA is available for testing set samples, and distinguishes somatic variants from somatic-like germline variants. TOBI's somatic classifications were compared to published somatic calls from tumor-normal analysis of test set cases. (Cingolani et al., 2013, Network et al., 2015, The Cancer Genome Atlas Research, 2014, Abate et al., 2015, Palometo et al, 2014, Akcbani et ah, 2015). Across all variants, TOBI had a sensitivity of 86.6%; for nonsynonymous variants, TOBI had a sensitivity of 87,2%. TOBI also showed high sensitivity for variants with tumor VAF as low as 5% (FIGS. 26 & 27). In addition to sensitivity, specificity and F-score are also shown in FIGS. 26 and 27. As noted above, additional performance metrics were evaluated, including positive predictive value, negative predictive value, prevalence, accuracy, false positive rate (FPR), false discovery rate (FDR), and area under the curve (AUC). Per gene, the number of cases with nonsynonymous variants predicted as somatic closely matches published somatic analysis (FIGS. 9 and 13). TOBI's sensitivity in a cancer type positively correlates with the median somatic SNV per megabase (Mb) across all cases of that cancer (Spearman rho 0.964, p-value < 0.003 for both all gene and driver only sensitivity, FIGS. 34 and 35). The vertical axis shows the number of somatic SNV per megabase on a loglO scale. Each point represents a tumor sample, red horizontal lines indicate median value for cancer; cancers ordered by increasing median number of somatic mutation.
[00107] Thus, this Example provides a new framework to predict somatic variants in cancer exome studies with few matched normal controls and many tumor-only samples. TOBI successfully classified the majority of true somatic mutations in driver genes across seven tumor types, and classified known pediatric glioma driver genes as recurrently mutated in tumor only cases.
[00108] Comparison to other techniques
[00109] In order to compare results from TOBI to other techniques, 6 GBM samples and 6 Pediatric Glioma samples were analyzed through SomVarlUS (Smith et al., 2015) and Virtual Normal Correction (Hiltemann et al., 2015). To build the reference database, an hgl 9 dbSNP bed was utilized to generate the required pickle file. To build the reference virtual normal, 433 CG-sequenced normal exomes were downloaded from lOOOGenomes.
[00110] Compared to VNC, TOBI has higher F-scores (0.48 for Ped.GIioma and 0.22 for GBM; VNC F- score less than 0.0002 for both Ped.GIioma and GBM). SomVarlUS did not identify any true somatic mutations in Ped.GIioma. TOBI also predicts orders of magnitude fewer somatic variants per case compared to VNC and SomVarlUS (TOBI: -5-50; VNC: -300,000; SomVarlUS: -100-3,000). TOBI's higher F-scores and biologically appropriate number of somatic variants indicates that TOBI outperforms these methods. Furthermore, TOBI was compared to methods that assess a variant's disease potential since these methods have been used to assess effects of somatic variants. Using published somatic variants from tumor-normal analysis as the gold standard, TOBI consistently had the highest AUC (FIGS. 26 and 27).
[00111] Bladder cancer cases with inactivating mutations in Fanconi anemia pathway display somatic signature of BRCA-deficiencv
[00112] Truncating germline alterations in cancer predisposition genes have been reported in 4-19% cancer types. Accordingly, the exome-wide SLG nonsense variants in each cancer type was examined. Bladder carcinoma cases showed significant enrichment of SLG nonsense variants in the Fanconi anemia (FA) pathway based on pathway assessment with g:Profiler42 (49 genes with SLG variants, 54 genes in FA pathway, 3 overlapping genes; p-value of 0.029 after multiple testing correction). The FA pathway normally performs DNA repair of interstrand crosslinks, which requires homologous recombination.
[00113] The overall occurrence of germline and somatic nonsense mutations in the FA pathway predicted by TOBl was assessed (FIG. 19). In bladder cancer, TOBl predicted these variants in 1 1 % (1 1/100) of patients. Less than 2.5% of patients in any other cancer type had predicted nonsense FA variants. True somatic nonsense variants occurred in 6% of BLCA cases, affecting genes BRCA2, FANCM, FANCE, REV3L, and SLX4. Germline nonsense variants were predicted in 5% of BLCA cases, affecting BRCA2, FANCM, and FANCD2. Several of these germline variants showed potential loss of heterozygosity based on increased VAF in tumor DNA compared to germline DNA (FIG. 20: FANCM Rl 931 *, BRCA2
Y3308*). Of note, BRCA2 variant Y3308 has been associated with hereditary colorectal and breast cancer (Naseem et al., 2006). Mice ES cells with BRCA2 Y3308 mutations showed hypersensitivity to ionizing radiation and crosslinking agents, as well as decreased
homologous recombination efficiency (Kuznetsov et al., 2008). Additionally, FANCM R1931 was associated with increased breast cancer risk and deficient DNA repair (Peterlongo et al., 2015). FIG. 21 describes published somatic copy number alterations and predicted nonsynonymous variants within the FA pathway for this BLCA cohort.
[00114] Additionally, TOBl was tested to investigate whether BLCA cases with predicted FA pathway nonsense mutations had significantly different mutational signatures compared to wildtype cases. Using all somatic mutations published for 130 TCGA BLCA cases including our 100 test cases, trinucleotide mutational spectra that decomposed into four somatic signatures were generated (FIGS. 32 and 33). Non-negative matrix factorization approach developed by Alexandrov et al. was applied to infer the mutational signatures of Bladder cancer. Cases with FA nonsense mutations were only enriched in the fourth signature (FIG. 22), a somatic signature similar to the BRCAl/2-deficiency signature from a pan-cancer analysis (signature 3 in the referenced publication). Enrichment of this somatic mutation signature in bladder cancer cases with nonsense FA variants suggests that these FA nonsense variants, whether somatic or germline, affect the bladder cancer somatic mutation landscape. [00115] In tumor-only analysis, TOBI identified 87% of nonsynonymous somatic variants. Higher true positive rates in driver genes can suggest that TOBI enriches for cancer- causing variants. TOBI's similar performance on frozen and FFPE samples can suggest that TOBI filters certain FFPE artifacts. A TOBI modification trained on FFPE artifacts can remove more FFPE sequencing artifacts. TOBI outperformed other methods designed for somatic variant identification from tumor-only samples. This higher performance can provide two fundamental differences between alternative methods and TOBI. First, alternative techniques use a single information source, but TOBI can integrate biological features from individual variants, patient cohorts, and curated databases. Second, TOBI can use the powerful gradient boosting algorithm to classify variants, allowing TOBI to leant features important to specific tumor types (FIG. 6).
[001 16] When germline VAF information is available, TOBI can identify "somatic-like" germline variants. These SLG variants include oncogenic germline variants validated by outside groups, such as the TP53 R248Q alteration confirmed as germline by tumor-normal analysis of a pediatric glioma case (Wu et a/., 2014). SLG variants in cancer genes also associated with earlier age of diagnosis in patients with low-grade glioma (FIG. 18), suggesting that TOBI's SLG variants are enriched for cancer-associated variants.
[00117] Analysis of bladder carcinoma cases using TOBI revealed largely unreported germline inactivating mutations in the Fanconi anemia pathway, suggesting a potential genetic predisposition in 5% of patients. Outside analysis of a 14-patient bladder tumor cohort found a germline nonsense variant in BRCA2, but did not assess Fanconi anemia mutations. Germline BRCA2 nonsense mutations in bladder carcinoma can reflect the pan- cancer susceptibility attributed to germline BRCA2 mutations in analysis of other adult cancers. Future assessment of a larger BLCA cohort can reveal associations between germline FA mutations and clinical outcomes, similar to how an expanded cohort of prostate cancer patients revealed significantly more deleterious germline mutations in DNA repair genes in patients with metastatic vs. localized prostate cancer (Pritchard et al., 2016).
[00118] Integrated somatic and germline analysis identified nonsense FA pathway mutations in 1 1% of BLCA cases, suggesting a role for aberrant interstrand crosslink repair in bladder tumor development. Enrichment for a BRCA -deficiency somatic signature in these patients indicates similarity between FA mutant bladder cancers and Z?/?G4-mutant breast cancers. Treating 5,/? C4 -mutant breast cancers with PARP inhibitors improved patient outcome, so PARP inhibitors can also show increased effectiveness in bladder tumors with BRCA2 or other FA mutations. Furthermore, recent research found that the presence of tumor DNA alterations in FANCC (a member of the FA pathway), ATM, and RBI predicted beneficial response to cisplatin neoadjuvant chemotherapy (Plimack et al., 2015). FA nonsense mutations can predict beneficial response to cisplatin, particularly given the beneficial response to cisplatin in patients with BRCA1 mutant breast cancers.
[00119] The framework analyzed either tumor-only samples or samples with matched tumor-normal DMA for variants with somatic features. In tumor-only samples, the framework (1 ) promoted the study of previously collected tumor samples without matched normal DNA, unlocking a vast repository of tumor-only samples without sequencing of matched normal DNA, and (2) prioritized exome alterations in a particular patient by focusing on variants with somatic characteristics. In cases with matched normal DNA, this framework identified germline variants that present somatic-like features and informed tumor developments. Integrated analysis of germline and somatic variants remains uncommon, making TOBI's identification of both somatic-like germline variants and somatic variants a unique strength. Applying the TOBI framework to seven cancer types illustrated that TOBI can recover known oncogenic variants of somatic and germline origin, and suggests a previously unreported role for inactivating mutations in the Fanconi anemia pathway in bladder cancer.
[00120] Furthermore, TOBI can capture most somatic mutations in cases where the training set for generating the classification model contains at least 10, or at least 20, cases with matched germline DNA. While this model learns features from somatic variants, any germline variant that has similar biological features and passed filtering will also be classified as somatic. TOBI's somatic classification for known tumorigenic germline variants such as TP53 R248Q and R283C illustrates this point. A computational framework similar to TOBI could be developed to identify tumorigenic germline variants.
[00121] In summary, the TOBI framework identifies somatic mutations in tumors without normal counterparts using a supervised machine learning approach. Applying TOBI to six adult cancer types from TCGA and one pediatric glioma data set illustrates that TOBI can outperform established tools like MutSig in identifying true somatic variants from tumor only data. These results on unpaired pediatric glioma samples illustrate standard usage of TOBI on tumor-only data to predict somatic variants enriched for driver genes. REFERENCES
Garraway, L. A., Verweij, J. & Ballman, K. V. Precision oncology: an overview. Journal of clinical oncology : official journal of the American Society of Clinical Oncology 31 , 1803- 1805, doi: 10.1200/JCO.2013.49.4799 (2013).
Nowell, P. C. The clonal evolution of tumor cell populations. Science 194,
23-28, doi: 10.1 126/science.959840(1976).
Cingolani, Cameron W. et al. The Somatic Genomic Landscape of Glioblastoma. Cell 155, 462-477, doi: 10.1016/j.cell.2013.09.034 (2013).
Network, T. C. G. A. R. Comprehensive, integrative Genomic Analysis of Diffuse Lower-Grade Gliomas. New England Journal of Medicine 372, 2481 - 2498, doi:10.1056/NEJMoal402121 (2015).
The Cancer Genome Atlas Research, N. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543-550, doi:10.1038/naturel 3385 (2014). The Cancer Genome Atlas Research, N. Comprehensive molecular
characterization of urothelial bladder carcinoma. Nature 507, 315-322, doi : 10.1038/naturel 2965 (2014).
Jones, S. et al. Personalized genomic analyses for cancer mutation discovery and interpretation. Sci TranslMed 7, 283ra253-283ra253,
doi: ] 0.1 126/scitranslmed.aaa7161 (2015).
Fontebasso, A. M. et al. Recurrent somatic mutations in ACVR1 in pediatric midline high-grade astrocytoma. Nature Genetics 46, 462-466,
doi: 10.1038/ng.2950 (2014).
Kim, J., Kim, S., Nam, H., Kim, S. & Lee, D. SoloDel: a probabilistic
model for detecting low-frequent somatic deletions from unmatched
sequencing data. Bioinformatics 31, 3105-31 13,
doi: 10.1093/bioinformatics/btv358 {2015).
Wu, G. et al The genomic landscape of diffuse intrinsic pontine glioma and pediatric non-brainstem high-grade glioma. Nature Genetics 46, 444-450, doi: 10.1038/ng.2938 (2014).
Raymond, V. M. et al. Germline Findings in Tumor-Only Sequencing: Points to Consider for Clinicians and Laboratories. JNC1 J Natl Cancer Inst 108, djv351 , doi : 10.1093/jnci/dj v351 (2036).
Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer- associated genes. Nature 499, 214-218, doi:Doi 10.1038/Naturel2213 (2013).
Mack, S. C. et al, Epigenomic alterations define lethal CIMP-positive
ependymomas of infancy. Nature 506, 445-450, doi : 10.1038/naturel 3108 (2014). Vogelstein, B. et al. Cancer genome landscapes. Science 339, 1546- 1558 , doi : 10.1 126/science.1235122 (2013).
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research 29, 308-31 1 (2001 ).
Forbes, S. A. et al. COSMIC: exploring the world's knowledge of somatic mutations in human cancer. Nucleic acids research 43, D805-81 1,
doi :10.1093/nar/gku 1075 (2015).
Smith, K. S. et al. SomVarlUS: somatic variant identification from unpaired tissue samples. Bioinformatics, btv685, doi : 10.1093/bioinformatics/btv685
(2015),
Hiltemann, S„ Jenster, G., Trapman, J., Spek, P. v. d. & Stubbs, A,
Discriminating somatic and germline mutations in tumour DNA samples wi thout matchi ng normals. Genome Res. , gr.183053.1831 14,
doi:10.1 101/gr.l 83053.114 (2015).
Abate, F. et al. Distinct Viral and Mutational Spectrum of Endemic Burkitt Lymphoma. PLOSPathog 11, el 005158, doi: 10.1371/joumal.ppat.1005158 (2015). Palomero, T. et al. Recurrent mutations in epigenetic regulators, RHOA and FYN kinase in peripheral T cell lymphomas. Nature Genetics 46, 166-170, doi: 10.1038/ng.2873 (2014).
Tzoneva, G. et al. Activating mutations in the NT5C2 nucleotidase gene drive chemotherapy resistance in relapsed ALL. Nat Med 19, 368-371 ,
doi:10.1038/nm.3078 (2013).
Schwartzentruber, J. et al. Driver mutations in histone H3.3 and chromatin remodelling genes in paediatric glioblastoma. Nature 482, 226-231 ,
doi: 10.1038/naturel 0833 (2012).
Kanchi, K. L. et al. Integrated analysis of germline and somatic variants in ovarian cancer. Nat Commun 5, doi: 10.1038/ncomms4l 56 (2014).
Zhang, J. et al. Germline Mutations in Predisposition Genes in Pediatric Cancer. New England Journal of Medicine 0, null, doi: 10.1056/NEJMoal 508054 (2015).
Caruana, R. & Niculescu-Mizil, A. 161-168 (ACM) (2006).
Consortium, T. G. P. An integrated map of genetic variation from 1 ,092 human genomes. Nature 491, 56-65, doi:10.1038/naturel 1632 (2012).
Friedman, J. H. Stochastic gradient boosting. Computational Statistics & Data Analysis 38, 367-378, doi:10.1016/S0167-9473(01)00065-2 (2002).
Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotech 31, 213-219, doi: 10.1038/nbt.2514 (2013).
Trifonov, V., Pasqualucci, L., Tiacci, E., Falini, B. & Rabadan, R. SAVI: a statistical algorithm for variant frequency identification. BMC Systems Biology 7, 1-1 1 , doi: 10.1 186/1752-0509-7-S2-S2(2013).
Akbani, R. et al. Genomic Classification of Cutaneous Melanoma, Cell 161, 1681- 1696, doi: 10.1016/j .cell.2015.05.044(2015).
The Cancer Genome Atlas Research, N. Comprehensive molecular
characterization of gastric adenocarcinoma. Nature 513, 202-209,
doi : 10.1038/naturel 3480 (2014).
Wang, J. et al. Clonal evolution of glioblastoma under therapy. Nature Genetics 48, 768- 776, doi : 10.1038/ng.3590 (2016).
Elith, J., Leathwick, J. R. & Hastie, T, A working guide to boosted regression trees. JAnim Ecol 77, 802-813, doi: 10.1 1 1 1/j.1365-2656.2008.01390.x (2008). Rubio-Perez, C. et al. In Silico Prescription of Anticancer Drugs to Cohorts of 28 Tumor Types Reveals Targeting Opportunities. Cancer Cell 27, 382-396, doi: 10.1016/j.ccell.2035.02.007 (2015).
Kircher, M. et al. A general framework for estimating the relative
pathogenicity of human genetic variants. Nature Genetics 46, 310-315, doi:10.1038/ng.2892 (2014).
Kumar, P., Henikoff, S. & Ng, P. C. Predicting the effects of coding non- synonymous variants on protein function using the SIFT algorithm. Nature Protocols 4, 1073-1081 , doi: 10.1038/nprot.2009.86 (2009).
Reva, B., Antipin, Y. & Sander, C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Research 39, el 18-el l 8, doi: 10.1093/nar/gkr407 (201 1 ). Schwarz, J. M., Rodelsperger, C, Schuelke, M. & Seelow, D. MutationTaster evaluates disease-causing potential of sequence alterations. Nature methods 7, 575-576 (2010).
Roberts, N. D. et al. A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics 29, 2223-2230,
doi: 10.1093/biomformatics/btt375 (2013).
Lu, C. et al. Patterns and functional implications of rare germline variants across 12 cancer types. Nat Commun 6, 10086, doi: 10.1038/ncomms 10086
(2015) .
Knudson, A. G. Mutation and Cancer: Statistical Study of Retinoblastoma.
PNAS 68, 820-823 (1971).
Reimand, J. et al. g: Profiler— a web server for functional interpretation of gene lists (2016 update). Nucleic Acids Research, gkwl99, doi: 10.1093/nar/gkwl99
(2016) .
Schlacher, K., Wu, H. & Jasin, M. A Distinct Replication Fork Protection
Pathway Connects Fanconi Anemia Tumor Suppressors to RAD51-BRCA1/2. Cancer Cell 22, 106- Π 6, doi: 10.1016/j.ccr.2012.05.015 (2012).
Naseem, H. et al. inherited association of breast and colorectal cancer: limited role of CHEK2 compared with high-penetrance genes. Clinical Genetics 70,
388-395, doi: 10.1 1 1 1/j.l399-0004.2006.00698.x (2006).
Kuznetsov, S. G., Liu, P. & Sharan, S. K. Mouse embryonic stem cell-based functional assay to evaluate mutations in BRCA2. Nat Med 14, 875-881 , doi:10.1038/nm.l719 (2008).
Peterlongo, P. et al. FANCM c.5791C>T nonsense mutation (rsl44567652) induces exon skipping, affects DNA repair activity and is a familial breast cancer risk factor. Hum Mol Genet 24, 5345-5355, doi: 10.1093/hmg/ddv251 (2015).
Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415-421 , doi: 10.1038/nature 12477 (2013).
Nickerson, M. L. et al. Concurrent Alterations in TERT, KDM6A, and the
BRCA Pathway in Bladder Cancer. Clin Cancer Res 20, 4935-4948,
doi: 10.1 158/1078- 0432.CCR-14-0330 (2014).
Pritchard, C. C. et al. Inherited DNA-Repair Gene Mutations in Men with
Metastatic Prostate Cancer. New England Journal of Medicine 0, null, doi : 10.1056/NEJMoa 1603144 (2036).
Tutt, A. et al. Oral poly(ADP-ribose) polymerase inhibitor olaparib in patients with BRCA1 or BRCA2 mutations and advanced breast cancer: a proof-of- concept trial. The Lancet 376, 235-244, doi:10.1016/S0I40-6736(10)60892-6 (2010).
Plimack, E. R. et al. Defects in DNA Repair Genes Predict Response to
Neoadjuvant Cisplatin-based Chemotherapy in Muscle-invasive Bladder Cancer. European Urology 68, 959-967, doi: 10.1016/j.eururo.2015.07.009 (2015).
Byrski, T. et al. Results of a phase II open-label, non-randomized trial of cispiatin chemotherapy in patients with BRCA1 -positive metastatic breast cancer. Breast Cancer Res 14, R1 10, doi:10.1 186/bcr3231 (2012).
Cerami, E. et al. The cBio Cancer Genomics Portal: An Open Platform for
Exploring Multidimensional Cancer Genomics Data. Cancer Discovery 2, 401- 404, doi: 10.1 158/2159-8290.CD- 12-0095 (2012).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics 25, 1754-1760,
doi: 10.1093/bioinformatics/btp324 (2009).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987-2993, doi:Doi 10.1093/Bioinformatics/Btr509 (201 1 ). Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila
melanogaster strain w(l 1 18); iso-2; iso-3. Fly 6, 80-92, doi:Doi
10.4161 /Fly.19695 (2012).
Liu, X., Jian, X. & Boerwinkle, E. dbNSFP v2.0: a database of human non- synonymous SNVs and their functional predictions and annotations. Human mutation 34, E2393-2402, doi: 10.1002/humu.22376 (2013).
Kuhn, M. & Johnson, K. Applied predictive modeling. (Springer, 2033).
Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 44, D279-285, doi: 10.1093/nar/gkvl 344 (2016).
Ceccaldi, R. et al. Spontaneous abrogation of the G(2)DNA damage checkpoint has clinical benefits but promotes leukemogenesis in Fanconi anemia patients. J Clin Invest 121, 184-194, doi:10.1 172/JC143836 (201 1 ). 61. Gao, J. J. et al. Integrative Analysis of Complex Cancer Genomics and Clinical Profiles Using the cBioPortal. Sci Signal 6 (2013).
62. Lu, C, Xie, M., Wendl, M.C., Wang, J,, McLellan, M.D., Leiserson, M.D.M.,
Huang, K., Wyczalkowski, M.A., Jayasinghe, R., Banerjee, T., et al. Patterns and functional implications of rare germline variants across 12 cancer types. Nat.
Commun. 6, 10086 (2015).
63. Taylor, K.R., Mackay, A., Truffaux, N., Butterfield, Y.S., Morozova, O.,
Philippe, C, Castel, D., Grasso, C.S., Vinci, M., Carvalho, D., et al. Recurrent activating ACVR1 mutations in diffuse intrinsic pontine glioma. Nat. Genet. 46, 457-461 (2014).
* * *
[00122] The contents of all figures and all references, patents and published patent applications and Accession numbers cited throughout this application are expressly incorporated herein by reference.
[00123] In addition to the various embodiments depicted and claimed, the disclosed subject matter is also directed to other embodiments having other combinations of the features disclosed and claimed herein. As such, the particular features presented herein can be combined with each other in other manners within the scope of the disclosed subject matter such that the disclosed subject matter includes any suitable combination of the features disclosed herein. The foregoing description of specific embodiments of the disclosed subject matter has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosed subject matter to those embodiments disclosed.
[00124] it will be apparent to those skilled in the art that various modifications and variations can be made in the systems and methods of the disclosed subject matter without departing from the spirit or scope of the disclosed subject matter. Thus, it is intended that the disclosed subject matter include modifications and variations that are within the scope of the appended claims and their equivalents.

Claims

WHAT IS CLAIMED IS:
1. A method of classifying variants in DNA extracted from a tumor sample of a subject using a classification model, comprising:
obtaining a DNA sequence from the tumor sample;
identifying variants in the DNA sequence from the tumor sample;
filtering the variants from the tumor sample to obtain a tumor only set of variants; and
applying the classification model to the tumor only set to identify one or more somatic variants.
2. The method of claim 1, wherein the filtering uses one or more technical filters and/or biological filters.
3. The method of claim 2, wherein the technical filter applies a numerical cutoff
corresponding to variant quality.
4. The method of claim 3, wherein the numerical cutoff corresponds to at least one of variant depth, mapping quality, strand bias, map quality bias, and tail distance bias.
5. The method of claim 2, wherein the biological filter removes variants present in 1% or more of the population as represented by the 1000 genome project populations.
6. The method of claim 2, wherein the biological filter removes variants present in one or more existing databases.
7. The method of claim 2, wherein the biological filter removes variants from intragenic non-coding exon regions and/or splice site regions within the DNA sequence.
8. The method of claim 1, further comprising identifying one or more somatic-like germline variants from among the identified somatic variants.
9. The method of claim 8, wherein the one or more somatic-like germline variants are identified by comparison to a DNA sequence obtained from a normal sample corresponding to the tumor sample of the subject.
10. The method of claim 1, wherein the tumor sample is obtained from a subject for
which there is no matched normal tissue sample.
11. The method of claim 1, further comprising creating the classification model by:
obtaining from about 10 to about 50 matched tumor-normal DNA sequences; identifying variants in the matched tumor-normal DNA sequences; filtering the variants to obtain a tumor-normal set of variants;
segregating the tumor-normal set of variants into a training set and a testing set;
applying a gradient boosting technique to the training set to obtain a preliminary classification model; and
applying the preliminary classification model to the testing set to obtain the classification model using machine-learning.
12. The method of claim 11, wherein the identifying variants in the matched tumor- normal DNA sequences and/or in the DNA sequence from the tumor sample comprises a variant calling process.
13. The method of claim 11, wherein the gradient boosting technique incorporates one or more tunable parameters selected from shrinkage, interaction depth, and number of decision trees.
14. The method of claim 11, further comprising determining a relative influence of one or more biological features of variants in the training set based on the gradient boosting technique.
15. The method of claim 11, wherein the matched tumor-normal DNA sequences are obtained from tumor and normal tissue samples from a subject.
16. The method of claim 15, wherein the normal tissue sample is obtained prior to tumor onset.
17. The method of claim 15, wherein the normal tissue sample is obtained from a region of the subject separate from a region containing a tumor.
18. The method of claim 1, wherein one or more somatic variants are present on one or more driver genes.
19. The method of claim 18, wherein the driver gene is selected from the group consisting of ACVR1, ATRX, ARID 1 A, BCOR, BRAF, CTNND2, DDIT3, EGFR, FAT2, FGFR3, GPR116, HERC2, IDH1, KIT, LRP1B, LRP2, LZTR1, MYCN, NF1, NOS1, NRAS, PCNX, PDGFRA, PIK3CA, PIK3R1, PKHD1, PPM1D, PTEN, RBI, TEK, TP53, and TSHZ2.
20. The method of claim 1, wherein the one or more somatic variants are present on
BRCA2.
21. The method of claim 20, further comprising treating the subject with an effective amount of a PARP inhibitor.
22. The method of claim 1, wherein the one or more somatic variants are present on
BRCA1, FANCC, ATM, or RBI .
23. The method of claim 22, wherein the one or more somatic variants are nonsense
variants and the method further comprises treating the subject with an effective amount of cisplatin.
24. The method of claim 1, wherein the method correctly identifies more than 85% of somatic variants within the DNA sequence from the tumor sample.
25. A system for classifying variants in DNA extracted from a tumor sample of a subject, comprising a processor configured to:
create a classification model by: obtaining from about 10 to about 50 matched tumor-normal DNA sequences;
identifying variants in the matched tumor-normal DNA sequences; filtering the variants to obtain a tumor-normal set of variants;
segregating the tumor-normal set of variants into a training set and a testing set;
applying a gradient boosting technique to the training set to obtain a preliminary classification model; and
applying the preliminary classification model to the testing set to obtain the classification model using machine-learning;
obtain a DNA sequence from the tumor sample;
identify variants in the DNA sequence from the tumor sample;
filter the variants from the tumor sample to obtain a tumor only set of variants; and
apply the classification model to the tumor only set to identify one or more somatic variants.
PCT/US2017/054445 2016-09-30 2017-09-29 Methods for classifying somatic variations WO2018064547A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US201662402137P true 2016-09-30 2016-09-30
US62/402,137 2016-09-30

Publications (1)

Publication Number Publication Date
WO2018064547A1 true WO2018064547A1 (en) 2018-04-05

Family

ID=61760207

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/054445 WO2018064547A1 (en) 2016-09-30 2017-09-29 Methods for classifying somatic variations

Country Status (1)

Country Link
WO (1) WO2018064547A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019016353A1 (en) * 2017-07-21 2019-01-24 F. Hoffmann-La Roche Ag Classifying somatic mutations from heterogeneous sample
WO2020068506A1 (en) * 2018-09-24 2020-04-02 President And Fellows Of Harvard College Systems and methods for classifying tumors
WO2021216477A1 (en) * 2020-04-21 2021-10-28 Grail, Inc. Generating cancer detection panels according to a performance metric

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DEO ET AL.: "Prioritizing causal disease genes using unbiased genomic features", GENOME BIOLOGY, vol. 15, no. 12, 3 December 2014 (2014-12-03), XP021207726 *
NIKBAKHT ET AL.: "Spatial and temporal homogeneity of driver mutations in diffuse intrinsic pontine glioma", NATURE COMMUNICATIONS, vol. 7, no. 11185, 6 April 2016 (2016-04-06), pages 1 - 8, XP055515661 *
RAYMOND ET AL.: "Germline Findings in Tumor-Only Sequencing: Points to Consider for Clinicians and Laboratories", JOURNAL OF THE NATIONAL CANCER INSTITUTE, vol. 108, no. 4, 20 November 2015 (2015-11-20), XP055515646 *
VURAL: "Classification of Breast Cancer Patients Using Somatic Mutation Profiles and Machine Learning Approaches", PH.D. THESIS, December 2015 (2015-12-01), Omaha, Nebraska, pages 2, 21 , 31 , 35 - 43 , 62-64, 67 , 71-73, XP055515641, Retrieved from the Internet <URL:https://digitalcommons.unmc.edu/cgi/viewcontent.cgi?referer=https://www.google.com/&httpsredir=1&article=1050&context=etd> [retrieved on 20171202] *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019016353A1 (en) * 2017-07-21 2019-01-24 F. Hoffmann-La Roche Ag Classifying somatic mutations from heterogeneous sample
WO2020068506A1 (en) * 2018-09-24 2020-04-02 President And Fellows Of Harvard College Systems and methods for classifying tumors
WO2021216477A1 (en) * 2020-04-21 2021-10-28 Grail, Inc. Generating cancer detection panels according to a performance metric

Similar Documents

Publication Publication Date Title
Alkodsi et al. Comparative analysis of methods for identifying somatic copy number alterations from deep sequencing data
JP2022025101A (en) Methods for fragmentome profiling of cell-free nucleic acids
Park et al. Systematic discovery of germline cancer predisposition genes through the identification of somatic second hits
WO2018064547A1 (en) Methods for classifying somatic variations
EP2359277A2 (en) Genomic classification of colorectal cancer based on patterns of gene copy number alterations
JP2019512823A (en) Detection and diagnosis of cancer evolution
WO2010051314A2 (en) Genomic classification of non-small cell lung carcinoma based on patterns of gene copy number alterations
JP2019519248A (en) Mutation signature in cancer
WO2014041380A1 (en) Method and computer program product for detecting mutation in a nucleotide sequence
Alkallas et al. Multi-omic analysis reveals significantly mutated genes and DDX3X as a sex-specific tumor suppressor in cutaneous melanoma
Wood et al. Recommendations for accurate resolution of gene and isoform allele-specific expression in RNA-Seq data
Muller et al. OutLyzer: software for extracting low-allele-frequency tumor mutations from sequencing background noise in clinical practice
Madubata et al. Identification of potentially oncogenic alterations from tumor-only samples reveals Fanconi anemia pathway mutations in bladder carcinomas
CN111968701A (en) Method and device for detecting somatic copy number variation of designated genome region
Lee et al. Genome‐defined African ancestry is associated with distinct mutations and worse survival in patients with diffuse large B‐cell lymphoma
US20190352695A1 (en) Methods for fragmentome profiling of cell-free nucleic acids
Huang et al. Genotype-based gene signature of glioma risk
US20220130549A1 (en) Tumor classification based on predicted tumor mutational burden
US20190287645A1 (en) Methods for fragmentome profiling of cell-free nucleic acids
Nicchia et al. Identification of point mutations and large intragenic deletions in Fanconi anemia using next‐generation sequencing technology
US20180312928A1 (en) Method and system for selecting customized drug using genomic nucleotide sequence variation information and survival information of cancer patient
Cho et al. Methylation and molecular profiles of ependymoma: Influence of patient age and tumor anatomic location
Pelttari et al. Screening of HELQ in breast and ovarian cancer families
Planterose Jiménez et al. Revisiting genetic artifacts on DNA methylation microarrays exposes novel biological implications
Hasan Identifying and Analyzing Indel Variants in the Human Genome Using Computational Approaches

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17857540

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17857540

Country of ref document: EP

Kind code of ref document: A1