WO2022266518A2 - Methods for identifying mutations using machine learning - Google Patents

Methods for identifying mutations using machine learning Download PDF

Info

Publication number
WO2022266518A2
WO2022266518A2 PCT/US2022/034115 US2022034115W WO2022266518A2 WO 2022266518 A2 WO2022266518 A2 WO 2022266518A2 US 2022034115 W US2022034115 W US 2022034115W WO 2022266518 A2 WO2022266518 A2 WO 2022266518A2
Authority
WO
WIPO (PCT)
Prior art keywords
ratio
max
decision trees
machine learning
softclip
Prior art date
Application number
PCT/US2022/034115
Other languages
French (fr)
Other versions
WO2022266518A3 (en
Inventor
Scott Arthur TOMLINS
Daniel Reed RHODES
David Bryan JOHNSON
Original Assignee
Strata Oncology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Strata Oncology, Inc. filed Critical Strata Oncology, Inc.
Priority to EP22825947.9A priority Critical patent/EP4356319A2/en
Priority to AU2022292749A priority patent/AU2022292749A1/en
Priority to US18/571,652 priority patent/US20240290422A1/en
Priority to IL309473A priority patent/IL309473A/en
Priority to CA3224548A priority patent/CA3224548A1/en
Publication of WO2022266518A2 publication Critical patent/WO2022266518A2/en
Publication of WO2022266518A3 publication Critical patent/WO2022266518A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • NGS Next Generation Sequencing
  • NGS-based companion diagnostic (CDx) cancer tests have now been FDA approved, their uptake has been limited by: (1) single-site approvals (e.g., FoundationOne CDxTM [FI CDx]) that compete with regional laboratories who wish to better serve their local patient populations, (2) impractical sample input requirements for capture-based NGS IVD tests (e.g., FI CDx requires tumor surface area >25mm 2 ) and (3) insufficient content in kitted products (e.g., Oncomine Target Dx, Kir Extended Ras Panel), which took many years to develop, but when launched, no longer met current medical needs given the ever expanding list of biomarkers recommended or required by professional organizations and payors.
  • kitted products e.g., Oncomine Target Dx, Kir Extended Ras Panel
  • US 2019/0189242 A1 (Published June 20, 2019), provides a variant calling method utilizing machine learning. However, as detailed in paragraph [0166], machine learning was accomplished using an artificially created "spike-in” training set. Thus, the method of US 2019/0189242 A1 has degraded performance when calling "real world" samples that often comprise low quality samples.
  • Some aspects of the present disclosure are directed to a method for identifying mutations from a patient sample, comprising: evaluating, using a computer having a machine learning classifier, a candidate variant against a plurality of decision trees trained to detect mutations in the candidate variant with a gradient boosting algorithm, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classifying, using the computer, the candidate variant as a real mutation based on classifications of each of the plurality of decision trees, wherein the decision trees receive the following parameters: min_AF_ratio_50; min_aIt_count; min_AF_softcIip_ratio_99; count; and within_tumor_prior.
  • the decision trees also receive the following parameters: min_AF_ratio_100, min_AF_softcIip_ratio_100, min_AF_softcIip, min_AF_softcIip_ratio_90, and max_aIt_count.
  • the decision trees also receive one or more of the following parameters: max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, ampIicon_variant_count, AF_frac_pos_max, ana!ysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_reI_chip, max_AF_ratio_99, AF_softcIip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softcIip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softcIip, frac_max_af, max_depth, max_AF_softcIip_rati
  • evaluating the candidate variant further comprises evaluating the candidate variant using a random forest classifier.
  • the plurality of decision trees further comprises at least one thousand decision trees.
  • the plurality of decision trees further comprises a plurality of decision trees for each mutation.
  • the method further comprises training the machine learning classifier using a training data set of sequences that include identified mutations.
  • the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review.
  • training the machine learning classifier further comprises optimizing parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations.
  • optimizing parameters of the machine learning classifier further comprises selecting a plurality of feature categories.
  • the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
  • the method further comprises generating, using the computer, an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees. In some embodiments, the method further comprises providing a report that describes the candidate variant as including the mutation.
  • one or more of the decision trees receive parameters selected from the group consisting of: sample type; FASTQ quality score; alignment score; read coverage; and an estimated probability of error.
  • the training data set comprises a plurality of known single-nucleotide variants (SNVs) and insertions / deletions (indels), the method comprising: detecting at least one mutation in the nucleic acid; validating the detected mutation as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the mutation.
  • a decision tree classifies the candidate variant as present or absent (e.g., likely noise or other sequencing artifact). In some embodiments, the decision tree does not classify whether or not the candidate variant is germline.
  • Some aspects of the present disclosure are directed to a system for identifying mutations from a patient sample, comprising: a computer system having a processor, memory and a plurality of lines of instructions; a machine learning classifier executed by the processor of the computer system, the machine learning classifier being configured to: evaluate a candidate variant against the plurality of decision trees trained to detect mutations in the candidate variant, wherein each decision tree classifies the candidate variant as present or absent (e.g., likely noise or other sequencing artifact), based on classifications of each of the plurality of decision trees.
  • a computer system having a processor, memory and a plurality of lines of instructions
  • a machine learning classifier executed by the processor of the computer system, the machine learning classifier being configured to: evaluate a candidate variant against the plurality of decision trees trained to detect mutations in the candidate variant, wherein each decision tree classifies the candidate variant as present or absent (e.g., likely noise or other sequencing artifact), based on classifications of each of the plurality of decision trees.
  • the machine learning classifier is further configured to evaluate the candidate variant using a random forest classifier.
  • the plurality of decision trees further comprises at least one thousand decision trees.
  • the plurality of decision trees further comprises a plurality of decision trees for each mutation.
  • the processor is further configured to train the machine learning classifier using a training data set of sequences that include known mutations.
  • the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review.
  • the processor is further configured to optimize parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations.
  • the processor is further configured to select a plurality of feature categories.
  • the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
  • the processor is further configured to generate an overall confidence score for the candidate variant being a real mutation (vs. sequencing artifact) based on the classifications of all of the plurality of decision trees.
  • the sample is from plasma, blood, serum, saliva, sputum, stool, a tumor, cell free DNA, circulating tumor cell, or other biological sample.
  • the sample is from a subject having or at risk of having cancer.
  • the cancer is selected from lung, bladder, colon, gastric, head and neck, breast, prostate, non-small cell lung adenocarcinoma, non-small cell lung squamous cell carcinoma, bladder urothelial carcinoma, colorectal, brain or pancreatic cancer.
  • FIG. 1 shows sequence read information for KRAS chrl2:25398280 OT.
  • FIG. 2 shows sequence read information for BRCA1 chrl7:41256098 A>G.
  • FIG. 3 shows sequence read information for MSH6 chr2:48010515 G>A.
  • Some aspects of the present disclosure are directed to a method for identifying mutations from a patient sample, comprising: evaluating, using a computer having a machine learning classifier, a candidate variant against a plurality of decision trees trained to detect mutations in the candidate variant with a gradient boosting algorithm, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classifying, using the computer, the candidate variant as a real mutation based on classifications of each of the plurality of decision trees, wherein the decision trees receive the following parameters: min_AF_ratio_50, min_alt_count, min_AF_softclip_ratio_99, count, and within_tumor_prior.
  • the decision trees also receive one or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive two or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count.
  • the decision trees also receive three or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive four or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count.
  • the decision trees also receive min_AF_softclip_ratio_100. In some embodiments, the decision trees also receive min_AF_ratio_100 and min_AF_softclip_ratio_10. In some embodiments, the decision trees also receive min_AF_ratio_100, min_AF_softclip_ratio_100, and min_AF_softclip. In some embodiments, the decision trees also receive min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, and min_AF_softclip_ratio_90.
  • the decision trees also receive 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
  • max_AF_ratio_100 max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, amplicon_variant_count, AF_frac_pos_max, analysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_rel_chip, max_AF_ratio_99, AF_softclip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softclip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softclip, frac_max_af,
  • the decision trees also receive one or more of the following parameters: max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, ampIicon_variant_count, AF_frac_pos_max, anaIysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_reI_chip, max_AF_ratio_99, AF_softcIip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softcIip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softcIip, frac_max_af, max_depth, max_AF_softcIip_ratio_99,
  • the machine learning classifier is not limited and may be any suitable machine learning classifier type.
  • Machine learning approaches including any one or more of: supervised learning (e.g., using logistic regression, using back propagation neural networks, using random forests, decision trees, etc.), unsupervised learning (e.g., using an Apriori algorithm, using K-means clustering), semi-supervised learning, a deep learning algorithm (e.g., neural networks, a restricted Boltzmann machine, a deep belief network method, a convolutional neural network method, a recurrent neural network method, stacked auto encoder method, etc.), reinforcement learning (e.g., using a Q-Iearning algorithm, using temporal difference learning), a regression algorithm (e.g., ordinary least squares, logistic regression, stepwise regression, multivariate adaptive regression splines, locally estimated scatterplot smoothing, etc.), an instance-based method (e.g., k-nearest neighbor, learning vector quantization, self-organizing map, etc.), a regularization
  • the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
  • the machine learning approach is a supervised learning approach.
  • evaluating the candidate variant with machine learning further comprises evaluating the candidate variant using a random forest classifier.
  • the plurality of decision trees further comprises at least one thousand decision trees. In some embodiments, the plurality of decision trees further comprises at least ten thousand decision trees. In some embodiments, the plurality of decision trees further comprises at least fifty thousand decision trees.
  • the plurality of decision trees further comprises a plurality of decision trees for each mutation. In some embodiments, the plurality of decision trees for each mutation comprises at least 5 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 50 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 100 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 500 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 1000 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 5000 decision trees per each mutant.
  • Some embodiments of the methods disclosed herein further comprise training the machine learning classifier using a training data set of sequences that include identified mutations.
  • the training data set of sequences was obtained in part from low quality biological samples (e.g., at least 1%, at least 2%, at least 3%, at least 5%, at least 7.5%, at least 10% are low quality biological samples).
  • the mutations are identified via expert review.
  • the training data set of sequences was obtained in part from low quality biological samples (e.g., at least 1%, at least 2%, at least 3%, at least 5%, at least 7.5%, at least 10% is low quality) and the mutations are identified via expert review.
  • the training the machine learning classifier further comprises optimizing parameters of the machine learning classifier until the machine learning classifier produces output describing the mutations identified in the training set (e.g., the mutations identified by expert review).
  • optimizing parameters of the machine learning classifier further comprises selecting a plurality of feature categories. In some embodiments, at least 2,
  • the method further comprises generating, using the computer, an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees.
  • the overall confidence score is on a scale of 0-1, with a 1 being a 100% confidence for a real mutation.
  • a candidate variant is called as a real mutation when the confidence score is greater than 0.3, 0.4, 0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.98, or 0.99.
  • the candidate variant is detected in a high quality sample
  • the candidate variant is called as a real mutation when the confidence score is greater than 0.3.
  • the candidate variant is detected in a high low sample
  • the candidate variant is called as a real mutation when the confidence score is greater than 0.7.
  • a candidate variant is called as a real mutation when the confidence score is greater than 0.9.
  • one or more of the decision trees receive read coverage parameter.
  • one or more of the decision trees receive parameters are selected from min_AF_ratio_50; min_alt_count; min_AF_softclip_ratio_99; count; min_AF_softclip; min_AF_softclip_ratio_100; within_tumor_prior; min_AF_ratio_100; max_AF_ratio_100; and max_alt_count.
  • the method comprising: detecting at least one SNV or Indel in the nucleic acid; validating the detected SNV or Indel as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the SNV or Indel.
  • SNVs single-nucleotide variants
  • the training data set comprises a plurality of known single-nucleotide variants (SNVs)
  • the method comprising: detecting at least one SNV in the nucleic acid; validating the detected SNV as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the SNV.
  • the training data set comprises a plurality of insertions and/or deletions (Indel)
  • the method comprising: detecting at least one Indel in the nucleic acid; validating the detected indel as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the indel.
  • Some aspects of the present disclosure are directed to a system for identifying mutations from a patient sample, comprising a computer system having a processor, memory and a plurality of lines of instructions; a machine learning classifier executed by the processor of the computer system, the machine learning classifier being configured to: evaluate a candidate variant against the plurality of decision trees trained to detect mutations in the candidate variant, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classify the candidate variant as present or not present based on classifications of each of the plurality of decision trees.
  • the machine learning classifier is not limited and may be any suitable machine learning methodology described herein.
  • the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
  • the machine learning classifier is further configured to evaluate the candidate variant using a random forest classifier.
  • the number of decision trees is not limited and may be any suitable number described herein.
  • the plurality of decision trees further comprises at least one thousand decision trees.
  • the plurality of decision trees further comprises a plurality of decision trees for each mutation.
  • the processor is further configured to train the machine learning classifier using a training data set of sequences that include known mutations.
  • the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review.
  • low quality biological samples are samples comprising significantly degraded nucleotide sequences (e.g., DNA).
  • the low quality biological samples are from samples that have undergone formalin fixation and paraffin embedding (FFPE).
  • FFPE formalin fixation and paraffin embedding
  • the low quality biological samples comprise at least 2-fold, 5- fold, 10-fold, or more chemically modified nucleotides or chemical crosslinks than a freshly obtained and untreated biological sample.
  • the low quality biological samples comprise at least 2-fold, 5-fold, 10-fold, or more degraded sequences (DNA sequences) than a freshly obtained and untreated biological sample.
  • the processor is further configured to optimize parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations.
  • the processor is further configured to select a plurality of feature categories.
  • the feature categories are not limited and may be any feature categories described herein.
  • the feature categories are provided in Table 2, 3, or 4.
  • the feature categories are provided in Table 3.
  • the processor is further configured to generate an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees.
  • the biological sample i.e., sample
  • the sample is any suitable sample type.
  • the sample is from plasma, blood, serum, saliva, sputum, stool, a tumor, cell free DNA, circulating tumor cell, or other biological sample.
  • the sample is a blood sample.
  • the biological sample is a tumor specimen.
  • the sample is from a subject having or at risk of having cancer.
  • the type of cancer is not limited and may be any suitable cancer.
  • Exemplary cancers include, but are not limited to, acoustic neuroma; adenocarcinoma; adrenal gland cancer; anal cancer; angiosarcoma (e.g., lymphangiosarcoma, lymphangioendotheliosarcoma, hemangiosarcoma); appendix cancer; benign monoclonal gammopathy; biliary cancer (e.g., cholangiocarcinoma); bladder cancer; breast cancer (e.g., adenocarcinoma of the breast, papillary carcinoma of the breast, mammary cancer, medullary carcinoma of the breast); brain cancer (e.g., meningioma, glioblastomas, glioma (e.g., astrocytoma, oligodendroglioma), medulloblastoma); bronchus cancer; carcinoid tumor; cervical cancer (e.g., cervical adenocarcinoma); choriocar
  • Wilms tumor, renal cell carcinoma); liver cancer (e.g., hepatocellular cancer (HCC), malignant hepatoma); lung cancer (e.g., bronchogenic carcinoma, small cell lung cancer (SCLC), non-small cell lung cancer (NSCLC), adenocarcinoma of the lung); leiomyosarcoma (LMS); mastocytosis (e.g., systemic mastocytosis); muscle cancer; myelodysplastic syndrome (MDS); mesothelioma; myeloproliferative disorder (MPD) (e.g., polycythemia vera (PV), essential thrombocytosis (ET), agnogenic myeloid metaplasia (AMM) a.k.a.
  • HCC hepatocellular cancer
  • lung cancer e.g., bronchogenic carcinoma, small cell lung cancer (SCLC), non-small cell lung cancer (NSCLC), adenocarcinoma of the lung
  • myelofibrosis MF
  • chronic idiopathic myelofibrosis chronic myelocytic leukemia (CML), chronic neutrophilic leukemia (CNL), hypereosinophilic syndrome (HES)
  • neuroblastoma e.g., neurofibromatosis (NF) type 1 or type 2, schwannomatosis
  • neuroendocrine cancer e.g., gastroenteropancreatic neuroendoctrine tumor (GEP-NET), carcinoid tumor
  • osteosarcoma e.g., bone cancer
  • ovarian cancer e.g., cystadenocarcinoma, ovarian embryonal carcinoma, ovarian adenocarcinoma
  • papillary adenocarcinoma pancreatic cancer
  • pancreatic cancer e.g., pancreatic andenocarcinoma, intraductal papillary mucinous neoplasm (IPMN), Islet cell tumors
  • the cancer is selected from adrenal, biliary, bladder, brain, breast, cervical, colon and rectum, endometrium, esophagus, head and neck, kidney, liver, lung - NSCLC, lung - Other, lymphoma, melanoma, meninges, NSCLC, non-melanoma skin, ovary, pancreas, prostate, sarcoma, small intestine, stomach, thymus, or thyroid cancer.
  • the cancer is selected from lung, bladder, colon, gastric, head and neck, breast, prostate, non-small cell lung adenocarcinoma, non-small cell lung squamous cell carcinoma, bladder urothelial carcinoma, colorectal, brain or pancreatic cancer.
  • the biological sample comprises less genomic material than prior mutant calling methods permit. In some embodiments, the biological sample comprises less than about 25 nanograms (ng) of genomic material. In some embodiments, the biological sample comprises less than about 20 ng of genomic material. In some embodiments, the biological sample comprises less than about 15 ng of genomic material. In some embodiments, the biological sample comprises less than about 12 ng of genomic material. In some embodiments, the biological sample comprises less than 10 ng of genomic material. In some embodiments, the biological sample comprises less than 7.5 ng of genomic material. In some embodiments, the biological sample comprises less than 5 ng of genomic material. [0048] In some embodiments, the biological sample has undergone fixation. The method of fixation is not limited and may be any method of fixation known in the art. In some embodiments, fixation includes formalin fixation which is known in the art to result in an abundance of OT mutations thought to be due to deamination.
  • the area of the biological sample is not limited. In some embodiments, the area of the biological sample is less than the area needed for variant calling methods used in the art. In some embodiments, the biological sample is a sample having an area of less than 5 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 10 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 15 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 20 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 25 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 30 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 35 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 40 mm 2 .
  • the tumor content of the biological sample is not limited. In some embodiments, the tumor content of the biological sample is less than the tumor content required in methods of variant calling practiced in the art. In some embodiments, the biological sample is a sample having a tumor content of less than 40%. In some embodiments, the biological sample is a sample having a tumor content of less than 30%. In some embodiments, the biological sample is a sample having a tumor content of less than 20%. In some embodiments, the biological sample is a sample having a tumor content of less than 17%. In some embodiments, the biological sample is a sample having a tumor content of less than 15%. In some embodiments, the biological sample is a sample having a tumor content of less than 12%. In some embodiments, the biological sample is a sample having a tumor content of less than 10%. [0051] Methods of determining or calculating tumor content are not limited and may be any suitable method known in the art.
  • Methods of preparing biological specimens and sequencing data sets are not limited and may be any suitable method used in the art.
  • the method comprises one or more steps comprising 1) receiving a sample (FFPE block or slides); 2) reviewing H&E stained slide for tumor content; 3) Cutting additional slides as necessary; 4) Scraping cells from slides to into tubes (performing macrodissection if indicated by pathologist); 5) Extracting nucleic acid (fully automated batch process); 6) Preparing DNA libraries (largely automated batch process); 7) Loading DNA libraries to sequencing chip (fully automated); 8) Sequencing chips via NGS (fully automated); and 9) Data analysis.
  • the biological sample has been stored for at least about 1 year prior to sequencing. In some embodiments, the biological sample has been stored for at least about 2 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 3 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 4 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 5 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 10 years prior to sequencing.
  • the mutation sequencing data set is low quality. In some embodiments, the mutation sequencing data set comprises less than 1000 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 600 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 500 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 400 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 300 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set has been obtained in a batch with a plurality of other sequencing data sets and comprises less than 50% of the average number of reads in the plurality of other sequencing data sets. In some embodiments, no more than 1% of the positions of the sequencing data set have more that 3% of non-reference alignments.
  • the determination of the mutation or mutations in the sample leads to and is followed by particular treatment steps intended to be therapeutic for the identified mutation(s), mutation symptomology, and/or disorder associated with the mutation(s).
  • the treatment steps can be treatment with a suitable agent or agents, including combination treatment regimens.
  • Suitable treatment steps can include, for example, administration of appropriate gene therapy agents, such as TALENs, zinc fingers and/or Crispr agents and suitable guide sequences, for correcting the mutation(s) in the subject from whom the sample was obtained.
  • Suitable treatment steps can also include chemotherapeutic agents and regimens which are efficacious in treatment of cancers associated with identified mutation(s). The particular treatment will be determined based upon the mutation(s) identified using the methods described herein.
  • any one or more active agents, additives, ingredients, optional agents, types of organism, disorders, subjects, or combinations thereof, can be excluded.
  • claims or description relate to a composition of matter, it is to be understood that methods of making or using the composition of matter according to any of the methods disclosed herein, and methods of using the composition of matter for any of the purposes disclosed herein are aspects of the invention, unless otherwise indicated or unless it would be evident to one of ordinary skill in the art that a contradiction or inconsistency would arise.
  • the invention includes embodiments that relate analogously to any intervening value or range defined by any two values in the series, and that the lowest value may be taken as a minimum and the greatest value may be taken as a maximum.
  • Numerical values include values expressed as percentages. For any embodiment of the invention in which a numerical value is prefaced by “about” or “approximately”, the invention includes an embodiment in which the exact value is recited. For any embodiment of the invention in which a numerical value is not prefaced by “about” or “approximately”, the invention includes an embodiment in which the value is prefaced by “about” or “approximately”.
  • Model parameters were trained using a corpus of paired raw sequencing data and expert review data from comprehensive genomic profiling tests for a consecutive cohort of 3,020 clinical samples spanning 30 tumor types [Adrenal, Biliary, Bladder, Brain, Breast, Cervical, Colon and Rectum, Endometrium, Esophagus, Head and Neck, Kidney, Liver, Lung - NSCLC, Lung - Other, Lymphoma, Melanoma, Meninges, NSCLC, Non-Melanoma Skin, other, Ovary, Pancreas, Prostate, Sarcoma, Small Intestine, Stomach, Thymus, Thyroid, and unknown primary] .
  • SNV single nucleotide variant
  • Companion diagnostic markers (CDx) - mutations with existing clinical indications such as BRAE p.V600E, which may indicate vemurafenib in melanomas
  • the mutation is covered by two independent amplicons, both of which support the mutation:
  • the read variant allele frequency is similar on both strands (range 0-1):
  • the read support is highly strand-biased (variant allele frequencies are different on each strand):
  • Negative example 2 MSH6 chr2:48010515 G>A (FIG. 3)
  • This candidate is typical of sequencing errors that are often present at variant allele frequencies high enough to trigger false positive results in systems that do not consider second-order evidence such as strand bias, multiple amplicon coverage, and prior probabilities.
  • second-order evidence such as strand bias, multiple amplicon coverage, and prior probabilities.
  • Variant candidates like this with marginal read support can be real - for example, it is common to have biased read coverage such that only one strand has sufficient reads to support analysis; in such cases where read support is not definitive, prior probabilities and multiple amplicon coverage can help resolve such ambiguity. In this case, neither parameter lends support:
  • pan_tumor_prior pan_tumor_prior, pan_tumor_prior_gene, within_tumor_prior, within_tumor_prior_gene ⁇ 0.01

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Algebra (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Disclosed herein are methods for identifying mutations from a patient sample, by evaluating, using a computer having a machine learning classifier, a candidate variant against a plurality of decision trees trained to detect mutations in the candidate variant with a gradient boosting algorithm.

Description

METHODS FOR IDENTIFYING MUTATIONS USING MACHINE LEARNING
RELATED APPLICATION
[0001] This application claims priority to, and the benefit of, co-pending United States Provisional No. 63/211,891, filed June 17, 2021. The disclosure of said provisional application is hereby incorporated by reference in its entirety.
BACKGROUND OF THE INVENTION
[0002] Next Generation Sequencing (NGS)-based cancer tests (e.g., variant calling) are commonly used in the advanced cancer setting to inform treatment selection of targeted therapy (e.g., due to alterations in ALK, EGFR or BRAF), and for research. While multiple NGS-based companion diagnostic (CDx) cancer tests have now been FDA approved, their uptake has been limited by: (1) single-site approvals (e.g., FoundationOne CDx™ [FI CDx]) that compete with regional laboratories who wish to better serve their local patient populations, (2) impractical sample input requirements for capture-based NGS IVD tests (e.g., FI CDx requires tumor surface area >25mm2) and (3) insufficient content in kitted products (e.g., Oncomine Target Dx, Praxis Extended Ras Panel), which took many years to develop, but when launched, no longer met current medical needs given the ever expanding list of biomarkers recommended or required by professional organizations and payors.
[0003] Hence, the majority of molecular testing for patients with advanced cancer in the U.S. continues to occur in CLIA-certified laboratories as custom single-gene or NGS laboratory- developed tests (LDTs). In the 2018 B Mulitgene Tumor Panel CAP proficiency test, of 97, 101 and 101 laboratories reporting testing for EGFR, BRAF and KRAS mutations, respectively, only 23 (24%), 7 (7%) and 2 (2%) used FDA cleared or approved companion diagnostic tests. While LDTs can be developed more quickly to meet changing medical needs, they are at increased risk of producing inaccurate or inconsistent results due to variability in laboratory methods, bioinformatics and data interpretation, and in some cases, inadequate analytical and clinical validation for their intended use. Hence, as NGS tests become more complex, this risk has the potential to increase.
[0004] US 2019/0189242 A1 (Published June 20, 2019), provides a variant calling method utilizing machine learning. However, as detailed in paragraph [0166], machine learning was accomplished using an artificially created "spike-in" training set. Thus, the method of US 2019/0189242 A1 has degraded performance when calling "real world" samples that often comprise low quality samples.
[0005] Thus, there remains a need in the art for standardized and better variant calling methods that can be applied by any laboratory analyzing NGS sequences.
SUMMARY OF THE INVENTION
[0006] Some aspects of the present disclosure are directed to a method for identifying mutations from a patient sample, comprising: evaluating, using a computer having a machine learning classifier, a candidate variant against a plurality of decision trees trained to detect mutations in the candidate variant with a gradient boosting algorithm, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classifying, using the computer, the candidate variant as a real mutation based on classifications of each of the plurality of decision trees, wherein the decision trees receive the following parameters: min_AF_ratio_50; min_aIt_count; min_AF_softcIip_ratio_99; count; and within_tumor_prior.
[0007] In some embodiments, the decision trees also receive the following parameters: min_AF_ratio_100, min_AF_softcIip_ratio_100, min_AF_softcIip, min_AF_softcIip_ratio_90, and max_aIt_count.
[0008] In some embodiments, the decision trees also receive one or more of the following parameters: max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, ampIicon_variant_count, AF_frac_pos_max, ana!ysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_reI_chip, max_AF_ratio_99, AF_softcIip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softcIip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softcIip, frac_max_af, max_depth, max_AF_softcIip_ratio_99, min_AF_ratio_90, max_AF, max_AF_softcIip_ratio_100, max_AF_softcIip_ratio_90, max_softcIip_count, best_strand_bias, max_AF_softcIip_ratio_50, and min_softcIip_count.
[0009] In some embodiments, evaluating the candidate variant further comprises evaluating the candidate variant using a random forest classifier. In some embodiments, the plurality of decision trees further comprises at least one thousand decision trees. In some embodiments, the plurality of decision trees further comprises a plurality of decision trees for each mutation. [0010] In some embodiments, the method further comprises training the machine learning classifier using a training data set of sequences that include identified mutations. In some embodiments, the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review.
[0011] In some embodiments, training the machine learning classifier further comprises optimizing parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations. In some embodiments, optimizing parameters of the machine learning classifier further comprises selecting a plurality of feature categories. In some embodiments, the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
[0012] In some embodiments, the method further comprises generating, using the computer, an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees. In some embodiments, the method further comprises providing a report that describes the candidate variant as including the mutation.
[0013] In some embodiments, one or more of the decision trees receive parameters selected from the group consisting of: sample type; FASTQ quality score; alignment score; read coverage; and an estimated probability of error. In some embodiments, the training data set comprises a plurality of known single-nucleotide variants (SNVs) and insertions / deletions (indels), the method comprising: detecting at least one mutation in the nucleic acid; validating the detected mutation as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the mutation.
[0014] In some embodiments, a decision tree classifies the candidate variant as present or absent (e.g., likely noise or other sequencing artifact). In some embodiments, the decision tree does not classify whether or not the candidate variant is germline.
[0015] Some aspects of the present disclosure are directed to a system for identifying mutations from a patient sample, comprising: a computer system having a processor, memory and a plurality of lines of instructions; a machine learning classifier executed by the processor of the computer system, the machine learning classifier being configured to: evaluate a candidate variant against the plurality of decision trees trained to detect mutations in the candidate variant, wherein each decision tree classifies the candidate variant as present or absent (e.g., likely noise or other sequencing artifact), based on classifications of each of the plurality of decision trees.
[0016] In some embodiments, the machine learning classifier is further configured to evaluate the candidate variant using a random forest classifier. In some embodiments, the plurality of decision trees further comprises at least one thousand decision trees. In some embodiments, the plurality of decision trees further comprises a plurality of decision trees for each mutation. In some embodiments, the processor is further configured to train the machine learning classifier using a training data set of sequences that include known mutations. In some embodiments, the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review. In some embodiments, the processor is further configured to optimize parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations. In some embodiments, the processor is further configured to select a plurality of feature categories. In some embodiments, the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes. In some embodiments, the processor is further configured to generate an overall confidence score for the candidate variant being a real mutation (vs. sequencing artifact) based on the classifications of all of the plurality of decision trees.
[0017] In some embodiments of the methods and systems described herein, the sample is from plasma, blood, serum, saliva, sputum, stool, a tumor, cell free DNA, circulating tumor cell, or other biological sample. In some embodiments, the sample is from a subject having or at risk of having cancer. In some embodiments, the cancer is selected from lung, bladder, colon, gastric, head and neck, breast, prostate, non-small cell lung adenocarcinoma, non-small cell lung squamous cell carcinoma, bladder urothelial carcinoma, colorectal, brain or pancreatic cancer.
BRIEF DESCRIPTION OF THE DRAWINGS
[0018] The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawings will be provided by the Office upon request and payment of the necessary fee.
[0019] FIG. 1 shows sequence read information for KRAS chrl2:25398280 OT.
[0020] FIG. 2 shows sequence read information for BRCA1 chrl7:41256098 A>G.
[0021] FIG. 3 shows sequence read information for MSH6 chr2:48010515 G>A. DET AILED DESCRIPTION OF THE INVENTION
[0022] Some aspects of the present disclosure are directed to a method for identifying mutations from a patient sample, comprising: evaluating, using a computer having a machine learning classifier, a candidate variant against a plurality of decision trees trained to detect mutations in the candidate variant with a gradient boosting algorithm, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classifying, using the computer, the candidate variant as a real mutation based on classifications of each of the plurality of decision trees, wherein the decision trees receive the following parameters: min_AF_ratio_50, min_alt_count, min_AF_softclip_ratio_99, count, and within_tumor_prior. [0023] In some embodiments, the decision trees also receive one or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive two or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive three or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive four or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive min_AF_softclip_ratio_100. In some embodiments, the decision trees also receive min_AF_ratio_100 and min_AF_softclip_ratio_10. In some embodiments, the decision trees also receive min_AF_ratio_100, min_AF_softclip_ratio_100, and min_AF_softclip. In some embodiments, the decision trees also receive min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, and min_AF_softclip_ratio_90.
[0024] In some embodiments, the decision trees also receive 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,35, 36, 37, 38 39, 40, 41, 42, 43, 44, or all of the following parameters: max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, amplicon_variant_count, AF_frac_pos_max, analysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_rel_chip, max_AF_ratio_99, AF_softclip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softclip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softclip, frac_max_af, max_depth, max_AF_softclip_ratio_99, min_AF_ratio_90, max_AF, max_AF_softclip_ratio_100, max_AF_softclip_ratio_90, max_softclip_count, best_strand_bias, max_AF_softclip_ratio_50, and min_softclip_count. In some embodiments, the decision trees also receive one or more of the following parameters: max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, ampIicon_variant_count, AF_frac_pos_max, anaIysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_reI_chip, max_AF_ratio_99, AF_softcIip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softcIip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softcIip, frac_max_af, max_depth, max_AF_softcIip_ratio_99, min_AF_ratio_90, max_AF, max_AF_softcIip_ratio_100, max_AF_softcIip_ratio_90, max_softcIip_count, best_strand_bias, max_AF_softcIip_ratio_50, and min_softcIip_count.
[0025] The machine learning classifier is not limited and may be any suitable machine learning classifier type. Machine learning approaches including any one or more of: supervised learning (e.g., using logistic regression, using back propagation neural networks, using random forests, decision trees, etc.), unsupervised learning (e.g., using an Apriori algorithm, using K-means clustering), semi-supervised learning, a deep learning algorithm (e.g., neural networks, a restricted Boltzmann machine, a deep belief network method, a convolutional neural network method, a recurrent neural network method, stacked auto encoder method, etc.), reinforcement learning (e.g., using a Q-Iearning algorithm, using temporal difference learning), a regression algorithm (e.g., ordinary least squares, logistic regression, stepwise regression, multivariate adaptive regression splines, locally estimated scatterplot smoothing, etc.), an instance-based method (e.g., k-nearest neighbor, learning vector quantization, self-organizing map, etc.), a regularization method (e.g., ridge regression, least absolute shrinkage and selection operator, elastic net, etc.), a decision tree learning method (e.g., classification and regression tree, iterative dichotomiser 3, C4.5, chi-squared automatic interaction detection, decision stump, random forest, multivariate adaptive regression splines, gradient boosting machines, etc.), a Bayesian method (e.g., naive Bayes, averaged one-dependence estimators, Bayesian belief network, etc.), a kernel method (e.g., a support vector machine, a radial basis function, a linear discriminate analysis, etc.), a clustering method (e.g., k-means clustering, expectation maximization, etc.), an associated rule learning algorithm (e.g., an Apriori algorithm, an Eclat algorithm, etc.), an artificial neural network model (e.g., a Perceptron method, a back-propagation method, a Hopfield network method, a self-organizing map method, a learning vector quantization method, etc.), a dimensionality reduction method (e.g., principal component analysis, partial lest squares regression, Sammon mapping, multidimensional scaling, projection pursuit, etc.), an ensemble method (e.g., boosting, bootstrapped aggregation, AdaBoost, stacked generalization, gradient boosting machine method, random forest method, etc.), and/or any suitable artificial intelligence approach.
[0026] In some embodiments, the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
[0027] In some embodiments, the machine learning approach is a supervised learning approach. In some embodiments, evaluating the candidate variant with machine learning further comprises evaluating the candidate variant using a random forest classifier.
[0028] In some embodiments, the plurality of decision trees further comprises at least one thousand decision trees. In some embodiments, the plurality of decision trees further comprises at least ten thousand decision trees. In some embodiments, the plurality of decision trees further comprises at least fifty thousand decision trees.
[0029] In some embodiments, the plurality of decision trees further comprises a plurality of decision trees for each mutation. In some embodiments, the plurality of decision trees for each mutation comprises at least 5 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 50 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 100 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 500 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 1000 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 5000 decision trees per each mutant.
[0030] Some embodiments of the methods disclosed herein further comprise training the machine learning classifier using a training data set of sequences that include identified mutations. In some embodiments, the training data set of sequences was obtained in part from low quality biological samples (e.g., at least 1%, at least 2%, at least 3%, at least 5%, at least 7.5%, at least 10% are low quality biological samples). In some embodiments, the mutations are identified via expert review. In some embodiments, the training data set of sequences was obtained in part from low quality biological samples (e.g., at least 1%, at least 2%, at least 3%, at least 5%, at least 7.5%, at least 10% is low quality) and the mutations are identified via expert review. [0031] In some embodiments, the training the machine learning classifier further comprises optimizing parameters of the machine learning classifier until the machine learning classifier produces output describing the mutations identified in the training set (e.g., the mutations identified by expert review).
[0032] In some embodiments, optimizing parameters of the machine learning classifier further comprises selecting a plurality of feature categories. In some embodiments, at least 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31, 32, 33, 34,35, 36, 37, 38 39, 40, 41, 42, 43, or 44 feature categories (e.g., the feature categories provided in Table 2, 3 or 4 below. In some embodiments, the feature categories min_AF_ratio_50, min_alt_count, min_AF_softclip_ratio_99, count, within_tumor_prior, min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, max_alt_count, max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, amplicon_variant_count, AF_frac_pos_max, analysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_rel_chip, max_AF_ratio_99, AF_softclip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softclip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softclip, frac_max_af, max_depth, max_AF_softclip_ratio_99, min_AF_ratio_90, max_AF, max_AF_softclip_ratio_100, max_AF_softclip_ratio_90, max_softclip_count, best_strand_bias, max_AF_softclip_ratio_50, and min_softclip_count. In some embodiments, the features are the first 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,35, 36, 37, 38 39, 40, 41, 42, 43, or 44 provided in Table 3 from the top of the table.
[0033] In some embodiments, the method further comprises generating, using the computer, an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees. In some embodiments, the overall confidence score is on a scale of 0-1, with a 1 being a 100% confidence for a real mutation.
In some embodiments, a candidate variant is called as a real mutation when the confidence score is greater than 0.3, 0.4, 0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.98, or 0.99. In some embodiments, wherein the candidate variant is detected in a high quality sample, the candidate variant is called as a real mutation when the confidence score is greater than 0.3. In some embodiments, wherein the candidate variant is detected in a high low sample, the candidate variant is called as a real mutation when the confidence score is greater than 0.7. In some embodiments, a candidate variant is called as a real mutation when the confidence score is greater than 0.9. [0034] In some embodiments, one or more of the decision trees receive read coverage parameter. In some embodiments, one or more of the decision trees receive parameters are selected from min_AF_ratio_50; min_alt_count; min_AF_softclip_ratio_99; count; min_AF_softclip; min_AF_softclip_ratio_100; within_tumor_prior; min_AF_ratio_100; max_AF_ratio_100; and max_alt_count.
[0035] In some embodiments wherein the training data set comprises a plurality of known single-nucleotide variants (SNVs) and Indels, the method comprising: detecting at least one SNV or Indel in the nucleic acid; validating the detected SNV or Indel as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the SNV or Indel.
[0036] In some embodiments wherein the training data set comprises a plurality of known single-nucleotide variants (SNVs), the method comprising: detecting at least one SNV in the nucleic acid; validating the detected SNV as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the SNV. [0037] In some embodiments wherein the training data set comprises a plurality of insertions and/or deletions (Indel), the method comprising: detecting at least one Indel in the nucleic acid; validating the detected indel as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the indel.
[0038] Some aspects of the present disclosure are directed to a system for identifying mutations from a patient sample, comprising a computer system having a processor, memory and a plurality of lines of instructions; a machine learning classifier executed by the processor of the computer system, the machine learning classifier being configured to: evaluate a candidate variant against the plurality of decision trees trained to detect mutations in the candidate variant, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classify the candidate variant as present or not present based on classifications of each of the plurality of decision trees.
[0039] The machine learning classifier is not limited and may be any suitable machine learning methodology described herein. In some embodiments, the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes. In some embodiments, the machine learning classifier is further configured to evaluate the candidate variant using a random forest classifier. The number of decision trees is not limited and may be any suitable number described herein. In some embodiments, the plurality of decision trees further comprises at least one thousand decision trees. In some embodiments, the plurality of decision trees further comprises a plurality of decision trees for each mutation.
[0040] In some embodiments, the processor is further configured to train the machine learning classifier using a training data set of sequences that include known mutations. In some embodiments, the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review.
[0041] As used herein, low quality biological samples are samples comprising significantly degraded nucleotide sequences (e.g., DNA). In some embodiments, the low quality biological samples are from samples that have undergone formalin fixation and paraffin embedding (FFPE). In some embodiments, the low quality biological samples comprise at least 2-fold, 5- fold, 10-fold, or more chemically modified nucleotides or chemical crosslinks than a freshly obtained and untreated biological sample. In some embodiments, the low quality biological samples comprise at least 2-fold, 5-fold, 10-fold, or more degraded sequences (DNA sequences) than a freshly obtained and untreated biological sample.
[0042] In some embodiments, the processor is further configured to optimize parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations.
[0043] In some embodiments, the processor is further configured to select a plurality of feature categories. The feature categories are not limited and may be any feature categories described herein. In some embodiments, the feature categories are provided in Table 2, 3, or 4. In some embodiments, the feature categories are provided in Table 3.
[0044] In some embodiments, the processor is further configured to generate an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees.
[0045] In embodiments of the methods and systems disclosed herein, the biological sample (i.e., sample) is any suitable sample type. In some embodiments, the sample is from plasma, blood, serum, saliva, sputum, stool, a tumor, cell free DNA, circulating tumor cell, or other biological sample. In some embodiments, the sample is a blood sample. In some embodiments, the biological sample is a tumor specimen. In some embodiments, the sample is from a subject having or at risk of having cancer. The type of cancer is not limited and may be any suitable cancer. Exemplary cancers include, but are not limited to, acoustic neuroma; adenocarcinoma; adrenal gland cancer; anal cancer; angiosarcoma (e.g., lymphangiosarcoma, lymphangioendotheliosarcoma, hemangiosarcoma); appendix cancer; benign monoclonal gammopathy; biliary cancer (e.g., cholangiocarcinoma); bladder cancer; breast cancer (e.g., adenocarcinoma of the breast, papillary carcinoma of the breast, mammary cancer, medullary carcinoma of the breast); brain cancer (e.g., meningioma, glioblastomas, glioma (e.g., astrocytoma, oligodendroglioma), medulloblastoma); bronchus cancer; carcinoid tumor; cervical cancer (e.g., cervical adenocarcinoma); choriocarcinoma; chordoma; craniopharyngioma; colorectal cancer (e.g., colon cancer, rectal cancer, colorectal adenocarcinoma); connective tissue cancer; epithelial carcinoma; ependymoma; endotheliosarcoma (e.g., Kaposi’ s sarcoma, multiple idiopathic hemorrhagic sarcoma); endometrial cancer (e.g., uterine cancer, uterine sarcoma); esophageal cancer (e.g., adenocarcinoma of the esophagus, Barrett’ s adenocarinoma); Ewing’ s sarcoma; eye cancer (e.g., intraocular melanoma, retinoblastoma); familiar hypereosinophilia; gall bladder cancer; gastric cancer (e.g., stomach adenocarcinoma); gastrointestinal stromal tumor (GIST); germ cell cancer; head and neck cancer (e.g., head and neck squamous cell carcinoma, oral cancer (e.g., oral squamous cell carcinoma), throat cancer (e.g., laryngeal cancer, pharyngeal cancer, nasopharyngeal cancer, oropharyngeal cancer)); hematopoietic cancers (e.g., leukemia such as acute lymphocytic leukemia (ALL) (e.g., B-cell ALL, T-cell ALL), acute myelocytic leukemia (AML) (e.g., B-cell AML, T-cell AML), chronic myelocytic leukemia (CML) (e.g., B-cell CML, T-cell CML), and chronic lymphocytic leukemia (CLL) (e.g., B-cell CLL, T-cell CLL)); lymphoma such as Hodgkin lymphoma (HL) (e.g., B-cell HL, T-cell HL) and non- Hodgkin lymphoma (NHL) (e.g., B-cell NHL such as diffuse large cell lymphoma (DLCL) (e.g., diffuse large B-cell lymphoma), follicular lymphoma, chronic lymphocytic leukemia/small lymphocytic lymphoma (CLL/SLL), mantle cell lymphoma (MCL), marginal zone B-cell lymphomas (e.g., mucosa-associated lymphoid tissue (MALT) lymphomas, nodal marginal zone B-cell lymphoma, splenic marginal zone B-cell lymphoma), primary mediastinal B-cell lymphoma, Burkitt lymphoma, lymphoplasmacytic lymphoma (i.e., Waldenstrom’ s macroglobulinemia), hairy cell leukemia (HCL), immunoblastic large cell lymphoma, precursor B -lymphoblastic lymphoma and primary central nervous system (CNS) lymphoma; and T-cell NHL such as precursor T-lymphoblastic lymphoma/leukemia, peripheral T-cell lymphoma (PTCL) (e.g., cutaneous T-cell lymphoma (CTCL) (e.g., mycosis fungiodes, Sezary syndrome), angioimmunoblastic T-cell lymphoma, extranodal natural killer T-cell lymphoma, enteropathy type T-cell lymphoma, subcutaneous panniculitis-like T-cell lymphoma, and anaplastic large cell lymphoma); a mixture of one or more leukemia/lymphoma as described above; and multiple myeloma (MM)), heavy chain disease (e.g., alpha chain disease, gamma chain disease, mu chain disease); hemangioblastoma; hypopharynx cancer; inflammatory myofibroblastic tumors; immunocytic amyloidosis; kidney cancer (e.g., nephroblastoma a.k.a. Wilms’ tumor, renal cell carcinoma); liver cancer (e.g., hepatocellular cancer (HCC), malignant hepatoma); lung cancer (e.g., bronchogenic carcinoma, small cell lung cancer (SCLC), non-small cell lung cancer (NSCLC), adenocarcinoma of the lung); leiomyosarcoma (LMS); mastocytosis (e.g., systemic mastocytosis); muscle cancer; myelodysplastic syndrome (MDS); mesothelioma; myeloproliferative disorder (MPD) (e.g., polycythemia vera (PV), essential thrombocytosis (ET), agnogenic myeloid metaplasia (AMM) a.k.a. myelofibrosis (MF), chronic idiopathic myelofibrosis, chronic myelocytic leukemia (CML), chronic neutrophilic leukemia (CNL), hypereosinophilic syndrome (HES)); neuroblastoma; neurofibroma (e.g., neurofibromatosis (NF) type 1 or type 2, schwannomatosis); neuroendocrine cancer (e.g., gastroenteropancreatic neuroendoctrine tumor (GEP-NET), carcinoid tumor); osteosarcoma (e.g., bone cancer); ovarian cancer (e.g., cystadenocarcinoma, ovarian embryonal carcinoma, ovarian adenocarcinoma); papillary adenocarcinoma; pancreatic cancer (e.g., pancreatic andenocarcinoma, intraductal papillary mucinous neoplasm (IPMN), Islet cell tumors); penile cancer (e.g., Paget’ s disease of the penis and scrotum); pinealoma; primitive neuroectodermal tumor (PNT); plasma cell neoplasia; paraneoplastic syndromes; intraepithelial neoplasms; prostate cancer (e.g., prostate adenocarcinoma); rectal cancer; rhabdomyosarcoma; salivary gland cancer; skin cancer (e.g., squamous cell carcinoma (SCC), keratoacanthoma (KA), melanoma, basal cell carcinoma (BCC)); small bowel cancer (e.g., appendix cancer); soft tissue sarcoma (e.g., malignant fibrous histiocytoma (MFH), liposarcoma, malignant peripheral nerve sheath tumor (MPNST), chondrosarcoma, fibrosarcoma, myxosarcoma); sebaceous gland carcinoma; small intestine cancer; sweat gland carcinoma; synovioma; testicular cancer (e.g., seminoma, testicular embryonal carcinoma); thyroid cancer (e.g., papillary carcinoma of the thyroid, papillary thyroid carcinoma (PTC), medullary thyroid cancer); urethral cancer; vaginal cancer; and vulvar cancer (e.g., Paget’ s disease of the vulva). In some embodiments, the cancer is lung or prostate cancer.
[0046] In some embodiments, the cancer is selected from adrenal, biliary, bladder, brain, breast, cervical, colon and rectum, endometrium, esophagus, head and neck, kidney, liver, lung - NSCLC, lung - Other, lymphoma, melanoma, meninges, NSCLC, non-melanoma skin, ovary, pancreas, prostate, sarcoma, small intestine, stomach, thymus, or thyroid cancer. In some embodiments, the cancer is selected from lung, bladder, colon, gastric, head and neck, breast, prostate, non-small cell lung adenocarcinoma, non-small cell lung squamous cell carcinoma, bladder urothelial carcinoma, colorectal, brain or pancreatic cancer.
[0047] In some embodiments, the biological sample comprises less genomic material than prior mutant calling methods permit. In some embodiments, the biological sample comprises less than about 25 nanograms (ng) of genomic material. In some embodiments, the biological sample comprises less than about 20 ng of genomic material. In some embodiments, the biological sample comprises less than about 15 ng of genomic material. In some embodiments, the biological sample comprises less than about 12 ng of genomic material. In some embodiments, the biological sample comprises less than 10 ng of genomic material. In some embodiments, the biological sample comprises less than 7.5 ng of genomic material. In some embodiments, the biological sample comprises less than 5 ng of genomic material. [0048] In some embodiments, the biological sample has undergone fixation. The method of fixation is not limited and may be any method of fixation known in the art. In some embodiments, fixation includes formalin fixation which is known in the art to result in an abundance of OT mutations thought to be due to deamination.
[0049] The area of the biological sample (mm2) is not limited. In some embodiments, the area of the biological sample is less than the area needed for variant calling methods used in the art. In some embodiments, the biological sample is a sample having an area of less than 5 mm2. In some embodiments, the biological sample is a sample having an area of less than 10 mm2. In some embodiments, the biological sample is a sample having an area of less than 15 mm2. In some embodiments, the biological sample is a sample having an area of less than 20 mm2. In some embodiments, the biological sample is a sample having an area of less than 25 mm2. In some embodiments, the biological sample is a sample having an area of less than 30 mm2. In some embodiments, the biological sample is a sample having an area of less than 35 mm2. In some embodiments, the biological sample is a sample having an area of less than 40 mm2.
[0050] The tumor content of the biological sample is not limited. In some embodiments, the tumor content of the biological sample is less than the tumor content required in methods of variant calling practiced in the art. In some embodiments, the biological sample is a sample having a tumor content of less than 40%. In some embodiments, the biological sample is a sample having a tumor content of less than 30%. In some embodiments, the biological sample is a sample having a tumor content of less than 20%. In some embodiments, the biological sample is a sample having a tumor content of less than 17%. In some embodiments, the biological sample is a sample having a tumor content of less than 15%. In some embodiments, the biological sample is a sample having a tumor content of less than 12%. In some embodiments, the biological sample is a sample having a tumor content of less than 10%. [0051] Methods of determining or calculating tumor content are not limited and may be any suitable method known in the art.
[0052] Methods of preparing biological specimens and sequencing data sets are not limited and may be any suitable method used in the art. In some embodiments, the method comprises one or more steps comprising 1) receiving a sample (FFPE block or slides); 2) reviewing H&E stained slide for tumor content; 3) Cutting additional slides as necessary; 4) Scraping cells from slides to into tubes (performing macrodissection if indicated by pathologist); 5) Extracting nucleic acid (fully automated batch process); 6) Preparing DNA libraries (largely automated batch process); 7) Loading DNA libraries to sequencing chip (fully automated); 8) Sequencing chips via NGS (fully automated); and 9) Data analysis.
[0053] In some embodiments, the biological sample has been stored for at least about 1 year prior to sequencing. In some embodiments, the biological sample has been stored for at least about 2 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 3 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 4 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 5 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 10 years prior to sequencing.
[0054] In some embodiments, the mutation sequencing data set is low quality. In some embodiments, the mutation sequencing data set comprises less than 1000 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 600 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 500 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 400 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 300 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set has been obtained in a batch with a plurality of other sequencing data sets and comprises less than 50% of the average number of reads in the plurality of other sequencing data sets. In some embodiments, no more than 1% of the positions of the sequencing data set have more that 3% of non-reference alignments.
[0055] In some embodiments the determination of the mutation or mutations in the sample leads to and is followed by particular treatment steps intended to be therapeutic for the identified mutation(s), mutation symptomology, and/or disorder associated with the mutation(s). In such embodiments the treatment steps can be treatment with a suitable agent or agents, including combination treatment regimens. Suitable treatment steps can include, for example, administration of appropriate gene therapy agents, such as TALENs, zinc fingers and/or Crispr agents and suitable guide sequences, for correcting the mutation(s) in the subject from whom the sample was obtained. Suitable treatment steps can also include chemotherapeutic agents and regimens which are efficacious in treatment of cancers associated with identified mutation(s). The particular treatment will be determined based upon the mutation(s) identified using the methods described herein.
[0056] The description of embodiments of the disclosure is not intended to be exhaustive or to limit the disclosure to the precise form disclosed. While specific embodiments of, and examples for, the disclosure are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the disclosure, as those skilled in the relevant art will recognize. For example, while method steps or functions are presented in a given order, alternative embodiments may perform functions in a different order, or functions may be performed substantially concurrently. The teachings of the disclosure provided herein can be applied to other procedures or methods as appropriate. The various embodiments described herein can be combined to provide further embodiments. Aspects of the disclosure can be modified, if necessary, to employ the compositions, functions and concepts of the above references and application to provide yet further embodiments of the disclosure. These and other changes can be made to the disclosure in light of the detailed description.
[0057] Specific elements of any of the foregoing embodiments can be combined or substituted for elements in other embodiments. Furthermore, while advantages associated with certain embodiments of the disclosure have been described in the context of these embodiments, other embodiments may also exhibit such advantages, and not all embodiments need necessarily exhibit such advantages to fall within the scope of the disclosure.
[0058] All patents and other publications identified are expressly incorporated herein by reference for the purpose of describing and disclosing, for example, the methodologies described in such publications that might be used in connection with the present invention. These publications are provided solely for their disclosure prior to the filing date of the present application. Nothing in this regard should be construed as an admission that the inventors are not entitled to antedate such disclosure by virtue of prior invention or prior publication, or for any other reason. All statements as to the date or representation as to the contents of these documents is based on the information available to the applicants and does not constitute any admission as to the correctness of the dates or contents of these documents. [0059] One skilled in the art readily appreciates that the present invention is well adapted to carry out the objects and obtain the ends and advantages mentioned, as well as those inherent therein. The details of the description and the examples herein are representative of certain embodiments, are exemplary, and are not intended as limitations on the scope of the invention. Modifications therein and other uses will occur to those skilled in the art. These modifications are encompassed within the spirit of the invention. It will be readily apparent to a person skilled in the art that varying substitutions and modifications may be made to the invention disclosed herein without departing from the scope and spirit of the invention.
[0060] The articles “a” and “an” as used herein in the specification and in the claims, unless clearly indicated to the contrary, should be understood to include the plural referents. Claims or descriptions that include “or” between one or more members of a group are considered satisfied if one, more than one, or all of the group members are present in, employed in, or otherwise relevant to a given product or process unless indicated to the contrary or otherwise evident from the context. The invention includes embodiments in which exactly one member of the group is present in, employed in, or otherwise relevant to a given product or process. The invention also includes embodiments in which more than one, or all of the group members are present in, employed in, or otherwise relevant to a given product or process. Furthermore, it is to be understood that the invention provides all variations, combinations, and permutations in which one or more limitations, elements, clauses, descriptive terms, etc., from one or more of the listed claims is introduced into another claim dependent on the same base claim (or, as relevant, any other claim) unless otherwise indicated or unless it would be evident to one of ordinary skill in the art that a contradiction or inconsistency would arise. It is contemplated that all embodiments described herein are applicable to all different aspects of the invention where appropriate. It is also contemplated that any of the embodiments or aspects can be freely combined with one or more other such embodiments or aspects whenever appropriate. Where elements are presented as lists, e.g., in Markush group or similar format, it is to be understood that each subgroup of the elements is also disclosed, and any element(s) can be removed from the group. It should be understood that, in general, where the invention, or aspects of the invention, is/are referred to as comprising particular elements, features, etc., certain embodiments of the invention or aspects of the invention consist, or consist essentially of, such elements, features, etc. For purposes of simplicity those embodiments have not in every case been specifically set forth in so many words herein. It should also be understood that any embodiment or aspect of the invention can be explicitly excluded from the claims, regardless of whether the specific exclusion is recited in the specification. For example, any one or more active agents, additives, ingredients, optional agents, types of organism, disorders, subjects, or combinations thereof, can be excluded. [0061] Where the claims or description relate to a composition of matter, it is to be understood that methods of making or using the composition of matter according to any of the methods disclosed herein, and methods of using the composition of matter for any of the purposes disclosed herein are aspects of the invention, unless otherwise indicated or unless it would be evident to one of ordinary skill in the art that a contradiction or inconsistency would arise. Where the claims or description relate to a method, e.g., it is to be understood that methods of making compositions useful for performing the method, and products produced according to the method, are aspects of the invention, unless otherwise indicated or unless it would be evident to one of ordinary skill in the art that a contradiction or inconsistency would arise.
[0062] Where ranges are given herein, the invention includes embodiments in which the endpoints are included, embodiments in which both endpoints are excluded, and embodiments in which one endpoint is included and the other is excluded. It should be assumed that both endpoints are included unless indicated otherwise. Furthermore, it is to be understood that unless otherwise indicated or otherwise evident from the context and understanding of one of ordinary skill in the art, values that are expressed as ranges can assume any specific value or subrange within the stated ranges in different embodiments of the invention, to the tenth of the unit of the lower limit of the range, unless the context clearly dictates otherwise. It is also understood that where a series of numerical values is stated herein, the invention includes embodiments that relate analogously to any intervening value or range defined by any two values in the series, and that the lowest value may be taken as a minimum and the greatest value may be taken as a maximum. Numerical values, as used herein, include values expressed as percentages. For any embodiment of the invention in which a numerical value is prefaced by “about” or “approximately”, the invention includes an embodiment in which the exact value is recited. For any embodiment of the invention in which a numerical value is not prefaced by “about” or “approximately”, the invention includes an embodiment in which the value is prefaced by “about” or “approximately”.
[0063] “Approximately” or “about” generally includes numbers that fall within a range of 1% or in some embodiments within a range of 5% of a number or in some embodiments within a range of 10% of a number in either direction (greater than or less than the number) unless otherwise stated or otherwise evident from the context (except where such number would impermissibly exceed 100% of a possible value). It should be understood that, unless clearly indicated to the contrary, in any methods claimed herein that include more than one act, the order of the acts of the method is not necessarily limited to the order in which the acts of the method are recited, but the invention includes embodiments in which the order is so limited. It should also be understood that unless otherwise indicated or evident from the context, any product or composition described herein may be considered “isolated”. [0064] Examples [0065] Training Set
[0066] Model parameters were trained using a corpus of paired raw sequencing data and expert review data from comprehensive genomic profiling tests for a consecutive cohort of 3,020 clinical samples spanning 30 tumor types [Adrenal, Biliary, Bladder, Brain, Breast, Cervical, Colon and Rectum, Endometrium, Esophagus, Head and Neck, Kidney, Liver, Lung - NSCLC, Lung - Other, Lymphoma, Melanoma, Meninges, NSCLC, Non-Melanoma Skin, other, Ovary, Pancreas, Prostate, Sarcoma, Small Intestine, Stomach, Thymus, Thyroid, and unknown primary] . Expert review included raw data review and reporting decision arbitration for individual mutations; this review was guided by clinical context, overall sequencing data quality (per sample and per batch), and orthogonal testing results. Notably, the test includes several hundred kilobases of genomic content with redundant coverage generated by two independent library preparations; discordances within these redundant regions were manually reviewed and adjudicated as false positive or false negative based on all supporting evidence. Model training was performed using default parameters as described in the open-source XGBoost library (version 1.2.1). No low-quality data were excluded from training or subsequent performance evaluation; instead, raw quality control metrics were provided as input to the classifier as additional signal for contextualizing the potential significance of individual mutation candidate specific inputs. The feature weights obtained from model training are provided in Table 3 below.
[0067] Prospective Validation
[0068] After model training, performance was validated prospectively in a separate cohort of 3,437 consecutive clinical samples. The results are shown in Table 1. Sensitivity and positive predictive value were evaluated in comparison to the clinically reported results following manual expert review. Performance was evaluated in the following subgroups: [0069] Mutation type:
[0070] single nucleotide variant (SNV)
[0071] insertion / deletion (INDEL)
[0072] Clinical significance category:
[0073] Companion diagnostic markers (CDx) - mutations with existing clinical indications, such as BRAE p.V600E, which may indicate vemurafenib in melanomas [0074] Hotspot - specific mutations with known functional, clinical and/or recurrence evidence supporting their inclusion in clinical reporting
[0075] De novo - non-hotspot mutations with clear functional impact via truncated or significantly altered downstream gene transcription [0076] Variants of Unknown Significance (VUS) - non-hotspot mutations with unknown functional impact and significance [0077] Table 1
Figure imgf000020_0001
[0078] Example Calculations
[0079] Positive example - KRAS chrl2:25398280 OT (FIG. 1)
[0080] Several factors support the significance of this KRAS p.G13D mutation, which yielded an overall score of 0.9998 (range 0-1) - see details in table below:
[0081] The mutation is covered by two independent amplicons, both of which support the mutation:
[0082] num_covering_amps=2 [0083] frac_amps_with_evidence=l
[0084] The read evidence is well above expected background error for that specific nucleotide change
[0085] min_AF_ratio_100=73.261
[0086] The read variant allele frequency is similar on both strands (range 0-1):
[0087] best_strand_bias=0.919
[0088] Both pan tumor and within tumor prior probabilities are elevated for both this specific mutation and gene:
[0089] within_tumor_prior=0.078 [0090] within_tumor_prior_gene=0.488
[0091] Negative example 1 - BRCA1 chrl7:41256098 A>G (FIG. 2)
[0092] Although the raw variant allele frequency (AF) of 14.6% is relatively high, several factors argue against this mutation candidate: [0093] Allele frequencies for both the forward and reverse strands are less than the expected background:
[0094] max_AF_ratio_l 00=0.711
[0095] The read support is highly strand-biased (variant allele frequencies are different on each strand):
[0096] best_strand_bias=0.102
[0097] There are many mutation candidates of the same nucleotide change (A>G) present on the same amplicon:
[0098] amplicon_variant_count=13
[0099] Negative example 2: MSH6 chr2:48010515 G>A (FIG. 3)
[0100] This candidate is typical of sequencing errors that are often present at variant allele frequencies high enough to trigger false positive results in systems that do not consider second-order evidence such as strand bias, multiple amplicon coverage, and prior probabilities. Though the local sequencing data quality is high (nearly all bases are colored pink or purple, indicating match to the expected reference sequence), and one strand is well above the expected background error (max_AF_ratio_100=5.262), there is no supporting evidence on the reverse strand reads - this manifests in several different parameters:
[0101] best_strand_bias=0 [0102] min_AF=0 [0103] min_alt_count=0
[0104] Variant candidates like this with marginal read support can be real - for example, it is common to have biased read coverage such that only one strand has sufficient reads to support analysis; in such cases where read support is not definitive, prior probabilities and multiple amplicon coverage can help resolve such ambiguity. In this case, neither parameter lends support:
[0105] num_covering_amps=l
[0106] pan_tumor_prior, pan_tumor_prior_gene, within_tumor_prior, within_tumor_prior_gene < 0.01
[0107] The results for positive example 1 and negative examples 1-2 are shown in Table 2:
Figure imgf000021_0001
Figure imgf000022_0001
Figure imgf000023_0001
Table 3- Features with weight
Figure imgf000023_0002
Figure imgf000024_0001
Table 4- Feature Key
Figure imgf000024_0002
Figure imgf000025_0001
Figure imgf000026_0001

Claims

CLAIMS What is claimed is:
1. A method for identifying mutations from a patient sample, comprising: evaluating, using a computer having a machine learning classifier, a candidate variant against a plurality of decision trees trained to detect mutations in the candidate variant with a gradient boosting algorithm, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classifying, using the computer, the candidate variant as present or not present based on classifications of each of the plurality of decision trees, wherein the decision trees receive the following parameters: min_AF_ratio_50; min_alt_count; min_AF_softclip_ratio_99; count; and within_tumor_prior.
2. The method of claim 1, wherein the decision trees also receive the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count.
3. The method of any one of claims 1-2, wherein the decision trees also receive one or more of the following parameters: max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, amplicon_variant_count, AF_frac_pos_max, analysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_rel_chip, max_AF_ratio_99, AF_softclip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softclip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softclip, frac_max_af, max_depth, max_AF_softclip_ratio_99, min_AF_ratio_90, max_AF, max_AF_softclip_ratio_ 100, max_AF_softclip_ratio_90, max_softclip_count, bcsl_slrand_bias, max_AF_softclip_ratio_50, and min_softclip_count.
4. The method of any one of claims 1-3, wherein evaluating the candidate variant further comprises evaluating the candidate variant using a random forest classifier.
5. The method of claim 4, wherein the plurality of decision trees further comprises at least one thousand decision trees.
6. The method of claim 4, wherein the plurality of decision trees further comprises a plurality of decision trees for each mutation.
7. The method of any one of claims 1-3, further comprising training the machine learning classifier using a training data set of sequences that include identified mutations.
8. The method of claim 7, wherein the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review.
9. The method of any one of claims 7-8, wherein training the machine learning classifier further comprises optimizing parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations.
10. The method of claim 9, wherein optimizing parameters of the machine learning classifier further comprises selecting a plurality of feature categories.
11. The method of any one of claims 1-3, wherein the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
12. The method of any one of claims 1-3, further comprising generating, using the computer, an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees.
13. The method of any one of claims 1-3, further comprising providing a report that describes the candidate variant as including the mutation or structural alteration.
14. The method of claim 13, wherein describing the structural alteration comprises: comparing the sequence reads to a reference to detect an indicium of the structural alteration; and validating the structural alteration as present in the nucleic acid using the classification model.
15. The method of any one of claims 1-14, wherein one or more of the decision trees receive parameters selected from the group consisting of: sample type; FASTQ quality score; alignment score; read coverage; and an estimated probability of error.
16. The method of any one of claims 7-10, wherein the training data set comprises a plurality of known single-nucleotide variants (SNVs) and insertions/deletions (Indels), the method comprising: detecting at least one SNV or Indel in the nucleic acid; validating the detected SNV or Indel as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the SNV or Indel.
17. A system for identifying mutations from a patient sample, comprising: a computer system having a processor, memory and a plurality of lines of instructions; a machine learning classifier executed by the processor of the computer system, the machine learning classifier being configured to: evaluate a candidate variant against the plurality of decision trees trained to detect mutations in the candidate variant, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classify the candidate variant as a real mutation based on classifications of each of the plurality of decision trees.
18. The system of claim 17, wherein machine learning classifier is further configured to evaluate the candidate variant using a random forest classifier.
19. The system of any one of claims 17-18, wherein the plurality of decision trees further comprises at least one thousand decision trees.
20. The system of any one of claims 17-19, wherein the plurality of decision trees further comprises a plurality of decision trees for each mutation.
21. The system of any one of claims 17-20, wherein the processor is further configured to train the machine learning classifier using a training data set of sequences that include known mutations.
22. The system of claim 21, wherein the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review.
23. The system of any one of claims 17-22, wherein the processor is further configured to optimize parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations.
24. The system of any one of claims 17-23, wherein the processor is further configured to select a plurality of feature categories.
25. The system of claim 17, wherein the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
26. The system of any one of claims 17-25, wherein the processor is further configured to generate an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees.
27. The method of claim 1 or the system of claim 17, wherein the sample is from plasma, blood, serum, saliva, sputum, stool, a tumor, cell free DNA, circulating tumor cell, or other biological sample.
28. The method or system of claim 27, wherein the sample is from a subject having or at risk of having cancer.
29. The method or system of claim 28, wherein the cancer is selected from lung, bladder, colon, gastric, head and neck, breast, prostate, non-small cell lung adenocarcinoma, non-small cell lung squamous cell carcinoma, bladder urothelial carcinoma, colorectal, brain or pancreatic cancer.
PCT/US2022/034115 2021-06-17 2022-06-17 Methods for identifying mutations using machine learning WO2022266518A2 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
EP22825947.9A EP4356319A2 (en) 2021-06-17 2022-06-17 Methods for identifying mutations using machine learning
AU2022292749A AU2022292749A1 (en) 2021-06-17 2022-06-17 Methods for identifying mutations using machine learning
US18/571,652 US20240290422A1 (en) 2021-06-17 2022-06-17 Methods for identifying mutations using machine learning
IL309473A IL309473A (en) 2021-06-17 2022-06-17 Methods for identifying mutations using machine learning
CA3224548A CA3224548A1 (en) 2021-06-17 2022-06-17 Methods for identifying mutations using machine learning

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163211891P 2021-06-17 2021-06-17
US63/211,891 2021-06-17

Publications (2)

Publication Number Publication Date
WO2022266518A2 true WO2022266518A2 (en) 2022-12-22
WO2022266518A3 WO2022266518A3 (en) 2023-01-26

Family

ID=84526760

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/034115 WO2022266518A2 (en) 2021-06-17 2022-06-17 Methods for identifying mutations using machine learning

Country Status (6)

Country Link
US (1) US20240290422A1 (en)
EP (1) EP4356319A2 (en)
AU (1) AU2022292749A1 (en)
CA (1) CA3224548A1 (en)
IL (1) IL309473A (en)
WO (1) WO2022266518A2 (en)

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3728642A4 (en) * 2017-12-18 2021-09-15 Personal Genome Diagnostics Inc. Machine learning system and method for somatic mutation discovery

Also Published As

Publication number Publication date
WO2022266518A3 (en) 2023-01-26
EP4356319A2 (en) 2024-04-24
AU2022292749A1 (en) 2024-01-18
US20240290422A1 (en) 2024-08-29
IL309473A (en) 2024-02-01
CA3224548A1 (en) 2022-12-22

Similar Documents

Publication Publication Date Title
Hayes et al. Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts
JP2024019413A (en) Ultrasound-sensitive detection of circulating tumor DNA through genome-wide integration
Tan et al. Ensemble machine learning on gene expression data for cancer classification
US20210292845A1 (en) Identifying methylation patterns that discriminate or indicate a cancer condition
JP2024119880A (en) Cancer Classification with Synthetic Training Samples
US20200219587A1 (en) Systems and methods for using fragment lengths as a predictor of cancer
US20240249798A1 (en) Systems and methods for enriching for cancer-derived fragments using fragment size
US20220081724A1 (en) Methods of detecting and treating subjects with checkpoint inhibitor-responsive cancer
US20230140123A1 (en) Systems and methods for classifying and treating homologous repair deficiency cancers
CN106778073A (en) A kind of method and system for assessing tumor load change
US20230242975A1 (en) Methods and systems for distinguishing somatic genomic sequences from germline genomic sequences
US20240290422A1 (en) Methods for identifying mutations using machine learning
US20240312561A1 (en) Optimization of sequencing panel assignments
KR102470937B1 (en) A biomarker-searching devices and methods that can predict the effectiveness and overal survival of ici treatment for cancer patients using network-based machine learning techniques
WO2024050366A1 (en) Systems and methods for classifying and treating homologous repair deficiency cancers
WO2019016353A1 (en) Classifying somatic mutations from heterogeneous sample
WO2024238750A2 (en) Clonal hematopoiesis burden as a biomarker for immune checkpoint inhibitor response
Aljouie Cancer Risk Prediction with Whole Exome Sequencing and Machine Learning
US20220336044A1 (en) Read-Tier Specific Noise Models for Analyzing DNA Data
Nic Fisk Computational and Evolutionary Approaches for Translational Cancer Research
Bang Oral Cancer Genomics Data Mining and Integration for Predictive Therapeutics
WO2024215498A1 (en) Method for detecting patients with systematically under-estimated tumor mutational burden who may benefit from immunotherapy
Bruno et al. Check for updates Classification and Survival Prediction in Diffuse Large B-Cell Lymphoma by Gene Expression Profiling
SK882023A3 (en) Methods and system for detecting microsatellite instability from sequenced free circulating DNA
Li Integration and Development of Machine Learning Methodologies to Improve the Power of Genome-Wide Association Studies

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22825947

Country of ref document: EP

Kind code of ref document: A2

WWE Wipo information: entry into national phase

Ref document number: 309473

Country of ref document: IL

WWE Wipo information: entry into national phase

Ref document number: P6003285/2023

Country of ref document: AE

Ref document number: 3224548

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 2022292749

Country of ref document: AU

Ref document number: 806851

Country of ref document: NZ

Ref document number: AU2022292749

Country of ref document: AU

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112023026730

Country of ref document: BR

WWE Wipo information: entry into national phase

Ref document number: 2022825947

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022292749

Country of ref document: AU

Date of ref document: 20220617

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 523451976

Country of ref document: SA

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22825947

Country of ref document: EP

Kind code of ref document: A2

ENP Entry into the national phase

Ref document number: 2022825947

Country of ref document: EP

Effective date: 20240117

REG Reference to national code

Ref country code: BR

Ref legal event code: B01E

Ref document number: 112023026730

Country of ref document: BR

Free format text: APRESENTE RELATORIO DESCRITIVO E DESENHOS, CONFORME PEDIDO INTERNACIONAL INICIALMENTE DEPOSITADO, POIS O MESMO NAO FOI APRESENTADO ATE O MOMENTO. A EXIGENCIA DEVE SER RESPONDIDA EM ATE 60 (SESSENTA) DIAS DE SUA PUBLICACAO E DEVE SER REALIZADA POR MEIO DA PETICAO GRU CODIGO DE SERVICO 207.

NENP Non-entry into the national phase

Ref country code: JP

ENPW Started to enter national phase and was withdrawn or failed for other reasons

Ref document number: 112023026730

Country of ref document: BR

Free format text: PEDIDO RETIRADO DA FASE NACIONAL BRASILEIRA PELO NAO CUMPRIMENTO DA EXIGENCIA PUBLICADA NA RPI 2776 DE 19/03/2024, CONFORME O DISPOSTO PELO ART. 28, 1O DA PORTARIA/INPI/NO 39/2021

WWW Wipo information: withdrawn in national office

Ref document number: 2022825947

Country of ref document: EP