WO2022266518A2 - Methods for identifying mutations using machine learning - Google Patents
Methods for identifying mutations using machine learning Download PDFInfo
- Publication number
- WO2022266518A2 WO2022266518A2 PCT/US2022/034115 US2022034115W WO2022266518A2 WO 2022266518 A2 WO2022266518 A2 WO 2022266518A2 US 2022034115 W US2022034115 W US 2022034115W WO 2022266518 A2 WO2022266518 A2 WO 2022266518A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- ratio
- max
- decision trees
- machine learning
- softclip
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/10—Machine learning using kernel methods, e.g. support vector machines [SVM]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Definitions
- NGS Next Generation Sequencing
- NGS-based companion diagnostic (CDx) cancer tests have now been FDA approved, their uptake has been limited by: (1) single-site approvals (e.g., FoundationOne CDxTM [FI CDx]) that compete with regional laboratories who wish to better serve their local patient populations, (2) impractical sample input requirements for capture-based NGS IVD tests (e.g., FI CDx requires tumor surface area >25mm 2 ) and (3) insufficient content in kitted products (e.g., Oncomine Target Dx, Kir Extended Ras Panel), which took many years to develop, but when launched, no longer met current medical needs given the ever expanding list of biomarkers recommended or required by professional organizations and payors.
- kitted products e.g., Oncomine Target Dx, Kir Extended Ras Panel
- US 2019/0189242 A1 (Published June 20, 2019), provides a variant calling method utilizing machine learning. However, as detailed in paragraph [0166], machine learning was accomplished using an artificially created "spike-in” training set. Thus, the method of US 2019/0189242 A1 has degraded performance when calling "real world" samples that often comprise low quality samples.
- Some aspects of the present disclosure are directed to a method for identifying mutations from a patient sample, comprising: evaluating, using a computer having a machine learning classifier, a candidate variant against a plurality of decision trees trained to detect mutations in the candidate variant with a gradient boosting algorithm, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classifying, using the computer, the candidate variant as a real mutation based on classifications of each of the plurality of decision trees, wherein the decision trees receive the following parameters: min_AF_ratio_50; min_aIt_count; min_AF_softcIip_ratio_99; count; and within_tumor_prior.
- the decision trees also receive the following parameters: min_AF_ratio_100, min_AF_softcIip_ratio_100, min_AF_softcIip, min_AF_softcIip_ratio_90, and max_aIt_count.
- the decision trees also receive one or more of the following parameters: max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, ampIicon_variant_count, AF_frac_pos_max, ana!ysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_reI_chip, max_AF_ratio_99, AF_softcIip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softcIip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softcIip, frac_max_af, max_depth, max_AF_softcIip_rati
- evaluating the candidate variant further comprises evaluating the candidate variant using a random forest classifier.
- the plurality of decision trees further comprises at least one thousand decision trees.
- the plurality of decision trees further comprises a plurality of decision trees for each mutation.
- the method further comprises training the machine learning classifier using a training data set of sequences that include identified mutations.
- the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review.
- training the machine learning classifier further comprises optimizing parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations.
- optimizing parameters of the machine learning classifier further comprises selecting a plurality of feature categories.
- the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
- the method further comprises generating, using the computer, an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees. In some embodiments, the method further comprises providing a report that describes the candidate variant as including the mutation.
- one or more of the decision trees receive parameters selected from the group consisting of: sample type; FASTQ quality score; alignment score; read coverage; and an estimated probability of error.
- the training data set comprises a plurality of known single-nucleotide variants (SNVs) and insertions / deletions (indels), the method comprising: detecting at least one mutation in the nucleic acid; validating the detected mutation as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the mutation.
- a decision tree classifies the candidate variant as present or absent (e.g., likely noise or other sequencing artifact). In some embodiments, the decision tree does not classify whether or not the candidate variant is germline.
- Some aspects of the present disclosure are directed to a system for identifying mutations from a patient sample, comprising: a computer system having a processor, memory and a plurality of lines of instructions; a machine learning classifier executed by the processor of the computer system, the machine learning classifier being configured to: evaluate a candidate variant against the plurality of decision trees trained to detect mutations in the candidate variant, wherein each decision tree classifies the candidate variant as present or absent (e.g., likely noise or other sequencing artifact), based on classifications of each of the plurality of decision trees.
- a computer system having a processor, memory and a plurality of lines of instructions
- a machine learning classifier executed by the processor of the computer system, the machine learning classifier being configured to: evaluate a candidate variant against the plurality of decision trees trained to detect mutations in the candidate variant, wherein each decision tree classifies the candidate variant as present or absent (e.g., likely noise or other sequencing artifact), based on classifications of each of the plurality of decision trees.
- the machine learning classifier is further configured to evaluate the candidate variant using a random forest classifier.
- the plurality of decision trees further comprises at least one thousand decision trees.
- the plurality of decision trees further comprises a plurality of decision trees for each mutation.
- the processor is further configured to train the machine learning classifier using a training data set of sequences that include known mutations.
- the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review.
- the processor is further configured to optimize parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations.
- the processor is further configured to select a plurality of feature categories.
- the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
- the processor is further configured to generate an overall confidence score for the candidate variant being a real mutation (vs. sequencing artifact) based on the classifications of all of the plurality of decision trees.
- the sample is from plasma, blood, serum, saliva, sputum, stool, a tumor, cell free DNA, circulating tumor cell, or other biological sample.
- the sample is from a subject having or at risk of having cancer.
- the cancer is selected from lung, bladder, colon, gastric, head and neck, breast, prostate, non-small cell lung adenocarcinoma, non-small cell lung squamous cell carcinoma, bladder urothelial carcinoma, colorectal, brain or pancreatic cancer.
- FIG. 1 shows sequence read information for KRAS chrl2:25398280 OT.
- FIG. 2 shows sequence read information for BRCA1 chrl7:41256098 A>G.
- FIG. 3 shows sequence read information for MSH6 chr2:48010515 G>A.
- Some aspects of the present disclosure are directed to a method for identifying mutations from a patient sample, comprising: evaluating, using a computer having a machine learning classifier, a candidate variant against a plurality of decision trees trained to detect mutations in the candidate variant with a gradient boosting algorithm, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classifying, using the computer, the candidate variant as a real mutation based on classifications of each of the plurality of decision trees, wherein the decision trees receive the following parameters: min_AF_ratio_50, min_alt_count, min_AF_softclip_ratio_99, count, and within_tumor_prior.
- the decision trees also receive one or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive two or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count.
- the decision trees also receive three or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive four or more of the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count. In some embodiments, the decision trees also receive the following parameters: min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, min_AF_softclip_ratio_90, and max_alt_count.
- the decision trees also receive min_AF_softclip_ratio_100. In some embodiments, the decision trees also receive min_AF_ratio_100 and min_AF_softclip_ratio_10. In some embodiments, the decision trees also receive min_AF_ratio_100, min_AF_softclip_ratio_100, and min_AF_softclip. In some embodiments, the decision trees also receive min_AF_ratio_100, min_AF_softclip_ratio_100, min_AF_softclip, and min_AF_softclip_ratio_90.
- the decision trees also receive 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
- max_AF_ratio_100 max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, amplicon_variant_count, AF_frac_pos_max, analysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_rel_chip, max_AF_ratio_99, AF_softclip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softclip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softclip, frac_max_af,
- the decision trees also receive one or more of the following parameters: max_AF_ratio_100, AF, frac_amps_with_evidence, pan_tumor_prior, ampIicon_variant_count, AF_frac_pos_max, anaIysis_variant_count, min_depth, min_AF, pct_noisy_positions, avg_cov_reI_chip, max_AF_ratio_99, AF_softcIip, pan_tumor_prior_gene, min_AF_ratio_99, within_tumor_prior_gene, max_AF_ratio_50, min_AF_softcIip_ratio_50, max_AF_ratio_90, depth, num_covering_amps, DNA_avg_coverage, max_AF_softcIip, frac_max_af, max_depth, max_AF_softcIip_ratio_99,
- the machine learning classifier is not limited and may be any suitable machine learning classifier type.
- Machine learning approaches including any one or more of: supervised learning (e.g., using logistic regression, using back propagation neural networks, using random forests, decision trees, etc.), unsupervised learning (e.g., using an Apriori algorithm, using K-means clustering), semi-supervised learning, a deep learning algorithm (e.g., neural networks, a restricted Boltzmann machine, a deep belief network method, a convolutional neural network method, a recurrent neural network method, stacked auto encoder method, etc.), reinforcement learning (e.g., using a Q-Iearning algorithm, using temporal difference learning), a regression algorithm (e.g., ordinary least squares, logistic regression, stepwise regression, multivariate adaptive regression splines, locally estimated scatterplot smoothing, etc.), an instance-based method (e.g., k-nearest neighbor, learning vector quantization, self-organizing map, etc.), a regularization
- the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
- the machine learning approach is a supervised learning approach.
- evaluating the candidate variant with machine learning further comprises evaluating the candidate variant using a random forest classifier.
- the plurality of decision trees further comprises at least one thousand decision trees. In some embodiments, the plurality of decision trees further comprises at least ten thousand decision trees. In some embodiments, the plurality of decision trees further comprises at least fifty thousand decision trees.
- the plurality of decision trees further comprises a plurality of decision trees for each mutation. In some embodiments, the plurality of decision trees for each mutation comprises at least 5 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 50 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 100 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 500 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 1000 decision trees per each mutant. In some embodiments, the plurality of decision trees for each mutation comprises at least 5000 decision trees per each mutant.
- Some embodiments of the methods disclosed herein further comprise training the machine learning classifier using a training data set of sequences that include identified mutations.
- the training data set of sequences was obtained in part from low quality biological samples (e.g., at least 1%, at least 2%, at least 3%, at least 5%, at least 7.5%, at least 10% are low quality biological samples).
- the mutations are identified via expert review.
- the training data set of sequences was obtained in part from low quality biological samples (e.g., at least 1%, at least 2%, at least 3%, at least 5%, at least 7.5%, at least 10% is low quality) and the mutations are identified via expert review.
- the training the machine learning classifier further comprises optimizing parameters of the machine learning classifier until the machine learning classifier produces output describing the mutations identified in the training set (e.g., the mutations identified by expert review).
- optimizing parameters of the machine learning classifier further comprises selecting a plurality of feature categories. In some embodiments, at least 2,
- the method further comprises generating, using the computer, an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees.
- the overall confidence score is on a scale of 0-1, with a 1 being a 100% confidence for a real mutation.
- a candidate variant is called as a real mutation when the confidence score is greater than 0.3, 0.4, 0.5, 0.6, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.98, or 0.99.
- the candidate variant is detected in a high quality sample
- the candidate variant is called as a real mutation when the confidence score is greater than 0.3.
- the candidate variant is detected in a high low sample
- the candidate variant is called as a real mutation when the confidence score is greater than 0.7.
- a candidate variant is called as a real mutation when the confidence score is greater than 0.9.
- one or more of the decision trees receive read coverage parameter.
- one or more of the decision trees receive parameters are selected from min_AF_ratio_50; min_alt_count; min_AF_softclip_ratio_99; count; min_AF_softclip; min_AF_softclip_ratio_100; within_tumor_prior; min_AF_ratio_100; max_AF_ratio_100; and max_alt_count.
- the method comprising: detecting at least one SNV or Indel in the nucleic acid; validating the detected SNV or Indel as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the SNV or Indel.
- SNVs single-nucleotide variants
- the training data set comprises a plurality of known single-nucleotide variants (SNVs)
- the method comprising: detecting at least one SNV in the nucleic acid; validating the detected SNV as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the SNV.
- the training data set comprises a plurality of insertions and/or deletions (Indel)
- the method comprising: detecting at least one Indel in the nucleic acid; validating the detected indel as present in the nucleic acid using the classification model; and providing a report that describes the nucleic as including the indel.
- Some aspects of the present disclosure are directed to a system for identifying mutations from a patient sample, comprising a computer system having a processor, memory and a plurality of lines of instructions; a machine learning classifier executed by the processor of the computer system, the machine learning classifier being configured to: evaluate a candidate variant against the plurality of decision trees trained to detect mutations in the candidate variant, wherein each decision tree classifies the candidate variant as at least one of present or not present; and classify the candidate variant as present or not present based on classifications of each of the plurality of decision trees.
- the machine learning classifier is not limited and may be any suitable machine learning methodology described herein.
- the machine learning classifier is selected from the group consisting of: a neural network, Bayesian classifier, logistic regression, decision tree, gradient-boosted tree, multilayer perceptron, one-vs-rest, and Naive Bayes.
- the machine learning classifier is further configured to evaluate the candidate variant using a random forest classifier.
- the number of decision trees is not limited and may be any suitable number described herein.
- the plurality of decision trees further comprises at least one thousand decision trees.
- the plurality of decision trees further comprises a plurality of decision trees for each mutation.
- the processor is further configured to train the machine learning classifier using a training data set of sequences that include known mutations.
- the training data set of sequences was obtained in part from low quality biological samples and wherein the mutations are identified via expert review.
- low quality biological samples are samples comprising significantly degraded nucleotide sequences (e.g., DNA).
- the low quality biological samples are from samples that have undergone formalin fixation and paraffin embedding (FFPE).
- FFPE formalin fixation and paraffin embedding
- the low quality biological samples comprise at least 2-fold, 5- fold, 10-fold, or more chemically modified nucleotides or chemical crosslinks than a freshly obtained and untreated biological sample.
- the low quality biological samples comprise at least 2-fold, 5-fold, 10-fold, or more degraded sequences (DNA sequences) than a freshly obtained and untreated biological sample.
- the processor is further configured to optimize parameters of the machine learning classifier until the machine learning classifier produces output describing the known mutations.
- the processor is further configured to select a plurality of feature categories.
- the feature categories are not limited and may be any feature categories described herein.
- the feature categories are provided in Table 2, 3, or 4.
- the feature categories are provided in Table 3.
- the processor is further configured to generate an overall confidence score for the candidate variant being a real mutation based on the classifications of all of the plurality of decision trees.
- the biological sample i.e., sample
- the sample is any suitable sample type.
- the sample is from plasma, blood, serum, saliva, sputum, stool, a tumor, cell free DNA, circulating tumor cell, or other biological sample.
- the sample is a blood sample.
- the biological sample is a tumor specimen.
- the sample is from a subject having or at risk of having cancer.
- the type of cancer is not limited and may be any suitable cancer.
- Exemplary cancers include, but are not limited to, acoustic neuroma; adenocarcinoma; adrenal gland cancer; anal cancer; angiosarcoma (e.g., lymphangiosarcoma, lymphangioendotheliosarcoma, hemangiosarcoma); appendix cancer; benign monoclonal gammopathy; biliary cancer (e.g., cholangiocarcinoma); bladder cancer; breast cancer (e.g., adenocarcinoma of the breast, papillary carcinoma of the breast, mammary cancer, medullary carcinoma of the breast); brain cancer (e.g., meningioma, glioblastomas, glioma (e.g., astrocytoma, oligodendroglioma), medulloblastoma); bronchus cancer; carcinoid tumor; cervical cancer (e.g., cervical adenocarcinoma); choriocar
- Wilms tumor, renal cell carcinoma); liver cancer (e.g., hepatocellular cancer (HCC), malignant hepatoma); lung cancer (e.g., bronchogenic carcinoma, small cell lung cancer (SCLC), non-small cell lung cancer (NSCLC), adenocarcinoma of the lung); leiomyosarcoma (LMS); mastocytosis (e.g., systemic mastocytosis); muscle cancer; myelodysplastic syndrome (MDS); mesothelioma; myeloproliferative disorder (MPD) (e.g., polycythemia vera (PV), essential thrombocytosis (ET), agnogenic myeloid metaplasia (AMM) a.k.a.
- HCC hepatocellular cancer
- lung cancer e.g., bronchogenic carcinoma, small cell lung cancer (SCLC), non-small cell lung cancer (NSCLC), adenocarcinoma of the lung
- myelofibrosis MF
- chronic idiopathic myelofibrosis chronic myelocytic leukemia (CML), chronic neutrophilic leukemia (CNL), hypereosinophilic syndrome (HES)
- neuroblastoma e.g., neurofibromatosis (NF) type 1 or type 2, schwannomatosis
- neuroendocrine cancer e.g., gastroenteropancreatic neuroendoctrine tumor (GEP-NET), carcinoid tumor
- osteosarcoma e.g., bone cancer
- ovarian cancer e.g., cystadenocarcinoma, ovarian embryonal carcinoma, ovarian adenocarcinoma
- papillary adenocarcinoma pancreatic cancer
- pancreatic cancer e.g., pancreatic andenocarcinoma, intraductal papillary mucinous neoplasm (IPMN), Islet cell tumors
- the cancer is selected from adrenal, biliary, bladder, brain, breast, cervical, colon and rectum, endometrium, esophagus, head and neck, kidney, liver, lung - NSCLC, lung - Other, lymphoma, melanoma, meninges, NSCLC, non-melanoma skin, ovary, pancreas, prostate, sarcoma, small intestine, stomach, thymus, or thyroid cancer.
- the cancer is selected from lung, bladder, colon, gastric, head and neck, breast, prostate, non-small cell lung adenocarcinoma, non-small cell lung squamous cell carcinoma, bladder urothelial carcinoma, colorectal, brain or pancreatic cancer.
- the biological sample comprises less genomic material than prior mutant calling methods permit. In some embodiments, the biological sample comprises less than about 25 nanograms (ng) of genomic material. In some embodiments, the biological sample comprises less than about 20 ng of genomic material. In some embodiments, the biological sample comprises less than about 15 ng of genomic material. In some embodiments, the biological sample comprises less than about 12 ng of genomic material. In some embodiments, the biological sample comprises less than 10 ng of genomic material. In some embodiments, the biological sample comprises less than 7.5 ng of genomic material. In some embodiments, the biological sample comprises less than 5 ng of genomic material. [0048] In some embodiments, the biological sample has undergone fixation. The method of fixation is not limited and may be any method of fixation known in the art. In some embodiments, fixation includes formalin fixation which is known in the art to result in an abundance of OT mutations thought to be due to deamination.
- the area of the biological sample is not limited. In some embodiments, the area of the biological sample is less than the area needed for variant calling methods used in the art. In some embodiments, the biological sample is a sample having an area of less than 5 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 10 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 15 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 20 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 25 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 30 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 35 mm 2 . In some embodiments, the biological sample is a sample having an area of less than 40 mm 2 .
- the tumor content of the biological sample is not limited. In some embodiments, the tumor content of the biological sample is less than the tumor content required in methods of variant calling practiced in the art. In some embodiments, the biological sample is a sample having a tumor content of less than 40%. In some embodiments, the biological sample is a sample having a tumor content of less than 30%. In some embodiments, the biological sample is a sample having a tumor content of less than 20%. In some embodiments, the biological sample is a sample having a tumor content of less than 17%. In some embodiments, the biological sample is a sample having a tumor content of less than 15%. In some embodiments, the biological sample is a sample having a tumor content of less than 12%. In some embodiments, the biological sample is a sample having a tumor content of less than 10%. [0051] Methods of determining or calculating tumor content are not limited and may be any suitable method known in the art.
- Methods of preparing biological specimens and sequencing data sets are not limited and may be any suitable method used in the art.
- the method comprises one or more steps comprising 1) receiving a sample (FFPE block or slides); 2) reviewing H&E stained slide for tumor content; 3) Cutting additional slides as necessary; 4) Scraping cells from slides to into tubes (performing macrodissection if indicated by pathologist); 5) Extracting nucleic acid (fully automated batch process); 6) Preparing DNA libraries (largely automated batch process); 7) Loading DNA libraries to sequencing chip (fully automated); 8) Sequencing chips via NGS (fully automated); and 9) Data analysis.
- the biological sample has been stored for at least about 1 year prior to sequencing. In some embodiments, the biological sample has been stored for at least about 2 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 3 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 4 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 5 years prior to sequencing. In some embodiments, the biological sample has been stored for at least about 10 years prior to sequencing.
- the mutation sequencing data set is low quality. In some embodiments, the mutation sequencing data set comprises less than 1000 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 600 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 500 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 400 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set comprises less than 300 reads of the candidate variant of interest. In some embodiments, the mutation sequencing data set has been obtained in a batch with a plurality of other sequencing data sets and comprises less than 50% of the average number of reads in the plurality of other sequencing data sets. In some embodiments, no more than 1% of the positions of the sequencing data set have more that 3% of non-reference alignments.
- the determination of the mutation or mutations in the sample leads to and is followed by particular treatment steps intended to be therapeutic for the identified mutation(s), mutation symptomology, and/or disorder associated with the mutation(s).
- the treatment steps can be treatment with a suitable agent or agents, including combination treatment regimens.
- Suitable treatment steps can include, for example, administration of appropriate gene therapy agents, such as TALENs, zinc fingers and/or Crispr agents and suitable guide sequences, for correcting the mutation(s) in the subject from whom the sample was obtained.
- Suitable treatment steps can also include chemotherapeutic agents and regimens which are efficacious in treatment of cancers associated with identified mutation(s). The particular treatment will be determined based upon the mutation(s) identified using the methods described herein.
- any one or more active agents, additives, ingredients, optional agents, types of organism, disorders, subjects, or combinations thereof, can be excluded.
- claims or description relate to a composition of matter, it is to be understood that methods of making or using the composition of matter according to any of the methods disclosed herein, and methods of using the composition of matter for any of the purposes disclosed herein are aspects of the invention, unless otherwise indicated or unless it would be evident to one of ordinary skill in the art that a contradiction or inconsistency would arise.
- the invention includes embodiments that relate analogously to any intervening value or range defined by any two values in the series, and that the lowest value may be taken as a minimum and the greatest value may be taken as a maximum.
- Numerical values include values expressed as percentages. For any embodiment of the invention in which a numerical value is prefaced by “about” or “approximately”, the invention includes an embodiment in which the exact value is recited. For any embodiment of the invention in which a numerical value is not prefaced by “about” or “approximately”, the invention includes an embodiment in which the value is prefaced by “about” or “approximately”.
- Model parameters were trained using a corpus of paired raw sequencing data and expert review data from comprehensive genomic profiling tests for a consecutive cohort of 3,020 clinical samples spanning 30 tumor types [Adrenal, Biliary, Bladder, Brain, Breast, Cervical, Colon and Rectum, Endometrium, Esophagus, Head and Neck, Kidney, Liver, Lung - NSCLC, Lung - Other, Lymphoma, Melanoma, Meninges, NSCLC, Non-Melanoma Skin, other, Ovary, Pancreas, Prostate, Sarcoma, Small Intestine, Stomach, Thymus, Thyroid, and unknown primary] .
- SNV single nucleotide variant
- Companion diagnostic markers (CDx) - mutations with existing clinical indications such as BRAE p.V600E, which may indicate vemurafenib in melanomas
- the mutation is covered by two independent amplicons, both of which support the mutation:
- the read variant allele frequency is similar on both strands (range 0-1):
- the read support is highly strand-biased (variant allele frequencies are different on each strand):
- Negative example 2 MSH6 chr2:48010515 G>A (FIG. 3)
- This candidate is typical of sequencing errors that are often present at variant allele frequencies high enough to trigger false positive results in systems that do not consider second-order evidence such as strand bias, multiple amplicon coverage, and prior probabilities.
- second-order evidence such as strand bias, multiple amplicon coverage, and prior probabilities.
- Variant candidates like this with marginal read support can be real - for example, it is common to have biased read coverage such that only one strand has sufficient reads to support analysis; in such cases where read support is not definitive, prior probabilities and multiple amplicon coverage can help resolve such ambiguity. In this case, neither parameter lends support:
- pan_tumor_prior pan_tumor_prior, pan_tumor_prior_gene, within_tumor_prior, within_tumor_prior_gene ⁇ 0.01
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Bioethics (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Chemical & Material Sciences (AREA)
- Algebra (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Biomedical Technology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22825947.9A EP4356319A2 (en) | 2021-06-17 | 2022-06-17 | Methods for identifying mutations using machine learning |
AU2022292749A AU2022292749A1 (en) | 2021-06-17 | 2022-06-17 | Methods for identifying mutations using machine learning |
US18/571,652 US20240290422A1 (en) | 2021-06-17 | 2022-06-17 | Methods for identifying mutations using machine learning |
IL309473A IL309473A (en) | 2021-06-17 | 2022-06-17 | Methods for identifying mutations using machine learning |
CA3224548A CA3224548A1 (en) | 2021-06-17 | 2022-06-17 | Methods for identifying mutations using machine learning |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163211891P | 2021-06-17 | 2021-06-17 | |
US63/211,891 | 2021-06-17 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2022266518A2 true WO2022266518A2 (en) | 2022-12-22 |
WO2022266518A3 WO2022266518A3 (en) | 2023-01-26 |
Family
ID=84526760
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2022/034115 WO2022266518A2 (en) | 2021-06-17 | 2022-06-17 | Methods for identifying mutations using machine learning |
Country Status (6)
Country | Link |
---|---|
US (1) | US20240290422A1 (en) |
EP (1) | EP4356319A2 (en) |
AU (1) | AU2022292749A1 (en) |
CA (1) | CA3224548A1 (en) |
IL (1) | IL309473A (en) |
WO (1) | WO2022266518A2 (en) |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3728642A4 (en) * | 2017-12-18 | 2021-09-15 | Personal Genome Diagnostics Inc. | Machine learning system and method for somatic mutation discovery |
-
2022
- 2022-06-17 IL IL309473A patent/IL309473A/en unknown
- 2022-06-17 EP EP22825947.9A patent/EP4356319A2/en not_active Withdrawn
- 2022-06-17 WO PCT/US2022/034115 patent/WO2022266518A2/en not_active Application Discontinuation
- 2022-06-17 AU AU2022292749A patent/AU2022292749A1/en active Pending
- 2022-06-17 CA CA3224548A patent/CA3224548A1/en active Pending
- 2022-06-17 US US18/571,652 patent/US20240290422A1/en not_active Abandoned
Also Published As
Publication number | Publication date |
---|---|
WO2022266518A3 (en) | 2023-01-26 |
EP4356319A2 (en) | 2024-04-24 |
AU2022292749A1 (en) | 2024-01-18 |
US20240290422A1 (en) | 2024-08-29 |
IL309473A (en) | 2024-02-01 |
CA3224548A1 (en) | 2022-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hayes et al. | Gene expression profiling reveals reproducible human lung adenocarcinoma subtypes in multiple independent patient cohorts | |
JP2024019413A (en) | Ultrasound-sensitive detection of circulating tumor DNA through genome-wide integration | |
Tan et al. | Ensemble machine learning on gene expression data for cancer classification | |
US20210292845A1 (en) | Identifying methylation patterns that discriminate or indicate a cancer condition | |
JP2024119880A (en) | Cancer Classification with Synthetic Training Samples | |
US20200219587A1 (en) | Systems and methods for using fragment lengths as a predictor of cancer | |
US20240249798A1 (en) | Systems and methods for enriching for cancer-derived fragments using fragment size | |
US20220081724A1 (en) | Methods of detecting and treating subjects with checkpoint inhibitor-responsive cancer | |
US20230140123A1 (en) | Systems and methods for classifying and treating homologous repair deficiency cancers | |
CN106778073A (en) | A kind of method and system for assessing tumor load change | |
US20230242975A1 (en) | Methods and systems for distinguishing somatic genomic sequences from germline genomic sequences | |
US20240290422A1 (en) | Methods for identifying mutations using machine learning | |
US20240312561A1 (en) | Optimization of sequencing panel assignments | |
KR102470937B1 (en) | A biomarker-searching devices and methods that can predict the effectiveness and overal survival of ici treatment for cancer patients using network-based machine learning techniques | |
WO2024050366A1 (en) | Systems and methods for classifying and treating homologous repair deficiency cancers | |
WO2019016353A1 (en) | Classifying somatic mutations from heterogeneous sample | |
WO2024238750A2 (en) | Clonal hematopoiesis burden as a biomarker for immune checkpoint inhibitor response | |
Aljouie | Cancer Risk Prediction with Whole Exome Sequencing and Machine Learning | |
US20220336044A1 (en) | Read-Tier Specific Noise Models for Analyzing DNA Data | |
Nic Fisk | Computational and Evolutionary Approaches for Translational Cancer Research | |
Bang | Oral Cancer Genomics Data Mining and Integration for Predictive Therapeutics | |
WO2024215498A1 (en) | Method for detecting patients with systematically under-estimated tumor mutational burden who may benefit from immunotherapy | |
Bruno et al. | Check for updates Classification and Survival Prediction in Diffuse Large B-Cell Lymphoma by Gene Expression Profiling | |
SK882023A3 (en) | Methods and system for detecting microsatellite instability from sequenced free circulating DNA | |
Li | Integration and Development of Machine Learning Methodologies to Improve the Power of Genome-Wide Association Studies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22825947 Country of ref document: EP Kind code of ref document: A2 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 309473 Country of ref document: IL |
|
WWE | Wipo information: entry into national phase |
Ref document number: P6003285/2023 Country of ref document: AE Ref document number: 3224548 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022292749 Country of ref document: AU Ref document number: 806851 Country of ref document: NZ Ref document number: AU2022292749 Country of ref document: AU |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112023026730 Country of ref document: BR |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022825947 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022292749 Country of ref document: AU Date of ref document: 20220617 Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 523451976 Country of ref document: SA |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22825947 Country of ref document: EP Kind code of ref document: A2 |
|
ENP | Entry into the national phase |
Ref document number: 2022825947 Country of ref document: EP Effective date: 20240117 |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01E Ref document number: 112023026730 Country of ref document: BR Free format text: APRESENTE RELATORIO DESCRITIVO E DESENHOS, CONFORME PEDIDO INTERNACIONAL INICIALMENTE DEPOSITADO, POIS O MESMO NAO FOI APRESENTADO ATE O MOMENTO. A EXIGENCIA DEVE SER RESPONDIDA EM ATE 60 (SESSENTA) DIAS DE SUA PUBLICACAO E DEVE SER REALIZADA POR MEIO DA PETICAO GRU CODIGO DE SERVICO 207. |
|
NENP | Non-entry into the national phase |
Ref country code: JP |
|
ENPW | Started to enter national phase and was withdrawn or failed for other reasons |
Ref document number: 112023026730 Country of ref document: BR Free format text: PEDIDO RETIRADO DA FASE NACIONAL BRASILEIRA PELO NAO CUMPRIMENTO DA EXIGENCIA PUBLICADA NA RPI 2776 DE 19/03/2024, CONFORME O DISPOSTO PELO ART. 28, 1O DA PORTARIA/INPI/NO 39/2021 |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 2022825947 Country of ref document: EP |