US20200143905A1 - Methods and compositions for germline variant detection - Google Patents
Methods and compositions for germline variant detection Download PDFInfo
- Publication number
- US20200143905A1 US20200143905A1 US16/669,270 US201916669270A US2020143905A1 US 20200143905 A1 US20200143905 A1 US 20200143905A1 US 201916669270 A US201916669270 A US 201916669270A US 2020143905 A1 US2020143905 A1 US 2020143905A1
- Authority
- US
- United States
- Prior art keywords
- variants
- germline
- variant
- tumor
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 210000004602 germ cell Anatomy 0.000 title claims abstract description 184
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000001514 detection method Methods 0.000 title description 2
- 239000000203 mixture Substances 0.000 title 1
- 108700028369 Alleles Proteins 0.000 claims abstract description 150
- 230000000392 somatic effect Effects 0.000 claims abstract description 60
- 206010028980 Neoplasm Diseases 0.000 claims description 92
- 239000000523 sample Substances 0.000 claims description 64
- 230000035772 mutation Effects 0.000 claims description 30
- 239000012472 biological sample Substances 0.000 claims description 26
- 210000004881 tumor cell Anatomy 0.000 claims description 17
- 238000009826 distribution Methods 0.000 claims description 14
- 229940076838 Immune checkpoint inhibitor Drugs 0.000 claims description 9
- 239000012274 immune-checkpoint protein inhibitor Substances 0.000 claims description 9
- 238000004220 aggregation Methods 0.000 claims description 6
- 230000002776 aggregation Effects 0.000 claims description 6
- 210000004369 blood Anatomy 0.000 claims description 6
- 239000008280 blood Substances 0.000 claims description 6
- 210000002966 serum Anatomy 0.000 claims description 5
- 230000007614 genetic variation Effects 0.000 claims description 4
- 201000001441 melanoma Diseases 0.000 claims description 4
- 208000003174 Brain Neoplasms Diseases 0.000 claims description 3
- 206010006187 Breast cancer Diseases 0.000 claims description 3
- 208000026310 Breast neoplasm Diseases 0.000 claims description 3
- 239000012275 CTLA-4 inhibitor Substances 0.000 claims description 3
- 229940045513 CTLA4 antagonist Drugs 0.000 claims description 3
- 208000001333 Colorectal Neoplasms Diseases 0.000 claims description 3
- 206010061968 Gastric neoplasm Diseases 0.000 claims description 3
- 208000008839 Kidney Neoplasms Diseases 0.000 claims description 3
- 239000012270 PD-1 inhibitor Substances 0.000 claims description 3
- 239000012668 PD-1-inhibitor Substances 0.000 claims description 3
- 239000012271 PD-L1 inhibitor Substances 0.000 claims description 3
- 206010061902 Pancreatic neoplasm Diseases 0.000 claims description 3
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 claims description 3
- 229960003852 atezolizumab Drugs 0.000 claims description 3
- 229950002916 avelumab Drugs 0.000 claims description 3
- 229950009791 durvalumab Drugs 0.000 claims description 3
- 208000023965 endometrium neoplasm Diseases 0.000 claims description 3
- 229960005386 ipilimumab Drugs 0.000 claims description 3
- 208000020816 lung neoplasm Diseases 0.000 claims description 3
- 208000037841 lung tumor Diseases 0.000 claims description 3
- 229960003301 nivolumab Drugs 0.000 claims description 3
- 201000002528 pancreatic cancer Diseases 0.000 claims description 3
- 229940121655 pd-1 inhibitor Drugs 0.000 claims description 3
- 229940121656 pd-l1 inhibitor Drugs 0.000 claims description 3
- 229960002621 pembrolizumab Drugs 0.000 claims description 3
- 229950007213 spartalizumab Drugs 0.000 claims description 3
- 208000025421 tumor of uterus Diseases 0.000 claims description 3
- 206010046766 uterine cancer Diseases 0.000 claims description 3
- 230000002759 chromosomal effect Effects 0.000 description 19
- 210000000349 chromosome Anatomy 0.000 description 18
- 238000012163 sequencing technique Methods 0.000 description 17
- 230000015654 memory Effects 0.000 description 16
- 238000003556 assay Methods 0.000 description 12
- 210000004027 cell Anatomy 0.000 description 11
- 238000001914 filtration Methods 0.000 description 11
- 230000000875 corresponding effect Effects 0.000 description 9
- 238000003860 storage Methods 0.000 description 9
- 238000007482 whole exome sequencing Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 241000270322 Lepidosauria Species 0.000 description 4
- 239000003814 drug Substances 0.000 description 4
- 230000002068 genetic effect Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 229940124597 therapeutic agent Drugs 0.000 description 4
- 210000001519 tissue Anatomy 0.000 description 4
- 241000995070 Nirvana Species 0.000 description 3
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 238000003780 insertion Methods 0.000 description 3
- 230000037431 insertion Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000007481 next generation sequencing Methods 0.000 description 3
- 108020004707 nucleic acids Proteins 0.000 description 3
- 150000007523 nucleic acids Chemical class 0.000 description 3
- 102000039446 nucleic acids Human genes 0.000 description 3
- 239000002773 nucleotide Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 241000251468 Actinopterygii Species 0.000 description 2
- 241000272517 Anseriformes Species 0.000 description 2
- 241000271566 Aves Species 0.000 description 2
- 241000124008 Mammalia Species 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000002896 database filtering Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 235000019688 fish Nutrition 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 125000003729 nucleotide group Chemical group 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 206010069754 Acquired gene mutation Diseases 0.000 description 1
- 241000270728 Alligator Species 0.000 description 1
- 241000252073 Anguilliformes Species 0.000 description 1
- 241000283690 Bos taurus Species 0.000 description 1
- 241000252229 Carassius auratus Species 0.000 description 1
- 241000269333 Caudata Species 0.000 description 1
- 241000282693 Cercopithecidae Species 0.000 description 1
- 241000251730 Chondrichthyes Species 0.000 description 1
- 241000272194 Ciconiiformes Species 0.000 description 1
- 241001481833 Coryphaena hippurus Species 0.000 description 1
- 241000270722 Crocodylidae Species 0.000 description 1
- 230000009946 DNA mutation Effects 0.000 description 1
- 241000283073 Equus caballus Species 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 241001137350 Fratercula Species 0.000 description 1
- 241000287828 Gallus gallus Species 0.000 description 1
- 241000270349 Iguana Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 241000282577 Pan troglodytes Species 0.000 description 1
- 241000009328 Perro Species 0.000 description 1
- 241000282405 Pongo abelii Species 0.000 description 1
- 241000283984 Rodentia Species 0.000 description 1
- 241000270295 Serpentes Species 0.000 description 1
- 102220497176 Small vasohibin-binding protein_T47D_mutation Human genes 0.000 description 1
- 241001415849 Strigiformes Species 0.000 description 1
- 241000271567 Struthioniformes Species 0.000 description 1
- 241000282898 Sus scrofa Species 0.000 description 1
- 241000270666 Testudines Species 0.000 description 1
- 241000270708 Testudinidae Species 0.000 description 1
- 241000269959 Xiphias gladius Species 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 239000002246 antineoplastic agent Substances 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 239000000090 biomarker Substances 0.000 description 1
- 238000001574 biopsy Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 231100000504 carcinogenesis Toxicity 0.000 description 1
- 241001233037 catfish Species 0.000 description 1
- 235000013330 chicken meat Nutrition 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 229940127089 cytotoxic agent Drugs 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- LNNWVNGFPYWNQE-GMIGKAJZSA-N desomorphine Chemical compound C1C2=CC=C(O)C3=C2[C@]24CCN(C)[C@H]1[C@@H]2CCC[C@@H]4O3 LNNWVNGFPYWNQE-GMIGKAJZSA-N 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 210000004696 endometrium Anatomy 0.000 description 1
- 210000003608 fece Anatomy 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000002496 gastric effect Effects 0.000 description 1
- 210000004209 hair Anatomy 0.000 description 1
- 238000009169 immunotherapy Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 230000000869 mutational effect Effects 0.000 description 1
- 238000011275 oncology therapy Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 239000012188 paraffin wax Substances 0.000 description 1
- 210000002381 plasma Anatomy 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 210000003296 saliva Anatomy 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 210000003491 skin Anatomy 0.000 description 1
- 230000037439 somatic mutation Effects 0.000 description 1
- 235000021335 sword fish Nutrition 0.000 description 1
- 230000005945 translocation Effects 0.000 description 1
- 210000002700 urine Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
- C12Q1/6886—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2535/00—Reactions characterised by the assay type for determining the identity of a nucleotide base or a sequence of oligonucleotides
- C12Q2535/122—Massive parallel sequencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2537/00—Reactions characterised by the reaction format or use of a specific feature
- C12Q2537/10—Reactions characterised by the reaction format or use of a specific feature the purpose or use of
- C12Q2537/165—Mathematical modelling, e.g. logarithm, ratio
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/156—Polymorphic or mutational markers
Definitions
- a somatic variant can be distinguished from a germline variant based on variant allele frequency of a variant in a sample and its location in a genome.
- NGS Next-generation sequencing
- VAF variant allele frequency
- the low quality bases usually near the 3′ end of reads, and exogenous sequences such as sequencing adapters are trimmed from the DNA sample read processing tools.
- the cleaned reads are mapped using mapping and alignment tools to determine where the variants may come from in a reference genome, and then aligned base-by-base.
- the process of variant calling is used to separate real variants from artifacts stemming from library preparation, sample enrichment, sequencing, and mapping/alignment. There is a continued need for improved methods of variant calling from sequence data.
- Some embodiments include a method for identifying somatic variants in a plurality of variants, comprising: (a) obtaining a plurality of variants comprising somatic variants and germline variants; (b) applying a database filter to the plurality of variants, comprising: determining first germline variants in the plurality of variants, wherein the first germline variants each have an allele count in a first reference set of variants greater than or equal to a threshold allele count; (c) applying a proximity filter to the plurality of variants, comprising: (i) binning variants of the plurality of variants into a plurality of bins, wherein variants located in the same region of a genome are binned into the same bin, (ii) determining database variants in the plurality of variants, wherein a database variant is present in a second reference set of variants, and (iii) determining second germline variants in the plurality of variants, wherein the second germline variants each have an allele
- (b) and (c) are performed consecutively.
- (c) is performed before (b).
- the threshold allele count is 5. In some embodiments, the threshold allele count is 10.
- the first and second reference set of variants are the same reference set.
- the first or second reference set of variants comprises a database of variants for a plurality of individuals. In some embodiments, the first or second reference set of variants comprises at least one database selected from a genome aggregation database (gnomAD), and a 1000 genome database.
- gnomAD genome aggregation database
- the same region of a genome is within the same chromosome. In some embodiments, the same region of a genome is within the same chromosomal arm. In some embodiments, the same region of a genome is within the same chromosomal cytoband. In some embodiments, the same region of a genome is within a 10 Mb region.
- the applying a proximity filter further comprises identifying a second germline variant having an allele frequency greater than or equal to 0.9.
- the applying a proximity filter further comprises identifying a second germline variant in the plurality of variants, wherein the second germline variant is a database variant present in the second reference set of variants.
- the proximate range is a range having a maximum and a minimum of 0.05 from the allele frequency of a second germline variant.
- the proximate range is a range having a maximum and a minimum of two standard deviations from a binomial distribution of an allele frequency of a second germline variant, and centered from the allele frequency of a second germline variant.
- the second germline variants have an allele frequency within a threshold proximity to an allele frequency of at least five database variants in the same bin as the second germline variant. In some embodiments, the second germline variants have an allele frequency within a threshold proximity to an allele frequency of at least ten database variants in the same bin as the second germline variant.
- (a) comprises: obtaining sequence data from a biological sample comprising a tumor cell. Some embodiments also include aligning the sequence data with a reference sequence, and identifying variants in the sequence data.
- the biological sample comprising a tumor cell is selected from a serum sample, a stool sample, a blood sample, a tumor sample. In some embodiments, the tumor sample is fixed.
- Some embodiments include a method of determining a tumor mutation burden of a tumor, comprising: obtaining sequence data from a biological sample comprising a tumor cell; determining a plurality of variants from the sequence data; and determining the number of somatic variants in a plurality of variants according to the method of any one of the foregoing embodiments, wherein the number of somatic variants is the tumor mutation burden of the tumor.
- Some embodiments include a method of treating a tumor, comprising: determining a tumor having a tumor mutation burden greater than or equal to 10 somatic variants according to a method of determining a tumor mutation burden of a tumor; and treating the tumor by administering an effective amount of a checkpoint inhibitor.
- the tumor is selected from the group consisting of a colorectal tumor, a lung tumor, an endometrium tumor, a uterine tumor, a gastric tumor, a melanoma, a breast tumor, a pancreatic tumor, a kidney tumor, a bladder tumor, and a brain tumor.
- the checkpoint inhibitor is selected from the group consisting of a CTLA-4 inhibitor, a PD-1 inhibitor, and a PD-L1 inhibitor. In some embodiments, the checkpoint inhibitor is selected from the group consisting of Ipilimumab, Nivolumab, Pembrolizumab, Spartalizumab, Atezolizumab, Avelumab, and Durvalumab.
- Some embodiments include an electronic system for analyzing genetic variation data, comprising: an informatics module running on a processor and adapted to identify a plurality of variants from sequence data from a biological sample comprising a tumor cell, wherein the plurality of variants comprises somatic variants and germline variants; a database filter module adapted to remove first germline variants from the plurality of variants, wherein the first germline variants each have an allele count in a first reference set of variants greater than or equal to a threshold allele count; a proximity filter module adapted to remove second germline variants from the plurality of variants, the proximity filter module comprising: a binning sub-module adapted to return a plurality of bins, each bin containing variants of the plurality of variants located in the same region of a genome, an identification sub-module adapted to return database variants in the plurality of variants, wherein a database variant is present in a second reference set of variants, and a removal sub-module adapted to remove
- informatics module comprises a variant annotation tool.
- the threshold allele count is 5. In some embodiments, the threshold allele count is 10.
- the first and second reference set of variants are the same reference set.
- the first or second reference set of variants comprises a database of variants for a plurality of individuals. In some embodiments, the first or second reference set of variants comprises at least one database selected from a genome aggregation database (gnomAD), and a 1000 genome database.
- gnomAD genome aggregation database
- the same region of a genome is within the same chromosome. In some embodiments, the same region of a genome is within the same chromosomal arm. In some embodiments, the same region of a genome is within the same chromosomal cytoband. In some embodiments, the same region of a genome is within a 10 Mb region.
- the removal sub-module is adapted to remove a variant having an allele frequency greater than or equal to 0.9 from the plurality of variants.
- the removal sub-module is adapted to remove a database variant present in the second reference set of variants from the plurality of variants.
- the proximate range is a range having a maximum and a minimum of 0.05 from the allele frequency of a second germline variant.
- the proximate range is a range having a maximum and a minimum of two standard deviations from a binomial distribution of an allele frequency of a second germline variant, and centered from the allele frequency of a second germline variant.
- the second germline variants have an allele frequency within a threshold proximity to an allele frequency of at least five database variants in the same bin as the second germline variant. In some embodiments, the second germline variants have an allele frequency within a threshold proximity to an allele frequency of at least ten database variants in the same bin as the second germline variant.
- the biological sample comprising a tumor cell is selected from a serum sample, a stool sample, a blood sample, a tumor sample. In some embodiments, the tumor sample is fixed.
- Some embodiments include a computer-implemented method for identifying somatic variants in a plurality of variants, comprising: performing the method of any one of the foregoing methods.
- Some embodiments include a computer-implemented method for identifying somatic variants in a plurality of variants, comprising: (a) receiving a plurality of variants from sequence data from a biological sample comprising a tumor cell, the plurality of variants comprising somatic variants and germline variants; (b) applying a database filter to the plurality of variants, comprising: creating an index of documents for the plurality of variants, searching a first reference set of variants with the index to identify first germline variants in the index, wherein the first germline variants each have an allele count in the first reference set of variants greater than or equal to a threshold allele count, and removing the identified first germline variants from the index to create an index of first filtered variants; (c) applying a proximity filter to the index of first filtered variants, comprising: (i) creating a plurality of bins for different regions of a genome, (ii) binning variants of the index of first filtered variants, wherein variants located in the
- the threshold allele count is 5. In some embodiments, the threshold allele count is 10.
- the first and second reference set of variants are the same reference set.
- the first or second reference set of variants comprises a database of variants for a plurality of individuals. In some embodiments, the first or second reference set of variants comprises at least one database selected from a genome aggregation database (gnomAD), and a 1000 genome database.
- gnomAD genome aggregation database
- the same region of a genome is within the same chromosome. In some embodiments, the same region of a genome is within the same chromosomal arm. In some embodiments, the same region of a genome is within the same chromosomal cytoband. In some embodiments, the same region of a genome is within a 10 Mb region.
- the generating an index of second filtered variants further comprises identifying a second germline variant having an allele frequency greater than or equal to 0.9.
- the generating an index of second filtered variants further comprises identifying a second germline variant in the plurality of variants, wherein the second germline variant is a database variant present in the second reference set of variants.
- the proximate range is a range having a maximum and a minimum of 0.05 from the allele frequency of a second germline variant.
- the proximate range is a range having a maximum and a minimum of two standard deviations from a binomial distribution of an allele frequency of a second germline variant, and centered from the allele frequency of a second germline variant.
- the second germline variants have an allele frequency within a threshold proximity to an allele frequency of at least five database variants in the same bin as the second germline variant. In some embodiments, the second germline variants have an allele frequency within a threshold proximity to an allele frequency of at least ten database variants in the same bin as the second germline variant.
- the biological sample comprising a tumor cell is selected from a serum sample, a stool sample, a blood sample, a tumor sample. In some embodiments, the tumor sample is fixed.
- FIG. 1 depicts an example embodiment of a workflow that includes obtaining sequence data, such as a VCF file, identifying and annotating variants in the data, identifying and filtering germline variant, and returning a variant table indicating the status of the variants.
- sequence data such as a VCF file
- FIG. 2A is a graph showing the variant allele frequency (VAF) for various variants according to chromosomal location of each variant with somatic variants (black-filled circles), and germline variants (gray-filled circles).
- VAF variant allele frequency
- FIG. 2B is a graph showing the VAF for various variants according to chromosomal location of each variant with filter-determined somatic variants (black-filled circles), and filter-determined germline variants (gray-filled circles).
- FIG. 3 a graph showing the VAF for various variants according to chromosomal location for chromosomes 1-7 for each variant with filter-determined somatic variants (black-filled circles), and filter-determined germline variants (gray-filled circles), and an enlargement for variants located on chromosome 7 in which a particular filter-determined somatic variant has been selected, and a range drawn from the selected variant.
- FIG. 4A is a graph showing the VAF for various variants according to chromosomal location of each variant with filter-determined somatic variants (black-filled circles), and filter-determined germline variants (gray-filled circles), filtered with a database filter only.
- FIG. 4B is a graph showing the VAF for various variants according to chromosomal location of each variant with filter-determined somatic variants (black-filled circles), and filter-determined germline variants (gray-filled circles), filtered with a database filter only, and a proximity filter.
- FIG. 5 depicts an overview of an example embodiment of a workflow that includes obtaining formalin-fixed paraffin embedded (FFPE) samples, obtaining sequence data, and analyzing the sequence data.
- FFPE formalin-fixed paraffin embedded
- FIG. 6 depicts an example embodiment of a workflow that includes filtering germline variants from the identified variants using a database filter and a proximity filter, and calculating a tumor mutation burden.
- FIG. 7 is a line graph showing a distribution of remaining germline variant count after filtering with database only (graph peaks at about 3 germline residuals/Mb) and the hybrid strategy (graph peaks at about 0 germline residuals/Mb).
- FIG. 8A is a graph showing a comparison of tumor mutation burden (TMB) between tumor-only and tumor/normal assays.
- FIG. 8B is a graph showing a comparison of tumor mutation burden (TMB) between tumor-only and WES tumor-normal assays.
- a somatic variant can be distinguished from a germline variant based on the variant's allele frequency in a sample and the variant's location in a genome.
- a “variant” can include a polymorphism within a nucleic acid molecule.
- a polymorphism can include an insertion, deletion, variable length tandem repeats, single nucleotide mutation, and a structural variant such as translocation, copy number variation, or a combination thereof.
- a “germline variant” can include a variant present in germ cells and all cells of an individual.
- a “somatic variant” can include a variant present in a tumor cell, and not in other cells of an individual.
- variant calling between somatic variants and germline variants has relied on a comparison between data obtained from a tumor sample, and data obtained from a matched normal sample.
- traditional variant calling requires a matched sample to be available, and for two sets of data to be obtained.
- Embodiments provided herein relate to variant calling from sequence data taken from a single sample from an individual. Using a single sample may reduce the need for a matched sample, and the costs that would have been required for obtaining sequence data for both a tumor sample, and a matched normal sample.
- a filter can include a proximity filter.
- the proximity filter includes binning the plurality of variants into a plurality of bins according to the location of the variants in a genome. Some of the binned variants can be identified as germline variants by the presence of corresponding variants in one or more reference sets of variants.
- An uncharacterized binned variant can be determined to be a germline variant if the uncharacterized binned variant has an allele frequency similar to the allele frequency of one or more identified germline variants in the same bin as the uncharacterized variant.
- Some embodiments also include applying a database filter to identify germline variants.
- the database filter can identify germline variants according to an allele count of corresponding variants in one or more reference sets of variants.
- a database filter and a proximity filter can be applied to the plurality of variants to identify germline variants.
- somatic variants are variants that are identified as germline variants. The number of somatic variants can indicate the tumor mutation burden of a tumor.
- the germline variants may include variants that an individual is born with (or shared between the tumor and the normal cell) but which are detected as variants in comparison to the reference genome. These variants do not contribute to distinguishing tumor cells from normal cells, and thus can lead to over estimation of the tumor mutation burden if not correctly filtered out.
- Embodiments include determining a tumor mutation burden for a tumor, selecting a treatment for the tumor according to the tumor mutation burden, and administering the treatment to a subject in need thereof.
- Some embodiments of the methods and systems provided herein relate to a method for identifying a somatic variant in a plurality of variants comprising somatic variants and germline variants.
- germline variants can be filtered from the plurality of variants using one or more filters. Examples of such filters include a database filter, and a proximity filter.
- a database filter can be applied to a plurality of variants.
- the database filter can be used to identify a variant as a germline variant, and remove the variant from the plurality of variants.
- the database filter can be related to an allele count of a corresponding variant in a database, for a particular variant of the plurality of variants.
- a reference database can be searched for the corresponding variant in the database.
- a reference database can include a database of variants for a plurality of individuals. Examples of databases useful with embodiments provided herein include a genome aggregation database (gnomAD), including gnomAD exome and gnomAD genome databases, and a 1000 genome database (International Genome Sample Resource). See e.g., Lek, M., et al., (2016) Nature 536:285-292 which is incorporated by reference in its entirety.
- a total allele count can be determined for the corresponding variant in one or more reference databases. An allele count can represent the total number of observations within a database that a variant is observed.
- an allele count of 10 in a database for a corresponding variant denotes that the corresponding variant has been observed in at least 5 samples for homozygous variants, or a maximum of 10 samples for heterozygous variants.
- an allele count can be the highest allele count observed in more than one databases.
- a variant having a corresponding variant with an allele count greater than or equal to a certain threshold allele count can be identified as a germline variant
- the threshold allele count can be greater than or equal to 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 and 20.
- a proximity filter can be applied to a plurality of variants.
- the database filter can be used to identify a variant as a germline variant, and remove the variant from the plurality of variants.
- the proximity filter can be related to the allele frequency of a certain variant of the plurality of variants, the location of the variant in region of a genome, and the proximity of the allele frequency of the variant to the allele frequency of identified germline variants in the same region of a genome.
- variants of the plurality of variants can be sorted or binned into a plurality of bins, such that variants located in the same region of a genome are sorted or binned into the same bin.
- the same region of a genome can be within the same chromosome, within the same arm of a chromosome, within the same chromosomal cytoband. In some embodiments, the same region of a genome can be within the same contiguous 100 Mb, 50 Mb, 40 Mb, 30 Mb, 20 Mb, 10 Mb, 5 Mb, 1 Mb, or within any range between any two of the foregoing numbers.
- the proximity filter also includes determining which binned variants are readily identifiable as germline variants.
- a binned variant can have a corresponding variant present in one or more reference databases and be identified as a germline variant.
- the proximity filter includes determining that variants having an allele frequency greater than or equal to a threshold frequency in the sample are germline variants. In some such embodiments, variants having an allele frequency greater than or equal to 0.7, 0.8, 0.9, or 1.0 can be identified as germline variants.
- the proximity filter includes determining a proximate range of an allele frequency for a variant that has not been identified as a germline variant.
- a proximate range of an allele frequency for a variant can include a range of allele frequencies above and below the allele frequency of the variant.
- the proximate range is a range having a maximum and a minimum from the allele frequency of variant of 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, or any number within a range between any two of the foregoing numbers. For example, for a variant having an allele frequency of 0.2 and a proximate range of 0.05, the minimum and maximum of the proximate range would be allele frequencies of 0.15 and 0.25, respectively.
- the proximate range is determined by the value of two (n) standard deviations of a binomial distribution assuming the supporting evidence for the given variant is generated by a binomial process.
- the proximate range (z) can be:
- the proximate range would be 0.08, and the minimum and maximum of the proximate range would be allele frequencies of 0.12 and 0.28, respectively.
- the proximate range is the higher of either 0.05, or two (n) standard deviations from a binomial distribution of the allele frequency of the variant, above and below the allele frequency of the variant.
- a variant can be identified as a germline variant if the variant has an allele frequency within proximate range of one or more identified germline variants in the same bin as the variant. In some embodiments, a variant can be identified as a germline variant if the variant has an allele frequency within proximate range of more than 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 identified germline variants in the same bin as the variant. In some embodiments, a variant can be identified as a germline variant if the variant has an allele frequency within proximate range of more than 5 identified germline variants in the same bin as the variant.
- a variant having an allele frequency of 0.2 with a proximate range of 0.05, thus having a minimum range of 0.15 and a maximum range of 0.25 and binned in a bin representing chromosome 7 would be identified as a germline variant where more than 5 identified germline variants having allele frequencies in proximate range of the variant and binned in the bin representing chromosome 7.
- the proximity filter identifies somatic variants which are variants not identified as germline variants.
- the number of somatic variants obtained from sequencing data from a tumor is the tumor mutation burden of the tumor.
- the database filter or the proximity filter can be applied to the plurality of variants to identify and remove germline variants from the plurality of variants.
- the database filter and the proximity filter can be applied consecutively. For example, the output of the database filter such can be used for the input of the proximity filter. Conversely, the output of the proximity filter can be used as the input of the database filter.
- Some embodiments of the methods and systems provided herein include electronic system for analyzing genetic variation data.
- a database filter described herein and/or a proximity filter described herein can be applied to the genetic variation data to identify germline variants.
- Some embodiments can include an informatics module running on a processor and adapted to identify a plurality of variants from sequence data from a biological sample comprising a tumor cell, in which the plurality of variants comprises somatic variants and germline variants.
- Some embodiments include a database filter module adapted to remove germline variants from the plurality of variants, wherein the germline variants each have an allele count in a reference set of variants greater than or equal to a threshold allele count.
- the threshold allele count can be greater than or equal to 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 and 20.
- Some embodiments include a proximity filter module adapted to remove germline variants from the plurality of variants.
- the proximity filter module can include a binning sub-module adapted to return a plurality of bins, each bin containing variants of the plurality of variants located in the same region of a genome.
- variants of the plurality of variants can be sorted or binned into a plurality of bins, such that variants located in the same region of a genome are sorted or binned into the same bin.
- the same region of a genome can be within the same chromosome, within the same arm of a chromosome, within the same chromosomal cytoband.
- the same region of a genome can be within the same contiguous 100 Mb, 50 Mb, 40 Mb, 30 Mb, 20 Mb, 10 Mb, 5 Mb, 1 Mb, or within any range between any two of the foregoing numbers.
- the proximity filter module can include an identification sub-module adapted to return database variants in the plurality of variants, wherein a database variant is present in a reference set of variants.
- the proximity filter module can include a removal sub-module adapted to remove germline variants from the plurality of variants, wherein the germline variants each have an allele frequency within a proximate range of an allele frequency of at least one database variant in the same bin as the germline variant.
- the proximity filter includes determining a proximate range of an allele frequency for a variant that has not been identified as a germline variant.
- the approximate range is a range having a maximum and a minimum from the allele frequency of a variant of 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, or any number within a range between any two of the foregoing numbers.
- the proximate range is a range having a maximum and a minimum of two standard deviations from a binomial distribution of the allele frequency of the variant. In some embodiments, the proximate range is the higher of 0.05, or two (n) standard deviations from a binomial distribution of the allele frequency of the variant, above and below the allele frequency of the variant.
- a variant can be identified as a germline variant if the variant has an allele frequency within proximate range of one or more identified germline variants in the same bin as the variant. In some embodiments, a variant can be identified as a germline variant if the variant has an allele frequency within proximate range of more than 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 identified germline variants in the same bin as the variant. In some embodiments, the removal sub-module is adapted to remove a variant having an allele frequency greater than or equal to a threshold frequency. In some such embodiments, variants having an allele frequency greater than or equal to 0.7, 0.8, 0.9, or 1.0 can be identified as germline variants. In some embodiments, the removal sub-module is adapted to remove a database variant present in the reference set of variants from the plurality of variants.
- Some embodiments provided herein include computer-implemented methods for identifying somatic variants in a plurality of variants. Some such embodiments can include receiving a plurality of variants from sequence data from a biological sample comprising a tumor cell, the plurality of variants can include somatic variants and germline variants.
- Some embodiments include applying a database filter to the plurality of variants. Some such embodiments include creating an index of documents for the plurality of variants, searching a reference set of variants with the index to identify germline variants in the index. In some embodiments, the germline variants each have an allele count in the reference set of variants greater than or equal to a threshold allele count. In some embodiments, the threshold allele count can be greater than or equal to 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 and 20. Some embodiments also include removing the identified germline variants from the index to create an index of first filtered variants.
- Some embodiments include applying a proximity filter to the index of first filtered variants. Some such embodiments include creating a plurality of bins for different regions of a genome. Some embodiments include binning variants of the index of first filtered variants, wherein variants located in the same region of a genome are binned into the same bin. In some embodiments, the same region of a genome can be within the same chromosome, within the same arm of a chromosome, within the same chromosomal cytoband.
- the same region of a genome can be within the same contiguous 100 Mb, 50 Mb, 40 Mb, 30 Mb, 20 Mb, 10 Mb, 5 Mb, 1 Mb, or within any range between any two of the foregoing numbers.
- Some embodiments include searching a reference set of variants with the index of first filtered variants to identify database variants in the index of first filtered variants.
- Some embodiments include generating an index of germline variants from the index of first filtered variants by identifying germline variants.
- the germline variants each have an allele frequency within a proximate range of an allele frequency of at least one database variant in the same bin as the second germline variant.
- the proximate range is a range having a maximum and a minimum from the allele frequency of variant of 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, or any number within a range between any two of the foregoing numbers.
- the proximate range is a range having a maximum and a minimum of two standard deviations from a binomial distribution of the allele frequency of the variant. In some embodiments, the proximate range is the higher of 0.05, or two (n) standard deviations from a binomial distribution of the allele frequency of the variant, above and below the allele frequency of the variant.
- a variant can be identified as a germline variant if the variant has an allele frequency within a proximate range of one or more identified germline variants in the same bin as the variant. In some embodiments, a variant can be identified as a germline variant if the variant has an allele frequency within proximate range of more than 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 identified germline variants in the same bin as the variant. In some embodiments, the germline variant can be identified as a variant having an allele frequency greater than or equal to a threshold frequency. In some such embodiments, variants having an allele frequency greater than or equal to 0.7, 0.8, 0.9, or 1.0 can be identified as germline variants.
- Some embodiments include removing the identified germline variants from the index of first filtered variants to create an index of somatic variants, thereby identifying somatic variants in the plurality of variants.
- the number of somatic variants obtained from sequencing data from a tumor is the tumor mutation burden of the tumor.
- Some embodiments of the methods and systems include methods of treating a tumor.
- the number of somatic variants present in a tumor can be determined by the methods and systems provided herein.
- sequence data can be obtained from a tumor
- a plurality of variants can be identified from the sequence data
- germline variants can be identified and removed from a plurality of variants, thereby identifying somatic variants in the plurality of variants.
- germline variants can be identified and removed from the plurality of variants by applying one or more of a database filter, and/or a proximity filter, thereby identifying somatic variants that are not removed by applying the one or more of the filters.
- the number of somatic variants obtained from sequencing data from a tumor is the tumor mutation burden of the tumor.
- tumor mutation burden is calculated as an average number of somatic variants per genomic region, such as, for example, mutations per 50 kb, 100 kb, 1 Mb, 10 Mb, 100 Mb, and the like.
- Tumor mutation burden can be sampled by sequencing an entire genome or a portion thereof. For example, a portion of a genome may be sequenced by enriching for one or more genomic regions of interest, such as a tumor gene panel, a full exome, a partial exome, and the like.
- Some embodiments of treating a tumor can include determining a tumor has a tumor mutation burden greater than or equal to a tumor mutation burden threshold, and contacting the tumor with an effective amount of therapeutic agent. Some embodiments include treating a subject having a tumor and can include determining a tumor has a tumor mutation burden greater than or equal to a TMB threshold, and administering to the subject an effective amount of therapeutic agent.
- a tumor mutation burden threshold can be 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000 or any number in a range between any two of the foregoing numbers.
- therapeutic agents include chemotherapeutic agents.
- the therapeutic agent can include a checkpoint inhibitor.
- checkpoint inhibitors include a CTLA-4 inhibitor, a PD-1 inhibitor, and a PD-L1 inhibitor.
- the checkpoint inhibitor can include Ipilimumab, Nivolumab, Pembrolizumab, Spartalizumab, Atezolizumab, Avelumab, and Durvalumab.
- tumors include a colorectal tumor, a lung tumor, an endometrium tumor, a uterine tumor, a gastric tumor, a melanoma, a breast tumor, a pancreatic tumor, a kidney tumor, a bladder tumor, and a brain tumor. More examples of cancers that can be treated with the methods and systems included herein are listed in U.S. 20180218789 which is expressly incorporated by reference herein in its entirety.
- a biological sample can include a tumor cell.
- a biological sample can include a serum sample, a stool sample, a blood sample, and a tumor sample. In some embodiments, the biological sample is fixed.
- a subject can provide a biological sample.
- the biological sample can be any substance that is produced by the subject.
- the biological sample is any tissue taken from the subject or any substance produced by the subject. Examples of biological samples can include blood, plasma, saliva, cerebrospinal fluid (CSF), cheek tissue, urine, feces, skin, hair, organ tissue.
- the biological sample is a solid tumor or a biopsy of a solid tumor.
- the biological sample is a formalin-fixed, paraffin-embedded (FFPE) tissue sample.
- the biological sample can be any biological sample that comprises nucleic acids. Biological samples may be derived from a subject.
- the subject maybe a mammal, a reptile, an amphibian, an avian, or a fish.
- mammals include a human, ape, orangutan, monkey, chimpanzee, cow, pig, horse, rodent, bird, reptile, dog, cat, dolphin, or other animal.
- reptiles include a lizard, snake, alligator, turtle, crocodile, iguana , and tortoise.
- Examples of amphibians include a toad, frog, newt, and salamander.
- avians include chickens, ducks, geese, penguins, ostriches, puffins, and owls.
- fish include catfish, eels, sharks, goldfish, and swordfish.
- the subject is a human.
- Some embodiments include computer-based systems and computer implemented methods for performing the methods described herein.
- the systems can be utilized for determining and reporting the presence or absence of variants in a sample, such as germline variants and/or somatic variants.
- the system can comprise one or more client components.
- the one or more client components can comprise a user interface.
- the system can comprise one or more server components.
- the server components can comprise one or more memory locations.
- the one or more memory locations can be configured to receive a data input.
- the data input can comprise sequencing data.
- the sequencing data can be generated from a nucleic acid sample from a subject.
- the system can further comprise one or more computer processor.
- the one or more computer processor can be operably coupled to the one or more memory locations.
- the one or more computer processor can be programmed to map the sequencing data to a reference sequence.
- the one or more computer processor can be further programmed to determine a presence or absence of a plurality of variants from the sequencing data.
- the one or more computer processor can be further programmed to apply at least one filter to the genetic variants to identify germline variants. Examples of filters include a database filter and a proximity filter.
- the one or more computer processor can be further programmed to remove identify germline variants from an index of the identified variants.
- the one or more computer processor can be further programmed to generate an output for display on a screen.
- the output can comprise one or more reports identifying the germline variants and/or the somatic variants in the plurality of variants.
- Some embodiments of the methods and systems can comprise one or more client components.
- the one or more client components can comprise one or more software components, one or more hardware components, or a combination thereof.
- the one or more client components can access one or more services through one or more server components.
- the one or more services can be accessed by the one or more client components through a network.
- “Services” is used herein to refer to any product, method, function, or use of the system.
- a user can place an order for a genetic test.
- the order can be placed through the one or more client components of the system and the request can be transmitted through a network to the one or more server components of the system.
- the network can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
- the network in some cases is a telecommunication and/or data network.
- the network can include one or more computer servers, which can enable distributed computing, such as cloud computing.
- the network in some cases with the aid of the computer system, can implement a peer-to-peer network, which may enable devices coupled to the computer system to behave as a client or a server.
- Some embodiments of the systems can comprise one or more memory locations, such as random-access memory, read-only memory, flash memory, electronic storage unit, such as hard disk; communication interface, such as network adapter, for communicating with one or more other systems, and peripheral devices, such as cache, other memory, data storage and/or electronic display adapters.
- the memory, storage unit, interface and peripheral devices are in communication with the CPU through a communication bus, such as a motherboard.
- the storage unit can be a data storage unit or data repository for storing data.
- the one or more memory locations can store the received sequencing data.
- Some embodiments of the methods and systems can comprise one or more computer processors.
- the one or more computer processors may be operably coupled to the one or more memory locations to e.g., access the stored sequencing data.
- the one or more computer processors can implement machine executable code to carry out the methods described herein. For instance, the one or more computer processors can execute machine readable code to map a sequencing data input to a reference sequence, and/or identify germline variants and/or somatic variants.
- Some embodiments of the methods and systems provided herein can include machine executable or machine readable code.
- the machine executable or machine readable code can be provided in the form of software.
- the code can be executed by the processor.
- the code can be retrieved from the storage unit and stored on the memory for ready access by the processor.
- the electronic storage unit can be precluded, and machine-executable instructions are stored on memory.
- the code can be pre-compiled and configured for use with a machine having a processor adapted to execute the code, can be compiled during runtime, or can be interpreted during runtime.
- the code can be supplied in a programming language that can be selected to enable the code to execute in a pre-compiled, as-compiled or interpreted fashion.
- Some embodiments of the systems and methods provided herein, such as the computer system, can be embodied in programming.
- Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
- Machine-executable code can be stored on an electronic storage unit, such memory or a hard disk.
- “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming.
- All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
- another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
- the physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software.
- terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
- Some embodiments of the methods and systems disclosed herein can include or be in communication with one or more electronic displays.
- the electronic display can be part of the computer system, or coupled to the computer system directly or through the network.
- the computer system can include a user interface (UI) for providing various features and functionalities disclosed herein.
- UIs include, without limitation, graphical user interfaces(GUIs) and web-based user interfaces.
- GUIs graphical user interfaces
- the UI can provide an interactive tool by which a user can utilize the methods and systems described herein.
- a UI as envisioned herein can be a web-based tool by which a healthcare practitioner can order a genetic test, customize a list of genetic variants to be tested, and receive and view a biomedical report.
- Some embodiments of the methods and systems disclosed herein may comprise biomedical databases, genomic databases, biomedical reports, disease reports, case-control analysis, and rare variant discovery analysis based on data and/or information from one or more databases, one or more assays, one or more data or results, one or more outputs based on or derived from one or more assays, one or more outputs based on or derived from one or more data or results, or a combination thereof.
- FIG. 2A is a graph showing the variant allele frequency (VAF) for different variants according to chromosomal location of each variant with somatic variants (black-filled circles), and germline variants (gray-filled circles). This method required two samples from the individual.
- VAF variant allele frequency
- Sequence data was obtained for a tumor sample only from Example 1. Variants were identified in the sequence data.
- variants called from a variant calling pipeline were annotated using an annotation tool, Nirvana (Illumina, San Diego).
- Nirvana provided clinical-grade annotation of genomic variants, such as single nucleotide variants, multi-nucleotide variants, insertions, deletions, copy number variants.
- the input to Nirvana was in a variant call format (VCF) and the output was a structured JSON representation of all annotation and sample information.
- VCF variant call format
- the total allele counts were parsed for a given variant in the genome aggregation database (gnomAD) exome, gnomAD genome, and the 1000 genome database along with the variant allele frequencies and coverage. These total allele counts represented the total number of observations within the database across different sub-populations. For each variant, the maximum allele count observed in all three databases was taken to take into account regions that were not covered in the exome database, while taking advantage of its larger sample size compared to the genome database. The filtering strategy marked variants with a maximum allele count of greater or equal than 10 as potential germline variants.
- gnomAD genome aggregation database
- FIG. 2B is a graph showing the variant allele frequency (VAF) for various variants according to chromosomal location of each variant with filter-determined somatic variants (black-filled circles), and filter-determined germline variants (gray-filled circles). This demonstrated that database filtering only, mis-called variants.
- VAF variant allele frequency
- Sequence data was obtained for a tumor sample only from an individual. Variants were identified in the sequence data.
- the database filter of Example 2 was applied to the variants. A proximity filter was used to further filter out variants that were not found in the database.
- the proximity filter used information of database filtered variants in close positional proximity. For a given variant that was not found in the database and had an allele frequency lower than 0.9, variants on the same chromosome were retrieved within a given range of variant allele frequencies of the unfiltered variant. Variants with an allele frequency greater than 90% were marked as germline without any further processing.
- FIG. 3 (left panel) is a graph showing the variant allele frequency (VAF) for various variants according to chromosomal location for chromosomes 1-7 for each variant with filter-determined somatic variants (black-filled circles), and filter-determined germline variants (gray-filled circles), filtered with a database filter only.
- FIG. 3 (right panel) is an enlargement for variants located on chromosome 7 in which a particular filter-determined somatic variant (black circle) has been selected, and a range drawn from the variant that encompasses several filter-determined germline variants (gray circle).
- a determination that the selected filter-determined somatic variant (black circle) should be called as a germline variant can be made based on the proximity of the selected variant's allele frequency to the allele frequencies of a certain number of already identified germline variants.
- FIG. 4A is a graph showing the variant allele frequency (VAF) for various variants according to chromosomal location of each variant with filter-determined somatic variants (black-filled circles), and filter-determined germline variants (gray-filled circles), filtered with a database filter only.
- FIG. 4B is a graph showing the variant allele frequency (VAF) for various variants according to chromosomal location of each variant with filter-determined somatic variants (black-filled circles), and filter-determined germline variants (gray-filled circles), filtered with a database filter only, and a proximity filter.
- FIG. 4B shows that certain putative false positives shown as somatic variants in FIG. 4A , were identified as germline variants in FIG. 4B .
- identified somatic variants located on chromosome 7 having allele frequencies about 0.4 and 0.3 were identified as germline variants when the proximity filter was applied ( FIG. 4B ).
- This example relates to a targeted next-generation sequencing assay for measuring tumor mutation burden (TMB) in formalin-fixed, paraffin-embedded (FFPE) tumor samples.
- FIG. 5 shows an example workflow for the assay. Sequence data was obtained from tumor samples for 523 genes in a panel size of 1.94 Mb with exon size of 1.33 Mb. Sequencing and was performed with unique molecular identifiers (UMIs), and using Illumina NextSeqTM 500/550 platforms. Data analysis was performed using a pipeline for detecting variants at 5% variant allele frequencies (VAF). For technical noise removal, a variant calling algorithm was used that utilized information from UMIs, and sample specific error profiles to ensure a uniform variant calling performance across samples of different FFPE qualities. To accurately remove germline variants from TMB calculations, a hybrid strategy was used that integrated information from large-scale public databases with the measured coverage and variant allele frequency of each variant, and that was substantially similar to the database filter and the proximity filter of the foregoing Examples.
- UMIs
- sequence data was obtained, aligned with a reference, and variants were identified.
- Germline variants were filtered from the identified variants using a database filter and a proximity filter, and a TMB was calculated in a workflow substantially similar to the pipeline shown in FIG. 6 .
- a total of 170 pairs of tumor-normal samples were analyzed to assess the germline filtering and TMB performance (TABLE 1).
- a subset of 108 sample pairs were also analyzed with whole exome sequencing (WES).
- FIG. 7 shows distribution of remaining germline variant count after filtering with database only (graph peaks at about 3 germline residuals/Mb) and the hybrid strategy (graph peaks at about 0 germline residuals/Mb).
- TMB reproducibility was assessed in 8 different samples including 4 cell lines and 4 FFPE samples across 3 operators. Mean and standard deviation (SD) of each sample were calculated. TABLE 2 lists TMB reproducibility assessed in 4 cell lines and 4 FFPE samples across 12 replicates each.
- TMB mean TMB SD T47D Cell line 12 0.9 0.7 H2228 Cell line 12 7.5 0.8 HD799 Cell line 12 405.0 6.8 OncoSpan Cell line 12 389.1 8.4 1251 FFPE 12 0.3 0.4 4116 FFPE 11 24.9 0.7 3643 FFPE 12 7.6 1.4 4118 FFPE 12 50.5 1.5
- FIG. 8A shows TMB comparison between tumor-only and tumor/normal assays.
- FIG. 8B shows TMB comparison between tumor-only and WES tumor-normal assays.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Health & Medical Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Organic Chemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Analytical Chemistry (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Genetics & Genomics (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Molecular Biology (AREA)
- Immunology (AREA)
- Medical Informatics (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Biochemistry (AREA)
- Microbiology (AREA)
- Pathology (AREA)
- Hospice & Palliative Care (AREA)
- Oncology (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/669,270 US20200143905A1 (en) | 2018-11-01 | 2019-10-30 | Methods and compositions for germline variant detection |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201862754094P | 2018-11-01 | 2018-11-01 | |
US16/669,270 US20200143905A1 (en) | 2018-11-01 | 2019-10-30 | Methods and compositions for germline variant detection |
Publications (1)
Publication Number | Publication Date |
---|---|
US20200143905A1 true US20200143905A1 (en) | 2020-05-07 |
Family
ID=68610356
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/669,270 Pending US20200143905A1 (en) | 2018-11-01 | 2019-10-30 | Methods and compositions for germline variant detection |
Country Status (12)
Country | Link |
---|---|
US (1) | US20200143905A1 (pt) |
EP (1) | EP3874066A1 (pt) |
JP (1) | JP7554121B2 (pt) |
KR (1) | KR20210083208A (pt) |
CN (1) | CN112424380A (pt) |
AU (1) | AU2019369517A1 (pt) |
BR (1) | BR112020026259A2 (pt) |
CA (1) | CA3104004A1 (pt) |
IL (1) | IL279435A (pt) |
MX (1) | MX2020014090A (pt) |
SG (1) | SG11202012487WA (pt) |
WO (1) | WO2020092591A1 (pt) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR102427600B1 (ko) * | 2021-12-14 | 2022-08-01 | 주식회사 테라젠바이오 | 줄기세포의 배양적응성을 판단하기 위한 체세포 변이를 선별하는 방법 |
US20230215513A1 (en) | 2021-12-31 | 2023-07-06 | Sophia Genetics S.A. | Methods and systems for detecting tumor mutational burden |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11261494B2 (en) * | 2012-06-21 | 2022-03-01 | The Chinese University Of Hong Kong | Method of measuring a fractional concentration of tumor DNA |
EP3137601B1 (en) * | 2014-04-29 | 2020-04-08 | Illumina, Inc. | Multiplexed single cell gene expression analysis using template switch and tagmentation |
BR112017007282A2 (pt) | 2014-10-10 | 2018-06-19 | Invitae Corp | métodos, sistemas e processos de montagem de novo de leituras de sequenciamento |
CA2982570C (en) | 2015-04-13 | 2023-08-22 | Invitae Corporation | Methods, systems and processes of identifying genetic variation in highly similar genes |
CN107922973B (zh) | 2015-07-07 | 2019-06-14 | 远见基因组系统公司 | 用于基于测序的变型检测的方法和系统 |
JP6743150B2 (ja) * | 2015-08-28 | 2020-08-19 | イルミナ インコーポレイテッド | 単一細胞の核酸配列分析 |
WO2018067886A2 (en) | 2016-10-05 | 2018-04-12 | Nantomics, Llc | Stress induced mutations as a hallmark of cancer |
CN107491666B (zh) * | 2017-09-01 | 2020-11-10 | 深圳裕策生物科技有限公司 | 异常组织中单样本体细胞突变位点检测方法、装置和存储介质 |
-
2019
- 2019-10-30 KR KR1020207037041A patent/KR20210083208A/ko active Search and Examination
- 2019-10-30 AU AU2019369517A patent/AU2019369517A1/en active Pending
- 2019-10-30 SG SG11202012487WA patent/SG11202012487WA/en unknown
- 2019-10-30 EP EP19805856.2A patent/EP3874066A1/en active Pending
- 2019-10-30 JP JP2020572675A patent/JP7554121B2/ja active Active
- 2019-10-30 CN CN201980042604.3A patent/CN112424380A/zh active Pending
- 2019-10-30 CA CA3104004A patent/CA3104004A1/en active Pending
- 2019-10-30 US US16/669,270 patent/US20200143905A1/en active Pending
- 2019-10-30 BR BR112020026259-5A patent/BR112020026259A2/pt unknown
- 2019-10-30 MX MX2020014090A patent/MX2020014090A/es unknown
- 2019-10-30 WO PCT/US2019/058895 patent/WO2020092591A1/en unknown
-
2020
- 2020-12-14 IL IL279435A patent/IL279435A/en unknown
Also Published As
Publication number | Publication date |
---|---|
MX2020014090A (es) | 2021-03-09 |
CA3104004A1 (en) | 2020-05-07 |
JP7554121B2 (ja) | 2024-09-19 |
KR20210083208A (ko) | 2021-07-06 |
EP3874066A1 (en) | 2021-09-08 |
IL279435A (en) | 2021-01-31 |
JP2022511208A (ja) | 2022-01-31 |
BR112020026259A2 (pt) | 2021-07-27 |
CN112424380A (zh) | 2021-02-26 |
WO2020092591A1 (en) | 2020-05-07 |
SG11202012487WA (en) | 2021-01-28 |
AU2019369517A1 (en) | 2021-01-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109689891B (zh) | 用于无细胞核酸的片段组谱分析的方法 | |
JP7340021B2 (ja) | 予測腫瘍遺伝子変異量に基づいた腫瘍分類 | |
WO2017065959A2 (en) | Methods and compositions that utilize transcriptome sequencing data in machine learning-based classification | |
CN110168648A (zh) | 序列变异识别的验证方法和系统 | |
US11475978B2 (en) | Detection of human leukocyte antigen loss of heterozygosity | |
WO2021258026A1 (en) | Molecular response and progression detection from circulating cell free dna | |
US20210090686A1 (en) | Single cell rna-seq data processing | |
US20210118526A1 (en) | Calculating cell-type rna profiles for diagnosis and treatment | |
US12054712B2 (en) | Fragment size characterization of cell-free DNA mutations from clonal hematopoiesis | |
US20200143905A1 (en) | Methods and compositions for germline variant detection | |
US20220223227A1 (en) | Machine learning techniques for identifying malignant b- and t-cell populations | |
RU2813655C2 (ru) | Способы и композиции для обнаружения соматического варианта | |
US20220301654A1 (en) | Systems and methods for predicting and monitoring treatment response from cell-free nucleic acids | |
CA3219608A1 (en) | Detection of human leukocyte antigen loss of heterozygosity | |
Carroll et al. | A chromosome-scale fishing cat reference genome for the evaluation of potential germline risk variants | |
Carroll et al. | A novel fishing cat reference genome for the evaluation of potential germline risk variants | |
Persson | Comparing Two Algorithms for the Detection of Cross-Contamination in Simulated Tumor Next-Generation Sequencing Data | |
Yang | Statistical Methods for Comapring Next-generation Sequencing Data-Reproducibility, Similarity and Differentiation | |
WO2021086335A1 (en) | In silico genomic variant identification | |
Friedenberg | Understanding the Genetic Basis of Addison's Disease in Standard Poodle Dogs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: ILLUMINA, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:JU, JIN HYUN;REEL/FRAME:052510/0935 Effective date: 20200225 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |