CN110689930B - Method and device for detecting TMB - Google Patents

Method and device for detecting TMB Download PDF

Info

Publication number
CN110689930B
CN110689930B CN201910995474.7A CN201910995474A CN110689930B CN 110689930 B CN110689930 B CN 110689930B CN 201910995474 A CN201910995474 A CN 201910995474A CN 110689930 B CN110689930 B CN 110689930B
Authority
CN
China
Prior art keywords
mutation
somatic mutation
sites
candidate
site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910995474.7A
Other languages
Chinese (zh)
Other versions
CN110689930A (en
Inventor
董永芳
郭璟
楼峰
曹善柏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiangxin Medical Technology Co ltd
Tianjin Xiangxin Biotechnology Co ltd
Beijing Xiangxin Biotechnology Co ltd
Original Assignee
Beijing Xiangxin Medical Technology Co ltd
Tianjin Xiangxin Biotechnology Co ltd
Beijing Xiangxin Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiangxin Medical Technology Co ltd, Tianjin Xiangxin Biotechnology Co ltd, Beijing Xiangxin Biotechnology Co ltd filed Critical Beijing Xiangxin Medical Technology Co ltd
Priority to CN201910995474.7A priority Critical patent/CN110689930B/en
Publication of CN110689930A publication Critical patent/CN110689930A/en
Application granted granted Critical
Publication of CN110689930B publication Critical patent/CN110689930B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Genetics & Genomics (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The invention provides a method and a device for detecting TMB. The method comprises the following steps: removing the germ line mutation sites in the sequencing data of the sample to be detected by using the sequencing data of the paired white blood cells to obtain a candidate somatic mutation site set; filtering false positive somatic mutation sites in the candidate somatic mutation site set to obtain a somatic mutation site set to be detected, wherein the false positive somatic mutation sites comprise at least one of the following sites: sites of mutations due to oxidative damage, sites of mutations due to background noise; and dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exon region to obtain the TMB. False positive somatic mutation is removed by fully utilizing paired white blood cells, a background noise mutation frequency distribution database and oxidative damage, so that the accuracy and stability of TMB are improved.

Description

Method and device for detecting TMB
Technical Field
The invention relates to the field of gene sequencing data analysis, in particular to a method and a device for detecting TMB.
Background
Tumor Mutation Burden (TMB) is an indicator of the total number of Tumor somatic mutations in a responding Tumor cell, usually expressed as the total number of Tumor somatic mutations contained per megabase (Mb) of the Tumor genomic region. Tumors with high levels of TMB, representing a higher number of mutations in their tumor cells, further indicate that the higher the number of tumor neoantigens (Neoantigen) that can be recognized by the immune system in tumor cells, may be, thereby helping immune cells to produce more effective killing of tumor cells.
The currently commonly used method for detecting tumor mutation load is a strategy proposed by Lawrence team 2015 in Nature, and the tumor mutation load state is judged by calculating the number of somatic mutations of the whole exome (average depth < 200X). However, this method often occurs with false positives and false negatives.
Therefore, it is urgently required to develop a new method for detecting TMB.
Disclosure of Invention
The invention mainly aims to provide a method and a device for detecting a TMB (transient response message) so as to solve the problem of inaccurate TMB detection in the prior art.
In order to achieve the above object, according to an aspect of the present invention, there is provided a method of detecting a TMB, the method including: removing embryonic system mutation sites in sequencing data of a sample to be detected by using sequencing data of paired white blood cells to obtain a candidate somatic mutation site set; filtering false positive somatic mutation sites in the candidate somatic mutation site set to obtain a somatic mutation site set to be detected, wherein the false positive somatic mutation sites comprise at least one of the following sites: sites of mutations due to oxidative damage, sites of mutations due to background noise; and dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exon region to obtain the TMB.
Further, the false positive somatic mutation sites include oxidative damage-induced mutation sites, and before filtering the false positive somatic mutation sites in the set of candidate somatic mutation sites, the method further comprises determining whether the somatic mutation sites in the set of candidate somatic mutation sites are oxidative damage-induced mutation sites.
Further, determining whether the somatic mutation sites in the candidate set of somatic mutation sites are oxidative damage-induced mutation sites comprises: searching reads supporting candidate somatic mutation sites, and judging whether the reads are positioned in a positive strand or a negative strand; counting the ratio of the number of reads of the positive strand to the number of reads of the negative strand of the candidate somatic mutation site, and judging whether the ratio is greater than a first threshold or smaller than a second threshold, if so, the candidate somatic mutation site is a mutation site caused by oxidative damage; preferably, the first threshold value is equal to or greater than 2, and the second threshold value is equal to or less than 0.5.
Further, the set of false positive somatic mutation sites includes mutation sites caused by background noise, and before filtering the false positive somatic mutation sites in the set of candidate somatic mutation sites, the method further comprises determining whether the somatic mutation sites in the set of candidate somatic mutation sites are mutation sites caused by background noise.
Further, determining whether a somatic mutation site in the candidate set of somatic mutation sites is a mutation site caused by background noise comprises: removing embryonic system mutation sites in sequencing data of healthy people by using sequencing data of white blood cells to obtain a somatic mutation site set of the healthy people; establishing a Weibull distribution model of background noise mutation frequencies of different mutation types of each detection site by using a somatic mutation site set of healthy people; calculating the mutation frequency of each candidate somatic mutation site in a candidate somatic mutation site set of a sample to be detected, and calculating the P value of the mutation frequency of each candidate somatic mutation site in a Weibull distribution model; judging whether the P value is larger than or equal to a third threshold value, if so, taking the candidate somatic mutation site as a mutation site caused by background noise; preferably, the third threshold value is equal to or greater than 0.05.
Further, dividing the number of load mutations in the set of somatic mutation sites by the length of sequencing data in the exon region, the method further comprises: and counting the number of load mutations in the somatic mutation site set.
Further, the statistics of the number of load mutations in the set of somatic mutation sites includes: the total number of all mutation types in the somatic mutation site set was counted as follows: synonymous mutation, non-synonymous mutation, frameshift mutation and non-frameshift mutation; removing at least one of the following sites from the total number to obtain the number of load mutations: a mutation site with the thousand-person mutation frequency being more than 0.01 and a mutation site marked as COSMIC.
Further, dividing the set of somatic mutation sites by sequencing data prior to all lengths of the exon regions, the method further comprising: calculate all the length of the sequencing data in the exoscope region.
In order to achieve the above object, according to an aspect of the present invention, there is provided an apparatus for detecting a TMB, the apparatus including: the detection module is used for removing germline mutation sites in sequencing data of a sample to be detected by using the sequencing data of paired leukocytes to obtain a candidate somatic mutation site set; the filtering module is used for filtering false positive somatic mutation sites in the candidate somatic mutation site set to obtain a somatic mutation site set to be detected, and the false positive somatic mutation sites comprise at least one of the following: sites of mutations due to oxidative damage, sites of mutations due to background noise; and the TMB calculation module is used for dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exon region to obtain the TMB.
Further, the device also comprises an oxidative damage judging module which is used for judging whether the somatic mutation sites in the candidate somatic mutation site set are the mutation sites caused by oxidative damage.
Further, the oxidation damage judgment module comprises: the device comprises a searching module, a first statistic module and a ratio judging module, wherein the searching module is used for searching reads supporting candidate somatic mutation sites and judging whether the reads are positioned in a positive strand or a negative strand; the first statistic module is used for counting the ratio of the number of reads of a positive strand and the number of reads of a negative strand supporting the candidate somatic mutation sites, the ratio judgment module is used for judging whether the ratio is larger than a first threshold or smaller than a second threshold, and if yes, the candidate somatic mutation sites are mutation sites caused by oxidative damage; preferably, the first threshold value is equal to or greater than 2, and the second threshold value is equal to or less than 0.5.
Further, the device also comprises a background noise judging module for judging whether the somatic mutation sites in the candidate somatic mutation site set are mutation sites caused by background noise.
Further, the background noise determination module comprises: the system comprises a health site set acquisition module, a model establishment module, a P value calculation module and a noise judgment module, wherein the health site set acquisition module is used for removing embryonic system mutation sites in sequencing data of healthy people by utilizing sequencing data of white blood cells to obtain a somatic mutation site set of the healthy people; the model establishing module is used for establishing a Weibull distribution model of background noise mutation frequencies of different mutation types of each detection site by utilizing a somatic mutation site set of healthy people; the P value calculation module is used for calculating the mutation frequency of each candidate somatic mutation site in the candidate somatic mutation site set of the sample to be detected and calculating the P value of the mutation frequency of each candidate somatic mutation site in a Weibull distribution model; the noise judgment module is used for judging whether the P value is larger than or equal to a third threshold value, if so, the candidate somatic mutation site is a mutation site caused by background noise; preferably, the third threshold value is equal to or greater than 0.05.
Further, the apparatus further comprises: and the load mutation number counting module is used for counting the load mutation number in the somatic mutation site set.
Further, the load mutation number statistic module comprises: a statistic unit and a removal unit, wherein the statistic unit is used for counting the total number of all the following mutation types in the somatic mutation site set: synonymous mutation, non-synonymous mutation, frameshift mutation and non-frameshift mutation; the removal unit is used for removing at least one of the following sites from the total number to obtain the number of the load mutation: a mutation site with the frequency of thousands of people more than 0.01 and a mutation site marked as COSMIC;
further, the apparatus further comprises: and the length calculation module is used for calculating all the lengths of the sequencing data in the exoscope area.
According to a third aspect of the present invention, there is provided a storage medium having stored thereon a computer-executable program configured to, when executed, perform any one of the above-described methods of detecting a TMB.
According to a fourth aspect of the present invention, there is provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to execute the computer program to perform any of the above-described methods of detecting a TMB.
By applying the technical scheme of the invention, firstly, the germ line mutation carried by the sample is removed by matching with the white blood cells, so that the influence of the germ line mutation on the TMB value is greatly reduced; secondly, removing false positive sites caused by DNA oxidative damage caused by steps of constructing a library, breaking NDA fragments and the like; and/or removing the influence of the false positive somatic mutation caused by low-frequency background noise on the TMB value through a background noise frequency distribution database of healthy people, namely removing the false positive somatic mutation by fully utilizing the paired white blood cells, the background noise mutation frequency distribution database and oxidative damage, and improving the accuracy and stability of the TMB value.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 shows a flow diagram of a method of detecting a TMB in a preferred embodiment according to the present invention;
FIG. 2 shows a detailed flow diagram of a method of detecting a TMB in a preferred embodiment according to the present invention; and
fig. 3 shows a schematic structural diagram of an apparatus for detecting a TMB in a preferred embodiment according to the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present invention will be described in detail with reference to examples.
Reference sequence (Refseq) species reference standard genomic sequence.
Fusion gene (Fusion gene) refers to a process in which sequences of all or a part of two genes are fused to each other to form a new gene. It may be the result of a chromosomal translocation, an intermediate deletion or a chromosomal event.
Tumor mutation burden (TMB, Tumor mutation burden): the total number of somatic gene coding errors, base substitutions, gene insertion or deletion errors detected per million bases.
Germ line mutation (germine mutation) germ cell mutation, mutation derived from germ cells such as sperm or ovum.
Reads genomic or transcriptome sequence fragments
Synonymous mutations: substitution mutations that do not alter the amino acid sequence of the peptide chain product
Non-synonymous mutations: gene mutations that result in changes in the amino acid sequence or changes in the base sequence of functional RNA of a polypeptide product
Frame shift mutation: a mutation which causes the dislocation of a sequence of coding sequences following the insertion or loss of a certain site in a DNA fragment by the insertion or loss of one or several (not a multiple of 3 or 3) base pairs
Non-frameshift mutations: a mutation which is inserted or lost at a certain site in a DNA fragment by one or several (3 or 3 fold) base pairs without misplacing a sequence of coding sequences following the insertion or loss site
PE sequencing: double-ended sequencing, a sequencing method
read 1/2: in the PE sequencing data, read1 represents the nucleotide sequence obtained in the first round of the test, and read2 represents the nucleotide sequence obtained in the second round of the test.
bwa: a comparison method software is used for searching the position of reads in Refseq, and finally obtaining a bam format file.
The adapter sequence: linker sequences flanking the DNA fragment in the sequencing.
flag: and the bam format file is used for describing a value of information such as a sequence alignment mode, a direction and the like.
cigar: a brief alignment information expression, which represents the alignment results using data plus letters based on the reference sequence.
duplicate: repetitive sequence refers to a sequence amplified by PCR.
qname: the number of fragments (template) is aligned.
Oxidative damage of DNA: of the A, T, G and C four bases, the C8 position in G readily binds oxygen, the G base becomes 8-oxo-G, and the resulting 8-oxo-G fusion then binds to base A, resulting in the detection of a false positive mutation from G to T.
COSMIC: COSMIC is an abbreviation for "cancer somatic mutation List" that encompasses the scientific literature and literature from large-scale experimental screening of the Sanger institute cancer genome project. The database is intended to collect and display information on cancer somatic mutations.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
As mentioned in the background art, the TMB detection method in the prior art has the defect of inaccurate detection, and in order to improve the current situation, the inventors have analyzed and studied the existing TMB detection method, and found that the existing method cannot completely filter leukocyte mutation and systematic background error, and mutation actually lower than the threshold is filtered through mutation frequency threshold screening, so that the calculated value of TMB has a certain deviation. On the basis, the inventor proposes an improvement scheme of the application.
Example 1
Embodiments of a method of detecting a TMB are provided.
Fig. 1 is a flow chart of an alternative method of detecting a TMB according to an embodiment of the present invention, as shown in fig. 1, the method including:
s101, removing germline mutation sites in sequencing data of a sample to be detected by using sequencing data of paired white blood cells to obtain a candidate somatic mutation site set;
step S102, filtering false positive somatic mutation sites in the candidate somatic mutation site set to obtain a somatic mutation site set to be detected, wherein the false positive somatic mutation sites comprise at least one of the following: sites of mutations due to oxidative damage, sites of mutations due to background noise;
and step S103, dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exon region to obtain the TMB.
According to the method for detecting TMB, firstly, the germ line mutation carried by the sample is removed by matching with the leucocyte, so that the influence of the germ line mutation on the TMB value is greatly reduced; secondly, removing false positive sites caused by DNA oxidative damage caused by steps of constructing a library, breaking NDA fragments and the like; and/or removing the influence of the false positive somatic mutation caused by low-frequency background noise on the TMB value through a background noise frequency distribution database of healthy people, namely removing the false positive somatic mutation by fully utilizing the paired white blood cells, the background noise mutation frequency distribution database and oxidative damage, and improving the accuracy and stability of the TMB value.
Specifically, in step S101, a tissue and a leukocyte sample of an object to be tested (e.g., a patient) are used, according to sequencing data obtained by a PE sequencing method after being downloaded and an obtained human genome reference sequence, bwa software is used to search positions of the sequencing sequences (reads) in the gene to form a bam format file, after a marked repeat sequence (duplication) and a base quality value are corrected, a pairing detection mode of the software (e.g., protect 2) is used to perform detection, and a somatic mutation result of the object to be tested is obtained and is used as a candidate somatic mutation site set for further screening for counting the number of mutation loads.
Fig. 2 shows a detailed flow diagram of a method of detecting a TMB in a preferred embodiment according to the present invention. The following is a detailed description:
and step S110, removing the germline mutation sites in the sequencing data of the sample to be detected by using the sequencing data of the paired white blood cells to obtain a candidate somatic mutation site set.
The false positive somatic mutation sites include mutation sites resulting from oxidative damage, and prior to filtering the false positive somatic mutation sites in the set of candidate somatic mutation sites, the method further comprises:
step S210, judging whether the somatic mutation sites in the candidate somatic mutation site set are mutation sites caused by oxidative damage. Any method capable of determining whether a site belongs to a mutant site caused by oxidative damage is suitable for the present application.
In an alternative embodiment, as shown in FIG. 2, determining whether a somatic mutation site in the set of candidate somatic mutation sites is a mutation site caused by oxidative damage comprises:
step S211, searching reads supporting candidate somatic mutation sites, and judging whether the reads are positioned in a positive strand or a negative strand;
step S212, counting the ratio of the number of reads of the positive strand and the number of reads of the negative strand of the candidate somatic mutation site;
step S213, judging whether the ratio is larger than a first threshold or smaller than a second threshold, if so, determining the candidate somatic mutation site as a mutation site caused by oxidative damage; preferably, the first threshold value is equal to or greater than 2, and the second threshold value is equal to or less than 0.5.
Specifically, searching reads supporting the mutation in a bam file according to mutation information of each mutation site in the candidate somatic mutation site set, including chromosomes, positions and mutations, judging whether the reads are a positive strand or a negative strand according to a flag of the reads, finally obtaining a ratio of the number of the reads positioned in the positive strand and the number of the reads positioned in the negative strand in the reads supporting the mutation, and if the ratio is more than 2 or less than 0.5, judging that the somatic mutation is a false positive system mutation caused by DNA oxidative damage and belongs to a false positive mutation; if the ratio is 0.5 or more and 2 or less, the somatic mutation becomes a positive somatic mutation. Wherein 0.5 and 2 are respectively summarized according to the prior references.
To further improve the accuracy of the detection, in an alternative embodiment, as shown in fig. 2, the set of false positive somatic mutation sites includes mutation sites caused by background noise, and the method further comprises, before filtering the false positive somatic mutation sites in the set of candidate somatic mutation sites:
step S220, judging whether the somatic mutation sites in the candidate somatic mutation site set are mutation sites caused by background noise. Any method capable of determining whether a mutation site is a mutation caused by background noise is suitable for use in the present application.
In order to more accurately detect and determine whether a mutation site is actually caused by background noise, as shown in fig. 2, in an alternative embodiment, determining whether a somatic mutation site in the candidate set of somatic mutation sites is caused by background noise comprises:
step S221, removing germ line mutation sites in sequencing data of healthy people by utilizing sequencing data of white blood cells to obtain a somatic mutation site set of the healthy people;
step S222, establishing a Weibull distribution model of background noise mutation frequencies of different mutation types of each detection site by using a somatic mutation site set of healthy people;
step S223, calculating the mutation frequency of each candidate somatic mutation site in the candidate somatic mutation site set of the sample to be detected, and calculating the P value of the mutation frequency of each candidate somatic mutation site in a Weibull distribution model;
step S224, judging whether the P value is larger than or equal to a third threshold value, if so, taking the candidate somatic mutation site as a mutation site caused by background noise; preferably, the third threshold value is equal to or greater than 0.05.
In the above preferred embodiment, a background noise mutation frequency distribution model, that is, a weibull distribution model, is constructed by using healthy population data, and by using the weibull distribution model, only variation information in a candidate somatic mutation site set guide of an object to be detected needs to be introduced into the model, so that the probability of the weibull distribution model of the mutation frequency of each mutation site at the site can be calculated, and if the probability is smaller, the probability is less than the model, that is, the probability is not the background noise, and the probability is positive somatic mutation; conversely, if the probability exceeds a threshold, it indicates that the model is matched, i.e., belongs to background noise, and false positive somatic mutations should be removed.
Then, step S310 is executed to filter the false positive somatic mutation sites in the candidate somatic mutation site set to obtain a set of somatic mutation sites to be tested, where the false positive somatic mutation sites include at least one of the following: sites of mutations due to oxidative damage, sites of mutations due to background noise.
After the mutation sites with false positives removed are processed, step S410 is executed to divide the number of load mutations in the set of somatic mutation sites to be detected by all the lengths of the sequencing data in the exon regions, so as to obtain TMB.
In an alternative embodiment, the number of load mutations in the set of somatic mutation sites is divided by the length of sequencing data in the exon region, and the method further comprises: and counting the number of load mutations in the somatic mutation site set.
In an alternative embodiment, counting the number of load mutations in the set of somatic mutation sites comprises: the total number of all mutation types in the somatic mutation site set was counted as follows: synonymous mutation, non-synonymous mutation, frameshift mutation and non-frameshift mutation; removing at least one of the following sites from the total number to obtain the number of load mutations: a mutation site with the thousand-person mutation frequency being more than 0.01 and a mutation site marked as COSMIC.
In the above optional embodiment, when the number of load mutations is calculated, a mutation site with a thousand-person mutation frequency of more than 0.01, that is, a population polymorphic site, is excluded, and a mutation site recorded in COSMIC is also excluded, so that a mutation site labeled as COSMIC is removed, fluctuation of the TMB value caused by a sample type is reduced, and stability of the TMB value is improved. Among them, 22651741 mutation sites are currently included in COSMIC.
In an alternative embodiment, the method further comprises dividing the set of somatic mutation sites by sequencing data prior to all lengths of the exon regions: calculate all the length of the sequencing data in the exoscope region.
Specifically, the method for calculating the size of the coding region, taking the sequencing data of the capture chip as an example, first obtains the gene, the position, the transcript number, the exon or intron region number and the length of the capture region of the chip in the reference genome, and then counts all the lengths of the capture region of the chip in the exon region, wherein the length unit is Mb.
Example 2
Taking a clinical sample as an example, obtaining a tissue sample of the patient and a corresponding plasma sample, extracting DNA, establishing a library, and respectively obtaining off-machine data sampA.R1.fastq.gz and sampA.R2.fastq.gz of the tissue by using an illumina sequencing platform; and leukocyte data of the sample are sampb.r1.fastq.gz, sampb.r2. fastq.gz; sample a. final. bam and sample b. final. bam were obtained after alignment at bwa, marker replication and base quality value calibration, respectively. Sample was detected according to the pairing mode of mutect2 using sample a. And (4) detecting by using sample A.final.bam according to an unpaired mode of mutect2 to obtain sample A.resultWithoutsample B.vcf. According to the filtering module and the calculating module, subsequent filtering and calculation are carried out on the sampA.resultvcf, and the finally obtained TMB value is 14.7; if the method is adopted according to the prior art, the pairing mode detection is not adopted during the detection module, the filtering operation of the filtering module is omitted, the mutation marked as cosic is not filtered in the calculating module, and the finally obtained detection result is 25.8. Therefore, the TMB result detected by the existing method is low, and the mutation counted in the TMB calculation value has the situations of germline mutation and common background system noise, so that compared with the existing method, the method disclosed by the application has higher TMB detection accuracy.
Therefore, compared with the existing method, the method of the embodiment has the following advantages:
firstly, the advantages of white blood cells are fully utilized, and the germ line mutation result is completely removed;
secondly, background noise errors are removed through a healthy person background data set, and oxidation damage errors are removed through an oxidation damage error filtering module, so that the accuracy of the TMB value is improved;
and thirdly, the mutation site marked as cosmic is removed by the algorithm, the fluctuation of the TMB value caused by the sample type is reduced, and the stability of the TMB value is improved.
Example 3
Embodiments of an apparatus to detect a TMB are also provided.
FIG. 3 is a schematic diagram of an alternative apparatus for detecting TMB according to an embodiment of the present invention, as shown in FIG. 3, the apparatus includes a detection module 10, a filtering module 20, and a TMB calculation module 30, the detection module 10 is configured to remove germline mutation sites in sequencing data of a sample to be detected by using sequencing data of paired white blood cells to obtain a candidate set of somatic mutation sites; a filtering module 20, configured to filter false positive somatic mutation sites in the candidate somatic mutation site set to obtain a set of somatic mutation sites to be detected, where the false positive somatic mutation sites include at least one of the following: sites of mutations due to oxidative damage, sites of mutations due to background noise; and the TMB calculation module 30 is used for dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exon region to obtain the TMB.
According to the device for detecting TMB, the detection module 10 and the filtering module 20 fully utilize the paired white blood cells, the background noise mutation frequency distribution database and the oxidative damage to remove false positive somatic mutation, firstly, the detection module 10 is paired with the white blood cells to remove embryonic system mutation carried by a sample, so that the influence of the embryonic system mutation on the TMB value is greatly reduced; secondly, removing false positive sites caused by DNA oxidative damage caused by steps of building a library, breaking NDA fragments and the like by using a filtering module 20; and/or removing the influence of the false positive somatic mutation caused by the low-frequency background noise on the TMB value through a background noise frequency distribution database of the healthy person, and finally performing TMB calculation by using the composite mutation number obtained after the false positive somatic mutation is removed through the TMB calculation module 30, so that the accuracy and the stability of the TMB value are improved.
In an optional embodiment, the apparatus further comprises an oxidative damage determination module for determining whether the somatic mutation sites in the candidate set of somatic mutation sites are mutation sites caused by oxidative damage. Any oxidative damage judging module capable of judging whether a certain locus belongs to a mutant locus caused by oxidative damage is suitable for the application.
In an alternative embodiment, the oxidation damage determining module includes: the device comprises a searching module, a first statistic module and a ratio judging module, wherein the searching module is used for searching reads supporting candidate somatic mutation sites and judging whether the reads are positioned in a positive chain or a negative chain; a first statistical module for counting a ratio of the number of reads of the positive strand to the number of reads of the negative strand that support the candidate somatic mutation sites; and the ratio judging module is used for judging whether the ratio is greater than a first threshold or smaller than a second threshold, and if so, the candidate somatic mutation site is a mutation site caused by oxidative damage. Preferably, the first threshold value is equal to or greater than 2, and the second threshold value is equal to or less than 0.5.
In an alternative embodiment, the apparatus further comprises a background noise determination module for determining whether the somatic mutation sites in the candidate set of somatic mutation sites are mutation sites caused by background noise.
In order to more accurately detect and determine whether a mutation site is actually a mutation caused by background noise, in an alternative embodiment, the background noise determination module includes: the system comprises a healthy site set acquisition module, a model establishment module, a P value calculation module and a noise judgment module, wherein the healthy site set acquisition module is used for removing embryonic system mutation sites in sequencing data of healthy people by utilizing sequencing data of white blood cells to obtain a somatic mutation site set of the healthy people; the model establishing module is used for establishing a Weibull distribution model of background noise mutation frequencies of different mutation types of each detection site by utilizing a somatic mutation site set of healthy people; the P value calculation module is used for calculating the mutation frequency of each candidate somatic mutation site in the candidate somatic mutation site set of the sample to be detected and calculating the P value of the mutation frequency of each candidate somatic mutation site in a Weibull distribution model; and the noise judgment module is used for judging whether the P value is greater than or equal to a third threshold value, and if so, the candidate somatic mutation site is a mutation site caused by background noise. Preferably, the third threshold value is equal to or greater than 0.05.
In the above preferred embodiment, a background noise mutation frequency distribution model, that is, a weibull distribution model, is constructed by using healthy population data, and by using the weibull distribution model, only variation information in a candidate somatic mutation site set guide of an object to be detected needs to be introduced into the model, so that the probability of the weibull distribution model of the mutation frequency of each mutation site at the site can be calculated, and if the probability is smaller, the probability is less than the model, that is, the probability is not the background noise, and the probability is positive somatic mutation; conversely, if the probability exceeds a threshold, it indicates that the model is matched, i.e., belongs to background noise, and false positive somatic mutations should be removed.
In an optional embodiment, the apparatus further comprises: and the load mutation number counting module is used for counting the load mutation number in the somatic mutation site set.
In an alternative embodiment, the load sudden change number statistic module comprises: a statistic unit and a removal unit, wherein the statistic unit is used for counting the total number of all the following mutation types in the somatic mutation site set: synonymous mutation, non-synonymous mutation, frameshift mutation and non-frameshift mutation; a removal unit for removing at least one of the following sites from the total number to obtain the number of load mutations: a mutation site with the frequency of thousands of people more than 0.01 and a mutation site marked as COSMIC;
in an optional embodiment, the apparatus further comprises: and the length calculation module is used for calculating all the lengths of the sequencing data in the exoscope area.
The above-mentioned apparatus may comprise a processor and a memory, and the above-mentioned units may be stored in the memory as program units, and the processor executes the above-mentioned program units stored in the memory to implement the corresponding functions.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The order of the embodiments of the present application described above does not represent the merits of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways.
The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (18)

1. A method of detecting a TMB, the method comprising:
removing embryonic system mutation sites in sequencing data of a sample to be detected by using sequencing data of paired white blood cells to obtain a candidate somatic mutation site set;
filtering false positive somatic mutation sites in the candidate somatic mutation site set to obtain a somatic mutation site set to be detected, wherein the false positive somatic mutation sites comprise at least one of the following sites: sites of mutations due to oxidative damage, sites of mutations due to background noise;
dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in an expicity region to obtain the TMB;
wherein the false positive somatic mutation sites comprise mutation sites caused by oxidative damage, and before filtering the false positive somatic mutation sites in the candidate set of somatic mutation sites, the method further comprises judging whether the somatic mutation sites in the candidate set of somatic mutation sites are mutation sites caused by oxidative damage;
judging whether the somatic mutation sites in the candidate somatic mutation site set are mutation sites caused by oxidative damage or not comprises the following steps:
searching reads supporting the candidate somatic mutation sites, and judging whether the reads are positioned in a positive strand or a negative strand;
counting the ratio of the number of reads of the positive strand to the number of reads of the negative strand, which supports the candidate somatic mutation site, and judging whether the ratio is greater than a first threshold or smaller than a second threshold, if so, the candidate somatic mutation site is a mutation site caused by oxidative damage.
2. The method of claim 1, wherein the first threshold is 2 or more and the second threshold is 0.5 or less.
3. The method of claim 1 or 2, wherein the false positive somatic mutation sites comprise background noise-induced mutation sites, and wherein prior to filtering the false positive somatic mutation sites in the set of candidate somatic mutation sites, the method further comprises determining whether the somatic mutation sites in the set of candidate somatic mutation sites are background noise-induced mutation sites.
4. The method of claim 3, wherein determining whether a somatic mutation site in the set of candidate somatic mutation sites is a background noise-induced mutation site comprises:
removing embryonic system mutation sites in sequencing data of healthy people by using the sequencing data of the white blood cells to obtain a somatic mutation site set of the healthy people;
establishing a Weibull distribution model of background noise mutation frequencies of different mutation types of each detection site by using the somatic mutation site set of the healthy population;
calculating the mutation frequency of each candidate somatic mutation site in the candidate somatic mutation site set of the sample to be detected, and calculating the P value of the mutation frequency of each candidate somatic mutation site in the Weibull distribution model;
and judging whether the P value is larger than or equal to a third threshold value, if so, the candidate somatic mutation site is a mutation site caused by background noise.
5. The method of claim 4, wherein the third threshold is equal to or greater than 0.05.
6. The method of claim 1 or 2, wherein the number of load mutations in the set of test somatic mutation sites is divided by the length of all exon regions of the sequencing data, and the method further comprises: and counting the number of load mutations in the somatic mutation site set to be detected.
7. The method of claim 6, wherein counting the number of loading mutations in the set of somatic mutation sites to be tested comprises:
counting the total number of all the following mutation types in the somatic mutation site set to be detected: synonymous mutation, non-synonymous mutation, frameshift mutation and non-frameshift mutation;
removing at least one of the following sites from the total number to obtain the number of the load mutations: a mutation site with the thousand-person mutation frequency being more than 0.01 and a mutation site marked as COSMIC.
8. The method of claim 1, wherein the set of somatic mutation sites to be tested is divided by the sequencing data before all lengths of the exon regions, the method further comprising: calculate all the length of the sequencing data in the exoscope region.
9. An apparatus for detecting a TMB, the apparatus comprising:
the detection module is used for removing the germline mutation sites in the sequencing data of the sample to be detected by using the paired sequencing data of the white blood cells to obtain a candidate somatic mutation site set;
a filtering module, configured to filter false positive somatic mutation sites in the candidate somatic mutation site set to obtain a set of somatic mutation sites to be detected, where the false positive somatic mutation sites include at least one of: sites of mutations due to oxidative damage, sites of mutations due to background noise;
the TMB calculation module is used for dividing the number of load mutations in the somatic mutation site set to be detected by all the lengths of the sequencing data in the exoscope area to obtain the TMB;
the device also comprises an oxidative damage judging module which is used for judging whether the somatic mutation sites in the candidate somatic mutation site set are mutation sites caused by oxidative damage or not;
the oxidation damage judgment module includes:
the searching module is used for searching reads supporting the candidate somatic mutation sites and judging whether the reads are positioned in a positive strand or a negative strand;
a first statistical module for counting a ratio of the number of reads of the positive strand to the number of reads of the negative strand that support the candidate somatic mutation site,
and the ratio judging module is used for judging whether the ratio is greater than a first threshold or smaller than a second threshold, and if so, the candidate somatic mutation site is a mutation site caused by oxidative damage.
10. The apparatus of claim 9, wherein the first threshold is equal to or greater than 2 and the second threshold is equal to or less than 0.5.
11. The apparatus of claim 9 or 10, further comprising a background noise determination module for determining whether a somatic mutation site in the set of candidate somatic mutation sites is a mutation site caused by background noise.
12. The apparatus of claim 11, wherein the background noise determination module comprises:
the healthy site set acquisition module is used for removing embryonic system mutation sites in sequencing data of healthy people by utilizing the sequencing data of the white blood cells to obtain a somatic mutation site set of the healthy people;
the model establishing module is used for establishing a Weibull distribution model of background noise mutation frequencies of different mutation types of each detection site by utilizing the somatic mutation site set of the healthy population;
a P value calculation module, configured to calculate a mutation frequency of each candidate somatic mutation site in the candidate somatic mutation site set of the sample to be tested, and calculate a P value of the mutation frequency of each candidate somatic mutation site in the Weibull distribution model;
and the noise judgment module is used for judging whether the P value is greater than or equal to a third threshold value, and if so, the candidate somatic mutation site is a mutation site caused by background noise.
13. The apparatus of claim 12, wherein the third threshold is equal to or greater than 0.05.
14. The apparatus of claim 9 or 10, further comprising: and the load mutation number counting module is used for counting the load mutation number in the somatic mutation site set to be detected.
15. The apparatus of claim 14, wherein the load break number statistics module comprises:
a statistic unit for counting the total number of all the following mutation types in the set of somatic mutation sites to be detected: synonymous mutation, non-synonymous mutation, frameshift mutation and non-frameshift mutation;
a removing unit, configured to remove at least one of the following sites from the total number to obtain the number of the load mutations: a mutation site with the frequency of thousands of people more than 0.01 and a mutation site marked as COSMIC.
16. The apparatus of claim 9, further comprising: and the length calculation module is used for calculating all the lengths of the sequencing data in the exoscope region.
17. A storage medium having stored thereon a computer-executable program, wherein the program is configured to, when executed, perform a method of detecting a TMB according to any one of claims 1 to 8.
18. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is configured to execute the computer program to perform the method of detecting a TMB of any of claims 1 to 8.
CN201910995474.7A 2019-10-18 2019-10-18 Method and device for detecting TMB Active CN110689930B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910995474.7A CN110689930B (en) 2019-10-18 2019-10-18 Method and device for detecting TMB

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910995474.7A CN110689930B (en) 2019-10-18 2019-10-18 Method and device for detecting TMB

Publications (2)

Publication Number Publication Date
CN110689930A CN110689930A (en) 2020-01-14
CN110689930B true CN110689930B (en) 2021-07-30

Family

ID=69113235

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910995474.7A Active CN110689930B (en) 2019-10-18 2019-10-18 Method and device for detecting TMB

Country Status (1)

Country Link
CN (1) CN110689930B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111951893B (en) * 2020-08-24 2022-11-15 中山大学附属第三医院 Method for constructing tumor mutation load TMB panel
CN112289376B (en) * 2020-10-26 2021-07-06 北京吉因加医学检验实验室有限公司 Method and device for detecting somatic cell mutation
CN117253546B (en) * 2023-10-11 2024-05-28 北京博奥医学检验所有限公司 Method, system and storable medium for reducing targeted second-generation sequencing background noise

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016019149A1 (en) * 2014-07-30 2016-02-04 Sutter West Bay Hospitals Mitochondrial dna mutation profile for predicting human health conditions and disease risk and for monitoring treatments
CN104462869B (en) * 2014-11-28 2017-12-26 天津诺禾致源生物信息科技有限公司 The method and apparatus for detecting body cell single nucleotide mutation
CA2982266A1 (en) * 2015-04-27 2016-11-03 Nicholas MCGRANAHAN Method for treating cancer
WO2016179049A1 (en) * 2015-05-01 2016-11-10 Guardant Health, Inc Diagnostic methods
KR102358206B1 (en) * 2016-02-29 2022-02-04 파운데이션 메디신 인코포레이티드 Methods and systems for assessing tumor mutational burden
CN106282356B (en) * 2016-08-30 2019-11-26 天津诺禾医学检验所有限公司 A kind of method and device based on amplification second filial sequencing point mutation detection
CN107338292A (en) * 2017-07-10 2017-11-10 上海思路迪生物医学科技有限公司 Method and kit based on high-flux sequence detection human genome mutational load
CN109427412B (en) * 2018-11-02 2022-02-15 北京吉因加科技有限公司 Sequence combination for detecting tumor mutation load and design method thereof

Also Published As

Publication number Publication date
CN110689930A (en) 2020-01-14

Similar Documents

Publication Publication Date Title
CN110689930B (en) Method and device for detecting TMB
CN107423578B (en) Device for detecting somatic cell mutation
CN108690871B (en) Method, device and storage medium for detecting insertion deletion mutation based on next generation sequencing
KR102521842B1 (en) Mutational analysis of plasma dna for cancer detection
CN110444255B (en) Biological information quality control method and device based on second-generation sequencing and storage medium
ES2766860T5 (en) Method for detecting chromosomal structural abnormalities and device for it
CN112111565A (en) Mutation analysis method and device for cell free DNA sequencing data
CN108229103B (en) Method and device for processing circulating tumor DNA repetitive sequence
CN110739027B (en) Cancer tissue positioning method and system based on chromatin region coverage depth
CN104462869A (en) Method and device for detecting somatic cell SNP
WO2018054254A1 (en) Method and system for identifying tumor load in sample
CN110189796A (en) A kind of sheep full-length genome resurveys sequence analysis method
CN116356001B (en) Dual background noise mutation removal method based on blood circulation tumor DNA
CN111524548B (en) Method, computing device, and computer storage medium for detecting IGH reordering
CN112592969A (en) Method, device and storage medium for detecting hereditary aortic disease and related genes
CN116064755B (en) Device for detecting MRD marker based on linkage gene mutation
CN110648722B (en) Device for evaluating neonatal genetic disease risk
CN108595912A (en) Detect the method, apparatus and system of chromosomal aneuploidy
CN112795635A (en) Detection method, device and storage medium for Marfan syndrome and related genes
CN109712671B (en) Gene detection device based on ctDNA, storage medium and computer system
KR102472050B1 (en) Method for Predicting Tumor Recurrence Using Bespoke Panel
CN116469462A (en) Ultra-low frequency DNA mutation identification method and device based on double sequencing
CN111292803B (en) Genome breakpoint identification method and application
CN112513292B (en) Method and device for detecting homologous sequences based on high-throughput sequencing
CN109979534B (en) C site extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant