CN116356001A - Dual background noise mutation removal method based on blood circulation tumor DNA - Google Patents

Dual background noise mutation removal method based on blood circulation tumor DNA Download PDF

Info

Publication number
CN116356001A
CN116356001A CN202310080082.4A CN202310080082A CN116356001A CN 116356001 A CN116356001 A CN 116356001A CN 202310080082 A CN202310080082 A CN 202310080082A CN 116356001 A CN116356001 A CN 116356001A
Authority
CN
China
Prior art keywords
mutation
background
filtering
sequencing
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310080082.4A
Other languages
Chinese (zh)
Other versions
CN116356001B (en
Inventor
叶雷
陈子清
于跃
李俊
邓望龙
许青
李诗濛
任用
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Xiansheng Medical Diagnosis Co ltd
Original Assignee
Jiangsu Xiansheng Medical Diagnosis Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Xiansheng Medical Diagnosis Co ltd filed Critical Jiangsu Xiansheng Medical Diagnosis Co ltd
Priority to CN202310080082.4A priority Critical patent/CN116356001B/en
Publication of CN116356001A publication Critical patent/CN116356001A/en
Application granted granted Critical
Publication of CN116356001B publication Critical patent/CN116356001B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • Zoology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Wood Science & Technology (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Physiology (AREA)
  • Immunology (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application belongs to the technical field of biological analysis, and particularly relates to a double background noise mutation removal method based on blood circulation tumor DNA and application thereof.

Description

Dual background noise mutation removal method based on blood circulation tumor DNA
Technical Field
The application belongs to the technical field of bioinformatics, and particularly relates to a double background noise mutation removal method based on blood circulation tumor DNA.
Technical Field
Human circulating cell-free DNA (cfDNA) refers to fragmented DNA found in non-cellular components, and is mainly derived from small fragments of DNA produced by secretory release of apoptotic, necrotic cells, usually in the form of chain fragments of about 150-200 base pairs in length. Circulating tumor DNA (circulating tumor DNA, ctDNA) is part of cfDNA derived from apoptotic, necrotic tumor cells or small fragments of DNA produced by tumor cell secretion and release. Because of the limitation factors such as difficult clinical acquisition, space/time/structure/functional heterogeneity and the like of tumor tissue detection, ctDNA carries molecular genetic characteristics consistent with the primary tumor tissue, and the detection process is not invasive and can be repeatedly performed at different stages of disease treatment, thus being a tumor marker widely used for clinical diagnosis of various tumors. Therefore, ctDNA detection plays an important role in early tumor screening and diagnosis, targeted drug guidance, prognosis and dynamic monitoring of treatment.
Solid tumor micro/molecular residual lesions (MRD) refer to residual tumor cells or micro/molecular lesions that remain in the patient after treatment but cannot be found by traditional imaging or laboratory methods, but are abnormal in cancer-derived molecules found by liquid biopsy, belonging to the hidden stage of tumor progression. The number of residual cancer cells may be small and temporarily do not cause any signs or symptoms, but they may lead to the progression or recurrent metastasis of future tumors. As an important molecular marker for MRD detection in patients, how to accurately determine the authenticity of low-frequency ctDNA mutations becomes one of the key challenges for MRD detection due to the need to detect very low levels of ctDNA signals in blood. In order to improve the sensitivity of detecting the ctDNA signal with extremely low content and avoid the false negative problem of MRD detection, in the current detection technical method, the method is generally realized by expanding the detection range of ctDNA, detecting more mutation signals and cooperatively carrying out ultra-high depth second-generation sequencing. However, this approach causes problems of specificity due to multiple variant detection on the one hand; on the other hand, high-level noise caused by ultra-high depth sequencing is also a detection trap; in addition, due to the aging mechanism of human body and the influence of factors such as selective pressure of external environment (smoking, chemotherapy, etc.), some clonal hematopoietic Mutations (Clonal Hematopoiesis Mutations, CH-Mutations) can appear in blood, and the Mutations can interfere with accurate ctDNA detection.
In summary, the lack of effective methods for reducing background noise and interference of clonal hematopoietic mutations results in an undesirable accuracy in detection of low frequency ctDNA mutation signals.
In view of this, the present application is presented.
Disclosure of Invention
In order to solve the technical problems, a double background noise mutation removal method based on blood circulation tumor DNA is established through bioinformatics analysis, and the method can effectively reduce false positives of blood ctDNA mutation detection and ensure accuracy of ctDNA signal with extremely low content in MRD and low frequency ctDNA mutation detection in liquid biopsy.
Therefore, a first object of the present application is to provide a method for constructing a blood circulation tumor DNA (ctDNA) background noise mutation combined filtration model and application thereof.
A second object of the present application is to provide a method for removing double background noise mutation based on blood circulation tumor DNA (ctDNA) and application thereof
Specifically, the application proposes the following technical scheme:
the application firstly provides a construction method of a blood circulation tumor DNA (ctDNA) background noise mutation combined filtration model, which comprises the following steps:
a. target capturing and sequencing of normal human blood cfDNA samples;
b. sequence quality control and reference genome alignment;
c. sequencing depth and base mutation frequency acquisition:
d. background mutation acquisition and germ line mutation filtration:
e. construction of binomial distribution background mutation noise filtering model: counting and acquiring the accumulated sequencing depth of each real background mutation (SNV and/or INDEL) obtained in the step d in a normal cfDNA sample and the number of supporting sequences of the background mutation, so as to construct a binomial distribution background mutation noise filtering model;
f. background Context mutation feature regression model construction: and acquiring the sequencing depth of each background Context mutation feature in a normal cfDNA sample and the supporting sequence number of the background Context mutation feature, counting the background error rate accumulated under different supporting sequence numbers, and constructing a background Context mutation feature regression model.
Further, the term "normal human" as used herein refers to a non-tumor patient;
further, the step b specifically includes: performing quality control on the NSG sequencing sequence, and comparing the quality-controlled sequence to a ginseng test genome; preferably, the repetitive sequence is removed again;
further, the step c specifically includes: obtaining the sequencing depth of each site in each normal human cfDNA sample and the mutation frequency of the base mutation at the site to any other three non-reference bases based on the ginseng genome ratio data;
further, the step d specifically includes: obtaining mutation with mutation frequency smaller than 0.2 in a normal human cfDNA sample as a real background mutation, and filtering other sites as human germ line mutation.
Further, in step e, the binomial distribution background mutation noise filtering model is:
Figure BDA0004067235490000031
wherein P (x=n m ) Indicating that the background mutation site support number is n m Probability at time, m is background mutation site, p m Indicating the cumulative background error rate, N, of the background mutation sites in normal human cfDNA samples m For the total sequencing depth of the background mutation site in a normal human cfDNA sample, n m The total number of support sequences in normal human cfDNA samples for this background mutation site.
Further, in step f, the Context mutation feature refers to: any one base mutation (such as 12 basic single base mutant forms A > T, A > G, A > C, C > A, C > T, C > G, G > A, G > C, G > T, T > A, T > C, T > G combined with a sequence formed by one base at the upstream and downstream of the reference genome of the mutation position, and 192 kinds are total).
The background Context mutation feature regression model is as follows:
Figure BDA0004067235490000032
wherein P (X.gtoreq.k) m ) The number of support sequences representing the Context mutation characteristics is greater than or equal to k m Probability of time, m is background Context mutation feature, k m And a is a constant term for the number of the support sequences of the background Context mutation characteristics, and b is a regression coefficient.
The application also provides a method for removing double background noise mutation based on blood circulation tumor DNA (ctDNA), which comprises the following steps:
1) Tumor cfDNA targeted capture sequencing;
2) Mutation detection, namely obtaining all SNV and/or INDEL mutation results;
3) Background noise mutation joint filtering model construction: the construction is based on the method of any one of claims 1-4;
4) Background noise mutation filtering of binomial distribution model:
5) Filtering background noise mutation of a background Context mutation characteristic regression model;
preferably, the method further comprises:
6) The clonal hematopoietic mutations are filtered.
Further, in the step 1), targeted capture sequencing is to use a single molecule tag UMI to build a library and targeted capture sequencing;
preferably, the method further comprises a sequencing quality control step and a deduplication and sequence correction step after sequencing;
more preferably, the sequencing quality control is: removing low-quality sequences after high-throughput sequencing, and comparing the sequences to the ginseng test genome to obtain a comparison result; the deduplication and sequence correction are: based on the reference genome comparison data deduplication and sequence consistency correction of the double-end fixed single-molecule tag UMI, a comparison result after background mutation noise correction is obtained;
further, the mutation detection in step 2) is as follows: obtaining SNV and/or INDEL mutation results based on the sequencing results; preferably, SNV and/or INDEL mutation results are obtained based on the sequence alignment after deduplication and sequence correction.
Further, the filtering of the background noise mutation of the 4) binomial distribution model is specifically:
a. for each detected mutation (SNV and/or INDEL) in step 2), calculating the probability P (Bias) that the mutation site is a background noise mutation based on the sequence support number (k) of the mutation site and the sequencing depth (n) of the site and the binomial distributed background mutation noise filtering model constructed in step 3):
P(Bias)=Bink,n,p
b. performing background noise mutation filtering based on the calculated probability P (Bias), judging the site as background noise mutation and filtering if the P (Bias) is larger than a judging threshold value, so as to perform the first filtering of the background false positive site;
further, the filtering of the background noise mutation of the 5) background Context mutation characteristic regression model specifically comprises the following steps:
a. obtaining the Context mutation characteristic of the mutation site aiming at each detected single base mutation (SNV) in the step 2);
b. calculating probability P of the mutation being background noise mutation based on the mutation support sequence number (k) of the site and the background Context mutation characteristic regression model constructed in the step 3),
Figure BDA0004067235490000041
wherein a and b are respectively the constant term and regression coefficient of the background Context mutation characteristic regression model in the step 3),
c. and filtering the background noise mutation based on the probability P, judging the position as the background noise mutation if the probability P is larger than a judging threshold value, and filtering out the position, so as to carry out secondary filtering on the background false positive position.
Further, the step 6) of filtering the cloned hematopoietic mutation specifically comprises:
a. single-molecule tag library construction and high-depth sequencing are carried out on matched white blood cells, and comparison of reference genome, de-duplication of double-end fixed UMI and background mutation correction are carried out after correction of a sequencing sequence;
b. based on the comparison result of paired white blood cells after consistency correction and duplication removal in the step a, obtaining SNV and INDEL of the paired white blood cells;
c. and constructing a Fisher statistical distribution model based on the distribution characteristics of mutation frequencies of the clonal hematopoietic mutation and the germ line mutation in cfDNA and the paired white blood cells, and filtering the clonal hematopoietic mutation and the germ line mutation.
The application also provides an electronic device comprising: a processor and a memory; the processor is connected to a memory, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program to perform the method according to any of the preceding claims.
The present application also provides a computer storage medium storing a computer program comprising program instructions which, when executed by a processor, perform a method as claimed in any preceding claim.
The beneficial technical effect of this application:
1) According to the method, a binomial distribution statistical model based on the background error rate of a specific site and a regression model based on the background error rate of the accumulation of three-base Context mutation characteristics are constructed through background noise mutation in a training set sample of cfDNA of normal people (namely non-tumor crowd), double filtration of the background noise mutation is carried out on the specific site and the background mutation characteristic layer simultaneously through combination of 2 models, the accuracy degree and the effectiveness of filtering the background noise mutation are improved, and the accuracy of ctDNA signal and low-frequency ctDNA mutation detection of very low content in MRD is ensured.
2) According to the method, the double-end fixation UMI library construction and high-depth sequencing of the control leucocytes are performed, a statistical distribution model is constructed based on the mutation frequency of the clonal hematopoietic mutation in cfDNA and the characteristics of the control leucocytes so as to distinguish the clonal hematopoietic mutation, and the interference of the clonal hematopoietic mutation on ctDNA mutation detection is identified and removed, so that the accuracy of ctDNA signal with extremely low content and low-frequency ctDNA mutation detection in MRD is further ensured.
3) The application develops a double background noise mutation removal method based on blood circulation tumor DNA, which removes the background noise mutation of ctDNA by a combined method of sequence consistency correction, background noise mutation double filtration and clonal hematopoietic mutation filtration, and reduces false positives of mutation detection caused by high sequencing depth of ctDNA and clonal hematopoietic mutation.
Drawings
FIG. 1, ctDNA background mutation distribution model construction and background noise mutation removal flow;
FIG. 2, a logistic regression filtering model diagram of 48 types of A-type background Context mutation features;
FIG. 3, a logistic regression filtering model diagram of 48C-type background Context mutation features;
FIG. 4, a logistic regression filtering model diagram of 48G-type background Context mutation features;
FIG. 5, a logistic regression filtering model diagram of 48T-type background Context mutation features;
FIG. 6 background error rate of negative cfDNA under different background noise mutation removal methods;
FIG. 7 accuracy of ctDNA mutation detection under different background noise mutation removal methods.
Detailed Description
Embodiments of the present application will be described in detail below with reference to examples, but it will be understood by those skilled in the art that the following examples are only for illustration of the present application and should not be construed as limiting the scope of the present application. The specific conditions are not noted in the examples and are carried out according to conventional conditions or conditions recommended by the manufacturer. The reagents or apparatus used were conventional products commercially available without the manufacturer's attention.
Some definitions of terms unless defined otherwise below, all technical and scientific terms used in the detailed description of the present application are intended to have the same meaning as commonly understood by one of ordinary skill in the art. While the following terms are believed to be well understood by those skilled in the art, the following definitions are set forth to better explain the present application.
The term "about" in this application means a range of accuracy that one skilled in the art can understand that still guarantees the technical effect of the features in question. The term generally means a deviation of + -10%, preferably + -5%, from the indicated value.
As used in this application, the terms "comprising," "including," "having," "containing," or "involving" are inclusive or open-ended and do not exclude additional unrecited elements or method steps. The term "consisting of …" is considered to be a preferred embodiment of the term "comprising". If a certain group is defined below to contain at least a certain number of embodiments, this should also be understood to disclose a group that preferably consists of only these embodiments.
Furthermore, the terms first, second, third, (a), (b), (c), and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments described herein are capable of operation in other sequences than described or illustrated herein.
The present application is described below in conjunction with specific embodiments.
Example 1 construction of blood ctDNA background noise mutation Joint Filter model
As shown in fig. 1B, the ctDNA background noise mutation joint filtering model construction process of the present application includes the following steps:
a. extracting cfDNA from a normal human blood sample and performing cfDNA targeted capture and sequencing;
b. sequencing the captured genes through an NSG platform to obtain high-throughput sequencing sequences, performing quality control, comparing the sequences after quality control to a ginseng genome hg19, and removing repeated sequences;
c. analyzing and obtaining the sequencing depth of each site in each normal cfDNA sample and the mutation frequency of the base mutation at the site to any other three non-reference bases based on the ginseng genome comparison data;
d. obtaining background mutations and human germline mutations in normal human cfDNA samples: wherein the mutation frequencies of human germ line mutations are generally distributed at 0.5 and 1 (heterozygous and homozygous mutations) and a certain population occurrence frequency exists in a population database, and the mutation frequencies of background mutations are generally distributed below 0.2, so that the mutation frequencies of less than 0.2 in a normal human cfDNA sample are used as true background mutations and other sites are used as human germ line mutations for filtering;
e. construction of binomial distribution background mutation noise filtering model: counting and obtaining the accumulated sequencing depth N of each real background mutation obtained in the step d in the normal cfDNA sample m And number n of support sequences for background mutations m The statistical result accords with the following binomial probability model, so as to construct a binomial distribution background mutation noise filtering model; wherein m represents a background mutation site, p m Indicating the cumulative background error rate, N, in normal human cfDNA samples at the background mutation site m For the total sequencing depth of the background mutation site in a normal human cfDNA sample, n m For the total number of supported sequences of the background mutation site in normal human cfDNA samples, P (x=n m ) Indicating that the background mutation site support number is n m Probability at time:
Figure BDA0004067235490000071
f. background Context mutation feature regression model construction: the Context mutation characteristics of each background mutation (three base mutation means 12 basic single base mutant forms a > T, A > G, A > C, C > A, C > T, C > G, G > A, G > C, G > T, T > A, T > C, T > G and the combination of one base upstream and downstream of the position reference genome, total 192) were obtained, the sequencing depth N in the normal human cfDNA sample and the number N of support sequences of the background mutation pattern, and the background error rate accumulated under different numbers of support sequences was counted. As a result, the number of support sequences of the background Context mutation feature and the cumulative background error rate conform to a logistic regression model as shown in the examples of fig. 2-5.
Thus, a logistic regression filter model aiming at 192 background Context mutation characteristics is constructed, wherein m represents the Context mutation characteristics of the background mutation, and k m For the number of support sequences of the Context mutation feature, a is a definite constant term, b is a regression coefficient, P (X.gtoreq.k) m ) The number of support sequences representing the Context mutation characteristics is greater than or equal to k m Probability at time:
Figure BDA0004067235490000072
example 2 Dual background noise mutation removal method
As shown in fig. 1, the background noise mutation removal process of the present application includes the following steps:
1) Tumor cfDNA was pooled and targeted for capture sequencing using single molecule tags (UMI, unique Molecular Identifier);
2) The obtained high-throughput sequencing sequence is subjected to quality control, and after the low-quality sequencing sequence is removed, the high-throughput sequencing sequence is compared with the ginseng genome hg19 to obtain a BAM comparison result;
3) Aiming at the BAM comparison result, the duplication elimination and sequence consistency correction of the reference genome comparison data based on double-end fixed single molecule tags (UMI) are carried out to obtain a BAM comparison result after background mutation noise correction;
4) Mutation detection: based on the BAM alignment results after the identity correction and the duplication removal in the above 3), single nucleotide site variation (SNV) and small fragment INDEL variation (INDEL) are obtained
5) Construction of a background noise mutation joint filtering model: as shown in fig. 1B, the ctDNA background noise mutation joint filter model construction process includes the following steps:
a. extracting cfDNA from a normal human blood sample and performing cfDNA targeted capture and sequencing;
b. sequencing the captured genes through an NSG platform to obtain high-throughput sequencing sequences, performing quality control, comparing the sequences after quality control to a ginseng genome hg19, and removing repeated sequences;
c. analyzing and obtaining the sequencing depth of each site in each normal cfDNA sample and the mutation frequency of the base mutation at the site to any other three non-reference bases based on the ginseng genome comparison data;
d. obtaining background mutations and human germline mutations in normal human cfDNA samples: wherein the mutation frequencies of human germ line mutations are generally distributed at 0.5 and 1 (heterozygous and homozygous mutations) and a certain population occurrence frequency exists in a population database, and the mutation frequencies of background mutations are generally distributed below 0.2, so that the mutation frequencies of less than 0.2 in a normal human cfDNA sample are used as true background mutations and other sites are used as human germ line mutations for filtering;
e. construction of binomial distribution background mutation noise filtering model: counting and obtaining the accumulated sequencing depth N of each real background mutation obtained in the step d in the normal cfDNA sample m And number n of support sequences for background mutations m The statistical result accords with the following binomial probability model, so as to construct a binomial distribution background mutation noise filtering model; wherein m represents a background mutation site, p m Indicating the cumulative background error rate, N, in normal human cfDNA samples at the background mutation site m For the total sequencing depth of the background mutation site in a normal human cfDNA sample, n m For the total number of supported sequences of the background mutation site in normal human cfDNA samples, P (x=n m ) Indicating that the background mutation site support number is n m Probability at time:
Figure BDA0004067235490000081
f. background Context mutation feature regression model construction: the Context mutation characteristics of each background mutation (three base mutation means 12 basic single base mutant forms a > T, A > G, A > C, C > A, C > T, C > G, G > A, G > C, G > T, T > A, T > C, T > G and the combination of one base upstream and downstream of the position reference genome, total 192) were obtained, the sequencing depth N in the normal human cfDNA sample and the number N of support sequences of the background mutation pattern, and the background error rate accumulated under different numbers of support sequences was counted. As shown in the example of fig. 2, the number of support sequences of the background Context mutation feature and the cumulative background error rate conform to a logistic regression model.
Thus, a Context mutation characteristic logistic regression filter model aiming at 192 kinds of background mutation noise is constructed, wherein m represents Context mutation characteristics, k of background mutation m For the number of support sequences of the Context mutation feature, a is a definite constant term, b is a regression coefficient, P (X.gtoreq.k) m ) The number of support sequences representing the Context mutation characteristics is greater than or equal to k m Probability at time:
Figure BDA0004067235490000091
6) Background noise mutation filtering of binomial distribution model, the specific filtering process is as follows:
a. for each detected mutation in 4) above, calculating the probability P (Bias) that the mutation site is a background noise mutation based on the sequence support number (k) of the mutation site and the sequencing depth (n) of the site and the binomial distribution background mutation noise filter model (P) constructed in 5) above:
P(Bias)=)ink,n,p
b. filtering background noise mutation based on the probability P (Bias) calculated in the a, judging the position as the background noise mutation if P (Bias) > alpha (alpha is a judging threshold value), and filtering out the position, so as to perform first filtering of the specific background false positive position in the Panel coverage range;
7) Background noise mutation filtering of Context mutation characteristic regression model is specifically performed as follows:
a. obtaining the Context mutation characteristic (three-base mutation, example A (C- > T) C) of the mutation site for each detected single-base mutation in the above 4);
b. calculating the probability P that the mutation is background noise mutation based on the mutation support sequence number (k) of the site and the regression model constructed based on the Context mutation characteristics in the above 5), wherein a and b are respectively a constant term and a regression coefficient of the Context mutation characteristic regression model in the above 5):
Figure BDA0004067235490000092
c. filtering the background noise mutation based on the probability P calculated in the step b, judging the position as the background noise mutation and filtering if P > alpha (alpha is a judging threshold value), and performing secondary filtering on the background false positive position based on the Context mutation characteristic of the background noise mutation;
8) Clonal hematopoietic mutation filtration:
a. single-molecule tag library construction and high-depth sequencing are carried out on matched white blood cells, and comparison of reference genome, de-duplication of double-end fixed UMI and background mutation correction are carried out after correction of a sequencing sequence;
b. obtaining single nucleotide site variation (SNV) and small fragment INDEL variation (INDEL) of paired white blood cells based on the BAM comparison result of paired white blood cells subjected to the consistency correction and the duplication removal in the step a;
c. constructing a Fisher statistical distribution model based on the distribution characteristics of mutation frequencies of the clonal hematopoietic mutation and the germ line mutation in cfDNA and paired white blood cells, and filtering the clonal hematopoietic mutation and the germ line mutation;
9) The actual mutation derived from the circulating tumor DNA (ctDNA) was obtained.
Example 3 evaluation of Effect
In the embodiment, 8 normal cfDNA samples and 15 cfDNA samples of tumor patients use double-end fixed UMI library establishment and high-depth targeted sequencing, and the sequencing depth is 60000X; in addition, 15 tumor patients underwent simultaneous pooling of paired-end fixed UMI of leukocytes and high-depth targeted sequencing, sequencing depth 10000X. The sequence quality control of 38 samples (containing 15 paired white blood cells) was followed by human reference genome alignment analysis using BWA (v0.7.17) MEM algorithm; and then, all samples are subjected to sequence consistency correction and de-duplication through double-end fixed UMI, and BAM comparison results after de-duplication and background mutation noise correction are obtained and used as input files. After mutation detection is carried out on all samples, respectively carrying out a background noise joint filtering model and gram Long Xing hematopoietic mutation filtering in the application, and removing background noise mutation; on the other hand, the background noise combined filtration model and gram Long Xing hematopoietic mutation filtration of the present application were not performed for the basic filtration of mutation only for all samples, and the following 2-way test was mainly performed:
1) Comparing the background error rate of the cfDNA of the normal person with the background noise mutation removal of the cfDNA of the normal person without the background noise mutation removal;
2) The background noise mutation removal of the present application was performed in all cfDNA and the accuracy of the low frequency ctDNA mutation detection without background noise mutation removal was compared.
As shown in FIG. 3, the average background error rate (average) was 10 in 8 normal cfDNA cases without any background noise mutation removal -4 Average background error rate 10 when sequence identity correction is performed using only double-ended fixed UMI (UMI-NoiseReduced) -5 The average background error rate (UMI-NoiseReduced+polished) has been as low as 10 when using the background noise mutation double filtering method in the present application -6 . In summary, the dual background noise mutation removal method in the application can effectively remove the interference of the background noise mutation in cfDNA detection, and the overall background error rate is 100 times lower than that of the traditional method without any background noise mutation removal.
15 tumor patients whose known tumor mutations had been determined are shown in Table 1. Results as shown in fig. 4, the present application uses 8 normal human cfDNA and the 15 patient cfDNA with known tumor mutations to evaluate the effect of the background noise mutation removal method in the present application on the accuracy of low frequency ctDNA mutation detection. When the dual background noise mutation removal method (UMI-NoiseReduced+polished) in the present application is used, the background noise mutation in cfDNA of normal and tumor patients can be effectively removed compared to using only sequence identity correction (UMI-NoiseReduced only); the number of the false positive background false positive mutations in the cfDNA of each normal person is not more than 1 and is a rare mutation which is not related to tumor, the MRD detection state of a patient is not influenced, and the method ensures the specificity of low-frequency ctDNA mutation detection and MRD detection; on the other hand, the method in the application can effectively remove background noise mutation in cfDNA of tumor patients, and meanwhile, low-frequency real ctDNA mutation can be effectively detected, so that the accuracy of detecting low-frequency ctDNA mutation and extremely-low-content ctDNA signals in MRD is ensured.
TABLE 1 real ctDNA mutation information in cfDNA of 15 tumor patients
22030111 EGFR p.T790M
22030111 EGFR p.L858R
22030112 EGFR p.C797S
22030113 KRAS p.G12D
22030114 EGFR p.L718M
22030115 EGFR p.C797S
22030115 EGFR p.L858R
22030117 KRAS p.G13D
22030118 KRAS p.K117N
22030119 PIK3CA p.E545K
22030120 U2AF1 p.S34F
22030120 PIK3CA p.E545K
22030122 BRAF p.V600_K601delinsE
22030122 BRAF p.V600E
22030123 EGFR p.E746_A750del
22030123 TP53 p.R248W
22030124 EGFR p.L858R
22030124 EGFR p.T790M
22030125 KRAS p.G12A
22030126 KRAS p.G12D
22030127 EGFR p.C797S
22030111 EGFR p.T790M
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims (11)

1. The method for constructing the blood circulation tumor DNA background noise mutation combined filtration model is characterized by comprising the following steps of:
a. target capturing and sequencing of normal human blood cfDNA samples;
b. sequence quality control and reference genome alignment;
c. sequencing depth and base mutation frequency acquisition:
d. background mutation acquisition and germ line mutation filtration:
e. construction of binomial distribution background mutation noise filtering model: counting and obtaining the accumulated sequencing depth of each real background mutation obtained in the step d in a normal cfDNA sample and the supporting sequence number of the background mutation, so as to construct a binomial distribution background mutation noise filtering model;
f. background Context mutation feature regression model construction: and acquiring the sequencing depth of each background Context mutation feature in a normal cfDNA sample and the supporting sequence number of the background Context mutation feature, counting the background error rate accumulated under different supporting sequence numbers, and constructing a background Context mutation feature regression model.
2. The construction method according to claim 1, wherein,
the step b specifically comprises the following steps: performing quality control on the NSG sequencing sequence, and comparing the quality-controlled sequence to a ginseng test genome;
preferably, the step c specifically includes: obtaining the sequencing depth of each site in each normal human cfDNA sample and the mutation frequency of the base mutation at the site to any other three non-reference bases based on the ginseng genome ratio data;
more preferably, the step d specifically includes: obtaining mutation with mutation frequency smaller than 0.2 in a normal human cfDNA sample as a real background mutation, and filtering other sites as human germ line mutation.
3. The method of claim 1, wherein in step e,
the binomial distribution background mutation noise filtering model is as follows:
Figure QLYQS_1
wherein P (x=n m ) Indicating that the background mutation site support number is n m Probability at time, m is background mutation site, p m For the cumulative background error rate, N, of the background mutation sites in normal human cfDNA samples m For the total sequencing depth of the background mutation site in a normal human cfDNA sample, n m The total number of support sequences in normal human cfDNA samples for this background mutation site.
4. The method of claim 1, wherein in step f,
the background Context mutation feature regression model is as follows:
Figure QLYQS_2
wherein P (X.gtoreq.k) m ) The number of support sequences representing the Context mutation characteristics is greater than or equal to k m Probability of time, m is background Context mutation feature, k m And a is a constant term for the number of the support sequences of the background Context mutation characteristics, and b is a regression coefficient.
5. A method for removing double background noise mutation based on blood circulation tumor DNA (ctDNA), comprising the steps of:
1) Tumor cfDNA targeted capture sequencing;
2) Mutation detection, namely obtaining all SNV and/or INDEL mutation results;
3) Background noise mutation joint filtering model construction: the construction is based on the method of any one of claims 1-4;
4) Background noise mutation filtering of binomial distribution model:
5) Filtering background noise mutation of a background Context mutation characteristic regression model;
preferably, the method further comprises:
6) The clonal hematopoietic mutations are filtered.
6. The method of removing according to claim 5, wherein,
in the step 1), targeted capture sequencing is to build a library by using a single-molecule tag UMI and targeted capture sequencing;
preferably, the method further comprises a sequencing quality control step and a deduplication and sequence correction step after sequencing;
more preferably, the sequencing quality control is: removing low-quality sequences after high-throughput sequencing, and comparing the sequences to the ginseng test genome to obtain a comparison result; the deduplication and sequence correction are: based on the reference genome comparison data deduplication and sequence consistency correction of the double-end fixed single-molecule tag UMI, a comparison result after background mutation noise correction is obtained;
further preferably, the step 2) mutation detection is: based on the sequence alignment results after deduplication and sequence correction, all SNV and/or INDEL mutation results were obtained.
7. A removal method as claimed in any one of claims 5-6, wherein,
the background noise mutation filtering of the 4) binomial distribution model is specifically as follows:
a. calculating the probability P (Bias) that the mutation site is a background noise mutation based on the sequence support number (k) of the mutation site and the sequencing depth (n) of the site and the binomial distributed background mutation noise filtering model constructed in the step 3) for each SNV and/or INDEL mutation detected in the step 2):
P(Bias)=Bin(k,n,p)
b. and (3) carrying out background noise mutation filtering based on the calculated probability P (Bias), judging the position as background noise mutation and filtering if the P (Bias) is larger than a judging threshold value, and carrying out first filtering on the background false positive position.
8. A removal method as claimed in any one of claims 5-7, wherein,
the 5) background Context mutation feature regression model specifically comprises the following steps:
a. aiming at each single base mutation SNV obtained by detection in the step 2), obtaining the Context mutation characteristic of the mutation site;
b. calculating probability P of the mutation being background noise mutation based on the mutation support sequence number (k) of the site and the background Context mutation characteristic regression model constructed in the step 3),
Figure QLYQS_3
wherein a and b are respectively the constant term and regression coefficient of the background Context mutation characteristic regression model in the step 3),
c. and filtering the background noise mutation based on the probability P, judging the position as the background noise mutation if the probability P is larger than a judging threshold value, and filtering out the position, so as to carry out secondary filtering on the background false positive position.
9. The method according to any one of claims 5 to 8, wherein,
the step 6) of the filtration of the cloned hematopoietic mutation is specifically as follows:
a. single-molecule tag library construction and high-depth sequencing are carried out on matched white blood cells, and comparison of reference genome, de-duplication of double-end fixed UMI and background mutation correction are carried out after correction of a sequencing sequence;
b. based on the comparison result of paired white blood cells after consistency correction and duplication removal in the step a, obtaining SNV and INDEL of the paired white blood cells;
c. and constructing a Fisher statistical distribution model based on the distribution characteristics of mutation frequencies of the clonal hematopoietic mutation and the germ line mutation in cfDNA and the paired white blood cells, and filtering the clonal hematopoietic mutation and the germ line mutation.
10. An electronic device, comprising: a processor and a memory; the processor being connected to a memory, wherein the memory is adapted to store a computer program, the processor being adapted to invoke the computer program to perform the method according to any of claims 1-9.
11. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method of any of claims 1-9.
CN202310080082.4A 2023-02-07 2023-02-07 Dual background noise mutation removal method based on blood circulation tumor DNA Active CN116356001B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310080082.4A CN116356001B (en) 2023-02-07 2023-02-07 Dual background noise mutation removal method based on blood circulation tumor DNA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310080082.4A CN116356001B (en) 2023-02-07 2023-02-07 Dual background noise mutation removal method based on blood circulation tumor DNA

Publications (2)

Publication Number Publication Date
CN116356001A true CN116356001A (en) 2023-06-30
CN116356001B CN116356001B (en) 2023-12-15

Family

ID=86929488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310080082.4A Active CN116356001B (en) 2023-02-07 2023-02-07 Dual background noise mutation removal method based on blood circulation tumor DNA

Country Status (1)

Country Link
CN (1) CN116356001B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116646007A (en) * 2023-07-27 2023-08-25 北京泛生子基因科技有限公司 Device for identifying real mutation or sequencing noise in ctDNA sequencing data, computer readable storage medium and application
CN117253546A (en) * 2023-10-11 2023-12-19 北京博奥医学检验所有限公司 Method, system and storable medium for reducing targeted second-generation sequencing background noise

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107523563A (en) * 2017-09-08 2017-12-29 杭州和壹基因科技有限公司 A kind of Bioinformatics method for Circulating tumor DNA analysis
CN108733975A (en) * 2018-03-29 2018-11-02 深圳裕策生物科技有限公司 Tumor colonies mutation detection method, device and storage medium based on the sequencing of two generations
AU2019200162A1 (en) * 2012-07-20 2019-01-31 Verinata Health, Inc. Detecting and classifying copy number variation
CN109762881A (en) * 2019-01-31 2019-05-17 中山拓普基因科技有限公司 It is a kind of for detecting the Bioinformatic methods in the ultralow frequency mutational site in tumor patient blood ctDNA
CN110010197A (en) * 2019-03-29 2019-07-12 深圳裕策生物科技有限公司 Single nucleotide variations detection method, device and storage medium based on blood circulation Tumour DNA
CN111278993A (en) * 2017-09-15 2020-06-12 加利福尼亚大学董事会 Somatic cell mononucleotide variants from cell-free nucleic acids and applications for minimal residual lesion monitoring
CN111321209A (en) * 2020-03-26 2020-06-23 杭州和壹基因科技有限公司 Method for double-end correction of circulating tumor DNA sequencing data
CN114127308A (en) * 2019-05-17 2022-03-01 阿尔缇玛基因组学公司 Method and system for detecting residual disease
CN114182022A (en) * 2022-01-29 2022-03-15 福建医科大学孟超肝胆医院(福州市传染病医院) Method for detecting liver cancer specific mutation based on cfDNA base mutation frequency distribution
CN114292912A (en) * 2021-12-24 2022-04-08 广州燃石医学检验所有限公司 Detection method of variant nucleic acid
WO2022109574A1 (en) * 2020-11-18 2022-05-27 Ultima Genomics, Inc. Methods and systems for detecting residual disease
CN114694750A (en) * 2022-05-31 2022-07-01 江苏先声医疗器械有限公司 Single-sample tumor somatic mutation distinguishing and TMB (Tetramethylbenzidine) detecting method based on NGS (Next Generation System) platform

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2019200162A1 (en) * 2012-07-20 2019-01-31 Verinata Health, Inc. Detecting and classifying copy number variation
CN107523563A (en) * 2017-09-08 2017-12-29 杭州和壹基因科技有限公司 A kind of Bioinformatics method for Circulating tumor DNA analysis
CN111278993A (en) * 2017-09-15 2020-06-12 加利福尼亚大学董事会 Somatic cell mononucleotide variants from cell-free nucleic acids and applications for minimal residual lesion monitoring
CN108733975A (en) * 2018-03-29 2018-11-02 深圳裕策生物科技有限公司 Tumor colonies mutation detection method, device and storage medium based on the sequencing of two generations
CN109762881A (en) * 2019-01-31 2019-05-17 中山拓普基因科技有限公司 It is a kind of for detecting the Bioinformatic methods in the ultralow frequency mutational site in tumor patient blood ctDNA
CN110010197A (en) * 2019-03-29 2019-07-12 深圳裕策生物科技有限公司 Single nucleotide variations detection method, device and storage medium based on blood circulation Tumour DNA
CN114127308A (en) * 2019-05-17 2022-03-01 阿尔缇玛基因组学公司 Method and system for detecting residual disease
CN111321209A (en) * 2020-03-26 2020-06-23 杭州和壹基因科技有限公司 Method for double-end correction of circulating tumor DNA sequencing data
WO2022109574A1 (en) * 2020-11-18 2022-05-27 Ultima Genomics, Inc. Methods and systems for detecting residual disease
CN114292912A (en) * 2021-12-24 2022-04-08 广州燃石医学检验所有限公司 Detection method of variant nucleic acid
CN114182022A (en) * 2022-01-29 2022-03-15 福建医科大学孟超肝胆医院(福州市传染病医院) Method for detecting liver cancer specific mutation based on cfDNA base mutation frequency distribution
CN114694750A (en) * 2022-05-31 2022-07-01 江苏先声医疗器械有限公司 Single-sample tumor somatic mutation distinguishing and TMB (Tetramethylbenzidine) detecting method based on NGS (Next Generation System) platform

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DENG S等: "TNER: a novel background error suppression method for mutation detection in circulating tumor DNA", 《BMC BIOINFORMATICS》, vol. 19, no. 1, pages 1 - 7, XP021261738, DOI: 10.1186/s12859-018-2428-3 *
LARRIBÈRE L等: "Advantages and Challenges of Using ctDNA NGS to Assess the Presence of Minimal Residual Disease (MRD) in Solid Tumors", 《 CANCERS (BASEL)》, vol. 13, no. 22, pages 1 - 14 *
LV X等: "Detection of Rare Mutations in CtDNA Using Next Generation Sequencing", 《J VIS EXP》, pages 1 - 8 *
NEWMAN AM等: "Integrated digital error suppression for improved detection of circulating tumor DNA", 《NAT BIOTECHNOL》, vol. 34, no. 5, pages 547 - 555, XP055802486, DOI: 10.1038/nbt.3520 *
WANG TT等: "High efficiency error suppression for accurate detection of low-frequency variants", 《NUCLEIC ACIDS RES》, vol. 47, no. 15, pages 1 - 11 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116646007A (en) * 2023-07-27 2023-08-25 北京泛生子基因科技有限公司 Device for identifying real mutation or sequencing noise in ctDNA sequencing data, computer readable storage medium and application
CN116646007B (en) * 2023-07-27 2023-10-20 北京泛生子基因科技有限公司 Device for identifying real mutation or sequencing noise in ctDNA sequencing data, computer readable storage medium and application
CN117253546A (en) * 2023-10-11 2023-12-19 北京博奥医学检验所有限公司 Method, system and storable medium for reducing targeted second-generation sequencing background noise
CN117253546B (en) * 2023-10-11 2024-05-28 北京博奥医学检验所有限公司 Method, system and storable medium for reducing targeted second-generation sequencing background noise

Also Published As

Publication number Publication date
CN116356001B (en) 2023-12-15

Similar Documents

Publication Publication Date Title
AU2019229273B2 (en) Ultra-sensitive detection of circulating tumor DNA through genome-wide integration
US20230295738A1 (en) Systems and methods for detection of residual disease
CN116356001B (en) Dual background noise mutation removal method based on blood circulation tumor DNA
CN109767810B (en) High-throughput sequencing data analysis method and device
US11581062B2 (en) Systems and methods for classifying patients with respect to multiple cancer classes
US20210065842A1 (en) Systems and methods for determining tumor fraction
CN113151474A (en) Plasma DNA mutation analysis for cancer detection
US20200372296A1 (en) Systems and methods for determining whether a subject has a cancer condition using transfer learning
US11929148B2 (en) Systems and methods for enriching for cancer-derived fragments using fragment size
CN113674803A (en) Detection method of copy number variation and application thereof
US20210292845A1 (en) Identifying methylation patterns that discriminate or indicate a cancer condition
CN116403644B (en) Method and device for predicting cancer risk
US20190073445A1 (en) Identifying false positive variants using a significance model
US20210102199A1 (en) Fragment size characterization of cell-free dna mutations from clonal hematopoiesis
CN109712671B (en) Gene detection device based on ctDNA, storage medium and computer system
US20210295948A1 (en) Systems and methods for estimating cell source fractions using methylation information
US20220301654A1 (en) Systems and methods for predicting and monitoring treatment response from cell-free nucleic acids
US11535896B2 (en) Method for analysing cell-free nucleic acids
CN114078567A (en) Tumor load detection device and detection method based on cfDNA
CN116543835B (en) Method and device for detecting microsatellite state of plasma sample
US20230197277A1 (en) Assessment and Quantification of Imperfect dsDNA Break Repair for Cancer Diagnosis and Treatment
CN114672562A (en) Method, device, equipment and medium for monitoring drug resistance of PARP inhibitor
CN113362884A (en) Tumor marker screening method based on single base substitution characteristics and application
CN117106870A (en) Fetal concentration determination method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant