WO2023115662A1 - Method for detecting variant nucleic acids - Google Patents

Method for detecting variant nucleic acids Download PDF

Info

Publication number
WO2023115662A1
WO2023115662A1 PCT/CN2022/070974 CN2022070974W WO2023115662A1 WO 2023115662 A1 WO2023115662 A1 WO 2023115662A1 CN 2022070974 W CN2022070974 W CN 2022070974W WO 2023115662 A1 WO2023115662 A1 WO 2023115662A1
Authority
WO
WIPO (PCT)
Prior art keywords
mutation
site
frequency
somatic
sample
Prior art date
Application number
PCT/CN2022/070974
Other languages
French (fr)
Chinese (zh)
Inventor
张之宏
祝鹏飞
吴帅来
王晨阳
邱福俊
汉雨生
Original Assignee
广州燃石医学检验所有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州燃石医学检验所有限公司 filed Critical 广州燃石医学检验所有限公司
Publication of WO2023115662A1 publication Critical patent/WO2023115662A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/11DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the present application relates to the field of biomedicine, in particular to a method for detecting variant nucleic acids.
  • ctDNA Circulating Tumor DNA
  • MRD minimal residual disease
  • MRD Minimal Residual Disease
  • MRD positive or negative is mainly judged by detecting ctDNA content in peripheral blood after surgery.
  • the application provides a method for detecting the presence and/or quantity of a variant nucleic acid, the method comprising determining the variant based on the somatic mutation site and the background mutation site in the region to be tested in the sample to be tested The presence and/or amount of nucleic acid, wherein said background mutation site is determined by removing said somatic mutation site from all mutation sites in said test sample.
  • the present application provides an analysis device for detecting the presence and/or quantity of a variant nucleic acid, the device comprising a judgment module for based on the somatic mutation site and the background mutation of the region to be tested in the sample to be tested site, determining the presence and/or quantity of the variant nucleic acid, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites in the sample to be tested.
  • the present application provides a storage medium, which records the program that can implement the method described in the present application.
  • the present application provides a device, the device comprising the storage medium described in the present application.
  • the present application provides the method according to the present application, which is used to detect and/or quantify circulating tumor DNA in a test sample obtained from a subject.
  • the present application provides a method for detecting a variant nucleic acid, for example, a method for detecting the presence and/or quantity of a variant nucleic acid, the method comprising a somatic mutation site and a background mutation site based on a region to be tested in a sample to be tested Determine the presence and/or amount of the variant nucleic acid, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites in the test sample.
  • the detection method of the present application can accurately evaluate the proportion of sample ctDNA and the significance level of sample ctDNA.
  • Figures 1A-1B show the observed signal frequency of insertion or deletion of 1 repeat unit for different repeat unit repeat numbers.
  • Figures 2A-2B show the frequency of observable signals for insertions or deletions of 1, 2 or 3 repeat units for different repeat unit repeat numbers.
  • Figures 3A-3B show the observed signal frequency of insertion or deletion of a repeating unit with a length of 1, 2 or 3 bases for different repeating numbers of the repeating unit.
  • Figure 4 shows the frequency of observable signals for random insertions or deletions of 1 or 2 bases.
  • Figures 5A-5B show the results of the evaluation dilution ratio based on the number of different sites, where the abscissa is the number of sites, the ordinate is the evaluation dilution ratio, and the dotted line represents the dilution ratio of the experiment.
  • Figure 6 shows the detection sensitivity results of different detection methods.
  • Figures 7A-7E show the results of sensitivity and specificity detected by the method of the present application for diluted samples of different cell lines.
  • Figure 7A detects the detection of diluted samples for the H2009 cell line (human lung adenocarcinoma cells), including analysis based on 88 positive sites and 265 negative sites;
  • Figure 7B detects the detection of HCC38 cell line (human breast ductal carcinoma cells ) detection of diluted samples, including analysis based on 41 positive sites and 312 negative sites;
  • Figure 7C detects the detection of diluted samples for H1437 cell line (human non-small cell lung cancer cells), including analysis based on 48 positive sites and the analysis of 305 negative sites;
  • Figure 7D has detected the detection of the diluted sample for HCC1395 cell line (human breast cancer cells), including the analysis based on 85 positive sites and 268 negative sites;
  • Figure 7E has detected the analysis for H2126 Detection of diluted samples of a cell line (human lung cancer cell line), including analysis based on 91 positive loc
  • the abscissa 05pct represents the 5e-03 dilution
  • 01pct represents the 1e-03 dilution
  • 002pct represents the 2e-04 dilution
  • 0004pct represents the 4e-05 dilution
  • 00008pct represents the 8e-06 dilution
  • negative samples can represent the dilution is 0.
  • variant nucleic acid generally refers to a nucleic acid fragment after mutation such as insertion, addition, deletion and/or substitution at one or more positions of the nucleic acid sequence.
  • variant nucleic acid may comprise variant nucleic acid derived from tumor tissue, such as ctDNA.
  • variant nucleic acids may comprise variant nucleic acids derived from fetal tissue.
  • a variant nucleic acid may comprise a variant nucleic acid derived from a foreign tissue or organ.
  • the term "somatic mutation” generally refers to an acquired class of mutations that occur in non-embryonic cells.
  • the somatic mutation may include a genetic change occurring in a somatic tissue (eg, a cell outside the germline).
  • the somatic mutations may include point mutations (for example, the exchange of a single nucleotide for another nucleotide (for example, silent mutations, missense mutations and nonsense mutations)), insertions and deletions (for example , addition and/or removal of one or more nucleotides (eg, indels), amplifications, gene duplications, copy number alterations (CNAs), rearrangements, and splice variants.
  • the somatic mutations may be closely related to the processes of cell growth, programming, senescence and apoptosis.
  • the somatic mutations may be associated with alterations in signaling pathways in tumorigenesis, angiogenesis and/or tumor metastasis.
  • background mutation generally refers to a mutation in a test sample that can be used for background reference.
  • a background mutation can be a heritable mutation in a subject, for example, a background mutation can be a mutation that both normal tissues as well as tumor tissues of the subject can have.
  • the method provided by this application can remove all mutations in the sample to be tested, somatic mutations detected in tumor tissue, and information corresponding to other sites that need to be excluded, so as to exclude definite mutations. Influence of points or regions on background calculations.
  • the term "mutation site” generally refers to the site where there is a nucleotide difference compared with the nucleotide sequence of the control sequence.
  • the control sequence may be a reference sequence used in gene sequencing (for example, it may be a human reference genome).
  • the mutation site may include at least 1 (for example, 1, 2, 3, 4 or more) differences in the nucleotide sequence at the site (for example, the difference Nucleotide substitutions, duplications, deletions and/or additions may be included).
  • the mutation site may include a nucleotide mutation at at least one nucleotide position.
  • the nucleotide mutation can be a natural mutation or an artificial mutation.
  • the mutation site may comprise a single nucleotide variation (SNV).
  • database generally refers to an organized entity of related data, regardless of the manner in which the data or the organized entity is represented.
  • the organized entity of related data may take the form of a table, map, grid, group, datagram, file, document, list, or any other form.
  • the database may include any data collected and stored in a computer-accessible manner.
  • the term "computing module” generally refers to a functional module for computing.
  • the calculation module can calculate the output value or obtain a conclusion or result according to the input value, for example, the calculation module can be mainly used for calculating the output value.
  • a computing module can be tangible, such as a processor of an electronic computer, a computer or electronic device with a processor, or a computer network, or it can be a program, command line or software package stored on an electronic medium.
  • processing module generally refers to a functional module for data processing.
  • the processing module may be based on processing the input value into statistically significant data, for example, it may be a classification of data for the input value.
  • a processing module may be tangible, such as an electronic or magnetic medium for storing data, and a processor of an electronic computer, a computer or electronic device with a processor, or a computer network, or it may be a program stored on an electronic medium, command line or package.
  • the term "judgment module” generally refers to a functional module for obtaining relevant judgment results.
  • the judging module may calculate an output value or obtain a conclusion or a result according to an input value, for example, the judging module may be mainly used to obtain a conclusion or a result.
  • the judging module can be tangible, such as a processor of an electronic computer, a computer with a processor or an electronic device or a computer network, or it can be a program, a command line or a software package stored on an electronic medium.
  • sample obtaining module generally refers to a functional module for obtaining said sample of a subject.
  • the sample obtaining module may include reagents and/or instruments required to obtain the sample (eg, tissue sample, blood sample, saliva, pleural effusion, peritoneal effusion, cerebrospinal fluid, etc.).
  • lancets, blood collection tubes, and/or blood sample transport boxes may be included.
  • the device of the present application may not contain or contain one or more of the sample obtaining modules, and may optionally have the function of outputting the measured value of the sample described in the present application.
  • the term "receiving module” generally refers to a functional module for obtaining said measured values in said sample.
  • the receiving module may input the samples described in this application (such as tissue samples, blood samples, saliva, pleural effusion, peritoneal effusion, cerebrospinal fluid, etc.).
  • the receiving module may input the measured values of the samples described in the present application (such as tissue samples, blood samples, saliva, pleural effusion, peritoneal effusion, cerebrospinal fluid, etc.).
  • the receiving module can detect the state of the sample.
  • the data receiving module may optionally perform the gene sequencing described in this application (eg, next-generation gene sequencing) on the sample.
  • the data receiving module may optionally include reagents and/or instruments required for the gene sequencing.
  • the data receiving module can optionally detect sequencing depth, sequencing read length count or sequencing sequence information.
  • next-generation gene sequencing high-throughput sequencing
  • next-generation sequencing generally refer to the second-generation high-throughput sequencing technology and the higher-throughput sequencing methods developed thereafter.
  • Next-generation sequencing Platforms include but are not limited to existing sequencing platforms such as Illumina. With the continuous development of sequencing technology, those skilled in the art can understand that the sequencing methods and devices of other methods can also be used for this method. For example, second-generation gene sequencing It can have the advantages of high sensitivity, high throughput, high sequencing depth, or low cost.
  • Massively Parallel Signature Sequencing Massively Parallel Signature Sequencing, MPSS
  • Polony Sequencing 454pyro sequencing
  • Illumina (Solexa) sequencing Illumina (Solexa) sequencing
  • Ion semi conductor sequencing DNA nano-ball sequencing
  • Complete Genomics DNA nanoarrays Complete Genomics DNA nanoarrays and combined probe anchored ligation sequencing, etc.
  • the second-generation gene sequencing can make it possible to analyze the transcriptome and genome of a species in detail, so it is also called deep sequencing (deep sequencing)
  • the method of the present application can also be applied to first-generation gene sequencing, second-generation gene sequencing, third-generation gene sequencing or single molecule sequencing (SMS).
  • SMS single molecule sequencing
  • sample to be tested generally refers to a sample that needs to be detected and determined whether there is a variant nucleic acid in one or more gene regions of the sample.
  • the sample to be tested or its data can be pre-stored in the memory before testing.
  • human reference genome generally refers to the human genome that can function as a reference in gene sequencing.
  • the information of the human reference genome can refer to UCSC (University of California, Santa Cruz).
  • the human reference genome can have different versions, for example, it can be hg19, GRCH37 or ensembl 75.
  • sequencing depth generally refers to the number of times a specific region (eg, a specific gene, a specific interval, or a specific base) is detected.
  • the sequencing depth may refer to a base sequence detected by sequencing. For example, by comparing the sequencing depth to the human reference genome, and optionally removing duplicates, the number of sequencing reads on a specific gene, a specific interval, or a specific base position can be determined and counted as the sequencing depth.
  • sequencing depth can be correlated to sequencing depth. For example, sequencing depth can be affected by the mutation status of a gene.
  • sequencing data generally refers to data of short sequences obtained after sequencing.
  • the sequencing data includes the base sequence of a sequenced short sequence (sequencing read), the number of sequencing reads, and the like.
  • the term "significance test” generally refers to a way of judging whether the difference between a sample and a hypothesized distribution is significant. For example, through the significance test, it can be judged whether the somatic mutation of the sample to be tested is a significant difference.
  • T-test generally refers to a form of statistical hypothesis testing with a Student's t distribution. For example, the somatic mutation of a certain target gene in the sample to be tested can be confirmed to be significant by T-test.
  • the term "about” generally refers to a range of 0.5%-10% above or below the specified value, such as 0.5%, 1%, 1.5%, 2%, 2.5%, above or below the specified value. 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10%.
  • the application provides a method for detecting the presence and/or quantity of a variant nucleic acid
  • the method may comprise determining the variant based on the somatic mutation site and the background mutation site in the region to be tested in the sample to be tested The presence and/or amount of nucleic acid, wherein said background mutation site is determined by removing said somatic mutation site from all mutation sites in said test sample.
  • the region to be tested can be targeted and detected based on a probe or a combination of probes.
  • the region to be tested can be selected based on known mutation regions of tumors in the art.
  • the region to be tested can be selected according to the mutation region obtained after sequencing the tumor tissue.
  • somatic mutation sites can be selected based on sequencing data from a subject's tumor sample.
  • the somatic mutation site can be randomly selected based on the somatic mutation of the subject's tumor sample, or one or more sites with higher priority can be selected according to the ranking of somatic mutation frequency and the like. For example, select 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, 10 or more, 15 or more, 20 or more , 25 or more, 30 or more, 40 or more, 50 or more, 60 or more or 100 or more sites for somatic mutation sites.
  • the variant nucleic acid may be selected from the group consisting of circulating tumor nucleic acid, cell-free fetal nucleic acid (or may be referred to as circulating fetal nucleic acid), and circulating nucleic acid derived from allogeneic organs and/or tissues.
  • the variant nucleic acid can be circulating tumor DNA.
  • the method provided in the present application may also include performing base error correction on all mutation sites in the sample to be tested.
  • the correction of base errors can be a commonly used correction means in the art.
  • the base error correction may comprise correcting the base type at each position of the sequencing reads originating from the same position based on a majority voting rule to determine a consensus sequence.
  • the correction includes adjusting the base quality of the site whose base type cannot be determined to 0.
  • the correction of the present application may include simultaneous correction of the sequencing reads of the sense strand and the antisense strand derived from the same site, that is, the sense strand and the antisense strand derived from the same nucleic acid fragment are corrected to retain a corrected consistency sequence.
  • the correction of the present application may include correcting the sequencing reads of the sense strand and the antisense strand derived from the same site separately, that is, after the sense strand and the antisense strand derived from the same nucleic acid fragment are corrected, two corrected consistent sequence.
  • the base error correction may also include correcting the base type at each position of the sense strand and the antisense strand derived from the same site, retaining the respective consensus sequences of the sense strand and the antisense strand .
  • the correction may include adjusting the base quality of the sites of inconsistent bases derived from the same site to 0.
  • the correction may include deleting the base information of the positions of inconsistent bases derived from the same position.
  • the correction may include not using the base information of the positions of inconsistent bases derived from the same position for subsequent data analysis.
  • sequenced reads derived from the same locus may comprise sequenced reads that align to the same position in the human reference genome and comprise the same unimolecular signature (UMI).
  • sequence reads derived from the same locus can comprise sequence reads that align to substantially the same position in the human reference genome.
  • the method may include determining a mutation site in the sample to be tested based on the base error-corrected site.
  • the method of the present application may include selecting a mutation site in the sample to be tested from the base error-corrected sites.
  • the method described in the present application may further comprise obtaining the background mutation sites by removing high-frequency mutation sites from all mutation sites in the sample to be tested.
  • the background mutation sites in the present application may include removing known tumor somatic mutation sites and high-frequency mutation sites from the mutation sites in the sample to be tested, and the remaining mutation sites.
  • the high frequency mutation site may comprise a site with a mutation frequency of about 5e-03 or higher.
  • the high-frequency mutation sites can be adjusted according to factors such as sequencing accuracy and sample quality.
  • the high frequency mutation site may comprise a mutation frequency of about 1e-03 or higher, 5e-03 or higher, 1e-02 or higher, 5e-02 or higher, 1e-01 or higher, or 5e-01 or higher loci.
  • the method of the present application may include removing sequence information that is unqualified for quality control from the sequence information of the sample to be tested.
  • the unqualified sequence information of quality control may include the unqualified sequence information determined by the sequencing quality control method commonly used in the art.
  • the sequence information of unqualified quality control may include sequence information of low-quality sequencing reads, sequence information of low-quality bases, and the like.
  • the method may further include removing sequence information of low-quality sequencing reads (reads) from sequence information of the sample to be tested.
  • low-quality sequencing reads may contain alignment errors or difficult-to-align sequencing reads.
  • a low-quality sequencing read may be a sequencing read that, when the sequencing read is aligned to a human reference genome location, has a low probability value that the aligned position turns out to be correct.
  • the low quality sequencing reads may comprise sequencing reads having an alignment quality of less than 60.
  • the sequencing information of the sequencing reads may not be used as the sequence information of the alignment position.
  • the sequencing quality of the sequencing reads can be confirmed by a sequencing instrument and quality control methods commonly used in the art.
  • the low-quality sequencing reads can also include sequencing reads that include 8 or more base mismatches.
  • the method further includes removing sequence information of low-quality bases from the sequence information of the sample to be tested.
  • sequence information of bases with small base masses is removed after correction.
  • the low-quality bases may include bases whose corrected base quality is less than 20.
  • the sequencing accuracy can be 99.99% or higher.
  • the method may further comprise determining a mutation frequency selected from the group consisting of the somatic mutation frequency of the somatic mutation site and the background mutation frequency of the background mutation site for evaluating Site mutation significance level.
  • a mutation frequency derived from a somatic mutation site can be a somatic mutation frequency.
  • a mutation frequency derived from a background mutation site can be a background mutation frequency.
  • the mutation frequency may comprise multimer mutation frequency and/or insertion or deletion (INDEL) mutation frequency; for example, the model used to calculate the mutation frequency may be the multimer mutation frequency of sequencing data .
  • the model used to calculate mutation frequency can be the insertion or deletion (INDEL) mutation frequency of sequencing data.
  • INDEL can represent insertion or deletion.
  • the multimer mutation frequency may include the frequency at which a base at a specific position in a specific contiguous base sequence is mutated into another base.
  • the frequency of single base mutation may include the frequency of mutation of a single base.
  • the multimer mutation frequency may include the frequency at which a base at an intermediate position is mutated in a consecutive sequence of bases.
  • the contiguous base sequence may contain 2 or more bases contiguously.
  • the contiguous base sequence may comprise 2 or more, 3 or more, 5 or more, 7 or more, or 9 or more bases in a contiguous arrangement.
  • the contiguous base sequence may comprise 3 or 5 contiguous bases.
  • the multimer mutation frequency may include the frequency at which the second base in a specific contiguous sequence is mutated to another specific base.
  • the trimer mutation frequency focus on the frequency of the second base mutating into another base in a specific arrangement environment of the first base and the third base.
  • the INDEL mutation frequency may include the following groups: random INDEL mutation frequency, and base repeat region INDEL mutation frequency.
  • the INDEL mutation frequency may comprise a random INDEL mutation frequency.
  • the random INDEL mutation frequency may comprise the frequency of insertion or deletion of one or more bases.
  • the random INDEL mutation frequency may comprise the frequency of insertion or deletion of one or more bases after a specified one or more bases.
  • the random INDEL mutation frequency may comprise the frequency of insertion or deletion of one or more bases after a specific one base; for example, the random INDEL mutation frequency may comprise the frequency of insertion or deletion of one or more bases after a specific The frequency with which one or more bases are subsequently inserted or deleted.
  • the random INDEL mutation frequency may include the frequency of insertion or deletion of a specific base after a specific base.
  • the random INDEL mutation frequency may include the frequency of insertion or deletion of a specific length of bases after a specific base.
  • the specific base combination of the insertion or deletion may not be considered, and only the frequency of insertion or deletion of a specific length of bases after a specific one or more bases may be considered.
  • the INDEL mutation frequency may include the frequency of INDEL mutations in base repeat regions.
  • the INDEL mutation frequency in the base repeating region may include the frequency of insertion or deletion of one or more base repeating units (Unit), and the length of the Unit is 1 or more.
  • the INDEL mutation frequency in the base repeating region may include the frequency of insertion or deletion of one or more base repeating units (Unit), and the length of the Unit is 2 or more.
  • the Unit length is 2 bases or more, 3 bases or more, 4 bases or more, 5 bases or more, 6 bases or more, 7 bases or More, 8 bases or more, 9 bases or more, or 10 bases or more.
  • the INDEL mutation frequency in the base repeat region may include the frequency of insertion or deletion of a specific number of Units in sequences with the same Unit length and the same Unit repetition number.
  • the specific base combination of the Unit may not be considered, and only the frequency of insertion or deletion of one or more Units in a Unit with a specific number of repetitions may be considered.
  • determining the frequency of INDEL mutations in the base repeat region may include the frequency of insertion or deletion of a specific number of Units in sequences with the same Unit length and the same Unit repetition number, where Unit may include any sequence.
  • the Units of any combination of bases in this case can be combined for calculation.
  • the method described in the present application may further comprise determining the presence of the variant nucleic acid in the sample to be tested and/or the significance level of the mutation at the somatic mutation site. For example, significantly mutated somatic mutation sites can be used to assess the presence of a variant nucleic acid. For example, when assessing the proportion of variant nucleic acids, only the data of somatic mutation sites with significant mutations can be used.
  • the method may comprise measuring said level of significance by determining a cumulative probability of considering said somatic mutation site as a background mutation. For example, it is possible to estimate the cumulative probability that a candidate somatic mutation site is subject to background mutations. For example, the cumulative probability can be used to represent a significance level.
  • the method may comprise determining the cumulative probability based on a Poisson distribution or a binomial distribution.
  • the method may comprise determining the cumulative probability based on a binomial distribution.
  • the method may comprise determining the cumulative probability based on a Poisson distribution.
  • the method may comprise determining the cumulative probability based on the following formula:
  • P represents the cumulative probability
  • k is accumulated from 0 to x-1
  • x represents the coverage depth of the sequence after the mutation of the somatic cell mutation site
  • n represents the total coverage depth of the somatic cell mutation site
  • p represents the somatic cell mutation site
  • the background mutation frequency of the mutation site, e represents the natural logarithm.
  • the method may comprise determining the cumulative probability based on the following formula:
  • P represents the cumulative probability
  • k is accumulated from 0 to x-1
  • x represents the coverage depth of the sequence after the mutation of the somatic cell mutation site
  • n represents the total coverage depth of the somatic cell mutation site
  • p represents the somatic cell mutation site
  • the method may comprise determining the presence of a variant nucleic acid when the cumulative probability is less than a significance threshold.
  • the determination of the significance threshold can be adjusted by those skilled in the art according to the accuracy of the sequencing instrument and the quality of the sample to be tested.
  • the significance threshold is 0.05 or less.
  • the significance threshold is 0.05 or less, 0.01 or less, 0.005 or less, 0.001 or less, 0.0005 or less, or 0.0001 or less.
  • the method may further comprise determining the presence and/or amount of the variant nucleic acid in the sample to be tested.
  • the method of the present application can be used to accurately determine the proportion of variant nucleic acid such as ctDNA in the sample to be tested.
  • the method of the present application for determining the proportion of variant nucleic acid and/or obtaining the significance level of the proportion can be evaluated based on the data of somatic mutation sites and background mutation sites.
  • the method may comprise determining the presence and/or amount of the variant nucleic acid in the test sample by a likelihood estimation algorithm.
  • the method may comprise a likelihood estimation algorithm based on a Poisson distribution or a binomial distribution to determine the presence and/or amount of the variant nucleic acid in the test sample.
  • the quantity of the variant nucleic acid may comprise the proportion of circulating tumor DNA (ctDNA) in the total DNA of the test sample.
  • ctDNA circulating tumor DNA
  • the maximum likelihood estimation value of ctDNA proportion ⁇ is determined by a maximum likelihood estimation algorithm.
  • the method may comprise determining a maximum likelihood estimate of the ctDNA fraction ⁇
  • the function l( ⁇ ) of the following formula takes the maximum value, for example, as known in the art, ln(x) represents the calculation symbol of the logarithm of x with the natural logarithm e as the base:
  • w i is the weight of the mutation site in the i-th somatic cell
  • l i ( ⁇ ; x i , ni , p i , q i ) is calculated by the following formula:
  • f i is calculated by the following formula:
  • n i represents the total coverage depth of the i-th somatic mutation site
  • x i is the coverage depth of the i-th somatic mutation site after mutation
  • q i represents the i-th somatic mutation site in the tumor
  • p i represents the background mutation frequency of the corresponding mutation in the sample to be tested
  • e represents the natural logarithm.
  • the method may comprise determining a maximum likelihood estimate of the ctDNA fraction ⁇ When the value of ⁇ is the When , the function l( ⁇ ) of the following formula takes the maximum value:
  • w i is the weight of the mutation site in the i-th somatic cell
  • l i ( ⁇ ; x i , ni , p i , q i ) is calculated by the following formula:
  • f i is calculated by the following formula:
  • n i represents the total coverage depth of the i-th somatic mutation site
  • x i is the coverage depth of the i-th somatic mutation site after mutation
  • q i represents the i-th somatic mutation site in the tumor
  • pi represents the background mutation frequency of the corresponding mutation in the sample to be tested.
  • the method may comprise determining a significance level for the maximum likelihood estimate of the ctDNA proportion ⁇ .
  • the method may comprise determining the significance level by a likelihood ratio testing algorithm.
  • the method may comprise determining the significance level by a chi-square distribution based likelihood ratio testing algorithm.
  • the method described in this application according to the likelihood ratio statistic Values and a probability density function of the chi-square distribution with 1 degree of freedom determine the significance level.
  • the method may comprise determining the likelihood ratio statistic by value
  • w i is the weight of the i-th individual cell mutation site, and the value of w i is determined according to the somatic mutation frequency or sequencing coverage depth of the i-th individual cell mutation site, l i ( ⁇ ; x i , n i , p i ,q i ) are calculated by the following formula:
  • f i is calculated by the following formula:
  • n i represents the total coverage depth of the i-th somatic mutation site
  • x i is the coverage depth of the i-th somatic mutation site after mutation
  • q i represents the i-th somatic mutation site in the tumor
  • p i represents the background mutation frequency of the corresponding mutation in the sample to be tested
  • e represents the natural logarithm.
  • the method includes determining the likelihood ratio statistic by value
  • w i is the weight of the i-th individual cell mutation site, and the value of w i is determined according to the somatic mutation frequency or sequencing coverage depth of the i-th individual cell mutation site, l i ( ⁇ ; x i , n i , p i ,q i ) are calculated by the following formula:
  • f i is calculated by the following formula:
  • n i represents the total coverage depth of the i-th somatic mutation site
  • x i is the coverage depth of the i-th somatic mutation site after mutation
  • q i represents the i-th somatic mutation site in the tumor
  • the mutation frequency in the tissue sample p i represents the background mutation frequency of the corresponding mutation in the sample to be tested
  • e represents the natural logarithm; for example, the value of each w i can be the same, for example, the value of each w i can be 1.
  • those skilled in the art can adjust the specific value of w i from 0 to 1 according to the actual importance of the i-th somatic mutation site, such as the mutation frequency or sequencing coverage depth of the site.
  • the present application provides a method for detecting the presence and/or quantity of a variant nucleic acid, the method may include determining the mutation site based on the somatic mutation site and the background mutation site in the region to be tested in the sample to be tested.
  • the minimal residual disease detection method (PROPHET) of the present application can determine whether MRD is positive or negative by analyzing tumor somatic mutations in next-generation sequencing data generated by the amplicon method or hybridization capture method, and can belong to the tumor-informed method (tumor-informed method). -informed assay) strategy.
  • the detection method of the present application can use clear tumor somatic mutation information, for example, it can be obtained from tumor tissue, and used to detect tumor somatic mutation in peripheral blood. Specifically, it can be: 1) Perform whole-exome analysis on tumor tissue samples and paired samples group sequencing; 2) compare the sequencing results to the human reference genome based on commonly used comparison software such as bwa; 3) detect somatic mutations in tumor tissues based on commonly used somatic mutation analysis software such as mutect2; 4) detect somatic mutations Perform prioritization, and select a certain number of mutations based on the priority; 5) Based on the screened mutations, design hybridization capture probes, which are subsequently used for peripheral blood sample detection.
  • somatic mutations mutations in high repeat regions, high GC regions, and regions homologous to other sequences can be filtered out to reduce the difficulty of hybridization capture.
  • the priority of somatic cell sorting from high to low can be: 1) driver mutation (driver mutation), 2) mutations that cause amino acid sequence changes, including non-synonymous mutations, alternative splicing mutations, and in-frame/out- of-frame InDel et al., 3) synonymous mutations, each of these three types of mutations are arranged in descending order of mutation frequency.
  • the analysis of this application may specifically include the following steps: 1) data preparation, including correcting base errors based on specific molecular tags (UMI, Unique Molecular Identifier), and aligning the corrected reads to the human genome; 2) based on the read length ( reads) compare the results, and calculate the sample-specific background level; 3) calculate the mutation rate of the mutation site of the somatic cell to be tested; 4) evaluate it as a true mutation according to the background level of each mutation site of the somatic cell to be tested 5) Based on all the somatic loci screened, according to the background level, evaluate the proportion of sample ctDNA and the significance level of sample ctDNA.
  • UMI Unique Molecular Identifier
  • base correction can optionally be performed by UMI, or using methods commonly used in the art, such as Choose a more accurate method for library construction and sequencing to reduce base errors in library construction and sequencing.
  • the data preparation step can generate sequence alignment files in BAM format after UMI deduplication and base correction.
  • the principle of UMI base correction is to use the sequencing sequences of multiple PCR products from the same molecular source to correct base errors in the process of library construction and sequencing.
  • the specific steps can be: 1) based on the commonly used next-generation sequencing comparison software bwa (version 0.7.10), compare the sequencing reads to the human reference genome; 2) use the comparison information and UMI information to compare the genome positions And all reads with the same UMI are regarded as reads from the same molecular source, and they are classified as one unit and the unit with the number of reads greater than a certain threshold is reserved; 3) Determine the base at each position in the unit based on the majority voting rule, and finally generate A consistent reads representing this unit; 4) Align the consistent reads to the genome to generate a BAM file.
  • next-generation sequencing comparison software bwa version 0.7.10
  • UMI duplex library construction can distinguish molecules from different strands of double-stranded DNA, and this information can be used to correct each other in the subsequent base correction.
  • the reads from the same DNA chain can be corrected based on the majority voting rule, and the undetermined bases are set to N and the quality is 0, and other bases
  • the quality can be set to the highest value to generate single-strand consensus sequences or SSCSs; and then correct the sequences derived from different strands of the same DNA, the quality of inconsistent bases in the double strand can be adjusted to 0, but these two can be retained SSCSs. Since the ctDNA molecule is only about 164bp, the sequencing read length can usually reach about 150bp.
  • the method of the present application can optionally use the overlapping parts of the sequencing read lengths of R1 and R2 derived from the same DNA chain sequenced for recalibration, and adjust the inconsistent base quality in R1 and R2 to 0.
  • the method provided in this application can optionally distinguish the reads from different strands of the same DNA, so that the loss of this part of the correction information can be avoided during the subsequent base correction.
  • the sample-specific background is based on the alignment information of the sequencing target region BAM file.
  • Various multimer mutation frequencies can optionally be calculated as the sample-specific background frequency. For example, when calculating the frequency of trimer mutations, we can pay attention to whether the second base is changed, and the remaining two bases are fixed. For example, a trimer composed of one base at a certain position in the target area and one base on the left and right is AGC, and the alignment result at this position includes 4 ACCs, 6 ATCs, 10 AACs and 99980 AGCs, then its AGC-> The ACC trimer transition frequency was 4e-05, the AGC->ATC transition frequency was 6e-05, and the AGC->AAC transition frequency was 1e-04.
  • the trimer can also be changed into oligomers of other lengths, and the calculation method can be similar to that of the trimer.
  • the specific steps for sample-specific background calculation are: 1) Count the number of trimers corresponding to all sites in the sequencing target region; 2) Remove all somatic mutation sites and other sites that need to be excluded. 3) All trimer information corresponding to sites with a mutation frequency higher than a specific threshold such as 5e-03 can be removed to exclude other potential The interference of mutations to the background calculation; 4) The trimer information of the remaining sites was integrated, and the frequency of each mutation was calculated based on the trimer mutation type, which was used as the specific background mutation level of the sample.
  • the background In order to exclude the impact of sequence alignment and low-quality bases on the background noise evaluation, when calculating the background, you can optionally filter the reads whose alignment quality is less than 60 or include 8 or more base mismatches, and in addition Trimers with lower base quality may also optionally be discarded.
  • the sequencing data information of the sample to be analyzed can be used, and it does not rely on other normal samples or other samples of the same batch as controls, which is beneficial to eliminate background fluctuations caused by inter-sample factors or experimental batch factors.
  • it makes full use of all the information of the sequencing target region, integrates the information belonging to the same trimer at different positions, and effectively solves the problem of inaccurate background assessment due to insufficient data.
  • the method can also use the InDel mutation.
  • InDel InDel mutation
  • the sample-specific InDel background based on the InDel sequence characteristics, it is divided into two categories: 1) random InDel, 2) base repeat region InDel, represented by (Unit)n, where Unit represents a repeating unit, which can be Single base or multiple bases, n represents the number of repetitions, generally 2 or more.
  • Unit represents a repeating unit, which can be Single base or multiple bases
  • n represents the number of repetitions, generally 2 or more.
  • InDel in the base repeat region generally manifests as single or multiple indels of repeating units.
  • the background steps for calculating InDel are similar to SNV, specifically: 1) For all sites in the sequencing target region, based on their reference sequences, count their sequencing The number of InDel signals and non-InDel signals; 2) Remove all somatic mutation sites and other information corresponding to other sites that need to be excluded, so as to exclude the impact of specific mutation sites or regions on the background calculation; 3) The mutation frequency can be removed All the information corresponding to the site above a specific threshold such as 5e-03 to exclude the interference of other potential mutations on the background calculation; 4) Integrate the information of the remaining sites together, and calculate the frequency of each mutation based on the InDel type, as Specific InDel background mutation level for this sample.
  • this application can count different types of InDel background values according to the type of the base before the InDel position and the length of the InDel insertion or deletion.
  • the previous base can be combined with the indel base, and the related frequency can be counted separately.
  • TA->T, GA-> can be counted separately Background frequency of G, CA->C, T->TA, G->GA, C->CA.
  • the indel has 2 or more bases, due to the excessive number of combinations and the small number of single type target sites, it is optional to calculate the background mean value of the same length bases without separate statistics.
  • Unit For (Unit)n, where Unit is a single-base mutation, this application counts the background value based on the type of Unit in the reference sequence, the value of n, and the number of indels. For (Unit)n, when Unit is 2 bases, you can optionally ignore the specific sequence of Unit, merge all InDels with Unit length 2 and the same number of repetitions n, and calculate the corresponding background. For example, the mutations GATAT->GAT, CTGTG->CTG, all of which belong to Unit is 2, n is 2, and there is one missing mutation, will be combined and processed when calculating the background noise. When the Unit length is greater than 2, the processing method can be the same as when the Unit is 2.
  • the specific SNV mutation frequency is calculated based on the trimer pattern, or the corresponding mutation frequency is calculated based on the InDel type. For example, when the original trimer of a specific somatic cell to be tested is CAG, and the somatic cell mutation is A->G, the mutation frequency of CAG->CGG can be calculated. Likewise, the effects of low-quality alignments or low-quality bases are optionally excluded from the calculations.
  • the binomial distribution is to repeat n times independent Bernoulli experiments. In each experiment, there are only two possible results, and whether the two results occur are opposite to each other and independent of each other, which is in line with the description of the sample background mutation scene.
  • the method of the present application can adopt the assumption of Poisson distribution (x ⁇ Binom(n,p)) or binomial distribution (x ⁇ Poison(np)) to calculate the significance of somatic mutations.
  • the present application adopts the assumption of Poisson distribution for calculation, which can have higher evaluation result accuracy.
  • the method of the present application calculates the cumulative probability P value under the background condition according to the observed value of the specific mutation of the somatic cell site and the frequency of the mutation in the sample background.
  • the P value is less than a specific threshold, it can be considered that the mutation frequency of this site is significantly higher than the sample background, and this position is a true mutation.
  • the mutation type of the site to be detected is A->G
  • the original trimer is CAG
  • the coverage depth of the site is observed to be n, where the number of CGGs is x
  • the p value of the mutation detection at the site is:
  • the method is also applicable to calculating INDEL significance. If a certain INDEL to be detected is AGGG->AGG, the coverage depth of this point is n1, and the number of observed AGGs is x1, you can replace n in the above formula with n1, x with x1, and p(CAG ⁇ CGG) with p(AGGG ⁇ AGG) is enough. By analogy, all types of INDE1 or SNV mutations can use this method to calculate the significance of their sites.
  • the effective sequencing depth of samples is limited due to the influence of peripheral blood sampling volume and testing cost.
  • the proportion of ctDNA is as low as 0.02% or less, if the average effective depth is 10000X, there are only about 2 or less mutation signals per point on average, so some somatic mutation sites may be difficult to detect in peripheral blood data Mutational signal, and taking into account the background level of various mutations, it may be difficult to directly calculate the proportion of ctDNA.
  • the method of the present application can use the multi-site joint test method to determine whether ctDNA exists in the sample, and use the likelihood method to estimate the proportion of ctDNA in the sample.
  • the frequency of the ith somatic cell mutation in the tumor tissue sample is q i
  • the background frequency of the corresponding mutation in the detection sample is p i
  • the expected somatic mutation frequency in the detection sample is f i satisfies:
  • w i is the weight of the i-th somatic cell mutation to be tested.
  • the value of the weight can be set according to the type and reliability of the mutation.
  • n i represents the effective coverage depth of the i-th somatic cell mutation to be tested.
  • x i is the target mutation depth of the ith somatic cell mutation to be tested, and
  • l i ( ⁇ ) is the posterior probability of the ith somatic cell mutation to be tested:
  • the present application also provides a method for detecting the presence and/or quantity of a variant nucleic acid
  • the method may include determining the somatic mutation site based on the mutation priority of the somatic mutation site in the variant sample to be detected.
  • Point set, the somatic mutation site set can be used to detect the presence and/or quantity of variant nucleic acid, and the mutation priority from high to low can include: driver mutations, non-synonymous mutations other than driver mutations, and synonymous mutations mutation.
  • the sample of the variant to be tested may be derived from a sample obtained from the subject prior to receiving treatment.
  • the treatment may comprise tumor treatment.
  • the somatic mutation site can be determined by comparing the variant sample to be detected with a negative sample.
  • non-synonymous mutations other than the driver mutation may be selected from the group consisting of alternative splicing mutations, insertions or deletions that do not cause a shift in the reading frame of the gene (in-frame INDEL), and insertions or deletions that cause a shift in the reading frame of the gene. Missing (out-of-frame INDEL).
  • the method may include sorting the somatic mutation sites according to mutation priority from high to low, wherein in the same mutation priority, the somatic mutation sites may be sorted according to mutation frequency from high to low.
  • the method may include selecting the five or more highest-ranked mutation sites as the set of somatic mutation sites.
  • the method of the present application may include selecting the highest ranked 1 or more, the highest 2 or more, the highest 3 or more, the highest 4 or more, the highest 5 or more, Highest 6 or more, Highest 7 or more, Highest 8 or more, Highest 9 or more, Highest 10 or more, Highest 15 or more, Highest 20 or more, up to 25 or more, up to 30 or more, up to 40 or more, up to 50 or more, or up to 100 or more mutation sites As the set of somatic mutation sites.
  • the method may further include determining a region to be tested in the sample to be tested based on the set of somatic mutation sites.
  • the method may further comprise determining nucleic acids that can bind to the region to be tested based on the set of somatic mutation sites.
  • the method of the present application may include designing probes for detecting the sample to be tested based on the set of somatic mutation sites.
  • the present application also provides an analysis device for detecting the presence and/or quantity of variant nucleic acid.
  • mutation site and background mutation site determining the presence and/or quantity of the variant nucleic acid, wherein the background mutation can be determined by removing the somatic mutation site from all mutation sites in the sample to be tested location.
  • the analytical device of the method for detecting the presence and/or quantity of variant nucleic acids of the present application may comprise a module for performing the method for detecting the presence and/or quantity of variant nucleic acids described in the present application.
  • the present application also provides a method for establishing a database, the database includes a collection of somatic mutation sites, and the method may include determining the mutation priority of the somatic mutation sites based on the variant sample to be tested.
  • a set of somatic mutation sites, the set of somatic mutation sites is used to detect the presence and/or quantity of variant nucleic acids, and the priority of the mutations includes: driver mutations, non-synonymous mutations other than driver mutations, and synonymous mutation.
  • the method for establishing the database of the present application may include a method of determining a set of somatic mutation sites based on the mutation priority of the somatic mutation sites in the variant sample to be tested.
  • the present application also provides a device for establishing a database, the database includes a collection of somatic mutation sites, and the device includes a determination module for mutation of somatic mutation sites based on the variant sample to be detected Priority, determine the set of somatic mutation sites, the set of somatic mutation sites is used to detect the presence and/or quantity of variant nucleic acid, the mutation priority from high to low includes: driver mutation, non-driver mutation Synonymous mutations and synonymous mutations.
  • the device for establishing a database in this application may include a module for executing the method for establishing a database in this application.
  • the present application also provides a database, which can be established according to the database establishment method of the present application.
  • the present application also provides a storage medium, which records a program capable of running the method described in the present application.
  • the non-transitory computer readable storage medium may include a floppy disk, a flexible disk, a hard disk, a solid state storage (SSS) (such as a solid state drive (SSD)), a solid state card (SSC), a solid state module (SSM)), an enterprise high-grade flash drives, tape, or any other non-transitory magnetic media, etc.
  • SSD solid state drive
  • SSC solid state card
  • SSM solid state module
  • Non-transitory computer readable storage media may also include punched cards, paper tape, cursor sheets (or any other physical media having a pattern of holes or other optically identifiable markings), compact disc read only memory (CD-ROM) , Rewritable Disc (CD-RW), Digital Versatile Disc (DVD), Blu-ray Disc (BD) and/or any other non-transitory optical media.
  • CD-ROM compact disc read only memory
  • CD-RW Rewritable Disc
  • DVD Digital Versatile Disc
  • BD Blu-ray Disc
  • the present application also provides a device, which includes the storage medium described in the present application.
  • the device of the present application may further include a processor coupled to the storage medium, and the processor is configured to execute based on a program stored in the storage medium to implement the method of the present application.
  • the present application also provides the method according to the present application, which can be used to detect and/or quantify circulating tumor DNA in a test sample obtained from a subject.
  • the method can be used to determine the presence and/or content of circulating tumor DNA in the test sample of the subject.
  • any one or more of the methods of the present application may be for non-diagnostic purposes.
  • any one or more of the methods of the present application may be for diagnostic purposes.
  • the present application also provides the method according to the present application, which can be used for the diagnosis, prevention and/or concomitant treatment of the disease or residual disease.
  • the present application also provides the method according to the present application, which can be used for prediction, selection and/or evaluation of disease treatment methods. For example, it can be determined or aided in determining the likelihood that a subject has cancer or has a recurrence of cancer that would benefit from anticancer therapy, including chemotherapy, immunotherapy, radiation therapy, surgery, or a combination thereof.
  • the method can be used in clinical practice by detecting the presence and/or content of circulating tumor DNA in the test sample (for example, it can be speculated whether certain specific tumor treatment methods are suitable for the subject) .
  • the presence and/or content of circulating tumor DNA in the test sample detected by the method can be used in clinical practice in combination with biomarkers known in the art.
  • the observed signal frequency of 2 indels is lower than that of 1 indel, and the observable signal is weak when there are 3 or more indels, as shown in Figure 2A-2B.
  • the proportion of ctDNA is low, such as 2e-4 or below.
  • the observable signal of InDel can be below 1e-5, so this type of mutation can be included in the MRD analysis, and for samples with a ctDNA ratio above 1e-5, it can be achieved for MRD accurate detection.
  • This application selects 1 cell line to be tested and 1 background cell line as research materials, and dilutes them into 5 gradient dilution samples of 5e-03, 1e-03, 2e-04, 4e-05, 8e-06. Samples with different ctDNA ratios. From the cell lines to be tested, 88 mutation sites different from the background cell lines were selected to design probes and captured for sequencing. Finally, each diluted sample was sequenced three times, and a total of 15 diluted samples were obtained. The input amount for each sample was 30ng, and the average sequencing depth of the target area was 100,000X.
  • the parallel analysis of the two methods was carried out on the 15 sequencing samples in Example 2.
  • 5-60 sites were arbitrarily selected from 88 sites, and the number of repetitions was 50 times.
  • 5-60 negative sites were randomly selected from the 88 sites, and the number of repetitions was also 50 times, in order to evaluate the specificity of the detection.
  • the sample Pvalue ⁇ 0.01 is selected as the threshold, the detection sensitivity of the method of this application can be better than that of the INVAR method for the case of less than 40 sites, as shown in Table 2 and Figure 6.
  • the known INVAR method not only uses the sequencing information of 10 bp before and after the mutation site, but also uses the same target combination (panel) to capture and sequence the sequencing information of multiple samples. Therefore, in addition to the mutation site of the sample itself, the INVAR method also includes sequencing information of 10 bp before and after the mutation site of other samples sequenced at the same time, so the total number of optional mutation sites is relatively large. However, the method of this application is more suitable for the panel of a single sample, and can be applied to the detection environment with fewer total optional mutation sites.
  • this application selected a standard product data and a background cell line for dilution, and diluted them into 2.5e-3, 1.25e-3, 6.25e-4, 3.125e- 4. Seven gradients of 1.6e-4, 8e-5, and 4e-5 simulate samples with different ctDNA proportions.
  • the standard contains a total of 28 effective mutations, including 8 INDEL mutations and 20 SNV mutations. The average sequencing depth of the diluted products is 60000X.
  • 8 INDELs, 8 SNVs (randomly selected), and 28 mutations were used to analyze the proportion of ctDNA, and the results are shown in Table 3.
  • SNV or INDEL can be accurately estimated at the dilution level of 1.6e-4 and above when analyzed separately, and the significance pvalue ⁇ 0.01 is satisfied, and at 8e-5 or below , INDEL calculation results can be better than SNV; when SNV and INDEL are used for combined analysis, the dilution gradient can be accurately calculated under all gradient dilutions of the experiment, and the significance pvalue ⁇ 0.01 is satisfied.
  • This application selected 5 cell lines to be tested and 1 background cell line as research materials, and each cell line to be tested and the background cell line were diluted to 5e-03, 1e-03, 2e-04, 4e-05 , 8e-06, a total of 5 gradient dilution samples, simulating samples with different ctDNA proportions.
  • 40-100 unique germline mutations were selected as somatic mutation sites, and corresponding hybridization probes were designed for subsequent experiments.
  • three repeated experiments were carried out on each diluted sample and background sample, and a total of 90 sample data were obtained.
  • the input amount for each sample was 30ng, and the average sequencing depth of the target region was 100,000X.
  • Table 4 shows the evaluation and significance level results of the mixing ratio (proportion of simulated ctDNA) of the sample
  • Figure 7A-7E shows the detection status of the site when the pvalue ⁇ 0.05.
  • the method of the present application can evaluate the dilution level more accurately, and the sample significance pvalue is low.
  • the dilution level estimation The value is quite different from the actual one.
  • the site analysis results it can be known that when the dilution gradient is 5e-03 to 4e-05, the sensitivity drops from 100% to about 15%, but they are all significantly higher than (1-specificity).
  • the sensitivity drops to about 5%, which is relatively close to (1-specificity).
  • the detection method of this application can detect ctDNA as low as about 4e-05, which is lower than the level of ctDNA detection limit as low as 2e-04 given by the consensus, and provides data support and support for subsequent application in the detection of minimal residual lesions. auxiliary.

Abstract

A method for detecting variant nucleic acids and the use thereof in the detection of samples. Specifically, the present invention relates to a method for detecting the presence and/or quantity of variant nucleic acids. The method comprises determining the presence and/or quantity of variant nucleic acids on the basis of a somatic mutation region and a background mutation region in a sample to be detected.

Description

一种变体核酸的检测方法A method for detecting variant nucleic acid 技术领域technical field
本申请涉及生物医学领域,具体的涉及一种变体核酸的检测方法。The present application relates to the field of biomedicine, in particular to a method for detecting variant nucleic acids.
背景技术Background technique
检测外周血中循环肿瘤DNA(ctDNA,Circulating Tumor DNA)的存在和/或占比是进行微小残留病灶(MRD,Minimal Residual Disease)检测的主要方法。微小残留病灶(MRD,Minimal Residual Disease)是指癌症治疗后残留在体内的少量癌细胞,它是肿瘤复发和远端转移的潜在来源,在肺癌,结直肠癌,食管癌等多种实体瘤中具有很好的预后价值。当前主要通过检测术后外周血中ctDNA含量判断MRD阳性或者阴性,国内首个《肺癌MRD的检测和临床应用共识》规定,用作MRD检测时,ctDNA的检出限需要低至0.02%的水平。因此,本领域急需一种能够准确检测ctDNA的存在和/或占比的方法。Detecting the presence and/or proportion of circulating tumor DNA (ctDNA, Circulating Tumor DNA) in peripheral blood is the main method for detecting minimal residual disease (MRD, Minimal Residual Disease). Minimal Residual Disease (MRD, Minimal Residual Disease) refers to a small amount of cancer cells remaining in the body after cancer treatment. It is a potential source of tumor recurrence and distant metastasis. have good prognostic value. At present, MRD positive or negative is mainly judged by detecting ctDNA content in peripheral blood after surgery. The first domestic "Consensus on the Detection and Clinical Application of MRD in Lung Cancer" stipulates that when used for MRD detection, the detection limit of ctDNA needs to be as low as 0.02%. . Therefore, there is an urgent need in the art for a method that can accurately detect the presence and/or proportion of ctDNA.
发明内容Contents of the invention
一方面,本申请提供了一种检测变体核酸的存在和/或数量方法,所述方法包含基于待测样本中待测区域的体细胞突变位点和背景突变位点,确定所述变体核酸的存在和/或数量,其中通过从所述待测样本的全部突变位点中去除所述体细胞突变位点,确定所述背景突变位点。In one aspect, the application provides a method for detecting the presence and/or quantity of a variant nucleic acid, the method comprising determining the variant based on the somatic mutation site and the background mutation site in the region to be tested in the sample to be tested The presence and/or amount of nucleic acid, wherein said background mutation site is determined by removing said somatic mutation site from all mutation sites in said test sample.
一方面,本申请提供了一种检测变体核酸的存在和/或数量方法的分析设备,所述设备包含判断模块,用于基于待测样本中待测区域的体细胞突变位点和背景突变位点,确定所述变体核酸存在和/或数量,其中通过从所述待测样本的全部突变位点中去除所述体细胞突变位点,确定所述背景突变位点。In one aspect, the present application provides an analysis device for detecting the presence and/or quantity of a variant nucleic acid, the device comprising a judgment module for based on the somatic mutation site and the background mutation of the region to be tested in the sample to be tested site, determining the presence and/or quantity of the variant nucleic acid, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites in the sample to be tested.
一方面,本申请提供了一种储存介质,其记载可以本申请所述的方法的程序。In one aspect, the present application provides a storage medium, which records the program that can implement the method described in the present application.
一方面,本申请提供了一种设备,所述设备包含本申请所述的储存介质。In one aspect, the present application provides a device, the device comprising the storage medium described in the present application.
一方面,本申请提供了根据本申请所述的方法,所述方法用于检测和/或量化从受试者获得的待测样品中的循环肿瘤DNA。In one aspect, the present application provides the method according to the present application, which is used to detect and/or quantify circulating tumor DNA in a test sample obtained from a subject.
本申请提供了一种变体核酸的检测方法,例如一种检测变体核酸的存在和/或数量方法,所述方法包含基于待测样本中待测区域的体细胞突变位点和背景突变位点,确定所述变体核酸的存在和/或数量,其中通过从所述待测样本的全部突变位点中去除所述体细胞突变位点, 确定所述背景突变位点。本申请的检测方法可以准确地评估样本ctDNA占比以及样本ctDNA的显著性水平。The present application provides a method for detecting a variant nucleic acid, for example, a method for detecting the presence and/or quantity of a variant nucleic acid, the method comprising a somatic mutation site and a background mutation site based on a region to be tested in a sample to be tested Determine the presence and/or amount of the variant nucleic acid, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites in the test sample. The detection method of the present application can accurately evaluate the proportion of sample ctDNA and the significance level of sample ctDNA.
本领域技术人员能够从下文的详细描述中容易地洞察到本申请的其它方面和优势。下文的详细描述中仅显示和描述了本申请的示例性实施方式。如本领域技术人员将认识到的,本申请的内容使得本领域技术人员能够对所公开的具体实施方式进行改动而不脱离本申请所涉及发明的精神和范围。相应地,本申请的附图和说明书中的描述仅仅是示例性的,而非为限制性的。Those skilled in the art can easily perceive other aspects and advantages of the present application from the following detailed description. In the following detailed description, only exemplary embodiments of the present application are shown and described. As those skilled in the art will appreciate, the content of the present application enables those skilled in the art to make changes to the specific embodiments which are disclosed without departing from the spirit and scope of the invention to which this application relates. Correspondingly, the drawings and descriptions in the specification of the present application are only exemplary rather than restrictive.
附图说明Description of drawings
本申请所涉及的发明的具体特征如所附权利要求书所显示。通过参考下文中详细描述的示例性实施方式和附图能够更好地理解本申请所涉及发明的特点和优势。对附图简要说明如下:The particular features of the invention to which this application relates are set forth in the appended claims. The features and advantages of the invention to which this application relates can be better understood with reference to the exemplary embodiments described in detail hereinafter and the accompanying drawings. A brief description of the accompanying drawings is as follows:
图1A-1B显示的是不同的重复单元重复次数情况下,插入或缺失1个重复单元的可观测信号频率。Figures 1A-1B show the observed signal frequency of insertion or deletion of 1 repeat unit for different repeat unit repeat numbers.
图2A-2B显示的是不同的重复单元重复次数情况下,插入或缺失1个、2个或3个重复单元的可观测信号频率。Figures 2A-2B show the frequency of observable signals for insertions or deletions of 1, 2 or 3 repeat units for different repeat unit repeat numbers.
图3A-3B显示的是不同的重复单元重复次数情况下,插入或缺失1个长度为1个、2个或3个碱基的重复单元的可观测信号频率。Figures 3A-3B show the observed signal frequency of insertion or deletion of a repeating unit with a length of 1, 2 or 3 bases for different repeating numbers of the repeating unit.
图4显示的是随机插入或缺失1个或2个碱基的可观测信号频率。Figure 4 shows the frequency of observable signals for random insertions or deletions of 1 or 2 bases.
图5A-5B显示的是基于不同位点个数的评估稀释比的结果,其中横坐标为位点个数,纵坐标为评估稀释比例,虚线表示实验的稀释比。Figures 5A-5B show the results of the evaluation dilution ratio based on the number of different sites, where the abscissa is the number of sites, the ordinate is the evaluation dilution ratio, and the dotted line represents the dilution ratio of the experiment.
图6显示的是不同检测方法的检测敏感性结果。Figure 6 shows the detection sensitivity results of different detection methods.
图7A-7E显示的是对于不同细胞系稀释样本,本申请方法检测的敏感性和特异性的结果。图7A检测了对于H2009细胞系(人肺腺癌细胞)稀释样本的检测,包括基于88个阳性位点和265个阴性位点的分析;图7B检测了对于HCC38细胞系(人乳腺导管癌细胞)稀释样本的检测,包括基于41个阳性位点和312个阴性位点的分析;图7C检测了对于H1437细胞系(人非小细胞肺癌细胞)稀释样本的检测,包括基于48个阳性位点和305个阴性位点的分析;图7D检测了对于HCC1395细胞系(人乳腺癌细胞)稀释样本的检测,包括基于85个阳性位点和268个阴性位点的分析;图7E检测了对于H2126细胞系(人肺癌细胞系)稀释样本的检测,包括基于91个阳性位点和262个阴性位点的分析。其中,横坐标05pct表示5e- 03稀释度,01pct表示1e-03稀释度,002pct表示2e-04稀释度,0004pct表示4e-05稀释度,00008pct表示8e-06稀释度,阴性样本可以表示稀释度为0。Figures 7A-7E show the results of sensitivity and specificity detected by the method of the present application for diluted samples of different cell lines. Figure 7A detects the detection of diluted samples for the H2009 cell line (human lung adenocarcinoma cells), including analysis based on 88 positive sites and 265 negative sites; Figure 7B detects the detection of HCC38 cell line (human breast ductal carcinoma cells ) detection of diluted samples, including analysis based on 41 positive sites and 312 negative sites; Figure 7C detects the detection of diluted samples for H1437 cell line (human non-small cell lung cancer cells), including analysis based on 48 positive sites and the analysis of 305 negative sites; Figure 7D has detected the detection of the diluted sample for HCC1395 cell line (human breast cancer cells), including the analysis based on 85 positive sites and 268 negative sites; Figure 7E has detected the analysis for H2126 Detection of diluted samples of a cell line (human lung cancer cell line), including analysis based on 91 positive loci and 262 negative loci. Among them, the abscissa 05pct represents the 5e-03 dilution, 01pct represents the 1e-03 dilution, 002pct represents the 2e-04 dilution, 0004pct represents the 4e-05 dilution, 00008pct represents the 8e-06 dilution, negative samples can represent the dilution is 0.
具体实施方式Detailed ways
以下由特定的具体实施例说明本申请发明的实施方式,熟悉此技术的人士可由本说明书所公开的内容容易地了解本申请发明的其他优点及效果。The implementation of the invention of the present application will be described in the following specific examples, and those skilled in the art can easily understand other advantages and effects of the invention of the present application from the content disclosed in this specification.
术语定义Definition of Terms
在本申请中,术语“变体核酸”通常是指在核酸序列的一个或更多个位置处发生的插入、添加、缺失和/或替换等突变后的核酸片段。例如,变体核酸可以包含源于肿瘤组织的变体核酸,如ctDNA。例如,变体核酸可以包含源于胎儿组织的变体核酸。例如,变体核酸可以包含源于异体组织或器官的变体核酸。In the present application, the term "variant nucleic acid" generally refers to a nucleic acid fragment after mutation such as insertion, addition, deletion and/or substitution at one or more positions of the nucleic acid sequence. For example, variant nucleic acid may comprise variant nucleic acid derived from tumor tissue, such as ctDNA. For example, variant nucleic acids may comprise variant nucleic acids derived from fetal tissue. For example, a variant nucleic acid may comprise a variant nucleic acid derived from a foreign tissue or organ.
在本申请中,术语“体细胞突变”通常是指发生在非胚胎细胞中的后天获得的一类突变。在本申请中,所述体细胞突变可以包括在体细胞组织(例如,种系外的细胞)中发生的遗传改变。在本申请中,所述体细胞突变可以包括点突变(例如,单个核苷酸与另一个核苷酸的交换(例如,沉默突变、错义突变和无义突变))、插入和缺失(例如,添加和/或移除一个或更多个核苷酸(例如,插入缺失))、扩增、基因重复、拷贝数改变(CNA)、重排和剪接变体。所述体细胞突变可以与细胞的生长,编程,衰老和凋亡过程密切相关。例如,所述体细胞突变可以与肿瘤发生中信号通路改变,血管生成和/或肿瘤的转移相关。In this application, the term "somatic mutation" generally refers to an acquired class of mutations that occur in non-embryonic cells. In the present application, the somatic mutation may include a genetic change occurring in a somatic tissue (eg, a cell outside the germline). In this application, the somatic mutations may include point mutations (for example, the exchange of a single nucleotide for another nucleotide (for example, silent mutations, missense mutations and nonsense mutations)), insertions and deletions (for example , addition and/or removal of one or more nucleotides (eg, indels), amplifications, gene duplications, copy number alterations (CNAs), rearrangements, and splice variants. The somatic mutations may be closely related to the processes of cell growth, programming, senescence and apoptosis. For example, the somatic mutations may be associated with alterations in signaling pathways in tumorigenesis, angiogenesis and/or tumor metastasis.
在本申请中,术语“背景突变”通常是指在待测样本中可以用于背景参考的突变。例如,背景突变可以是受试者体内可遗传的突变,例如,背景突变可以是受试者正常组织以及肿瘤组织都可以具有的突变。例如,为了确定更准确的背景突变,本申请提供的方法可以将待测样本的全部突变去除肿瘤组织中检测的体细胞突变,以及其他需要排除的位点对应的信息,以排除明确的突变位点或区域对背景计算的影响。In the present application, the term "background mutation" generally refers to a mutation in a test sample that can be used for background reference. For example, a background mutation can be a heritable mutation in a subject, for example, a background mutation can be a mutation that both normal tissues as well as tumor tissues of the subject can have. For example, in order to determine more accurate background mutations, the method provided by this application can remove all mutations in the sample to be tested, somatic mutations detected in tumor tissue, and information corresponding to other sites that need to be excluded, so as to exclude definite mutations. Influence of points or regions on background calculations.
在本申请中,术语“突变位点”通常是指与对照序列的核苷酸序列相比存在差异的核苷酸所在的位点。例如,所述对照序列可以为基因测序中使用的参照序列(例如可以为人类参考基因组)。在本申请中,所述突变位点可以包括至少1个(例如,1个、2个、3个、4个或更多个)位点处的核苷酸序列的不同(例如,所述不同可以包括核苷酸取代、重复、缺失和/或增加)。例如,所述突变位点可以包括至少1个核苷酸位点处发生核苷酸突变。所述核苷酸突变可以为自然突变,也可以为人工突变。所述突变位点可以包括单核苷酸变异(SNV)。In the present application, the term "mutation site" generally refers to the site where there is a nucleotide difference compared with the nucleotide sequence of the control sequence. For example, the control sequence may be a reference sequence used in gene sequencing (for example, it may be a human reference genome). In the present application, the mutation site may include at least 1 (for example, 1, 2, 3, 4 or more) differences in the nucleotide sequence at the site (for example, the difference Nucleotide substitutions, duplications, deletions and/or additions may be included). For example, the mutation site may include a nucleotide mutation at at least one nucleotide position. The nucleotide mutation can be a natural mutation or an artificial mutation. The mutation site may comprise a single nucleotide variation (SNV).
在本申请中,术语“数据库”通常是指相关数据的有组织实体,而不管数据或有组织实体的表示方式。例如,所述相关数据的有组织实体可以采取表、映射、网格、分组、数据报、文件、文档、列表的形式或任何其他形式。在本申请中,所述数据库可以包括以计算机可存取的方式来收集并保存的任何数据。In this application, the term "database" generally refers to an organized entity of related data, regardless of the manner in which the data or the organized entity is represented. For example, the organized entity of related data may take the form of a table, map, grid, group, datagram, file, document, list, or any other form. In this application, the database may include any data collected and stored in a computer-accessible manner.
在本申请中,术语“计算模块”通常是指用于计算的功能模块。所述计算模块可以根据输入值计算输出值或得到结论或结果,例如计算模块可以主要是用于计算输出值。计算模块可以是有形的,例如电子计算机的处理器、带有处理器的计算机或电子设备或计算机网络,也可以是存储在电子介质上的一段程序、命令行或软件包。In this application, the term "computing module" generally refers to a functional module for computing. The calculation module can calculate the output value or obtain a conclusion or result according to the input value, for example, the calculation module can be mainly used for calculating the output value. A computing module can be tangible, such as a processor of an electronic computer, a computer or electronic device with a processor, or a computer network, or it can be a program, command line or software package stored on an electronic medium.
在本申请中,术语“处理模块”通常是指用于数据处理的功能模块。所述处理模块可以根据将输入值处理为有统计学意义的数据,例如可以是用于输入值的数据的分类。处理模块可以是有形的,例如用于存储数据的电子或磁介质,以及电子计算机的处理器、带有处理器的计算机或电子设备或计算机网络,也可以是存储在电子介质上的一段程序、命令行或软件包。In this application, the term "processing module" generally refers to a functional module for data processing. The processing module may be based on processing the input value into statistically significant data, for example, it may be a classification of data for the input value. A processing module may be tangible, such as an electronic or magnetic medium for storing data, and a processor of an electronic computer, a computer or electronic device with a processor, or a computer network, or it may be a program stored on an electronic medium, command line or package.
在本申请中,术语“判断模块”通常是指用于获得相关判断结果的功能模块。在本申请中,所述判断模块可以根据输入值计算输出值或得到结论或结果,例如判断模块可以主要是用于得到结论或结果。判断模块可以是有形的,例如电子计算机的处理器、带有处理器的计算机或电子设备或计算机网络,也可以是存储在电子介质上的一段程序、命令行或软件包。In this application, the term "judgment module" generally refers to a functional module for obtaining relevant judgment results. In this application, the judging module may calculate an output value or obtain a conclusion or a result according to an input value, for example, the judging module may be mainly used to obtain a conclusion or a result. The judging module can be tangible, such as a processor of an electronic computer, a computer with a processor or an electronic device or a computer network, or it can be a program, a command line or a software package stored on an electronic medium.
在本申请中,术语“样品获得模块”通常是指用于获得受试者的所述样本的功能模块。例如,所述样品获得模块可以包括用以获得所述样本(例如组织样本、血液样本、唾液、胸腔积液、腹腔积液、脑脊液等)所需的试剂和/或仪器。例如,可以包括采血针、采血管和/或血液样本运输箱。例如,本申请的装置可以不含或包含1个或以上的所述样品获得模块,并可以可选地具有输出本申请所述的样本的测量值的功能。In this application, the term "sample obtaining module" generally refers to a functional module for obtaining said sample of a subject. For example, the sample obtaining module may include reagents and/or instruments required to obtain the sample (eg, tissue sample, blood sample, saliva, pleural effusion, peritoneal effusion, cerebrospinal fluid, etc.). For example, lancets, blood collection tubes, and/or blood sample transport boxes may be included. For example, the device of the present application may not contain or contain one or more of the sample obtaining modules, and may optionally have the function of outputting the measured value of the sample described in the present application.
在本申请中,术语“接收模块”通常是指用于获得所述样本中所述测量值的功能模块。在本申请中,所述接收模块可以输入本申请所述样本(例如组织样本、血液样本、唾液、胸腔积液、腹腔积液、脑脊液等)。在本申请中,所述接收模块可以输入本申请所述样本(例如组织样本、血液样本、唾液、胸腔积液、腹腔积液、脑脊液等)的测量值。所述接收模块可以对所述样本的状态进行检测。例如,所述数据接收模块可以可选地对所述样本进行本申请所述的基因测序(例如二代基因测序)。例如,所述数据接收模块可以可选地包括用以进行所述基因测序所需的试剂和/或仪器。所述数据接收模块可以可选地检测出测序深度、测序读长计数或测序序列信息。In this application, the term "receiving module" generally refers to a functional module for obtaining said measured values in said sample. In this application, the receiving module may input the samples described in this application (such as tissue samples, blood samples, saliva, pleural effusion, peritoneal effusion, cerebrospinal fluid, etc.). In the present application, the receiving module may input the measured values of the samples described in the present application (such as tissue samples, blood samples, saliva, pleural effusion, peritoneal effusion, cerebrospinal fluid, etc.). The receiving module can detect the state of the sample. For example, the data receiving module may optionally perform the gene sequencing described in this application (eg, next-generation gene sequencing) on the sample. For example, the data receiving module may optionally include reagents and/or instruments required for the gene sequencing. The data receiving module can optionally detect sequencing depth, sequencing read length count or sequencing sequence information.
在本申请中,术语“二代基因测序”、高通量测序”或“下一代测序”通常是指第二代高通量测序技术及之后发展的更高通量的测序方法。下一代测序平台包括但不限于已有的Illumina等测序平台。随着测序技术的不断发展,本领域技术人员能够理解的是还可以采用其他方法的测序方法和装置用于本方法。例如,二代基因测序可以具有高灵敏度、通量大、测序深度高、或低成本的优势。根据发展历史、影响力、测序原理和技术不同等,主要有以下几种:大规模平行签名测序(Massively Parallel Signature Sequencing,MPSS)、聚合酶克隆(Polony Sequencing)、454焦磷酸测序(454pyro sequencing)、Illumina(Solexa)sequencing、离子半导体测序(Ion semi conductor sequencing)、DNA纳米球测序(DNA nano-ball sequencing)、Complete Genomics的DNA纳米阵列与组合探针锚定连接测序法等。所述二代基因测序可以使对一个物种的转录组和基因组进行细致全貌的分析成为可能,所以又被称为深度测序(deep sequencing)。例如,本申请的方法同样可以应用于一代基因测序、二代基因测序、三代基因测序或单分子测序(SMS)。In this application, the terms "next-generation gene sequencing", high-throughput sequencing" or "next-generation sequencing" generally refer to the second-generation high-throughput sequencing technology and the higher-throughput sequencing methods developed thereafter. Next-generation sequencing Platforms include but are not limited to existing sequencing platforms such as Illumina. With the continuous development of sequencing technology, those skilled in the art can understand that the sequencing methods and devices of other methods can also be used for this method. For example, second-generation gene sequencing It can have the advantages of high sensitivity, high throughput, high sequencing depth, or low cost. According to the development history, influence, sequencing principles and technologies, there are mainly the following types: Massively Parallel Signature Sequencing (Massively Parallel Signature Sequencing, MPSS), Polony Sequencing, 454pyro sequencing, Illumina (Solexa) sequencing, Ion semi conductor sequencing, DNA nano-ball sequencing, Complete Genomics DNA nanoarrays and combined probe anchored ligation sequencing, etc. The second-generation gene sequencing can make it possible to analyze the transcriptome and genome of a species in detail, so it is also called deep sequencing (deep sequencing) For example, the method of the present application can also be applied to first-generation gene sequencing, second-generation gene sequencing, third-generation gene sequencing or single molecule sequencing (SMS).
在本申请中,术语“待测样本”通常是指需要进行检测,并判定该样本上的一个或者多个基因区域是否存在有变体核酸的样本。例如,待测样本或其数据可以在进行检测之前预先储存在储存器中。In this application, the term "sample to be tested" generally refers to a sample that needs to be detected and determined whether there is a variant nucleic acid in one or more gene regions of the sample. For example, the sample to be tested or its data can be pre-stored in the memory before testing.
在本申请中,术语“人类参考基因组”通常是指可以在基因测序中发挥参照功能的人类基因组。所述人类参考基因组的信息可以参考UCSC(University of California,Santa Cruz)。所述人类参考基因组可以有不同的版本,例如,可以为hg19、GRCH37或ensembl 75。In this application, the term "human reference genome" generally refers to the human genome that can function as a reference in gene sequencing. The information of the human reference genome can refer to UCSC (University of California, Santa Cruz). The human reference genome can have different versions, for example, it can be hg19, GRCH37 or ensembl 75.
在本申请中,术语“测序深度”通常是指特定区域(例如特定基因、特定区间、特定碱基)被检测的次数。测序深度可以是指通过测序检测的一段碱基序列。例如,通过将测序深度比对到人类参考基因组,并可选地去重,可以确定和统计特定基因、特定区间或特定碱基位置上测序读长的数量,作为测序深度。在一些情况下,测序深度可以与测序深度相关。例如,测序深度可以受到基因突变状态的影响。In this application, the term "sequencing depth" generally refers to the number of times a specific region (eg, a specific gene, a specific interval, or a specific base) is detected. The sequencing depth may refer to a base sequence detected by sequencing. For example, by comparing the sequencing depth to the human reference genome, and optionally removing duplicates, the number of sequencing reads on a specific gene, a specific interval, or a specific base position can be determined and counted as the sequencing depth. In some cases, sequencing depth can be correlated to sequencing depth. For example, sequencing depth can be affected by the mutation status of a gene.
在本申请中,术语“测序数据”通常是指测序后获得的短序列的数据。例如,测序数据包含测序短序列(测序读长)的碱基序列、测序读长的数目等。In this application, the term "sequencing data" generally refers to data of short sequences obtained after sequencing. For example, the sequencing data includes the base sequence of a sequenced short sequence (sequencing read), the number of sequencing reads, and the like.
在本申请中,术语“显著性检验”通常是指判断样本与假设分布之间的差异是否显著的方式。例如,通过显著性检验可以判断待测样本的体细胞突变是否属于显著的差异。In this application, the term "significance test" generally refers to a way of judging whether the difference between a sample and a hypothesized distribution is significant. For example, through the significance test, it can be judged whether the somatic mutation of the sample to be tested is a significant difference.
在本申请中,术语“T检验”通常是指一种有学生t分布的统计假设检验的方式。例如,通过T检验可以确认待测样本的某一目标基因的体细胞突变具有显著性。In this application, the term "T-test" generally refers to a form of statistical hypothesis testing with a Student's t distribution. For example, the somatic mutation of a certain target gene in the sample to be tested can be confirmed to be significant by T-test.
在本申请中,术语“包含”通常是指包括明确指定的特征,但不排除其他要素。In this application, the term "comprising" generally means including specifically specified features, but not excluding other elements.
在本申请中,术语“约”通常是指在指定数值以上或以下0.5%-10%的范围内变动,例如在指定数值以上或以下0.5%、1%、1.5%、2%、2.5%、3%、3.5%、4%、4.5%、5%、5.5%、6%、6.5%、7%、7.5%、8%、8.5%、9%、9.5%、或10%的范围内变动。In this application, the term "about" generally refers to a range of 0.5%-10% above or below the specified value, such as 0.5%, 1%, 1.5%, 2%, 2.5%, above or below the specified value. 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10%.
发明详述Detailed description of the invention
一方面,本申请提供了检测变体核酸的存在和/或数量的方法,所述方法可以包含基于待测样本中待测区域的体细胞突变位点和背景突变位点,确定所述变体核酸的存在和/或数量,其中通过从所述待测样本的全部突变位点中去除所述体细胞突变位点,确定所述背景突变位点。例如,所述待测区域可以根据探针或探针组合进行靶向和检测。例如,待测区域可以基于本领域已知的肿瘤突变区域进行选择。例如,待测区域可以根据对肿瘤组织测序后得到的突变区域进行选择。例如,体细胞突变位点可以基于受试者肿瘤样品的测序数据进行选择。例如,例如,体细胞突变位点可以基于受试者肿瘤样品的体细胞突变随机选择,也可以根据体细胞突变频率等排序选取优先级较高的一个或更多个位点。例如,选择1个或更多、2个或更多、3个或更多、4个或更多、5个或更多、10个或更多、15个或更多、20个或更多、25个或更多、30个或更多、40个或更多、50个或更多、60个或更多或100个或更多位点用于体细胞突变位点。In one aspect, the application provides a method for detecting the presence and/or quantity of a variant nucleic acid, the method may comprise determining the variant based on the somatic mutation site and the background mutation site in the region to be tested in the sample to be tested The presence and/or amount of nucleic acid, wherein said background mutation site is determined by removing said somatic mutation site from all mutation sites in said test sample. For example, the region to be tested can be targeted and detected based on a probe or a combination of probes. For example, the region to be tested can be selected based on known mutation regions of tumors in the art. For example, the region to be tested can be selected according to the mutation region obtained after sequencing the tumor tissue. For example, somatic mutation sites can be selected based on sequencing data from a subject's tumor sample. For example, the somatic mutation site can be randomly selected based on the somatic mutation of the subject's tumor sample, or one or more sites with higher priority can be selected according to the ranking of somatic mutation frequency and the like. For example, select 1 or more, 2 or more, 3 or more, 4 or more, 5 or more, 10 or more, 15 or more, 20 or more , 25 or more, 30 or more, 40 or more, 50 or more, 60 or more or 100 or more sites for somatic mutation sites.
例如,所述变体核酸可以选自以下组:循环肿瘤核酸、胎儿游离核酸(或者可以称为循环胎儿核酸)和来源于异体器官和/或组织的循环核酸。例如,所述变体核酸可以是循环肿瘤DNA。For example, the variant nucleic acid may be selected from the group consisting of circulating tumor nucleic acid, cell-free fetal nucleic acid (or may be referred to as circulating fetal nucleic acid), and circulating nucleic acid derived from allogeneic organs and/or tissues. For example, the variant nucleic acid can be circulating tumor DNA.
一方面,本申请提供的方法还可以包含对所述待测样本的全部突变位点进行碱基错误校正。例如,对于碱基错误的校正可以是本领域常用的校正手段。On the one hand, the method provided in the present application may also include performing base error correction on all mutation sites in the sample to be tested. For example, the correction of base errors can be a commonly used correction means in the art.
例如,所述碱基错误校正可以包含基于多数投票规则,校正源自相同位点的测序读段的每个位置的碱基类型,确定一致性序列。例如,所述校正包含将不能确定碱基类型的位点的碱基质量调整为0。例如,本申请的校正可以包含对于源自相同位点的正义链和反义链的测序读段同时校正,即来源于同一核酸片段的正义链和反义链校正后保留一条校正后的一致性序列。例如,本申请的校正可以包含对于源自相同位点的正义链和反义链的测序读段分别校正,即来源于同一核酸片段的正义链和反义链校正后,分别保留两条校正后的一致性序列。For example, the base error correction may comprise correcting the base type at each position of the sequencing reads originating from the same position based on a majority voting rule to determine a consensus sequence. For example, the correction includes adjusting the base quality of the site whose base type cannot be determined to 0. For example, the correction of the present application may include simultaneous correction of the sequencing reads of the sense strand and the antisense strand derived from the same site, that is, the sense strand and the antisense strand derived from the same nucleic acid fragment are corrected to retain a corrected consistency sequence. For example, the correction of the present application may include correcting the sequencing reads of the sense strand and the antisense strand derived from the same site separately, that is, after the sense strand and the antisense strand derived from the same nucleic acid fragment are corrected, two corrected consistent sequence.
例如,所述碱基错误校正还可以包含校正源自相同位点的正义链和反义链的每个位置的碱基类型,保留所述正义链和反义链的各自的所述一致性序列。例如,所述校正可以包含将相同位点来源的不一致碱基的位点的碱基质量调整为0。例如,所述校正可以包含将相同位点来源的不一致碱基的位点的碱基信息删除。例如,所述校正可以包含将相同位点来源的不 一致碱基的位点的碱基信息不用于后续的数据分析。For example, the base error correction may also include correcting the base type at each position of the sense strand and the antisense strand derived from the same site, retaining the respective consensus sequences of the sense strand and the antisense strand . For example, the correction may include adjusting the base quality of the sites of inconsistent bases derived from the same site to 0. For example, the correction may include deleting the base information of the positions of inconsistent bases derived from the same position. For example, the correction may include not using the base information of the positions of inconsistent bases derived from the same position for subsequent data analysis.
例如,所述源自相同位点的测序读段可以包含比对到人类参考基因组位置相同且包含相同单分子标签(UMI)的测序读段。例如,所述源自相同位点的测序读段可以包含比对到人类参考基因组位置基本相同的测序读段。For example, the sequenced reads derived from the same locus may comprise sequenced reads that align to the same position in the human reference genome and comprise the same unimolecular signature (UMI). For example, the sequence reads derived from the same locus can comprise sequence reads that align to substantially the same position in the human reference genome.
例如,所述方法可以包含基于所述碱基错误校正后的位点,确定所述待测样本中的突变位点。例如,本申请的方法可以包含从所述碱基错误校正后的位点中,选取待测样本中的突变位点。For example, the method may include determining a mutation site in the sample to be tested based on the base error-corrected site. For example, the method of the present application may include selecting a mutation site in the sample to be tested from the base error-corrected sites.
例如本申请所述的方法,还可以包含通过从所述待测样本的全部突变位点中去除高频率突变位点,得到所述背景突变位点。例如,本申请的背景突变位点可以包含从待测样品的突变位点中去除已知的肿瘤的体细胞突变位点和高频率突变位点,剩余的突变位点。For example, the method described in the present application may further comprise obtaining the background mutation sites by removing high-frequency mutation sites from all mutation sites in the sample to be tested. For example, the background mutation sites in the present application may include removing known tumor somatic mutation sites and high-frequency mutation sites from the mutation sites in the sample to be tested, and the remaining mutation sites.
例如,所述高频率突变位点可以包含突变频率约为5e-03或更高的位点。例如,所述高频率突变位点可以根据测序的准确度和样品的质量等因素进行调整。例如,所述高频率突变位点可以包含突变频率约为1e-03或更高、5e-03或更高、1e-02或更高、5e-02或更高、1e-01或更高、或5e-01或更高的位点。For example, the high frequency mutation site may comprise a site with a mutation frequency of about 5e-03 or higher. For example, the high-frequency mutation sites can be adjusted according to factors such as sequencing accuracy and sample quality. For example, the high frequency mutation site may comprise a mutation frequency of about 1e-03 or higher, 5e-03 or higher, 1e-02 or higher, 5e-02 or higher, 1e-01 or higher, or 5e-01 or higher loci.
例如,本申请的方法可以包含从待测样本的序列信息中去除质控不合格的序列信息。例如,质控不合格的序列信息可以包含通过本领域常用的测序质控方法确定的不合格序列信息。例如质控不合格的序列信息可以包含低质量测序读段的序列信息,低质量碱基的序列信息等。For example, the method of the present application may include removing sequence information that is unqualified for quality control from the sequence information of the sample to be tested. For example, the unqualified sequence information of quality control may include the unqualified sequence information determined by the sequencing quality control method commonly used in the art. For example, the sequence information of unqualified quality control may include sequence information of low-quality sequencing reads, sequence information of low-quality bases, and the like.
例如,所述方法还可以包含通过从所述待测样本的序列信息中去除低质量测序读段(read)的序列信息。例如,低质量测序读段可以包含比对错误或难以比对的测序读段。例如,低质量的测序读段可以是当将该测序读段比对到人类参考基因组位置时,比对位置结果为正确的概率值低的测序读段。例如,所述低质量测序读段可以包含比对质量小于60的测序读段。例如,对于比对错误或难以比对的测序读段,该测序读段的测序信息可以不作为该比对位置的序列信息。例如,测序读段的测序质量可以通过测序仪器和本领域常用的质控方法进行确认。例如,所述低质量测序读段还可以包含包括了8个或更多碱基错配的测序读段。For example, the method may further include removing sequence information of low-quality sequencing reads (reads) from sequence information of the sample to be tested. For example, low-quality sequencing reads may contain alignment errors or difficult-to-align sequencing reads. For example, a low-quality sequencing read may be a sequencing read that, when the sequencing read is aligned to a human reference genome location, has a low probability value that the aligned position turns out to be correct. For example, the low quality sequencing reads may comprise sequencing reads having an alignment quality of less than 60. For example, for sequencing reads that are incorrectly aligned or difficult to align, the sequencing information of the sequencing reads may not be used as the sequence information of the alignment position. For example, the sequencing quality of the sequencing reads can be confirmed by a sequencing instrument and quality control methods commonly used in the art. For example, the low-quality sequencing reads can also include sequencing reads that include 8 or more base mismatches.
例如,所述方法还包含通过从所述待测样本的序列信息中去除低质量碱基的序列信息。例如,校正后碱基质量小的碱基的序列信息被去除。例如,所述低质量碱基可以包含校正后碱基质量小于20的碱基。例如,碱基质量为20的碱基,测序正确率可以为99.99%或更高。For example, the method further includes removing sequence information of low-quality bases from the sequence information of the sample to be tested. For example, sequence information of bases with small base masses is removed after correction. For example, the low-quality bases may include bases whose corrected base quality is less than 20. For example, for bases with a base quality of 20, the sequencing accuracy can be 99.99% or higher.
例如本申请所述的方法,所述方法还可以包含确定选自以下组的突变频率:所述体细胞突变位点的体细胞突变频率和所述背景突变位点的背景突变频率,用于评估位点突变显著性水平。例如,源自体细胞突变位点的突变频率可以为体细胞突变频率。例如,源自背景突变位点的突变频率可以为背景突变频率。For example, the method described in the present application, the method may further comprise determining a mutation frequency selected from the group consisting of the somatic mutation frequency of the somatic mutation site and the background mutation frequency of the background mutation site for evaluating Site mutation significance level. For example, a mutation frequency derived from a somatic mutation site can be a somatic mutation frequency. For example, a mutation frequency derived from a background mutation site can be a background mutation frequency.
例如本申请所述的方法,所述突变频率可以包含多聚体突变频率和/或插入或缺失(INDEL)突变频率;例如,用于计算突变频率的模型可以是测序数据的多聚体突变频率。例如,用于计算突变频率的模型可以是测序数据的插入或缺失(INDEL)突变频率。例如,INDEL可以表示插入或缺失。For example, in the method described in the present application, the mutation frequency may comprise multimer mutation frequency and/or insertion or deletion (INDEL) mutation frequency; for example, the model used to calculate the mutation frequency may be the multimer mutation frequency of sequencing data . For example, the model used to calculate mutation frequency can be the insertion or deletion (INDEL) mutation frequency of sequencing data. For example, INDEL can represent insertion or deletion.
例如,所述多聚体突变频率可以包含在特定的连续排列碱基序列中特定位置的碱基突变为另一种碱基的频率。例如,单碱基突变频率可以包含单个碱基发生突变的频率。例如,多聚体突变频率可以包含连续排列的碱基序列中,中间位置的碱基发生突变的频率。For example, the multimer mutation frequency may include the frequency at which a base at a specific position in a specific contiguous base sequence is mutated into another base. For example, the frequency of single base mutation may include the frequency of mutation of a single base. For example, the multimer mutation frequency may include the frequency at which a base at an intermediate position is mutated in a consecutive sequence of bases.
例如,所述连续排列碱基序列可以包含连续排列的2个或更多碱基。例如,所述连续排列碱基序列可以包含连续排列的2个或更多、3个或更多、5个或更多、7个或更多、或9个或更多的碱基。例如,所述连续排列碱基序列可以包含连续排列的3个或5个碱基。For example, the contiguous base sequence may contain 2 or more bases contiguously. For example, the contiguous base sequence may comprise 2 or more, 3 or more, 5 or more, 7 or more, or 9 or more bases in a contiguous arrangement. For example, the contiguous base sequence may comprise 3 or 5 contiguous bases.
例如,所述多聚体突变频率可以包含在特定的连续排列序列中第2位的碱基突变为另一种特定碱基的频率。例如,对于三聚体突变频率,关注在特定的第一个碱基和第三个碱基的排列环境下,第二个碱基突变为另一种碱基的频率。For example, the multimer mutation frequency may include the frequency at which the second base in a specific contiguous sequence is mutated to another specific base. For example, for the trimer mutation frequency, focus on the frequency of the second base mutating into another base in a specific arrangement environment of the first base and the third base.
例如,所述INDEL突变频率可以包含以下组:随机INDEL突变频率、和碱基重复区INDEL突变频率。For example, the INDEL mutation frequency may include the following groups: random INDEL mutation frequency, and base repeat region INDEL mutation frequency.
例如,所述INDEL突变频率可以包含随机INDEL突变频率。例如,所述随机INDEL突变频率可以包含插入或缺失一个或多个碱基的频率。例如,所述随机INDEL突变频率可以包含在特定的一个或多个碱基之后插入或缺失一个或多个碱基的频率。例如,所述随机INDEL突变频率可以包含在特定的一个碱基之后插入或缺失一个或多个碱基的频率;例如,所述随机INDEL突变频率可以包含在特定的两个或更多个碱基之后插入或缺失一个或多个碱基的频率。For example, the INDEL mutation frequency may comprise a random INDEL mutation frequency. For example, the random INDEL mutation frequency may comprise the frequency of insertion or deletion of one or more bases. For example, the random INDEL mutation frequency may comprise the frequency of insertion or deletion of one or more bases after a specified one or more bases. For example, the random INDEL mutation frequency may comprise the frequency of insertion or deletion of one or more bases after a specific one base; for example, the random INDEL mutation frequency may comprise the frequency of insertion or deletion of one or more bases after a specific The frequency with which one or more bases are subsequently inserted or deleted.
例如,插入或缺失一个碱基时,所述随机INDEL突变频率可以包含在特定的一个碱基之后插入或缺失一个特定碱基的频率。例如,当插入或缺失2个或更多碱基时,所述随机INDEL突变频率可以包含在特定的一个碱基之后插入或缺失特定长度碱基的频率。例如,当插入或缺失2个或更多碱基时,插入或缺失的具体碱基组合可以不考虑,可以仅考虑在特定的一个或多个碱基之后插入或缺失特定长度碱基的频率。For example, when a base is inserted or deleted, the random INDEL mutation frequency may include the frequency of insertion or deletion of a specific base after a specific base. For example, when 2 or more bases are inserted or deleted, the random INDEL mutation frequency may include the frequency of insertion or deletion of a specific length of bases after a specific base. For example, when two or more bases are inserted or deleted, the specific base combination of the insertion or deletion may not be considered, and only the frequency of insertion or deletion of a specific length of bases after a specific one or more bases may be considered.
例如,所述INDEL突变频率可以包含碱基重复区INDEL突变频率。例如,所述碱基重复区INDEL突变频率可以包含插入或缺失一个或多个碱基重复单元(Unit)的频率,所述Unit长度为1个或更多。For example, the INDEL mutation frequency may include the frequency of INDEL mutations in base repeat regions. For example, the INDEL mutation frequency in the base repeating region may include the frequency of insertion or deletion of one or more base repeating units (Unit), and the length of the Unit is 1 or more.
例如,所述碱基重复区INDEL突变频率可以包含插入或缺失一个或多个碱基重复单元(Unit)的频率,所述Unit长度为2个或更多。例如,Unit长度为2个碱基或更多、3个碱基或更多、4个碱基或更多、5个碱基或更多、6个碱基或更多、7个碱基或更多、8个碱基或更多、9个碱基或更多或10个碱基或更多。For example, the INDEL mutation frequency in the base repeating region may include the frequency of insertion or deletion of one or more base repeating units (Unit), and the length of the Unit is 2 or more. For example, the Unit length is 2 bases or more, 3 bases or more, 4 bases or more, 5 bases or more, 6 bases or more, 7 bases or More, 8 bases or more, 9 bases or more, or 10 bases or more.
例如,所述碱基重复区INDEL突变频率可以包含相同Unit长度和相同Unit重复次数的序列中插入或缺失Unit特定个数的频率。例如,当Unit长度为2个或更多时,可以不考虑Unit的具体碱基组合,可以仅考虑特定的重复次数的Unit中发生插入或缺失一个或多个Unit的频率。例如,确定碱基重复区INDEL突变频率可以包含相同Unit长度和相同Unit重复次数的序列中插入或缺失Unit特定个数的频率,其中Unit可以包含任意的序列。例如,可以将该种情况下的任意碱基组合的Unit合并计算。For example, the INDEL mutation frequency in the base repeat region may include the frequency of insertion or deletion of a specific number of Units in sequences with the same Unit length and the same Unit repetition number. For example, when the length of the Unit is 2 or more, the specific base combination of the Unit may not be considered, and only the frequency of insertion or deletion of one or more Units in a Unit with a specific number of repetitions may be considered. For example, determining the frequency of INDEL mutations in the base repeat region may include the frequency of insertion or deletion of a specific number of Units in sequences with the same Unit length and the same Unit repetition number, where Unit may include any sequence. For example, the Units of any combination of bases in this case can be combined for calculation.
例如本申请所述的方法还可以包含确定待测样本中变体核酸的存在和/或所述体细胞突变位点存在突变的显著性水平。例如,发生显著性突变的体细胞突变位点可以用于评估变体核酸的存在。例如,评估变体核酸的占比时可以仅利用发生显著性突变的体细胞突变位点的数据。For example, the method described in the present application may further comprise determining the presence of the variant nucleic acid in the sample to be tested and/or the significance level of the mutation at the somatic mutation site. For example, significantly mutated somatic mutation sites can be used to assess the presence of a variant nucleic acid. For example, when assessing the proportion of variant nucleic acids, only the data of somatic mutation sites with significant mutations can be used.
例如,所述方法可以包含通过确定将所述体细胞突变位点视作为背景突变时的累积概率,用于衡量所述显著性水平。例如,可以假设候选的体细胞突变位点发生的是背景突变,评估该情况下的累积概率。例如,所述累积概率可以用于表示显著性水平。For example, the method may comprise measuring said level of significance by determining a cumulative probability of considering said somatic mutation site as a background mutation. For example, it is possible to estimate the cumulative probability that a candidate somatic mutation site is subject to background mutations. For example, the cumulative probability can be used to represent a significance level.
例如,所述方法可以包含基于泊松分布或二项分布,确定所述累积概率。例如,所述方法可以包含基于二项分布,确定所述累积概率。例如,所述方法可以包含基于泊松分布,确定所述累积概率。For example, the method may comprise determining the cumulative probability based on a Poisson distribution or a binomial distribution. For example, the method may comprise determining the cumulative probability based on a binomial distribution. For example, the method may comprise determining the cumulative probability based on a Poisson distribution.
例如,所述方法可以包含基于以下公式确定所述累积概率:For example, the method may comprise determining the cumulative probability based on the following formula:
Figure PCTCN2022070974-appb-000001
Figure PCTCN2022070974-appb-000001
其中,P表示累积概率,k从0到x-1累加,x表示体细胞突变位点突变后序列的覆盖深度,n表示所述体细胞突变位点的总覆盖深度,p表示所述体细胞突变位点的背景突变频率,e表示自然对数。Among them, P represents the cumulative probability, k is accumulated from 0 to x-1, x represents the coverage depth of the sequence after the mutation of the somatic cell mutation site, n represents the total coverage depth of the somatic cell mutation site, and p represents the somatic cell mutation site The background mutation frequency of the mutation site, e represents the natural logarithm.
例如,所述方法可以包含基于以下公式确定所述累积概率:For example, the method may comprise determining the cumulative probability based on the following formula:
Figure PCTCN2022070974-appb-000002
Figure PCTCN2022070974-appb-000002
其中,P表示累积概率,k从0到x-1累加,x表示体细胞突变位点突变后序列的覆盖深度,n表示所述体细胞突变位点的总覆盖深度,p表示所述体细胞突变位点的背景突变频率。Among them, P represents the cumulative probability, k is accumulated from 0 to x-1, x represents the coverage depth of the sequence after the mutation of the somatic cell mutation site, n represents the total coverage depth of the somatic cell mutation site, and p represents the somatic cell mutation site The background mutation frequency of the mutation site.
例如,所述方法可以包含当所述累积概率小于显著性阈值时,确定变体核酸的存在。例如,显著性阈值的确定可以是本领域技术人员根据测序仪器的准确性和待测样本的质量进行调整的。例如,所述显著性阈值为0.05或更小。例如,所述显著性阈值为0.05或更小、0.01或更小、0.005或更小、0.001或更小、0.0005或更小、或0.0001或更小。For example, the method may comprise determining the presence of a variant nucleic acid when the cumulative probability is less than a significance threshold. For example, the determination of the significance threshold can be adjusted by those skilled in the art according to the accuracy of the sequencing instrument and the quality of the sample to be tested. For example, the significance threshold is 0.05 or less. For example, the significance threshold is 0.05 or less, 0.01 or less, 0.005 or less, 0.001 or less, 0.0005 or less, or 0.0001 or less.
例如本申请所述的方法,所述方法还可以包含确定待测样本中变体核酸的存在和/或数量。例如,本申请的方法可以用于准确确定待测样本中变体核酸例如ctDNA的占比。例如,本申请确定变体核酸占比和/或得出该占比的显著性水平的方法可以基于体细胞突变位点和背景突变位点的数据进行评估。Such as the method described in the present application, the method may further comprise determining the presence and/or amount of the variant nucleic acid in the sample to be tested. For example, the method of the present application can be used to accurately determine the proportion of variant nucleic acid such as ctDNA in the sample to be tested. For example, the method of the present application for determining the proportion of variant nucleic acid and/or obtaining the significance level of the proportion can be evaluated based on the data of somatic mutation sites and background mutation sites.
例如,所述方法可以包含通过似然估计算法,确定待测样本中变体核酸的存在和/或数量。例如,所述方法可以包含基于泊松分布或二项分布的似然估计算法,确定待测样本中变体核酸的存在和/或数量。例如,所述变体核酸的数量可以包含待测样品中循环肿瘤DNA(ctDNA)在待测样品总DNA中的占比。例如,通过极大似然估计算法确定ctDNA占比π的极大似然估计值。For example, the method may comprise determining the presence and/or amount of the variant nucleic acid in the test sample by a likelihood estimation algorithm. For example, the method may comprise a likelihood estimation algorithm based on a Poisson distribution or a binomial distribution to determine the presence and/or amount of the variant nucleic acid in the test sample. For example, the quantity of the variant nucleic acid may comprise the proportion of circulating tumor DNA (ctDNA) in the total DNA of the test sample. For example, the maximum likelihood estimation value of ctDNA proportion π is determined by a maximum likelihood estimation algorithm.
例如,所述方法可以包含确定ctDNA占比π的极大似然估计值
Figure PCTCN2022070974-appb-000003
当π取值为所述
Figure PCTCN2022070974-appb-000004
时,如下式的函数l(π)取最大值,例如,如本领域公知的ln(x)表示以自然对数e为底数,求解x的对数的计算符号:
For example, the method may comprise determining a maximum likelihood estimate of the ctDNA fraction π
Figure PCTCN2022070974-appb-000003
When the value of π is the
Figure PCTCN2022070974-appb-000004
, the function l(π) of the following formula takes the maximum value, for example, as known in the art, ln(x) represents the calculation symbol of the logarithm of x with the natural logarithm e as the base:
l(π)=∑ iw ilnl i(π;x i,n i,p i,q i) l(π)=∑ i w i lnl i (π; x i ,n i ,p i ,q i )
其中,w i为第i个体细胞突变位点的权重,l i(π;x i,n i,p i,q i)通过下式计算: Among them, w i is the weight of the mutation site in the i-th somatic cell, and l i (π; x i , ni , p i , q i ) is calculated by the following formula:
Figure PCTCN2022070974-appb-000005
Figure PCTCN2022070974-appb-000005
其中,f i通过下式计算: Among them, f i is calculated by the following formula:
f i=πq i+(1-π)p i f i =πq i +(1-π)p i
n i表示第i个所述体细胞突变位点的总覆盖深度,x i为第i个体细胞突变位点突变后序列的覆盖深度,q i表示第i个所述体细胞突变位点在肿瘤组织样本中的突变频率,p i表示对应的突变在待测样本中的背景突变频率,e表示自然对数。 n i represents the total coverage depth of the i-th somatic mutation site, x i is the coverage depth of the i-th somatic mutation site after mutation, q i represents the i-th somatic mutation site in the tumor The mutation frequency in the tissue sample, p i represents the background mutation frequency of the corresponding mutation in the sample to be tested, and e represents the natural logarithm.
例如,所述方法可以包含确定ctDNA占比π的极大似然估计值
Figure PCTCN2022070974-appb-000006
当π取值为所述
Figure PCTCN2022070974-appb-000007
时,如下式的函数l(π)取最大值:
For example, the method may comprise determining a maximum likelihood estimate of the ctDNA fraction π
Figure PCTCN2022070974-appb-000006
When the value of π is the
Figure PCTCN2022070974-appb-000007
When , the function l(π) of the following formula takes the maximum value:
l(π)=∑ iw ilnl i(π;x i,n i,p i,q i) l(π)=∑ i w i lnl i (π; x i ,n i ,p i ,q i )
其中,w i为第i个体细胞突变位点的权重,l i(π;x i,n i,p i,q i)通过下式计算: Among them, w i is the weight of the mutation site in the i-th somatic cell, and l i (π; x i , ni , p i , q i ) is calculated by the following formula:
Figure PCTCN2022070974-appb-000008
Figure PCTCN2022070974-appb-000008
其中,f i通过下式计算: Among them, f i is calculated by the following formula:
f i=πq i+(1-π)p i f i =πq i +(1-π)p i
n i表示第i个所述体细胞突变位点的总覆盖深度,x i为第i个体细胞突变位点突变后序列的覆盖深度,q i表示第i个所述体细胞突变位点在肿瘤组织样本中的突变频率,p i表示对应的突变在待测样本中的背景突变频率。 n i represents the total coverage depth of the i-th somatic mutation site, x i is the coverage depth of the i-th somatic mutation site after mutation, q i represents the i-th somatic mutation site in the tumor The mutation frequency in the tissue sample, pi represents the background mutation frequency of the corresponding mutation in the sample to be tested.
例如,所述方法可以包含确定所述ctDNA占比π的极大似然估计值的显著性水平。例如,所述方法可以包含通过似然比检验算法确定所述显著性水平。例如,所述方法可以包含通过基于卡方分布的似然比检验算法确定所述显著性水平。例如本申请所述的方法,根据似然比统计量
Figure PCTCN2022070974-appb-000009
值和自由度为1的卡方分布概率密度函数,确定所述显著性水平。
For example, the method may comprise determining a significance level for the maximum likelihood estimate of the ctDNA proportion π. For example, the method may comprise determining the significance level by a likelihood ratio testing algorithm. For example, the method may comprise determining the significance level by a chi-square distribution based likelihood ratio testing algorithm. For example, the method described in this application, according to the likelihood ratio statistic
Figure PCTCN2022070974-appb-000009
Values and a probability density function of the chi-square distribution with 1 degree of freedom determine the significance level.
例如,所述方法可以包含通过下式确定所述似然比统计量
Figure PCTCN2022070974-appb-000010
值,
For example, the method may comprise determining the likelihood ratio statistic by
Figure PCTCN2022070974-appb-000010
value,
Figure PCTCN2022070974-appb-000011
Figure PCTCN2022070974-appb-000011
其中,l(π)=∑ iw ilnl i(π;x i,n i,p i,q i) Among them, l(π)=∑ i w i lnl i (π; x i ,n i ,p i ,q i )
其中,w i为第i个体细胞突变位点的权重,根据第i个体细胞突变位点的体细胞突变频率或测序覆盖深度确定w i的取值,l i(π;x i,n i,p i,q i)通过下式计算: Among them, w i is the weight of the i-th individual cell mutation site, and the value of w i is determined according to the somatic mutation frequency or sequencing coverage depth of the i-th individual cell mutation site, l i (π; x i , n i , p i ,q i ) are calculated by the following formula:
Figure PCTCN2022070974-appb-000012
Figure PCTCN2022070974-appb-000012
其中,f i通过下式计算: Among them, f i is calculated by the following formula:
f i=πq i+(1-π)p i f i =πq i +(1-π)p i
n i表示第i个所述体细胞突变位点的总覆盖深度,x i为第i个体细胞突变位点突变后序列的覆盖深度,q i表示第i个所述体细胞突变位点在肿瘤组织样本中的突变频率,p i表示对应的突变在待测样本中的背景突变频率,e表示自然对数。 n i represents the total coverage depth of the i-th somatic mutation site, x i is the coverage depth of the i-th somatic mutation site after mutation, q i represents the i-th somatic mutation site in the tumor The mutation frequency in the tissue sample, p i represents the background mutation frequency of the corresponding mutation in the sample to be tested, and e represents the natural logarithm.
例如,所述方法包含通过下式确定所述似然比统计量
Figure PCTCN2022070974-appb-000013
值,
For example, the method includes determining the likelihood ratio statistic by
Figure PCTCN2022070974-appb-000013
value,
Figure PCTCN2022070974-appb-000014
Figure PCTCN2022070974-appb-000014
其中,l(π)=∑ iw ilnl i(π;x i,n i,p i,q i) Among them, l(π)=∑ i w i lnl i (π; x i ,n i ,p i ,q i )
其中,w i为第i个体细胞突变位点的权重,根据第i个体细胞突变位点的体细胞突变频率或测序覆盖深度确定w i的取值,l i(π;x i,n i,p i,q i)通过下式计算: Among them, w i is the weight of the i-th individual cell mutation site, and the value of w i is determined according to the somatic mutation frequency or sequencing coverage depth of the i-th individual cell mutation site, l i (π; x i , n i , p i ,q i ) are calculated by the following formula:
Figure PCTCN2022070974-appb-000015
Figure PCTCN2022070974-appb-000015
其中,f i通过下式计算: Among them, f i is calculated by the following formula:
f i=πq i+(1-π)p i f i =πq i +(1-π)p i
n i表示第i个所述体细胞突变位点的总覆盖深度,x i为第i个体细胞突变位点突变后序列的覆盖深度,q i表示第i个所述体细胞突变位点在肿瘤组织样本中的突变频率,p i表示对应的突变在待测样本中的背景突变频率,e表示自然对数;例如,所述每一个w i取值可以相同,例如每一个w i取值可以1。例如,本领域人员可以根据实际的第i个体细胞突变位点的重要程度,例如该位点的突变频率或测序覆盖深度,调整具体的w i的0至1的取值。 n i represents the total coverage depth of the i-th somatic mutation site, x i is the coverage depth of the i-th somatic mutation site after mutation, and q i represents the i-th somatic mutation site in the tumor The mutation frequency in the tissue sample, p i represents the background mutation frequency of the corresponding mutation in the sample to be tested, and e represents the natural logarithm; for example, the value of each w i can be the same, for example, the value of each w i can be 1. For example, those skilled in the art can adjust the specific value of w i from 0 to 1 according to the actual importance of the i-th somatic mutation site, such as the mutation frequency or sequencing coverage depth of the site.
一方面,本申请提供了一种检测变体核酸的存在和/或数量方法,所述方法可以包含基于待测样本中待测区域的体细胞突变位点和背景突变位点,确定所述变体核酸的存在和/或数量,其中通过从所述待测样本的全部突变位点中去除所述体细胞突变位点,确定所述背景突变位点;任选地,可以对所述待测样本的全部突变位点进行碱基错误校正;任选地,本申请的背景突变位点可以包含从待测样品的突变位点中去除已知的肿瘤的体细胞突变位点和高频率突变位点,剩余的突变位点;任选地,可以从待测样本的序列信息中去除质控不合格的序列信息;任选地,本申请评估突变频率的类型可以选自单碱基突变频率、多聚体突变频率和插入或缺失(INDEL)突变频率;任选地,可以基于泊松分布或二项分布,确定将所述体细胞突变位点视作为背景突变时的累积概率;任选地,可以基于泊松分布或二项分布的似然估计算法,估计待测样本中变体核酸的存在和/或数量以及确定所述变体核酸占比的估计值的显著性水平。In one aspect, the present application provides a method for detecting the presence and/or quantity of a variant nucleic acid, the method may include determining the mutation site based on the somatic mutation site and the background mutation site in the region to be tested in the sample to be tested. The presence and/or amount of somatic nucleic acid, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites in the test sample; optionally, the test sample can be All mutation sites in the sample are corrected for base errors; optionally, the background mutation sites in this application can include removing known tumor somatic mutation sites and high-frequency mutation sites from the mutation sites in the sample to be tested point, the remaining mutation site; optionally, the unqualified sequence information of the quality control can be removed from the sequence information of the sample to be tested; optionally, the type of mutation frequency evaluated by this application can be selected from single base mutation frequency, Multimer mutation frequency and insertion or deletion (INDEL) mutation frequency; optionally, the cumulative probability of considering the somatic mutation site as a background mutation can be determined based on a Poisson distribution or a binomial distribution; optionally , the likelihood estimation algorithm based on Poisson distribution or binomial distribution can be used to estimate the existence and/or quantity of the variant nucleic acid in the sample to be tested and determine the significance level of the estimated value of the proportion of the variant nucleic acid.
例如,本申请的微小残留病灶检测方法(PROPHET)可以通过分析扩增子法或杂交捕获法产生的二代测序数据中的肿瘤体细胞突变,判断MRD阳性或阴性,可以属于肿瘤知情法(tumor-informed assay)策略。For example, the minimal residual disease detection method (PROPHET) of the present application can determine whether MRD is positive or negative by analyzing tumor somatic mutations in next-generation sequencing data generated by the amplicon method or hybridization capture method, and can belong to the tumor-informed method (tumor-informed method). -informed assay) strategy.
本申请检测方法可以利用明确的肿瘤体细胞突变信息,例如可以通过肿瘤组织获取,用于外周血中检测肿瘤体细胞突变,具体可以为:1)对肿瘤组织样本和配对样本进行全外显子组测序;2)基于常用的比对软件如bwa将测序结果比对到人类参考基因组;3)基于常用的体细胞突变分析软件如mutect2检测肿瘤组织中的体细胞突变;4)对体细胞突变进行优先级 排序,基于优先级挑选出一定数量的突变;5)基于筛选的突变,设计杂交捕获探针,后续用于外周血样本检测。体细胞突变排序之前,可以过滤掉高重复区,高GC区以及与其他位置序列同源区的突变,以降低杂交捕获的难度。体细胞排序的优先级从高到低依次可以为:1)驱动突变(driver mutation),2)造成氨基酸序列改变的突变,包括非同义突变,选择性剪接突变,以及in-frame/out-of-frame InDel等,3)同义突变,这三类突变中每一类都按照突变频率从高到低排列。The detection method of the present application can use clear tumor somatic mutation information, for example, it can be obtained from tumor tissue, and used to detect tumor somatic mutation in peripheral blood. Specifically, it can be: 1) Perform whole-exome analysis on tumor tissue samples and paired samples group sequencing; 2) compare the sequencing results to the human reference genome based on commonly used comparison software such as bwa; 3) detect somatic mutations in tumor tissues based on commonly used somatic mutation analysis software such as mutect2; 4) detect somatic mutations Perform prioritization, and select a certain number of mutations based on the priority; 5) Based on the screened mutations, design hybridization capture probes, which are subsequently used for peripheral blood sample detection. Before sorting somatic mutations, mutations in high repeat regions, high GC regions, and regions homologous to other sequences can be filtered out to reduce the difficulty of hybridization capture. The priority of somatic cell sorting from high to low can be: 1) driver mutation (driver mutation), 2) mutations that cause amino acid sequence changes, including non-synonymous mutations, alternative splicing mutations, and in-frame/out- of-frame InDel et al., 3) synonymous mutations, each of these three types of mutations are arranged in descending order of mutation frequency.
本申请的分析具体可以包括以下步骤:1)数据准备,包括基于特异性分子标签(UMI,Unique Molecular Identifier)校正碱基错误,以及校正后的reads比对到人类基因组;2)基于读长(reads)比对结果,计算样本特异性背景水平;3)对待检体细胞突变位点,计算其突变率;4)对每个待检体细胞突变位点,根据背景水平,评估其为真突变的显著性水平;5)基于筛选的所有体细胞位点,根据背景水平,评估样本ctDNA占比以及样本ctDNA的显著性水平。The analysis of this application may specifically include the following steps: 1) data preparation, including correcting base errors based on specific molecular tags (UMI, Unique Molecular Identifier), and aligning the corrected reads to the human genome; 2) based on the read length ( reads) compare the results, and calculate the sample-specific background level; 3) calculate the mutation rate of the mutation site of the somatic cell to be tested; 4) evaluate it as a true mutation according to the background level of each mutation site of the somatic cell to be tested 5) Based on all the somatic loci screened, according to the background level, evaluate the proportion of sample ctDNA and the significance level of sample ctDNA.
数据准备data preparation
由于建库测序的碱基出错的几率可以在1e-03水平,而MRD检测时需要1e-04水平的检出,可以任选地通过UMI进行碱基校正,或者使用本领域常用的方法,例如选用更准确的建库测序方法降低建库测序碱基误差。数据准备步骤可以产生经过UMI去重复和碱基校正后的BAM格式的序列比对文件。UMI碱基校正的原理是利用同一个分子来源的多个PCR产物的测序序列,校正建库测序过程中的碱基错误。具体步骤可以为:1)基于常用的二代测序比对软件bwa(version 0.7.10)将测序reads比对到人类参考基因组;2)利用比对信息和UMI信息,将比对到基因组位置相同且UMI相同的所有reads视为同一个分子来源的reads,将其归为一个单元并保留reads个数大于一定阈值的单元;3)基于多数投票规则确定单元内每个位置的碱基,最终产生一条代表这个单元的一致性reads;4)将一致性reads比对到基因组,生成BAM文件。Since the probability of base errors in library construction and sequencing can be at the 1e-03 level, and MRD detection requires detection at the 1e-04 level, base correction can optionally be performed by UMI, or using methods commonly used in the art, such as Choose a more accurate method for library construction and sequencing to reduce base errors in library construction and sequencing. The data preparation step can generate sequence alignment files in BAM format after UMI deduplication and base correction. The principle of UMI base correction is to use the sequencing sequences of multiple PCR products from the same molecular source to correct base errors in the process of library construction and sequencing. The specific steps can be: 1) based on the commonly used next-generation sequencing comparison software bwa (version 0.7.10), compare the sequencing reads to the human reference genome; 2) use the comparison information and UMI information to compare the genome positions And all reads with the same UMI are regarded as reads from the same molecular source, and they are classified as one unit and the unit with the number of reads greater than a certain threshold is reserved; 3) Determine the base at each position in the unit based on the majority voting rule, and finally generate A consistent reads representing this unit; 4) Align the consistent reads to the genome to generate a BAM file.
为了利用同一个分子来源的序列信息进行碱基错误校正,在杂交建库时,本申请可以任选地采用双端UMI的duplex建库方法。UMI duplex建库可以区分双链DNA不同链来源的分子,在后续进行碱基校正时,可以利用该信息互相校正。在进行碱基错误校正时,首先基于UMI和比对位置信息,可以将同一个DNA链来源的reads基于多数投票规则进行校正,将不能确定的碱基设置为N且质量为0,其他碱基质量可以设置为最高值,生成单链一致性序列即SSCSs;再将来源于同一DNA的不同链的序列进行校正,可以将双链中不一致碱基的质量调整为0,但可以保留这两条SSCSs。由于ctDNA分子只有约164bp左右,而测序读长通常可达150bp左右。本申请的方法可以任选地利用测序的同一个DNA链来源的R1和R2的测 序读长重叠部分进行再次校正,将R1和R2中不一致的碱基质量调整为0。本申请提供的方法可以任选地区分来自于同一个DNA不同链的reads,在后续碱基校正时,可以避免损失该部分校正信息。In order to use the sequence information from the same molecular source to correct base errors, the applicant can optionally use the double-end UMI duplex library construction method during hybridization library construction. UMI duplex library construction can distinguish molecules from different strands of double-stranded DNA, and this information can be used to correct each other in the subsequent base correction. When performing base error correction, based on UMI and alignment position information, the reads from the same DNA chain can be corrected based on the majority voting rule, and the undetermined bases are set to N and the quality is 0, and other bases The quality can be set to the highest value to generate single-strand consensus sequences or SSCSs; and then correct the sequences derived from different strands of the same DNA, the quality of inconsistent bases in the double strand can be adjusted to 0, but these two can be retained SSCSs. Since the ctDNA molecule is only about 164bp, the sequencing read length can usually reach about 150bp. The method of the present application can optionally use the overlapping parts of the sequencing read lengths of R1 and R2 derived from the same DNA chain sequenced for recalibration, and adjust the inconsistent base quality in R1 and R2 to 0. The method provided in this application can optionally distinguish the reads from different strands of the same DNA, so that the loss of this part of the correction information can be avoided during the subsequent base correction.
样本特异性SNV背景Sample-specific SNV background
样本特异性背景是基于测序目标区域BAM文件比对信息可以任选地计算各种多聚体突变频率,作为样本特异的背景频率。例如三聚体突变频率计算时可以关注第二位碱基是否发生改变,其余两个碱基固定。例如目标区某一位置和左右各一碱基组成的三聚体为AGC,现该位置的比对结果中包括4个ACC,6个ATC,10个AAC和99980个AGC,则其AGC->ACC三聚体转换频率为4e-05,AGC->ATC转换频率为6e-05,AGC->AAC转换频率为1e-04。此处三聚体也可以变更成其他长度的寡聚体,计算方法可以同三聚体类似。样本特异性背景计算的具体步骤为:1)对测序目标区域的所有位点,统计其对应的各种三聚体个数;2)去除所有体细胞突变位点以及其他需要排除的位点对应的三聚体信息,以排除明确的突变位点或区域对背景计算的影响;3)可以去除突变频率高于特定阈值如5e-03的位点对应的所有三聚体信息,以排除其他潜在突变对背景计算的干扰;4)将剩余位点的三聚体信息整合在一起,基于三聚体突变类型计算每种突变的频率,作为该样本的特异性背景突变水平。为了排除序列比对,以及低质量碱基对背景噪音评估的影响,在计算背景时,可以任选地针对比对质量小于60或包括8个或以上碱基错配的reads均进行过滤,另外也可以任选地舍弃了碱基质量较低的三聚体。样本特异性背景计算时,可以利用了待分析样本自身的测序数据信息,不依赖于其他正常样本或同批次其他样本作为对照,有利于排除样本间因素或者实验批次因素导致的背景波动。另外样本特异性背景计算时,充分利用了测序目标区域的所有信息,将不同位置的属于同一种三聚体的信息整合处理,有效解决了因数据不充分导致背景评估不准确的问题。The sample-specific background is based on the alignment information of the sequencing target region BAM file. Various multimer mutation frequencies can optionally be calculated as the sample-specific background frequency. For example, when calculating the frequency of trimer mutations, we can pay attention to whether the second base is changed, and the remaining two bases are fixed. For example, a trimer composed of one base at a certain position in the target area and one base on the left and right is AGC, and the alignment result at this position includes 4 ACCs, 6 ATCs, 10 AACs and 99980 AGCs, then its AGC-> The ACC trimer transition frequency was 4e-05, the AGC->ATC transition frequency was 6e-05, and the AGC->AAC transition frequency was 1e-04. Here, the trimer can also be changed into oligomers of other lengths, and the calculation method can be similar to that of the trimer. The specific steps for sample-specific background calculation are: 1) Count the number of trimers corresponding to all sites in the sequencing target region; 2) Remove all somatic mutation sites and other sites that need to be excluded. 3) All trimer information corresponding to sites with a mutation frequency higher than a specific threshold such as 5e-03 can be removed to exclude other potential The interference of mutations to the background calculation; 4) The trimer information of the remaining sites was integrated, and the frequency of each mutation was calculated based on the trimer mutation type, which was used as the specific background mutation level of the sample. In order to exclude the impact of sequence alignment and low-quality bases on the background noise evaluation, when calculating the background, you can optionally filter the reads whose alignment quality is less than 60 or include 8 or more base mismatches, and in addition Trimers with lower base quality may also optionally be discarded. When calculating the sample-specific background, the sequencing data information of the sample to be analyzed can be used, and it does not rely on other normal samples or other samples of the same batch as controls, which is beneficial to eliminate background fluctuations caused by inter-sample factors or experimental batch factors. In addition, when calculating the sample-specific background, it makes full use of all the information of the sequencing target region, integrates the information belonging to the same trimer at different positions, and effectively solves the problem of inaccurate background assessment due to insufficient data.
样本特异性InDel背景Sample-specific InDel background
为了充分利用样本的突变信息,除了SNV之外,本方法还可以采用InDel突变。在计算样本特异性InDel背景时,基于InDel序列特征,将其分为两大类:1)随机InDel,2)碱基重复区InDel,用(Unit)n表示,其中Unit表示重复单元,可以是单碱基或多碱基,n表示重复次数,一般为2次或以上。碱基重复区InDel一般表现为重复单元的单次或者多次的插入缺失.计算InDel背景步骤和SNV类似,具体为:1)对测序目标区域的所有位点,基于其参考序列,统计其测序的InDel信号和非InDel信号次数;2)去除所有体细胞突变位点以及其他需要排除的位点对应的信息,以排除明确的突变位点或区域对背景计算的影响;3)可以去除 突变频率高于特定阈值如5e-03的位点对应的所有信息,以排除其他潜在突变对背景计算的干扰;4)将剩余位点的信息整合在一起,基于InDel类型计算每种突变的频率,作为该样本的特异性InDel背景突变水平。In order to make full use of the mutation information of the sample, in addition to the SNV, the method can also use the InDel mutation. When calculating the sample-specific InDel background, based on the InDel sequence characteristics, it is divided into two categories: 1) random InDel, 2) base repeat region InDel, represented by (Unit)n, where Unit represents a repeating unit, which can be Single base or multiple bases, n represents the number of repetitions, generally 2 or more. InDel in the base repeat region generally manifests as single or multiple indels of repeating units. The background steps for calculating InDel are similar to SNV, specifically: 1) For all sites in the sequencing target region, based on their reference sequences, count their sequencing The number of InDel signals and non-InDel signals; 2) Remove all somatic mutation sites and other information corresponding to other sites that need to be excluded, so as to exclude the impact of specific mutation sites or regions on the background calculation; 3) The mutation frequency can be removed All the information corresponding to the site above a specific threshold such as 5e-03 to exclude the interference of other potential mutations on the background calculation; 4) Integrate the information of the remaining sites together, and calculate the frequency of each mutation based on the InDel type, as Specific InDel background mutation level for this sample.
对于随机InDel,在背景统计时,本申请可以依据InDel的位置前一个碱基的种类以及InDel插入缺失长度分别统计不同类型的InDel背景值。在插入或缺失单碱基时,可以将前一位碱基与插入缺失碱基组合,分别统计相关频率,例如,当插入或缺失单个碱基A时,分别统计TA->T,GA->G,CA->C,T->TA,G->GA,C->CA的背景频率。当插入缺失2个或多个碱基时,由于组合数过多,以及单个类型目标位点较少,可以任选地不单独统计,计算插入缺失相同长度碱基的背景均值。For random InDels, in background statistics, this application can count different types of InDel background values according to the type of the base before the InDel position and the length of the InDel insertion or deletion. When a single base is inserted or deleted, the previous base can be combined with the indel base, and the related frequency can be counted separately. For example, when a single base A is inserted or deleted, TA->T, GA-> can be counted separately Background frequency of G, CA->C, T->TA, G->GA, C->CA. When the indel has 2 or more bases, due to the excessive number of combinations and the small number of single type target sites, it is optional to calculate the background mean value of the same length bases without separate statistics.
对于(Unit)n,Unit为单碱基的突变,本申请基于参考序列中Unit的种类,n的值,以及插入缺失的个数,分别统计背景值。对于(Unit)n,Unit为2个碱基时,可以任选地不考虑Unit的具体序列,将所有Unit长度为2且重复次数n相同的InDel合并,依据其插入缺失的个数,计算对应的背景。如突变GATAT->GAT,CTGTG->CTG,均属于Unit为2,n为2,缺失一次的突变,在计算背景噪音时合并处理。Unit长度大于2时,处理方法可以与Unit为2时一致。For (Unit)n, where Unit is a single-base mutation, this application counts the background value based on the type of Unit in the reference sequence, the value of n, and the number of indels. For (Unit)n, when Unit is 2 bases, you can optionally ignore the specific sequence of Unit, merge all InDels with Unit length 2 and the same number of repetitions n, and calculate the corresponding background. For example, the mutations GATAT->GAT, CTGTG->CTG, all of which belong to Unit is 2, n is 2, and there is one missing mutation, will be combined and processed when calculating the background noise. When the Unit length is greater than 2, the processing method can be the same as when the Unit is 2.
对于Unit的长度n,本申请假设在Unit碱基类型和插入缺失个数相同的情况下,背景错误率与n之间存在关联性,具体联系如下:For the length n of Unit, this application assumes that there is a correlation between the background error rate and n when the base type of Unit and the number of indels are the same, the specific connection is as follows:
Figure PCTCN2022070974-appb-000016
Figure PCTCN2022070974-appb-000016
其中p n(Unit|n 1)表示特定Unit有n次重复的情况下插入(缺失)n 1个的背景错误率,在此假设下,利用所有满足条件的位点检测信息,可以估算出
Figure PCTCN2022070974-appb-000017
则重复n次位点的错误率为:
Where p n (Unit|n 1 ) represents the background error rate of n 1 insertions (deletions) when a specific Unit has n repetitions. Under this assumption, using the detection information of all sites that meet the conditions, it can be estimated
Figure PCTCN2022070974-appb-000017
Then the error rate of repeating n times the site is:
Figure PCTCN2022070974-appb-000018
Figure PCTCN2022070974-appb-000018
体细胞突变位点突变信号somatic mutation site mutation signal
依据BAM文件的比对信息,对预先选定的待检体细胞位点,基于三聚体模式计算特定的SNV突变频率,或者基于InDel类型计算对应的突变频率。如当某一特定体细胞待检位置的原始三聚体为CAG,体细胞突变为A->G,则可以计算CAG->CGG的突变频率。同样的,在计算时任选地排除低质量比对或低质量碱基的影响。According to the comparison information of the BAM file, for the pre-selected target cell sites, the specific SNV mutation frequency is calculated based on the trimer pattern, or the corresponding mutation frequency is calculated based on the InDel type. For example, when the original trimer of a specific somatic cell to be tested is CAG, and the somatic cell mutation is A->G, the mutation frequency of CAG->CGG can be calculated. Likewise, the effects of low-quality alignments or low-quality bases are optionally excluded from the calculations.
体细胞突变位点显著性评估Significance assessment of somatic mutation sites
二项分布即重复n次独立的伯努利试验,在每次试验中只有两种可能的结果,而且两种结果发生与否互相对立,并且相互独立,这符合样本背景突变场景的描述。另外当二项分布 的n足够大且事件发生概率p足够小时,观测事件发生次数近似服从泊松分布(λ=np)。因此本申请的方法可以采用泊松分布(x~Binom(n,p))或二项分布(x~Poison(np))的假设来计算体细胞突变显著性。例如,本申请采用泊松分布的假设进行计算,可以有较高的评估结果准确性。The binomial distribution is to repeat n times independent Bernoulli experiments. In each experiment, there are only two possible results, and whether the two results occur are opposite to each other and independent of each other, which is in line with the description of the sample background mutation scene. In addition, when the n of the binomial distribution is large enough and the event probability p is small enough, the number of observed events approximately obeys the Poisson distribution (λ=np). Therefore, the method of the present application can adopt the assumption of Poisson distribution (x~Binom(n,p)) or binomial distribution (x~Poison(np)) to calculate the significance of somatic mutations. For example, the present application adopts the assumption of Poisson distribution for calculation, which can have higher evaluation result accuracy.
本申请的方法根据该体细胞位点特定突变观测值,以及样本背景中该突变频率,计算背景条件下的累积概率P值。当P值小于特定的阈值时,则可以认为该位点突变频率显著高于样本背景,该位置为真突变。假设待检位点突变类型为A->G,原始三聚体为CAG,观测到该位点覆盖深度为n,其中CGG次数为x时,则该位点突变检出的p值为:The method of the present application calculates the cumulative probability P value under the background condition according to the observed value of the specific mutation of the somatic cell site and the frequency of the mutation in the sample background. When the P value is less than a specific threshold, it can be considered that the mutation frequency of this site is significantly higher than the sample background, and this position is a true mutation. Assuming that the mutation type of the site to be detected is A->G, the original trimer is CAG, and the coverage depth of the site is observed to be n, where the number of CGGs is x, then the p value of the mutation detection at the site is:
Figure PCTCN2022070974-appb-000019
Figure PCTCN2022070974-appb-000019
或者or
Figure PCTCN2022070974-appb-000020
Figure PCTCN2022070974-appb-000020
其中p(CAG→CGG)为背景频率。where p(CAG→CGG) is the background frequency.
除了计算SNV显著性,该方法同样适用于计算INDEL显著性。如某一待检测INDEL为AGGG->AGG,该点覆盖深度为n1,观测到AGG次数为x1,可以将上述公式中的n替换为n1,x替换为x1,p(CAG→CGG)替换为p(AGGG→AGG)即可。以此类推,所有类型的INDEl或SNV突变都可以用此方法计算其位点显著性。In addition to calculating SNV significance, the method is also applicable to calculating INDEL significance. If a certain INDEL to be detected is AGGG->AGG, the coverage depth of this point is n1, and the number of observed AGGs is x1, you can replace n in the above formula with n1, x with x1, and p(CAG→CGG) with p(AGGG→AGG) is enough. By analogy, all types of INDE1 or SNV mutations can use this method to calculate the significance of their sites.
样本ctDNA显著性水平和ctDNA占比评估Sample ctDNA significance level and ctDNA proportion evaluation
在实际应用中,受外周血采样量和检测成本的影响,样本的有效测序深度是受限的。当ctDNA占比低至0.02%或以下时,如果平均有效深度为10000X,则平均每个点只有约2个或以下突变信号,因此部分体细胞突变位点在外周血数据中可能很难检测到突变信号,再考虑到各种突变的背景水平,可能难以直接计算出ctDNA占比。本申请的方法可以采用多位点联合检验的方法判断样本中是否存在ctDNA,使用似然方法估算样本中的ctDNA占比。假设ctDNA占比为π,第i个待检体细胞突变在肿瘤组织样本中的频率为q i,对应的突变在检测样本中的背景频率为p i,则检测样本中体细胞突变频率的期望f i满足: In practical applications, the effective sequencing depth of samples is limited due to the influence of peripheral blood sampling volume and testing cost. When the proportion of ctDNA is as low as 0.02% or less, if the average effective depth is 10000X, there are only about 2 or less mutation signals per point on average, so some somatic mutation sites may be difficult to detect in peripheral blood data Mutational signal, and taking into account the background level of various mutations, it may be difficult to directly calculate the proportion of ctDNA. The method of the present application can use the multi-site joint test method to determine whether ctDNA exists in the sample, and use the likelihood method to estimate the proportion of ctDNA in the sample. Assuming that the proportion of ctDNA is π, the frequency of the ith somatic cell mutation in the tumor tissue sample is q i , and the background frequency of the corresponding mutation in the detection sample is p i , then the expected somatic mutation frequency in the detection sample is f i satisfies:
f i=πq i+(1-π)p i f i =πq i +(1-π)p i
使用似然法估计参数π,对数似然函数为:Using the likelihood method to estimate the parameter π, the log-likelihood function is:
l(π)=∑ iw ilnl i(π;x i,n i,p i,q i) l(π)=∑ i w i lnl i (π; x i ,n i ,p i ,q i )
其中w i为第i个待检体细胞突变的权重,实际分析中可以根据突变的类型和可信度设置权重的取值,n i表示第i个待检体细胞突变的有效覆盖深度,x i为第i个待检体细胞突变的目标突变深度,l i(π)为第i个待检体细胞突变的后验概率: Where w i is the weight of the i-th somatic cell mutation to be tested. In actual analysis, the value of the weight can be set according to the type and reliability of the mutation. n i represents the effective coverage depth of the i-th somatic cell mutation to be tested. x i is the target mutation depth of the ith somatic cell mutation to be tested, and l i (π) is the posterior probability of the ith somatic cell mutation to be tested:
Figure PCTCN2022070974-appb-000021
Figure PCTCN2022070974-appb-000021
或者or
Figure PCTCN2022070974-appb-000022
Figure PCTCN2022070974-appb-000022
其中,f i=πq i+(1-π)p i Among them, f i = πq i + (1-π) p i
通过极大似然估计算法对参数π进行估计,得到极大似然估计值
Figure PCTCN2022070974-appb-000023
使用似然比检验算法对零假设π=0进行检验,似然比统计量为:
The parameter π is estimated by the maximum likelihood estimation algorithm, and the maximum likelihood estimation value is obtained
Figure PCTCN2022070974-appb-000023
Using the likelihood ratio test algorithm to test the null hypothesis π=0, the likelihood ratio statistic is:
Figure PCTCN2022070974-appb-000024
Figure PCTCN2022070974-appb-000024
利用
Figure PCTCN2022070974-appb-000025
分布的概率密度函数可以计算P值。
use
Figure PCTCN2022070974-appb-000025
The probability density function of the distribution allows the calculation of the P value.
另一方面,本申请还提供了一种检测变体核酸存在和/或数量的方法,所述方法可以包含基于待测变体样本的体细胞突变位点的突变优先级,确定体细胞突变位点集合,所述体细胞突变位点集合可以用于检测变体核酸存在和/或数量,所述突变优先级从高到低可以包含:驱动突变、驱动突变以外的非同义突变和同义突变。On the other hand, the present application also provides a method for detecting the presence and/or quantity of a variant nucleic acid, the method may include determining the somatic mutation site based on the mutation priority of the somatic mutation site in the variant sample to be detected. Point set, the somatic mutation site set can be used to detect the presence and/or quantity of variant nucleic acid, and the mutation priority from high to low can include: driver mutations, non-synonymous mutations other than driver mutations, and synonymous mutations mutation.
例如,所述待测变体样本可以来源于受试者在接受治疗之前获得的样品。例如,所述治疗可以包含肿瘤治疗。For example, the sample of the variant to be tested may be derived from a sample obtained from the subject prior to receiving treatment. For example, the treatment may comprise tumor treatment.
例如,通过将所述待测变体样本与阴性样本比对,可以确定所述体细胞突变位点。For example, the somatic mutation site can be determined by comparing the variant sample to be detected with a negative sample.
例如,所述驱动突变以外的非同义突变可以选自以下组:选择性剪接突变、不造成基因读码框位移的插入或缺失(in-frame INDEL)和造成基因读码框位移的插入或缺失(out-of-frame INDEL)。For example, non-synonymous mutations other than the driver mutation may be selected from the group consisting of alternative splicing mutations, insertions or deletions that do not cause a shift in the reading frame of the gene (in-frame INDEL), and insertions or deletions that cause a shift in the reading frame of the gene. Missing (out-of-frame INDEL).
例如,所述方法可以包含将所述体细胞突变位点按照突变优先级从高到低排序,其中在相同的突变优先级中所述体细胞突变位点可以按照突变频率从高到低排序。For example, the method may include sorting the somatic mutation sites according to mutation priority from high to low, wherein in the same mutation priority, the somatic mutation sites may be sorted according to mutation frequency from high to low.
例如,所述方法可以包含选取排序最高的5个或更多的突变位点作为所述体细胞突变位点集合。例如,本申请的方法可以包含选取排序最高的1个或更多、最高的2个或更多、最高的3个或更多、最高的4个或更多、最高的5个或更多、最高的6个或更多、最高的7个或更多、最高的8个或更多、最高的9个或更多、最高的10个或更多、最高的15个或更多、最高的20个或更多、最高的25个或更多、最高的30个或更多、最高的40个或更多、最高的50个或更多、或最高的100个或更多的突变位点作为所述体细胞突变位点集合。For example, the method may include selecting the five or more highest-ranked mutation sites as the set of somatic mutation sites. For example, the method of the present application may include selecting the highest ranked 1 or more, the highest 2 or more, the highest 3 or more, the highest 4 or more, the highest 5 or more, Highest 6 or more, Highest 7 or more, Highest 8 or more, Highest 9 or more, Highest 10 or more, Highest 15 or more, Highest 20 or more, up to 25 or more, up to 30 or more, up to 40 or more, up to 50 or more, or up to 100 or more mutation sites As the set of somatic mutation sites.
例如,所述方法还可以包含基于所述体细胞突变位点集合,确定待测样本的待测区域。例如,所述方法还可以包含基于所述体细胞突变位点集合,确定可以结合所述待测区域的核酸。例如,本申请的方法可以包含基于所述体细胞突变位点集合,设计用于检测待测样本的探针。For example, the method may further include determining a region to be tested in the sample to be tested based on the set of somatic mutation sites. For example, the method may further comprise determining nucleic acids that can bind to the region to be tested based on the set of somatic mutation sites. For example, the method of the present application may include designing probes for detecting the sample to be tested based on the set of somatic mutation sites.
另一方面,本申请还提供了一种检测变体核酸的存在和/或数量方法的分析设备,所述设备包含确定模块或判断模块,可以用于基于待测样本中待测区域的体细胞突变位点和背景突变位点,确定所述变体核酸存在和/或数量,其中可以通过从所述待测样本的全部突变位点中去除所述体细胞突变位点,确定所述背景突变位点。例如,本申请的所述检测变体核酸的存在和/或数量方法的分析设备可以包含执行本申请所述的检测变体核酸的存在和/或数量方法的模块。On the other hand, the present application also provides an analysis device for detecting the presence and/or quantity of variant nucleic acid. mutation site and background mutation site, determining the presence and/or quantity of the variant nucleic acid, wherein the background mutation can be determined by removing the somatic mutation site from all mutation sites in the sample to be tested location. For example, the analytical device of the method for detecting the presence and/or quantity of variant nucleic acids of the present application may comprise a module for performing the method for detecting the presence and/or quantity of variant nucleic acids described in the present application.
另一方面,本申请还提供了一种数据库建立的方法,所述数据库包含体细胞突变位点集合,所述方法可以包含基于待测变体样本的体细胞突变位点的突变优先级,确定体细胞突变位点集合,所述体细胞突变位点集合用于检测变体核酸存在和/或数量,所述突变优先级从高到低包含:驱动突变、驱动突变以外的非同义突变和同义突变。例如,本申请数据库建立的方法可以包含基于待测变体样本的体细胞突变位点的突变优先级,确定体细胞突变位点集合的方法。On the other hand, the present application also provides a method for establishing a database, the database includes a collection of somatic mutation sites, and the method may include determining the mutation priority of the somatic mutation sites based on the variant sample to be tested. A set of somatic mutation sites, the set of somatic mutation sites is used to detect the presence and/or quantity of variant nucleic acids, and the priority of the mutations includes: driver mutations, non-synonymous mutations other than driver mutations, and synonymous mutation. For example, the method for establishing the database of the present application may include a method of determining a set of somatic mutation sites based on the mutation priority of the somatic mutation sites in the variant sample to be tested.
另一方面,本申请还提供了一种数据库的建立设备,所述数据库包含体细胞突变位点集合,所述设备包含确定模块,用于基于待测变体样本的体细胞突变位点的突变优先级,确定体细胞突变位点集合,所述体细胞突变位点集合用于检测变体核酸存在和/或数量,所述突变优先级从高到低包含:驱动突变、驱动突变以外的非同义突变和同义突变。例如,本申请的所述数据库的建立设备可以包含执行本申请所述的数据库建立的方法的模块。On the other hand, the present application also provides a device for establishing a database, the database includes a collection of somatic mutation sites, and the device includes a determination module for mutation of somatic mutation sites based on the variant sample to be detected Priority, determine the set of somatic mutation sites, the set of somatic mutation sites is used to detect the presence and/or quantity of variant nucleic acid, the mutation priority from high to low includes: driver mutation, non-driver mutation Synonymous mutations and synonymous mutations. For example, the device for establishing a database in this application may include a module for executing the method for establishing a database in this application.
另一方面,本申请还提供了一种数据库,其可以根据本申请的数据库建立的方法建立。On the other hand, the present application also provides a database, which can be established according to the database establishment method of the present application.
另一方面,本申请还提供了一种储存介质,其记载可以运行本申请所述的方法的程序。例如,所述非易失性计算机可读存储介质可以包括软盘、柔性盘、硬盘、固态存储(SSS)(例如固态驱动(SSD))、固态卡(SSC)、固态模块(SSM))、企业级闪存驱动、磁带或任何其他非临时性磁介质等。非易失性计算机可读存储介质还可以包括打孔卡、纸带、光标片(或任何其他具有孔型图案或其他光学可识别标记的物理介质)、压缩盘只读存储器(CD-ROM)、可重写式光盘(CD-RW)、数字通用光盘(DVD)、蓝光光盘(BD)和/或任何其他非临时性光学介质。On the other hand, the present application also provides a storage medium, which records a program capable of running the method described in the present application. For example, the non-transitory computer readable storage medium may include a floppy disk, a flexible disk, a hard disk, a solid state storage (SSS) (such as a solid state drive (SSD)), a solid state card (SSC), a solid state module (SSM)), an enterprise high-grade flash drives, tape, or any other non-transitory magnetic media, etc. Non-transitory computer readable storage media may also include punched cards, paper tape, cursor sheets (or any other physical media having a pattern of holes or other optically identifiable markings), compact disc read only memory (CD-ROM) , Rewritable Disc (CD-RW), Digital Versatile Disc (DVD), Blu-ray Disc (BD) and/or any other non-transitory optical media.
另一方面,本申请还提供了一种设备,所述设备包含本申请所述的储存介质。例如,本 申请的设备还可以包含耦接至所述储存介质的处理器,所述处理器被配置为基于存储在所述储存介质中的程序执行以实现本申请的方法。On the other hand, the present application also provides a device, which includes the storage medium described in the present application. For example, the device of the present application may further include a processor coupled to the storage medium, and the processor is configured to execute based on a program stored in the storage medium to implement the method of the present application.
另一方面,本申请还提供了根据本申请的方法,其可以用于检测和/或量化从受试者获得的待测样品中的循环肿瘤DNA。在本申请中,所述方法可以用于判断所述受试者的待测样品中的循环肿瘤DNA的存在和/含量。例如,本申请的任一个或多个方法可以是非诊断目的的。例如,本申请的任一个或多个方法可以是诊断目的的。On the other hand, the present application also provides the method according to the present application, which can be used to detect and/or quantify circulating tumor DNA in a test sample obtained from a subject. In the present application, the method can be used to determine the presence and/or content of circulating tumor DNA in the test sample of the subject. For example, any one or more of the methods of the present application may be for non-diagnostic purposes. For example, any one or more of the methods of the present application may be for diagnostic purposes.
另一方面,本申请还提供了根据本申请的方法,其可以用于疾病或残留疾病的诊断、预防和/或伴随治疗。On the other hand, the present application also provides the method according to the present application, which can be used for the diagnosis, prevention and/or concomitant treatment of the disease or residual disease.
另一方面,本申请还提供了根据本申请的方法,其可以用于疾病治疗方法的预测、选择和/或评估。例如,可以确定或辅助确定受试者患有癌症或具有癌症的复发的可能性,其可以受益于抗癌治疗,包括化学治疗、免疫治疗、放射治疗、手术或其组合的可能性。On the other hand, the present application also provides the method according to the present application, which can be used for prediction, selection and/or evaluation of disease treatment methods. For example, it can be determined or aided in determining the likelihood that a subject has cancer or has a recurrence of cancer that would benefit from anticancer therapy, including chemotherapy, immunotherapy, radiation therapy, surgery, or a combination thereof.
在本申请中,所述方法可以用于通过检测待测样品中的循环肿瘤DNA的存在和/含量,用于临床实践(例如可以推测某些特定的肿瘤治疗方式是否适于该受试者)。在某些情况下,所述方法检测出的待测样品中的循环肿瘤DNA的存在和/含量可以与本领域已知的生物标志物联合使用于临床实践。In this application, the method can be used in clinical practice by detecting the presence and/or content of circulating tumor DNA in the test sample (for example, it can be speculated whether certain specific tumor treatment methods are suitable for the subject) . In some cases, the presence and/or content of circulating tumor DNA in the test sample detected by the method can be used in clinical practice in combination with biomarkers known in the art.
不欲被任何理论所限,下文中的实施例仅仅是为了阐释本申请的方法和用途等,而不用于限制本申请发明的范围。Not intending to be limited by any theory, the following examples are only for explaining the methods and uses of the present application, and are not intended to limit the scope of the invention of the present application.
实施例Example
实施例1Example 1
本申请共选择了25例真实样本,分析其InDel可观测信号,初步评估背景突变频率。基于统计结果,发现在(Unit)n类型的InDel中,在插入或缺失重复单元次数相同的情况下,可观测信号频率随着n的增加而呈指数增加,如图1A-1B。其中,Unit表示重复单元的碱基长度,n表示重复单元的重复次数。In this application, a total of 25 real samples were selected, and their InDel observable signals were analyzed to preliminarily evaluate the background mutation frequency. Based on the statistical results, it was found that in (Unit)n type InDel, when the number of repeat units inserted or deleted is the same, the observable signal frequency increases exponentially with the increase of n, as shown in Figure 1A-1B. Wherein, Unit represents the base length of the repeating unit, and n represents the number of repetitions of the repeating unit.
在相同(Unit)n时,2个插入缺失均比1个插入缺失的可观测信号频率低,当3个或以上的插入缺失时可观测信号弱,如图2A-2B。At the same (Unit)n, the observed signal frequency of 2 indels is lower than that of 1 indel, and the observable signal is weak when there are 3 or more indels, as shown in Figure 2A-2B.
与长度为1个碱基的碱基重复单元相比,长度为2-3个碱基的重复单元的插入缺失可观测信号频率相当或增大,如图3A-3B。Compared with base repeat units with a length of 1 base, the frequency of indel observable signals of repeat units with a length of 2-3 bases is equivalent or increased, as shown in Figures 3A-3B.
与重复单元插入缺失相比,随机插入缺失1-2个碱基时,可观测信号频率均非常低,在1e-7水平,如图4。Compared with repeat unit indels, when random indels have 1-2 bases, the frequency of observable signals is very low, at the 1e-7 level, as shown in Figure 4.
考虑到MRD检测时,ctDNA占比较低,例如在2e-4或以下。重复单元的重复次数n<=3或者随机插入缺失时,InDel可观测信号可以在1e-5以下,因此该类突变可以纳入MRD分析,对于ctDNA占比在1e-5以上的样本可以实现对于MRD的准确检测。When considering MRD detection, the proportion of ctDNA is low, such as 2e-4 or below. When the number of repeats n<=3 or random indels, the observable signal of InDel can be below 1e-5, so this type of mutation can be included in the MRD analysis, and for samples with a ctDNA ratio above 1e-5, it can be achieved for MRD accurate detection.
实施例2Example 2
本申请选择1个待检细胞系和1个本底细胞系作为研究材料,稀释成5e-03,1e-03,2e-04,4e-05,8e-06共5个梯度的稀释样本,模拟不同ctDNA占比的样本。从待检细胞系中选择了88个与本底细胞系不同的突变位点设计探针,并捕获测序。最终每个稀释样本进行了三次测序,共获得15个稀释样本,每个样本的建库投入量为30ng,目标区平均测序深度为100000X。随后从这88个位点中,任意挑选5-60个突变位点分析ctDNA占比,重复次数为50次,因此每个稀释梯度共进行了150次分析测试。当选择样本Pvalue<0.01作为样本检出的阈值时,在稀释梯度为5e-3或1e-3时,5个突变位点即可完成样本的100%检出;在稀释梯度为2e-4时,15个突变位点或以上能完成样本的100%检出;在稀释梯度为4e-5时,40个突变位点或以上时能完成样本的100%检出,如表1。This application selects 1 cell line to be tested and 1 background cell line as research materials, and dilutes them into 5 gradient dilution samples of 5e-03, 1e-03, 2e-04, 4e-05, 8e-06. Samples with different ctDNA ratios. From the cell lines to be tested, 88 mutation sites different from the background cell lines were selected to design probes and captured for sequencing. Finally, each diluted sample was sequenced three times, and a total of 15 diluted samples were obtained. The input amount for each sample was 30ng, and the average sequencing depth of the target area was 100,000X. Then, from these 88 sites, 5-60 mutation sites were randomly selected to analyze the ctDNA ratio, and the number of repetitions was 50 times, so a total of 150 analysis tests were performed for each dilution gradient. When the sample Pvalue<0.01 is selected as the threshold for sample detection, when the dilution gradient is 5e-3 or 1e-3, 5 mutation sites can complete the detection of 100% of the sample; when the dilution gradient is 2e-4 100% of samples can be detected with 15 mutation sites or more; when the dilution gradient is 4e-5, 100% of samples can be detected with 40 mutation sites or more, as shown in Table 1.
表1样本检出比例结果Table 1 Sample detection ratio results
Figure PCTCN2022070974-appb-000026
Figure PCTCN2022070974-appb-000026
当评估稀释占比的结果时,发现在稀释梯度为5e-3或1e-3时,5个突变位点即可较准确计算稀释比;在稀释梯度为2e-4时,15个突变位点即可较准确计算稀释比;在稀释梯度为4e-5时,40个突变位点或以上可以较准确计算稀释比,如图5A-5B。When evaluating the results of the dilution ratio, it is found that when the dilution gradient is 5e-3 or 1e-3, 5 mutation sites can calculate the dilution ratio more accurately; when the dilution gradient is 2e-4, 15 mutation sites The dilution ratio can be calculated more accurately; when the dilution gradient is 4e-5, the dilution ratio can be calculated more accurately for 40 mutation sites or more, as shown in Figures 5A-5B.
实施例3Example 3
比较本申请方法(PROPHET)所用的背景构建方法和突变位点前后10bp区域构建背景方法(INVAR)的检测效果,对实施例2中的15例测序样本进行了两种方法的平行分析。同样地,从88个位点中任意挑选5-60个位点,重复次数为50次。另外还从88个位点之外,任意挑选了5-60个阴性位点,重复次数也为50次,目的是评估检测的特异性。当选择样本Pvalue<0.01为阈值时,对于少于40个位点的情况,本申请方法检测敏感性可以优于INVAR方法,如表2和图6所示。Compared the detection effect of the background construction method (PROPHET) used in the application method (PROPHET) and the background construction method (INVAR) of the 10 bp region before and after the mutation site, the parallel analysis of the two methods was carried out on the 15 sequencing samples in Example 2. Similarly, 5-60 sites were arbitrarily selected from 88 sites, and the number of repetitions was 50 times. In addition, 5-60 negative sites were randomly selected from the 88 sites, and the number of repetitions was also 50 times, in order to evaluate the specificity of the detection. When the sample Pvalue<0.01 is selected as the threshold, the detection sensitivity of the method of this application can be better than that of the INVAR method for the case of less than 40 sites, as shown in Table 2 and Figure 6.
表2不同方法检测效果比较Table 2 Comparison of detection effects of different methods
Figure PCTCN2022070974-appb-000027
Figure PCTCN2022070974-appb-000027
Figure PCTCN2022070974-appb-000028
Figure PCTCN2022070974-appb-000028
同时已知的INVAR方法既使用突变位点前后10bp的测序信息,也使用同一个靶点组合(panel)捕获测序多个样本的测序信息。因此,INVAR方法除了样本自身的突变位点之外,还包括同时测序的其他样本的突变位点前后10bp的测序信息可用,因此总的可选突变位点数相对较多。而本申请的方法更适用于单个样本的panel,可以适用于总的可选突变位点偏少的 检测环境。At the same time, the known INVAR method not only uses the sequencing information of 10 bp before and after the mutation site, but also uses the same target combination (panel) to capture and sequence the sequencing information of multiple samples. Therefore, in addition to the mutation site of the sample itself, the INVAR method also includes sequencing information of 10 bp before and after the mutation site of other samples sequenced at the same time, so the total number of optional mutation sites is relatively large. However, the method of this application is more suitable for the panel of a single sample, and can be applied to the detection environment with fewer total optional mutation sites.
实施例4Example 4
为了衡量INDEL和SNV对ctDNA占比评估效果,本申请选择了一个标准品数据和一个本底细胞系进行稀释,共稀释成2.5e-3、1.25e-3、6.25e-4、3.125e-4、1.6e-4、8e-5、4e-5七个梯度,模拟不同ctDNA占比的样本。在测序范围内,该标准品共包括28个有效突变,其中INDEL突变8个,SNV突变20个。稀释品的平均测序深度为60000X。本申请分别用8个INDEL,8个SNV(任意挑选),和28个突变分析ctDNA占比,结果如表3所示。In order to measure the evaluation effect of INDEL and SNV on the proportion of ctDNA, this application selected a standard product data and a background cell line for dilution, and diluted them into 2.5e-3, 1.25e-3, 6.25e-4, 3.125e- 4. Seven gradients of 1.6e-4, 8e-5, and 4e-5 simulate samples with different ctDNA proportions. Within the scope of sequencing, the standard contains a total of 28 effective mutations, including 8 INDEL mutations and 20 SNV mutations. The average sequencing depth of the diluted products is 60000X. In this application, 8 INDELs, 8 SNVs (randomly selected), and 28 mutations were used to analyze the proportion of ctDNA, and the results are shown in Table 3.
表3标准品稀释样本分析结果Table 3 Standard product dilution sample analysis results
Figure PCTCN2022070974-appb-000029
Figure PCTCN2022070974-appb-000029
基于结果可知,在选择8个突变位点情况下,SNV或者INDEL单独分析时,在1.6e-4的稀释水平及以上均能准确估计,且满足显著性pvalue<0.01,在8e-5或以下,INDEL计算结果可以比SNV好;当用SNV和INDEL结合分析时,在该实验的所有梯度稀释下均能准确计算其稀释梯度,且满足显著性pvalue<0.01。Based on the results, in the case of selecting 8 mutation sites, SNV or INDEL can be accurately estimated at the dilution level of 1.6e-4 and above when analyzed separately, and the significance pvalue<0.01 is satisfied, and at 8e-5 or below , INDEL calculation results can be better than SNV; when SNV and INDEL are used for combined analysis, the dilution gradient can be accurately calculated under all gradient dilutions of the experiment, and the significance pvalue<0.01 is satisfied.
实施例5Example 5
本申请选择了5个待检细胞系和1个本底细胞系作为研究材料,将每个待检细胞系与本底细胞系稀释成5e-03、1e-03、2e-04、4e-05、8e-06共5个梯度的稀释样本,模拟不同ctDNA占比的样本。在每个待检细胞系中选择40~100个自身特有的胚系突变,作为体细胞突变位点,并设计相应的杂交探针,用于后续实验。最终对每个稀释样本以及本底样本进行了三次重复实验,共获得90个样本数据,每个样本的建库投入量为30ng,目标区平均测序深度为100000X。后续利用本申请的方法对这些测序数据进行分析,计算其位点检出以及样本检出情 况。表4为样本的掺比(模拟ctDNA占比)评估和显著性水平结果,图7A-7E为位点pvalue<0.05时位点检出情况展示。This application selected 5 cell lines to be tested and 1 background cell line as research materials, and each cell line to be tested and the background cell line were diluted to 5e-03, 1e-03, 2e-04, 4e-05 , 8e-06, a total of 5 gradient dilution samples, simulating samples with different ctDNA proportions. In each cell line to be tested, 40-100 unique germline mutations were selected as somatic mutation sites, and corresponding hybridization probes were designed for subsequent experiments. Finally, three repeated experiments were carried out on each diluted sample and background sample, and a total of 90 sample data were obtained. The input amount for each sample was 30ng, and the average sequencing depth of the target region was 100,000X. Then use the method of this application to analyze these sequencing data, and calculate the detection of the site and the detection of the sample. Table 4 shows the evaluation and significance level results of the mixing ratio (proportion of simulated ctDNA) of the sample, and Figure 7A-7E shows the detection status of the site when the pvalue<0.05.
表4细胞系稀释样本分析结果Table 4 Analysis results of cell line dilution samples
Figure PCTCN2022070974-appb-000030
Figure PCTCN2022070974-appb-000030
Figure PCTCN2022070974-appb-000031
Figure PCTCN2022070974-appb-000031
基于样本分析结果可知,在稀释梯度为5e-03至4e-05时,本申请的方法均能较准确评估稀释水平,且样本显著性pvalue均较低,在8e-06水平时,稀释水平估值与实际差异较大。基于位点分析结果可知,在稀释梯度为5e-03至4e-05时,敏感性从100%降至15%左右,但均明显高于(1-特异性),在8e-06水平时,敏感性降至5%左右,与(1-特异性)比较接近。因此验证本申请的检测方法可以检测到低至4e-05左右的ctDNA,低于共识给出的ctDNA检出限低至2e-04的水平,为后续应用于微小残留病灶检测提供了数据支持和辅助。Based on the sample analysis results, it can be seen that when the dilution gradient is 5e-03 to 4e-05, the method of the present application can evaluate the dilution level more accurately, and the sample significance pvalue is low. At the 8e-06 level, the dilution level estimation The value is quite different from the actual one. Based on the site analysis results, it can be known that when the dilution gradient is 5e-03 to 4e-05, the sensitivity drops from 100% to about 15%, but they are all significantly higher than (1-specificity). At the 8e-06 level, The sensitivity drops to about 5%, which is relatively close to (1-specificity). Therefore, it is verified that the detection method of this application can detect ctDNA as low as about 4e-05, which is lower than the level of ctDNA detection limit as low as 2e-04 given by the consensus, and provides data support and support for subsequent application in the detection of minimal residual lesions. auxiliary.
前述详细说明是以解释和举例的方式提供的,并非要限制所附权利要求的范围。目前本申请所列举的实施方式的多种变化对本领域普通技术人员来说是显而易见的,且保留在所附的权利要求和其等同方案的范围内。The foregoing detailed description has been offered by way of explanation and example, not to limit the scope of the appended claims. Variations on the presently recited embodiments of this application will be apparent to those of ordinary skill in the art and remain within the scope of the appended claims and their equivalents.

Claims (20)

  1. 一种检测变体核酸的存在和/或数量的方法,所述方法包含基于待测样本中待测区域的体细胞突变位点和背景突变位点,确定所述变体核酸的存在和/或数量,其中通过从所述待测样本的全部突变位点中去除所述体细胞突变位点,确定所述背景突变位点。A method for detecting the presence and/or amount of variant nucleic acid, the method comprising determining the presence and/or quantity of the variant nucleic acid based on somatic mutation sites and background mutation sites in the region to be tested in the sample to be tested Quantity, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites in the sample to be tested.
  2. 如权利要求1所述的方法,所述变体核酸选自以下组:循环肿瘤核酸、胎儿游离核酸和来源于异体器官和/或组织的循环核酸。The method according to claim 1, wherein the variant nucleic acid is selected from the group consisting of circulating tumor nucleic acid, fetal free nucleic acid and circulating nucleic acid derived from allogeneic organs and/or tissues.
  3. 如权利要求1-2中任一项所述的方法,所述方法还包含对所述待测样本的突变位点进行如下任意一种或多种碱基错误校正,并且基于所述碱基错误校正后的位点,确定所述待测样本中的突变位点;The method according to any one of claims 1-2, further comprising performing any one or more of the following base error corrections on the mutation site of the sample to be tested, and based on the base error The corrected site is used to determine the mutation site in the sample to be tested;
    1)所述碱基错误校正包含基于多数投票规则,校正源自相同位点的测序读段的每个位置的碱基类型,确定一致性序列;1) The base error correction includes correcting the base type at each position of the sequencing reads from the same position based on the majority voting rule, and determining the consensus sequence;
    2)所述碱基错误校正包含将不能确定碱基类型的位点的碱基质量调整为0;2) The base error correction includes adjusting the base quality of the site where the base type cannot be determined to 0;
    3)所述碱基错误校正包含校正源自相同位点的正义链和反义链的每个位置的碱基类型,保留所述正义链和反义链的各自的所述一致性序列;3) the base error correction comprises correcting the base type at each position of the sense strand and the antisense strand derived from the same site, and retaining the respective consensus sequences of the sense strand and the antisense strand;
    4)所述碱基错误校正包含将相同位点来源的正义链和反义链中不一致碱基的位点的碱基质量调整为0;4) The base error correction includes adjusting the base quality of the site of inconsistent bases in the sense strand and the antisense strand derived from the same site to 0;
    所述源自相同位点的测序读段包含比对到人类参考基因组位置相同且包含相同单分子标签(UMI)的测序读段。The sequencing reads derived from the same locus include sequencing reads that align to the same position in the human reference genome and include the same unimolecular signature (UMI).
  4. 如权利要求1-3中任一项所述的方法,所述方法还包含通过从所述待测样本的全部突变位点进行如下以下任意一种或多种过滤,得到所述背景突变位点;The method according to any one of claims 1-3, further comprising performing any one or more of the following filters from all mutation sites of the sample to be tested to obtain the background mutation site ;
    1)去除高频率突变位点;1) Remove high-frequency mutation sites;
    2)从所述待测样本的序列信息中去除质控不合格的序列信息;2) removing the unqualified sequence information from the sequence information of the sample to be tested;
    3)从所述待测样本的序列信息中去除低质量测序读段(read)的序列信息;3) removing the sequence information of low-quality sequencing reads (reads) from the sequence information of the sample to be tested;
    4)从所述待测样本的序列信息中去除低质量碱基的序列信息。4) The sequence information of low-quality bases is removed from the sequence information of the sample to be tested.
  5. 如权利要求4所述的方法,所述高频率突变位点包含突变频率约为5e-03或更高的位点、所述低质量测序读段包含比对质量小于60的测序读段和/或包含8个或更多碱基错配的测序读段,和/或所述低质量碱基包含校正后碱基质量小于20的碱基。The method according to claim 4, wherein the high-frequency mutation site comprises a site with a mutation frequency of about 5e-03 or higher, and the low-quality sequencing reads comprise sequencing reads with an alignment quality less than 60 and/or Or sequence reads containing 8 or more base mismatches, and/or the low-quality bases include bases with a corrected base quality of less than 20.
  6. 如权利要求1-5中任一项所述的方法,所述方法还包含确定选自以下组的突变频率:所述体细胞突变位点的体细胞突变频率和所述背景突变位点的背景突变频率,用于评估位点突变显著性水平。The method of any one of claims 1-5, further comprising determining a mutation frequency selected from the group consisting of the somatic mutation frequency of the somatic mutation site and the background of the background mutation site Mutation frequency, used to assess the significance level of site mutations.
  7. 如权利要求6所述的方法,所述突变频率包含单碱基突变频率、多聚体突变频率和/或 INDEL突变频率。The method according to claim 6, wherein the mutation frequency comprises a single base mutation frequency, a multimer mutation frequency and/or an INDEL mutation frequency.
  8. 如权利要求7所述的方法,所述多聚体突变频率包含在特定的连续排列碱基序列中特定位置的碱基突变为另一种碱基的频率,其中,所述连续排列碱基序列包含连续排列的2个或更多碱基,或者所述连续排列碱基序列包含连续排列的3个碱基。The method according to claim 7, wherein the multimer mutation frequency comprises the frequency of mutation of a base at a specific position in a specific continuous sequence of bases to another base, wherein the continuous sequence of bases Contains 2 or more bases arranged in a row, or the sequence of bases arranged in a row includes 3 bases arranged in a row.
  9. 如权利要求7-8中任一项所述的方法,所述多聚体突变频率包含在特定的连续排列序列中第2位的碱基突变为另一种特定碱基的频率。The method according to any one of claims 7-8, wherein the multimer mutation frequency comprises the frequency at which the base at position 2 is mutated into another specific base in a specific contiguous sequence.
  10. 如权利要求7-9中任一项所述的方法,所述INDEL突变频率包含随机INDEL突变频率和/或碱基重复区INDEL突变频率,或者所述随机INDEL突变频率包含插入或缺失一个或更多个碱基的频率,或者所述随机INDEL突变频率包含在特定的一个或更多个碱基之后插入或缺失一个或更多个碱基的频率,或者所述随机INDEL突变频率包含在特定的一个碱基之后插入或缺失一个或更多个碱基的频率。The method according to any one of claims 7-9, wherein said INDEL mutation frequency comprises random INDEL mutation frequency and/or base repeat region INDEL mutation frequency, or said random INDEL mutation frequency comprises insertion or deletion of one or more The frequency of a plurality of bases, or the random INDEL mutation frequency includes the frequency of insertion or deletion of one or more bases after a specific one or more bases, or the random INDEL mutation frequency is included in a specific How often one or more bases are inserted or deleted after one base.
  11. 如权利要求1-10中任一项所述的方法,所述方法还包含确定待测样本中变体核酸的存在和/或所述体细胞突变位点存在突变的显著性水平。The method according to any one of claims 1-10, further comprising determining the presence of the variant nucleic acid in the test sample and/or the significance level of the mutation at the somatic mutation site.
  12. 如权利要求11所述的方法,所述方法包含通过确定所述体细胞突变位点的背景突变频率的累积概率,衡量所述显著性水平。The method of claim 11 , comprising measuring said level of significance by determining a cumulative probability of background mutation frequency for said somatic mutation site.
  13. 如权利要求12所述的方法,所述方法包含基于以下公式确定所述累积概率:The method of claim 12, said method comprising determining said cumulative probability based on the following formula:
    Figure PCTCN2022070974-appb-100001
    Figure PCTCN2022070974-appb-100001
    其中,P表示累积概率,k从0到x-1累加,x表示体细胞突变位点突变后序列的覆盖深度,n表示所述体细胞突变位点的总覆盖深度,p表示所述体细胞突变位点的背景突变频率,e表示自然对数;Among them, P represents the cumulative probability, k is accumulated from 0 to x-1, x represents the coverage depth of the sequence after the mutation of the somatic cell mutation site, n represents the total coverage depth of the somatic cell mutation site, and p represents the somatic cell mutation site The background mutation frequency of the mutation site, e represents the natural logarithm;
    和/或,所述方法包含基于以下公式确定所述累积概率:And/or, the method comprises determining the cumulative probability based on the following formula:
    Figure PCTCN2022070974-appb-100002
    Figure PCTCN2022070974-appb-100002
    其中,P表示累积概率,k从0到x-1累加,x表示体细胞突变位点突变后序列的覆盖深度,n表示所述体细胞突变位点的总覆盖深度,p表示所述体细胞突变位点的背景突变频率。Among them, P represents the cumulative probability, k is accumulated from 0 to x-1, x represents the coverage depth of the sequence after the mutation of the somatic cell mutation site, n represents the total coverage depth of the somatic cell mutation site, and p represents the somatic cell mutation site The background mutation frequency of the mutation site.
  14. 如权利要求12-13中任一项所述的方法,所述方法包含当所述累积概率小于显著性阈值时,确定变体核酸的存在,其中所述显著性阈值为0.05或更小。The method of any one of claims 12-13, comprising determining the presence of a variant nucleic acid when the cumulative probability is less than a significance threshold, wherein the significance threshold is 0.05 or less.
  15. 如权利要求1-14中任一项所述的方法,所述方法包含基于泊松分布或二项分布的似然 估计算法,确定待测样本中变体核酸的存在和/或数量,其中,所述变体核酸的数量包含待测样品中循环肿瘤DNA(ctDNA)在待测样品总DNA中的占比。The method according to any one of claims 1-14, said method comprising a likelihood estimation algorithm based on Poisson distribution or binomial distribution to determine the presence and/or quantity of variant nucleic acid in the sample to be tested, wherein, The quantity of the variant nucleic acid includes the proportion of circulating tumor DNA (ctDNA) in the total DNA of the sample to be tested.
  16. 如权利要求15所述的方法,所述方法包含确定ctDNA占比π的极大似然估计值
    Figure PCTCN2022070974-appb-100003
    当π取值为所述
    Figure PCTCN2022070974-appb-100004
    时,如下式的函数l(π)取最大值:
    The method of claim 15, comprising determining a maximum likelihood estimate of the ctDNA fraction π
    Figure PCTCN2022070974-appb-100003
    When the value of π is the
    Figure PCTCN2022070974-appb-100004
    When , the function l(π) of the following formula takes the maximum value:
    l(π)=∑ iw ilnl i(π;x i,n i,p i,q i) l(π)=∑ i w i lnl i (π; x i ,n i ,p i ,q i )
    其中,w i为第i个体细胞突变位点的权重,l i(π;x i,n i,p i,q i)通过下式计算: Among them, w i is the weight of the mutation site in the i-th somatic cell, and l i (π; x i , ni , p i , q i ) is calculated by the following formula:
    Figure PCTCN2022070974-appb-100005
    Figure PCTCN2022070974-appb-100005
    其中,f i通过下式计算: Among them, f i is calculated by the following formula:
    f i=πq i+(1-π)p i f i =πq i +(1-π)p i
    n i表示第i个所述体细胞突变位点的总覆盖深度,x i为第i个体细胞突变位点突变后序列的覆盖深度,q i表示第i个所述体细胞突变位点在肿瘤组织样本中的突变频率,p i表示对应的突变在待测样本中的背景突变频率,e表示自然对数; n i represents the total coverage depth of the i-th somatic mutation site, x i is the coverage depth of the i-th somatic mutation site after mutation, q i represents the i-th somatic mutation site in the tumor The mutation frequency in the tissue sample, pi represents the background mutation frequency of the corresponding mutation in the sample to be tested, and e represents the natural logarithm;
    和/或,所述方法包含确定ctDNA占比π的极大似然估计值
    Figure PCTCN2022070974-appb-100006
    当π取值为所述
    Figure PCTCN2022070974-appb-100007
    时,如下式的函数l(π)取最大值:
    And/or, the method comprises determining a maximum likelihood estimate of the ctDNA proportion π
    Figure PCTCN2022070974-appb-100006
    When the value of π is the
    Figure PCTCN2022070974-appb-100007
    When , the function l(π) of the following formula takes the maximum value:
    l(π)=∑ iw ilnl i(π;x i,n i,p i,q i) l(π)=∑ i w i lnl i (π; x i ,n i ,p i ,q i )
    其中,w i为第i个体细胞突变位点的权重,l i(π;x i,n i,p i,q i)通过下式计算: Among them, w i is the weight of the mutation site in the i-th somatic cell, and l i (π; x i , ni , p i , q i ) is calculated by the following formula:
    Figure PCTCN2022070974-appb-100008
    Figure PCTCN2022070974-appb-100008
    其中,f i通过下式计算: Among them, f i is calculated by the following formula:
    f i=πq i+(1-π)p i f i =πq i +(1-π)p i
    n i表示第i个所述体细胞突变位点的总覆盖深度,x i为第i个体细胞突变位点突变后序列的覆盖深度,q i表示第i个所述体细胞突变位点在肿瘤组织样本中的突变频率,p i表示对应的突变在待测样本中的背景突变频率。 n i represents the total coverage depth of the i-th somatic mutation site, x i is the coverage depth of the i-th somatic mutation site after mutation, q i represents the i-th somatic mutation site in the tumor The mutation frequency in the tissue sample, pi represents the background mutation frequency of the corresponding mutation in the sample to be tested.
  17. 如权利要求16所述的方法,所述方法包含通过似然比检验算法确定所述ctDNA占比π的极大似然估计值的显著性水平。The method of claim 16, comprising determining the significance level of the maximum likelihood estimate of the ctDNA proportion π by a likelihood ratio test algorithm.
  18. 如权利要求17所述的方法,所述方法包含通过下式确定所述似然比统计量
    Figure PCTCN2022070974-appb-100009
    值,
    The method of claim 17, said method comprising determining said likelihood ratio statistic by
    Figure PCTCN2022070974-appb-100009
    value,
    Figure PCTCN2022070974-appb-100010
    Figure PCTCN2022070974-appb-100010
    其中,l(π)=∑ iw ilnl i(π;x i,n i,p i,q i) Among them, l(π)=∑ i w i lnl i (π; x i ,n i ,p i ,q i )
    其中,w i为第i个体细胞突变位点的权重,根据第i个体细胞突变位点的体细胞突变频 率或测序覆盖深度确定w i的取值,l i(π;x i,n i,p i,q i)通过下式计算: Among them, w i is the weight of the i-th individual cell mutation site, and the value of w i is determined according to the somatic mutation frequency or sequencing coverage depth of the i-th individual cell mutation site, l i (π; x i , n i , p i ,q i ) are calculated by the following formula:
    Figure PCTCN2022070974-appb-100011
    Figure PCTCN2022070974-appb-100011
    其中,f i通过下式计算: Among them, f i is calculated by the following formula:
    f i=πq i+(1-π)p i f i =πq i +(1-π)p i
    n i表示第i个所述体细胞突变位点的总覆盖深度,x i为第i个体细胞突变位点突变后序列的覆盖深度,q i表示第i个所述体细胞突变位点在肿瘤组织样本中的突变频率,p i表示对应的突变在待测样本中的背景突变频率,e表示自然对数; n i represents the total coverage depth of the i-th somatic mutation site, x i is the coverage depth of the i-th somatic mutation site after mutation, q i represents the i-th somatic mutation site in the tumor The mutation frequency in the tissue sample, pi represents the background mutation frequency of the corresponding mutation in the sample to be tested, and e represents the natural logarithm;
    和/或,所述方法包含通过下式确定所述似然比统计量
    Figure PCTCN2022070974-appb-100012
    值,
    And/or, the method comprises determining the likelihood ratio statistic by
    Figure PCTCN2022070974-appb-100012
    value,
    Figure PCTCN2022070974-appb-100013
    Figure PCTCN2022070974-appb-100013
    其中,l(π)=∑ iw ilnl i(π;x i,n i,p i,q i) Among them, l(π)=∑ i w i lnl i (π; x i ,n i ,p i ,q i )
    其中,w i为第i个体细胞突变位点的权重,根据第i个体细胞突变位点的体细胞突变频率或测序覆盖深度确定w i的取值,l i(π;x i,n i,p i,q i)通过下式计算: Among them, w i is the weight of the i-th individual cell mutation site, and the value of w i is determined according to the somatic mutation frequency or sequencing coverage depth of the i-th individual cell mutation site, l i (π; x i , n i , p i ,q i ) are calculated by the following formula:
    Figure PCTCN2022070974-appb-100014
    Figure PCTCN2022070974-appb-100014
    其中,f i通过下式计算: Among them, f i is calculated by the following formula:
    f i=πq i+(1-π)p i f i =πq i +(1-π)p i
    n i表示第i个所述体细胞突变位点的总覆盖深度,x i为第i个体细胞突变位点突变后序列的覆盖深度,q i表示第i个所述体细胞突变位点在肿瘤组织样本中的突变频率,p i表示对应的突变在待测样本中的背景突变频率,e表示自然对数。 n i represents the total coverage depth of the i-th somatic mutation site, x i is the coverage depth of the i-th somatic mutation site after mutation, q i represents the i-th somatic mutation site in the tumor The mutation frequency in the tissue sample, p i represents the background mutation frequency of the corresponding mutation in the sample to be tested, and e represents the natural logarithm.
  19. 一种检测变体核酸的存在和/或数量方法的分析设备,所述设备包含判断模块,用于基于待测样本中待测区域的体细胞突变位点和背景突变位点,确定所述变体核酸存在和/或数量,其中通过从所述待测样本的全部突变位点中去除所述体细胞突变位点,确定所述背景突变位点。An analysis device for detecting the presence and/or quantity of a variant nucleic acid, the device comprising a judgment module for determining the variant nucleic acid based on the somatic mutation site and the background mutation site in the region to be tested in the sample to be tested The presence and/or quantity of somatic nucleic acid, wherein the background mutation site is determined by removing the somatic mutation site from all mutation sites in the sample to be tested.
  20. 一种储存介质,其记载可以运行权利要求1-18中任一项所述的方法的程序。A storage medium recording a program capable of executing the method according to any one of claims 1-18.
PCT/CN2022/070974 2021-12-24 2022-01-10 Method for detecting variant nucleic acids WO2023115662A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111600502.4 2021-12-24
CN202111600502.4A CN114292912A (en) 2021-12-24 2021-12-24 Detection method of variant nucleic acid

Publications (1)

Publication Number Publication Date
WO2023115662A1 true WO2023115662A1 (en) 2023-06-29

Family

ID=80970042

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/070974 WO2023115662A1 (en) 2021-12-24 2022-01-10 Method for detecting variant nucleic acids

Country Status (2)

Country Link
CN (1) CN114292912A (en)
WO (1) WO2023115662A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116676373A (en) * 2023-07-28 2023-09-01 臻和(北京)生物科技有限公司 Sample dilution factor quantification method and application thereof
CN117144002A (en) * 2023-07-19 2023-12-01 苏州吉因加生物医学工程有限公司 Design method and application of personalized probe set for MRD detection

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114875118B (en) * 2022-06-30 2022-10-11 北京百图智检科技服务有限公司 Methods, kits and devices for determining cell lineage
CN116064755B (en) * 2023-01-12 2023-10-20 华中科技大学同济医学院附属同济医院 Device for detecting MRD marker based on linkage gene mutation
CN116356001B (en) * 2023-02-07 2023-12-15 江苏先声医学诊断有限公司 Dual background noise mutation removal method based on blood circulation tumor DNA

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107075730A (en) * 2014-09-12 2017-08-18 利兰·斯坦福青年大学托管委员会 The identification of circle nucleic acid and purposes
CN110383385A (en) * 2016-12-08 2019-10-25 生命科技股份有限公司 The method of mutational load is detected from tumor sample
CN111278993A (en) * 2017-09-15 2020-06-12 加利福尼亚大学董事会 Somatic cell mononucleotide variants from cell-free nucleic acids and applications for minimal residual lesion monitoring
CN112111565A (en) * 2019-06-20 2020-12-22 上海其明信息技术有限公司 Mutation analysis method and device for cell free DNA sequencing data
CN112218957A (en) * 2018-04-16 2021-01-12 格里尔公司 Systems and methods for determining tumor fraction in cell-free nucleic acids
CN112638152A (en) * 2018-09-05 2021-04-09 牛津大学科技创新有限公司 Methods or systems for identifying pathogenic mutations that result in a phenotype of interest in a test sample
CN113228190A (en) * 2018-12-23 2021-08-06 豪夫迈·罗氏有限公司 Tumor classification based on predicted tumor mutation burden

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107075730A (en) * 2014-09-12 2017-08-18 利兰·斯坦福青年大学托管委员会 The identification of circle nucleic acid and purposes
CN110383385A (en) * 2016-12-08 2019-10-25 生命科技股份有限公司 The method of mutational load is detected from tumor sample
CN111278993A (en) * 2017-09-15 2020-06-12 加利福尼亚大学董事会 Somatic cell mononucleotide variants from cell-free nucleic acids and applications for minimal residual lesion monitoring
CN112218957A (en) * 2018-04-16 2021-01-12 格里尔公司 Systems and methods for determining tumor fraction in cell-free nucleic acids
CN112638152A (en) * 2018-09-05 2021-04-09 牛津大学科技创新有限公司 Methods or systems for identifying pathogenic mutations that result in a phenotype of interest in a test sample
CN113228190A (en) * 2018-12-23 2021-08-06 豪夫迈·罗氏有限公司 Tumor classification based on predicted tumor mutation burden
CN112111565A (en) * 2019-06-20 2020-12-22 上海其明信息技术有限公司 Mutation analysis method and device for cell free DNA sequencing data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WAN JONATHAN C. M., HEIDER KATRIN, GALE DAVINA, MURPHY SUZANNE, FISHER EYAL, MOULIERE FLORENT, RUIZ-VALDEPENAS ANDREA, SANTONJA AN: "ctDNA monitoring using patient-specific sequencing and integration of variant reads", SCIENCE TRANSLATIONAL MEDICINE, vol. 12, no. 548, 17 June 2020 (2020-06-17), pages eaaz8084, XP093073693, ISSN: 1946-6234, DOI: 10.1126/scitranslmed.aaz8084 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117144002A (en) * 2023-07-19 2023-12-01 苏州吉因加生物医学工程有限公司 Design method and application of personalized probe set for MRD detection
CN116676373A (en) * 2023-07-28 2023-09-01 臻和(北京)生物科技有限公司 Sample dilution factor quantification method and application thereof
CN116676373B (en) * 2023-07-28 2023-11-21 臻和(北京)生物科技有限公司 Sample dilution factor quantification method and application thereof

Also Published As

Publication number Publication date
CN114292912A (en) 2022-04-08

Similar Documents

Publication Publication Date Title
WO2023115662A1 (en) Method for detecting variant nucleic acids
CN106909806B (en) The method and apparatus of fixed point detection variation
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
US20200219587A1 (en) Systems and methods for using fragment lengths as a predictor of cancer
US11929148B2 (en) Systems and methods for enriching for cancer-derived fragments using fragment size
US20210065842A1 (en) Systems and methods for determining tumor fraction
CN112687333B (en) Single-sample microsatellite instability analysis method and device for pan-carcinomatous species
CN111341383B (en) Method, device and storage medium for detecting copy number variation
CN111785324B (en) Microsatellite instability analysis method and device
CN113674803A (en) Detection method of copy number variation and application thereof
CN110093417A (en) A method of the detection unicellular somatic mutation of tumour
CN112365922A (en) Microsatellite locus for detecting MSI, screening method and application thereof
CN109920480B (en) Method and device for correcting high-throughput sequencing data
CN113278706B (en) Method for distinguishing somatic mutation from germline mutation
CN113789371A (en) Method for detecting copy number variation based on batch correction
JP2023543719A (en) Detecting cross-contamination in sequencing data
CN116543835B (en) Method and device for detecting microsatellite state of plasma sample
WO2022262569A1 (en) Method for distinguishing somatic mutation and germline mutation
US20170226588A1 (en) Systems and methods for dna amplification with post-sequencing data filtering and cell isolation
WO2024027591A1 (en) Multi-cancer methylation detection kit and use thereof
KR102347464B1 (en) A method and apparatus for determining true positive variation in nucleic acid sequencing analysis
CA3099612C (en) Method of cancer prognosis by assessing tumor variant diversity by means of establishing diversity indices
CN114708905A (en) Chromosome aneuploidy detection method, device, medium and equipment based on NGS
CN117867113A (en) ICTR-lncRNAs for predicting prognosis of cervical cancer patient, prediction model and application

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22909008

Country of ref document: EP

Kind code of ref document: A1