WO2022262569A1 - Method for distinguishing somatic mutation and germline mutation - Google Patents

Method for distinguishing somatic mutation and germline mutation Download PDF

Info

Publication number
WO2022262569A1
WO2022262569A1 PCT/CN2022/096125 CN2022096125W WO2022262569A1 WO 2022262569 A1 WO2022262569 A1 WO 2022262569A1 CN 2022096125 W CN2022096125 W CN 2022096125W WO 2022262569 A1 WO2022262569 A1 WO 2022262569A1
Authority
WO
WIPO (PCT)
Prior art keywords
wild
mutant
fragment
length
fragments
Prior art date
Application number
PCT/CN2022/096125
Other languages
French (fr)
Chinese (zh)
Inventor
刘成林
王俊
张周
揣少坤
汉雨生
Original Assignee
广州燃石医学检验所有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州燃石医学检验所有限公司 filed Critical 广州燃石医学检验所有限公司
Publication of WO2022262569A1 publication Critical patent/WO2022262569A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • the present application relates to the field of biological information, in particular to a method for distinguishing somatic mutations and germline mutations.
  • cfDNA In the plasma of tumor patients, cfDNA widely exists, including a small amount of tumor-specific ctDNA. These ctDNAs differ from other normal cfDNAs in the way of shearing during cell senescence and apoptosis. In other words, the fragmentation patterns of ctDNA and other conventional cfDNA in cell-free DNA in plasma are different. Therefore, differences in this distribution pattern can serve as markers for ctDNA recognition.
  • Somatic mutations are non-genetic variations that are distinct from germline mutations (also known as: germline mutations) that gradually accumulate during the human life cycle. Somatic mutation is an important marker of tumor formation because it is closely related to the molecular signaling pathway of tumorigenesis. Germline mutations are heritable mutations that occur in germ cells, and are of great significance to the study of genetic diseases and genome evolution. In the “Tumor Mutation Burden Detection and Clinical Application Chinese Expert Consensus (2020 Edition)", it is mentioned that in the standardization requirements of the Tumor Mutation Burden (TMB) algorithm, the core element is the detection and calculation of somatic mutations that can affect protein coding .
  • TMB Tumor Mutation Burden
  • control samples peripheral blood or paracancerous tissues
  • the present application provides a method for distinguishing somatic mutations and germline mutations, a method for identifying ctDNA in cfDNA, and devices and applications corresponding to the methods.
  • the method and/or device described in this application has at least one of the following characteristics: (1) only need to use a single sample, that is, a sample from a subject; (2) has a wide range of applications and can be applied to different cancers Identification of somatic mutations in species, and/or identification of ctDNA; (3) high sensitivity; (4) high accuracy, for example, multiple factors can be combined on the basis of mutation database, population frequency, and mutation abundance at the same time Participate in the method described in this application to improve the reliability of the discrimination results; (5) easy to implement, no limit to the number of mutation sites; (6) fast operation, for example, the plasma of the subject can be used as a sample; (7) A new dimension of distinction is introduced.
  • the application provides a method for distinguishing between a somatic mutation and a germline mutation, comprising the steps of:
  • the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-
  • the application provides a method for identifying ctDNA in cfDNA, comprising the following steps:
  • the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-
  • the application provides a kind of training method of machine learning model, it comprises the following steps:
  • the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-
  • the application provides a method for establishing a database, which includes the following steps:
  • the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-
  • the gene sequencing includes next generation gene sequencing (NGS).
  • NGS next generation gene sequencing
  • the methods use only samples derived from the subject.
  • the sample comprises a blood sample.
  • the method further comprises the step of obtaining a sample from the subject.
  • the mutation site comprises a single nucleotide variation (SNV).
  • SNV single nucleotide variation
  • the mutation site comprises more than two nucleotide variations.
  • the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 550 nucleotides.
  • the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 400 nucleotides.
  • the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 200 nucleotides.
  • the method includes the following steps: (4') obtaining the distribution of the difference in step (3), selecting the maximum value in the distribution as Dev(Max), and using the Dev( Max) as an indicator of the distinction and/or as the training sample.
  • the method includes the following steps: (4') obtaining the distribution of the difference in step (3), which is referred to as the first distribution.
  • the method includes the following steps: (5) within the length range of the effective segment interval, each difference in the first distribution is sequentially accumulated to obtain an added value, wherein, The length of the effective fragment interval covers the length of the nucleic acid sequence wound around the nucleosome.
  • the nucleic acid sequence is capable of wrapping around a nucleosome for more than 2 weeks, or capable of wrapping around a nucleosome for less than 1 week.
  • the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 200 or more nucleotides.
  • the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides.
  • the method includes the following steps: (6) Obtaining the second distribution of the added value in step (5), and calculating the maximum value of the added value in the second distribution.
  • the maximum value of the added value is used as Dev(Max), and the Dev(Max) is used as the indicator of the distinction and/or as the training sample.
  • the difference is smoothed, wherein the smoothing includes the following steps:
  • the smoothing window value is an integer between about 2-6.
  • the smoothing window value is 3.
  • the smoothing process includes the following steps: (f) obtaining the first distribution of the average difference values in step (e).
  • the smoothing process includes the following steps: (g) within the length range of the effective segment interval, each average difference in the first distribution is sequentially accumulated to obtain an added value, wherein, the length of the effective fragment interval is the length of the nucleic acid sequence wound around the nucleosome.
  • the nucleic acid sequence is capable of wrapping around a nucleosome for more than 2 weeks, or capable of wrapping around a nucleosome for less than 1 week.
  • the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 200 or more nucleotides.
  • the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides.
  • the smoothing process includes the following steps: (h) obtaining a second distribution of the added value in step (g), and calculating the maximum value of the added value in the second distribution.
  • the maximum value of the added value is used as Dev(Max), and the Dev(Max) is used as the indicator of the distinction and/or as the training sample.
  • the index also includes one or more parameters selected from the following group: the chromosomal position where the mutation site is located, the base substitution pattern of the mutation site, the mutation site The count value of the nucleic acid fragments of each length in the wild type of the point and/or the count value of the nucleic acid fragments of each length in the mutant type of the mutation site, the allelic variation of the mutation site, the age and The mutation type of the mutation site.
  • the index also includes one or more parameters selected from the following group: the chromosome position where the SNV site is located, the base substitution pattern of the SNV site, the SNV site The count value of the nucleic acid fragment of each length in the wild type of the point and/or the count value of the nucleic acid fragment of each length in the mutant type of the SNV site, the allelic variation of the SNV site, the age and The mutation type of the SNV site.
  • detecting the mutation site comprises the following steps:
  • step (1) obtaining data from the sample; (2) performing variation identification on the data obtained in step (1); (3) performing variation annotation on the variation identified in step (2); and, (4) performing variation annotation on step (3) ) to filter the annotated variation to obtain the mutation site; optionally, perform quality control on the mutation site.
  • the present application provides a device for distinguishing somatic mutations and germline mutations, which includes: a calculation module for calculating the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site , according to the number of wild-type support fragments of at least one length, and the corresponding number of mutant-type support fragments of the same length; the ratio WC is the number of wild-type support fragments of one length and the number of wild-type support fragments The ratio of the total number; wherein the ratio MC is the ratio of the number of corresponding mutant-type support fragments of the same length to the total number of the mutant-type support fragments; wherein, the wild-type support fragment is a base containing wild-type A cfDNA fragment of a base sequence, the mutant supporting fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is the nucleoside corresponding to the position of the mutation site in the reference genome Compared with the acid sequence, the same sequence,
  • the present application provides a device for identifying ctDNA in cfDNA, which includes:
  • a calculation module configured to calculate the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site, according to the number of wild-type support fragments of at least one length, and the corresponding mutant support fragments of the same length Quantity; the ratio WC is the ratio of the number of wild-type support fragments of a length to the total number of wild-type support fragments; wherein the ratio MC is the corresponding number of mutant-type support fragments of the same length Ratio to the total number of mutant support fragments; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein , the wild-type base sequence is, compared with the nucleotide sequence of the reference genome at the corresponding position of the mutation site, the same sequence, wherein, the mutant base sequence is, and the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site,
  • the application provides a training device for a machine learning model, which includes:
  • a calculation module configured to calculate the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site, according to the number of wild-type support fragments of at least one length, and the corresponding mutant support fragments of the same length Quantity; the ratio WC is the ratio of the number of wild-type support fragments of a length to the total number of wild-type support fragments; wherein the ratio MC is the corresponding number of mutant-type support fragments of the same length Ratio to the total number of mutant support fragments; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein , the wild-type base sequence is, compared with the nucleotide sequence of the reference genome at the corresponding position of the mutation site, the same sequence, wherein, the mutant base sequence is, and the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site,
  • the device uses only samples derived from the subject.
  • the device further includes: an output module, used to display the identification result of the somatic mutation and/or the identification result of ctDNA in the cfDNA generated by the determination module.
  • the device further includes a sample obtaining module, configured to obtain the sample from the subject.
  • the sample comprises a blood sample.
  • the sample obtaining module includes reagents and/or instruments for obtaining the sample.
  • the device further includes a data receiving module, configured to obtain the mutation site in the sample.
  • the mutation site comprises a single nucleotide variation (SNV).
  • SNV single nucleotide variation
  • the mutation site comprises more than two nucleotide variations.
  • detecting the mutation site in the device comprises the following steps:
  • step (1) obtaining data from the sample; (2) performing variation identification on the data obtained in step (1); (3) performing variation annotation on the variation identified in step (2); and, (4) performing variation annotation on step (3) ) to filter the annotated variation to obtain the mutation site; optionally, perform quality control on the mutation site.
  • the gene sequencing includes next generation gene sequencing (NGS).
  • NGS next generation gene sequencing
  • the data receiving module includes reagents and/or instruments required for the gene sequencing.
  • the device further includes an input module, configured to obtain the number of the wild-type supporting fragment of the at least one length, and/or the corresponding number of the mutant supporting fragment of the same length. quantity.
  • the import module is capable of distinguishing between the wild-type supporting fragment and the mutant supporting fragment.
  • the input module counts the number of wild-type support fragments of different lengths; and counts the number of wild-type support fragments of different lengths.
  • the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 550 nucleotides.
  • the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 400 nucleotides.
  • the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 200 nucleotides.
  • the calculation module obtain the distribution of the difference, select the maximum value in the distribution as Dev(Max), use the Dev(Max) as the distinguishing index and /or as the training sample.
  • the difference is smoothed in the calculation module, wherein the smoothing process includes the following steps: (a) determining a smoothing window value, wherein the smoothing window value is about 1 Integer in -10; (b) determine the smoothing sampling length range of several length values equal to the smoothing window value, wherein the minimum value of each smoothing sampling length range is the starting length, wherein the range of the starting length is the range of the length of the wild-type support fragment and/or the mutant-type support fragment; (c) obtaining the number of the wild-type support fragment of at least one smoothed sampling length in any smoothed sampling length range, obtaining corresponding to the number of mutant support fragments of the same length,
  • the smoothing window value is an integer between about 2-6.
  • the smoothing window value is 3.
  • the smoothing process includes the following steps: (f) obtaining the first distribution of the average difference values in step (e).
  • the smoothing process includes the following steps: (g) within the length range of the effective segment interval, each average difference in the first distribution is sequentially accumulated to obtain an added value, wherein, the length of the effective fragment interval covers the length of the nucleic acid sequence wound around the nucleosome.
  • the nucleic acid sequence is capable of wrapping around a nucleosome for more than 2 weeks, or capable of wrapping around a nucleosome for less than 1 week.
  • the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 200 or more nucleotides.
  • the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides.
  • the smoothing process includes the following steps: (h) obtaining a second distribution of the added value in step (g), and calculating the maximum value of the added value in the second distribution.
  • the maximum value of the added value is used as Dev(Max), and the Dev(Max) is used as the indicator of the distinction and/or as the training sample.
  • the calculation module outputs the Dev(Max).
  • the index and/or training samples also include one or more parameters selected from the following group: the chromosomal position where the mutation site is located, the base substitution pattern of the mutation site , the count value of nucleic acid fragments of various lengths in the wild type of the mutation site and/or the count value of nucleic acid fragments of various lengths in the mutant type of the mutation site, the allelic variation of the mutation site, the affected The age of the test subject and the mutation type of the mutation site.
  • the index and/or training samples also include one or more parameters selected from the following group: the chromosome position where the SNV site is located, the base substitution pattern of the SNV site , the count value of nucleic acid fragments of various lengths in the wild type of the SNV site and/or the count value of nucleic acid fragments of various lengths in the mutant type of the SNV site, the allelic variation of the SNV site, the affected The age of the test subject and the mutation type of the SNV site.
  • the present application provides an electronic device, including a memory; and a processor coupled to the memory, the processor configured to execute based on instructions stored in the memory to implement the instructions described in the present application.
  • the present application provides a non-volatile computer-readable storage medium, on which a computer program is stored, and the program is executed by a processor to implement the method for distinguishing somatic mutations and germline mutations described in the present application. ; the method for identifying ctDNA in cfDNA described in the present application, or the training method of the machine learning model described in the present application.
  • the present application provides a database system, which includes a memory; and a processor coupled to the memory, the processor configured to execute based on instructions stored in the memory to implement the The method for distinguishing somatic mutations and germline mutations as described above; the method for identifying ctDNA in cfDNA as described in this application, or the method for building a database as described in this application.
  • the present application provides an application of the method for distinguishing somatic mutations and germline mutations described in the present application in the management of tumor families.
  • the present application provides an application of the method for distinguishing somatic mutations and germline mutations described in the present application in the detection of tumor mutation burden (TMB).
  • TMB tumor mutation burden
  • Figure 1 shows the training set used by the method described in this application for the machine learning model, and the verification set required to verify that the method for distinguishing somatic mutations and germline mutations described in this application can distinguish between somatic mutations and germline mutations Case.
  • Figure 2 shows the machine training results of the machine learning model obtained using the method described in this application.
  • Figure 3 shows how the machine learning model obtained by the method described in this application distinguishes between somatic mutations and germline mutations in validation set 1.
  • FIG. 4 shows how the machine learning model obtained by the method described in this application distinguishes between somatic mutations and germline mutations in the verification set 2.
  • Figure 5 shows that using the method described in this application can distinguish between somatic mutations and germline mutations for different tumor types.
  • Figure 6 shows the AUC results of the method described in this application for distinguishing somatic mutations from germline mutations.
  • Figure 7 shows the AUC results of the method described in this application for distinguishing somatic mutations from germline mutations.
  • Figure 8 shows the distribution of the lengths of the wild-type supporting fragment and the mutant supporting fragment for a mutation site.
  • Figure 9 shows the distribution of the lengths of the wild-type supporting fragment and the mutant supporting fragment for a mutation site.
  • Figure 10 shows the distribution of the lengths of the wild-type supporting fragment and the mutant supporting fragment for a mutation site.
  • the term "somatic mutation” generally refers to an acquired class of mutations that occur in non-embryonic cells.
  • the somatic mutation may include a genetic change occurring in a somatic tissue (eg, a cell outside the germline).
  • the somatic mutations may include point mutations (for example, the exchange of a single nucleotide for another nucleotide (for example, silent mutations, missense mutations, and nonsense mutations)), insertions, and deletions (for example, , addition and/or removal of one or more nucleotides (eg, indels), amplifications, gene duplications, copy number alterations (CNAs), rearrangements, and splice variants.
  • CNAs copy number alterations
  • the somatic mutations may be closely related to the processes of cell growth, programming, senescence and apoptosis.
  • the somatic mutations may be associated with alterations in signaling pathways in tumorigenesis, angiogenesis and/or tumor metastasis.
  • germline mutation generally refers to a heritable mutation that occurs in a germ cell (eg, egg or sperm).
  • the germline mutation can be passed on to progeny, eg, can be incorporated into the DNA of every cell (eg, germline and somatic) in the progeny.
  • the germline mutation may be less associated with tumorigenesis.
  • the germline mutation can serve as a "baseline” in TMB analysis.
  • the term "gene sequencing” generally refers to the technique used to determine the order of the nucleotide bases adenine, guanine, cytosine and thymine in a DNA molecule.
  • the gene sequencing may include first-generation gene sequencing, second-generation gene sequencing, third-generation gene sequencing or single-molecule sequencing (SMS).
  • Second-generation or next-generation gene sequencing may refer to techniques that use advanced techniques (optical) to detect base position methods while generating many sequences (see, for example, Metzker, 2009 for a review).
  • next-generation sequencing is a high-throughput sequencing technology (High-throughput sequencing), which can parallelize hundreds of thousands to millions of DNA at a time. Molecules are sequenced, generally with short read lengths.
  • massively parallel signature sequencing Massively Parallel Signature Sequencing, MPSS
  • Polymerase cloning Polymerase cloning
  • 454 pyrosequencing 454pyro sequencing
  • Illumina Solexa
  • ion semiconductor sequencing Ion semi conductor sequencing
  • DNA nanoball sequencing DNA nano-ball sequencing
  • Complete Genomics' DNA nanoarray and combined probe anchor ligation sequencing etc.
  • the second-generation gene sequencing can make it possible to analyze the transcriptome and genome of a species in detail, so it is also called deep sequencing (deep sequencing).
  • the term "mutation site” generally refers to the site where there is a nucleotide difference compared with the nucleotide sequence of the control sequence.
  • the control sequence may be a reference sequence used in gene sequencing (for example, it may be a human reference genome).
  • the mutation site may include at least one (for example, 1, 2, 3, 4 or more) difference in the nucleotide sequence at the site (for example, the difference Nucleotide substitutions, duplications, deletions and/or additions may be included).
  • the mutation site may include a nucleotide mutation at at least one nucleotide position.
  • the nucleotide mutation can be a natural mutation or an artificial mutation.
  • the mutation site may comprise a single nucleotide variation (SNV).
  • wild-type base sequence generally refers to the same sequence compared with the nucleotide sequence at the corresponding position of the mutation site in a reference genome (for example, a human reference genome).
  • the wild-type base sequence may be the nucleotide sequence at the corresponding position of the mutation site in the human reference genome.
  • the wild-type base sequence may not contain the mutation site.
  • mutant base sequence generally refers to a sequence that is different from the nucleotide sequence at the corresponding position of the mutation site in a reference genome (for example, it may be a human reference genome). In some cases, for a specific mutation site described in this application, the mutant base sequence may contain the mutation site.
  • wild-type supporting fragment generally refers to a cfDNA fragment comprising the wild-type base sequence described in this application.
  • the wild-type supporting fragments may have different sequence lengths.
  • the wild-type supporting fragment may not contain the mutation site.
  • the wild-type support fragment may not contain the mutation site, but for another mutation site described in the application , the wild-type supporting fragment may or may not contain the other mutation site.
  • the term "the length of the wild-type supporting fragment” refers to the length of the wild-type supporting fragment described in this application, and the unit is the number of "nucleotides”.
  • mutant supporting fragment generally refers to a cfDNA fragment comprising the mutated base sequence described in the present application.
  • the mutant support fragment may contain the mutation site.
  • the mutant support fragment may contain the mutation site, but for another mutation site described in the application, The mutant supporting fragment may or may not contain the other mutation site.
  • length of the mutant supporting fragment refers to the length of the mutant supporting fragment described in this application, and the unit is the number of "nucleotides”.
  • human reference genome generally refers to the human genome that can function as a reference in gene sequencing.
  • the information of the human reference genome can refer to UCSC (http://genome.ucsc.edu/index.html).
  • the human reference genome can have different versions, for example, it can be hg19, GRCH37 or ensembl 75.
  • the term "at the corresponding position” generally refers to the position of at least one specific base in one sequence, and the position of the specific base in the other sequence.
  • the corresponding position can be the nucleotide position at the mutation site in the wild-type base sequence or the mutant base sequence described in the application, and the mutation in the reference genome described in the application location of the site.
  • the mutation site is the 100th nucleotide in the mutant base sequence
  • the corresponding position in the reference genome can be the 100th nucleotide of the corresponding sequence in the reference genome .
  • cfDNA usually refers to the abbreviation of Cell free DNA, and may refer to plasma free DNA.
  • the cfDNA can be an extracellular DNA fragment located in the peripheral circulation.
  • ctDNA generally refers to circulating tumor DNA.
  • ctDNA is a fragment of tumor-derived DNA that is not associated with cells in the blood.
  • the ctDNA can be produced by the entry of genomes in apoptotic or necrotic tumor cells into the blood.
  • the ctDNA may carry specific gene characteristics of primary tumor or metastatic tumor.
  • the ctDNA can be considered as a special kind of the cfDNA.
  • machine learning model generally refers to a system or collection of program instructions and/or data configured to implement an algorithm, process, or mathematical model.
  • the algorithm, process or mathematical model can predict and provide a desired output based on a given input.
  • the parameters of the machine learning model may not be explicitly programmed, and in the traditional sense, the machine learning model may not be explicitly designed to follow specific rules in order to provide the desired output.
  • the use of the machine learning model may mean that the machine learning model and/or the data structure/set of rules being the machine learning model are trained by a machine learning algorithm.
  • database generally refers to an organized entity of related data, regardless of the manner in which the data or the organized entity is represented.
  • the organized entity of related data may take the form of a table, map, grid, group, datagram, file, document, list, or any other form.
  • the database may include any data collected and stored in a computer-accessible manner.
  • single nucleotide variation generally refers to a variation in a single nucleotide that occurs at a specific location in a genome that is identical to a reference genome (such as the one described in this application).
  • the nucleotides at corresponding positions in the human reference genome differ (for example, substitutions, duplications, deletions, or additions of one nucleotide).
  • the term “smoothing” generally refers to a method of data processing that reduces the deviation between one or more of the differences described herein.
  • the smoothing process may include obtaining an average value of a certain number of difference values described in this application.
  • the smoothing process may include selecting different lengths (for example, the smoothing sampling length described in this application) corresponding to a certain interval length (for example, it may be the smoothing window value described in this application).
  • the number of the wild-type support fragment and/or the mutant-type support fragment calculate the ratio of the two numbers to the total number of the wild-type support ratio and the ratio to the total number of the mutant-type support fragment difference.
  • the smoothing process may include dividing the accumulated value of the difference within a certain length range by the interval length to obtain a ratio. For example, said ratio may be considered as the average difference of said differences over the length range.
  • the term “smoothing window value” generally refers to the interval between the selected wild-type support fragments and/or mutant support fragments of different lengths in the smoothing process described in the present application. Nucleotide length value. For example, in the smoothing process, the length of the selected wild-type support fragment and/or the mutant support fragment can be 1, 4, 7, 10, 13 ... nucleotides in sequence, Then the smoothing window value may be 3.
  • the smoothing window value may be an integer of about 1-30, for example, may be 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10. For example, it can be 1, 2, 3, 4, 5 or 6.
  • the term “smoothed sampling length” generally refers to the length value of the wild-type support fragment selected for counting in the smoothing process described in the present application, and/or, selected for counting Count the length values of the mutant support fragments.
  • the smoothed sampling length may be the length value of each support fragment within the smoothed sampling length range within the length range of the wild-type support fragment and/or the mutant support fragment described in the present application.
  • each smoothing sampling length range it can be from the initial length (for example, it can start from a length of 1 nucleotide) to the maximum value of the smoothing sampling length range (for example, it can be the initial length+ (smoothing window value - 1)), where each supports the length value of the segment.
  • the smoothing window value can be 3, if the initial length is 1 nucleotide, then the smoothing sampling length range can be 1-3, 4-6, 7-9...;
  • the smoothing window value can be 3, if the initial length is 1 nucleotide, then the smoothing sampling length range can also be 1-3, 2-4, 3-5... .
  • the initial length may also be other lengths than 1 (for example, it may start from a length of 2 nucleotides).
  • the smoothing sampling length range can be 2-4, 5-7, 8-10...; for example, if the smoothing window value can be 3.
  • the smoothed sampling length range may also be 2-4, 3-5, 4-6....
  • first distribution generally refers to the distribution of the average difference of each smoothed sampling length range described in the present application.
  • first distribution may be a collection of average differences described in the present application.
  • the term "the length of a nucleic acid sequence that winds around a nucleosome” generally refers to the length required for a nucleic acid sequence to wind around a nucleosome.
  • the nucleic acid sequence may wrap around the nucleosome at a certain multiple (eg, may wrap within one time, or may wrap more than twice).
  • the term "the length of the effective fragment interval" generally refers to the range of the length corresponding to the wild-type supporting fragment and/or the mutant supporting fragment required for calculating the addition value described in the present application.
  • the term "second distribution" generally refers to the distribution of addition values described in the present application. In some cases, the second distribution may be a collection of the added values described in this application.
  • the term "calculation module” generally refers to a functional module for calculating the difference between the number of wild-type support fragments described in this application and the number of mutant support fragments described in this application with the same length.
  • the calculation module can input the number of wild-type supporting fragments described in the present application, and correspondingly the number of mutant-type supporting fragments of the same length.
  • the calculation module can output the difference value described in this application. For example, Dev(Max) described in this application may be output.
  • the smoothing process described in this application can be performed.
  • the term "judgment module” generally refers to a machine learning model that has been trained by machine learning to obtain relevant judgment results (for example, the judgment results may include the recognition results of somatic mutations described in this application, And/or the judgment result of recognizing ctDNA in the cfDNA described in this application).
  • the judging module may input the difference described in this application (for example, the Dev(Max)).
  • the judging module can output the related judging result.
  • the machine learning model can be used for judging.
  • the term "training module” generally refers to a functional module for inputting the difference described in the present application (such as the Dev(Max)) as a training sample into the machine learning model for machine learning training .
  • the "machine learning” may refer to artificial intelligence systems configured to learn from data without being explicitly programmed.
  • the "machine learning model” can be a collection of parameters and functions that can train parameters on a set of training samples. Parameters and functions can be collections of linear algebraic, nonlinear algebraic, and tensor algebraic operations. Parameters and functions can contain statistical functions, tests, and probability models.
  • the training module may input the difference value described in the present application (for example, the Dev(Max)).
  • the training module can output a machine learning model that has been trained by machine learning.
  • the term "output module” generally refers to a functional module for displaying the recognition result of the somatic mutation generated by the judgment module of the present application and/or the judgment result of recognition of ctDNA in the cfDNA.
  • the output module may include a display, which may display (for example, in the form of graphs and/or text) the recognition result of the somatic mutation generated by the judgment module of the present application and/or the recognition of ctDNA in the cfDNA judgment result.
  • sample obtaining module generally refers to a functional module for obtaining said sample of a subject.
  • the sample obtaining module may include reagents and/or instruments needed to obtain the sample (eg, blood sample).
  • the sample acquisition module can output the samples described herein.
  • the term "data receiving module” generally refers to a functional module for obtaining the mutation site in the sample.
  • the data receiving module may input the sample described in this application (such as a blood sample).
  • the data receiving module can output the mutation site.
  • the data receiving module can detect the mutation site of the sample.
  • the data receiving module may perform the gene sequencing described in this application (eg, next-generation gene sequencing) on the sample.
  • the data receiving module may include reagents and/or instruments required for the gene sequencing.
  • the data receiving module can detect the single nucleotide variation.
  • the term "input module” generally refers to the number of wild-type support fragments used to obtain the at least one length, and/or the corresponding number of mutant support fragments of the same length.
  • the input module can input the mutation site described in this application.
  • the input module may output (eg, may display) the number of the wild-type supporting fragments of the at least one length, and/or the corresponding number of the mutant supporting fragments of the same length.
  • Said input module may comprise reagents and/or instruments capable of enumerating said wild-type support fragments of a specified length.
  • the input module may comprise reagents and/or instruments capable of counting the mutant supporting fragments of a specified length.
  • the input module can identify the lengths of the wild-type supporting fragments and count them respectively; the input module can identify the lengths of the mutant-type supporting fragments and count them respectively. The input module can determine whether the length of the wild-type supporting fragment is the same as that of the mutant-type fragment.
  • cancer family management generally refers to providing help for cancer-related matters to patients with family hereditary tumors, their relatives and/or high-risk groups.
  • the cancer family management may include providing genetic counseling for the above-mentioned population, performing detection of tumor-related genes and interpretation of results, risk assessment of cancer, consultation and/or implementation of preventive interventions.
  • TMB tumor mutation burden
  • TMB tumor mutation burden detection and clinical application Chinese expert consensus (2020 edition)
  • TMB generally refers to the tumor mutation burden per megabase in a specific genomic region.
  • Mb base pairs
  • XX mutations/Mb mutations per megabase
  • the TMB can be used as a biomarker related to the response to immunotherapy.
  • the TMB can indirectly reflect the ability and degree of neoantigen production by the tumor, and has been proven to predict the response to immunotherapy.
  • TMB is used to identify candidates for "Nivolumab+Ipilimumab” immune combination therapy and "Nivolumab” monotherapy in patients with lung cancer.
  • the expression level of TMB may be related to many factors, such as microsatellite instability (MSI-H) and the existence of certain driver genes.
  • the term "about” generally refers to a range of 0.5%-10% above or below the specified value, such as 0.5%, 1%, 1.5%, 2%, 2.5%, above or below the specified value. 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10%.
  • the application provides a method for distinguishing between a somatic mutation and a germline mutation, comprising the steps of:
  • the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-
  • the application provides a method for identifying ctDNA in cfDNA, comprising the following steps:
  • the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-
  • the application provides a kind of training method of machine learning model, it comprises the following steps:
  • the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-
  • the application provides a method for establishing a database, which includes the following steps:
  • the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-
  • the gene sequencing may include next generation gene sequencing (NGS).
  • NGS next generation gene sequencing
  • the NGS can be selected from the following group: Solexa sequencing technology, 454 sequencing technology, SOLiD sequencing technology, Complete Genomics sequencing method and semiconductor (Ion Torrent) sequencing technology.
  • the gene sequencing can be high-throughput, for example, hundreds of thousands or millions of DNA molecules can be sequenced at one time.
  • the gene sequencing can be short-segment, for example, the read length of NGS can be no more than 500bp.
  • the gene sequencing may include the following steps: (1) library construction; for example, it may include modifying the ends of the DNA molecules, and adding adapters (for example, Y-shaped adapters may be formed), and then perform PCR Amplification; (2) Sequencing; for example, it may include DNA replication using oligonucleotides as primers and library fragments as templates; then "bridge” amplification, and sequencing while synthesizing. Then add the sequencing primer Index primer to read the Index sequence in the linker to determine which library the DNA at each site belongs to.
  • library construction for example, it may include modifying the ends of the DNA molecules, and adding adapters (for example, Y-shaped adapters may be formed), and then perform PCR Amplification
  • Sequencing for example, it may include DNA replication using oligonucleotides as primers and library fragments as templates; then "bridge” amplification, and sequencing while synthesizing. Then add the sequencing primer Index primer to read the Index sequence in the linker to determine which library the DNA at each site belongs to.
  • the method may only use samples derived from a subject.
  • the method may not require the use of paired samples. Therefore the method described in this application can greatly reduce the requirement of the subject's sample.
  • the sample may include a blood sample.
  • the method may further include the following step: obtaining a sample from a subject.
  • the step of obtaining a blood sample from said subject using a lancet system may be included.
  • the method for obtaining samples may include a vacuum blood collection tube blood collection method.
  • the mutation site may include a single nucleotide variation (SNV).
  • the mutation site may contain more than two nucleotide variations.
  • the mutation site described in the present application may include one of the SNVs, or two or more (for example, it may be 2, 3, 4, 5, 6, 7, 8, 9, 10 or more) SNVs (eg, may include more than two nucleotide variations).
  • the nucleotide sequence at the position of the mutation site is different between the wild-type support fragment and the mutant support fragment.
  • the mutation sites may include nucleotide substitutions, and may also include nucleotide deletions and/or insertions in some cases.
  • the mutation site may include nucleotide substitutions.
  • the division of the wild type supporting fragment and/or the mutant supporting fragment may be specific to a specific mutation site.
  • the nucleotide sequence at the mutation site is the same as the nucleotide sequence at the corresponding position of the mutation site in the reference genome, it can be considered as the reference genome for the mutation site.
  • the wild-type support fragment and/or the mutant support fragment can range in length from about 1 nucleotide to about 550 nucleotides (for example, it can be about 1-500, about 1-450, about 1-400, about 1-350, about 1-300, about 1-250, about 1-200, or about 1-100).
  • it can be from about 1 nucleotide to about 400 nucleotides.
  • it can be from about 1 nucleotide to about 200 nucleotides.
  • the method may include the following steps: (4') obtain the distribution of the difference in step (3), select the maximum value in the distribution as Dev(Max), and use the Dev(Max ) as the distinguishing index and/or as the training sample.
  • the distribution may be a collection of the differences.
  • the Dev(Max) may be the maximum value of the difference in the set.
  • the difference may be smoothed.
  • the difference after the smoothing process, the difference can more intuitively and accurately reflect the difference between the number of wild-type supporting fragments and the mutant-type supporting fragments of the same length. Further, the smoothed difference can more accurately, specifically and/or more sensitively distinguish the somatic mutation from the systemic mutation, and/or identify cfDNA in the ctDNA.
  • the smoothing process may include the following steps:
  • the smoothing window value can be adjusted according to different subjects, different gene sequencing methods and/or different differentiation purposes, as long as the selected smoothing window value can make the smoothing window processing can be carried out.
  • the smoothing window value may be an integer between about 2-6 (for example, the smoothing window value may be 2, 3, 4, 5 or 6).
  • the smoothing window value may be 3.
  • the smoothing process may include the following specific steps:
  • (b) Determine the smoothing sampling length range, wherein the minimum value of each smoothing sampling length range is the starting length, and wherein the maximum value of each smoothing sampling length range is the starting length+(smoothing window value-1) ; wherein the range of the starting length is the range of the length of the wild-type support fragment and/or the mutant support fragment (for example, may be about 1 nucleotide to about 400 nucleotides); in In the present application, the initial length may be any length within the range of the length of the wild-type supporting fragment and/or the mutant-type supporting fragment. In this application, the "length" can be measured by the number of nucleotides.
  • the minimum value in each smoothing sampling length range may be a starting length as the first item, and the smoothing The window value is a tolerance, the first item, the second item, the third item up to the Nth item in the arithmetic sequence within the range of the length of the wild-type supporting fragment and/or the mutant-type supporting fragment.
  • the smoothing minimum value may be 1, 4, 7, 10... 400.
  • the smoothing sampling length ranges can be 1-3, 4-6, 7 -10.
  • the smoothing sampling length ranges can be 1-3, 4-6, 7 -10.
  • the smoothing sampling length ranges can be 1-3, 4-6, 7 -10.
  • the smoothing sampling length ranges can be 1-3, 2-4, 3 -5..., or 1-3, 3-5, 5-7....
  • the smoothing sampling length ranges can be 2-4, 5-7, 8-11....
  • the smoothing sampling length ranges can be 2-4, 3-5, 4 -6.
  • the number of wild-type supporting fragments with a length of 1 nucleotide is obtained, and the number is divided by the total number of wild-type supporting fragments W total to obtain a ratio WC1;
  • the number of the mutant supporting fragments is divided by the total number M total of the mutant supporting fragments to obtain the ratio MC1, and the difference WC1-MC1 between the two is calculated; for example, the length of the acquisition is 4 nucleotides
  • the number of the wild-type supporting fragments is divided by the total number W total of the wild-type supporting fragments to obtain a ratio WC4;
  • the number MC4 of the mutant supporting fragments with a length of 4 nucleotides is obtained, Divide this number by the total number M total of the mutant support fragments, and calculate the difference WC4-MC4 between the two; thus obtain different smoothing sampling lengths (for example, 1, 4, 7, 10...400 ) respectively corresponding to the difference of the ratio; for example, within the range of each smoothed sampling length, under each smoothed
  • the obtained average difference B1 can be used as the representative of the smoothing sampling length range value.
  • the accumulated value of (WC4-MC4), (WC5-MC5) and (WC6-MC6) is divided by the smoothing window value 3, and the obtained average difference B4 can be used as the representative of the smoothing sampling length range value.
  • the smoothing process may further include the following steps: (g) within the length range of the effective segment interval, sequentially accumulating each average difference in the first distribution to obtain an added value, wherein, the length of the effective fragment interval covers the length of the nucleic acid sequence wound around the nucleosome.
  • the nucleic acid sequence can wrap around the nucleosome for more than 2 weeks, or can wrap around the nucleosome within 1 week.
  • the length of the effective fragment interval can be about 1 to about 180 nucleotides (for example, it can be about 1 to about 180, about 1 to about 179, about 1 to about 178, about 1 to about 177, about 1 to about 176, about 1 to about 175, about 1 to about 174, about 1 to about 173, about 1 to about 172, about 1 to about 171, about 1 to about 170 about 1-about 169, about 1-about 168, about 1-about 167, about 1-about 166 or about 1-about 165), and/or, about 200 or more nucleotides (For example, it can be about 200 or more, about 210 or more, about 220 or more, about 230 or more, about 240 or more, about 250 or more, about 260 or more, about 270 or more, about 280 or more, About 290 or more, about 300 or more, about 350 or
  • B1 and B4 in the first distribution may be added up to obtain the added value D1; B1, B4 and B7 in the first distribution may be added up to obtain the added value D2.
  • the smoothing process may include the following steps: (h) obtaining the second distribution of the added value in step (g), and calculating the maximum value of the added value in the second distribution.
  • i may be the length of the effective segment interval.
  • the maximum value in the second distribution may be taken as Dev(Max).
  • the Dev(Max) may be used as the distinguishing index and/or as the training sample.
  • the indicator may also include one or more parameters selected from the following group: the chromosomal position where the mutation site is located, the base substitution pattern of the mutation site, the wild type of the mutation site The count value of nucleic acid fragments of various lengths in and/or the count value of nucleic acid fragments of various lengths in the mutant type of the mutation site, the allelic variation of the mutation site, the age of the subject and the mutation site The point mutation type.
  • the indicator may also include one or more parameters selected from the following group: the chromosome position where the SNV site is located, the base substitution pattern of the SNV site, the SNV site The count value of nucleic acid fragments of various lengths in the wild type and/or the count value of nucleic acid fragments of various lengths in the mutant type of the SNV site, the allelic variation of the SNV site, the age of the subject and the The mutation type of the SNV locus.
  • the method may further include the step of detecting the mutation site.
  • the step of detecting the mutation site can be a routine step in the art.
  • detecting the mutation site can include the following steps: (1) obtaining data from the sample; The data obtained in step (1) is subjected to variation identification (for example, the variation identification can be carried out by base quality, mapping quality, number of mismatches, mutation frequency, reads supporting mutation and other factors); (3) step ( 2) The identified variants are annotated for variants (for example, ANNOVAR 20160201, 1000 Genomes database, ExAC database and/or gnomAD genome database can be used for annotation; for example, database annotation, hot site annotation, mutation type and/or population can be used frequency annotation); and, (4) filter the variation annotated in step (3) (for example, filtering of population mutation site frequency, filtering of hot spot mutation, filtering of clonal hematopoietic mutation, and/or maximum depth filter) to obtain the mutation site.
  • the step may also
  • the present application provides a device for distinguishing between somatic mutations and germline mutations, comprising:
  • a calculation module configured to calculate the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site, according to the number of wild-type support fragments of at least one length, and the corresponding mutant support fragments of the same length Quantity; the ratio WC is the ratio of the number of wild-type support fragments of a length to the total number of wild-type support fragments; wherein the ratio MC is the corresponding number of mutant-type support fragments of the same length Ratio to the total number of mutant support fragments; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein , the wild-type base sequence is, compared with the nucleotide sequence of the reference genome at the corresponding position of the mutation site, the same sequence, wherein, the mutant base sequence is, and the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site,
  • a judging module configured to obtain a recognition result for identifying the somatic mutation according to a machine learning model that has been trained by machine learning, wherein the machine learning training includes inputting the difference as a training sample into the machine learning model to Do machine learning training.
  • the present application provides a device for identifying ctDNA in cfDNA, which includes:
  • a calculation module configured to calculate the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site, according to the number of wild-type support fragments of at least one length, and the corresponding mutant support fragments of the same length Quantity; the ratio WC is the ratio of the number of wild-type support fragments of a length to the total number of wild-type support fragments; wherein the ratio MC is the corresponding number of mutant-type support fragments of the same length Ratio to the total number of mutant support fragments; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein , the wild-type base sequence is, compared with the nucleotide sequence of the reference genome at the corresponding position of the mutation site, the same sequence, wherein, the mutant base sequence is, and the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site,
  • a judgment module configured to obtain a judgment result of identifying ctDNA in the cfDNA according to a machine learning model that has been trained by machine learning, wherein the machine learning training includes inputting the difference as a training sample into the machine learning model for machine learning training.
  • the application provides a training device for a machine learning model, which includes:
  • a calculation module configured to calculate the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site, according to the number of wild-type support fragments of at least one length, and the corresponding mutant support fragments of the same length Quantity; the ratio WC is the ratio of the number of wild-type support fragments of a length to the total number of wild-type support fragments; wherein the ratio MC is the corresponding number of mutant-type support fragments of the same length Ratio to the total number of mutant support fragments; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein , the wild-type base sequence is, compared with the nucleotide sequence of the reference genome at the corresponding position of the mutation site, the same sequence, wherein, the mutant base sequence is, and the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site,
  • a training module configured to input the difference as a training sample into the machine learning model for machine learning training.
  • the device may only use samples derived from the subject.
  • the device may further include: an output module, configured to display the recognition result of the somatic mutation generated by the judgment module and/or the judgment result of recognition of ctDNA in the cfDNA.
  • the output module may display the recognition result of the somatic mutation generated by the judgment module of the present application and/or the judgment result of recognition of ctDNA in the cfDNA.
  • the output module may include an output device (such as a display) and/or an output program (such as a mobile APP), so as to display the recognition result of the somatic mutation generated by the judgment module of the present application and/or the The judgment result of ctDNA recognition in cfDNA.
  • the output module inputs the identification result of the somatic mutation and/or the identification result of ctDNA in the cfDNA obtained by the determination module.
  • the device may further include a sample obtaining module, configured to obtain the sample of the subject.
  • the sample may comprise a blood sample.
  • the sample obtaining module may include reagents and/or instruments required for obtaining the sample.
  • the sample acquisition module may include blood collection needles, blood collection tubes and/or blood sample transport boxes.
  • the sample acquisition module can include an anticoagulant.
  • the sample obtaining module can output the samples described in this application.
  • the device may further include a data receiving module, configured to obtain the mutation site in the sample.
  • the data receiving module may input the samples.
  • the data receiving module can output the mutation sites described in this application.
  • the data receiving module may include reagents and/or instruments required for obtaining the mutation site.
  • the data receiving module may include reagents and/or instruments required for the gene sequencing.
  • the data receiving module may perform the gene sequencing described in the present application, for example, the gene sequencing may include next-generation gene sequencing (NGS).
  • NGS next-generation gene sequencing
  • the data receiving module may include a next-generation gene sequencer (eg, Roche454 sequencer, Illumina sequencer).
  • the data receiving module can include an automated sample preparation system.
  • the data receiving module may include fluorescently labeled dNTP, end repair enzyme, end repair reaction buffer, DNA ligase, DNA ligation buffer and/or library amplification reaction solution.
  • the mutation site may include a single nucleotide variation (SNV). In the present application, the mutation site may contain more than two nucleotide variations.
  • SNV single nucleotide variation
  • detecting the mutation site in the device may include the following steps: (1) obtaining data from the sample; (2) performing mutation identification on the data obtained in step (1); (3) Annotate the variation identified in step (2); and, (4) filter the variation annotated in step (3) to obtain a mutation site; optionally, perform quality control on the mutation site.
  • the device may further include an input module, configured to obtain the quantity of the wild-type support fragment of the at least one length, and/or the corresponding quantity of the mutant support fragment of the same length .
  • the input module can input the mutation sites described in this application.
  • the input module may output the number of the wild-type supporting fragments of the at least one length, and/or the corresponding number of the mutant supporting fragments of the same length.
  • the input module may include reagents and/or instruments capable of counting the wild-type support fragments of a specific length.
  • the input module may comprise reagents and/or instruments capable of counting the mutant supporting fragments of a specified length.
  • the input module may include an instrument capable of displaying the quantity of the wild-type support fragment of the at least one length, and/or the quantity of the corresponding mutant support fragment of the same length (such as a display) and/or an output program (such as a mobile APP), so that the number of wild-type and/or mutant support fragments obtained by using the input module can be displayed.
  • the input module can distinguish between the wild type support fragment and the mutant support fragment.
  • the input module can count the number of wild-type supporting fragments of different lengths; and, count the number of wild-type supporting fragments of different lengths.
  • the length of the wild-type supporting fragment and/or the mutant supporting fragment may range from about 1 nucleotide to about 550 nucleotides. For example, it can be from about 1 nucleotide to about 400 nucleotides. For example, it can be from about 1 nucleotide to about 200 nucleotides.
  • the calculation module can input (for example, can be obtained through the input module of the present application) the number of wild-type support fragments described in the present application, and the corresponding number of mutant support fragments of the same length .
  • the calculation module may output the difference value described in this application, for example, the calculation module may output Dev(Max) described in this application.
  • the calculation module may include calculation logic and/or a calculation program for calculating the difference value described in this application.
  • the distribution of the difference can be obtained in the calculation module, the maximum value in the distribution is selected as Dev(Max), and the Dev(Max) is used as the indicator of the distinction and/or as the training sample.
  • the difference value may be smoothed in the calculation module, wherein the smoothing process may include the following steps: (a) determining a smoothing window value, wherein the smoothing window value is about 1 Integers in -30; (b) determine several smoothing sampling length ranges whose length values are equal to the smoothing window value, wherein the minimum value of each smoothing sampling length range is the starting length, wherein the range of the starting length is the range of the length of the wild-type support fragment and/or the mutant-type support fragment; (c) obtaining the number of the wild-type support fragment of at least one smoothed sampling length in any smoothed sampling length range, Obtain the corresponding number of mutant support fragments of the same length, calculate the ratio WC of the number of wild-type support fragments of this length to the total number of wild-type support fragments; calculate the mutant support fragments of the same length the ratio MC of the number of fragments to the total number of mutant support fragments; calculating the difference between said ratio WC and said ratio MC
  • the smoothing window value may be an integer between about 2-6.
  • the smoothing window value may be 3.
  • the smoothing process may include the following steps: (f) obtaining the first distribution of the average difference in step (e).
  • the smoothing process may include the following steps: (g) within the length range of the effective segment interval, each average difference in the first distribution is sequentially accumulated to obtain an added value, wherein , the length of the effective fragment interval covers the length of the nucleic acid sequence wound around the nucleosome.
  • the nucleic acid sequence may be capable of wrapping around nucleosomes for more than 2 weeks, or capable of wrapping around nucleosomes for less than 1 week.
  • the length of the effective fragment interval may be about 1 to about 167 nucleotides, and/or, about 200 or more nucleotides. In the present application, the length of the effective fragment interval may be about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides.
  • the smoothing process may include the following steps: (h) obtaining the second distribution of the added value in step (g), and calculating the maximum value of the added value in the second distribution.
  • the maximum value of the added value may be used as Dev(Max), and the Dev(Max) may be used as the distinguishing index and/or as the training sample.
  • the judgment module can obtain relevant judgment results according to the machine learning model that has been trained by machine learning (for example, the judgment results can include the recognition results of somatic mutations described in this application, and/or the results of this application The judgment result of recognizing ctDNA in the cfDNA).
  • the judging module may input the difference described in this application (for example, the Dev(Max)).
  • the judging module can output the related judging result.
  • the judging module may include a machine learning model that has been trained by machine learning.
  • the machine learning model is obtained by using the verification set and the difference described in the present application (for example, using the parameters described in the present application), and using the training method of the machine learning model described in the present application.
  • the index and/or training samples may also include one or more of the following parameters: the chromosome position where the mutation site is located, the base substitution pattern of the mutation site, the mutation The count value of nucleic acid fragments of various lengths in the wild type of the site and/or the count value of nucleic acid fragments of various lengths in the mutant type of the mutation site, the allelic variation of the mutation site, the age of the subject and the mutation type of the mutation site.
  • the index and/or training samples may also include one or more of the following parameters: the chromosome position where the SNV site is located, the base substitution pattern of the SNV site, the SNV The count value of nucleic acid fragments of various lengths in the wild type of the site and/or the count value of nucleic acid fragments of various lengths in the mutant type of the SNV site, the allelic variation of the SNV site, the age of the subject and the mutation type of the SNV site.
  • the device may include the calculating module and the judging module.
  • the apparatus may include the computing module and the training module.
  • the device may include the sample acquisition module, the data receiving module, the input module, the calculation module, the judgment module and the output module.
  • the sample, and the information and/or calculation results derived from the sample can be obtained from the sample acquisition module, the data receiving module, the input module, the calculation module, the judgment The order of the modules and the output modules are transferred sequentially.
  • the present application provides an electronic device, including a memory; and a processor coupled to the memory, the processor configured to execute based on instructions stored in the memory to implement the instructions described in the present application.
  • the present application provides a non-volatile computer-readable storage medium, on which a computer program is stored, and the program is executed by a processor to implement the method for distinguishing somatic mutations and germline mutations described in the present application. ; the method for identifying ctDNA in cfDNA described in the present application, or the training method of the machine learning model described in the present application.
  • the non-transitory computer readable storage medium may include a floppy disk, a flexible disk, a hard disk, a solid state storage (SSS) (such as a solid state drive (SSD)), a solid state card (SSC), a solid state module (SSM)), an enterprise high-grade flash drives, tape, or any other non-transitory magnetic media, etc.
  • SSD solid state drive
  • SSC solid state card
  • SSM solid state module
  • enterprise high-grade flash drives tape, or any other non-transitory magnetic media, etc.
  • Non-transitory computer readable storage media may also include punched cards, paper tape, cursor sheets (or any other physical media having a pattern of holes or other optically identifiable markings), compact disc read only memory (CD-ROM) , Rewritable Disc (CD-RW), Digital Versatile Disc (DVD), Blu-ray Disc (BD) and/or any other non-transitory optical media.
  • CD-ROM compact disc read only memory
  • CD-RW Rewritable Disc
  • DVD Digital Versatile Disc
  • BD Blu-ray Disc
  • the present application provides a database system, which includes a memory; and a processor coupled to the memory, the processor configured to execute based on instructions stored in the memory to implement the The method for distinguishing somatic mutations and germline mutations as described above; the method for identifying ctDNA in cfDNA as described in this application, or the method for building a database as described in this application.
  • the database system may implement various mechanisms to ensure that the methods described herein performed on the database system produce correct results.
  • the database system may use disks as permanent data storage.
  • the database system can provide database storage and processing services for multiple database clients.
  • the database client may store database data across multiple shared storage devices, and/or may utilize one or more execution platforms with multiple execution nodes.
  • the database system can be organized such that storage and computing resources can be effectively scaled indefinitely.
  • the present application provides an application of the method for distinguishing somatic mutations and germline mutations described in the present application in the management of tumor families.
  • the present application provides an application of the method for distinguishing somatic mutations and germline mutations described in the present application in the detection of tumor mutation burden (TMB).
  • TMB tumor mutation burden
  • the method can be used to determine whether the subject has a germline mutation.
  • Subjects carrying certain germline mutations may have a higher lifetime risk of developing cancer (eg, colorectal, endometrial, gastric, and/or ovarian cancer) than the general population. Therefore, the method can be used to screen out subjects with higher risk.
  • the subject can receive individualized tumor monitoring, so as to achieve the purpose of early diagnosis and early treatment.
  • the method can be used to detect the TMB and can be used in clinical practice (for example, it can be speculated whether certain specific tumor treatment methods are suitable for the subject).
  • the TMB level detected by the method can be used in clinical practice in combination with other biomarkers such as immune checkpoints and T cell inflammatory markers.
  • Example 1 Obtain the mutation site described in the application
  • Sequence reply Use the mem module in the bwa 0.7.10 software to map the sequence to the human reference genome GRCh37/hg19 to form a .bam file of the alignment result.
  • c) remove reads with too many mismatches, for example: more than 12, 10, 8 or 6 mismatches;
  • hotspot mutation (hot) site if a mutation is in the hotspot mutation list, the mutation is a hotspot mutation, and in the subsequent mutation filtering, the hotspot mutation is not included in the prediction of the model;
  • the databases used include but are not limited to: 1000Genomes database, ExAC database and ESP6500 database, etc. .
  • Maximum depth filtering filter mutations greater than a specific sequencing depth, for example: sequencing depth greater than 20,000, etc.;
  • Repeat sequence removal remove the repeat sequence generated during PCR amplification
  • Filter low-quality fragments filter fragments whose base quality median is less than Q20;
  • Filter fragments with sequencing errors filter fragments that cannot be compared with the reference genome
  • Mutation removal at low coverage depth remove SNVs with less than 50 supporting fragments.
  • Embodiment 2 The method for obtaining the difference described in this application
  • wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence
  • mutant support fragment is a cfDNA fragment comprising a mutant base sequence
  • the wild-type base sequence is the same sequence as the nucleotide sequence at the corresponding position of the mutation site in the reference genome
  • the mutant base sequence is the same sequence as the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site, the sequence is different
  • the reference genome is the human reference genome in the gene sequencing.
  • the distribution of the wild-type supporting fragment and the mutant supporting fragment is calculated.
  • WC i and MC i in the formula (1) respectively represent the number of the wild-type supporting fragments with a length of i nucleotides and the number of the wild-type support fragments with a length of i nucleotides at a certain mutation site.
  • the mutant supports the number of fragments.
  • 3 is the smoothing window value
  • j is the length value in the smoothing sampling length range, for example, j can be an integer in an arithmetic sequence such as 1, 4, 7, 10;
  • 400 is the range of the length of the wild type supporting fragment and/or the mutant supporting fragment.
  • the length of the effective fragment interval is set to be about 1-about 167 nucleotides, and/or, about 250-about 400 nucleotides.
  • the length of the effective fragment interval may be the length of the nucleic acid sequence wound around the nucleosome.
  • the nucleic acid sequence can wrap around the nucleosome for more than 2 weeks, or, can wrap around the nucleosome within 1 week (for example, the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides).
  • the values of each B in the first distribution D are sequentially accumulated again to obtain the added value (that is, refer to formula (3)) .
  • the values of each B in the first distribution D are calculated sequentially Bonus value.
  • the set of added values constitutes the second distribution A, and the largest added value in the second distribution is recorded as Dev(Max) (ie, refer to formula (4)).
  • Figure 8 shows the distribution frequency of the lengths of the wild-type support fragment and the mutant support fragment of the present application obtained by using the method described in Example 2.1 for the mutation site C-T at No. 20525808 of human chromosome 4 .
  • Figure 9 shows the distribution frequency of the lengths of the wild-type support fragment and the mutant support fragment of the present application obtained by using the method described in Example 2.1 for the mutation site G-T at No. 56189455 of human chromosome 5 .
  • Figure 10 shows the distribution frequency of the lengths of the wild-type support fragment and the mutant support fragment of the present application obtained by using the method described in Example 2.1 for the mutation site C-A at No. 7577141 of human chromosome 17 .
  • wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence
  • mutant support fragment is a cfDNA fragment comprising a mutant base sequence
  • the wild-type base sequence is the same sequence as the nucleotide sequence at the corresponding position of the mutation site in the reference genome
  • the mutant base sequence is the same sequence as the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site, the sequence is different
  • the reference genome is the human reference genome in the gene sequencing.
  • the distribution of the wild-type supporting fragment and the mutant supporting fragment is calculated.
  • WC i and MC i in the formula (1) respectively represent the number of the wild-type supporting fragments with a length of i nucleotides and the number of the wild-type support fragments with a length of i nucleotides at a certain mutation site The number of mutant support fragments.
  • 3 is the smoothing window value
  • j is the length value in the smoothing sampling length range, for example, j can be an integer in an arithmetic sequence such as 1, 2, 3, 4;
  • 400 is the range of the length of the wild type supporting fragment and/or the mutant supporting fragment.
  • the length of the effective fragment interval is set to be about 1-about 167 nucleotides, and/or, about 250-about 400 nucleotides.
  • the length of the effective fragment interval may be the length of the nucleic acid sequence wound around the nucleosome.
  • the nucleic acid sequence can wrap around the nucleosome for more than 2 weeks, or, can wrap around the nucleosome within 1 week (for example, the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides).
  • the values of each B in the first distribution D are sequentially accumulated again to obtain the added value (that is, refer to formula (3)) .
  • the values of each B in the first distribution D are calculated sequentially Bonus value.
  • the set of added values constitutes the second distribution A, and the largest added value in the second distribution is recorded as Dev(Max) (ie, refer to formula (4)).
  • Embodiment 3 Carry out the machine learning described in this application
  • These indicators can be divided into 7 types according to the types of different characteristics, and the indicators are all related to the mutation site.
  • Location information including the chromosome location where the SNV is located, for example, 68771372 on chromosome 16.
  • Base substitution pattern In a single SNV locus, the wild-type base is transformed into a newly introduced mutant base pattern. For example, chr3, 178935093C>A, the base substitution mode is "CA".
  • This feature uses the "one-hot encoding" method, taking into account the theoretical 12 replacement modes, namely: AT, AC, AG, TA, TC, TG, CA, CT, CG, GA, GT, GC.
  • Example 2 Dev value obtained in Example 2 (that is, it can reflect the fragmentation mode of cfDNA): it can also characterize the characteristics W ratio and M ratio of the mutation shift direction.
  • the Delta ratio can also be characterized.
  • the calculation methods of the above three parameters are respectively shown in formula (5), formula (6) and formula (7).
  • 167 can also be any integer in 160-174.
  • C 1>167 and C 1 ⁇ 167 respectively represent the number of the wild-type support fragments with a length greater than 167 nucleotides, and the number of wild-type support fragments with a length less than 167 nucleotides.
  • Quantity, W ratio indicates the ratio of C l>167 and C l ⁇ 167 .
  • C 1>167 and C 1 ⁇ 167 respectively represent the number of the mutant support fragments with a length greater than 167 nucleotides, and the number of mutant support fragments with a length less than 167 nucleotides
  • M ratio means the ratio of C l>167 and C l ⁇ 167 .
  • Formula (7) represents the difference between W ratio and M ratio .
  • Fragment count it includes all unmutated wild-type fragments in a certain mutation site, and the number of all supported fragments in which a single base mutation occurs at this site.
  • Allelic variation This type of feature includes two types, namely sample frequency and population frequency.
  • the sample frequency refers to the allele mutation frequency (Variant Allele Frequency) that occurs in a certain sample
  • the population frequency refers to the frequency of the mutation in the population.
  • Age the age of the sample that produced the mutation.
  • intron_variant (intron variant)
  • missense_variant (nonsense mutation)
  • promoter_region_variant (promoter region mutation)
  • z-transform is performed on each feature type, that is, all values are converted into a standard normal distribution with a mean of 0 and a variance of 1.
  • the number of samples is determined to be None, the minimum number of samples that can be divided by a node is 10, and the final result consists of 40 decision trees
  • the real data includes a total of 1309 lung cancer blood samples, which are divided into a training set containing 928 samples, and two sets of verification sets containing 191 and 190 samples respectively (that is, the training set, verification set in Figure 1, respectively). set 1 and validation set 2).
  • the RF(+Dev) or RF(-Dev) respectively refer to the results of model verification of the above two verification sets by machine learning models that include the parameter Dev and do not include the parameter for machine learning training.
  • the method and/or model described in this application not only perform well in lung cancer, but also have excellent performance in the classification ability of pan-cancer.

Abstract

A method for distinguishing a somatic mutation and a germline mutation. The method comprises: obtaining at least one mutation site derived from a sample of a subject; obtaining a wild-type support fragment and a mutant-type support fragment, wherein the wild-type support fragment is a cfDNA fragment containing a wild-type base sequence, the mutant-type support fragment is a cfDNA fragment containing a mutant-type base sequence, the wild-type base sequence is the same as a nucleotide sequence at a corresponding position of the mutation site of a human reference genome, and the mutant-type base sequence is different from the nucleotide sequence; obtaining the number of the wild-type support fragments with at least one length, obtaining the number of the corresponding mutant-type support fragments with the same length, and calculating the difference value between the ratio of the number of the wild-type support fragments with the same length to the total number of the corresponding support fragments and the ratio of the number of the mutant-type support fragments with the same length to the total number of the corresponding support fragments; and using the difference value as a distinguishing index. Provided are a method and a device for identifying ctDNA from cfDNA. The method is used for tumor family management and TMB detection.

Description

一种用于区分体细胞突变和种系突变的方法A method for distinguishing between somatic and germline mutations 技术领域technical field
本申请涉及生物信息领域,具体的涉及一种用于区分体细胞突变和种系突变的方法。The present application relates to the field of biological information, in particular to a method for distinguishing somatic mutations and germline mutations.
背景技术Background technique
在肿瘤患者的血浆中,广泛存在cfDNA,其中包含少量肿瘤特异性ctDNA存在。这些ctDNA在细胞衰老和凋亡过程中与其他正常的cfDNA在剪切方式上存在差异。换言之,血浆中游离DNA中ctDNA和其他常规cfDNA片段化分布模式的不同。因此,这种分布模式的差异可以作为ctDNA识别的标志物。In the plasma of tumor patients, cfDNA widely exists, including a small amount of tumor-specific ctDNA. These ctDNAs differ from other normal cfDNAs in the way of shearing during cell senescence and apoptosis. In other words, the fragmentation patterns of ctDNA and other conventional cfDNA in cell-free DNA in plasma are different. Therefore, differences in this distribution pattern can serve as markers for ctDNA recognition.
体细胞突变是在人类生命周期中逐渐累积的一种区别于种系突变(也称为:胚系突变)的非遗传变异。体细胞突变由于与肿瘤发生的分子信号通路密切相关,而作为肿瘤形成的重要标志。种系突变是发生在生殖细胞可遗传的突变,对研究遗传性疾病和基因组进化具有重要意义。在《肿瘤突变负荷检测及临床应用中国专家共识(2020年版)》中提到:在肿瘤突变负荷(TMB)算法的标准化要求中,核心要素是对能影响蛋白质编码的体细胞突变的探测及计算。由于目前公开的人群数据库均已欧美人群为主,不适用于中国人群TMB检测,因此建议TMB的体细胞突变确定应使用对照样本(外周血或癌旁组织)去除患者的胚系变异,或使用中国人群大样本遗传性突变数据库构建背景库过滤胚系变异。因此,正确区分细胞中的突变类型和来源对肿瘤的分类,治疗,预后等具有重要作用。Somatic mutations are non-genetic variations that are distinct from germline mutations (also known as: germline mutations) that gradually accumulate during the human life cycle. Somatic mutation is an important marker of tumor formation because it is closely related to the molecular signaling pathway of tumorigenesis. Germline mutations are heritable mutations that occur in germ cells, and are of great significance to the study of genetic diseases and genome evolution. In the "Tumor Mutation Burden Detection and Clinical Application Chinese Expert Consensus (2020 Edition)", it is mentioned that in the standardization requirements of the Tumor Mutation Burden (TMB) algorithm, the core element is the detection and calculation of somatic mutations that can affect protein coding . Since the currently public population databases are dominated by European and American populations, they are not suitable for TMB detection in the Chinese population. Therefore, it is recommended that control samples (peripheral blood or paracancerous tissues) should be used to remove germline mutations in patients, or use Construction of a large-sample hereditary mutation database in the Chinese population background library filtering germline variation. Therefore, correctly distinguishing the type and source of mutations in cells plays an important role in the classification, treatment, and prognosis of tumors.
然而,当前进行体细胞突变判别方法主要依赖于对配对样本的检测,配对样本平行测序可以很准确地判断出突变的来源,但是对于初次没有收取配对材料的样本,重新收取配对样本往往非常困难。另外,与肿瘤样本进行同深度的高通量测序会造成经费和计算资源的大量消耗。同时该方法对样本收集的完整性和计算存储资源有很高的要求,并且会显著增加突变检测成本。另外,突变频率过滤和突变注释数据库比对的方法在准确性上仍然无法满足要求。However, the current methods for identifying somatic mutations mainly rely on the detection of paired samples. Parallel sequencing of paired samples can accurately determine the source of mutations. However, it is often very difficult to re-collect paired samples for samples that have not been collected for the first time. In addition, performing high-throughput sequencing at the same depth as tumor samples will consume a lot of funds and computing resources. At the same time, this method has high requirements on the integrity of sample collection and computing and storage resources, and will significantly increase the cost of mutation detection. In addition, the methods of mutation frequency filtering and mutation annotation database comparison still cannot meet the requirements in terms of accuracy.
发明内容Contents of the invention
本申请提供了一种用于区分体细胞突变和种系突变的方法,一种用于在cfDNA中识别ctDNA的方法,以及所述方法对应的装置以及应用。本申请所述的方法和/或装置,具有以下 特征中的至少一种:(1)仅需要使用单一样本,即来源于受试者的样本;(2)适用范围广,可以适用于不同癌症种类中体细胞突变的识别,和/或ctDNA的识别;(3)高灵敏度;(4)高准确性,例如可以同时在突变数据库、人群频率、突变丰度的基础上,使多个因素共同参与本申请所述的方法而提高区分结果的可靠性;(5)易于实施,对突变位点的数量没有限制;(6)操作快捷,例如可以以受试者的血浆作为样本;(7)引入了新的区分维度。The present application provides a method for distinguishing somatic mutations and germline mutations, a method for identifying ctDNA in cfDNA, and devices and applications corresponding to the methods. The method and/or device described in this application has at least one of the following characteristics: (1) only need to use a single sample, that is, a sample from a subject; (2) has a wide range of applications and can be applied to different cancers Identification of somatic mutations in species, and/or identification of ctDNA; (3) high sensitivity; (4) high accuracy, for example, multiple factors can be combined on the basis of mutation database, population frequency, and mutation abundance at the same time Participate in the method described in this application to improve the reliability of the discrimination results; (5) easy to implement, no limit to the number of mutation sites; (6) fast operation, for example, the plasma of the subject can be used as a sample; (7) A new dimension of distinction is introduced.
一方面,本申请提供了一种用于区分体细胞突变和种系突变的方法,其包括以下步骤:In one aspect, the application provides a method for distinguishing between a somatic mutation and a germline mutation, comprising the steps of:
(1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中所述参考基因组为所述基因测序中的人类参考基因组;(3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;计算相同长度下所述比值WC与所述比值MC的差值;(4)将所述差值作为区分所述突变位点为体细胞突变还是种系突变的指标。(1) Obtain at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-type supporting fragments; calculating the mutation of the same length The ratio MC of the quantity of type support fragment and the total quantity of described mutant type support fragment; Calculate the difference value of described ratio WC and described ratio MC under the same length; (4) use described difference as distinguishing described mutation position Points are indicators of somatic or germline mutations.
一方面,本申请提供了一种用于在cfDNA中识别ctDNA的方法,其包括以下步骤:In one aspect, the application provides a method for identifying ctDNA in cfDNA, comprising the following steps:
(1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中所述参考基因组为所述基因测序中的人类参考基因组;(3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;计算相同长度下所述比值WC与所述比值MC的差值;(4)将所述差值作为识别所述突变位点是否位于ctDNA的指标。(1) Obtain at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-type supporting fragments; calculating the mutation of the same length The ratio MC of the quantity of type support fragment and the total quantity of described mutant type support fragment; Calculate the difference of described ratio WC and described ratio MC under the same length; (4) use described difference as identification described mutation position Indicator of whether spots are located on ctDNA.
一方面,本申请提供了一种机器学习模型的训练方法,其包括以下步骤:On the one hand, the application provides a kind of training method of machine learning model, it comprises the following steps:
(1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中所述参考基因组为所述基因测序中的人类参考基因组;(3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;计算相同长度下所述比值WC与所述比值MC的差值;(4)将所述差值作为训练的指标输入至所述机器学习模型以进行机器学习训练。(1) Obtain at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-type supporting fragments; calculating the mutation of the same length The ratio MC of the quantity of type support fragment and the total quantity of described mutation type support fragment; Calculate the difference value of described ratio WC and described ratio MC under the same length; (4) input described difference value as training index to The machine learning model is used for machine learning training.
一方面,本申请提供了一种数据库建立方法,其包括以下步骤:On the one hand, the application provides a method for establishing a database, which includes the following steps:
(1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中所述参考基因组为所述基因测序中的人类参考基因组;(3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;计算相同长度下所述比值WC与所述比值MC的差值;(4)将所述差值存储至数据库中,以便区分体细胞突变和种系突变,和/或在cfDNA中识别ctDNA。(1) Obtain at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-type supporting fragments; calculating the mutation of the same length The ratio MC of the quantity of type support fragments and the total quantity of said mutant support fragments; calculate the difference between said ratio WC and said ratio MC under the same length; (4) store said difference in the database, so that Distinguish between somatic and germline mutations, and/or identify ctDNA in cfDNA.
在某些实施方式中,所述基因测序包括二代基因测序(NGS)。In some embodiments, the gene sequencing includes next generation gene sequencing (NGS).
在某些实施方式中,所述方法仅使用源自受试者的样本。In certain embodiments, the methods use only samples derived from the subject.
在某些实施方式中,所述样本包括血液样本。In certain embodiments, the sample comprises a blood sample.
在某些实施方式中,所述方法还包括以下的步骤:获取源自受试者的样本。In certain embodiments, the method further comprises the step of obtaining a sample from the subject.
在某些实施方式中,所述突变位点包括单核苷酸变异(SNV)。In certain embodiments, the mutation site comprises a single nucleotide variation (SNV).
在某些实施方式中,所述突变位点包含两个以上的核苷酸变异。In certain embodiments, the mutation site comprises more than two nucleotide variations.
在某些实施方式中,所述野生型支持片段和/或所述突变型支持片段的长度的范围为约1个核苷酸至约550个核苷酸。In certain embodiments, the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 550 nucleotides.
在某些实施方式中,所述野生型支持片段和/或所述突变型支持片段的长度的范围为约1个核苷酸至约400个核苷酸。In certain embodiments, the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 400 nucleotides.
在某些实施方式中,所述野生型支持片段和/或所述突变型支持片段的长度的范围为约1个核苷酸至约200个核苷酸。In certain embodiments, the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 200 nucleotides.
在某些实施方式中,所述方法包括以下的步骤:(4’)获得步骤(3)所述差值的分布,选择所述分布中的最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。In some embodiments, the method includes the following steps: (4') obtaining the distribution of the difference in step (3), selecting the maximum value in the distribution as Dev(Max), and using the Dev( Max) as an indicator of the distinction and/or as the training sample.
在某些实施方式中,所述方法包括以下的步骤:(4’)获得步骤(3)所述差值的分布,将其称为第一分布。In some embodiments, the method includes the following steps: (4') obtaining the distribution of the difference in step (3), which is referred to as the first distribution.
在某些实施方式中,所述方法包括以下的步骤:(5)在有效片段区间的长度范围内,将所述第一分布中的每个差值依次进行累加,获得加成值,其中,所述有效片段区间的长度覆盖缠绕核小体的核酸序列的长度。In some embodiments, the method includes the following steps: (5) within the length range of the effective segment interval, each difference in the first distribution is sequentially accumulated to obtain an added value, wherein, The length of the effective fragment interval covers the length of the nucleic acid sequence wound around the nucleosome.
在某些实施方式中,所述核酸序列能够缠绕核小体2周以上,或者,能够缠绕核小体1周以内。In some embodiments, the nucleic acid sequence is capable of wrapping around a nucleosome for more than 2 weeks, or capable of wrapping around a nucleosome for less than 1 week.
在某些实施方式中,所述有效片段区间的长度为约1-约167个核苷酸,和/或,约200以上个核苷酸。In some embodiments, the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 200 or more nucleotides.
在某些实施方式中,所述有效片段区间的长度为约1-约167个核苷酸,和/或,约250-约400个核苷酸。In certain embodiments, the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides.
在某些实施方式中,所述方法包括以下的步骤:(6)获得步骤(5)所述加成值的第二分布,计算所述第二分布中所述加成值的最大值。在某些实施方式中,将所述加成值的最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。In some embodiments, the method includes the following steps: (6) Obtaining the second distribution of the added value in step (5), and calculating the maximum value of the added value in the second distribution. In some embodiments, the maximum value of the added value is used as Dev(Max), and the Dev(Max) is used as the indicator of the distinction and/or as the training sample.
在某些实施方式中,所述差值经平滑化处理,其中所述平滑化处理包括以下步骤:In some embodiments, the difference is smoothed, wherein the smoothing includes the following steps:
(a)确定平滑化窗口值;其中所述平滑化窗口值为约1-10中的整数;(b)确定若干个长度值等于平滑化窗口值的平滑化取样长度范围,其中每一个平滑化取样长度范围的最小值为起始长度,其中所述起始长度的范围为所述野生型支持片段和/或所述突变型支持片段的长度的范围;(c)获取任意一个平滑化取样长度范围中,至少一个平滑化取样长度的所述野生型 支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,(a) determine the smoothing window value; wherein the smoothing window value is about an integer in 1-10; (b) determine a number of smoothing sampling length ranges whose length values are equal to the smoothing window value, wherein each smoothing The minimum value of the sampling length range is the initial length, wherein the initial length range is the length range of the wild-type support fragment and/or the mutant support fragment; (c) obtain any smoothed sampling length In the range, the number of wild-type support fragments of at least one smoothed sampling length is obtained to obtain the corresponding number of mutant support fragments of the same length,
计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;calculating the ratio WC of the number of wild-type support fragments of the length to the total number of wild-type support fragments;
计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;calculating the ratio MC of the number of mutant supporting fragments of the same length to the total number of mutant supporting fragments;
计算相同长度下所述比值WC与所述比值MC的差值;(d)根据所述至少一个平滑化取样长度的所述差值计算该平滑化取样长度范围的平均差值;(e)将所得的平均差值作为所述该平滑化取样长度范围的代表值。calculating the difference between said ratio WC and said ratio MC at the same length; (d) calculating an average difference over a range of smoothed sample lengths based on said difference of said at least one smoothed sample length; (e) The obtained average difference is used as a representative value of the smoothed sampling length range.
在某些实施方式中,所述平滑化窗口值为约2-6中的整数。In some embodiments, the smoothing window value is an integer between about 2-6.
在某些实施方式中,所述平滑化窗口值为3。In some embodiments, the smoothing window value is 3.
在某些实施方式中,所述平滑化处理包括以下步骤:(f)获得步骤(e)所述平均差值的第一分布。In some embodiments, the smoothing process includes the following steps: (f) obtaining the first distribution of the average difference values in step (e).
在某些实施方式中,所述平滑化处理包括以下步骤:(g)在有效片段区间的长度范围内,将所述第一分布中的每个平均差值依次进行累加,获得加成值,其中,所述有效片段区间的长度为缠绕核小体的核酸序列的长度。In some embodiments, the smoothing process includes the following steps: (g) within the length range of the effective segment interval, each average difference in the first distribution is sequentially accumulated to obtain an added value, Wherein, the length of the effective fragment interval is the length of the nucleic acid sequence wound around the nucleosome.
在某些实施方式中,所述核酸序列能够缠绕核小体2周以上,或者,能够缠绕核小体1周以内。In some embodiments, the nucleic acid sequence is capable of wrapping around a nucleosome for more than 2 weeks, or capable of wrapping around a nucleosome for less than 1 week.
在某些实施方式中,所述有效片段区间的长度为约1-约167个核苷酸,和/或,约200以上个核苷酸。In some embodiments, the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 200 or more nucleotides.
在某些实施方式中,所述有效片段区间的长度为约1-约167个核苷酸,和/或,约250-约400个核苷酸。In certain embodiments, the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides.
在某些实施方式中,所述平滑化处理包括以下步骤:(h)获得步骤(g)所述加成值的第二分布,计算所述第二分布中所述加成值的最大值。In some embodiments, the smoothing process includes the following steps: (h) obtaining a second distribution of the added value in step (g), and calculating the maximum value of the added value in the second distribution.
在某些实施方式中,将所述加成值的最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。In some embodiments, the maximum value of the added value is used as Dev(Max), and the Dev(Max) is used as the indicator of the distinction and/or as the training sample.
在某些实施方式中,所述指标还包括选自下组参数中的一种或多种:所述突变位点所在的染色体位置、所述突变位点的碱基替换模式、所述突变位点的野生型中各个长度的核酸片段的计数值和/或所述突变位点的突变型中各个长度的核酸片段的计数值、所述突变位点的等位变异、受试者的年龄和所述突变位点的突变类型。In some embodiments, the index also includes one or more parameters selected from the following group: the chromosomal position where the mutation site is located, the base substitution pattern of the mutation site, the mutation site The count value of the nucleic acid fragments of each length in the wild type of the point and/or the count value of the nucleic acid fragments of each length in the mutant type of the mutation site, the allelic variation of the mutation site, the age and The mutation type of the mutation site.
在某些实施方式中,所述指标还包括选自下组参数中的一种或多种:所述SNV位点所在的染色体位置、所述SNV位点的碱基替换模式、所述SNV位点的野生型中各个长度的核酸片段的计数值和/或所述SNV位点的突变型中各个长度的核酸片段的计数值、所述SNV位点 的等位变异、受试者的年龄和所述SNV位点的突变类型。In some embodiments, the index also includes one or more parameters selected from the following group: the chromosome position where the SNV site is located, the base substitution pattern of the SNV site, the SNV site The count value of the nucleic acid fragment of each length in the wild type of the point and/or the count value of the nucleic acid fragment of each length in the mutant type of the SNV site, the allelic variation of the SNV site, the age and The mutation type of the SNV site.
在某些实施方式中,检测所述突变位点包括以下的步骤:In some embodiments, detecting the mutation site comprises the following steps:
(1)从所述样本中获得数据;(2)对步骤(1)所得的数据进行变异识别;(3)对步骤(2)识别的变异进行变异注释;以及,(4)对步骤(3)注释的变异进行过滤,获得突变位点;可选地,对所述突变位点进行质量控制。(1) obtaining data from the sample; (2) performing variation identification on the data obtained in step (1); (3) performing variation annotation on the variation identified in step (2); and, (4) performing variation annotation on step (3) ) to filter the annotated variation to obtain the mutation site; optionally, perform quality control on the mutation site.
另一方面,本申请提供了一种区分体细胞突变和种系突变的装置,其包括:计算模块,用于计算相同长度的比值WC与比值MC的差值;其中,针对每一个突变位点,根据至少一个长度的野生型支持片段的数量,以及对应的相同长度的突变型支持片段的数量;所述比值WC为一个长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值;其中所述比值MC为对应的相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中,所述参考基因组为所述基因测序中的人类参考基因组;所述突变位点源自受试者样本,其中,所述突变位点通过基因测序的方法获得,判断模块,用于依据已被进行机器学习训练的机器学习模型获得识别所述体细胞突变的识别结果,其中所述机器学习训练包括将所述差值作为训练样本输入至所述机器学习模型以进行机器学习训练。On the other hand, the present application provides a device for distinguishing somatic mutations and germline mutations, which includes: a calculation module for calculating the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site , according to the number of wild-type support fragments of at least one length, and the corresponding number of mutant-type support fragments of the same length; the ratio WC is the number of wild-type support fragments of one length and the number of wild-type support fragments The ratio of the total number; wherein the ratio MC is the ratio of the number of corresponding mutant-type support fragments of the same length to the total number of the mutant-type support fragments; wherein, the wild-type support fragment is a base containing wild-type A cfDNA fragment of a base sequence, the mutant supporting fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is the nucleoside corresponding to the position of the mutation site in the reference genome Compared with the acid sequence, the same sequence, wherein the mutant base sequence is, compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, a different sequence, wherein the reference The genome is the human reference genome in the gene sequencing; the mutation site is derived from the subject sample, wherein the mutation site is obtained by gene sequencing, and the judgment module is used for machine learning training The machine learning model obtains the identification result of identifying the somatic mutation, wherein the machine learning training includes inputting the difference as a training sample into the machine learning model to perform machine learning training.
另一方面,本申请提供了一种在cfDNA中识别ctDNA的装置,其包括:On the other hand, the present application provides a device for identifying ctDNA in cfDNA, which includes:
计算模块,用于计算相同长度的比值WC与比值MC的差值;其中,针对每一个突变位点,根据至少一个长度的野生型支持片段的数量,以及对应的相同长度的突变型支持片段的数量;所述比值WC为一个长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值;其中所述比值MC为对应的相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中,所述参考基因组为所述基因测序中的人类参考基因组;所述突变位点源自受试者样本,其中,所述突变位点通过基因测序的方法获得,判断模块,用于依据已被进行机器学习训练的机器学习模型获得在所述cfDNA中识别ctDNA的判断结果,其中所述机器学习训 练包括将所述差值作为训练样本输入至所述机器学习模型以进行机器学习训练。A calculation module, configured to calculate the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site, according to the number of wild-type support fragments of at least one length, and the corresponding mutant support fragments of the same length Quantity; the ratio WC is the ratio of the number of wild-type support fragments of a length to the total number of wild-type support fragments; wherein the ratio MC is the corresponding number of mutant-type support fragments of the same length Ratio to the total number of mutant support fragments; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein , the wild-type base sequence is, compared with the nucleotide sequence of the reference genome at the corresponding position of the mutation site, the same sequence, wherein, the mutant base sequence is, and the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site, a different sequence, wherein the reference genome is the human reference genome in the gene sequencing; the mutation site is derived from a subject sample, Wherein, the mutation site is obtained by gene sequencing, and the judgment module is used to obtain the judgment result of identifying ctDNA in the cfDNA according to the machine learning model that has been trained by machine learning, wherein the machine learning training includes The difference is input to the machine learning model as a training sample for machine learning training.
另一方面,本申请提供了一种机器学习模型的训练装置,其包括:On the other hand, the application provides a training device for a machine learning model, which includes:
计算模块,用于计算相同长度的比值WC与比值MC的差值;其中,针对每一个突变位点,根据至少一个长度的野生型支持片段的数量,以及对应的相同长度的突变型支持片段的数量;所述比值WC为一个长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值;其中所述比值MC为对应的相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中,所述参考基因组为所述基因测序中的人类参考基因组;所述突变位点源自受试者样本,其中,所述突变位点通过基因测序的方法获得,训练模块,用于将所述差值作为训练样本输入至所述机器学习模型以进行机器学习训练。A calculation module, configured to calculate the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site, according to the number of wild-type support fragments of at least one length, and the corresponding mutant support fragments of the same length Quantity; the ratio WC is the ratio of the number of wild-type support fragments of a length to the total number of wild-type support fragments; wherein the ratio MC is the corresponding number of mutant-type support fragments of the same length Ratio to the total number of mutant support fragments; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein , the wild-type base sequence is, compared with the nucleotide sequence of the reference genome at the corresponding position of the mutation site, the same sequence, wherein, the mutant base sequence is, and the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site, a different sequence, wherein the reference genome is the human reference genome in the gene sequencing; the mutation site is derived from a subject sample, Wherein, the mutation site is obtained by gene sequencing, and the training module is used to input the difference as a training sample into the machine learning model for machine learning training.
在某些实施方式中,所述装置仅使用源自受试者的样本。In certain embodiments, the device uses only samples derived from the subject.
在某些实施方式中,所述的装置还包括:输出模块,用以显示所述判断模块产生的所述体细胞突变的识别结果和/或在所述cfDNA中识别ctDNA的判断结果。In some embodiments, the device further includes: an output module, used to display the identification result of the somatic mutation and/or the identification result of ctDNA in the cfDNA generated by the determination module.
在某些实施方式中,所述的装置还包括样品获得模块,用于获得受试者的所述样本。In some embodiments, the device further includes a sample obtaining module, configured to obtain the sample from the subject.
在某些实施方式中,所述样本包括血液样本。In certain embodiments, the sample comprises a blood sample.
在某些实施方式中,所述样品获得模块包括获得所述样本的试剂和/或仪器。In certain embodiments, the sample obtaining module includes reagents and/or instruments for obtaining the sample.
在某些实施方式中,所述的装置还包括数据接收模块,用于获得所述样本中所述突变位点。In some embodiments, the device further includes a data receiving module, configured to obtain the mutation site in the sample.
在某些实施方式中,所述突变位点包括单核苷酸变异(SNV)。In certain embodiments, the mutation site comprises a single nucleotide variation (SNV).
在某些实施方式中,所述突变位点包含两个以上的核苷酸变异。In certain embodiments, the mutation site comprises more than two nucleotide variations.
在某些实施方式中,所述装置中检测所述突变位点包括以下的步骤:In some embodiments, detecting the mutation site in the device comprises the following steps:
(1)从所述样本中获得数据;(2)对步骤(1)所得的数据进行变异识别;(3)对步骤(2)识别的变异进行变异注释;以及,(4)对步骤(3)注释的变异进行过滤,获得突变位点;可选地,对所述突变位点进行质量控制。(1) obtaining data from the sample; (2) performing variation identification on the data obtained in step (1); (3) performing variation annotation on the variation identified in step (2); and, (4) performing variation annotation on step (3) ) to filter the annotated variation to obtain the mutation site; optionally, perform quality control on the mutation site.
在某些实施方式中,所述基因测序包括二代基因测序(NGS)。In some embodiments, the gene sequencing includes next generation gene sequencing (NGS).
在某些实施方式中,所述数据接收模块包括所述基因测序所需的试剂和/或仪器。In some embodiments, the data receiving module includes reagents and/or instruments required for the gene sequencing.
在某些实施方式中,所述装置还包括输入模块,用以获得所述至少一个长度的所述野生 型支持片段的数量,和/或所述对应的相同长度的所述突变型支持片段的数量。In some embodiments, the device further includes an input module, configured to obtain the number of the wild-type supporting fragment of the at least one length, and/or the corresponding number of the mutant supporting fragment of the same length. quantity.
在某些实施方式中,所述输入模块能够区分所述野生型支持片段和所述突变型支持片段。In certain embodiments, the import module is capable of distinguishing between the wild-type supporting fragment and the mutant supporting fragment.
在某些实施方式中,所述输入模块统计不同长度的所述野生型支持片段的数量;以及,统计不同长度的所述野生型支持片段的数量。In some embodiments, the input module counts the number of wild-type support fragments of different lengths; and counts the number of wild-type support fragments of different lengths.
在某些实施方式中,所述野生型支持片段和/或所述突变型支持片段的长度的范围为约1个核苷酸至约550个核苷酸。In certain embodiments, the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 550 nucleotides.
在某些实施方式中,所述野生型支持片段和/或所述突变型支持片段的长度的范围为约1个核苷酸至约400个核苷酸。In certain embodiments, the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 400 nucleotides.
在某些实施方式中,所述野生型支持片段和/或所述突变型支持片段的长度的范围为约1个核苷酸至约200个核苷酸。In certain embodiments, the wild-type supporting fragment and/or the mutant supporting fragment range in length from about 1 nucleotide to about 200 nucleotides.
在某些实施方式中,在所述计算模块中:获得所述差值的分布,选择所述分布中的最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。在某些实施方式中,在所述计算模块中平滑化处理所述差值,其中所述平滑化处理包括以下步骤:(a)确定平滑化窗口值,其中所述平滑化窗口值为约1-10中的整数;(b)确定若干个长度值等于平滑化窗口值的平滑化取样长度范围,其中每一个平滑化取样长度范围的最小值为起始长度,其中所述起始长度的范围为所述野生型支持片段和/或所述突变型支持片段的长度的范围;(c)获取任意平滑化取样长度范围中,至少一个平滑化取样长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,In some embodiments, in the calculation module: obtain the distribution of the difference, select the maximum value in the distribution as Dev(Max), use the Dev(Max) as the distinguishing index and /or as the training sample. In some embodiments, the difference is smoothed in the calculation module, wherein the smoothing process includes the following steps: (a) determining a smoothing window value, wherein the smoothing window value is about 1 Integer in -10; (b) determine the smoothing sampling length range of several length values equal to the smoothing window value, wherein the minimum value of each smoothing sampling length range is the starting length, wherein the range of the starting length is the range of the length of the wild-type support fragment and/or the mutant-type support fragment; (c) obtaining the number of the wild-type support fragment of at least one smoothed sampling length in any smoothed sampling length range, obtaining corresponding to the number of mutant support fragments of the same length,
计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;calculating the ratio WC of the number of wild-type support fragments of the length to the total number of wild-type support fragments;
计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;calculating the ratio MC of the number of mutant supporting fragments of the same length to the total number of mutant supporting fragments;
计算相同长度下所述比值WC与所述比值MC的差值;(d)根据所述至少一个平滑化取样长度的所述差值计算该平滑化取样长度范围的平均差值;(e)将所得的平均差值作为所述该平滑化取样长度范围的代表值。calculating the difference between said ratio WC and said ratio MC at the same length; (d) calculating an average difference over a range of smoothed sample lengths based on said difference of said at least one smoothed sample length; (e) The obtained average difference is used as a representative value of the smoothed sampling length range.
在某些实施方式中,所述平滑化窗口值为约2-6中的整数。In some embodiments, the smoothing window value is an integer between about 2-6.
在某些实施方式中,所述平滑化窗口值为3。In some embodiments, the smoothing window value is 3.
在某些实施方式中,所述平滑化处理包括以下步骤:(f)获得步骤(e)所述平均差值的第一分布。In some embodiments, the smoothing process includes the following steps: (f) obtaining the first distribution of the average difference values in step (e).
在某些实施方式中,所述平滑化处理包括以下步骤:(g)在有效片段区间的长度范围内,将所述第一分布中的每个平均差值依次进行累加,获得加成值,其中,所述有效片段区间的长度覆盖缠绕核小体的核酸序列的长度。In some embodiments, the smoothing process includes the following steps: (g) within the length range of the effective segment interval, each average difference in the first distribution is sequentially accumulated to obtain an added value, Wherein, the length of the effective fragment interval covers the length of the nucleic acid sequence wound around the nucleosome.
在某些实施方式中,所述核酸序列能够缠绕核小体2周以上,或者,能够缠绕核小体1周以内。In some embodiments, the nucleic acid sequence is capable of wrapping around a nucleosome for more than 2 weeks, or capable of wrapping around a nucleosome for less than 1 week.
在某些实施方式中,所述有效片段区间的长度为约1-约167个核苷酸,和/或,约200以上个核苷酸。In some embodiments, the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 200 or more nucleotides.
在某些实施方式中,所述有效片段区间的长度为约1-约167个核苷酸,和/或,约250-约400个核苷酸。In certain embodiments, the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides.
在某些实施方式中,所述平滑化处理包括以下步骤:(h)获得步骤(g)所述加成值的第二分布,计算所述第二分布中所述加成值的最大值。In some embodiments, the smoothing process includes the following steps: (h) obtaining a second distribution of the added value in step (g), and calculating the maximum value of the added value in the second distribution.
在某些实施方式中,将所述加成值的最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。In some embodiments, the maximum value of the added value is used as Dev(Max), and the Dev(Max) is used as the indicator of the distinction and/or as the training sample.
在某些实施方式中,所述计算模块输出所述Dev(Max)。In some embodiments, the calculation module outputs the Dev(Max).
在某些实施方式中,所述指标和/或训练样本还包括选自下组参数中的一种或多种:所述突变位点所在的染色体位置、所述突变位点的碱基替换模式、所述突变位点的野生型中各个长度的核酸片段的计数值和/或所述突变位点的突变型中各个长度的核酸片段的计数值、所述突变位点的等位变异、受试者的年龄和所述突变位点的突变类型。In some embodiments, the index and/or training samples also include one or more parameters selected from the following group: the chromosomal position where the mutation site is located, the base substitution pattern of the mutation site , the count value of nucleic acid fragments of various lengths in the wild type of the mutation site and/or the count value of nucleic acid fragments of various lengths in the mutant type of the mutation site, the allelic variation of the mutation site, the affected The age of the test subject and the mutation type of the mutation site.
在某些实施方式中,所述指标和/或训练样本还包括选自下组参数中的一种或多种:所述SNV位点所在的染色体位置、所述SNV位点的碱基替换模式、所述SNV位点的野生型中各个长度的核酸片段的计数值和/或所述SNV位点的突变型中各个长度的核酸片段的计数值、所述SNV位点的等位变异、受试者的年龄和所述SNV位点的突变类型。In some embodiments, the index and/or training samples also include one or more parameters selected from the following group: the chromosome position where the SNV site is located, the base substitution pattern of the SNV site , the count value of nucleic acid fragments of various lengths in the wild type of the SNV site and/or the count value of nucleic acid fragments of various lengths in the mutant type of the SNV site, the allelic variation of the SNV site, the affected The age of the test subject and the mutation type of the SNV site.
另一方面,本申请提供了一种电子设备,包括存储器;和耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令执行以实现本申请所述的区分体细胞突变和种系突变的方法;本申请所述的在cfDNA中识别ctDNA的方法,或者本申请所述的机器学习模型的训练方法。In another aspect, the present application provides an electronic device, including a memory; and a processor coupled to the memory, the processor configured to execute based on instructions stored in the memory to implement the instructions described in the present application. A method for distinguishing somatic mutations from germline mutations; a method for identifying ctDNA in cfDNA as described herein, or a method for training a machine learning model as described herein.
另一方面,本申请提供了一种非易失性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行以实现本申请所述的区分体细胞突变和种系突变的方法;本申请所述的在cfDNA中识别ctDNA的方法,或者本申请所述的机器学习模型的训练方法。In another aspect, the present application provides a non-volatile computer-readable storage medium, on which a computer program is stored, and the program is executed by a processor to implement the method for distinguishing somatic mutations and germline mutations described in the present application. ; the method for identifying ctDNA in cfDNA described in the present application, or the training method of the machine learning model described in the present application.
另一方面,本申请提供了一种数据库系统,其包括存储器;和耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令执行以实现本申请所述的区分体细胞突变和种系突变的方法;本申请所述的在cfDNA中识别ctDNA的方法,或者本申请所述的数据库建立方法。In another aspect, the present application provides a database system, which includes a memory; and a processor coupled to the memory, the processor configured to execute based on instructions stored in the memory to implement the The method for distinguishing somatic mutations and germline mutations as described above; the method for identifying ctDNA in cfDNA as described in this application, or the method for building a database as described in this application.
另一方面,本申请提供了一种本申请所述的区分体细胞突变和种系突变的方法在肿瘤家系管理的应用。In another aspect, the present application provides an application of the method for distinguishing somatic mutations and germline mutations described in the present application in the management of tumor families.
另一方面,本申请提供了一种本申请所述的区分体细胞突变和种系突变的方法在肿瘤突变负担(TMB)检测中的应用。In another aspect, the present application provides an application of the method for distinguishing somatic mutations and germline mutations described in the present application in the detection of tumor mutation burden (TMB).
本领域技术人员能够从下文的详细描述中容易地洞察到本申请的其它方面和优势。下文的详细描述中仅显示和描述了本申请的示例性实施方式。如本领域技术人员将认识到的,本申请的内容使得本领域技术人员能够对所公开的具体实施方式进行改动而不脱离本申请所涉及发明的精神和范围。相应地,本申请的附图和说明书中的描述仅仅是示例性的,而非为限制性的。Those skilled in the art can easily perceive other aspects and advantages of the present application from the following detailed description. In the following detailed description, only exemplary embodiments of the present application are shown and described. As those skilled in the art will appreciate, the content of the present application enables those skilled in the art to make changes to the specific embodiments which are disclosed without departing from the spirit and scope of the invention to which this application relates. Correspondingly, the drawings and descriptions in the specification of the present application are only exemplary rather than restrictive.
附图说明Description of drawings
本申请所涉及的发明的具体特征如所附权利要求书所显示。通过参考下文中详细描述的示例性实施方式和附图能够更好地理解本申请所涉及发明的特点和优势。对附图简要说明书如下:The particular features of the invention to which this application relates are set forth in the appended claims. The features and advantages of the invention to which this application relates can be better understood with reference to the exemplary embodiments described in detail hereinafter and the accompanying drawings. A brief description of the accompanying drawings is as follows:
图1显示的是本申请所述方法进行机器学习模型所用的训练集,以及验证本申请所述的区分体细胞突变和种系突变的方法能够区分体细胞突变和种系突变所需的验证集的情况。Figure 1 shows the training set used by the method described in this application for the machine learning model, and the verification set required to verify that the method for distinguishing somatic mutations and germline mutations described in this application can distinguish between somatic mutations and germline mutations Case.
图2显示的是利用本申请所述方法获得的机器学习模型的机器训练结果。Figure 2 shows the machine training results of the machine learning model obtained using the method described in this application.
图3显示的是本申请所述方法获得的机器学习模型在验证集1中区分体细胞突变和种系突变的情况。Figure 3 shows how the machine learning model obtained by the method described in this application distinguishes between somatic mutations and germline mutations in validation set 1.
图4显示的是本申请所述方法获得的机器学习模型在验证集2中区分体细胞突变和种系突变的情况。FIG. 4 shows how the machine learning model obtained by the method described in this application distinguishes between somatic mutations and germline mutations in the verification set 2.
图5显示的是利用本申请所述的方法可以针对不同的瘤种区分体细胞突变和种系突变。Figure 5 shows that using the method described in this application can distinguish between somatic mutations and germline mutations for different tumor types.
图6显示的是本申请所述方法区分体细胞突变和种系突变的AUC结果。Figure 6 shows the AUC results of the method described in this application for distinguishing somatic mutations from germline mutations.
图7显示的是本申请所述方法区分体细胞突变和种系突变的AUC结果。Figure 7 shows the AUC results of the method described in this application for distinguishing somatic mutations from germline mutations.
图8显示的是针对一个突变位点的所述野生型支持片段和所述突变型支持片段的长度的分布情况。Figure 8 shows the distribution of the lengths of the wild-type supporting fragment and the mutant supporting fragment for a mutation site.
图9显示的是针对一个突变位点的所述野生型支持片段和所述突变型支持片段的长度的分布情况。Figure 9 shows the distribution of the lengths of the wild-type supporting fragment and the mutant supporting fragment for a mutation site.
图10显示的是针对一个突变位点的所述野生型支持片段和所述突变型支持片段的长度的分布情况。Figure 10 shows the distribution of the lengths of the wild-type supporting fragment and the mutant supporting fragment for a mutation site.
具体实施方式detailed description
以下由特定的具体实施例说明本申请发明的实施方式,熟悉此技术的人士可由本说明书所公开的内容容易地了解本申请发明的其他优点及效果。The implementation of the invention of the present application will be described in the following specific examples, and those skilled in the art can easily understand other advantages and effects of the invention of the present application from the content disclosed in this specification.
术语定义Definition of Terms
在本申请中,术语“体细胞突变”通常是指发生在非胚胎细胞中的后天获得的一类突变。在本申请中,所述体细胞突变可以包括在体细胞组织(例如,种系外的细胞)中发生的遗传改变。在本申请中,所述体细胞突变可以包括点突变(例如,单个核苷酸与另一个核苷酸的交换(例如,沉默突变、错义突变和无义突变))、插入和缺失(例如,添加和/或移除一个或多个核苷酸(例如,插入缺失))、扩增、基因重复、拷贝数改变(CNA)、重排和剪接变体。所述体细胞突变可以与细胞的生长,编程,衰老和凋亡过程密切相关。例如,所述体细胞突变可以与肿瘤发生中信号通路改变,血管生成和/或肿瘤的转移相关。In this application, the term "somatic mutation" generally refers to an acquired class of mutations that occur in non-embryonic cells. In the present application, the somatic mutation may include a genetic change occurring in a somatic tissue (eg, a cell outside the germline). In this application, the somatic mutations may include point mutations (for example, the exchange of a single nucleotide for another nucleotide (for example, silent mutations, missense mutations, and nonsense mutations)), insertions, and deletions (for example, , addition and/or removal of one or more nucleotides (eg, indels), amplifications, gene duplications, copy number alterations (CNAs), rearrangements, and splice variants. The somatic mutations may be closely related to the processes of cell growth, programming, senescence and apoptosis. For example, the somatic mutations may be associated with alterations in signaling pathways in tumorigenesis, angiogenesis and/or tumor metastasis.
在本申请中,术语“种系突变”通常是指发生在生殖细胞(例如卵子或精子)可遗传的突变。所述种系突变可以传给后代,例如可以被纳入后代体内的每个细胞(例如种系细胞和体细胞)的DNA中。所述种系突变可以与肿瘤的发生关联性不大。例如,所述种系突变可以作为TMB分析中的“基线”。In this application, the term "germline mutation" generally refers to a heritable mutation that occurs in a germ cell (eg, egg or sperm). The germline mutation can be passed on to progeny, eg, can be incorporated into the DNA of every cell (eg, germline and somatic) in the progeny. The germline mutation may be less associated with tumorigenesis. For example, the germline mutation can serve as a "baseline" in TMB analysis.
在本申请中,术语“基因测序”通常是指用于确定DNA分子中核苷酸碱基腺嘌呤,鸟嘌呤,胞嘧啶和胸腺嘧啶的顺序的技术。所述基因测序可以包括一代基因测序、二代基因测序、三代基因测序或单分子测序(SMS)。二代或下一代基因测序可以是指在产生许多序列的同时使用先进技术(光学)检测碱基位置方法的技术(例如可以参见Metzker,2009的综述)。术语“二代基因测序“或者“下一代测序”(Next-generation sequencing,NGS),是一种高通量测序技术(High-throughput sequencing),可以一次并行对几十万到几百万条DNA分子进行序列测定,一般读长较短。根据发展历史、影响力、测序原理和技术不同等,主要有以下几种:大规模平行签名测序(Massively Parallel Signature Sequencing,MPSS)、聚合酶克隆(Polony Sequencing)、454焦磷酸测序(454pyro sequencing)、Illumina(Solexa)sequencing、离子半导体测序(Ion semi conductor sequencing)、DNA纳米球测序(DNA nano-ball sequencing)、Complete Genomics的DNA纳米阵列与组合探针锚定连接测序法等。所述二代基因测序可以使对一个物种的转录组和基因组进行细致全貌的分析成为可能,所以又被称为深度测序(deep sequencing)。In this application, the term "gene sequencing" generally refers to the technique used to determine the order of the nucleotide bases adenine, guanine, cytosine and thymine in a DNA molecule. The gene sequencing may include first-generation gene sequencing, second-generation gene sequencing, third-generation gene sequencing or single-molecule sequencing (SMS). Second-generation or next-generation gene sequencing may refer to techniques that use advanced techniques (optical) to detect base position methods while generating many sequences (see, for example, Metzker, 2009 for a review). The term "next-generation sequencing" or "next-generation sequencing" (Next-generation sequencing, NGS) is a high-throughput sequencing technology (High-throughput sequencing), which can parallelize hundreds of thousands to millions of DNA at a time. Molecules are sequenced, generally with short read lengths. According to the development history, influence, sequencing principle and technology, there are mainly the following types: massively parallel signature sequencing (Massively Parallel Signature Sequencing, MPSS), polymerase cloning (Polony Sequencing), 454 pyrosequencing (454pyro sequencing) , Illumina (Solexa) sequencing, ion semiconductor sequencing (Ion semi conductor sequencing), DNA nanoball sequencing (DNA nano-ball sequencing), Complete Genomics' DNA nanoarray and combined probe anchor ligation sequencing, etc. The second-generation gene sequencing can make it possible to analyze the transcriptome and genome of a species in detail, so it is also called deep sequencing (deep sequencing).
在本申请中,术语“突变位点”通常是指与对照序列的核苷酸序列相比存在差异的核苷 酸所在的位点。例如,所述对照序列可以为基因测序中使用的参照序列(例如可以为人类参考基因组)。在本申请中,所述突变位点可以包括至少1个(例如,1个、2个、3个、4个或更多个)位点处的核苷酸序列的不同(例如,所述不同可以包括核苷酸取代、重复、缺失和/或增加)。例如,所述突变位点可以包括至少1个核苷酸位点处发生核苷酸突变。所述核苷酸突变可以为自然突变,也可以为人工突变。所述突变位点可以包括单核苷酸变异(SNV)。In the present application, the term "mutation site" generally refers to the site where there is a nucleotide difference compared with the nucleotide sequence of the control sequence. For example, the control sequence may be a reference sequence used in gene sequencing (for example, it may be a human reference genome). In the present application, the mutation site may include at least one (for example, 1, 2, 3, 4 or more) difference in the nucleotide sequence at the site (for example, the difference Nucleotide substitutions, duplications, deletions and/or additions may be included). For example, the mutation site may include a nucleotide mutation at at least one nucleotide position. The nucleotide mutation can be a natural mutation or an artificial mutation. The mutation site may comprise a single nucleotide variation (SNV).
在本申请中,术语“野生型碱基序列”通常是指与参考基因组(例如可以为人类参考基因组)在所述突变位点的对应位置处的核苷酸序列相比相同的序列。在某些情况下,所述野生型碱基序列可以为人类参考基因组在所述突变位点的对应位置处的核苷酸序列。在某些情况下,针对某一个特定的本申请所述的突变位点,所述野生型碱基序列可以不包含所述的突变位点。In the present application, the term "wild-type base sequence" generally refers to the same sequence compared with the nucleotide sequence at the corresponding position of the mutation site in a reference genome (for example, a human reference genome). In some cases, the wild-type base sequence may be the nucleotide sequence at the corresponding position of the mutation site in the human reference genome. In some cases, for a specific mutation site described in this application, the wild-type base sequence may not contain the mutation site.
在本申请中,术语“突变型碱基序列”通常是指与参考基因组(例如可以为人类参考基因组)在所述突变位点的对应位置处的核苷酸序列相比不同的序列。在某些情况下,针对某一个特定的本申请所述的突变位点,所述突变型碱基序列可以包含所述的突变位点。In the present application, the term "mutant base sequence" generally refers to a sequence that is different from the nucleotide sequence at the corresponding position of the mutation site in a reference genome (for example, it may be a human reference genome). In some cases, for a specific mutation site described in this application, the mutant base sequence may contain the mutation site.
在本申请中,术语“野生型支持片段”通常是指包含本申请所述野生型碱基序列的cfDNA片段。在本申请中,针对某一个特定的本申请所述的突变位点,所述野生型支持片段可以有不同的序列长度。在某些情况下,针对某一个特定的本申请所述的突变位点,所述野生型支持片段可以不包含所述的突变位点。在某些情况下,针对某一个特定的本申请所述的突变位点,所述野生型支持片段可以不包含所述的突变位点,然而针对另一个其他的本申请所述的突变位点,所述野生型支持片段可以包含,也可以不包含所述另一个其他的突变位点。术语“野生型支持片段的长度”,指的是本申请所述野生型支持片段的长度,单位是“核苷酸”的个数。In this application, the term "wild-type supporting fragment" generally refers to a cfDNA fragment comprising the wild-type base sequence described in this application. In this application, for a specific mutation site described in this application, the wild-type supporting fragments may have different sequence lengths. In some cases, for a specific mutation site described in this application, the wild-type supporting fragment may not contain the mutation site. In some cases, for a specific mutation site described in the application, the wild-type support fragment may not contain the mutation site, but for another mutation site described in the application , the wild-type supporting fragment may or may not contain the other mutation site. The term "the length of the wild-type supporting fragment" refers to the length of the wild-type supporting fragment described in this application, and the unit is the number of "nucleotides".
在本申请中,术语“突变型支持片段”通常是指包含本申请所述突变型碱基序列的cfDNA片段。在某些情况下,针对某一个特定的本申请所述的突变位点,所述突变型支持片段可以包含所述的突变位点。在某些情况下,这对一个特定的本申请所述的突变位点,所述突变型支持片段可以包含所述的突变位点,然而针对另一个其他的本申请所述的突变位点,所述突变型支持片段可以包含,也可以不包含所述另一个其他的突变位点。术语“突变型支持片段的长度”,指的是本申请所述突变型支持片段的长度,单位是“核苷酸”的个数。In the present application, the term "mutant supporting fragment" generally refers to a cfDNA fragment comprising the mutated base sequence described in the present application. In some cases, for a specific mutation site described in this application, the mutant support fragment may contain the mutation site. In some cases, for a specific mutation site described in the present application, the mutant support fragment may contain the mutation site, but for another mutation site described in the application, The mutant supporting fragment may or may not contain the other mutation site. The term "length of the mutant supporting fragment" refers to the length of the mutant supporting fragment described in this application, and the unit is the number of "nucleotides".
在本申请中,术语“人类参考基因组”通常是指可以在基因测序中发挥参照功能的人类基因组。所述人类参考基因组的信息可以参考UCSC(http://genome.ucsc.edu/index.html)。所述人类参考基因组可以有不同的版本,例如,可以为hg19、GRCH37或ensembl 75。In this application, the term "human reference genome" generally refers to the human genome that can function as a reference in gene sequencing. The information of the human reference genome can refer to UCSC (http://genome.ucsc.edu/index.html). The human reference genome can have different versions, for example, it can be hg19, GRCH37 or ensembl 75.
在本申请中,术语“对应位置处”通常是指针对至少一个的特定碱基在一个序列中的位置,另一个序列中所述特定碱基在该序列中的位置。例如,所述对应位置处可以为针对本申请所述野生型碱基序列或所述突变型碱基序列中所述突变位点处的核苷酸位置,本申请所述参考基因组中所述突变位点的位置。例如,在所述突变型碱基序列中所述突变位点为第100位核苷酸,则所述参考基因组中的所述对应位置处可以为在所述参考基因组中对应序列的第100位。In the present application, the term "at the corresponding position" generally refers to the position of at least one specific base in one sequence, and the position of the specific base in the other sequence. For example, the corresponding position can be the nucleotide position at the mutation site in the wild-type base sequence or the mutant base sequence described in the application, and the mutation in the reference genome described in the application location of the site. For example, if the mutation site is the 100th nucleotide in the mutant base sequence, then the corresponding position in the reference genome can be the 100th nucleotide of the corresponding sequence in the reference genome .
在本申请中,术语“cfDNA”通常是指Cell free DNA的缩写,可以指血浆游离DNA。例如,所述cfDNA可以是位于外周循环中的细胞外的DNA片段。In this application, the term "cfDNA" usually refers to the abbreviation of Cell free DNA, and may refer to plasma free DNA. For example, the cfDNA can be an extracellular DNA fragment located in the peripheral circulation.
在本申请中,术语“ctDNA”通常是指循环肿瘤DNA。ctDNA是血液中与细胞无关的肿瘤来源的片段DNA。所述ctDNA可以由凋亡或坏死的肿瘤细胞中的基因组进入血液而产生。所述ctDNA可以携带有原发瘤或转移瘤特定的基因特征。所述ctDNA可以为认为是一种特殊的所述cfDNA。In this application, the term "ctDNA" generally refers to circulating tumor DNA. ctDNA is a fragment of tumor-derived DNA that is not associated with cells in the blood. The ctDNA can be produced by the entry of genomes in apoptotic or necrotic tumor cells into the blood. The ctDNA may carry specific gene characteristics of primary tumor or metastatic tumor. The ctDNA can be considered as a special kind of the cfDNA.
在本申请中,术语“机器学习模型”通常是指被配置为实现算法、过程或数学模型的系统或程序指令和/或数据的集合。在本申请中,所述算法、过程或数学模型可以基于给定的输入来预测和提供期望的输出。在本申请中,所述机器学习模型的参数可以没有被明确地编程,并且在传统意义上,所述机器学习模型可以没有被明确地设计成遵循特定的规则以便为给定的输入提供期望的输出。例如,所述机器学习模型的使用可以意味着机器学习模型和/或作为机器学习模型的数据结构/一组规则是由机器学习算法训练的。In this application, the term "machine learning model" generally refers to a system or collection of program instructions and/or data configured to implement an algorithm, process, or mathematical model. In the present application, the algorithm, process or mathematical model can predict and provide a desired output based on a given input. In the present application, the parameters of the machine learning model may not be explicitly programmed, and in the traditional sense, the machine learning model may not be explicitly designed to follow specific rules in order to provide the desired output. For example, the use of the machine learning model may mean that the machine learning model and/or the data structure/set of rules being the machine learning model are trained by a machine learning algorithm.
在本申请中,术语“数据库”通常是指相关数据的有组织实体,而不管数据或有组织实体的表示方式。例如,所述相关数据的有组织实体可以采取表、映射、网格、分组、数据报、文件、文档、列表的形式或任何其他形式。在本申请中,所述数据库可以包括以计算机可存取的方式来收集并保存的任何数据。In this application, the term "database" generally refers to an organized entity of related data, regardless of the manner in which the data or the organized entity is represented. For example, the organized entity of related data may take the form of a table, map, grid, group, datagram, file, document, list, or any other form. In this application, the database may include any data collected and stored in a computer-accessible manner.
在本申请中,术语“单核苷酸变异(SNV)”通常是指在基因组中特定位置处发生的单核苷酸中的变异,其中所述特定位置与参比基因组(例如本申请所述的人类参考基因组)中对应位置处的核苷酸不同(例如,取代、重复、缺失或添加一个核苷酸)。In this application, the term "single nucleotide variation (SNV)" generally refers to a variation in a single nucleotide that occurs at a specific location in a genome that is identical to a reference genome (such as the one described in this application). The nucleotides at corresponding positions in the human reference genome differ (for example, substitutions, duplications, deletions, or additions of one nucleotide).
在本申请中,术语“平滑化处理”通常是指使一个以上的本申请所述的差值之间的偏差减小的数据处理的方法。例如,所述平滑化处理可以包括获得一定数量的本申请所述差值的平均值。例如,所述平滑化处理可以包括根据一定的间隔长度(例如,可以为本申请所述的平滑化窗口值),选择不同长度(例如,可以为本申请所述的平滑化取样长度)所对应的所述野生型支持片段和/或所述突变型支持片段的数量,计算两者数量分别与所述野生型支持比值 的总数量的比值和与所述突变型支持片段的总数量的比值的差值。例如,所述平滑化处理可以包括将一定长度范围内,所述差值的累加值再除以间隔长度以获得比值。例如,所述比值可以被认为是该长度范围的所述差值的平均差值。In this application, the term "smoothing" generally refers to a method of data processing that reduces the deviation between one or more of the differences described herein. For example, the smoothing process may include obtaining an average value of a certain number of difference values described in this application. For example, the smoothing process may include selecting different lengths (for example, the smoothing sampling length described in this application) corresponding to a certain interval length (for example, it may be the smoothing window value described in this application). The number of the wild-type support fragment and/or the mutant-type support fragment, calculate the ratio of the two numbers to the total number of the wild-type support ratio and the ratio to the total number of the mutant-type support fragment difference. For example, the smoothing process may include dividing the accumulated value of the difference within a certain length range by the interval length to obtain a ratio. For example, said ratio may be considered as the average difference of said differences over the length range.
在本申请中,术语“平滑化窗口值”通常是指在本申请所述的平滑化处理中,所选择的不同长度的所述野生型支持片段和/或所述突变型支持片段所间隔的核苷酸的长度值。例如,在所述的平滑化处理中,所选择的所述野生型支持片段和/或所述突变型支持片段的长度可以依次为1、4、7、10、13……个核苷酸,则所述平滑化窗口值可以为3。所述平滑化窗口值可以为约1-30中的整数,例如,可以为1、2、3、4、5、6、7、8、9或10。例如,可以为1、2、3、4、5或6。In the present application, the term "smoothing window value" generally refers to the interval between the selected wild-type support fragments and/or mutant support fragments of different lengths in the smoothing process described in the present application. Nucleotide length value. For example, in the smoothing process, the length of the selected wild-type support fragment and/or the mutant support fragment can be 1, 4, 7, 10, 13 ... nucleotides in sequence, Then the smoothing window value may be 3. The smoothing window value may be an integer of about 1-30, for example, may be 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10. For example, it can be 1, 2, 3, 4, 5 or 6.
在本申请中,术语“平滑化取样长度”通常是指在本申请所述的平滑化处理中,所选择以进行计数的所述野生型支持片段的长度值,和/或,所选择以进行计数的所述突变型支持片段的长度值。例如,所述平滑化取样长度,可以为在本申请所述野生型支持片段和/或所述突变型支持片段的长度的范围内,平滑化取样长度范围内的各个支持片段的长度值。例如,在每一个平滑化取样长度范围内,可以自起始长度(例如,可以从长度为1个核苷酸起),至该平滑化取样长度范围的最大值(例如可以为起始长度+(平滑化窗口值-1)),其中各个支持片段的长度值。例如,如果所述平滑化窗口值可以为3,如果所述起始长度为1个核苷酸,则所述平滑化取样长度范围可以为1-3、4-6、7-9……;例如,如果所述平滑化窗口值可以为3,如果所述起始长度为1个核苷酸,则所述平滑化取样长度范围也可以为1-3、2-4、3-5……。在本申请中,所述起始长度也可以为除1以外的其他长度(例如,可以从长度为2个核苷酸起)。例如,如果所述起始长度为2个核苷酸,则所述平滑化取样长度范围可以为2-4、5-7、8-10……;例如,如果所述平滑化窗口值可以为3,如果所述起始长度为2个核苷酸,则所述平滑化取样长度范围也可以为2-4、3-5、4-6……。In the present application, the term "smoothed sampling length" generally refers to the length value of the wild-type support fragment selected for counting in the smoothing process described in the present application, and/or, selected for counting Count the length values of the mutant support fragments. For example, the smoothed sampling length may be the length value of each support fragment within the smoothed sampling length range within the length range of the wild-type support fragment and/or the mutant support fragment described in the present application. For example, in each smoothing sampling length range, it can be from the initial length (for example, it can start from a length of 1 nucleotide) to the maximum value of the smoothing sampling length range (for example, it can be the initial length+ (smoothing window value - 1)), where each supports the length value of the segment. For example, if the smoothing window value can be 3, if the initial length is 1 nucleotide, then the smoothing sampling length range can be 1-3, 4-6, 7-9...; For example, if the smoothing window value can be 3, if the initial length is 1 nucleotide, then the smoothing sampling length range can also be 1-3, 2-4, 3-5... . In the present application, the initial length may also be other lengths than 1 (for example, it may start from a length of 2 nucleotides). For example, if the initial length is 2 nucleotides, the smoothing sampling length range can be 2-4, 5-7, 8-10...; for example, if the smoothing window value can be 3. If the initial length is 2 nucleotides, the smoothed sampling length range may also be 2-4, 3-5, 4-6....
在本申请中,术语“第一分布”通常是指本申请所述的各个平滑化取样长度范围的平均差值的分布。在某些情况下,所述第一分布可以为各个本申请所述的平均差值的集合。In the present application, the term "first distribution" generally refers to the distribution of the average difference of each smoothed sampling length range described in the present application. In some cases, the first distribution may be a collection of average differences described in the present application.
在本申请中,术语“缠绕核小体的核酸序列的长度”通常是指一个核酸序列缠绕核小体所需要的长度。例如,所述核酸序列可以以一定的倍数(例如,可以缠绕一倍以内,或者,可以缠绕2倍以上)缠绕核小体。In the present application, the term "the length of a nucleic acid sequence that winds around a nucleosome" generally refers to the length required for a nucleic acid sequence to wind around a nucleosome. For example, the nucleic acid sequence may wrap around the nucleosome at a certain multiple (eg, may wrap within one time, or may wrap more than twice).
在本申请中,术语“有效片段区间的长度”通常是指计算本申请所述加成值所需的所述野生型支持片段和/或所述突变型支持片段所对应的长度的范围。In the present application, the term "the length of the effective fragment interval" generally refers to the range of the length corresponding to the wild-type supporting fragment and/or the mutant supporting fragment required for calculating the addition value described in the present application.
在本申请中,术语“第二分布”通常是指本申请所述的加成值的分布。在某些情况下, 所述第二分布可以为各个本申请所述的加成值的集合。In the present application, the term "second distribution" generally refers to the distribution of addition values described in the present application. In some cases, the second distribution may be a collection of the added values described in this application.
在本申请中,术语“计算模块”通常是指用于计算相同长度的本申请所述野生型支持片段的数量与本申请所述突变型支持片段的数量的差值的功能模块。所述计算模块可以输入本申请所述野生型支持片段的数量,以及对应地相同长度的所述突变型支持片段的数量。所述计算模块可以输出本申请所述的差值。例如,可以输出本申请所述的Dev(Max)。在所述计算模块中,可以进行本申请所述的平滑化处理。In this application, the term "calculation module" generally refers to a functional module for calculating the difference between the number of wild-type support fragments described in this application and the number of mutant support fragments described in this application with the same length. The calculation module can input the number of wild-type supporting fragments described in the present application, and correspondingly the number of mutant-type supporting fragments of the same length. The calculation module can output the difference value described in this application. For example, Dev(Max) described in this application may be output. In the calculation module, the smoothing process described in this application can be performed.
在本申请中,术语“判断模块”通常是指用于依据已被进行机器学习训练的机器学习模型获得相关判断结果(例如,所述判断结果可以包括本申请所述体细胞突变的识别结果,和/或本申请所述cfDNA中识别ctDNA的判断结果)。在本申请中,所述判断模块可以输入本申请所述差值(例如所述的Dev(Max))。所述判断模块可以输出所述的相关判断结果。在所述判断模块中,可以借助所述机器学习模型进行判断。In this application, the term "judgment module" generally refers to a machine learning model that has been trained by machine learning to obtain relevant judgment results (for example, the judgment results may include the recognition results of somatic mutations described in this application, And/or the judgment result of recognizing ctDNA in the cfDNA described in this application). In this application, the judging module may input the difference described in this application (for example, the Dev(Max)). The judging module can output the related judging result. In the judging module, the machine learning model can be used for judging.
在本申请中,术语“训练模块”通常是指用于将本申请所述差值(例如所述的Dev(Max))作为训练样本输入至所述机器学习模型以进行机器学习训练的功能模块。所述“机器学习”可以指被配置为在没有显式编程的情况下从数据中学习的人工智能系统。所述“机器学习模型”可以是参数和函数的集合,其可以在一组训练样品上训练参数。参数和函数可以是线性代数运算、非线性代数运算和张量代数运算的集合。参数和函数可以包含统计函数、检验和概率模型。在本申请中,所述训练模块可以输入本申请所述差值(例如所述的Dev(Max))。所述训练模块可以输出已被进行机器学习训练的机器学习模型。In the present application, the term "training module" generally refers to a functional module for inputting the difference described in the present application (such as the Dev(Max)) as a training sample into the machine learning model for machine learning training . The "machine learning" may refer to artificial intelligence systems configured to learn from data without being explicitly programmed. The "machine learning model" can be a collection of parameters and functions that can train parameters on a set of training samples. Parameters and functions can be collections of linear algebraic, nonlinear algebraic, and tensor algebraic operations. Parameters and functions can contain statistical functions, tests, and probability models. In the present application, the training module may input the difference value described in the present application (for example, the Dev(Max)). The training module can output a machine learning model that has been trained by machine learning.
在本申请中,术语“输出模块”通常是指用于显示本申请所述判断模块产生的所述体细胞突变的识别结果和/或所述cfDNA中识别ctDNA的判断结果的功能模块。例如,所述输出模块可以包括显示器,其可以显示(例如以图表和/或文字的形式)显示本申请所述判断模块产生的所述体细胞突变的识别结果和/或所述cfDNA中识别ctDNA的判断结果。In the present application, the term "output module" generally refers to a functional module for displaying the recognition result of the somatic mutation generated by the judgment module of the present application and/or the judgment result of recognition of ctDNA in the cfDNA. For example, the output module may include a display, which may display (for example, in the form of graphs and/or text) the recognition result of the somatic mutation generated by the judgment module of the present application and/or the recognition of ctDNA in the cfDNA judgment result.
在本申请中,术语“样品获得模块”通常是指用于获得受试者的所述样本的功能模块。例如,所述样品获得模块可以包括用以获得所述样本(例如血液样本)所需的试剂和/或仪器。例如,可以包括采血针、采血管和/或血液样本运输箱。所述样品获得模块可以输出本申请所述的样本。In this application, the term "sample obtaining module" generally refers to a functional module for obtaining said sample of a subject. For example, the sample obtaining module may include reagents and/or instruments needed to obtain the sample (eg, blood sample). For example, lancets, blood collection tubes, and/or blood sample transport boxes may be included. The sample acquisition module can output the samples described herein.
在本申请中,术语“数据接收模块”通常是指用于获得所述样本中所述突变位点的功能模块。在本申请中,所述数据接收模块可以输入本申请所述样本(例如血液样本)。所述数据接收模块可以输出所述突变位点。所述数据接收模块可以对所述样本的突变位点进行检测。例如,所述数据接收模块可以对所述样本进行本申请所述的基因测序(例如二代基因测序)。 例如,所述数据接收模块可以包括用以进行所述基因测序所需的试剂和/或仪器。所述数据接收模块可以检测出所述单核苷酸变异。In this application, the term "data receiving module" generally refers to a functional module for obtaining the mutation site in the sample. In this application, the data receiving module may input the sample described in this application (such as a blood sample). The data receiving module can output the mutation site. The data receiving module can detect the mutation site of the sample. For example, the data receiving module may perform the gene sequencing described in this application (eg, next-generation gene sequencing) on the sample. For example, the data receiving module may include reagents and/or instruments required for the gene sequencing. The data receiving module can detect the single nucleotide variation.
在本申请中,术语“输入模块”通常是指用以获得所述至少一个长度的所述野生型支持片段的数量,和/或所述对应的相同长度的所述突变型支持片段的数量的功能模块。在本申请中,所述输入模块可以输入本申请所述的突变位点。所述输入模块可以输出(例如,可以显示出)所述至少一个长度的所述野生型支持片段的数量,和/或所述对应的相同长度的所述突变型支持片段的数量。所述输入模块可以包括能够对特定长度的所述野生型支持片段进行计数的试剂和/或仪器。所述输入模块可以包括能够对特定长度的所述突变型支持片段进行计数的试剂和/或仪器。所述输入模块可以识别所述野生型支持片段的长度并分别计数;所述输入模块可以识别所述突变型支持片段的长度并分别计数。所述输入模块可以判断所述野生型支持片段的长度和所述突变型片段的长度是否相同。In the present application, the term "input module" generally refers to the number of wild-type support fragments used to obtain the at least one length, and/or the corresponding number of mutant support fragments of the same length. functional module. In this application, the input module can input the mutation site described in this application. The input module may output (eg, may display) the number of the wild-type supporting fragments of the at least one length, and/or the corresponding number of the mutant supporting fragments of the same length. Said input module may comprise reagents and/or instruments capable of enumerating said wild-type support fragments of a specified length. The input module may comprise reagents and/or instruments capable of counting the mutant supporting fragments of a specified length. The input module can identify the lengths of the wild-type supporting fragments and count them respectively; the input module can identify the lengths of the mutant-type supporting fragments and count them respectively. The input module can determine whether the length of the wild-type supporting fragment is the same as that of the mutant-type fragment.
在本申请中,术语“肿瘤家系管理”通常是指为家族遗传性肿瘤患者、其亲属和/或高风险人群提供肿瘤相关事宜的帮助。例如,所述肿瘤家系管理可以包括为上述人群提供遗传咨询、进行肿瘤相关基因的检测和结果解读、患肿瘤的风险评估、预防性干预措施的咨询和/或实施。In this application, the term "cancer family management" generally refers to providing help for cancer-related matters to patients with family hereditary tumors, their relatives and/or high-risk groups. For example, the cancer family management may include providing genetic counseling for the above-mentioned population, performing detection of tumor-related genes and interpretation of results, risk assessment of cancer, consultation and/or implementation of preventive interventions.
在本申请中,术语“肿瘤突变负荷(TMB)”指tumor mutation burden,根据《肿瘤突变负荷检测及临床应用中国专家共识(2020年版)》中的定义,TMB一般指特定基因组区域内每兆碱基对(Mb)体细胞非同义突变的个数,通常用每兆碱基有多少个突变表示(XX个突变/Mb)。所述TMB可以作为免疫治疗反应相关的生物标志物。所述TMB可以间接反映肿瘤产生新抗原的能力和程度,已经被证明可以预测免疫治疗的反应,例如,NSCLC指南2019年第1版中指出TMB用于识别适合接受“Nivolumab+Ipilimumab”免疫联合治疗和“Nivolumab”单药治疗的肺癌患者。TMB表达水平可能与多种因素有关,例如微卫星不稳定(microsatellite instability,MSI-H)及某些驱动基因的存在等。In this application, the term "tumor mutation burden (TMB)" refers to tumor mutation burden. According to the definition in "Tumor mutation burden detection and clinical application Chinese expert consensus (2020 edition)", TMB generally refers to the tumor mutation burden per megabase in a specific genomic region. The number of somatic non-synonymous mutations in base pairs (Mb), usually expressed in terms of mutations per megabase (XX mutations/Mb). The TMB can be used as a biomarker related to the response to immunotherapy. The TMB can indirectly reflect the ability and degree of neoantigen production by the tumor, and has been proven to predict the response to immunotherapy. For example, the first edition of the NSCLC guidelines in 2019 pointed out that TMB is used to identify candidates for "Nivolumab+Ipilimumab" immune combination therapy and "Nivolumab" monotherapy in patients with lung cancer. The expression level of TMB may be related to many factors, such as microsatellite instability (MSI-H) and the existence of certain driver genes.
在本申请中,术语“包含”通常是指包括明确指定的特征,但不排除其他要素。In this application, the term "comprising" generally means including specifically specified features, but not excluding other elements.
在本申请中,术语“约”通常是指在指定数值以上或以下0.5%-10%的范围内变动,例如在指定数值以上或以下0.5%、1%、1.5%、2%、2.5%、3%、3.5%、4%、4.5%、5%、5.5%、6%、6.5%、7%、7.5%、8%、8.5%、9%、9.5%、或10%的范围内变动。In this application, the term "about" generally refers to a range of 0.5%-10% above or below the specified value, such as 0.5%, 1%, 1.5%, 2%, 2.5%, above or below the specified value. 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10%.
发明详述Detailed description of the invention
方法method
一方面,本申请提供了一种用于区分体细胞突变和种系突变的方法,其包括以下步骤:In one aspect, the application provides a method for distinguishing between a somatic mutation and a germline mutation, comprising the steps of:
(1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中所述参考基因组为所述基因测序中的人类参考基因组;(3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;计算相同长度下所述比值WC与所述比值MC的差值;(4)将所述差值作为区分所述突变位点为体细胞突变还是种系突变的指标。(1) Obtain at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-type supporting fragments; calculating the mutation of the same length The ratio MC of the quantity of type support fragment and the total quantity of described mutant type support fragment; Calculate the difference value of described ratio WC and described ratio MC under the same length; (4) use described difference as distinguishing described mutation position Points are indicators of somatic or germline mutations.
一方面,本申请提供了一种用于在cfDNA中识别ctDNA的方法,其包括以下步骤:In one aspect, the application provides a method for identifying ctDNA in cfDNA, comprising the following steps:
(1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中所述参考基因组为所述基因测序中的人类参考基因组;(3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;计算相同长度下所述比值WC与所述比值MC的差值;(4)将所述差值作为识别所述突变位点是否位于ctDNA的指标。(1) Obtain at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-type supporting fragments; calculating the mutation of the same length The ratio MC of the quantity of type support fragment and the total quantity of described mutant type support fragment; Calculate the difference of described ratio WC and described ratio MC under the same length; (4) use described difference as identification described mutation position Indicator of whether spots are located on ctDNA.
一方面,本申请提供了一种机器学习模型的训练方法,其包括以下步骤:On the one hand, the application provides a kind of training method of machine learning model, it comprises the following steps:
(1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对 应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中所述参考基因组为所述基因测序中的人类参考基因组;(3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;计算相同长度下所述比值WC与所述比值MC的差值;(4)将所述差值作为训练的指标输入至所述机器学习模型以进行机器学习训练。(1) Obtain at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-type supporting fragments; calculating the mutation of the same length The ratio MC of the quantity of type support fragment and the total quantity of described mutation type support fragment; Calculate the difference value of described ratio WC and described ratio MC under the same length; (4) input described difference value as training index to The machine learning model is used for machine learning training.
一方面,本申请提供了一种数据库建立方法,其包括以下步骤:On the one hand, the application provides a method for establishing a database, which includes the following steps:
(1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中所述参考基因组为所述基因测序中的人类参考基因组;(3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;计算相同长度下所述比值WC与所述比值MC的差值;(4)将所述差值存储至数据库中,以便区分体细胞突变和种系突变,和/或在cfDNA中识别ctDNA。(1) Obtain at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing, (2) for each of the mutation sites, obtain a wild-type supporting fragment and a mutation type support fragment; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein the wild-type base sequence is , compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence, wherein the mutant base sequence is, at the corresponding position of the mutation site in the reference genome Compared with the nucleotide sequence of different sequences, wherein the reference genome is the human reference genome in the gene sequencing; (3) for each mutation site, obtain at least one length of the wild-type support fragment Quantity, obtaining the corresponding number of mutant-type supporting fragments of the same length, calculating the ratio WC of the number of wild-type supporting fragments of this length to the total number of wild-type supporting fragments; calculating the mutation of the same length The ratio MC of the quantity of type support fragments and the total quantity of said mutant support fragments; calculate the difference between said ratio WC and said ratio MC under the same length; (4) store said difference in the database, so that Distinguish between somatic and germline mutations, and/or identify ctDNA in cfDNA.
在本申请中,所述基因测序可以包括二代基因测序(NGS)。在本申请中,所述NGS可以选自下组:Solexa测序技术、454测序技术、SOLiD测序技术、Complete Genomics测序方法和半导体(Ion Torrent)测序技术。所述基因测序可以是高通量的,例如可以一次性对几十万、几百万数量级的DNA分子进行测序。所述基因测序可以是短片段的,例如,NGS的读长可以不超过500bp。In the present application, the gene sequencing may include next generation gene sequencing (NGS). In the present application, the NGS can be selected from the following group: Solexa sequencing technology, 454 sequencing technology, SOLiD sequencing technology, Complete Genomics sequencing method and semiconductor (Ion Torrent) sequencing technology. The gene sequencing can be high-throughput, for example, hundreds of thousands or millions of DNA molecules can be sequenced at one time. The gene sequencing can be short-segment, for example, the read length of NGS can be no more than 500bp.
在本申请中,所述基因测序可以包括以下的步骤:(1)文库构建;例如,可以包括对DNA分子的末端进行修饰,并添加接头(例如,可以形成Y形接头),然后再进行PCR扩增;(2)测序;例如,可以包括以寡核苷酸为引物、文库片段为模板进行DNA复制;然后进行“桥式”扩增,并且边合成边测序。然后加入测序引物Index引物,读出接头中的Index序列,从而确 定每个位点的DNA属于哪一个文库。In the present application, the gene sequencing may include the following steps: (1) library construction; for example, it may include modifying the ends of the DNA molecules, and adding adapters (for example, Y-shaped adapters may be formed), and then perform PCR Amplification; (2) Sequencing; for example, it may include DNA replication using oligonucleotides as primers and library fragments as templates; then "bridge" amplification, and sequencing while synthesizing. Then add the sequencing primer Index primer to read the Index sequence in the linker to determine which library the DNA at each site belongs to.
在本申请中,所述方法可以仅使用源自受试者的样本。在本申请中,所述方法可以无需使用配对样本。因此本申请所述的方法可以极大地减少对受试者的样本的要求。In the present application, the method may only use samples derived from a subject. In the present application, the method may not require the use of paired samples. Therefore the method described in this application can greatly reduce the requirement of the subject's sample.
在本申请中,所述样本可以包括血液样本。In the present application, the sample may include a blood sample.
在本申请中,所述方法还可以包括以下的步骤:获取源自受试者的样本。例如,可以包括利用采血针系统,从所述受试者中获取血液样本的步骤。所述获取样本的方法可以包括真空采血管采血法。In the present application, the method may further include the following step: obtaining a sample from a subject. For example, the step of obtaining a blood sample from said subject using a lancet system may be included. The method for obtaining samples may include a vacuum blood collection tube blood collection method.
在本申请中,所述突变位点可以包括单核苷酸变异(SNV)。在本申请中,所述突变位点可以包含两个以上的核苷酸变异。例如,本申请所述的突变位点可以包括1个所述SNV,也可以包括两个以上(例如,可以为2个、3个、4个、5个、6个、7个、8个、9个、10个或更多个)的SNV(例如,可以包括两个以上的核苷酸变异)。在本申请中,针对一个特定的所述突变位点,所述野生型支持片段和所述突变型支持片段该突变位点位置处的核苷酸序列存在不同。所述突变位点可以包括核苷酸的替换,也可以在某些情况下包括核苷酸的删除和/或插入。在本申请中,所述突变位点可以包括核苷酸的替换。In the present application, the mutation site may include a single nucleotide variation (SNV). In the present application, the mutation site may contain more than two nucleotide variations. For example, the mutation site described in the present application may include one of the SNVs, or two or more (for example, it may be 2, 3, 4, 5, 6, 7, 8, 9, 10 or more) SNVs (eg, may include more than two nucleotide variations). In the present application, for a specific mutation site, the nucleotide sequence at the position of the mutation site is different between the wild-type support fragment and the mutant support fragment. The mutation sites may include nucleotide substitutions, and may also include nucleotide deletions and/or insertions in some cases. In the present application, the mutation site may include nucleotide substitutions.
在本申请中,所述野生型支持片段和/或所述突变型支持片段的划分可以是针对一个特定的所述突变位点的。例如,如果在该突变位点处的核苷酸序列与所述参考基因组在所述突变位点的对应位置处的核苷酸序列相同,则可以在针对该突变位点时被认为是所述野生型支持片段;如果在该突变位点处的核苷酸序列与所述参考基因组在所述突变位点的对应位置处的核苷酸序列不同,则可以在针对该突变位点时被认为是所述突变型支持片段。In the present application, the division of the wild type supporting fragment and/or the mutant supporting fragment may be specific to a specific mutation site. For example, if the nucleotide sequence at the mutation site is the same as the nucleotide sequence at the corresponding position of the mutation site in the reference genome, it can be considered as the reference genome for the mutation site. Wild-type supporting fragment; if the nucleotide sequence at the mutation site is different from the nucleotide sequence at the corresponding position of the reference genome at the mutation site, it can be considered when targeting the mutation site is the mutant support fragment.
在本申请中,所述野生型支持片段和/或所述突变型支持片段的长度的范围可以为约1个核苷酸至约550个核苷酸(例如,可以为约1-500个、约1-450个、约1-400个、约1-350个、约1-300个、约1-250个、约1-200个或者约1-100个)。例如,可以为约1个核苷酸至约400个核苷酸。例如,可以为约1个核苷酸至约200个核苷酸。In the present application, the wild-type support fragment and/or the mutant support fragment can range in length from about 1 nucleotide to about 550 nucleotides (for example, it can be about 1-500, about 1-450, about 1-400, about 1-350, about 1-300, about 1-250, about 1-200, or about 1-100). For example, it can be from about 1 nucleotide to about 400 nucleotides. For example, it can be from about 1 nucleotide to about 200 nucleotides.
在本申请中,所述方法可以包括以下的步骤:(4’)获得步骤(3)所述差值的分布,选择所述分布中的最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。In the present application, the method may include the following steps: (4') obtain the distribution of the difference in step (3), select the maximum value in the distribution as Dev(Max), and use the Dev(Max ) as the distinguishing index and/or as the training sample.
在本申请中,所述分布可以为所述差值的集合。所述Dev(Max)可以为在所述集合中,所述差值的最大值。In this application, the distribution may be a collection of the differences. The Dev(Max) may be the maximum value of the difference in the set.
在本申请中,所述差值可以经平滑化处理。在本申请中,经过所述平滑化处理,所述差值可以更直观、更准确地反映出相同长度的所述野生型支持片段的数量与所述突变型支持片 段的数量的差值。进一步地,经过所述平滑化处理的所述差值能够更准确、更特异性和/或更灵敏地区分所述体细胞突变和所述体系突变,和/或在所述ctDNA中识别cfDNA。In this application, the difference may be smoothed. In the present application, after the smoothing process, the difference can more intuitively and accurately reflect the difference between the number of wild-type supporting fragments and the mutant-type supporting fragments of the same length. Further, the smoothed difference can more accurately, specifically and/or more sensitively distinguish the somatic mutation from the systemic mutation, and/or identify cfDNA in the ctDNA.
在本申请中,所述平滑化处理可以包括以下步骤:In the present application, the smoothing process may include the following steps:
(a)确定平滑化窗口值;其中所述平滑化窗口值为约1-10中的整数;(b)确定平滑化取样长度范围,其中每一个平滑化取样长度范围的最小值为起始长度,其中每一个平滑化取样长度范围的最大值为起始长度+(平滑化窗口值-1),即:每一个平滑取样长度范围的长度值等于确定的平滑窗口值;其中所述起始长度的范围为所述野生型支持片段和/或所述突变型支持片段的长度的范围;(c)获取任意一个平滑化取样长度范围中,至少一个平滑化取样长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,(a) determining the smoothing window value; wherein the smoothing window value is an integer of about 1-10; (b) determining the smoothing sampling length range, wherein the minimum value of each smoothing sampling length range is the initial length , where the maximum value of each smoothing sampling length range is the initial length+(smoothing window value-1), that is: the length value of each smoothing sampling length range is equal to the determined smoothing window value; wherein the initial length The range is the range of the length of the wild-type support fragment and/or the mutant-type support fragment; (c) obtain at least one smoothed sampling length of the wild-type support fragment in any smoothed sampling length range Quantity, to obtain the corresponding quantity of the mutant support fragments of the same length,
计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;Calculate the ratio WC of the number of wild-type support fragments of the length to the total number of wild-type support fragments; calculate the ratio of the number of mutant support fragments of the same length to the total number of mutant support fragments MC;
计算相同长度下所述比值WC与所述比值MC的差值;(d)将步骤(c)所得的每一个所述差值进行累加,除以所述平滑化窗口值,得到该平滑取样长度范围的平均差值;(e)将所得的平均差值作为该所述平滑化取样长度范围的代表值。Calculate the difference between the ratio WC and the ratio MC under the same length; (d) accumulate each of the differences obtained in step (c), divide by the smoothing window value, and obtain the smoothing sampling length (e) using the obtained average difference as a representative value of the smoothed sampling length range.
在本申请中,所述平滑化窗口值可以根据不同的受试者情况,不同的基因测序方法和/或不同的区分目的而调整,只要所选择的所述平滑化窗口值可以使得所述平滑化处理得以实施即可。在本申请中,所述平滑化窗口值可以为约2-6中的整数(例如,所述平滑化窗口值可以为2、3、4、5或6)。例如,所述平滑化窗口值可以为3。In this application, the smoothing window value can be adjusted according to different subjects, different gene sequencing methods and/or different differentiation purposes, as long as the selected smoothing window value can make the smoothing window processing can be carried out. In the present application, the smoothing window value may be an integer between about 2-6 (for example, the smoothing window value may be 2, 3, 4, 5 or 6). For example, the smoothing window value may be 3.
在本申请中,所述平滑化处理可以包括以下的具体步骤:In the present application, the smoothing process may include the following specific steps:
(a)确定平滑化窗口值;其中所述平滑化窗口值为约1-10中的整数(例如,选择所述平滑化窗口值为3);(a) determining a smoothing window value; wherein the smoothing window value is an integer between about 1-10 (for example, selecting the smoothing window value to be 3);
(b)确定平滑化取样长度范围,其中每一个平滑化取样长度范围的最小值为起始长度,其中每一个平滑化取样长度范围的最大值为起始长度+(平滑化窗口值-1);其中所述起始长度的范围为所述野生型支持片段和/或所述突变型支持片段的长度的范围(例如,可以为约1个核苷酸至约400个核苷酸);在本申请中,所述起始长度可以为在所述野生型支持片段和/或所述突变型支持片段的长度的范围内中的任意的长度。在本申请中,所述“长度”可以以核苷酸的个数来衡量。(b) Determine the smoothing sampling length range, wherein the minimum value of each smoothing sampling length range is the starting length, and wherein the maximum value of each smoothing sampling length range is the starting length+(smoothing window value-1) ; wherein the range of the starting length is the range of the length of the wild-type support fragment and/or the mutant support fragment (for example, may be about 1 nucleotide to about 400 nucleotides); in In the present application, the initial length may be any length within the range of the length of the wild-type supporting fragment and/or the mutant-type supporting fragment. In this application, the "length" can be measured by the number of nucleotides.
在本申请中,这对各个所述平滑化取样长度范围,每一个所述平滑化取样长度范围中的所述最小值可以为一个以所述起始长度为第一项,以所述平滑化窗口值为公差,在所述野生型支持片段和/或所述突变型支持片段的长度的范围内的等差数列中的第一项、第二项、第三 项直至第N项。例如,当所述平滑化窗口值为3,所述起始长度为1,则在约1个核苷酸至约400个核苷酸的范围内,所述平滑化最小值可以依次为1、4、7、10……400。In this application, for each of the smoothing sampling length ranges, the minimum value in each smoothing sampling length range may be a starting length as the first item, and the smoothing The window value is a tolerance, the first item, the second item, the third item up to the Nth item in the arithmetic sequence within the range of the length of the wild-type supporting fragment and/or the mutant-type supporting fragment. For example, when the smoothing window value is 3 and the initial length is 1, then in the range of about 1 nucleotide to about 400 nucleotides, the smoothing minimum value may be 1, 4, 7, 10... 400.
例如,当所述起始长度为1,如果平滑化窗口值为3,且如果各个所述平滑化取样长度范围相互不重叠,则平滑化取样长度范围可以为1-3、4-6、7-10……。例如,当所述起始长度为1,如果平滑化窗口值为3,且如果各个所述平滑化取样长度范围可以相互重叠,则平滑化取样长度范围可以为1-3、2-4、3-5……,或者1-3、3-5、5-7……。又例如,当所述起始长度为2,如果平滑化窗口值为3,且如果各个所述平滑化取样长度范围相互不重叠,则平滑化取样长度范围可以为2-4、5-7、8-11……。例如,当所述起始长度为2,如果平滑化窗口值为3,且如果各个所述平滑化取样长度范围可以相互重叠,则平滑化取样长度范围可以为2-4、3-5、4-6……。For example, when the initial length is 1, if the smoothing window value is 3, and if the smoothing sampling length ranges do not overlap each other, the smoothing sampling length ranges can be 1-3, 4-6, 7 -10……. For example, when the initial length is 1, if the smoothing window value is 3, and if each of the smoothing sampling length ranges can overlap each other, then the smoothing sampling length ranges can be 1-3, 2-4, 3 -5..., or 1-3, 3-5, 5-7.... For another example, when the initial length is 2, if the smoothing window value is 3, and if the smoothing sampling length ranges do not overlap each other, the smoothing sampling length ranges can be 2-4, 5-7, 8-11.... For example, when the initial length is 2, if the smoothing window value is 3, and if each of the smoothing sampling length ranges can overlap each other, then the smoothing sampling length ranges can be 2-4, 3-5, 4 -6…….
(c)获取任意一个平滑化取样长度范围中,至少一个(例如至少1个、至少2个、至少3个或更多个)平滑化取样长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;计算相同长度下所述比值WC与所述比值MC的差值。(c) Obtain the number of wild-type support fragments of at least one (for example, at least 1, at least 2, at least 3 or more) smoothed sampling lengths in any smoothed sampling length range, and obtain the corresponding The number of the mutant support fragments of the same length, calculate the ratio WC of the number of the wild type support fragments of this length to the total number of the wild type support fragments; calculate the number of the mutant support fragments of the same length Ratio MC to the total number of mutant supporting fragments; calculate the difference between the ratio WC and the ratio MC at the same length.
例如,获取长度为1个核苷酸的所述野生型支持片段的数量,将该数量除以所述野生型支持片段的总数量W total,得到比值WC1;获取长度为1个核苷酸的所述突变型支持片段的数量,将该数量除以所述突变型支持片段的总数量M total,得到比值MC1,计算两者的差值WC1-MC1;例如,获取长度为4个核苷酸的所述野生型支持片段的数量,将该数量除以所述野生型支持片段的总数量W total,得到比值WC4;获取长度为4个核苷酸的所述突变型支持片段的数量MC4,将该数量除以所述突变型支持片段的总数量M total,计算两者的差值WC4-MC4;从而分别得到不同的所述平滑化取样长度(例如1、4、7、10……400)各自对应的比值的差值;例如,可以获得各个所述平滑化取样长度范围内,各个所述平滑化取样长度下,所述野生型支持片段的数量与所述野生型支持片段的总数量的比值与所述突变型支持片段的数量与所述突变型支持片段的总数量的比值的差值。例如,针对平滑化取样长度范围1-3,可以分别获得(WC1-MC1)、(WC2-MC2)和(WC3-MC3)。 For example, the number of wild-type supporting fragments with a length of 1 nucleotide is obtained, and the number is divided by the total number of wild-type supporting fragments W total to obtain a ratio WC1; The number of the mutant supporting fragments is divided by the total number M total of the mutant supporting fragments to obtain the ratio MC1, and the difference WC1-MC1 between the two is calculated; for example, the length of the acquisition is 4 nucleotides The number of the wild-type supporting fragments is divided by the total number W total of the wild-type supporting fragments to obtain a ratio WC4; the number MC4 of the mutant supporting fragments with a length of 4 nucleotides is obtained, Divide this number by the total number M total of the mutant support fragments, and calculate the difference WC4-MC4 between the two; thus obtain different smoothing sampling lengths (for example, 1, 4, 7, 10...400 ) respectively corresponding to the difference of the ratio; for example, within the range of each smoothed sampling length, under each smoothed sampling length, the number of wild-type support fragments and the total number of wild-type support fragments can be obtained The difference between the ratio of and the ratio of the number of mutant supporting fragments to the total number of mutant supporting fragments. For example, for the smoothing sampling length range 1-3, (WC1-MC1), (WC2-MC2) and (WC3-MC3) can be obtained respectively.
(d)根据所述至少一个平滑化取样长度的所述差值计算该平滑化取样长度范围的平均差值;例如,计算(WC1-MC1)、(WC2-MC2)和(WC3-MC3)之和,再除以所述平滑化窗口值,得到平均差值。可选地,也可以只计算单个平滑化取样长度范围中的部分差值,例如:(WC1-MC1)和(WC3-MC3),再计算它们的平均值作为平均差值;(d) calculating an average difference value over a range of smoothed sample lengths based on said difference values of said at least one smoothed sample length; and are divided by the smoothing window value to obtain the average difference. Optionally, it is also possible to calculate only partial differences in a single smoothing sampling length range, for example: (WC1-MC1) and (WC3-MC3), and then calculate their average value as the average difference;
(e)将所得的平均差值作为所述平滑化取样长度范围的平均差值的代表值。例如,将(WC1-MC1)、(WC2-MC2)和(WC3-MC3)的累加值除以平滑化窗口值3,所获得的平均差值B1可以作为该所述平滑化取样长度范围的代表值。例如,将(WC4-MC4)、(WC5-MC5)和(WC6-MC6)的累加值除以平滑化窗口值3,所获得的平均差值B4可以作为该所述平滑化取样长度范围的代表值。(e) Taking the obtained average difference as a representative value of the average difference in the smoothed sampling length range. For example, by dividing the cumulative value of (WC1-MC1), (WC2-MC2) and (WC3-MC3) by the smoothing window value 3, the obtained average difference B1 can be used as the representative of the smoothing sampling length range value. For example, the accumulated value of (WC4-MC4), (WC5-MC5) and (WC6-MC6) is divided by the smoothing window value 3, and the obtained average difference B4 can be used as the representative of the smoothing sampling length range value.
在本申请中,所述平滑化处理可以包括以下步骤:(f)获得步骤(e)所述平均差值的第一分布。例如,将所述的各个累加值B1、B4、B7等形成所述第一分布D=[B1、B4、B7……B400]。In the present application, the smoothing process may include the following steps: (f) obtaining the first distribution of the average difference in step (e). For example, the respective accumulated values B1, B4, B7, etc. are formed into the first distribution D=[B1, B4, B7...B400].
在本申请中,所述平滑化处理还可以包括以下步骤:(g)在有效片段区间的长度范围内,将所述第一分布中的每个平均差值依次进行累加,获得加成值,其中,所述有效片段区间的长度覆盖缠绕核小体的核酸序列的长度。In the present application, the smoothing process may further include the following steps: (g) within the length range of the effective segment interval, sequentially accumulating each average difference in the first distribution to obtain an added value, Wherein, the length of the effective fragment interval covers the length of the nucleic acid sequence wound around the nucleosome.
在本申请中,所述核酸序列能够缠绕核小体2周以上,或者,能够缠绕核小体1周以内。例如,所述有效片段区间的长度可以为约1-约180个核苷酸(例如,可以为约1-约180个、约1-约179个、约1-约178个、约1-约177个、约1-约176个、约1-约175个、约1-约174个、约1-约173个、约1-约172个、约1-约171个、约1-约170个、约1-约169个、约1-约168个、约1-约167个、约1-约166个或约1-约165个),和/或,可以约200以上个核苷酸(例如,可以为约200以上个、约210以上个、约220以上个、约230以上个、约240以上个、约250以上个、约260以上个、约270以上个、约280以上个、约290以上个、约300以上个、约350以上个或约400以上个)。例如,所述有效片段区间的长度可以为约1-约167个核苷酸,和/或,约250-约400个核苷酸。In the present application, the nucleic acid sequence can wrap around the nucleosome for more than 2 weeks, or can wrap around the nucleosome within 1 week. For example, the length of the effective fragment interval can be about 1 to about 180 nucleotides (for example, it can be about 1 to about 180, about 1 to about 179, about 1 to about 178, about 1 to about 177, about 1 to about 176, about 1 to about 175, about 1 to about 174, about 1 to about 173, about 1 to about 172, about 1 to about 171, about 1 to about 170 about 1-about 169, about 1-about 168, about 1-about 167, about 1-about 166 or about 1-about 165), and/or, about 200 or more nucleotides (For example, it can be about 200 or more, about 210 or more, about 220 or more, about 230 or more, about 240 or more, about 250 or more, about 260 or more, about 270 or more, about 280 or more, About 290 or more, about 300 or more, about 350 or more, or about 400 or more). For example, the length of the effective fragment interval may be about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides.
例如,可以将所述第一分布中的B1和B4累加,得到加成值D1;可以将所述第一分布中的B1、B4和B7累加,得到加成值D2。For example, B1 and B4 in the first distribution may be added up to obtain the added value D1; B1, B4 and B7 in the first distribution may be added up to obtain the added value D2.
在本申请中,所述平滑化处理包括可以以下步骤:(h)获得步骤(g)所述加成值的第二分布,计算所述第二分布中所述加成值的最大值。例如,可以将各所述加成值D1、D2等形成所述第二分布A=[D1、D2……Di]。其中,i可以为所述有效片段区间的长度。In the present application, the smoothing process may include the following steps: (h) obtaining the second distribution of the added value in step (g), and calculating the maximum value of the added value in the second distribution. For example, each of the added values D1, D2, etc. can be formed into the second distribution A=[D1, D2...Di]. Wherein, i may be the length of the effective segment interval.
在本申请中,可以将所述第二分布中的最大值作为Dev(Max)。在本申请中,可以将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。In this application, the maximum value in the second distribution may be taken as Dev(Max). In the present application, the Dev(Max) may be used as the distinguishing index and/or as the training sample.
在本申请中,为了进一步提高本申请所述方法的准确性、灵敏度和/或特异性,还可以在本申请所述的差值(例如所述Dev(Max))的基础上将其他参数作为所述区分的指标和/或作为所述训练样本。例如,所述指标还可以包括选自下组参数中的一种或多种:所述突变位点 所在的染色体位置、所述突变位点的碱基替换模式、所述突变位点的野生型中各个长度的核酸片段的计数值和/或所述突变位点的突变型中各个长度的核酸片段的计数值、所述突变位点的等位变异、受试者的年龄和所述突变位点的突变类型。In the present application, in order to further improve the accuracy, sensitivity and/or specificity of the method described in the application, other parameters can also be used as The distinguishing index and/or as the training sample. For example, the indicator may also include one or more parameters selected from the following group: the chromosomal position where the mutation site is located, the base substitution pattern of the mutation site, the wild type of the mutation site The count value of nucleic acid fragments of various lengths in and/or the count value of nucleic acid fragments of various lengths in the mutant type of the mutation site, the allelic variation of the mutation site, the age of the subject and the mutation site The point mutation type.
在本申请中,所述指标还可以包括选自下组参数中的一种或多种:所述SNV位点所在的染色体位置、所述SNV位点的碱基替换模式、所述SNV位点的野生型中各个长度的核酸片段的计数值和/或所述SNV位点的突变型中各个长度的核酸片段的计数值、所述SNV位点的等位变异、受试者的年龄和所述SNV位点的突变类型。In this application, the indicator may also include one or more parameters selected from the following group: the chromosome position where the SNV site is located, the base substitution pattern of the SNV site, the SNV site The count value of nucleic acid fragments of various lengths in the wild type and/or the count value of nucleic acid fragments of various lengths in the mutant type of the SNV site, the allelic variation of the SNV site, the age of the subject and the The mutation type of the SNV locus.
在本申请中,所述方法还可以包括检测所述突变位点的步骤。检测所述突变位点的步骤可以是本领域常规的步骤,参考所述基因测序例如,检测所述突变位点可以包括以下的步骤:(1)从所述样本中获得数据;(2)对步骤(1)所得的数据进行变异识别(例如,可以通过对碱基质量、mapping质量、错配数量、突变频率、支持突变的读数等因素,进行所述变异识别);(3)对步骤(2)识别的变异进行变异注释(例如,可以使用ANNOVAR 20160201、1000 Genomes数据库、ExAC数据库和/或gnomAD genome数据库进行注释;例如,可以使用数据库注释、hot位点注释、突变类型及和/或人群频率注释);以及,(4)对步骤(3)注释的变异进行过滤(例如,可以进行人群突变位点频率的过滤、热点突变的过滤、克隆性造血突变的过滤,和/或最大深度的过滤),获得突变位点。例如,所述步骤还可以包括在步骤(4)之后,对所述突变位点进行质量控制(例如,所述质量控制可以包括去除重复片段,和/或,过滤低质量片段)。In the present application, the method may further include the step of detecting the mutation site. The step of detecting the mutation site can be a routine step in the art. With reference to the gene sequencing, for example, detecting the mutation site can include the following steps: (1) obtaining data from the sample; The data obtained in step (1) is subjected to variation identification (for example, the variation identification can be carried out by base quality, mapping quality, number of mismatches, mutation frequency, reads supporting mutation and other factors); (3) step ( 2) The identified variants are annotated for variants (for example, ANNOVAR 20160201, 1000 Genomes database, ExAC database and/or gnomAD genome database can be used for annotation; for example, database annotation, hot site annotation, mutation type and/or population can be used frequency annotation); and, (4) filter the variation annotated in step (3) (for example, filtering of population mutation site frequency, filtering of hot spot mutation, filtering of clonal hematopoietic mutation, and/or maximum depth filter) to obtain the mutation site. For example, the step may also include performing quality control on the mutation site after step (4) (for example, the quality control may include removing repeated fragments, and/or filtering low-quality fragments).
装置device
另一方面,本申请提供了一种区分体细胞突变和种系突变的装置,其包括:In another aspect, the present application provides a device for distinguishing between somatic mutations and germline mutations, comprising:
计算模块,用于计算相同长度的比值WC与比值MC的差值;其中,针对每一个突变位点,根据至少一个长度的野生型支持片段的数量,以及对应的相同长度的突变型支持片段的数量;所述比值WC为一个长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值;其中所述比值MC为对应的相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中,所述参考基因组为所述基因测序中的人类参考基因组;所述突变位点源自受试者样本,其中,所述突变位点通过基因测序的方法获得,A calculation module, configured to calculate the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site, according to the number of wild-type support fragments of at least one length, and the corresponding mutant support fragments of the same length Quantity; the ratio WC is the ratio of the number of wild-type support fragments of a length to the total number of wild-type support fragments; wherein the ratio MC is the corresponding number of mutant-type support fragments of the same length Ratio to the total number of mutant support fragments; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein , the wild-type base sequence is, compared with the nucleotide sequence of the reference genome at the corresponding position of the mutation site, the same sequence, wherein, the mutant base sequence is, and the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site, a different sequence, wherein the reference genome is the human reference genome in the gene sequencing; the mutation site is derived from a subject sample, Wherein, the mutation site is obtained by gene sequencing,
判断模块,用于依据已被进行机器学习训练的机器学习模型获得识别所述体细胞突变的识别结果,其中所述机器学习训练包括将所述差值作为训练样本输入至所述机器学习模型以进行机器学习训练。A judging module, configured to obtain a recognition result for identifying the somatic mutation according to a machine learning model that has been trained by machine learning, wherein the machine learning training includes inputting the difference as a training sample into the machine learning model to Do machine learning training.
另一方面,本申请提供了一种在cfDNA中识别ctDNA的装置,其包括:On the other hand, the present application provides a device for identifying ctDNA in cfDNA, which includes:
计算模块,用于计算相同长度的比值WC与比值MC的差值;其中,针对每一个突变位点,根据至少一个长度的野生型支持片段的数量,以及对应的相同长度的突变型支持片段的数量;所述比值WC为一个长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值;其中所述比值MC为对应的相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中,所述参考基因组为所述基因测序中的人类参考基因组;所述突变位点源自受试者样本,其中,所述突变位点通过基因测序的方法获得,A calculation module, configured to calculate the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site, according to the number of wild-type support fragments of at least one length, and the corresponding mutant support fragments of the same length Quantity; the ratio WC is the ratio of the number of wild-type support fragments of a length to the total number of wild-type support fragments; wherein the ratio MC is the corresponding number of mutant-type support fragments of the same length Ratio to the total number of mutant support fragments; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein , the wild-type base sequence is, compared with the nucleotide sequence of the reference genome at the corresponding position of the mutation site, the same sequence, wherein, the mutant base sequence is, and the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site, a different sequence, wherein the reference genome is the human reference genome in the gene sequencing; the mutation site is derived from a subject sample, Wherein, the mutation site is obtained by gene sequencing,
判断模块,用于依据已被进行机器学习训练的机器学习模型获得在所述cfDNA中识别ctDNA的判断结果,其中所述机器学习训练包括将所述差值作为训练样本输入至所述机器学习模型以进行机器学习训练。A judgment module, configured to obtain a judgment result of identifying ctDNA in the cfDNA according to a machine learning model that has been trained by machine learning, wherein the machine learning training includes inputting the difference as a training sample into the machine learning model for machine learning training.
另一方面,本申请提供了一种机器学习模型的训练装置,其包括:On the other hand, the application provides a training device for a machine learning model, which includes:
计算模块,用于计算相同长度的比值WC与比值MC的差值;其中,针对每一个突变位点,根据至少一个长度的野生型支持片段的数量,以及对应的相同长度的突变型支持片段的数量;所述比值WC为一个长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值;其中所述比值MC为对应的相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值;其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,其中,所述参考基因组为所述基因测序中的人类参考基因组;所述突变位点源自受试者样本,其中,所述突变位点通过基因测序的方法获得,A calculation module, configured to calculate the difference between the ratio WC and the ratio MC of the same length; wherein, for each mutation site, according to the number of wild-type support fragments of at least one length, and the corresponding mutant support fragments of the same length Quantity; the ratio WC is the ratio of the number of wild-type support fragments of a length to the total number of wild-type support fragments; wherein the ratio MC is the corresponding number of mutant-type support fragments of the same length Ratio to the total number of mutant support fragments; wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, wherein , the wild-type base sequence is, compared with the nucleotide sequence of the reference genome at the corresponding position of the mutation site, the same sequence, wherein, the mutant base sequence is, and the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site, a different sequence, wherein the reference genome is the human reference genome in the gene sequencing; the mutation site is derived from a subject sample, Wherein, the mutation site is obtained by gene sequencing,
训练模块,用于将所述差值作为训练样本输入至所述机器学习模型以进行机器学习训练。A training module, configured to input the difference as a training sample into the machine learning model for machine learning training.
在本申请中,所述装置可以仅使用源自受试者的样本。In the present application, the device may only use samples derived from the subject.
在本申请中,所述的装置还可以包括:输出模块,用以显示所述判断模块产生的所述体细胞突变的识别结果和/或所述cfDNA中识别ctDNA的判断结果。In the present application, the device may further include: an output module, configured to display the recognition result of the somatic mutation generated by the judgment module and/or the judgment result of recognition of ctDNA in the cfDNA.
在本申请中,所述输出模块可以显示本申请所述判断模块产生的所述体细胞突变的识别结果和/或所述cfDNA中识别ctDNA的判断结果。例如,所述输出模块可以包括输出装置(例如显示器)和/或输出程序(例如移动端APP),从而可以显示本申请所述判断模块产生的所述体细胞突变的识别结果和/或所述cfDNA中识别ctDNA的判断结果。在本申请中,所述输出模块输入所述判断模块获得的所述体细胞突变的识别结果和/或所述cfDNA中识别ctDNA的判断结果。In the present application, the output module may display the recognition result of the somatic mutation generated by the judgment module of the present application and/or the judgment result of recognition of ctDNA in the cfDNA. For example, the output module may include an output device (such as a display) and/or an output program (such as a mobile APP), so as to display the recognition result of the somatic mutation generated by the judgment module of the present application and/or the The judgment result of ctDNA recognition in cfDNA. In the present application, the output module inputs the identification result of the somatic mutation and/or the identification result of ctDNA in the cfDNA obtained by the determination module.
在本申请中,所述的装置还可以包括样品获得模块,用于获得受试者的所述样本。In the present application, the device may further include a sample obtaining module, configured to obtain the sample of the subject.
例如,所述样本可以包括血液样本。在本申请中,所述样品获得模块可以包括获得所述样本所需的试剂和/或仪器。例如,所述样品获得模块可以包括采血针、采血管和/或血液样本运输箱。例如,所述样品获得模块可以包括抗凝剂。在本申请中,所述样品获得模块可以输出本申请所述的样本。For example, the sample may comprise a blood sample. In the present application, the sample obtaining module may include reagents and/or instruments required for obtaining the sample. For example, the sample acquisition module may include blood collection needles, blood collection tubes and/or blood sample transport boxes. For example, the sample acquisition module can include an anticoagulant. In this application, the sample obtaining module can output the samples described in this application.
在本申请中,所述的装置还可以包括数据接收模块,用于获得所述样本中所述突变位点。例如,所述数据接收模块可以输入所述样本。例如,所述数据接收模块可以输出本申请所述的突变位点。在本申请中,所述数据接收模块可以包括获得所述突变位点所需的试剂和/或仪器。例如,所述数据接收模块可以包括所述基因测序所需的试剂和/或仪器。在本申请中,所述数据接收模块可以进行本申请所述的基因测序,例如,所述基因测序可以包括二代基因测序(NGS)。In the present application, the device may further include a data receiving module, configured to obtain the mutation site in the sample. For example, the data receiving module may input the samples. For example, the data receiving module can output the mutation sites described in this application. In the present application, the data receiving module may include reagents and/or instruments required for obtaining the mutation site. For example, the data receiving module may include reagents and/or instruments required for the gene sequencing. In the present application, the data receiving module may perform the gene sequencing described in the present application, for example, the gene sequencing may include next-generation gene sequencing (NGS).
例如,所述数据接收模块可以包括二代基因测序仪(例如Roche454测序仪、Illumina测序仪)。例如,所述数据接收模块可以包括自动化样本制备系统。例如,所述数据接收模块可以包括荧光标记的dNTP、末端修复酶、末端修复反应缓冲液、DNA连接酶、DNA连接缓冲液和/或文库扩增反应液。For example, the data receiving module may include a next-generation gene sequencer (eg, Roche454 sequencer, Illumina sequencer). For example, the data receiving module can include an automated sample preparation system. For example, the data receiving module may include fluorescently labeled dNTP, end repair enzyme, end repair reaction buffer, DNA ligase, DNA ligation buffer and/or library amplification reaction solution.
在本申请中,所述突变位点可以包括单核苷酸变异(SNV)。在本申请中,所述突变位点可以包含两个以上的核苷酸变异。In the present application, the mutation site may include a single nucleotide variation (SNV). In the present application, the mutation site may contain more than two nucleotide variations.
在本申请中,所述装置中检测所述突变位点可以包括以下的步骤:(1)从所述样本中获得数据;(2)对步骤(1)所得的数据进行变异识别;(3)对步骤(2)识别的变异进行变异注释;以及,(4)对步骤(3)注释的变异进行过滤,获得突变位点;可选地,对所述突变位点进行质量控制。In the present application, detecting the mutation site in the device may include the following steps: (1) obtaining data from the sample; (2) performing mutation identification on the data obtained in step (1); (3) Annotate the variation identified in step (2); and, (4) filter the variation annotated in step (3) to obtain a mutation site; optionally, perform quality control on the mutation site.
在本申请中,所述装置还可以包括输入模块,用以获得所述至少一个长度的所述野生型 支持片段的数量,和/或所述对应的相同长度的所述突变型支持片段的数量。In the present application, the device may further include an input module, configured to obtain the quantity of the wild-type support fragment of the at least one length, and/or the corresponding quantity of the mutant support fragment of the same length .
例如,所述输入模块可以输入本申请所述的突变位点。所述输入模块可以输出所述至少一个长度的所述野生型支持片段的数量,和/或所述对应的相同长度的所述突变型支持片段的数量。在本申请中,所述输入模块可以包括能够对特定长度的所述野生型支持片段进行计数的试剂和/或仪器。所述输入模块可以包括能够对特定长度的所述突变型支持片段进行计数的试剂和/或仪器。在本申请中,所述输入模块可以包括能够显示出所述至少一个长度的所述野生型支持片段的数量,和/或所述对应的相同长度的所述突变型支持片段的数量的仪器(例如显示器)和/或输出程序(例如移动端APP),从而可以显示利用所述输入模块获得的野生型和/或突变型支持片段的数量。在本申请中,所述输入模块可以区分所述野生型支持片段和所述突变型支持片段。在本申请中,所述输入模块可以统计不同长度的所述野生型支持片段的数量;以及,统计不同长度的所述野生型支持片段的数量。For example, the input module can input the mutation sites described in this application. The input module may output the number of the wild-type supporting fragments of the at least one length, and/or the corresponding number of the mutant supporting fragments of the same length. In the present application, the input module may include reagents and/or instruments capable of counting the wild-type support fragments of a specific length. The input module may comprise reagents and/or instruments capable of counting the mutant supporting fragments of a specified length. In the present application, the input module may include an instrument capable of displaying the quantity of the wild-type support fragment of the at least one length, and/or the quantity of the corresponding mutant support fragment of the same length ( Such as a display) and/or an output program (such as a mobile APP), so that the number of wild-type and/or mutant support fragments obtained by using the input module can be displayed. In the present application, the input module can distinguish between the wild type support fragment and the mutant support fragment. In this application, the input module can count the number of wild-type supporting fragments of different lengths; and, count the number of wild-type supporting fragments of different lengths.
在本申请中,所述野生型支持片段和/或所述突变型支持片段的长度的范围可以为约1个核苷酸至约550个核苷酸。例如,可以为约1个核苷酸至约400个核苷酸。例如,可以为约1个核苷酸至约200个核苷酸。In the present application, the length of the wild-type supporting fragment and/or the mutant supporting fragment may range from about 1 nucleotide to about 550 nucleotides. For example, it can be from about 1 nucleotide to about 400 nucleotides. For example, it can be from about 1 nucleotide to about 200 nucleotides.
在本申请中,所述计算模块可以输入(例如,可以通过本申请所述输入模块获得的)本申请所述野生型支持片段的数量,以及对应地相同长度的所述突变型支持片段的数量。所述计算模块可以输出本申请所述的差值,例如,所述计算模块可以输出本申请所述Dev(Max)。所述计算模块可以包括用以计算本申请所述差值的计算逻辑和/或计算程序。In the present application, the calculation module can input (for example, can be obtained through the input module of the present application) the number of wild-type support fragments described in the present application, and the corresponding number of mutant support fragments of the same length . The calculation module may output the difference value described in this application, for example, the calculation module may output Dev(Max) described in this application. The calculation module may include calculation logic and/or a calculation program for calculating the difference value described in this application.
在本申请中,在所述计算模块中可以获得所述差值的分布,选择所述分布中的最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。In the present application, the distribution of the difference can be obtained in the calculation module, the maximum value in the distribution is selected as Dev(Max), and the Dev(Max) is used as the indicator of the distinction and/or as the training sample.
在本申请中,可以在所述计算模块中平滑化处理所述差值,其中所述平滑化处理可以包括以下步骤:(a)确定平滑化窗口值,其中所述平滑化窗口值为约1-30中的整数;(b)确定若干个长度值等于平滑化窗口值的平滑化取样长度范围,其中每一个平滑化取样长度范围的最小值为起始长度,其中所述起始长度的范围为所述野生型支持片段和/或所述突变型支持片段的长度的范围;(c)获取任意一个平滑化取样长度范围中,至少一个平滑化取样长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;计算相同长度下所述比值WC与所述比值MC的差值;(d)根据所述至少一个平滑化取样长度的所述差值计算该平滑化取样长度范围的平均差值;(e)将步骤(d)将所得的平均差值作为所述该平滑化取 样长度范围的代表值。In the present application, the difference value may be smoothed in the calculation module, wherein the smoothing process may include the following steps: (a) determining a smoothing window value, wherein the smoothing window value is about 1 Integers in -30; (b) determine several smoothing sampling length ranges whose length values are equal to the smoothing window value, wherein the minimum value of each smoothing sampling length range is the starting length, wherein the range of the starting length is the range of the length of the wild-type support fragment and/or the mutant-type support fragment; (c) obtaining the number of the wild-type support fragment of at least one smoothed sampling length in any smoothed sampling length range, Obtain the corresponding number of mutant support fragments of the same length, calculate the ratio WC of the number of wild-type support fragments of this length to the total number of wild-type support fragments; calculate the mutant support fragments of the same length the ratio MC of the number of fragments to the total number of mutant support fragments; calculating the difference between said ratio WC and said ratio MC at the same length; (d) said difference according to said at least one smoothed sampling length Calculate the average difference of the smoothed sampling length range; (e) use the average difference obtained in step (d) as the representative value of the smoothed sampling length range.
在本申请中,所述平滑化窗口值可以为约2-6中的整数。例如,所述平滑化窗口值可以为3。In the present application, the smoothing window value may be an integer between about 2-6. For example, the smoothing window value may be 3.
在本申请中,所述平滑化处理可以包括以下步骤:(f)获得步骤(e)所述平均差值的第一分布。In the present application, the smoothing process may include the following steps: (f) obtaining the first distribution of the average difference in step (e).
在本申请中,所述平滑化处理可以包括以下步骤:(g)在有效片段区间的长度范围内,将所述第一分布中的每个平均差值依次进行累加,获得加成值,其中,所述有效片段区间的长度覆盖缠绕核小体的核酸序列的长度。In the present application, the smoothing process may include the following steps: (g) within the length range of the effective segment interval, each average difference in the first distribution is sequentially accumulated to obtain an added value, wherein , the length of the effective fragment interval covers the length of the nucleic acid sequence wound around the nucleosome.
在本申请中,所述核酸序列可以能够缠绕核小体2周以上,或者,能够缠绕核小体1周以内。在本申请中,所述有效片段区间的长度可以为约1-约167个核苷酸,和/或,约200以上个核苷酸。在本申请中,所述有效片段区间的长度可以为约1-约167个核苷酸,和/或,约250-约400个核苷酸。In the present application, the nucleic acid sequence may be capable of wrapping around nucleosomes for more than 2 weeks, or capable of wrapping around nucleosomes for less than 1 week. In the present application, the length of the effective fragment interval may be about 1 to about 167 nucleotides, and/or, about 200 or more nucleotides. In the present application, the length of the effective fragment interval may be about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides.
在本申请中,所述平滑化处理可以包括以下步骤:(h)获得步骤(g)所述加成值的第二分布,计算所述第二分布中所述加成值的最大值。在本申请中,可以将所述加成值的最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。In the present application, the smoothing process may include the following steps: (h) obtaining the second distribution of the added value in step (g), and calculating the maximum value of the added value in the second distribution. In the present application, the maximum value of the added value may be used as Dev(Max), and the Dev(Max) may be used as the distinguishing index and/or as the training sample.
在本申请中,所述判断模块可以依据已被进行机器学习训练的机器学习模型获得相关判断结果(例如,所述判断结果可以包括本申请所述体细胞突变的识别结果,和/或本申请所述cfDNA中识别ctDNA的判断结果)。在本申请中,所述判断模块可以输入本申请所述差值(例如所述的Dev(Max))。所述判断模块可以输出所述的相关判断结果。在本申请中,所述判断,模块可以包括已被进行机器学习训练的机器学习模型。其中,所述机器学习模型通使用验证集,以及本申请所述的差值(例如,还可以包括使用本申请所述的参数),利用本申请所述的机器学习模型的训练方法得到。In this application, the judgment module can obtain relevant judgment results according to the machine learning model that has been trained by machine learning (for example, the judgment results can include the recognition results of somatic mutations described in this application, and/or the results of this application The judgment result of recognizing ctDNA in the cfDNA). In this application, the judging module may input the difference described in this application (for example, the Dev(Max)). The judging module can output the related judging result. In this application, the judging module may include a machine learning model that has been trained by machine learning. Wherein, the machine learning model is obtained by using the verification set and the difference described in the present application (for example, using the parameters described in the present application), and using the training method of the machine learning model described in the present application.
在本申请中,所述指标和/或训练样本还可以包括以下参数中的一种或多种:所述突变位点所在的染色体位置、所述突变位点的碱基替换模式、所述突变位点的野生型中各个长度的核酸片段的计数值和/或所述突变位点的突变型中各个长度的核酸片段的计数值、所述突变位点的等位变异、受试者的年龄和所述突变位点的突变类型。In this application, the index and/or training samples may also include one or more of the following parameters: the chromosome position where the mutation site is located, the base substitution pattern of the mutation site, the mutation The count value of nucleic acid fragments of various lengths in the wild type of the site and/or the count value of nucleic acid fragments of various lengths in the mutant type of the mutation site, the allelic variation of the mutation site, the age of the subject and the mutation type of the mutation site.
在本申请中,所述指标和/或训练样本还可以包括以下参数中的一种或多种:所述SNV位点所在的染色体位置、所述SNV位点的碱基替换模式、所述SNV位点的野生型中各个长度的核酸片段的计数值和/或所述SNV位点的突变型中各个长度的核酸片段的计数值、所述SNV位点的等位变异、受试者的年龄和所述SNV位点的突变类型。In this application, the index and/or training samples may also include one or more of the following parameters: the chromosome position where the SNV site is located, the base substitution pattern of the SNV site, the SNV The count value of nucleic acid fragments of various lengths in the wild type of the site and/or the count value of nucleic acid fragments of various lengths in the mutant type of the SNV site, the allelic variation of the SNV site, the age of the subject and the mutation type of the SNV site.
在本申请中,所述装置可以包括所述计算模块和所述判断模块。所述装置可以包括所述计算模块和所述训练模块。In the present application, the device may include the calculating module and the judging module. The apparatus may include the computing module and the training module.
在本申请中,所述装置可以包括所述样品获取模块、所述数据接收模块、所述输入模块、所述计算模块、所述判断模块和所述输出模块。在本申请中,所述样本,以及源自于所述样本的信息和/或计算结果可以自所述样品获取模块、所述数据接收模块、所述输入模块、所述计算模块、所述判断模块和所述输出模块的顺序依次传输。In the present application, the device may include the sample acquisition module, the data receiving module, the input module, the calculation module, the judgment module and the output module. In this application, the sample, and the information and/or calculation results derived from the sample can be obtained from the sample acquisition module, the data receiving module, the input module, the calculation module, the judgment The order of the modules and the output modules are transferred sequentially.
另一方面,本申请提供了一种电子设备,包括存储器;和耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令执行以实现本申请所述的区分体细胞突变和种系突变的方法;本申请所述的在cfDNA中识别ctDNA的方法,或者本申请所述的机器学习模型的训练方法。In another aspect, the present application provides an electronic device, including a memory; and a processor coupled to the memory, the processor configured to execute based on instructions stored in the memory to implement the instructions described in the present application. A method for distinguishing somatic mutations from germline mutations; a method for identifying ctDNA in cfDNA as described herein, or a method for training a machine learning model as described herein.
另一方面,本申请提供了一种非易失性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行以实现本申请所述的区分体细胞突变和种系突变的方法;本申请所述的在cfDNA中识别ctDNA的方法,或者本申请所述的机器学习模型的训练方法。In another aspect, the present application provides a non-volatile computer-readable storage medium, on which a computer program is stored, and the program is executed by a processor to implement the method for distinguishing somatic mutations and germline mutations described in the present application. ; the method for identifying ctDNA in cfDNA described in the present application, or the training method of the machine learning model described in the present application.
例如,所述非易失性计算机可读存储介质可以包括软盘、柔性盘、硬盘、固态存储(SSS)(例如固态驱动(SSD))、固态卡(SSC)、固态模块(SSM))、企业级闪存驱动、磁带或任何其他非临时性磁介质等。非易失性计算机可读存储介质还可以包括打孔卡、纸带、光标片(或任何其他具有孔型图案或其他光学可识别标记的物理介质)、压缩盘只读存储器(CD-ROM)、可重写式光盘(CD-RW)、数字通用光盘(DVD)、蓝光光盘(BD)和/或任何其他非临时性光学介质。For example, the non-transitory computer readable storage medium may include a floppy disk, a flexible disk, a hard disk, a solid state storage (SSS) (such as a solid state drive (SSD)), a solid state card (SSC), a solid state module (SSM)), an enterprise high-grade flash drives, tape, or any other non-transitory magnetic media, etc. Non-transitory computer readable storage media may also include punched cards, paper tape, cursor sheets (or any other physical media having a pattern of holes or other optically identifiable markings), compact disc read only memory (CD-ROM) , Rewritable Disc (CD-RW), Digital Versatile Disc (DVD), Blu-ray Disc (BD) and/or any other non-transitory optical media.
另一方面,本申请提供了一种数据库系统,其包括存储器;和耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令执行以实现本申请所述的区分体细胞突变和种系突变的方法;本申请所述的在cfDNA中识别ctDNA的方法,或者本申请所述的数据库建立方法。In another aspect, the present application provides a database system, which includes a memory; and a processor coupled to the memory, the processor configured to execute based on instructions stored in the memory to implement the The method for distinguishing somatic mutations and germline mutations as described above; the method for identifying ctDNA in cfDNA as described in this application, or the method for building a database as described in this application.
例如,所述据库系统可以实现各种机制以便确保在数据库系统上执行的本申请所述的方法产生正确的结果。在本申请中,所述数据库系统可以使用磁盘作为永久性数据存储器。在本申请中,所述数据库系统可以为多个数据库客户端提供数据库存储和处理服务。所述数据库客户端可以跨多个共享存储设备存储数据库数据,和/或可以利用具有多个执行节点的一个或更多个执行平台。所述数据库系统可以被组织成使得存储和计算资源可以被有效地无限扩展。For example, the database system may implement various mechanisms to ensure that the methods described herein performed on the database system produce correct results. In this application, the database system may use disks as permanent data storage. In this application, the database system can provide database storage and processing services for multiple database clients. The database client may store database data across multiple shared storage devices, and/or may utilize one or more execution platforms with multiple execution nodes. The database system can be organized such that storage and computing resources can be effectively scaled indefinitely.
应用application
另一方面,本申请提供了一种本申请所述的区分体细胞突变和种系突变的方法在肿瘤家系管理的应用。In another aspect, the present application provides an application of the method for distinguishing somatic mutations and germline mutations described in the present application in the management of tumor families.
另一方面,本申请提供了一种本申请所述的区分体细胞突变和种系突变的方法在肿瘤突变负担(TMB)检测中的应用。In another aspect, the present application provides an application of the method for distinguishing somatic mutations and germline mutations described in the present application in the detection of tumor mutation burden (TMB).
在本申请中,所述方法可以用于判断所述受试者是否具有种系突变。携带某些特定的种系突变的受试者与普通人群相比,可以具有更高的患有肿瘤(例如结直肠癌、子宫内膜癌、胃癌和/或卵巢癌)的终生风险。因此,所述方法可以用于筛选出具有较高风险的受试者。该受试者可以接受肿瘤的个体化监测,从而可以达到早诊早治的目的。In the present application, the method can be used to determine whether the subject has a germline mutation. Subjects carrying certain germline mutations may have a higher lifetime risk of developing cancer (eg, colorectal, endometrial, gastric, and/or ovarian cancer) than the general population. Therefore, the method can be used to screen out subjects with higher risk. The subject can receive individualized tumor monitoring, so as to achieve the purpose of early diagnosis and early treatment.
在本申请中,所述方法可以用于通过检测所述TMB,可以用于临床实践(例如可以推测某些特定的肿瘤治疗方式是否适于该受试者)。在某些情况下,所述方法检测出的TMB水平可以与免疫检查点、T细胞炎症标志物等其他生物标志物联合使用于临床实践。In this application, the method can be used to detect the TMB and can be used in clinical practice (for example, it can be speculated whether certain specific tumor treatment methods are suitable for the subject). In some cases, the TMB level detected by the method can be used in clinical practice in combination with other biomarkers such as immune checkpoints and T cell inflammatory markers.
不欲被任何理论所限,下文中的实施例仅仅是为了阐释本申请的融合蛋白、制备方法和用途等,而不用于限制本申请发明的范围。Not intending to be limited by any theory, the following examples are only for explaining the fusion protein, preparation method and application of the application, and are not intended to limit the scope of the invention of the application.
实施例Example
实施例1 获得本申请所述的突变位点Example 1 Obtain the mutation site described in the application
1.数据准备1. Data preparation
a)序列回帖:使用bwa 0.7.10软件中的mem模块将序列映射(mapping)至人类参考基因组GRCh37/hg19上形成比对结果.bam文件。a) Sequence reply: Use the mem module in the bwa 0.7.10 software to map the sequence to the human reference genome GRCh37/hg19 to form a .bam file of the alignment result.
2.变异识别2. Mutation Identification
使用vardict 1.5.1对SNV进行突变体调用(variant calling),调用参数如下:Use vardict 1.5.1 to perform mutant calling (variant calling) on SNV, and the calling parameters are as follows:
a)去除碱基品质(base quality)<30的碱基;a) removing bases with base quality (base quality)<30;
b)去除映射品质(mapping quality)过低的读数,例如<60的读数(reads);b) Remove reads with low mapping quality, such as <60 reads;
c)去除错配过多的读数(reads),例如:错配超过12个、10个、8个或6个;c) remove reads with too many mismatches, for example: more than 12, 10, 8 or 6 mismatches;
d)突变频率不应过小,例如:突变频率>=0.002、0.001、0.0005、0.0002或0.0001;d) The mutation frequency should not be too small, for example: mutation frequency >=0.002, 0.001, 0.0005, 0.0002 or 0.0001;
e)支持突变的读数(reads)>=3、2或1;e) reads (reads) >= 3, 2 or 1 supporting the mutation;
3.变异注释3. Variation Annotation
其中包括数据库注释、热点突变(hot)位点注释、突变类型及、人群频率注释。These include database annotations, hot spot mutation (hot) site annotations, mutation types, and population frequency annotations.
a)使用ANNOVAR 20160201对变异位点进行注释;a) Use ANNOVAR 20160201 to annotate the variable sites;
b)注释热点突变(hot)位点:若一个突变在热点突变列表中,则该突变为热点突变,在 后续的突变过滤中,热点突变不纳入模型的预测当中;b) Annotate the hotspot mutation (hot) site: if a mutation is in the hotspot mutation list, the mutation is a hotspot mutation, and in the subsequent mutation filtering, the hotspot mutation is not included in the prediction of the model;
c)使用SnpEff V4.3对变异进行突变类型的注释;c) Use SnpEff V4.3 to annotate the mutation type;
d)人群频率的注释:给定变异位点,取多种数据库中的人群频率的最大值作为该突变位点的人群频率。d) Annotation of population frequency: Given a mutation site, take the maximum value of the population frequency in various databases as the population frequency of the mutation site.
使用的数据库包括但不限于:1000Genomes数据库、ExAC数据库和ESP6500数据库等。。The databases used include but are not limited to: 1000Genomes database, ExAC database and ESP6500 database, etc. .
4.SNV突变过滤4. SNV mutation filtering
按照一下条件对所有的注释后的突变位点进行注释:Annotate all the annotated mutation sites according to the following conditions:
a)人群突变频率的过滤:过滤后保留人群突变频率小于特定值的突变,例如:小于等于0.005、0.002或0.001;a) Filtering of population mutation frequency: after filtering, retain mutations whose population mutation frequency is less than a specific value, for example: less than or equal to 0.005, 0.002 or 0.001;
b)热点突变的过滤;b) filtering of hotspot mutations;
c)克隆性造血突变过滤;c) Clonal hematopoietic mutation filtering;
d)最大深度过滤:过滤大于特定测序深度的突变,例如:测序深度大于20000等;d) Maximum depth filtering: filter mutations greater than a specific sequencing depth, for example: sequencing depth greater than 20,000, etc.;
5.SNV突变位点片段质量控制5. Quality control of SNV mutation site fragments
a)重复序列去除:将PCR扩增过程中产生的重复序列去掉;a) Repeat sequence removal: remove the repeat sequence generated during PCR amplification;
b)过滤低质量片段:将碱基质量中位数小于Q20的片段过滤;b) Filter low-quality fragments: filter fragments whose base quality median is less than Q20;
c)过滤测序错误的片段:将无法与参考基因组比对的片段过滤;c) Filter fragments with sequencing errors: filter fragments that cannot be compared with the reference genome;
d)低覆盖深度的突变去除:去除支持片段少于50条的SNV。d) Mutation removal at low coverage depth: remove SNVs with less than 50 supporting fragments.
实施例2 获得本申请所述的差值的方法 Embodiment 2 The method for obtaining the difference described in this application
2.12.1
根据实施例1获得的突变位点SNV,按以下的步骤计算本申请所述的差值:According to the mutation site SNV obtained in Example 1, the difference value described in the application is calculated according to the following steps:
a)野生型支持片段和突变型支持片段的获取:其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为与参考基因组在所述突变位点的对应位置处的核苷酸序列相比相同的序列,其中,所述突变型碱基序列为与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,所述参考基因组为所述基因测序中的人类参考基因组。a) acquisition of wild-type support fragment and mutant support fragment: wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, Wherein, the wild-type base sequence is the same sequence as the nucleotide sequence at the corresponding position of the mutation site in the reference genome, wherein the mutant base sequence is the same sequence as the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site, the sequence is different, and the reference genome is the human reference genome in the gene sequencing.
b)在特定长度范围内,分别构建所述野生型支持片段和突变型支持片段的分布模式:b) within a specific length range, respectively construct the distribution patterns of the wild-type support fragment and the mutant support fragment:
在1到400个核苷酸的长度范围内,计算所述野生型支持片段和突变型支持片段的分布分布情况。Within the length range of 1 to 400 nucleotides, the distribution of the wild-type supporting fragment and the mutant supporting fragment is calculated.
c)其中,在特定区间内量化两组之间片段化模式的差异(Dev),计算公式如下:c) where the difference in fragmentation patterns (Dev) between the two groups is quantified within a specific interval, calculated as follows:
Figure PCTCN2022096125-appb-000001
Figure PCTCN2022096125-appb-000001
D=[B 1,B 4,B 7…B 397]        (2) D=[B 1 ,B 4 ,B 7 ...B 397 ] (2)
公式(1)中的WC i,和MC i分别表示某一个所述突变位点处,长度为i个核苷酸的所述野生型支持片段的数量和长度为i个核苷酸的所述突变型支持片段的数量。 WC i and MC i in the formula (1) respectively represent the number of the wild-type supporting fragments with a length of i nucleotides and the number of the wild-type support fragments with a length of i nucleotides at a certain mutation site. The mutant supports the number of fragments.
其中,3为所述平滑化窗口值;Wherein, 3 is the smoothing window value;
其中,j为所述平滑化取样长度范围中的长度值,例如,j可以为1、4、7、10这样的等差数列中的整数;Wherein, j is the length value in the smoothing sampling length range, for example, j can be an integer in an arithmetic sequence such as 1, 4, 7, 10;
其中,400为所述野生型支持片段和/或所述突变型支持片段的长度的范围。Wherein, 400 is the range of the length of the wild type supporting fragment and/or the mutant supporting fragment.
换言之,以3为间隔长度,在核苷酸长度为1-400的范围内,按照公式(1)分别计算不同所述长度时的所述比值的累加值,这些比值的集合构成所述第一分布D(即公式(2))。In other words, with 3 as the interval length, within the range of 1-400 nucleotide lengths, according to the formula (1), calculate the cumulative value of the ratio of different lengths, and the set of these ratios constitutes the first Distribution D (ie formula (2)).
然后,将所述有效片段区间的长度设定为约1-约167个核苷酸,和/或,约250-约400个核苷酸。在本申请中,所述有效片段区间的长度可以为缠绕核小体的核酸序列的长度。例如,所述核酸序列能够缠绕核小体2周以上,或者,能够缠绕核小体1周以内(例如,所述有效片段区间的长度为约1-约167个核苷酸,和/或,约250-约400个核苷酸)。Then, the length of the effective fragment interval is set to be about 1-about 167 nucleotides, and/or, about 250-about 400 nucleotides. In the present application, the length of the effective fragment interval may be the length of the nucleic acid sequence wound around the nucleosome. For example, the nucleic acid sequence can wrap around the nucleosome for more than 2 weeks, or, can wrap around the nucleosome within 1 week (for example, the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides).
在所述有效片段的区间内,将所述第一分布D中的各个B的值(即所述比值的累加值)再次依次进行累加,得到所述加成值(即参见公式(3))。Within the interval of the effective segment, the values of each B in the first distribution D (that is, the cumulative value of the ratio) are sequentially accumulated again to obtain the added value (that is, refer to formula (3)) .
Figure PCTCN2022096125-appb-000002
Figure PCTCN2022096125-appb-000002
例如,假设所述有效片段的区间的长度为100(即i为100),则在核苷酸的长度为1-100的范围内,计算所述第一分布D中的各个B的值依次的加成值。For example, assuming that the length of the interval of the effective fragment is 100 (that is, i is 100), then in the range of 1-100 nucleotides in length, the values of each B in the first distribution D are calculated sequentially Bonus value.
所述加成值的集合构成所述第二分布A,并且,将所述第二分布中最大的所述加成值记为Dev(Max)(即参见公式(4))。The set of added values constitutes the second distribution A, and the largest added value in the second distribution is recorded as Dev(Max) (ie, refer to formula (4)).
Dev=Max(A)           (4)Dev=Max(A) (4)
例如,图8显示了针对人4号染色体第20525808处的突变位点C-T,利用实施例2.1所述的方法获得的本申请所述野生型支持片段和所述突变型支持片段的长度的分布频率。For example, Figure 8 shows the distribution frequency of the lengths of the wild-type support fragment and the mutant support fragment of the present application obtained by using the method described in Example 2.1 for the mutation site C-T at No. 20525808 of human chromosome 4 .
例如,图9显示了针对人5号染色体第56189455处的突变位点G-T,利用实施例2.1所述的方法获得的本申请所述野生型支持片段和所述突变型支持片段的长度的分布频率。For example, Figure 9 shows the distribution frequency of the lengths of the wild-type support fragment and the mutant support fragment of the present application obtained by using the method described in Example 2.1 for the mutation site G-T at No. 56189455 of human chromosome 5 .
例如,图10显示了针对人17号染色体第7577141处的突变位点C-A,利用实施例2.1所 述的方法获得的本申请所述野生型支持片段和所述突变型支持片段的长度的分布频率。For example, Figure 10 shows the distribution frequency of the lengths of the wild-type support fragment and the mutant support fragment of the present application obtained by using the method described in Example 2.1 for the mutation site C-A at No. 7577141 of human chromosome 17 .
2.22.2
根据实施例1获得的突变位点SNV,按以下的步骤计算本申请所述的差值:According to the mutation site SNV obtained in Example 1, the difference value described in the application is calculated according to the following steps:
a)野生型支持片段和突变型支持片段的获取:其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,其中,所述野生型碱基序列为与参考基因组在所述突变位点的对应位置处的核苷酸序列相比相同的序列,其中,所述突变型碱基序列为与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,所述参考基因组为所述基因测序中的人类参考基因组。a) acquisition of wild-type support fragment and mutant support fragment: wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence, Wherein, the wild-type base sequence is the same sequence as the nucleotide sequence at the corresponding position of the mutation site in the reference genome, wherein the mutant base sequence is the same sequence as the reference genome at the Compared with the nucleotide sequence at the corresponding position of the mutation site, the sequence is different, and the reference genome is the human reference genome in the gene sequencing.
b)在特定长度范围内,分别构建所述野生型支持片段和突变型支持片段的分布模式:b) within a specific length range, respectively construct the distribution patterns of the wild-type support fragment and the mutant support fragment:
在1到400个核苷酸的长度范围内,计算所述野生型支持片段和突变型支持片段的分布分布情况。Within the length range of 1 to 400 nucleotides, the distribution of the wild-type supporting fragment and the mutant supporting fragment is calculated.
c)其中,在特定区间内量化两组之间片段化模式的差异(Dev),计算公式如下:c) where the difference in fragmentation patterns (Dev) between the two groups is quantified within a specific interval, calculated as follows:
Figure PCTCN2022096125-appb-000003
Figure PCTCN2022096125-appb-000003
D=[B 1,B 2,B 3…B 400]          (2) D=[B 1 ,B 2 ,B 3 ...B 400 ] (2)
公式(1)中的WC i,和MC i分别表示某一个所述突变位点处,长度为i个核苷酸的所述野生型支持片段的数量和长度为i个核苷酸的所述突变型支持片段的数量。 WC i and MC i in the formula (1) respectively represent the number of the wild-type supporting fragments with a length of i nucleotides and the number of the wild-type support fragments with a length of i nucleotides at a certain mutation site The number of mutant support fragments.
其中,3为所述平滑化窗口值;Wherein, 3 is the smoothing window value;
其中,j为所述平滑化取样长度范围中的长度值,例如,j可以为1、2、3、4这样的等差数列中的整数;Wherein, j is the length value in the smoothing sampling length range, for example, j can be an integer in an arithmetic sequence such as 1, 2, 3, 4;
其中,400为所述野生型支持片段和/或所述突变型支持片段的长度的范围。Wherein, 400 is the range of the length of the wild type supporting fragment and/or the mutant supporting fragment.
换言之,以3为间隔长度,在核苷酸长度为1-400的范围内,按照公式(1)分别计算不同所述长度时的所述比值的累加值,这些比值的集合构成所述第一分布D(即公式(2))。In other words, with 3 as the interval length, within the range of 1-400 nucleotide lengths, according to the formula (1), calculate the cumulative value of the ratio of different lengths, and the set of these ratios constitutes the first Distribution D (ie formula (2)).
然后,将所述有效片段区间的长度设定为约1-约167个核苷酸,和/或,约250-约400个核苷酸。在本申请中,所述有效片段区间的长度可以为缠绕核小体的核酸序列的长度。例如,所述核酸序列能够缠绕核小体2周以上,或者,能够缠绕核小体1周以内(例如,所述有效片段区间的长度为约1-约167个核苷酸,和/或,约250-约400个核苷酸)。Then, the length of the effective fragment interval is set to be about 1-about 167 nucleotides, and/or, about 250-about 400 nucleotides. In the present application, the length of the effective fragment interval may be the length of the nucleic acid sequence wound around the nucleosome. For example, the nucleic acid sequence can wrap around the nucleosome for more than 2 weeks, or, can wrap around the nucleosome within 1 week (for example, the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or, about 250 to about 400 nucleotides).
在所述有效片段的区间内,将所述第一分布D中的各个B的值(即所述比值的累加值)再次依次进行累加,得到所述加成值(即参见公式(3))。Within the interval of the effective segment, the values of each B in the first distribution D (that is, the cumulative value of the ratio) are sequentially accumulated again to obtain the added value (that is, refer to formula (3)) .
Figure PCTCN2022096125-appb-000004
Figure PCTCN2022096125-appb-000004
例如,假设所述有效片段的区间的长度为100(即i为100),则在核苷酸的长度为1-100的范围内,计算所述第一分布D中的各个B的值依次的加成值。For example, assuming that the length of the interval of the effective fragment is 100 (that is, i is 100), then in the range of 1-100 nucleotides in length, the values of each B in the first distribution D are calculated sequentially Bonus value.
所述加成值的集合构成所述第二分布A,并且,将所述第二分布中最大的所述加成值记为Dev(Max)(即参见公式(4))。The set of added values constitutes the second distribution A, and the largest added value in the second distribution is recorded as Dev(Max) (ie, refer to formula (4)).
Dev=Max(A)          (4)Dev=Max(A) (4)
实施例3 进行本申请所述的机器学习Embodiment 3 Carry out the machine learning described in this application
(1)将表1中涉及的指标输入至本申请所述机器学习模型以进行机器学习训练。(1) Input the indicators involved in Table 1 into the machine learning model described in this application for machine learning training.
根据不同的特征所属类型,这些指标可以被划分为7个类型,所述指标均与所述突变位点相关。These indicators can be divided into 7 types according to the types of different characteristics, and the indicators are all related to the mutation site.
表1Table 1
Figure PCTCN2022096125-appb-000005
Figure PCTCN2022096125-appb-000005
a)位置信息:其中包括SNV所在的染色体位置,如,将16号染色体的68771372。a) Location information: including the chromosome location where the SNV is located, for example, 68771372 on chromosome 16.
b)碱基替换模式:在单个SNV位点中,由野生型的碱基转化为新引入的突变碱基模式。例如chr3,178935093C>A,碱基的替换模式为“CA”。该特征使用“独热编码”的方法,将理论上12种替换模式均考虑在内,分别是:AT,AC,AG,TA,TC,TG,CA,CT,CG,GA,GT,GC。b) Base substitution pattern: In a single SNV locus, the wild-type base is transformed into a newly introduced mutant base pattern. For example, chr3, 178935093C>A, the base substitution mode is "CA". This feature uses the "one-hot encoding" method, taking into account the theoretical 12 replacement modes, namely: AT, AC, AG, TA, TC, TG, CA, CT, CG, GA, GT, GC.
c)实施例2获得的Dev值(即可以反映cfDNA的片段化的模式):其还可以表征突变偏移方向的特征W ratio和M ratioc) Dev value obtained in Example 2 (that is, it can reflect the fragmentation mode of cfDNA): it can also characterize the characteristics W ratio and M ratio of the mutation shift direction.
为了直观显示两组之间的差异,还可以表征Delta ratio。以上三个参数的计算方法分别依次如公式(5)、公式(6)和公式(7)所示。 To visualize the difference between two groups, the Delta ratio can also be characterized. The calculation methods of the above three parameters are respectively shown in formula (5), formula (6) and formula (7).
Figure PCTCN2022096125-appb-000006
Figure PCTCN2022096125-appb-000006
Figure PCTCN2022096125-appb-000007
Figure PCTCN2022096125-appb-000007
Delta ratio=W ratio-M ratio         (7) Delta ratio =W ratio -M ratio (7)
其中,167也可以为160-174中任意的整数。Wherein, 167 can also be any integer in 160-174.
公式(5)中,C l>167和分别C l<167表示长度大于167个核苷酸的所述野生型支持片段的数量,以及长度小于167个核苷酸的所述野生型支持片段的数量,W ratio则表示C l>167和C l<167的比值。 In formula (5), C 1>167 and C 1<167 respectively represent the number of the wild-type support fragments with a length greater than 167 nucleotides, and the number of wild-type support fragments with a length less than 167 nucleotides. Quantity, W ratio indicates the ratio of C l>167 and C l<167 .
公式(6)中,C l>167和分别C l<167表示长度大于167个核苷酸的所述突变型支持片段的数量,以及长度小于167个核苷酸的所述突变型支持片段的数量,M ratio则表示C l>167和C l<167的比值。 In the formula (6), C 1>167 and C 1<167 respectively represent the number of the mutant support fragments with a length greater than 167 nucleotides, and the number of mutant support fragments with a length less than 167 nucleotides Quantity, M ratio means the ratio of C l>167 and C l<167 .
公式(7)则表示W ratio和M ratio的差值。 Formula (7) represents the difference between W ratio and M ratio .
d)片段计数:其中包含在某一个突变位点中所有的未发生突变的野生型片段,以及该位点发生单碱基突变的所有支持的片段个数。d) Fragment count: it includes all unmutated wild-type fragments in a certain mutation site, and the number of all supported fragments in which a single base mutation occurs at this site.
e)等位变异:该类特征包含两类,即样本频率和人群频率。样本频率指的是在某一个样本中发生突变的等位基因突变频率(Variant Allele Frequency),人群频率(Population Frequency)指的是人群中发生该突变的频率。e) Allelic variation: This type of feature includes two types, namely sample frequency and population frequency. The sample frequency refers to the allele mutation frequency (Variant Allele Frequency) that occurs in a certain sample, and the population frequency refers to the frequency of the mutation in the population.
f)年龄:即产生该突变的样本年龄。f) Age: the age of the sample that produced the mutation.
g)突变类型:即变异注释的结果产生,该类别特征包括以下几个种类:g) Mutation type: It is the result of mutation annotation, and the characteristics of this category include the following types:
splice_donor_variant,(剪接供体突变)splice_donor_variant, (splice donor mutation)
synonymous_variant,(同义突变)synonymous_variant, (synonymous mutation)
stop_gained,(终止子获得)stop_gained, (terminator gain)
intron_variant,(内含子突变)intron_variant, (intron variant)
stop_lost,(终止子缺失)stop_lost, (terminator missing)
missense_variant,(无义突变)missense_variant, (nonsense mutation)
splice_region_variant,(剪接区域突变)splice_region_variant, (splice region mutation)
splice_acceptor_variant,(剪接受体突变)splice_acceptor_variant, (splice acceptor mutation)
promoter_region_variant,(启动子区域突变)promoter_region_variant, (promoter region mutation)
start_lost(起始密码子突变)start_lost (start codon mutation)
编码完成后,对每一个特征类型进行z变换,即将所有数值转换为均值为0,方差为1的标准正态分布。After the encoding is completed, z-transform is performed on each feature type, that is, all values are converted into a standard normal distribution with a mean of 0 and a variance of 1.
(2)模型训练(2) Model training
模型训练过程中使用python中的机器学习库sklearn v.0.23.2中的ensemble模块Use the ensemble module in the machine learning library sklearn v.0.23.2 in python during model training
参数设置。设置判别类别分离纯度方法为“entropy”,最大决策树深度由叶节点的最小分parameter settings. Set the discriminant category separation purity method to "entropy", and the maximum decision tree depth is determined by the minimum score of the leaf node
离样本个数决定设置为None,节点可分的最小样本数为10,最终的结果由40个决策树The number of samples is determined to be None, the minimum number of samples that can be divided by a node is 10, and the final result consists of 40 decision trees
投票决定。vote.
实施例4 本申请所述的方法对特定肿瘤的应用Example 4 Application of the method described in this application to specific tumors
真实数据总共包括1309个肺癌血液样本,将这些样本分成一组包含928个样本的训练集,和分别包含191个和190个样本的两组验证集(即分别为图1中的训练集、验证集1和验证集2)。The real data includes a total of 1309 lung cancer blood samples, which are divided into a training set containing 928 samples, and two sets of verification sets containing 191 and 190 samples respectively (that is, the training set, verification set in Figure 1, respectively). set 1 and validation set 2).
首先,按照实施例1-3的步骤,采用经过人群频率过滤之后,对训练集中剩余的12173个胚系和5816个体系突变进行建模,获得所述已被进行机器学习训练的机器学习模型。First, according to the steps of Examples 1-3, the remaining 12,173 germlines and 5,816 systemic mutations in the training set were modeled after being filtered by population frequency to obtain the machine learning model that has been trained by machine learning.
然后,利用所述已被进行机器学习训练的机器学习模型,分别对上述2个验证集进行模型验证(参见图1)。Then, use the machine learning model that has been trained by machine learning to perform model verification on the above two verification sets (see FIG. 1 ).
训练过程中,将所有17989个突变的20%当量的数据划分成测试集合。在80%的训练集合中,采用了内部5倍的交叉验证来选择所有最优模型的超参数,最终得到每一个最优模型在20%测试集合中的结果。模型的机器训练结果如图2所示。图2中,所述RF(+Dev)或者RF(-Dev)分别指指包含Dev这个参数和不包含这个参数进行机器学习训练的机器学习模型对对上述2个验证集进行模型验证的结果。During training, 20% equivalent data of all 17989 mutations are divided into the test set. In the 80% training set, an internal 5-fold cross-validation is used to select the hyperparameters of all optimal models, and finally the results of each optimal model in the 20% test set are obtained. The machine training results of the model are shown in Figure 2. In FIG. 2 , the RF(+Dev) or RF(-Dev) respectively refer to the results of model verification of the above two verification sets by machine learning models that include the parameter Dev and do not include the parameter for machine learning training.
结果表明,随机森林在所有的模型表现最优,其AUC值为0.9975。另外,在上述2个验证集中(图3-4)。其中图3-4分别显示了本申请所述已被进行机器学习训练的机器学习模型在验证集1和验证集2中的表现。The results show that random forest performs best among all models, with an AUC value of 0.9975. In addition, in the above 2 validation sets (Figure 3-4). Figures 3-4 show the performances of the machine learning models in the verification set 1 and verification set 2 that have been trained by machine learning in this application, respectively.
由此可见,本申请所述已被进行机器学习训练的机器学习模型也体现了优越的性能,AUC分别达到了0.9973和0.9979,证明了本申请所述方法的泛化能力。It can be seen that the machine learning model that has been trained by machine learning described in this application also exhibits superior performance, with AUCs reaching 0.9973 and 0.9979 respectively, which proves the generalization ability of the method described in this application.
实施例5 本申请所述的方法对不同肿瘤的应用Example 5 Application of the method described in this application to different tumors
为了证实本申请所述已被进行机器学习训练的机器学习模型可以综合的应用于泛癌种的胚系体系判别,使用了来自11种癌症类型共计1008个样本(样本的具体情况参见图5),经过人群频率等过滤方法,最终纳入考核的包括6647个体系突变和13567个种系突变(图5)。In order to confirm that the machine learning model that has been trained by machine learning described in this application can be comprehensively applied to the germline system discrimination of pan-cancer species, a total of 1008 samples from 11 cancer types were used (see Figure 5 for details of the samples). After filtering methods such as population frequency, 6,647 systemic mutations and 13,567 germline mutations were finally included in the assessment (Figure 5).
整体上本申请所述已被进行机器学习训练的机器学习模型对混杂的1008个多癌种测试集具有良好的预测能力,AUC达到了0.9947(参见图6),其中,cfSvG表示申请人开发的算法的名称。Overall, the machine learning model described in this application that has been trained by machine learning has good predictive ability for the mixed 1008 multi-cancer test sets, and the AUC has reached 0.9947 (see Figure 6), where cfSvG represents the applicant's developed The name of the algorithm.
另外,还对该模型在每一种癌种的分类能力进行测试。结果发现,几乎在所有11种癌症中模型的AUC都稳定在0.99以上。但是在膀胱癌数据中,表现略有下降,但是其AUC也达到了0.9886(AUC的结果参见图7)。Additionally, the model was tested for its ability to classify each cancer type. It was found that the AUC of the model was stable above 0.99 in almost all 11 cancers. But in the bladder cancer data, the performance dropped slightly, but its AUC also reached 0.9886 (see Figure 7 for AUC results).
本申请所述的方法和/或模型不仅在肺癌种表现良好,并且在泛癌种的分类能力上也具有卓越表现。The method and/or model described in this application not only perform well in lung cancer, but also have excellent performance in the classification ability of pan-cancer.
以上详细描述了本申请的实施方式,但是,本申请并不限于上述实施方式中的具体细节,在本申请的技术构思范围内,可以对本申请的技术方案进行多种简单变型,这些简单变型均属于本申请的保护范围。另外需要说明的是,在上述具体实施方式中所描述的各个具体技术特征,在不矛盾的情况下,可以通过任何合适的方式进行组合,为了避免不必要的重复,本申请对各种可能的组合方式不再另行说明。此外,本申请的各种不同的实施方式之间也可以进行任意组合,只要其不违背本申请的思想,其同样应当视为本申请所公开的内容。The embodiments of the present application have been described in detail above, but the present application is not limited to the specific details in the above-mentioned embodiments. Within the scope of the technical concept of the present application, various simple modifications can be made to the technical solutions of the present application. These simple modifications are all Belong to the protection scope of this application. In addition, it should be noted that the various specific technical features described in the above specific embodiments can be combined in any suitable way if there is no contradiction. The combination method will not be described separately. In addition, any combination of various implementations of the present application can also be made, as long as they do not violate the idea of the present application, they should also be regarded as the content disclosed in the present application.

Claims (20)

  1. 一种用于区分体细胞突变和种系突变的方法,其特征在于,包括以下步骤:A method for distinguishing somatic mutations from germline mutations, comprising the steps of:
    (1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(1) Obtaining at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing,
    (2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;(2) Obtain a wild-type support fragment and a mutant support fragment for each of the mutation sites;
    其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,Wherein, the wild-type supporting fragment is a cfDNA fragment comprising a wild-type base sequence,
    所述突变型支持片段为包含突变型碱基序列的cfDNA片段,The mutant supporting fragment is a cfDNA fragment comprising a mutant base sequence,
    其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,Wherein, the wild-type base sequence is, compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence,
    其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,Wherein, the mutant base sequence is a different sequence compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome,
    其中所述参考基因组为所述基因测序中的人类参考基因组;Wherein said reference genome is the human reference genome in said gene sequencing;
    (3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,(3) For each mutation site, obtain the number of wild-type supporting fragments of at least one length, and obtain the corresponding number of mutant-type supporting fragments of the same length,
    计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;calculating the ratio WC of the number of wild-type support fragments of the length to the total number of wild-type support fragments;
    计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;calculating the ratio MC of the number of mutant supporting fragments of the same length to the total number of mutant supporting fragments;
    计算相同长度下所述比值WC与所述比值MC的差值;calculating the difference between said ratio WC and said ratio MC under the same length;
    (4)将所述差值或差值的集合作为区分所述突变位点为体细胞突变还是种系突变的指标。(4) Using the difference or the set of differences as an index for distinguishing whether the mutation site is a somatic mutation or a germline mutation.
  2. 一种用于在cfDNA中识别ctDNA的方法,其包括以下步骤:A method for identifying ctDNA in cfDNA comprising the steps of:
    (1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(1) Obtaining at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing,
    (2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;(2) Obtain a wild-type support fragment and a mutant support fragment for each of the mutation sites;
    其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,Wherein, the wild-type supporting fragment is a cfDNA fragment comprising a wild-type base sequence,
    所述突变型支持片段为包含突变型碱基序列的cfDNA片段,The mutant supporting fragment is a cfDNA fragment comprising a mutant base sequence,
    其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,Wherein, the wild-type base sequence is, compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence,
    其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,Wherein, the mutant base sequence is a different sequence compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome,
    其中所述参考基因组为所述基因测序中的人类参考基因组;Wherein said reference genome is the human reference genome in said gene sequencing;
    (3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,(3) For each mutation site, obtain the number of wild-type supporting fragments of at least one length, and obtain the corresponding number of mutant-type supporting fragments of the same length,
    计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;calculating the ratio WC of the number of wild-type support fragments of the length to the total number of wild-type support fragments;
    计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;calculating the ratio MC of the number of mutant supporting fragments of the same length to the total number of mutant supporting fragments;
    计算相同长度下所述比值WC与所述比值MC的差值;calculating the difference between said ratio WC and said ratio MC under the same length;
    (4)将所述差值或差值的集合作为识别所述突变位点是否是ctDNA的指标。(4) Using the difference value or the set of difference values as an indicator for identifying whether the mutation site is ctDNA.
  3. 一种机器学习模型的训练方法,其包括以下步骤:A training method for a machine learning model, comprising the following steps:
    (1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(1) Obtaining at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing,
    (2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;(2) Obtain a wild-type support fragment and a mutant support fragment for each of the mutation sites;
    其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,Wherein, the wild-type supporting fragment is a cfDNA fragment comprising a wild-type base sequence,
    所述突变型支持片段为包含突变型碱基序列的cfDNA片段,The mutant supporting fragment is a cfDNA fragment comprising a mutant base sequence,
    其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,Wherein, the wild-type base sequence is, compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence,
    其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,Wherein, the mutant base sequence is a different sequence compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome,
    其中所述参考基因组为所述基因测序中的人类参考基因组;Wherein said reference genome is the human reference genome in said gene sequencing;
    (3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,(3) For each mutation site, obtain the number of wild-type supporting fragments of at least one length, and obtain the corresponding number of mutant-type supporting fragments of the same length,
    计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;calculating the ratio WC of the number of wild-type support fragments of the length to the total number of wild-type support fragments;
    计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;calculating the ratio MC of the number of mutant supporting fragments of the same length to the total number of mutant supporting fragments;
    计算相同长度下所述比值WC与所述比值MC的差值;calculating the difference between said ratio WC and said ratio MC under the same length;
    (4)将所述差值或差值的集合作为训练的指标输入至所述机器学习模型以进行机器学习训练。(4) Inputting the difference or a set of difference values into the machine learning model as a training index to perform machine learning training.
  4. 一种数据库建立方法,其包括以下步骤:A database establishment method, it comprises the following steps:
    (1)获取源自受试者样本的至少一个突变位点;其中,所述突变位点通过基因测序的方法获得,(1) Obtaining at least one mutation site derived from a subject sample; wherein, the mutation site is obtained by gene sequencing,
    (2)针对每一个所述突变位点,获取野生型支持片段和突变型支持片段;(2) Obtain a wild-type support fragment and a mutant support fragment for each of the mutation sites;
    其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,Wherein, the wild-type supporting fragment is a cfDNA fragment comprising a wild-type base sequence,
    所述突变型支持片段为包含突变型碱基序列的cfDNA片段,The mutant supporting fragment is a cfDNA fragment comprising a mutant base sequence,
    其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,Wherein, the wild-type base sequence is, compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence,
    其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,Wherein, the mutant base sequence is a different sequence compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome,
    其中所述参考基因组为所述基因测序中的人类参考基因组;Wherein said reference genome is the human reference genome in said gene sequencing;
    (3)针对每一个突变位点,获取至少一个长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,(3) For each mutation site, obtain the number of wild-type supporting fragments of at least one length, and obtain the corresponding number of mutant-type supporting fragments of the same length,
    计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;calculating the ratio WC of the number of wild-type support fragments of the length to the total number of wild-type support fragments;
    计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;calculating the ratio MC of the number of mutant supporting fragments of the same length to the total number of mutant supporting fragments;
    计算相同长度下所述比值WC与所述比值MC的差值;calculating the difference between said ratio WC and said ratio MC under the same length;
    (4)将所述差值或差值的集合存储至数据库中,以便区分体细胞突变和种系突变,和/或从cfDNA中识别ctDNA。(4) storing the difference or the set of differences in a database for distinguishing somatic mutations from germline mutations, and/or identifying ctDNA from cfDNA.
  5. 根据权利要求1-4中任一项所述的方法,其中所述野生型支持片段和/或所述突变型支持片段的长度的范围为约1个核苷酸至约550个核苷酸,或者约1个核苷酸至约400个核苷酸,或者约1个核苷酸至约200个核苷酸。The method according to any one of claims 1-4, wherein the wild-type support fragment and/or the mutant support fragment have a length ranging from about 1 nucleotide to about 550 nucleotides, Or about 1 nucleotide to about 400 nucleotides, or about 1 nucleotide to about 200 nucleotides.
  6. 根据权利要求1-4中任一项所述的方法,其包括以下的步骤:The method according to any one of claims 1-4, comprising the steps of:
    (4’)获得步骤(3)所述差值的分布,选择所述分布中的最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。(4') Obtain the distribution of the difference in step (3), select the maximum value in the distribution as Dev(Max), use the Dev(Max) as the distinguishing index and/or as the training sample.
  7. 根据权利要求1-4中任一项所述的方法,其包括以下步骤:The method according to any one of claims 1-4, comprising the steps of:
    (4’)获得步骤(3)所述差值的分布,将其称为第一分布。(4') Obtain the distribution of the difference described in step (3), which is called the first distribution.
  8. 根据权利要求7所述的方法,其包括以下步骤:The method according to claim 7, comprising the steps of:
    (5)在有效片段区间的长度范围内,将所述第一分布中的每个差值依次进行累加,获得加成值,其中,所述有效片段区间的长度覆盖缠绕核小体的核酸序列的长度。(5) within the length range of the effective fragment interval, each difference in the first distribution is sequentially accumulated to obtain an added value, wherein the length of the effective fragment interval covers the nucleic acid sequence wound around the nucleosome length.
  9. 根据权利要求8所述的方法,其中所述有效片段区间的长度为约1-约167个核苷酸,和/或,约200以上个核苷酸,例如,约250-约400个核苷酸。The method according to claim 8, wherein the length of the effective fragment interval is about 1 to about 167 nucleotides, and/or about 200 or more nucleotides, for example, about 250 to about 400 nucleosides acid.
  10. 根据权利要求8所述的方法,其包括以下步骤:The method according to claim 8, comprising the steps of:
    (6)获得步骤(5)所述加成值的第二分布,计算所述第二分布中所述加成值的最大值。(6) Obtaining the second distribution of the added value in step (5), and calculating the maximum value of the added value in the second distribution.
  11. 根据权利要求10所述的方法,其中将所述加成值的最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。The method according to claim 10, wherein the maximum value of the added value is used as Dev(Max), and the Dev(Max) is used as the distinguishing index and/or as the training sample.
  12. 根据权利要求1所述的方法,其中所述差值经平滑化处理,其中所述平滑化处理包括以下步骤:The method according to claim 1, wherein said difference is smoothed, wherein said smoothing comprises the steps of:
    (a)确定平滑化窗口值,其中所述平滑化窗口值为约1-10中的整数;(a) determining a smoothing window value, wherein the smoothing window value is an integer from about 1-10;
    (b)确定若干个长度值等于平滑化窗口值的平滑化取样长度范围,其中每一个平滑化取样长度范围的最小值为起始长度,(b) determine several smoothing sampling length ranges whose length values are equal to the smoothing window value, wherein the minimum value of each smoothing sampling length range is the initial length,
    其中所述起始长度的范围为所述野生型支持片段和/或所述突变型支持片段的长度的范围;Wherein the range of the initial length is the range of the length of the wild type support fragment and/or the length of the mutant support fragment;
    (c)获取任意一个平滑化取样长度范围中,至少一个平滑化取样长度的所述野生型支持片段的数量,获取对应的相同长度的所述突变型支持片段的数量,计算该长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值WC;计算相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值MC;(c) Obtain the number of wild-type support fragments of at least one smoothed sampling length in any smoothed sampling length range, obtain the corresponding number of mutant support fragments of the same length, and calculate the length of the The ratio WC of the number of wild-type support fragments to the total number of wild-type support fragments; calculate the ratio MC of the number of mutant support fragments of the same length to the total number of mutant support fragments;
    计算相同长度下所述比值WC与所述比值MC的差值;calculating the difference between said ratio WC and said ratio MC under the same length;
    (d)根据所述至少一个平滑化取样长度的所述差值计算该平滑化取样长度范围的平均差值;(d) calculating an average difference for a range of smoothed sample lengths based on said difference for said at least one smoothed sample length;
    (e)将所得的平均差值作为所述该平滑化取样长度范围的代表值。(e) Taking the obtained average difference as a representative value of the smoothed sampling length range.
  13. 根据权利要求12所述的方法,其中所述平滑化窗口值为约2-6中的整数,例如,所述平滑化窗口值为3。The method according to claim 12, wherein the smoothing window value is an integer between about 2-6, for example, the smoothing window value is 3.
  14. 根据权利要求12所述的方法,其中所述平滑化处理包括以下步骤:The method according to claim 12, wherein said smoothing process comprises the steps of:
    (f)获得步骤(e)所述平均差值的第一分布。(f) Obtaining a first distribution of said mean differences of step (e).
  15. 根据权利要求14所述的方法,其中所述平滑化处理包括以下步骤:The method according to claim 14, wherein said smoothing process comprises the steps of:
    (g)在有效片段区间的长度范围内,将所述第一分布中的每个平均差值依次进行累加,获得加成值,(g) within the length range of the effective segment interval, each average difference in the first distribution is sequentially accumulated to obtain the bonus value,
    其中,所述有效片段区间的长度为缠绕核小体的核酸序列的长度。Wherein, the length of the effective fragment interval is the length of the nucleic acid sequence wound around the nucleosome.
  16. 根据权利要求15所述的方法,其中所述平滑化处理包括以下步骤:The method according to claim 15, wherein said smoothing process comprises the steps of:
    (h)获得步骤(g)所述加成值的第二分布,计算所述第二分布中所述加成值的最大值。(h) Obtaining the second distribution of the added value in step (g), and calculating the maximum value of the added value in the second distribution.
  17. 根据权利要求16所述的方法,其中将所述最大值作为Dev(Max),将所述Dev(Max)作为所述区分的指标和/或作为所述训练样本。The method according to claim 16, wherein the maximum value is used as Dev(Max), and the Dev(Max) is used as the distinguishing index and/or as the training sample.
  18. 区分体细胞突变和种系突变的装置,其包括:A means for distinguishing between somatic and germline mutations comprising:
    计算模块,用于计算相同长度的比值WC与比值MC的差值;Calculation module, for calculating the difference between the ratio WC and the ratio MC of the same length;
    其中,针对每一个突变位点,根据至少一个长度的野生型支持片段的数量,以及对应的相同长度的突变型支持片段的数量;Wherein, for each mutation site, according to the number of wild-type supporting fragments of at least one length, and the corresponding number of mutant-type supporting fragments of the same length;
    所述比值WC为一个长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值;The ratio WC is the ratio of the number of wild-type support fragments of one length to the total number of wild-type support fragments;
    其中所述比值MC为对应的相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值;Wherein the ratio MC is the ratio of the number of corresponding mutant-type supporting fragments of the same length to the total number of the mutant-type supporting fragments;
    其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,Wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence,
    其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,Wherein, the wild-type base sequence is, compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence,
    其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,Wherein, the mutant base sequence is a different sequence compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome,
    其中,所述参考基因组为所述基因测序中的人类参考基因组;Wherein, the reference genome is the human reference genome in the gene sequencing;
    所述突变位点源自受试者样本,其中,所述突变位点通过基因测序的方法获得,The mutation site is derived from a subject sample, wherein the mutation site is obtained by gene sequencing,
    判断模块,用于依据已被进行机器学习训练的机器学习模型获得识别所述体细胞突变的识别结果,a judging module, configured to obtain a recognition result for identifying the somatic mutation according to a machine learning model that has been trained by machine learning,
    其中所述机器学习训练包括将所述差值作为训练样本输入至所述机器学习模型以进行机器学习训练。The machine learning training includes inputting the difference as a training sample into the machine learning model for machine learning training.
  19. 在cfDNA中识别ctDNA的装置,其包括:Means for identifying ctDNA in cfDNA, comprising:
    计算模块,用于计算相同长度的比值WC与比值MC的差值;Calculation module, for calculating the difference between the ratio WC and the ratio MC of the same length;
    其中,针对每一个突变位点,根据至少一个长度的野生型支持片段的数量,以及对应的相同长度的突变型支持片段的数量;Wherein, for each mutation site, according to the number of wild-type supporting fragments of at least one length, and the corresponding number of mutant-type supporting fragments of the same length;
    所述比值WC为一个长度的所述野生型支持片段的数量与所述野生型支持片段的总数量的比值;The ratio WC is the ratio of the number of wild-type support fragments of one length to the total number of wild-type support fragments;
    其中所述比值MC为对应的相同长度的所述突变型支持片段的数量与所述突变型支持片段的总数量的比值;Wherein the ratio MC is the ratio of the number of corresponding mutant-type supporting fragments of the same length to the total number of the mutant-type supporting fragments;
    其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,Wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence,
    其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,Wherein, the wild-type base sequence is, compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence,
    其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,Wherein, the mutant base sequence is a different sequence compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome,
    其中,所述参考基因组为所述基因测序中的人类参考基因组;Wherein, the reference genome is the human reference genome in the gene sequencing;
    所述突变位点源自受试者样本,其中,所述突变位点通过基因测序的方法获得,The mutation site is derived from a subject sample, wherein the mutation site is obtained by gene sequencing,
    判断模块,用于依据已被进行机器学习训练的机器学习模型获得从所述cfDNA中识别ctDNA的判断结果,a judging module, configured to obtain a judging result of identifying ctDNA from the cfDNA according to a machine learning model that has been trained by machine learning,
    其中所述机器学习训练包括将所述差值作为训练样本输入至所述机器学习模型以进行机器学习训练。The machine learning training includes inputting the difference as a training sample into the machine learning model for machine learning training.
  20. 一种机器学习模型的训练装置,其包括:A training device for a machine learning model, comprising:
    计算模块,用于计算相同长度的野生型支持片段的数量与突变型支持片段的数量的差值;Calculation module, for calculating the difference between the quantity of the wild-type support fragment and the quantity of the mutant support fragment of the same length;
    其中,所述野生型支持片段的数量包括针对每一个突变位点,至少一个长度的所述野生型支持片段的数量,所述突变型支持片段的数量包括对应的相同长度的所述突变型支持片段的数量,Wherein, the number of wild-type support fragments includes the number of wild-type support fragments of at least one length for each mutation site, and the number of mutant support fragments includes the corresponding mutant support fragments of the same length. number of fragments,
    其中,所述野生型支持片段为包含野生型碱基序列的cfDNA片段,所述突变型支持片段为包含突变型碱基序列的cfDNA片段,Wherein, the wild-type support fragment is a cfDNA fragment comprising a wild-type base sequence, and the mutant support fragment is a cfDNA fragment comprising a mutant base sequence,
    其中,所述野生型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,相同的序列,Wherein, the wild-type base sequence is, compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome, the same sequence,
    其中,所述突变型碱基序列为,与参考基因组在所述突变位点的对应位置处的核苷酸序列相比,不同的序列,Wherein, the mutant base sequence is a different sequence compared with the nucleotide sequence at the corresponding position of the mutation site in the reference genome,
    其中,所述参考基因组为所述基因测序中的人类参考基因组;Wherein, the reference genome is the human reference genome in the gene sequencing;
    所述突变位点源自受试者样本,其中,所述突变位点通过基因测序的方法获得,The mutation site is derived from a subject sample, wherein the mutation site is obtained by gene sequencing,
    训练模块,用于将所述差值作为训练样本输入至所述机器学习模型以进行机器学习训练。A training module, configured to input the difference as a training sample into the machine learning model for machine learning training.
PCT/CN2022/096125 2021-06-18 2022-05-31 Method for distinguishing somatic mutation and germline mutation WO2022262569A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110679099.2 2021-06-18
CN202110679099 2021-06-18

Publications (1)

Publication Number Publication Date
WO2022262569A1 true WO2022262569A1 (en) 2022-12-22

Family

ID=84464021

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/096125 WO2022262569A1 (en) 2021-06-18 2022-05-31 Method for distinguishing somatic mutation and germline mutation

Country Status (2)

Country Link
CN (1) CN115497556A (en)
WO (1) WO2022262569A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160032396A1 (en) * 2013-03-15 2016-02-04 The Board Of Trustees Of The Leland Stanford Junior University Identification and Use of Circulating Nucleic Acid Tumor Markers
US20170058332A1 (en) * 2015-09-02 2017-03-02 Guardant Health, Inc. Identification of somatic mutations versus germline variants for cell-free dna variant calling applications
CN110914450A (en) * 2017-05-16 2020-03-24 夸登特健康公司 Identification of somatic or germline sources of cell-free DNA
CN111278993A (en) * 2017-09-15 2020-06-12 加利福尼亚大学董事会 Somatic cell mononucleotide variants from cell-free nucleic acids and applications for minimal residual lesion monitoring
CN111357054A (en) * 2017-09-20 2020-06-30 夸登特健康公司 Methods and systems for differentiating between somatic and germline variations
CN112752854A (en) * 2018-07-23 2021-05-04 夸登特健康公司 Methods and systems for modulating tumor mutational burden by tumor score and coverage
CN113278706A (en) * 2021-07-23 2021-08-20 广州燃石医学检验所有限公司 Method for distinguishing somatic mutation from germline mutation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160032396A1 (en) * 2013-03-15 2016-02-04 The Board Of Trustees Of The Leland Stanford Junior University Identification and Use of Circulating Nucleic Acid Tumor Markers
US20170058332A1 (en) * 2015-09-02 2017-03-02 Guardant Health, Inc. Identification of somatic mutations versus germline variants for cell-free dna variant calling applications
CN110914450A (en) * 2017-05-16 2020-03-24 夸登特健康公司 Identification of somatic or germline sources of cell-free DNA
CN111278993A (en) * 2017-09-15 2020-06-12 加利福尼亚大学董事会 Somatic cell mononucleotide variants from cell-free nucleic acids and applications for minimal residual lesion monitoring
CN111357054A (en) * 2017-09-20 2020-06-30 夸登特健康公司 Methods and systems for differentiating between somatic and germline variations
CN112752854A (en) * 2018-07-23 2021-05-04 夸登特健康公司 Methods and systems for modulating tumor mutational burden by tumor score and coverage
CN113278706A (en) * 2021-07-23 2021-08-20 广州燃石医学检验所有限公司 Method for distinguishing somatic mutation from germline mutation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CAI ZHENGHAO, WANG ZHENXIN, LIU CHENGLIN, SHI DONGTAO, LI DAPENG, ZHENG MINHUA, HAN-ZHANG HAN, LIZASO ANALYN, XIANG JIANXING, LV J: "Detection of Microsatellite Instability from Circulating Tumor DNA by Targeted Deep Sequencing", THE JOURNAL OF MOLECULAR DIAGNOSTICS, vol. 22, no. 7, 1 July 2020 (2020-07-01), pages 860 - 870, XP093015742, ISSN: 1525-1578, DOI: 10.1016/j.jmoldx.2020.04.210 *
VAN DER POL YMKE; MOULIERE FLORENT: "Toward the Early Detection of Cancer by Decoding the Epigenetic and Environmental Fingerprints of Cell-Free DNA", CANCER CELL, vol. 36, no. 4, 14 October 2019 (2019-10-14), US , pages 350 - 368, XP085861188, ISSN: 1535-6108, DOI: 10.1016/j.ccell.2019.09.003 *

Also Published As

Publication number Publication date
CN115497556A (en) 2022-12-20

Similar Documents

Publication Publication Date Title
US11581063B2 (en) Analysis of fragmentation patterns of cell-free DNA
Ding et al. Expanding the computational toolbox for mining cancer genomes
Hasan et al. Performance evaluation of indel calling tools using real short-read data
JP6987786B2 (en) Detection and diagnosis of cancer evolution
US9115401B2 (en) Partition defined detection methods
JP6680680B2 (en) Methods and processes for non-invasive assessment of chromosomal alterations
CN109906276A (en) For detecting the recognition methods of somatic mutation feature in early-stage cancer
US11043283B1 (en) Systems and methods for automating RNA expression calls in a cancer prediction pipeline
US11581062B2 (en) Systems and methods for classifying patients with respect to multiple cancer classes
US20210065842A1 (en) Systems and methods for determining tumor fraction
US11869661B2 (en) Systems and methods for determining whether a subject has a cancer condition using transfer learning
CN110168648A (en) The verification method and system of sequence variations identification
WO2023115662A1 (en) Method for detecting variant nucleic acids
CA3204451A1 (en) Systems and methods for joint low-coverage whole genome sequencing and whole exome sequencing inference of copy number variation for clinical diagnostics
US20210358626A1 (en) Systems and methods for cancer condition determination using autoencoders
CN113278706B (en) Method for distinguishing somatic mutation from germline mutation
EP4127232A1 (en) Cancer classification with synthetic spiked-in training samples
WO2018081465A1 (en) Systems and methods for characterizing nucleic acid in a biological sample
WO2022262569A1 (en) Method for distinguishing somatic mutation and germline mutation
JP2022527316A (en) Stratification of virus-related cancer risk
WO2024027591A1 (en) Multi-cancer methylation detection kit and use thereof
Yin et al. LiBis: an ultrasensitive alignment augmentation for low-input bisulfite sequencing
Akbar et al. Unlocking Esophageal Carcinoma’s Secrets: An integrated Omics Approach Unveils DNA Methylation as a pivotal Early Detection Biomarker with Clinical Implications.
Chieruzzi Identification of RAS co-occurrent mutations in colorectal cancer patients: workflow assessment and enhancement
CN117672507A (en) Cancer recurrence risk assessment method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22824049

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE