WO2022050654A1 - Method for increasing ratio of intrinsic fragment used in ngs analysis for detecting low-frequency mutation of cfdna - Google Patents

Method for increasing ratio of intrinsic fragment used in ngs analysis for detecting low-frequency mutation of cfdna Download PDF

Info

Publication number
WO2022050654A1
WO2022050654A1 PCT/KR2021/011654 KR2021011654W WO2022050654A1 WO 2022050654 A1 WO2022050654 A1 WO 2022050654A1 KR 2021011654 W KR2021011654 W KR 2021011654W WO 2022050654 A1 WO2022050654 A1 WO 2022050654A1
Authority
WO
WIPO (PCT)
Prior art keywords
fragments
cfdna
ngs
dna
analysis
Prior art date
Application number
PCT/KR2021/011654
Other languages
French (fr)
Korean (ko)
Inventor
허성훈
이동인
방두희
노한성
김황필
문성태
Original Assignee
주식회사 아이엠비디엑스
연세대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 아이엠비디엑스, 연세대학교 산학협력단 filed Critical 주식회사 아이엠비디엑스
Publication of WO2022050654A1 publication Critical patent/WO2022050654A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/10Design of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • the present application relates to a technology for analyzing genetic variation of cfDNA using NGS (Next Generation Sequencing).
  • cfDNA cell-free DNA
  • cfDNA cell-free DNA
  • ctDNA circulating tumor DNA
  • This ctDNA contains genetic mutations related to specific cancers, and through monitoring of these genetic mutations, early detection of cancer before lesion occurs, analysis of responses to specific cancer treatments, discovery of mechanisms for generating resistance to anticancer drugs, and residual cancer It is possible to confirm the existence of
  • ddPCR droplet digital PCR
  • NGS Targeted next-generation sequencing
  • this NGS method mainly uses cfDNA, but there is a problem that the amount of ctDNA contained therein is very limited. ctDNA contains only ⁇ 0.1-10% of cfDNA. Furthermore, in order to obtain statistically significant results from NGS sequencing, a minimum of 10 X read depth is required in consideration of the error. Considering that it is a genome equivalent, a minimum of 6 ng of DNA is required. In NGS analysis, the amount of information is lost at each experimental stage, and the final amount of DNA information that can be obtained (conversion rate) is 30%. Therefore, 20 ng of DNA is required to obtain the amount of information corresponding to 6 ng of ctDNA in NGS analysis, but the available DNA for NGS testing in clinical practice is very limited.
  • An object of the present application is to provide a method capable of increasing the amount of available ctDNA fragment information that can be used for NGS analysis in an NGS test using a limited amount of cfDNA to detect low-frequency gene mutations.
  • the present application provides a method for improving the ratio of native fragments used in the NGS analysis in the detection of low-frequency mutations in cfDNA (cell free DNA) using the NGS analysis.
  • the method comprises the steps of (a) providing a specific amount of a cfDNA sample; (b) calculating the number of total genome fragments corresponding to a length of 51 to 330 bp among the genome fragments included in the specific amount of cfDNA, and the number of unique fragments, wherein the number of unique fragments is the total It is a value obtained by subtracting the sum of collision counts of each genome fragment corresponding to the length of 51 to 330 bp from the number of genome fragments, and the sum of collision counts is calculated from [Equation 1] disclosed herein, (c) the total number of genome fragments calculating the ratio occupied by the number of unique fragments, and dividing the specific amount of the cfDNA sample into a plurality of aliquots to increase the ratio of the unique fragments; and (d) preparing a library by tagging adapters each having a different index for each of the plurality of aliquots, performing NGS analysis, and then integrating the NGS results of each aliquot.
  • the detection of low-frequency mutations means detection of mutations in ctDNA included in cfDNA.
  • it is intended to detect a genetic mutation in the ctDNA of cancer cells that is present at a low frequency of less than about 1% among DNA genome fragments included in cfDNA.
  • step (d) the sample is divided into two or more aliquots so that the ratio of the native fragments is at least about 93% or more or the ratio of events (expressed as collisions) having identical sequences is less than or equal to a certain amount of DNA per aliquot
  • concentration is lowered to make an NGS library, and the NGS data generated from each aliquot are combined and analyzed in the analysis step.
  • the specific amount of cfDNA is 20 ng, in which case the cfDNA divides the cfDNA into 4 aliquots such that the ratio of the native fragments is about 93.9%.
  • the present application provides a library preparation method for improving the ratio of unique fragments used in the NGS analysis in the detection of low-frequency mutations in cfDNA using NGS analysis.
  • the method comprises the steps of (a) providing a specific amount of a cfDNA sample;
  • step (b) calculating the total number of genome fragments corresponding to a length of 51 bp to 330 bp among the genome fragments included in the specific amount of cfDNA, and the number of unique fragments, wherein the number of unique fragments is the total A value obtained by subtracting the sum of collision counts of each genome fragment corresponding to the length of 51 to 330 bp from the number of genome fragments, and the collision count is calculated from Equation 1 disclosed herein, (c) from step (b), calculating the ratio of the number of unique fragments to the total number of genome fragments, and dividing the specific amount of the cfDNA sample into a plurality of aliquots to increase the ratio of the unique fragments; and (d) preparing a library for NGS by tagging an adapter including a different index, respectively, for each of the plurality of aliquots.
  • it is intended to detect a genetic mutation in the ctDNA of cancer cells that is present at a low frequency of less than about 1% among DNA genome fragments included in cfDNA.
  • step (d) the sample is divided into two or more aliquots so that the ratio of the native fragments is at least about 93% or more or the ratio of events (expressed as collisions) having identical sequences is less than or equal to a certain amount of DNA per aliquot
  • concentration is lowered to make an NGS library, and the NGS data generated from each aliquot are combined and analyzed in the analysis step.
  • the specific amount of cfDNA is 20 ng, in which case the cfDNA divides the cfDNA into 4 aliquots such that the ratio of the native fragments is about 93.9%.
  • the method according to the present application improves the read depth by increasing the amount of available ctDNA information in a situation where the amount of cfDNA that can be obtained from blood is limited in actual medical settings, thereby improving the performance of ctDNA testing using next-generation sequencing (NGS). .
  • NGS next-generation sequencing
  • fragments with the same sequence are generated due to accidental cuts at the same site although they are derived from different cells in the process of cutting the genomic DNA of cancer cells, and it is difficult to distinguish them from those amplified by PCR. , the data in the NGS analysis process is lost.
  • the average length is shorter than that of normal DNA, so this event has a higher probability. This makes it particularly difficult to detect genetic mutations found in ctDNA present in small amounts in cfDNA.
  • the method according to the present application divides the sample into two or more aliquots so that the ratio of events (expressed as collisions) with the same sequence is below a certain level by chance, and lowers the DNA concentration per aliquot to create an NGS library, and the NGS data generated from each aliquot is analyzed
  • the ratio of events (expressed as collisions) with the same sequence is below a certain level by chance
  • the DNA concentration per aliquot to create an NGS library
  • the NGS data generated from each aliquot is analyzed
  • most unique DNA fragments can be distinguished, so it is possible to detect low-frequency gene mutations in ctDNA, for example, present in a very small amount in cfDNA than in general NGS.
  • FIG. 1 schematically shows a method according to an embodiment of the present application.
  • Figure 2a is a graph showing the number of fragments (fragments count) according to the length of the genomic DNA fragment present in the actual cfDNA, shows a distribution having two peaks. The peaks are identified at 166 bp and 315 bp, respectively.
  • the distribution of DNA fragments can be regarded as the probability of the appearance of DNA fragments by calculating the proportion in the whole.
  • Figure 2b shows that when there are 6,600 DNA fragments at a specific loci (locus) of the genome (expressed as 6,600 X depth), the number of DNA fragments of a specific length is calculated from the probability of the DNA fragment length calculated in Figure 2a, and the corresponding It is a graph that calculates possible collisions in length. As in Fig. 2a, fragments are most distributed at 166 bp in length, and the proportion of DNA collision fragments count at this length is very high as 40.3%, and most of the cumulative collision fraction occurs in the length between 100 and 200 bp, which is typical of NGS. Represents data that is not used in the analysis and is discarded.
  • 2C is a graph showing the ratio of native fragments according to the amount of starting DNA. At 20 ng, 21.4% of the fragments were classified as duplicates, which cannot be used in normal NGS analysis.
  • FIG. 3 shows that after dividing 20ng of the starting DNA into 2, 4 and 8 aliquots according to the method according to an embodiment of the present application, FMD (Fragment Mean Depth) increases according to the number of aliquots. .
  • FMD Frragment Mean Depth
  • 5 shows the amount of DNA amplified in the pre-PCR step, which is the library preparation step, for each aliquot after dividing 20 ng of the starting DNA into 2, 3 and 8 aliquots according to the method according to an embodiment of the present application, and the number of aliquots As ⁇ increases, the amount of DNA finally obtainable increases, but it was found to be saturated when divided into 4 aliquots.
  • VAF Variant allele frequency
  • the cfDNA pool of a specific sample was derived from different cells, but the same site was accidentally cut and cannot be distinguished from that amplified by PCR, so data is lost in the NGS analysis process, but a certain amount of cfDNA
  • a library is prepared by dividing a specific amount of cfDNA into a plurality of aliquots in a manner that minimizes the collision count of
  • cfDNA cell-free DNA, cfDNA
  • cfDNA includes genomic fragments of various lengths present in blood, but the chromatin portion that is not protected by histone proteins is mainly truncated and shows the mode at a length of 166 bp.
  • cfDNA is mostly DNA released from haematopoietic cells in healthy people, and includes ctDNA derived from cancer cells in cancer patients as described below due to cancer cell death.
  • cfDNA can be extracted from blood, and a reagent/kit for extracting it is commercially available, and the method is known (Clara Perez-Barrios et al. Traansl Lung Cancer Res 5 (2016).
  • circulating tumor DNA is a genomic fragment derived from cancer cells, included in cfDNA. It contains only ⁇ 0.1-10% of total cfDNA. Due to the rapid self-replication of cancer cells, ctDNA has fewer histone-protected sites and consequently is shorter than healthy cell-derived cfDNA . al . Sci Transl Med 10, (2016).) short. These ctDNAs contain genetic mutations related to specific cancers, and through monitoring of these genetic mutations using blood, early detection of cancer before lesion occurs, response analysis to specific cancer treatment methods, discovery of mechanisms for generating resistance to anticancer drugs, It can be usefully used to confirm the presence of residual cancer. In one embodiment according to the present application, it is intended to detect a genetic mutation in ctDNA derived from cancer cells that is present at a low frequency of less than about 1% among DNA genome fragments included in cfDNA.
  • NGS Next Generation Sequencing
  • library preparation which includes the process of adding and amplifying indexes, molecular barcodes, etc. to fragments, alignment of the calculated raw data, and error handling and derivation of nucleotide sequences through mapping to reference nucleotide sequences A data analysis process is required.
  • Next-generation sequencing can be used as a variety of analysis platforms depending on the purpose.
  • analysis platforms for next-generation sequencing include Illumina NextSeq, Illumina NovaSeq, ThermoFisher Ion Proton, Pacific Biosciences Sequel II, BGI MGI, etc., and library preparation kits and methods used for each platform are available from the platform manufacturer. can be obtained
  • Such NGS has the following essential problems due to its characteristics.
  • the error varies depending on the experimental method and the NGS platform.
  • Illumina Inc.'s equipment has an average error rate of 0.1 to 1% per nucleotide.
  • LOD limit of detection
  • AF Allele Frequency
  • traditional NGS experiments cannot distinguish true mutations from errors.
  • the NGS experiment always includes a DNA amplification step using PCR no matter which method is used.
  • the amplification efficiency of each DNA fragment is different due to various factors such as DNA GC content and DNA length, so it is not possible to obtain a result in which all fragments are amplified to a uniform degree. Therefore, this effect is compensated for by eliminating duplicates (including both PCR duplicates and collisions with duplicates amplified from one DNA) in the analysis step.
  • the picard tool is usually used, leaving the same read as the reference genome (referred to as the read sequence read by NGS), and excluding the remaining duplicates. If a random nucleotide error occurs during duplicates, the sequence appears as different reads, although duplicated from one DNA. In traditional NGS, these reads are largely ignored and only the read closest to the reference genome is used for analysis.
  • ctDNA is DNA derived from different cells by chance due to its short length, but it often has the exact same sequence. Therefore, the mutated DNA may be ignored.
  • barcode sequence or UMI unique molecular identifier
  • the sequence may appear identical in the NGS result because the genomic DNA released from the cell is derived from different cells during the cleavage process, but the same site is accidentally cut.
  • DNA is amplified by PCR during library preparation during the NGS experiment and finally appears as an overlapping sequence, even if the sequence is coincidentally identical, it is removed as a redundant sequence in the general analysis process.
  • ctDNA contains a specific gene mutation in cancer cells, so it is important to detect it.
  • ctDNA is contained in a very small amount in cfDNA, and there is a problem in that information is lost during the redundant sequence removal process and thus cannot be detected.
  • the method according to the present application minimizes the sequence lost due to error processing in the NGS analysis process to detect cancer-related low frequency mutations, for example, less than 1% mutations, present in ctDNA contained in very small amounts in cfDNA.
  • the present application relates to a method for improving the ratio of unique fragments used in NGS analysis in the detection of low-frequency mutations in cfDNA (cell free DNA) using NGS (Next Generation Sequencing) analysis.
  • the method comprises the steps of (a) providing a specific amount of a cfDNA sample; (b) calculating the number of total genome fragments corresponding to a length of 51 to 330 bp among the genome fragments included in the specific amount of cfDNA, and the number of unique fragments, wherein the number of unique fragments is the total number of It is a value obtained by subtracting the sum of collision counts of each genome fragment corresponding to the length of 51 to 330 bp from the number of genome fragments, and the sum of collision counts is calculated from the following equation,
  • q(k-1;d) Probability of a number equal to k among n numbers in the range of [1,d], k: a specific number, d: a range of numbers, n: the number of numbers.
  • the method according to the present disclosure is characterized in that a sample is divided into two or more aliquots to create an NGS library. Since NGS data generated from each aliquots can be combined and analyzed in the analysis step to obtain sequence information from a larger amount of ctDNA than general NGS, it enables accurate detection of gene mutations with low VAF (Variant Allele Frequency) of ctDNA. .
  • VAF Variariant Allele Frequency
  • Available ctDNA data herein refers to NGS data that can be used to detect mutations in DNA derived from mutated tumor cells in a state in which DNAs derived from normal cells and tumor cells are mixed and cannot be distinguished from each other.
  • a specific amount of DNA in the method according to the present application means an amount of cfDNA extracted from a blood sample, usually obtainable from blood taken from a patient, and generally allocated for NGS analysis.
  • the average amount of DNA in blood is about 4.4 ng/ml (Raymond, CK, Hernandez, J., Karr, R., Hill, K. & Li, M. Collection of cell-free DNA for genomic analysis of solid tumors in a clinical laboratory setting. PLoS One 12, (2017).).
  • about 5 ml of blood is collected from one patient for NGS analysis, about 20 ng of cfDNA can be obtained, but the specific amount may vary depending on the carcinoma or the progress of the cancer.
  • the method according to the present application includes obtaining the number of unique fragments among the fragments by using the number of genomic fragments contained in a specific amount of cfDNA and a collision count.
  • the number of total genome fragments corresponding to a length of 51 to 330 bp among the genome fragments included in a specific amount of cfDNA and the number of unique fragments are calculated.
  • the cfDNA genome fragment has a distribution with two peaks with a length of 5 bp to 991 bp.
  • the peaks respectively, have a mode at 166 bp and 315 bp, with the first peak at 166 bp having a ratio 16 times greater than the second peak at 315 bp.
  • information is lost due to excessive cleavage of less than 50 bp, or fragments exceeding 330 bp are not useful for testing because most are not ctDNA.
  • ctDNA information is contained in a fragment of 51 to 330 bp, and in one embodiment according to the present application, a fragment having a length of 51 to 330 bp in cfDNA including genomic fragments of various lengths is used for analysis.
  • the number of genomic fragments contained in cfDNA is a concept corresponding to the number of molecules of the fragment, and in the case of humans, 1 ng DNA is usually 330 Genome Equivalents (The amount of DNA present in all genes in one cell. This number depends on the size of the genome of a specific organism and is calculated by converting genome base pairs into ug DNA).
  • the length of the genomic fragment of cfDNA is about 51 bp to 330 bp, for example, the number of fragments having a length of 51 - 330 bp in 20 ng of cfDNA is 6,173 with reference to Tables 1 and 2.
  • the number of fragments with identical sequences is calculated by chance in a specific amount of cfDNA containing genomic fragments of various lengths using collision count, and when the number of fragments with identical sequences is subtracted by chance from the total number of fragments, a unique fragment becomes the number of A case in which DNA sequences are coincidentally identical in a specific DNA sample containing genomic fragments of various lengths is called collision, and the number of DNA fragments with identical sequences can be counted according to the collision counting method.
  • the Collison count is based on the birthday paradox, and is the probability of the same individual by chance in a group of a certain size.
  • d which can be expressed as Equation 1 as follows (Might, Matt. "Collision hash collisions with the birthday paradox”. Matt Might's blog. Retrieved 17 July 2015).
  • q(k-1;d) Probability that there is a number equal to k among n numbers in the range of [1,d], k: a specific number, d: a range of numbers, and n: the number of numbers.
  • the length distribution probability is 0.02874, which corresponds to 189.687 out of 6,600, and the collision counts at this time are as follows. It is calculated to be 76.4513.
  • Equation 1 the collision counts of each fragment with a length of 51 to 330 bp are calculated and summed to 1,319, which corresponds to 21.4% of 6,173 (see Table 2).
  • cfDNA is DNA derived from different cells due to the nature of it, but it often happens to have the exact same sequence by chance. Therefore, the mutated DNA may be ignored, resulting in loss of available ctDNA information.
  • the NGS test is performed by calculating the probability of occurrence of the same DNA by chance, and dividing the sample in a method that minimizes this probability.
  • Fragment length Fragment existence probability The number of cases of different positions among fragments of the corresponding length Number of fragments with the corresponding length in 20ng (6600 fragments) 20ng Collision Count percentage of collision count 51 0.000016 51 0.11 0 0.0% 52 0.000021 52 0.14 0 0.0% 53 0.000019 53 0.12 0 0.0% 54 0.000022 54 0.15 0 0.0% 55 0.000023 55 0.15 0 0.0% 56 0.000031 56 0.21 0 0.0% 57 0.000032 57 0.21 0 0.0% 58 0.000038 58 0.25 0 0.0% 59 0.000043 59 0.29 0 0.0% 60 0.000046 60 0.30 0 0.0% 61 0.000053 61 0.35 0 0.0% 62 0.000048 62 0.32 0 0.0% 63 0.000046 63 0.31 0 0.0% 64 0.000043 64 0.28 0 0.0% 65 0.000050 65 0.33 0 0.0% 66 0.000053 66 0.35
  • Fragments (genome equivalents) Fragments in 51 ⁇ 330 bp range Collision counts Collision fragment ratio Unique fragments Unique fragments ratio 1 ng 330 309 3 0.011 305 0.989 5 ng 1,650 1,543 95 0.061 1,448 0.939 10 ng 3,300 3,087 365 0.118 2,721 0.882 20 ng 6,600 6,173 1,319 0.214 4,855 0.786 40 ng 13,200 12,346 4,338 0.351 8,008 0.649 100 ng 33,000 30,866 17,327 0.561 13,539 0.439 150 ng 49,500 46,299 29,825 0.644 16,474 0.356 200 ng 66,000 61,732 42,929 0.695 18,803 0.305
  • the cfDNA sample is divided into a plurality of aliquots to increase the ratio of native fragments, and NGS analysis is performed for each aliquot.
  • NGS analysis is performed for each aliquot.
  • the NGS library is prepared by dividing into four, 93.9% of the DNA fragment can be used for analysis, and 15.2% more DNA fragment sequences can be used for analysis compared to 78.6% if not divided.
  • a library is prepared by appropriately dividing a specific amount of starting DNA so that the ratio of native fragments is at least 93%.
  • the amount of cfDNA that can be obtained by taking blood from a patient in clinical practice is 20 ng
  • the specific amount of cfDNA is 20 ng
  • the cfDNA is divided into 4 aliquots so that the ratio of native fragments is 93.9% do.
  • a library is prepared by tagging an adapter including a different index for each aliquot, and NGS analysis is performed, and then the NGS results of each aliquot are obtained.
  • the index that distinguishes each of the aliquots is referred to as a tube barcode. Selection of adapters including different indices and library preparation including tagging methods may differ depending on the specific platform employed, and those skilled in the art may select an appropriate one with reference to the description of Examples and the like herein.
  • the sequence is performed by a multiplex sequencing method on an NGS platform such as Illumina.
  • the NGS data of each aliquot is independently mapped to the reference genome and error correction is performed.
  • one bam file is created for each aliquot.
  • These bam files are combined into one bam file.
  • gene mutations are detected from the integrated bam file using a general mutation analysis program (Mutect2, Varscan, Vardict, Strelka2, etc.).
  • the present application relates to a library preparation method for improving the ratio of native fragments used for NGS analysis in detecting low-frequency mutations in cfDNA using NGS analysis.
  • the method comprises the steps of (a) providing a specific amount of a cfDNA sample; (b) calculating the number of total genome fragments corresponding to a length of 51 bp to 330 bp among the genome fragments included in the specific amount of cfDNA, and the number of unique fragments, wherein the number of unique fragments is the total number of It is a value obtained by subtracting the sum of collision counts of each genome fragment corresponding to the 51-330 bp length from the number of genome fragments, and the collision count sum is calculated from Equation 1, (c) the total genome fragments from step (b) calculating the ratio of the number of the unique fragments to the number, and dividing the specific amount of the cfDNA sample into a plurality of aliquots to increase the ratio of the unique fragments; and (d) preparing a library for NGS by tagging adapters each having a different index for each of the plurality of aliquots.
  • Each step included in the method may refer to the aforementioned bar.
  • the cfDNA used in the experiment was purchased from SeraseqTM ctDNA mutation mix v2 and SeraseqTM cfDNA mutation mix v2 WT (SeraCare, Milford, MA) and used.
  • PCR primers (Illumina, Inc) including i7 and i5 indexes at the 5' end and 3' end, respectively, are complementary to the adapter combined with The PCR primers include the following common sequences and index sequences (indicated by [i7] and [i5]) that can distinguish each aliquot.
  • ARAF A-Raf proto-oncogene serine/threonine kinase ABL1 ABL proto-oncogene 1, non-receptor tyrosine kinase AKT1 AKT serine/threonine kinase 1 AKT2 AKT serine/threonine kinase 2 APC APC, WNT signaling pathway regulator ARID1A AT-rich interaction domain 1A ATM ATM serine/threonine kinase BRAF B-Raf proto-oncogene, serine/threonine kinase BCR BCR, RhoGEF and GTPase activating protein BRCA1 BRCA1, DNA repair associated BRCA2 BRCA2, DNA repair associated BTK Bruton tyrosine kinase CEBPA CCAAT/enhancer binding protein alpha CD274 CD274 molecule CBL Cbl proto-oncogene FBXW7 F-box and WD repeat domain
  • the prepared library was sequenced with 2x150 bp paired ends using Nextseq550 Dx (Illumina, SanDiego, CA, USA) equipment, and demultiplexed using the bcl2fastq (v2.19.0.316, Illumina Inc.) program to correspond to each aliquot.
  • a fastq file was created.
  • two fastq files are created as pairs in forward and reverse directions.
  • fastp version 0.20.0, Shifu Chen et al.
  • the adapter sequence read at the end of the read was removed along with the insert (DNA fragment) to create a new fastq file.
  • FastQC v0.11.8, Babraham Institute
  • bam files were created as much as the number divided by aliqout for each sample.
  • each aliquot bam file was merged into a single bam file using the merge function of sambamba (version 0.7.0, Artem Tarasov et al.).
  • the per base depth is calculated using the depth function of sambamba (version 0.7.0, Artem Tarasov et al.), and the base of all 106 genes The average value was obtained for the fragment mean depth (FMD).
  • FMD Frametic mean depth
  • the method of creating the existing library is similar to the method described above. The only difference is that in the method according to the present application, a sample is divided into 4 aliquots and then a different index is used. However , in the existing method, since one sample is tested with one tube, a library was prepared using only one index. .
  • the amount of DNA per individual aliquot decreases as the number of aliquots increases, and the DNA of each aliquot is amplified in the pre-PCR step. It is saturated when there is 5 ng of DNA in it.
  • the total amount of pre-PCR DNA increased as the number of aliquots increased, and was saturated when the number of aliquots was increased to 4 (see FIG. 5). This is a phenomenon similar to saturation when the FMD value is 4 aliquots, and can be viewed as a feature when dividing aliquots into 4 aliquots.
  • VAF Variant allele frequency
  • the method according to the present application assumes that DNA from different cells does not occur by chance when properly divided into aliquots, and all duplicates are assumed to be PCR duplicates, similar to when using molecular barcodes, to make consensus DNA with a majority rule. However, in fact, since there may be identical DNA by chance from different cells, in the error correction process, if there are many different bases, it is excluded from the analysis.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Chemical & Material Sciences (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Bioethics (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biochemistry (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present application discloses a method for increasing ratios of intrinsic fragments employed in the next generation sequencing (NGS) analysis which is utilized for detecting low-frequency mutations of cell free DNA (cfDNA). In the method according to the present application, a sample is divided into two or more aliquots so that the ratio of random events with the same sequence is equal to or less than a certain level; a NGS library is constructed by lowering DNA concentrations in each aliquot; and the NGS data generated in each aliquot are combined and analyzed in an analysis step, whereby most intrinsic DNA fragments can be discriminated, thus enabling the detection of low-frequency gene mutations of, for example, ctDNA present at a very low level in cfDNA, rather than by general NGS.

Description

씨에프디엔에이의 저빈도 변이 검출을 위해 엔지에스 분석에 사용되는 고유 단편의 비율을 증가시키는 방법A method for increasing the proportion of unique fragments used in NGS analysis to detect low-frequency mutations in CFDNA
본원은 NGS (Next Generation Sequencing)를 이용한 cfDNA의 유전자 변이 분석 기술과 관련된 것이다. The present application relates to a technology for analyzing genetic variation of cfDNA using NGS (Next Generation Sequencing).
혈액속에 존재하는 세포유리 DNA (cell-free DNA, cfDNA)에는 건강한 사람들의 경우 조혈 세포 (haematopoietic cell)로부터 방출된 DNA가 대부분이다. 하지만 암 환자의 경우 cfDNA에는 암세포 사멸로 파괴된 세포로부터 혈액으로 방출된 순환 종양 DNA (circulating tumor DNA, ctDNA)가 포함되어 있다. 이 ctDNA는 특정 암과 관련된 유전적 변이를 포함하고 있으며, 이러한 유전적 변이의 모니터링을 통해, 병변 발생 전 암의 조기 발견, 특정 암치료법에 대한 반응 분석, 항암제에 대한 저항성 생성 기전 발견, 잔존 암의 존재 등의 확인이 가능하다. Most of the cell-free DNA (cfDNA) present in the blood is DNA released from haematopoietic cells in healthy people. However, in cancer patients, cfDNA contains circulating tumor DNA (ctDNA) released into the blood from cells destroyed by cancer cell death. This ctDNA contains genetic mutations related to specific cancers, and through monitoring of these genetic mutations, early detection of cancer before lesion occurs, analysis of responses to specific cancer treatments, discovery of mechanisms for generating resistance to anticancer drugs, and residual cancer It is possible to confirm the existence of
이러한 ctDNA의 검출을 위한 방법의 하나는 droplet digital PCR (ddPCR)로 이는 0.001%의 ctDNA까지 검사할 수 있다. 암을 유발하는 DNA 마커는 매우 다양한데, ddPCR의 경우 검사 범위가 제한적인 단점이 있다. One of the methods for detecting such ctDNA is droplet digital PCR (ddPCR), which can test up to 0.001% of ctDNA. There are many cancer-causing DNA markers, but ddPCR has a limited test range.
다른 방법은 표적화 NGS (Targeted next-generation sequencing) 방법 (Corcoran, R. B., & Chabner, B. A. (2018). New England Journal of Medicine, 379(18), 1754-1765) 이다. 이 기술은 다수의 종양 관련 유전자의 전체 엑손 또는 특정 마커를 한 번에 검사할 수 있는 특징으로 인해 종양에 대한 유전학적 프로파일을 얻을 수 있는 장점이 있다. Another method is the Targeted next-generation sequencing (NGS) method (Corcoran, R. B., & Chabner, B. A. (2018). New England Journal of Medicine, 379(18), 1754-1765). This technology has the advantage of obtaining a genetic profile for a tumor due to its ability to examine the entire exons or specific markers of multiple tumor-associated genes at once.
하지만, 이러한 NGS 방법에는 주로 cfDNA가 사용되는데, 이에 포함된 ctDNA의 양이 매우 제한적이라는 문제점이 있다. ctDNA는 cfDNA의 단지 <0.1 ~ 10% 양으로 포함되어 있다. 나아가, NGS의 서열분석에서 통계적으로 유의한 결과를 얻기 위해서는 에러를 고려하여 최소 10 X read depth가 필요하고, 그 결과 0.5%의 변이 수준을 검출하기 위해서는 총 2000 X depth가 필요하고, 1ng 당 330 genome equivalent인 것을 고려하면 최소 6ng의 DNA가 필요하다. NGS 분석시 실험 단계마다 정보 양이 소실되어, 최종적으로 얻을 수 있는 DNA 정보의 양 (conversion rate)은 30% 수준이다. 그러므로 NGS 분석에서 ctDNA 6ng에 해당하는 정보량을 얻기 위해서는 20ng의 DNA가 필요하나, 임상에서 NGS 검사에 이용할 수 있는 DNA는 매우 제한적이다.However, this NGS method mainly uses cfDNA, but there is a problem that the amount of ctDNA contained therein is very limited. ctDNA contains only <0.1-10% of cfDNA. Furthermore, in order to obtain statistically significant results from NGS sequencing, a minimum of 10 X read depth is required in consideration of the error. Considering that it is a genome equivalent, a minimum of 6 ng of DNA is required. In NGS analysis, the amount of information is lost at each experimental stage, and the final amount of DNA information that can be obtained (conversion rate) is 30%. Therefore, 20 ng of DNA is required to obtain the amount of information corresponding to 6 ng of ctDNA in NGS analysis, but the available DNA for NGS testing in clinical practice is very limited.
이에 더하여 최신 NGS 분석에 적용되는 molecular barcode 방식 적용시 어댑터에 의해 형성된 이량체(dimer)의 증가로 생산한 데이터 중 가용 데이터 비율의 저하, 암세포의 유전체 DNA가 잘리는 과정에서 서로 다른 세포에서 유래했지만 우연히 동일한 부위가 잘려서 NGS를 통해서 PCR로 증폭된 된 것과 구분하지 못하는 경우의 발생으로 인한 데이터 소실, 그리고 수십억개의 판독서열(reads) 중 동일한 서열을 제거하기 위해서 판독서열을 참조유전체에 맵핑(mapping)하는 과정에서 서열 품질이 가장 좋은 하나를 대표 판독서열로 삼고 나머지는 판독서열은 제외하는 중복제거 등으로 인한 가용 데이터의 소실로 인해 실제 더 많은 양의 DNA가 필요하다. 이는 ctDNA에 존재하는 암과 관련된 유전자 변이의 검출을 어렵게 만든다.In addition, when the molecular barcode method applied to the latest NGS analysis is applied, a decrease in the available data ratio among the produced data due to an increase in the dimer formed by the adapter; In the process of cutting the cancer cell's genomic DNA, data loss due to the occurrence of a case where the same region was accidentally cut and cannot be distinguished from the one that was amplified by PCR through NGS, and the same sequence among billions of reads derived from different cells during the cutting process In the process of mapping a read sequence to a reference genome to remove You need a lot of DNA. This makes it difficult to detect cancer-related genetic mutations present in ctDNA.
따라서 cfDNA에 저빈도로 존재하는 유전자 변이 검출을 위해 제한된 양의 cfDNA를 이용한 NGS 검사에 있어서, 분석에 사용될 수 있는 가용 ctDNA 분자 정보의 양을 증가시킬 수 있는 방법의 개발이 필요하다.Therefore, it is necessary to develop a method that can increase the amount of available ctDNA molecular information that can be used for analysis in the NGS test using a limited amount of cfDNA to detect genetic mutations that are present in cfDNA infrequently.
본원은 저빈도 유전자 변이 검출을 위해 제한된 양의 cfDNA를 이용한 NGS 검사에 있어서, NGS 분석에 사용될 수 있는 가용 ctDNA 단편 정보의 양을 증가시킬 수 있는 방법을 제공하고자 한다.An object of the present application is to provide a method capable of increasing the amount of available ctDNA fragment information that can be used for NGS analysis in an NGS test using a limited amount of cfDNA to detect low-frequency gene mutations.
한 양태에서 본원은 NGS 분석을 이용한 cfDNA (cell free DNA)의 저빈도 변이 검출에 있어서, 상기 NGS 분석에 사용되는 고유 단편의 비율을 향상시키는 방법을 제공한다. In one aspect, the present application provides a method for improving the ratio of native fragments used in the NGS analysis in the detection of low-frequency mutations in cfDNA (cell free DNA) using the NGS analysis.
일 구현예에서 상기 방법은 (a) 특정 양의 cfDNA 시료를 제공하는 단계; (b) 상기 특정 양의 cfDNA에 포함된 유전체 단편 중 51 내지 330bp 길이에 해당하는 총 유전체 단편의 수, 및 고유 단편(unique fragment)의 수를 계산하는 단계로, 상기 고유 단편의 수는 상기 총 유전체 단편의 수에서 상기 51 내지 330bp 길이에 해당하는 각 유전체 단편의 collision count의 합을 제외한 값이고, 상기 collision count 합은 본원에 개시된 [식 1]로부터 계산되고, (c) 상기 총 유전체 단편 수에서 상기 고유 단편 수가 차지하는 비를 계산하고, 상기 고유 단편의 비를 증가시키도록 상기 특정 양의 cfDNA 시료를 복수 개의 aliquot로 나누는 단계; 및 (d) 상기 복수개의 각 aliquot 별로, 각각 상이한 인덱스를 포함하는 어뎁터를 태깅하여 라이브러리를 제조하고, NGS 분석을 수행한 후 상기 각 aliquot의 NGS 결과를 통합하는 단계를 포함한다. In one embodiment, the method comprises the steps of (a) providing a specific amount of a cfDNA sample; (b) calculating the number of total genome fragments corresponding to a length of 51 to 330 bp among the genome fragments included in the specific amount of cfDNA, and the number of unique fragments, wherein the number of unique fragments is the total It is a value obtained by subtracting the sum of collision counts of each genome fragment corresponding to the length of 51 to 330 bp from the number of genome fragments, and the sum of collision counts is calculated from [Equation 1] disclosed herein, (c) the total number of genome fragments calculating the ratio occupied by the number of unique fragments, and dividing the specific amount of the cfDNA sample into a plurality of aliquots to increase the ratio of the unique fragments; and (d) preparing a library by tagging adapters each having a different index for each of the plurality of aliquots, performing NGS analysis, and then integrating the NGS results of each aliquot.
일 구현예에서 저빈도 변이 검출은 cfDNA에 포함된 ctDNA의 변이 검출을 의미한다. In one embodiment, the detection of low-frequency mutations means detection of mutations in ctDNA included in cfDNA.
일 구현예에서 cfDNA에 포함된 DNA 유전체 단편 중에서, 약 1% 미만의 저빈도로 존재하는 암세포의 ctDNA의 유전자 변이를 검출하고자 한다.In one embodiment, it is intended to detect a genetic mutation in the ctDNA of cancer cells that is present at a low frequency of less than about 1% among DNA genome fragments included in cfDNA.
일 구현예에서 상기 (d) 단계에서 상기 고유 단편의 비가 최소 약 93% 이상이 되도록 또는 우연히 서열이 동일한 사건 (collision으로 표현) 비가 일정 이하가 되도록 시료를 2개 이상의 aliquot로 분할하여 aliquot 당 DNA 농도를 낮춰서 NGS 라이브러리를 만들고, 각 aliquots에서 생성된 NGS 데이터는 분석 단계에서 합쳐서 분석한다. In one embodiment, in step (d), the sample is divided into two or more aliquots so that the ratio of the native fragments is at least about 93% or more or the ratio of events (expressed as collisions) having identical sequences is less than or equal to a certain amount of DNA per aliquot The concentration is lowered to make an NGS library, and the NGS data generated from each aliquot are combined and analyzed in the analysis step.
일 구현예에서 특정 양의 cfDNA는 20ng이고, 이 경우 상기 cfDNA는 상기 고유 단편이 비가 약 93.9%가 되도록 상기 cfDNA를 4개의 aliquot로 분할한다. In one embodiment, the specific amount of cfDNA is 20 ng, in which case the cfDNA divides the cfDNA into 4 aliquots such that the ratio of the native fragments is about 93.9%.
다른 양태에서 본원은 NGS 분석을 이용한 cfDNA의 저빈도 변이 검출에 있어서, 상기 NGS 분석에 사용되는 고유 단편의 비율을 향상시키기 위한 라이브러리 제조방법을 제공한다. In another aspect, the present application provides a library preparation method for improving the ratio of unique fragments used in the NGS analysis in the detection of low-frequency mutations in cfDNA using NGS analysis.
일 구현예에서 상기 방법은 (a) 특정 양의 cfDNA 시료를 제공하는 단계; In one embodiment, the method comprises the steps of (a) providing a specific amount of a cfDNA sample;
(b) 상기 특정 양의 cfDNA에 포함된 유전체 단편 중 51bp 내지 330bp 길이에 해당하는 총 유전체 단편의 수, 및 고유 단편(unique fragment)의 수를 계산하는 단계로, 상기 고유 단편의 수는 상기 총 유전체 단편의 수에서 상기 51~330bp 길이에 해당하는 각 유전체 단편의 collision count의 합을 제외한 값이고, 상기 collision count 합은 본원에 개시된 식 1로부터 계산되며, (c) 상기 단계 (b)로부터 상기 총 유전체 단편 수에서 상기 고유 단편의 수가 차지하는 비를 계산하고, 상기 고유 단편의 비를 증가시키도록 상기 특정 양의 cfDNA 시료를 복수 개의 aliquot로 나누는 단계; 및 (d) 상기 복수개의 각 aliquot 별로, 각각 상이한 인덱스를 포함하는 어뎁터를 태깅하여 NGS용 라이브러리를 제조하는 단계를 포함한다. (b) calculating the total number of genome fragments corresponding to a length of 51 bp to 330 bp among the genome fragments included in the specific amount of cfDNA, and the number of unique fragments, wherein the number of unique fragments is the total A value obtained by subtracting the sum of collision counts of each genome fragment corresponding to the length of 51 to 330 bp from the number of genome fragments, and the collision count is calculated from Equation 1 disclosed herein, (c) from step (b), calculating the ratio of the number of unique fragments to the total number of genome fragments, and dividing the specific amount of the cfDNA sample into a plurality of aliquots to increase the ratio of the unique fragments; and (d) preparing a library for NGS by tagging an adapter including a different index, respectively, for each of the plurality of aliquots.
일 구현예에서 cfDNA에 포함된 DNA 유전체 단편 중에서, 약 1% 미만의 저빈도로 존재하는 암세포의 ctDNA의 유전자 변이를 검출하고자 한다.In one embodiment, it is intended to detect a genetic mutation in the ctDNA of cancer cells that is present at a low frequency of less than about 1% among DNA genome fragments included in cfDNA.
일 구현예에서 상기 (d) 단계에서 상기 고유 단편의 비가 최소 약 93% 이상이 되도록 또는 우연히 서열이 동일한 사건 (collision으로 표현) 비가 일정 이하가 되도록 시료를 2개 이상의 aliquot로 분할하여 aliquot 당 DNA 농도를 낮춰서 NGS 라이브러리를 만들고, 각 aliquots에서 생성된 NGS 데이터는 분석 단계에서 합쳐서 분석한다. In one embodiment, in step (d), the sample is divided into two or more aliquots so that the ratio of the native fragments is at least about 93% or more or the ratio of events (expressed as collisions) having identical sequences is less than or equal to a certain amount of DNA per aliquot The concentration is lowered to make an NGS library, and the NGS data generated from each aliquot are combined and analyzed in the analysis step.
일 구현예에서 특정 양의 cfDNA는 20ng이고, 이 경우 상기 cfDNA는 상기 고유 단편이 비가 약 93.9%가 되도록 상기 cfDNA를 4개의 aliquot로 분할한다. In one embodiment, the specific amount of cfDNA is 20 ng, in which case the cfDNA divides the cfDNA into 4 aliquots such that the ratio of the native fragments is about 93.9%.
본원에 따른 방법은 실제 의료현장에서 혈액으로부터 얻을 수 있는 cfDNA 양이 제한적인 상황에서 가용 ctDNA 정보양을 증가시켜 read depth를 향상시켜 차세대 염기서열 분석 (NGS)을 이용한 ctDNA 검사의 성능을 높일 수 있다. The method according to the present application improves the read depth by increasing the amount of available ctDNA information in a situation where the amount of cfDNA that can be obtained from blood is limited in actual medical settings, thereby improving the performance of ctDNA testing using next-generation sequencing (NGS). .
NGS 방법에 사용되는 일반적인 프로토콜에 따른 라이브러리 제조에 있어서, 암세포의 유전체 DNA가 잘리는 과정에서 서로 다른 세포에서 유래했지만 우연히 동일한 부위가 잘려서 서열이 동일한 단편이 발생하고, 이를 PCR로 증폭된 것과 구분하지 못해, NGS 분석과정에서의 데이터가 소실된다. DNA 단편의 길이가 작을수록, DNA 농도가 높을수록 우연히 동일한 DNA 단편이 발생할 확률이 높다. ctDNA의 경우 정상 DNA보다 평균적인 길이가 더 짧아서 이 사건이 더 높은 확률로 일어난다. 이로 인해 특히 cfDNA에 적은 양으로 존재하는 ctDNA에서 발견되는 유전자변이의 검출을 어렵게 한다. 하지만 본원에 따른 방법은 우연히 서열이 동일한 사건 (collision으로 표현) 비가 일정 이하가 되도록 시료를 2개 이상의 aliquot로 분할하여 aliquot 당 DNA 농도를 낮춰서 NGS 라이브러리를 만들고, 각 aliquots에서 생성된 NGS 데이터는 분석 단계에서 합쳐서 분석함으로써 대부분의 고유한 DNA 단편을 구분할 수 있어 일반적인 NGS 보다 예를 들면 cfDNA에 매우 적은 양으로 존재하는 ctDNA의 저빈도 유전자 변이의 검출이 가능하다. In library preparation according to the general protocol used in the NGS method, fragments with the same sequence are generated due to accidental cuts at the same site although they are derived from different cells in the process of cutting the genomic DNA of cancer cells, and it is difficult to distinguish them from those amplified by PCR. , the data in the NGS analysis process is lost. The shorter the length of the DNA fragment and the higher the DNA concentration, the higher the probability that the same DNA fragment will occur by chance. In the case of ctDNA, the average length is shorter than that of normal DNA, so this event has a higher probability. This makes it particularly difficult to detect genetic mutations found in ctDNA present in small amounts in cfDNA. However, the method according to the present application divides the sample into two or more aliquots so that the ratio of events (expressed as collisions) with the same sequence is below a certain level by chance, and lowers the DNA concentration per aliquot to create an NGS library, and the NGS data generated from each aliquot is analyzed By combining analysis in steps, most unique DNA fragments can be distinguished, so it is possible to detect low-frequency gene mutations in ctDNA, for example, present in a very small amount in cfDNA than in general NGS.
도 1은 본원의 일 구현예에 따른 방법을 도식적으로 나타낸 것이다. 1 schematically shows a method according to an embodiment of the present application.
도 2a는 실제 cfDNA에 존재하는 유전체 DNA 단편의 길이에 따른 단편의 개수 (fragments count)를 나타낸 그래프로, 2개의 봉우리를 갖는 분포를 나타낸다. 각각 봉우리는 166 bp 와 315 bp에서 최빈값이 확인된다. DNA 단편 분포는 전체 중의 비율을 계산하여 DNA 단편이 나타날 확률로 간주할 수 있다.Figure 2a is a graph showing the number of fragments (fragments count) according to the length of the genomic DNA fragment present in the actual cfDNA, shows a distribution having two peaks. The peaks are identified at 166 bp and 315 bp, respectively. The distribution of DNA fragments can be regarded as the probability of the appearance of DNA fragments by calculating the proportion in the whole.
도 2b는 유전체의 특정 loci (좌위)에서 6,600개의 DNA 단편이 있을 때 (6,600 X depth 로 표현됨), 도 2a에서 계산된 DNA 단편 길이의 확률로부터 특정 길이의 DNA 단편의 개수를 계산하고, 해당하는 길이에서 발생가능한 collision을 계산한 그래프이다. 도 2a에서와 마찬가지로 166 bp 길이에서 단편이 가장 많이 분포하고 이 길이에서 DNA의 collision fragments count의 비율이 40.3%로 매우 높고, 누적 collision fraction의 대부분이 100~200 bp 사이 길이에서 일어나고, 이는 일반적인 NGS 분석에서는 사용되지 못하고 버려지는 데이터를 나타낸다. Figure 2b shows that when there are 6,600 DNA fragments at a specific loci (locus) of the genome (expressed as 6,600 X depth), the number of DNA fragments of a specific length is calculated from the probability of the DNA fragment length calculated in Figure 2a, and the corresponding It is a graph that calculates possible collisions in length. As in Fig. 2a, fragments are most distributed at 166 bp in length, and the proportion of DNA collision fragments count at this length is very high as 40.3%, and most of the cumulative collision fraction occurs in the length between 100 and 200 bp, which is typical of NGS. Represents data that is not used in the analysis and is discarded.
도 2c는 시작 DNA의 양에 따른 고유 단편의 비율을 나타낸 그래프이다. 20ng에서 21.4%의 단편이 중복서열(duplicates)로 분류되어 이는 일반적인 NGS 분석에서는 사용되지 못한다. 2C is a graph showing the ratio of native fragments according to the amount of starting DNA. At 20 ng, 21.4% of the fragments were classified as duplicates, which cannot be used in normal NGS analysis.
도 3은 본원의 일 구현예에 따른 방법에 따라 20ng의 시작 DNA를 2, 4 및 8개의 aliquot로 나눈 후, aliquot 갯수에 따른 FMD (Fragment Mean Depth) 증가를 나타내는 것으로 4개에서 포화되는 것을 나타낸다. Figure 3 shows that after dividing 20ng of the starting DNA into 2, 4 and 8 aliquots according to the method according to an embodiment of the present application, FMD (Fragment Mean Depth) increases according to the number of aliquots. .
도 4는 본원에 따른 방법에 의한 FMD 값과 기존 분석 (모든 duplicate 제거)의 FMD 값을 비교한 그래프로, 20ng의 시작 DNA를 2, 4 및 8개의 aliquot로 나누어 분석한 경우, 2개 이상의 모든 aliquot를 이용한 분석에서 FMD 값이 기존 분석 보다 높을 것을 나타낸다. 4 is a graph comparing the FMD value by the method according to the present application and the FMD value of the existing analysis (all duplicates removed). It indicates that the FMD value in the analysis using the aliquot will be higher than in the conventional analysis.
도 5는 본원의 일 구현예에 따른 방법에 따라 20ng의 시작 DNA를 2, 3 및 8개의 aliquot로 나눈 후, 각 aliquot 별로 library 제조 단계인 pre-PCR 단계에서 증폭되는 DNA 양을 나타내며, aliquot 개수가 증가할수록, 최종적으로 얻을 수 있는 DNA양은 증가하나, 4개 aliquot로 나눴을 때 포화되는 것으로 나타났다. 5 shows the amount of DNA amplified in the pre-PCR step, which is the library preparation step, for each aliquot after dividing 20 ng of the starting DNA into 2, 3 and 8 aliquots according to the method according to an embodiment of the present application, and the number of aliquots As α increases, the amount of DNA finally obtainable increases, but it was found to be saturated when divided into 4 aliquots.
도 6a 및 도 6b는 오류 수정 전 (a)/후 (b)의 변이의 VAF (Variant allele frequency) (1% 미만) calling 결과를 나타내는 것으로, 오류 수정 전 변이 (a)는 VAF 1%에서 다수의 false positive (FP) 변이가 검출되나, consensus DNA 생성과 aliquot 정보를 이용한 오류 수정 후(b)는 true positive (TP)만 남고 모든 FP가 제거되는 것을 나타낸다.6A and 6B show the VAF (Variant allele frequency) (less than 1%) calling result of the mutation before (a) / after (b) error correction, and the mutation (a) before error correction is majority in VAF 1% false positive (FP) mutation is detected, but after consensus DNA generation and error correction using aliquot information (b), only true positive (TP) remains and all FPs are removed.
본원은 cfDNA를 사용한 NGS 분석에 있어서, 특정 시료의 cfDNA 풀에 서로 다른 세포에서 유래했지만 우연히 동일한 부위가 잘려서 이를 PCR로 증폭된 것과 구분하지 못해 NGS 분석과정에서의 데이터가 소실되나, 특정 양의 cfDNA의 collision count를 최소로 하는 방식으로 특정 양의 cfDNA를 복수개의 aliquot로 나누어 라이브러리를 제조할 경우, 상기 데이터 소실을 최소화할 수 있고, 궁극적으로 저빈도 유전자 변이의 검출이 가능하다는 발견에 근거한 것이다. In this application, in the NGS analysis using cfDNA, the cfDNA pool of a specific sample was derived from different cells, but the same site was accidentally cut and cannot be distinguished from that amplified by PCR, so data is lost in the NGS analysis process, but a certain amount of cfDNA When a library is prepared by dividing a specific amount of cfDNA into a plurality of aliquots in a manner that minimizes the collision count of
본원에서 “cfDNA(cell-free DNA, cfDNA)”는 혈액 속에 존재하는 다양한 길이의 유전체 단편을 포함하나, 히스톤 단백질에 의해 보호되지 않는 크로마틴 부분이 주로 잘려서 166 bp 길이에서 최빈값을 보인다. cfDNA는 건강한 사람들의 경우 조혈 세포(haematopoietic cell)로부터 방출된 DNA가 대부분이고, 암환자의 경우, 암세포 사멸로 인해 후술하는 바와 같이 암세포 유래의 ctDNA를 포함한다. cfDNA는 혈액으로부터 추출될 수 있으며, 이를 추출하는 시약/키트는 시중에서 구입할 수 있고, 그 방법은 공지되어 있다 (Clara Perez-Barrios et al. Traansl Lung Cancer Res 5 (2016).As used herein, “cfDNA (cell-free DNA, cfDNA)” includes genomic fragments of various lengths present in blood, but the chromatin portion that is not protected by histone proteins is mainly truncated and shows the mode at a length of 166 bp. cfDNA is mostly DNA released from haematopoietic cells in healthy people, and includes ctDNA derived from cancer cells in cancer patients as described below due to cancer cell death. cfDNA can be extracted from blood, and a reagent/kit for extracting it is commercially available, and the method is known (Clara Perez-Barrios et al. Traansl Lung Cancer Res 5 (2016).
본원에서 “ctDNA (circulating tumor DNA)”는 cfDNA에 포함된, 암세포에서 유래된 유전체 단편이다. 총 cfDNA의 단지 <0.1 ~ 10% 양으로 포함되어 있다. ctDNA는 암세포의 급격한 자기복제로 인해 히스톤 의해 보호되는 부위가 더 적고 결과적으로 건강한 세포유래 cfDNA보다 더 짧아져 주로 90 ~ 150 bp의 길이로 보통의 cfDNA보다 약 20-40 bp (Mouliere, F. et al. Sci Transl Med 10, (2018).) 짧다. 이러한 ctDNA는 특정 암과 관련된 유전적 변이를 포함하고 있어 혈액을 이용한 이러한 유전적 변이의 모니터링을 통해, 병변 발생 전 암의 조기 발견, 특정 암치료법에 대한 반응 분석, 항암제에 대한 저항성 생성 기전 발견, 잔존 암의 존재 등의 확인에 유용하게 사용될 수 있다. 본원에 따른 일 구현예에서는 cfDNA에 포함된 DNA 유전체 단편 중에서 약 1% 미만의 저빈도로 존재하는 암세포 유래의 ctDNA의 유전자 변이를 검출하고자 한다.As used herein, “circulating tumor DNA (ctDNA)” is a genomic fragment derived from cancer cells, included in cfDNA. It contains only <0.1-10% of total cfDNA. Due to the rapid self-replication of cancer cells, ctDNA has fewer histone-protected sites and consequently is shorter than healthy cell-derived cfDNA . al . Sci Transl Med 10, (2018).) short. These ctDNAs contain genetic mutations related to specific cancers, and through monitoring of these genetic mutations using blood, early detection of cancer before lesion occurs, response analysis to specific cancer treatment methods, discovery of mechanisms for generating resistance to anticancer drugs, It can be usefully used to confirm the presence of residual cancer. In one embodiment according to the present application, it is intended to detect a genetic mutation in ctDNA derived from cancer cells that is present at a low frequency of less than about 1% among DNA genome fragments included in cfDNA.
본원에서 “NGS (Next Generation Sequencing)”란, 유전체의 염기서열 분석기술 중 하나로, 유전체 유래의 DNA 단편을 병렬로 처리함으로써 염기서열을 고속으로 분석할 수 있다. 이를 위해 단편에 인덱스, 분자바코드 등을 추가하고 증폭하는 과정을 포함하는 라이브러리 제조 및 산출된 원(raw) 데이터의 정렬(alignment) 및 참조 염기서열에의 맵핑을 통한 오류 처리 및 염기서열 도출 등의 데이터 분석 과정이 필요하다. 차세대 염기서열 분석은 목적에 따라 다양한 분석 플랫폼으로 이용될 수 있다. 예를 들어, 차세대 염기서열 분석의 분석 플랫폼은 Illumina NextSeq, Illumina NovaSeq, ThermoFisher Ion Proton, Pacific Biosciences Sequel II, BGI MGI 등을 들 수 있고, 각 플랫폼에 사용되는 라이브러리 제조 키트 및 방법은 해당 플랫폼 제조사로부터 입수할 수 있다. As used herein, “NGS (Next Generation Sequencing)” is one of the genome sequencing technologies, and it is possible to analyze the nucleotide sequence at high speed by processing DNA fragments derived from the genome in parallel. To this end, library preparation, which includes the process of adding and amplifying indexes, molecular barcodes, etc. to fragments, alignment of the calculated raw data, and error handling and derivation of nucleotide sequences through mapping to reference nucleotide sequences A data analysis process is required. Next-generation sequencing can be used as a variety of analysis platforms depending on the purpose. For example, analysis platforms for next-generation sequencing include Illumina NextSeq, Illumina NovaSeq, ThermoFisher Ion Proton, Pacific Biosciences Sequel II, BGI MGI, etc., and library preparation kits and methods used for each platform are available from the platform manufacturer. can be obtained
이러한 NGS는 그 특징으로 인해 다음과 같은 본질적 문제점이 있다. 먼저 오류 처리 방법으로, 오류는 실험 방법과 NGS 플랫폼에 따라 다른데, 예를 들면 Illumina Inc.의 장비에서는 평균적으로 뉴클레오타이드 당 0.1 ~ 1%의 error rate을 가지고 있다. 일반적으로 cfDNA 검사는 AF (Allele Frequency) 1% 미만의 LOD(Limit of Detection)를 목표로 하기 때문에 전통적인 NGS 실험으로는 진짜 변이와 에러를 구분할 수 없다. 한편 NGS 실험은 어떤 방법을 사용하더라도 반드시 PCR을 이용한 DNA 증폭 단계를 포함한다. 그런데 DNA 증폭은 DNA의 GC 함량, DNA 길이 등 여러가지 요소로 DNA 단편마다 증폭 효율이 다르기 때문에 모든 단편이 균일한 정도로 증폭되는 결과를 얻을 수 없다. 그렇기 때문에 분석단계에서 duplicates (하나의 DNA에서 증폭된 복제물로 PCR duplicate와 collision을 모두 포함)를 제거하여 이 효과를 보정한다. 이때 일반적으로 picard 툴을 사용하는데, 참조 유전체와 동일한 read (NGS로 읽힌 판독서열을 칭함)를 남기고 나머지 duplicates는 제외한다. 만약 duplicates 중에 무작위로 특정 염기(nucleotide)에 오류가 발생하면 비록 하나의 DNA에서 복제되었지만 서열이 다른 reads로 보인다. 전통적인 NGS에서는 이 reads 들은 대체로 무시되고 참조유전체와 가장 가까운 read만 분석에 사용된다. 그런데 ctDNA는 길이가 짧은 특성상 우연히 서로 다른 세포에서 유래한 DNA이지만 완전히 동일한 서열을 갖는 경우가 자주 발생하고, 전통적인 NGS 방법에서는 이 것이 PCR duplicates 인지 서로 다른 세포에서 유래했는지를 구분할 수 없다. 따라서 이로 인해 변이가 발생한 DNA는 무시되는 경우가 발생한다. 이후 Molecular barcoding을 이용한 ctDNA 검사에서는 위의 문제를 극복하기 위해 Barcode sequence 또는 UMI (unique molecular identifier) 기술이 개발되었다. 이 기술을 이용하면 barcode sequence를 사용해 서로 다른 세포에서 유래한 reads를 구분할 수 있고, 이 reads 중에 PCR 오류와 진짜 변이를 구분할 수 있다. 이 과정을 일반적으로 error correction이라 부른다. 이 경우 시퀀싱 오류는 각 base마다 Q 값으로 계산된다. 일반적으로 Illumina 시퀀싱의 전체 단계 중 시작과 끝이 error rates이 더 크다. Molecular barcode sequence는 시퀀싱의 처음에 읽히기 때문에 상대적으로 더 오류에 취약할 수밖에 없다. Barcode sequence가 잘못 읽히는 문제로 인해 고유 분자를 구분하기 위한 본래 목적이 크게 훼손된다. 이런 문제를 극복하기 위해 barcode 서열의 길이를 조절하고, 서열을 정교하게 조합하는 등 많은 시도들이 있었다 (Smith et al., Somervuo et al, Genome Res. 27, 491-499 (2017); Somervuo, P. et al. BMC Bioinformatics 19, 257 (2018)). 또한 Barcode가 포함된 adapter dimer에 의해 conversion rate가 저하된다. 보통 ctDNA의 길이는 cfDNA 보다 짧고 adapter dimer 보다는 길다. Adapter dimer 제거 시 DNA 길이를 이용하는데, barcode sequence로 인해 DNA 길이가 더 길어졌기 때문에 ctDNA와 구분이 더 어려워지고 adapter dimer 제거시 ctDNA가 더 많이 유실된다. 결과적으로 ctDNA 자체의 conversion rate는 전체 cfDNA의 값보다 더 낮아진다.Such NGS has the following essential problems due to its characteristics. First, as an error handling method, the error varies depending on the experimental method and the NGS platform. For example, Illumina Inc.'s equipment has an average error rate of 0.1 to 1% per nucleotide. In general, since cfDNA testing aims at a limit of detection (LOD) of less than 1% of the AF (Allele Frequency), traditional NGS experiments cannot distinguish true mutations from errors. On the other hand, the NGS experiment always includes a DNA amplification step using PCR no matter which method is used. However, in DNA amplification, the amplification efficiency of each DNA fragment is different due to various factors such as DNA GC content and DNA length, so it is not possible to obtain a result in which all fragments are amplified to a uniform degree. Therefore, this effect is compensated for by eliminating duplicates (including both PCR duplicates and collisions with duplicates amplified from one DNA) in the analysis step. In this case, the picard tool is usually used, leaving the same read as the reference genome (referred to as the read sequence read by NGS), and excluding the remaining duplicates. If a random nucleotide error occurs during duplicates, the sequence appears as different reads, although duplicated from one DNA. In traditional NGS, these reads are largely ignored and only the read closest to the reference genome is used for analysis. However, ctDNA is DNA derived from different cells by chance due to its short length, but it often has the exact same sequence. Therefore, the mutated DNA may be ignored. Later, in the ctDNA test using molecular barcoding, barcode sequence or UMI (unique molecular identifier) technology was developed to overcome the above problem. With this technology, reads from different cells can be distinguished using barcode sequences, and PCR errors and true mutations can be distinguished among these reads. This process is generally called error correction. In this case, the sequencing error is calculated as a Q value for each base. In general, the error rates at the beginning and end of the entire Illumina sequencing step are greater. Since the molecular barcode sequence is read at the beginning of sequencing, it is bound to be relatively more susceptible to errors. Due to the problem of the barcode sequence being read incorrectly, the original purpose of distinguishing unique molecules is greatly undermined. To overcome this problem, there have been many attempts such as adjusting the length of the barcode sequence and combining the sequences precisely (Smith et al., Somervuo et al, Genome Res. 27, 491-499 (2017); Somervuo, P) et al . BMC Bioinformatics 19, 257 (2018)). Also, the conversion rate is lowered by the adapter dimer with barcode. Usually, the length of ctDNA is shorter than that of cfDNA and longer than that of adapter dimer. When removing the adapter dimer, the DNA length is used. Because the DNA length is longer due to the barcode sequence, it is more difficult to distinguish from ctDNA, and when the adapter dimer is removed, more ctDNA is lost. As a result, the conversion rate of ctDNA itself is lower than that of total cfDNA.
즉, 세포에서 방출된 유전체 DNA가 잘리는 과정 중 서로 다른 세포에서 유래했지만 우연히 동일한 부위가 잘려서 NGS 결과에서 서열이 동일하게 나타날 수 있다. 한편, DNA는 NGS 실험과정 중 라이브러리 제조과정에서 PCR로 증폭되고 최종적으로 중복서열로 나타나기 때문에, 우연히 서열이 동일한 경우도 일반적인 분석 과정에서 중복서열로 제거된다. 특히 ctDNA는 암세포에 특이적 유전자 변이를 포함하고 있어 이를 검출하는 것이 중요하다. 하지만, ctDNA는 cfDNA에 매우 적은 양으로 포함되어 있으며 중복서열 제거 과정에서 정보가 소실되어 검출이 되지 않는 문제점이 있다. in other words, The sequence may appear identical in the NGS result because the genomic DNA released from the cell is derived from different cells during the cleavage process, but the same site is accidentally cut. On the other hand, since DNA is amplified by PCR during library preparation during the NGS experiment and finally appears as an overlapping sequence, even if the sequence is coincidentally identical, it is removed as a redundant sequence in the general analysis process. In particular, ctDNA contains a specific gene mutation in cancer cells, so it is important to detect it. However, ctDNA is contained in a very small amount in cfDNA, and there is a problem in that information is lost during the redundant sequence removal process and thus cannot be detected.
이러한 NGS의 문제점으로 인해, NGS를 이용한 혈액의 cfDNA 분석에 있어서, 분석에 사용될 수 있는 ctDNA의 정보는 더욱 제한된다. 또한 앞서 언급한 바와 같이, 혈액에 포함된 cfDNA의 양, 이에 포함된 ctDNA의 양의 매우 제한적이고 임상에서 채취할 수 있는 혈액의 양도 매우 제한적이어서 DNA 양 자체를 증가시키는 것은 한계가 있다. 예를 들면 정상인의 경우 혈액내에 DNA양은 평균적으로 약 4.4 ng/ml 정도이다 (Raymond, C. K., Hernandez, J., Karr, R., Hill, K. & Li, M. Collection of cell-free DNA for genomic analysis of solid tumors in a clinical laboratory setting. PLoS One 12, (2017).). 이 중 다른 검사 (예를 들어, ddPCR: 25ng, Real-Time PCR: 1~100ng)를 위해 DNA를 남겨두어야 하기 때문에, 실제 임상에서 NGS 검사에 이용할 수 있는 DNA 양은 더욱 제한적일 수밖에 없다.Due to these problems of NGS, in cfDNA analysis of blood using NGS, information on ctDNA that can be used for analysis is further limited. Also, as mentioned above, the amount of cfDNA contained in blood and the amount of ctDNA contained therein are very limited, and the amount of blood that can be collected in clinical practice is also very limited, so increasing the amount of DNA itself is limited. For example, in a normal person, the average amount of DNA in the blood is about 4.4 ng/ml (Raymond, CK, Hernandez, J., Karr, R., Hill, K. & Li, M. Collection of cell-free DNA for genomic analysis of solid tumors in a clinical laboratory setting. PLoS One 12, (2017).). Among these, since DNA must be left for other tests (eg, ddPCR: 25ng, Real-Time PCR: 1~100ng), the amount of DNA available for NGS test in actual clinical practice is inevitably more limited.
본원에 따른 방법은 cfDNA에 매우 적은 양으로 포함된 ctDNA에 존재하는 암과 연관된 저빈도 변이 예를 들면 1% 미만의 변이를 검출하기 위해, NGS 분석과정에서 에러 처리로 인해 소실되는 서열을 최소화할 수 있다. The method according to the present application minimizes the sequence lost due to error processing in the NGS analysis process to detect cancer-related low frequency mutations, for example, less than 1% mutations, present in ctDNA contained in very small amounts in cfDNA. can
이에 한 양태에서 본원은 NGS (Next Generation Sequencing) 분석을 이용한 cfDNA (cell free DNA)의 저빈도 변이 검출에 있어서, 상기 NGS 분석에 사용되는 고유 단편의 비율을 향상시키는 방법에 관한 것이다. Accordingly, in one aspect, the present application relates to a method for improving the ratio of unique fragments used in NGS analysis in the detection of low-frequency mutations in cfDNA (cell free DNA) using NGS (Next Generation Sequencing) analysis.
일 구현예에서 상기 방법은 (a) 특정 양의 cfDNA 시료를 제공하는 단계; (b) 상기 특정 양의 cfDNA에 포함된 유전체 단편 중 51 내지 330bp 길이에 해당하는 총 유전체 단편의 수, 및 고유 단편 (unique fragment)의 수를 계산하는 단계로, 상기 고유 단편의 수는 상기 총 유전체 단편의 수에서 상기 51 내지 330bp 길이에 해당하는 각 유전체 단편의 collision count의 합을 제외한 값이고, 상기 collision count 합은 다음 식으로부터 계산되며, In one embodiment, the method comprises the steps of (a) providing a specific amount of a cfDNA sample; (b) calculating the number of total genome fragments corresponding to a length of 51 to 330 bp among the genome fragments included in the specific amount of cfDNA, and the number of unique fragments, wherein the number of unique fragments is the total number of It is a value obtained by subtracting the sum of collision counts of each genome fragment corresponding to the length of 51 to 330 bp from the number of genome fragments, and the sum of collision counts is calculated from the following equation,
Figure PCTKR2021011654-appb-img-000001
.
Figure PCTKR2021011654-appb-img-000001
.
상기 식에서 q(k-1;d) : [1,d]의 범위의 n 개의 숫자 중 k 와 같은 숫자가 있을 확률, k : 특정 숫자, d : 숫자의 범위, n : 숫자의 개수임. In the above formula, q(k-1;d): Probability of a number equal to k among n numbers in the range of [1,d], k: a specific number, d: a range of numbers, n: the number of numbers.
(c) 상기 총 유전체 단편 수에서 상기 고유 단편 수가 차지하는 비를 계산하고, 상기 고유 단편의 비를 증가시키도록 상기 특정 양의 cfDNA 시료를 복수 개의 aliquot로 나누는 단계; 및 (d) 상기 복수개의 각 aliquot 별로, 각각 상이한 인덱스를 포함하는 어뎁터를 태깅하여 라이브러리를 제조하고, NGS (Next Generation Sequencing) 분석을 수행한 후 상기 각 aliquot의 NGS 결과를 통합하는 단계를 포함한다. (c) calculating a ratio of the number of unique fragments to the total number of genome fragments, and dividing the specific amount of the cfDNA sample into a plurality of aliquots to increase the ratio of the unique fragments; And (d) for each of the plurality of aliquots, tagging an adapter each having a different index to prepare a library, performing NGS (Next Generation Sequencing) analysis, and integrating the NGS results of each aliquot. .
본원에 따른 방법은 도 1을 참조하면, 시료를 2개 이상의 aliquot로 분할하여 NGS 라이브러리를 만드는 것이 특징이다. 각 aliquots에서 생성된 NGS 데이터는 분석 단계에서 합쳐서 분석함으로써 일반적인 NGS 보다 더 많은 양의 ctDNA로부터 서열정보를 얻을 수 있기 때문에, ctDNA의 VAF (Variant Allele Frequency)가 낮은 유전자 변이의 정확한 검출을 가능하게 한다. Referring to FIG. 1 , the method according to the present disclosure is characterized in that a sample is divided into two or more aliquots to create an NGS library. Since NGS data generated from each aliquots can be combined and analyzed in the analysis step to obtain sequence information from a larger amount of ctDNA than general NGS, it enables accurate detection of gene mutations with low VAF (Variant Allele Frequency) of ctDNA. .
본원에서 가용 ctDNA 데이터란, 정상 세포와 종양 세포 유래의 DNA가 섞여 있어서 서로 구분할 수 없는 상태에서, 변이가 발생한 종양 세포 유래 DNA의 변이 검출에 사용가능한 NGS 데이터를 의미한다.Available ctDNA data herein refers to NGS data that can be used to detect mutations in DNA derived from mutated tumor cells in a state in which DNAs derived from normal cells and tumor cells are mixed and cannot be distinguished from each other.
본원에 따른 방법에서 특정 양의 DNA는 통상적으로 환자로부터 채취된 혈액으로부터 얻을 수 있는, 혈액 샘플에서 추출된 cfDNA로서, NGS 분석에 일반적으로 할당되는 cfDNA의 양을 의미한다. 예를 들면 정상인의 경우 혈액내에 DNA양은 평균적으로 약 4.4 ng/ml 정도이다 (Raymond, C. K., Hernandez, J., Karr, R., Hill, K. & Li, M. Collection of cell-free DNA for genomic analysis of solid tumors in a clinical laboratory setting. PLoS One 12, (2017).). NGS 분석을 위해 환자 한 명으로부터 일반적으로 약 5ml의 혈액을 채취한다면, 약 20ng의 cfDNA를 수득할 수 있으나, 구체적인 양은 암종이나 암의 진행 상태에 달라질 수 있다. A specific amount of DNA in the method according to the present application means an amount of cfDNA extracted from a blood sample, usually obtainable from blood taken from a patient, and generally allocated for NGS analysis. For example, in normal individuals, the average amount of DNA in blood is about 4.4 ng/ml (Raymond, CK, Hernandez, J., Karr, R., Hill, K. & Li, M. Collection of cell-free DNA for genomic analysis of solid tumors in a clinical laboratory setting. PLoS One 12, (2017).). In general, if about 5 ml of blood is collected from one patient for NGS analysis, about 20 ng of cfDNA can be obtained, but the specific amount may vary depending on the carcinoma or the progress of the cancer.
다음 단계로 본원에 따른 방법은 특정 양의 cfDNA에 포함된 유전체 단편의 개수 및 collision count를 이용하여 상기 단편 중 고유 단편의 개수를 수득하는 단계를 포함한다.As a next step, the method according to the present application includes obtaining the number of unique fragments among the fragments by using the number of genomic fragments contained in a specific amount of cfDNA and a collision count.
본원에서는 특정 양의 cfDNA에 포함된 유전체 단편 중 특히 51 내지 330bp 길이에 해당하는 총 유전체 단편의 수 및 고유 단편(unique fragment)의 수를 계산한다. cfDNA 유전체 단편은 5bp ~ 991bp의 길이로 2개의 봉우리를 갖는 분포를 나타낸다. 각각 봉우리는 166 bp와 315 bp에서 최빈값을 갖고, 166 bp의 첫번째 봉우리는 315 bp의 두번째 봉우리보다 16배 정도 큰 비율을 갖는다. cfDNA 단편 중 50bp 이하로 과도하게 많이 잘려서 정보를 잃거나, 330 bp 이상으로 아주 긴 단편은 대부분 ctDNA가 아니므로 검사에 유용하지 않다. 따라서 51 ~ 330bp의 단편에 대부분의 ctDNA 정보가 포함되어 있으며, 본원에 따른 일 구현예에서는 다양한 길이의 유전체 단편을 포함하는 cfDNA에서 51 ~ 330bp 길이의 단편이 분석에 사용된다. In the present application, the number of total genome fragments corresponding to a length of 51 to 330 bp among the genome fragments included in a specific amount of cfDNA and the number of unique fragments are calculated. The cfDNA genome fragment has a distribution with two peaks with a length of 5 bp to 991 bp. The peaks, respectively, have a mode at 166 bp and 315 bp, with the first peak at 166 bp having a ratio 16 times greater than the second peak at 315 bp. Among cfDNA fragments, information is lost due to excessive cleavage of less than 50 bp, or fragments exceeding 330 bp are not useful for testing because most are not ctDNA. Therefore, most of the ctDNA information is contained in a fragment of 51 to 330 bp, and in one embodiment according to the present application, a fragment having a length of 51 to 330 bp in cfDNA including genomic fragments of various lengths is used for analysis.
cfDNA에 포함된 유전체 단편의 개수는 단편의 분자수에 상응하는 개념으로, 인간의 경우 통상 1ng DNA는 330개 Genome Equivalents (한 개 세포에 포함된 모든 유전자가 존재하는 DNA의 양으로 이 수는 특정 생물의 유전체의 크기에 따라 다르며 유전체 염기쌍을 ug의 DNA로 변환하여 계산된다)를 포함한다. 본원에 따른 일 구현예에서 상기 cfDNA의 유전체 단편의 길이는 약 51bp 내지 330bp로, 예를 들면 20ng의 cfDNA에서 51 - 330bp 길이를 갖는 단편의 개수는 표 1 및 2를 참조하면 6,173개이다. The number of genomic fragments contained in cfDNA is a concept corresponding to the number of molecules of the fragment, and in the case of humans, 1 ng DNA is usually 330 Genome Equivalents (The amount of DNA present in all genes in one cell. This number depends on the size of the genome of a specific organism and is calculated by converting genome base pairs into ug DNA). In one embodiment according to the present application, the length of the genomic fragment of cfDNA is about 51 bp to 330 bp, for example, the number of fragments having a length of 51 - 330 bp in 20 ng of cfDNA is 6,173 with reference to Tables 1 and 2.
본원에 따른 방법에서 collision count를 이용해서 다양한 길이의 유전체 단편을 포함하는 특정 양의 cfDNA에서 우연히 서열이 동일한 단편의 개수가 계산되고, 전체 단편의 개수에서 우연히 서열이 동일한 단편의 개수를 빼면 고유 단편의 수가 된다. 다양한 길이의 유전체 단편을 포함하는 특정 DNA 시료에서 DNA 서열이 우연히 동일한 경우를 collision이라고 하고, collision counting 방법에 따라 우연히 서열이 동일한 DNA 단편의 개수를 계산할 수 있다. In the method according to the present application, the number of fragments with identical sequences is calculated by chance in a specific amount of cfDNA containing genomic fragments of various lengths using collision count, and when the number of fragments with identical sequences is subtracted by chance from the total number of fragments, a unique fragment becomes the number of A case in which DNA sequences are coincidentally identical in a specific DNA sample containing genomic fragments of various lengths is called collision, and the number of DNA fragments with identical sequences can be counted according to the collision counting method.
Collison count는 Birthday paradox 에 근거한 것으로 특정한 크기의 집단에서 우연히 동일한 개체의 확률로, [1,d] 범위로부터 무작위로 선택된 k번째 정수가 적어도 하나의 앞선 선택이 반복될 확률은 q(k-1;d)와 같다는 것으로 다음과 같은 식 1으로 표시될 수 있다 (Might, Matt. "Collision hash collisions with the birthday paradox". Matt Might's blog. Retrieved 17 July 2015).The Collison count is based on the birthday paradox, and is the probability of the same individual by chance in a group of a certain size. d), which can be expressed as Equation 1 as follows (Might, Matt. "Collision hash collisions with the birthday paradox". Matt Might's blog. Retrieved 17 July 2015).
[식 1][Equation 1]
Figure PCTKR2021011654-appb-img-000002
Figure PCTKR2021011654-appb-img-000002
상기 식에서, q(k-1;d) : [1,d]의 범위의 n 개의 숫자 중 k 와 같은 숫자가 있을 확률, k : 특정 숫자, d : 숫자의 범위, n : 숫자의 개수 이다. In the above formula, q(k-1;d): Probability that there is a number equal to k among n numbers in the range of [1,d], k: a specific number, d: a range of numbers, and n: the number of numbers.
예를 들어 cfDNA 20ng 에는 6,600개 Genome Equivalents가 존재하는 것은 알려진 사실이고, 이 경우, 어떤 특정한 loci (좌위)를 기준으로 보면, 6,600개의 다른 genome에서 유래된 DNA 단편이 존재하고 (6,600 X depth로 표현), 이 단편은 도 2a에서 계산된 확률과 동일한 분포로 다양한 길이로 존재하고 그 개수는 2b의 분포와 표 1에 표시된 값과 같다. 하지만, 이 중 실제 저빈도 서열 변이를 포함하는 ctDNA의 정보를 반영하는 길이를 51 ~ 330bp 범위로 가정했을 때, 해당하는 유전체 DNA 단편의 총 합은 표 2에 계산된 6,173개이다. 그리고 각 길이의 단편마다 상이한 collision count 값이 계산되는데, 예를 들어 166bp 단편의 경우 표 1에 따르면 그 길이 분포 확률은 0.02874이고 이는 6,600개 중 189.687개에 해당하고, 이때의 collision counts는 다음과 같이 계산되어 76.4513개이다.For example, it is known that there are 6,600 Genome Equivalents in 20ng of cfDNA. In this case, based on a specific loci, there are DNA fragments derived from 6,600 different genomes (expressed in 6,600 X depth). ), this fragment exists in various lengths with the same distribution as the probability calculated in FIG. 2A, and the number is the same as the distribution of 2b and the values shown in Table 1. However, assuming that the length reflecting the information of the ctDNA including the actual low-frequency sequence mutation is in the range of 51 to 330 bp, the total number of the corresponding genomic DNA fragments is 6,173 calculated in Table 2. And a different collision count value is calculated for each fragment of each length. For example, in the case of a 166bp fragment, according to Table 1, the length distribution probability is 0.02874, which corresponds to 189.687 out of 6,600, and the collision counts at this time are as follows. It is calculated to be 76.4513.
Figure PCTKR2021011654-appb-img-000003
Figure PCTKR2021011654-appb-img-000003
이러한 방식으로, 식 1에 따라 51 ~ 330bp 길이의 각 단편의 collision counts를 계산하여 합하면 1,319개이고 이 것은 6,173개 중 21.4%에 해당한다 (표 2 참조). In this way, according to Equation 1, the collision counts of each fragment with a length of 51 to 330 bp are calculated and summed to 1,319, which corresponds to 21.4% of 6,173 (see Table 2).
앞서 언급한대로 cfDNA 특성상 서로 다른 세포에서 유래한 DNA이지만 우연히 완전히 동일한 서열을 갖는 경우가 자주 발생하고, 전통적인 NGS 방법에서는 PCR duplicates 인지 서로 다른 세포에서 유래했는지를 구분할 수 없다. 따라서 이로 인해 변이가 발생한 DNA는 무시되는 경우가 발생하여 가용 ctDNA 정보가 소실된다.As mentioned earlier, cfDNA is DNA derived from different cells due to the nature of it, but it often happens to have the exact same sequence by chance. Therefore, the mutated DNA may be ignored, resulting in loss of available ctDNA information.
따라서 본원에서는 우연히 동일한 DNA가 발생할 확률을 계산하고, 이러한 확률을 최소화하는 방법으로 시료를 분할하여 NGS 검사가 진행된다. Therefore, in the present application, the NGS test is performed by calculating the probability of occurrence of the same DNA by chance, and dividing the sample in a method that minimizes this probability.
예를 들면 이론적으로 계산했을 때, 앞서 언급한 바와 같이 약 20ng DNA에는 6,600개 Genome Equivalents가 존재하고, 이중 분석적으로 의미가 있는 51 ~ 330 bp 길이를 갖는 단편은 6,173개이다. 이 중 Collision count에 의하면 1,319개 copy (21.4%)가 DNA 서열이 우연히 서로 동일해서 PCR duplicates와 구분이 불가능하고 이 데이터는 NGS 분석에서 사용되지 못하고 버려진다. 따라서 서로 다른 세포에서 유래한 DNA 정보라고 하더라도, 분석 과정에서 PCR duplicates를 제거하는 단계에서 버려진다. 이로 인해 실제 DNA 중 서로 다른 고유한 DNA 단편은 4,854개 (6,173-1,319) 이고, 고유 단편의 비는 0.786 (4,854/6,173)이다. 즉 78.6%의 단편만이 분석에 사용된다는 것이다. For example, when calculated theoretically, as mentioned above, there are 6,600 Genome Equivalents in about 20 ng DNA, and 6,173 fragments having a length of 51 to 330 bp that are analytically significant. Among them, according to the collision count, 1,319 copies (21.4%) found that the DNA sequences were coincidentally identical to each other, making them indistinguishable from PCR duplicates, and these data were not used in NGS analysis and were discarded. Therefore, even DNA information from different cells is discarded in the step of removing PCR duplicates during the analysis process. Due to this, the number of different unique DNA fragments among actual DNA is 4,854 (6,173-1,319), and the ratio of unique fragments is 0.786 (4,854/6,173). That is, only 78.6% of the fragments are used for analysis.
고유 DNA 단편은 서로 다른 세포에서 유래한 것이고, 우연히 서열이 동일한 것으로 간주되어 제거되지 않는 것이 저빈도 변이 검출에 중요하기 때문에, 고유 단편의 비를 가능한 증가시키는 것이 유리하다. 앞서 언급한 바와 같이 특정 DNA 시료에서 DNA 서열이 우연히 동일한 경우를 collision이라고 하고, collision counting 방법에 따라 특정 길이의 DNA 중에서 우연히 서열이 동일한 DNA 단편의 개수를 계산할 수 있다.Because unique DNA fragments are from different cells and it is important for infrequent mutation detection that sequences are not accidentally removed as they are considered identical, it is advantageous to increase the ratio of unique fragments as much as possible. As mentioned above, a case in which DNA sequences are coincidentally identical in a specific DNA sample is called collision, and the number of DNA fragments with identical sequences among DNAs of a specific length can be counted according to the collision counting method.
본원에서는 cfDNA 단편의 길이별로 발생할 수 있는 확률을 계산하기위해 실제 혈액으로부터 cfDNA를 검사하여 단편 길이의 분포를 얻었다 (도 2a 참조). 그리고 이 확률을 바탕으로 특정 loci에 6,600 X depth가 있을 때 각 길이별 DNA 단편의 개수를 계산했다 (도 2b, 표 1 참조). 이 중 너무 짧거나, 긴 DNA 서열을 제외하고 51bp 내지 330bp까지 길이의 단편에 대하여 길이별로 collision counting을 계산한 결과는 아래 표 1과 같다. 표 1에 의하면 20ng을 사용하는 경우, 예를 들면 166bp 길이에서 collision count의 비율이 40.3% 정도로 높게 나타나고, 이는 기존 NGS 분석에서는 사용되지 못하고 버려지는 데이터이다.Herein, in order to calculate the probability of occurrence for each length of the cfDNA fragment, cfDNA was tested from real blood to obtain a fragment length distribution (see FIG. 2a ). And based on this probability, the number of DNA fragments for each length was calculated when there was 6,600 X depth at a specific loci (see Fig. 2b, Table 1). The results of calculating collision counting by length for fragments with a length of 51bp to 330bp except for too short or long DNA sequences are shown in Table 1 below. According to Table 1, when 20ng is used, for example, at a length of 166bp, the collision count ratio is as high as 40.3%, which is data that is not used in the existing NGS analysis and is discarded.
Fragment 길이Fragment length Fragment 존재 확률Fragment existence probability 해당 길이의 fragment 중 position이 다른 경우의 수The number of cases of different positions among fragments of the corresponding length 20ng (6600개 fragment) 중 해당 길이를 갖는 단편의 수Number of fragments with the corresponding length in 20ng (6600 fragments) 20ng Collision Count20ng Collision Count Collision count의 비율percentage of collision count
5151 0.0000160.000016 5151 0.110.11 00 0.0%0.0%
5252 0.0000210.000021 5252 0.140.14 00 0.0%0.0%
5353 0.0000190.000019 5353 0.120.12 00 0.0%0.0%
5454 0.0000220.000022 5454 0.150.15 00 0.0%0.0%
5555 0.0000230.000023 5555 0.150.15 00 0.0%0.0%
5656 0.0000310.000031 5656 0.210.21 00 0.0%0.0%
5757 0.0000320.000032 5757 0.210.21 00 0.0%0.0%
5858 0.0000380.000038 5858 0.250.25 00 0.0%0.0%
5959 0.0000430.000043 5959 0.290.29 00 0.0%0.0%
6060 0.0000460.000046 6060 0.300.30 00 0.0%0.0%
6161 0.0000530.000053 6161 0.350.35 00 0.0%0.0%
6262 0.0000480.000048 6262 0.320.32 00 0.0%0.0%
6363 0.0000460.000046 6363 0.310.31 00 0.0%0.0%
6464 0.0000430.000043 6464 0.280.28 00 0.0%0.0%
6565 0.0000500.000050 6565 0.330.33 00 0.0%0.0%
6666 0.0000530.000053 6666 0.350.35 00 0.0%0.0%
6767 0.0000590.000059 6767 0.390.39 00 0.0%0.0%
6868 0.0000700.000070 6868 0.460.46 00 0.0%0.0%
6969 0.0000810.000081 6969 0.540.54 00 0.0%0.0%
7070 0.0000840.000084 7070 0.560.56 00 0.0%0.0%
7171 0.0000870.000087 7171 0.570.57 00 0.0%0.0%
7272 0.0000790.000079 7272 0.520.52 00 0.0%0.0%
7373 0.0000850.000085 7373 0.560.56 00 0.0%0.0%
7474 0.0000880.000088 7474 0.580.58 00 0.0%0.0%
7575 0.0000850.000085 7575 0.560.56 00 0.0%0.0%
7676 0.0000910.000091 7676 0.600.60 00 0.0%0.0%
7777 0.0001090.000109 7777 0.720.72 00 0.0%0.0%
7878 0.0001240.000124 7878 0.820.82 00 0.0%0.0%
7979 0.0001600.000160 7979 1.051.05 00 0.0%0.0%
8080 0.0002000.000200 8080 1.321.32 00 0.0%0.0%
8181 0.0001950.000195 8181 1.291.29 00 0.0%0.0%
8282 0.0001530.000153 8282 1.011.01 00 0.0%0.0%
8383 0.0001410.000141 8383 0.930.93 00 0.0%0.0%
8484 0.0001200.000120 8484 0.790.79 00 0.0%0.0%
8585 0.0001290.000129 8585 0.850.85 00 0.0%0.0%
8686 0.0001410.000141 8686 0.930.93 00 0.0%0.0%
8787 0.0001570.000157 8787 1.041.04 0.0002190.000219 0.0%0.0%
8888 0.0001650.000165 8888 1.091.09 0.00054950.0005495 0.1%0.1%
8989 0.0002110.000211 8989 1.401.40 0.00310370.0031037 0.2%0.2%
9090 0.0002740.000274 9090 1.811.81 0.00815590.0081559 0.5%0.5%
9191 0.0003480.000348 9191 2.292.29 0.01628520.0162852 0.7%0.7%
9292 0.0003240.000324 9292 2.142.14 0.01322150.0132215 0.6%0.6%
9393 0.0002790.000279 9393 1.841.84 0.00833280.0083328 0.5%0.5%
9494 0.0002250.000225 9494 1.491.49 0.00385820.0038582 0.3%0.3%
9595 0.0002270.000227 9595 1.501.50 0.00395280.0039528 0.3%0.3%
9696 0.0002280.000228 9696 1.501.50 0.00394280.0039428 0.3%0.3%
9797 0.0002420.000242 9797 1.601.60 0.00495440.0049544 0.3%0.3%
9898 0.0002950.000295 9898 1.951.95 0.00940010.0094001 0.5%0.5%
9999 0.0003590.000359 9999 2.372.37 0.01636910.0163691 0.7%0.7%
100100 0.0004350.000435 100100 2.872.87 0.0267090.026709 0.9%0.9%
101101 0.0004590.000459 101101 3.033.03 0.03037140.0303714 1.0%1.0%
102102 0.0004810.000481 102102 3.173.17 0.03364110.0336411 1.1%1.1%
103103 0.0004670.000467 103103 3.083.08 0.03105340.0310534 1.0%1.0%
104104 0.0004100.000410 104104 2.712.71 0.02215750.0221575 0.8%0.8%
105105 0.0004040.000404 105105 2.662.66 0.02106650.0210665 0.8%0.8%
106106 0.0004400.000440 106106 2.902.90 0.02598070.0259807 0.9%0.9%
107107 0.0004680.000468 107107 3.093.09 0.03012590.0201259 1.0%1.0%
108108 0.0005180.000518 108108 3.423.42 0.03818960.0381896 1.1%1.1%
109109 0.0005930.000593 109109 3.923.92 0.05206430.0520643 1.3%1.3%
110110 0.0006770.000677 110110 4.474.47 0.06980320.0698032 1.6%1.6%
111111 0.0007750.000775 111111 5.125.12 0.0940190.094019 1.8%1.8%
112112 0.0007600.000760 112112 5.025.02 0.08920820.0892082 1.8%1.8%
113113 0.0006620.000662 113113 4.374.37 0.06467370.0646737 1.5%1.5%
114114 0.0006250.000625 114114 4.124.12 0.05611590.0561159 1.4%1.4%
115115 0.0006050.000605 115115 3.993.99 0.05156750.0515675 1.3%1.3%
116116 0.0006800.000680 116116 4.494.49 0.06709390.0670939 1.5%1.5%
117117 0.0007430.000743 117117 4.904.90 0.08110630.0811063 1.7%1.7%
118118 0.0007890.000789 118118 5.215.21 0.09211510.0921151 1.8%1.8%
119119 0.0009220.000922 119119 6.086.08 0.1284390.128439 2.1%2.1%
120120 0.0011270.001127 120120 7.447.44 0.19641120.1964112 2.6%2.6%
121121 0.0014490.001449 121121 9.569.56 0.33154540.3315454 3.5%3.5%
122122 0.0017550.001755 122122 11.5811.58 0.48931830.4893183 4.2%4.2%
123123 0.0016370.001637 123123 10.8010.80 0.42022910.4202291 3.9%3.9%
124124 0.0013530.001353 124124 8.938.93 0.28038540.2803854 3.1%3.1%
125125 0.0011320.001132 125125 7.477.47 0.19055130.1905513 2.6%2.6%
126126 0.0011350.001135 126126 7.497.49 0.19012540.1901254 2.5%2.5%
127127 0.0012300.001230 127127 8.128.12 0.22371670.2237167 2.8%2.8%
128128 0.0013920.001392 128128 9.189.18 0.2882040.288204 3.1%3.1%
129129 0.0016410.001641 129129 10.8310.83 0.40356810.4035681 3.7%3.7%
130130 0.0019320.001932 130130 12.7512.75 0.56051870.5605187 4.4%4.4%
131131 0.0025310.002531 131131 16.7016.70 0.96443210.9644321 5.8%5.8%
132132 0.0032360.003236 132132 21.3621.36 1.56906231.5690623 7.3%7.3%
133133 0.0039430.003943 133133 26.0226.02 2.30709252.3070925 8.9%8.9%
134134 0.0040840.004084 134134 26.9526.95 2.45492642.4549264 9.1%9.1%
135135 0.0037500.003750 135135 24.7524.75 2.06010012.0601001 8.3%8.3%
136136 0.0034440.003444 136136 22.7322.73 1.72725331.7272533 7.6%7.6%
137137 0.0038900.003890 137137 25.6725.67 2.18414822.1841482 8.5%8.5%
138138 0.0045380.004538 138138 29.9529.95 2.93903252.9390325 9.8%9.8%
139139 0.0055350.005535 139139 36.5336.53 4.30389444.3038944 11.8%11.8%
140140 0.0065450.006545 140140 43.1943.19 5.91390935.9139093 13.7%13.7%
141141 0.0071600.007160 141141 47.2547.25 6.98242416.9824241 14.8%14.8%
142142 0.0074280.007428 142142 49.0349.03 7.44495527.4449552 15.2%15.2%
143143 0.0074670.007467 143143 49.2849.28 7.47327527.4732752 15.2%15.2%
144144 0.0076730.007673 144144 50.6450.64 7.82238467.8223846 15.4%15.4%
145145 0.0077500.007750 145145 51.1551.15 7.92429097.9242909 15.5%15.5%
146146 0.0080470.008047 146146 53.1153.11 8.46097048.4609704 15.9%15.9%
147147 0.0082700.008270 147147 54.5854.58 8.85786998.8578699 16.2%16.2%
148148 0.0087550.008755 148148 57.7857.78 9.81172749.8117274 17.0%17.0%
149149 0.0093840.009384 149149 61.9461.94 11.12270611.122706 18.0%18.0%
150150 0.0102030.010203 150150 67.3467.34 12.94160412.941604 19.2%19.2%
151151 0.0107510.010751 151151 70.9670.96 14.19312314.193123 20.0%20.0%
152152 0.0112840.011284 152152 74.4774.47 15.44649415.446494 20.7%20.7%
153153 0.0113790.011379 153153 75.1075.10 15.60216215.602162 20.8%20.8%
154154 0.0113050.011305 154154 74.6174.61 15.32748915.327489 20.5%20.5%
155155 0.0118040.011804 155155 77.9177.91 16.52081816.520818 21.2%21.2%
156156 0.0128330.012833 156156 84.7084.70 19.1815319.18153 22.6%22.6%
157157 0.0140890.014089 157157 92.9992.99 22.65482822.654828 24.4%24.4%
158158 0.0156110.015611 158158 103.03103.03 27.17223127.172231 26.4%26.4%
159159 0.0169790.016979 159159 112.06112.06 31.46509531.465095 28.1%28.1%
160160 0.0179220.017922 160160 118.29118.29 34.50178534.501785 29.2%29.2%
161161 0.0188220.018822 161161 124.22124.22 37.4726737.47267 30.2%30.2%
162162 0.0202350.020235 162162 133.55133.55 42.4059242.40592 31.8%31.8%
163163 0.0224110.022411 163163 147.91147.91 50.51000850.510008 34.1%34.1%
164164 0.0257780.025778 164164 170.13170.13 64.06766864.067668 37.7%37.7%
165165 0.0283640.028364 165165 187.20187.20 75.07884575.078845 40.1%40.1%
166166 0.0287400.028740 166166 189.69189.69 76.4512976.45129 40.3%40.3%
167167 0.0279210.027921 167167 184.28184.28 72.4914972.49149 39.3%39.3%
168168 0.0265640.026564 168168 175.32175.32 66.30566966.305669 37.8%37.8%
169169 0.0247220.024722 169169 163.16163.16 58.33491158.334911 35.8%35.8%
170170 0.0232410.023241 170170 153.39153.39 52.16538852.165388 34.0%34.0%
171171 0.0212410.021241 171171 140.19140.19 44.3371944.33719 31.6%31.6%
172172 0.0200030.020003 172172 132.02132.02 39.67312539.673125 30.1%30.1%
173173 0.0193590.019359 173173 127.77127.77 37.25188637.251886 29.2%29.2%
174174 0.0189120.018912 174174 124.82124.82 35.56230135.562301 28.5%28.5%
175175 0.0182070.018207 175175 120.17120.17 33.06325933.063259 27.5%27.5%
176176 0.0174140.017414 176176 114.93114.93 30.36570430.365704 26.4%26.4%
177177 0.0163230.016323 177177 107.73107.73 26.86779226.867792 24.9%24.9%
178178 0.0148230.014823 178178 97.8397.83 22.40968522.409685 22.9%22.9%
179179 0.0131420.013142 179179 86.7486.74 17.84496117.844961 20.6%20.6%
180180 0.0118240.011824 180180 78.0478.04 14.57504514.575045 18.7%18.7%
181181 0.0106480.010648 181181 70.2870.28 11.90488811.904888 16.9%16.9%
182182 0.0097210.009721 182182 64.1664.16 9.96541769.9654176 15.5%15.5%
183183 0.0087450.008745 183183 57.7257.72 8.10058218.1005821 14.0%14.0%
184184 0.0082550.008255 184184 54.4854.48 7.21484017.2148401 13.2%13.2%
185185 0.0079450.007945 185185 52.4452.44 6.66939246.6693924 12.7%12.7%
186186 0.0072920.007292 186186 48.1348.13 5.6218815.621881 11.7%11.7%
187187 0.0069270.006927 187187 45.7245.72 5.06401925.0640192 11.1%11.1%
188188 0.0064130.006413 188188 42.3342.33 4.33617944.3361794 10.2%10.2%
189189 0.0059690.005969 189189 39.3939.39 3.74945723.7494572 9.5%9.5%
190190 0.0054040.005404 190190 35.6735.67 3.06971233.0697123 8.6%8.6%
191191 0.0048670.004867 191191 32.1232.12 2.48466342.4846634 7.7%7.7%
192192 0.0044510.004451 192192 29.3729.37 2.07079412.0707941 7.0%7.0%
193193 0.0040090.004009 193193 26.4626.46 1.67357591.6735759 6.3%6.3%
194194 0.0037500.003750 194194 24.7524.75 1.45714761.4571476 5.9%5.9%
195195 0.0034610.003461 195195 22.8422.84 1.23461141.2346114 5.4%5.4%
196196 0.0033000.003300 196196 21.7821.78 1.11664981.1166498 5.1%5.1%
197197 0.0031670.003167 197197 20.9020.90 1.02280441.0228044 4.9%4.9%
198198 0.0029350.002935 198198 19.3719.37 0.87280590.8728059 4.5%4.5%
199199 0.0028210.002821 199199 18.6218.62 0.80191030.8019103 4.3%4.3%
200200 0.0026710.002671 200200 17.6317.63 0.7141440.714144 4.1%4.1%
201201 0.0024550.002455 201201 16.2016.20 0.59857850.5985785 3.7%3.7%
202202 0.0022080.002208 202202 14.5714.57 0.47953690.4795369 3.3%3.3%
203203 0.0021120.002112 203203 13.9413.94 0.4355550.435555 3.1%3.1%
204204 0.0019180.001918 204204 12.6612.66 0.35536510.3553651 2.8%2.8%
205205 0.0018290.001829 205205 12.0712.07 0.32073860.3207386 2.7%2.7%
206206 0.0017200.001720 206206 11.3511.35 0.28106450.2810645 2.5%2.5%
207207 0.0015910.001591 207207 10.5010.50 0.23770540.2377054 2.3%2.3%
208208 0.0014840.001484 208208 9.809.80 0.20453320.2045332 2.1%2.1%
209209 0.0014060.001406 209209 9.289.28 0.18178370.1817837 2.0%2.0%
210210 0.0013840.001384 210210 9.149.14 0.17501440.1750144 1.9%1.9%
211211 0.0012630.001263 211211 8.338.33 0.1433940.143394 1.7%1.7%
212212 0.0011520.001152 212212 7.607.60 0.11729970.1172997 1.5%1.5%
213213 0.0010260.001026 213213 6.776.77 0.09104150.0910415 1.3%1.3%
214214 0.0009660.000966 214214 6.386.38 0.07955050.0795505 1.2%1.2%
215215 0.0009430.000943 215215 6.226.22 0.07503970.0750397 1.2%1.2%
216216 0.0008720.000872 216216 5.765.76 0.06299870.0629987 1.1%1.1%
217217 0.0008010.000801 217217 5.295.29 0.05202220.0520222 1.0%1.0%
218218 0.0007830.000783 218218 5.175.17 0.04912920.0491292 1.0%1.0%
219219 0.0006740.000674 219219 4.454.45 0.03494610.0349461 0.8%0.8%
220220 0.0007010.000701 220220 4.634.63 0.03803480.0380348 0.8%0.8%
221221 0.0006530.000653 221221 4.314.31 0.03215310.0321531 0.7%0.7%
222222 0.0006370.000637 222222 4.214.21 0.03027490.0302749 0.7%0.7%
223223 0.0005960.000596 223223 3.933.93 0.0258140.025814 0.7%0.7%
224224 0.0005380.000538 224224 3.553.55 0.0201430.0220143 0.6%0.6%
225225 0.0005310.000531 225225 3.503.50 0.01946650.0194665 0.6%0.6%
226226 0.0005070.000507 226226 3.353.35 0.01734870.0173487 0.5%0.5%
227227 0.0004420.000442 227227 2.912.91 0.01227720.0122772 0.4%0.4%
228228 0.0004660.000466 228228 3.083.08 0.01401330.0140133 0.5%0.5%
229229 0.0003860.000386 229229 2.552.55 0.00861280.0086128 0.3%0.3%
230230 0.0004050.000405 230230 2.672.67 0.00968830.0096883 0.4%0.4%
231231 0.0004040.000404 231231 2.672.67 0.00960950.0096095 0.4%0.4%
232232 0.0003980.000398 232232 2.622.62 0.00917920.0091792 0.3%0.3%
233233 0.0003740.000374 233233 2.472.47 0.00777980.0077798 0.3%0.3%
234234 0.0003450.000345 234234 2.282.28 0.00623020.0062302 0.3%0.3%
235235 0.0003220.000322 235235 2.122.12 0.00506990.0050699 0.2%0.2%
236236 0.0003240.000324 236236 2.142.14 0.00517650.0051765 0.2%0.2%
237237 0.0003190.000319 237237 2.112.11 0.00492760.0049276 0.2%0.2%
238238 0.0002860.000286 238238 1.891.89 0.00352120.0035212 0.2%0.2%
239239 0.0003090.000309 239239 2.042.04 0.00442080.0044208 0.2%0.2%
240240 0.0003000.000300 240240 1.981.98 0.0040550.004055 0.2%0.2%
241241 0.0002950.000295 241241 1.951.95 0.00382540.0038254 0.2%0.2%
242242 0.0002830.000283 242242 1.871.87 0.00334580.0033458 0.2%0.2%
243243 0.0002960.000296 243243 1.951.95 0.00383750.0038375 0.2%0.2%
244244 0.0002600.000260 244244 1.721.72 0.00253510.0025351 0.1%0.1%
245245 0.0002820.000282 245245 1.861.86 0.00326730.0032673 0.2%0.2%
246246 0.0002860.000286 246246 1.891.89 0.00339930.0033993 0.2%0.2%
247247 0.0002890.000289 247247 1.911.91 0.00351590.0035159 0.2%0.2%
248248 0.0002580.000258 248248 1.701.70 0.00241460.0024146 0.1%0.1%
249249 0.0002630.000263 249249 1.741.74 0.00257380.0025738 0.1%0.1%
250250 0.0002760.000276 250250 1.821.82 0.0029870.002987 0.2%0.2%
251251 0.0002900.000290 251251 1.921.92 0.00349680.0034968 0.2%0.2%
252252 0.0002890.000289 252252 1.911.91 0.00343780.0034378 0.2%0.2%
253253 0.0002820.000282 253253 1.861.86 0.00315190.0031519 0.2%0.2%
254254 0.0002500.000250 254254 1.651.65 0.00212050.0021205 0.1%0.1%
255255 0.0002820.000282 255255 1.861.86 0.00314810.0031481 0.2%0.2%
256256 0.0002440.000244 256256 1.611.61 0.00192660.0019266 0.1%0.1%
257257 0.0002690.000269 257257 1.781.78 0.00268420.0026842 0.2%0.2%
258258 0.0002670.000267 258258 1.761.76 0.00260650.0026065 0.1%0.1%
259259 0.0002640.000264 259259 1.741.74 0.0024870.002487 0.1%0.1%
260260 0.0002850.000285 260260 1.881.88 0.00318540.0031854 0.2%0.2%
261261 0.0002730.000273 261261 1.801.80 0.00277730.0027773 0.2%0.2%
262262 0.0002780.000278 262262 1.831.83 0.00291660.0029166 0.2%0.2%
263263 0.0003210.000321 263263 2.122.12 0.00449330.0044933 0.2%0.2%
264264 0.0002910.000291 264264 1.921.92 0.00335270.0033527 0.2%0.2%
265265 0.0002930.000293 265265 1.941.94 0.00341480.0034148 0.2%0.2%
266266 0.0002970.000297 266266 1.961.96 0.00352620.0035262 0.2%0.2%
267267 0.0002940.000294 267267 1.941.94 0.00341030.0034103 0.2%0.2%
268268 0.0002970.000297 268268 1.961.96 0.0035090.003509 0.2%0.2%
269269 0.0003230.000323 269269 2.132.13 0.00449150.0044915 0.2%0.2%
270270 0.0003210.000321 270270 2.122.12 0.00438240.0043824 0.2%0.2%
271271 0.0003450.000345 271271 2.282.28 0.0053590.005359 0.2%0.2%
272272 0.0003710.000371 272272 2.452.45 0.00651940.0065194 0.3%0.3%
273273 0.0003840.000384 273273 2.532.53 0.00711230.0071123 0.3%0.3%
274274 0.0003620.000362 274274 2.392.39 0.00606420.0060642 0.3%0.3%
275275 0.0003940.000394 275275 2.602.60 0.00755540.0075554 0.3%0.3%
276276 0.0003900.000390 276276 2.582.58 0.00735820.0073582 0.3%0.3%
277277 0.0004010.000401 277277 2.652.65 0.00785630.0078563 0.3%0.3%
278278 0.0004210.000421 278278 2.782.78 0.00887560.0088756 0.3%0.3%
279279 0.0004180.000418 279279 2.762.76 0.00867960.0086796 0.3%0.3%
280280 0.0004770.000477 280280 3.153.15 0.01207710.0120771 0.4%0.4%
281281 0.0005110.000511 281281 3.373.37 0.01419410.0141941 0.4%0.4%
282282 0.0005550.000555 282282 3.663.66 0.01725790.0172579 0.5%0.5%
283283 0.0005770.000577 283283 3.813.81 0.01883330.0188333 0.5%0.5%
284284 0.0005840.000584 284284 3.863.86 0.01934950.0193495 0.5%0.5%
285285 0.0005920.000592 285285 3.913.91 0.01990630.0199063 0.5%0.5%
286286 0.0005950.000595 286286 3.933.93 0.02003090.0200309 0.5%0.5%
287287 0.0006660.000666 287287 4.404.40 0.02595050.0259505 0.6%0.6%
288288 0.0006970.000697 288288 4.604.60 0.02865040.0286504 0.6%0.6%
289289 0.0007520.000752 289289 4.964.96 0.0339320.033932 0.7%0.7%
290290 0.0008040.000804 290290 5.315.31 0.03928150.0392815 0.7%0.7%
291291 0.0008280.000828 291291 5.475.47 0.04176340.0417634 0.8%0.8%
292292 0.0008600.000860 292292 5.685.68 0.0452960.045296 0.8%0.8%
293293 0.0008630.000863 293293 5.705.70 0.04546810.0454681 0.8%0.8%
294294 0.0009640.000964 294294 6.366.36 0.05768140.0576814 0.9%0.9%
295295 0.0010230.001023 295295 6.756.75 0.06541030.0654103 1.0%1.0%
296296 0.0010640.001064 296296 7.027.02 0.07097810.0709781 1.0%1.0%
297297 0.0011330.001133 297297 7.487.48 0.08106010.0810601 1.1%1.1%
298298 0.0011790.001179 298298 7.787.78 0.08795630.0879563 1.1%1.1%
299299 0.0012500.001250 299299 8.258.25 0.09939780.0993978 1.2%1.2%
300300 0.0012770.001277 300300 8.438.43 0.10366940.1036694 1.2%1.2%
301301 0.0013250.001325 301301 8.748.74 0.11163730.1116373 1.3%1.3%
302302 0.0013560.001356 302302 8.958.95 0.11691930.1169193 1.3%1.3%
303303 0.0013720.001372 303303 9.059.05 0.11939290.1193929 1.3%1.3%
304304 0.0014510.001451 304304 9.579.57 0.13392020.1339202 1.4%1.4%
305305 0.0014940.001494 305305 9.869.86 0.14202850.1420285 1.4%1.4%
306306 0.0015950.001595 306306 10.5310.53 0.16236990.1623699 1.5%1.5%
307307 0.0016060.001606 307307 10.6010.60 0.16424910.1642491 1.5%1.5%
308308 0.0016270.001627 308308 10.7410.74 0.16808510.1680851 1.6%1.6%
309309 0.0016670.001667 309309 11.0011.00 0.17624720.1762472 1.6%1.6%
310310 0.0016620.001662 310310 10.9710.97 0.1746980.174698 1.6%1.6%
311311 0.0017210.001721 311311 11.3611.36 0.18728850.1872885 1.6%1.6%
312312 0.0017360.001736 312312 11.4611.46 0.19010550.1901055 1.7%1.7%
313313 0.0017390.001739 313313 11.4811.48 0.19011820.1901182 1.7%1.7%
314314 0.0017420.001742 314314 11.5011.50 0.19018990.1901899 1.7%1.7%
315315 0.0017940.001794 315315 11.8411.84 0.20156780.22015678 1.7%1.7%
316316 0.0016860.001686 316316 11.1311.13 0.17654440.1765444 1.6%1.6%
317317 0.0017500.001750 317317 11.5511.55 0.19019070.1901907 1.6%1.6%
318318 0.0017020.001702 318318 11.2411.24 0.17910830.1791083 1.6%1.6%
319319 0.0017240.001724 319319 11.3811.38 0.18320940.1832094 1.6%1.6%
320320 0.0017150.001715 320320 11.3211.32 0.18070960.1807096 1.6%1.6%
321321 0.0016600.001660 321321 10.9510.95 0.16827350.1682735 1.5%1.5%
322322 0.0016780.001678 322322 11.0811.08 0.17165080.1716508 1.5%1.5%
323323 0.0015880.001588 323323 10.4810.48 0.15239970.1523997 1.5%1.5%
324324 0.0016170.001617 324324 10.6710.67 0.15790970.1579097 1.5%1.5%
325325 0.0015770.001577 325325 10.4110.41 0.14928740.1492874 1.4%1.4%
326326 0.0015520.001552 326326 10.2410.24 0.14403610.1440361 1.4%1.4%
327327 0.0014870.001487 327327 9.829.82 0.1312550.131255 1.3%1.3%
328328 0.0014840.001484 328328 9.809.80 0.13029630.1302963 1.3%1.3%
329329 0.0014320.001432 329329 9.459.45 0.12048470.1204847 1.3%1.3%
330330 0.0014420.001442 330330 9.529.52 0.12194210.1219421 1.3%1.3%
한편 본원에서는 보다 적은 양의 cfDNA를 사용하는 경우, 166bp에서 collision count가 6.06% 수준으로 낮아지며, 즉 NGS 실험에 사용되는 input cfDNA의 양이 증가할수록 collision count의 비율이 증가하는 것을 발견하였다 (표 2 참조). 시작 DNA 양이 적을수록 더 많은 비율의 데이터를 분석에 사용할 수 있다. 따라서 특정 양의 DNA를 일정 숫자로 나눠서 더 작은 시작 DNA 양으로 검사할 경우 더 많은 DNA 정보를 분석에 활용할 수 있음을 발견하였다. On the other hand, when a smaller amount of cfDNA was used, it was found that the collision count was lowered to 6.06% at 166bp, that is, the ratio of the collision count increased as the amount of input cfDNA used in the NGS experiment increased (Table 2). Reference). The smaller the amount of starting DNA, the greater the percentage of data available for analysis. Therefore, it was discovered that more DNA information can be utilized for analysis if a specific amount of DNA is divided by a certain number and tested with a smaller starting DNA amount.
DNA input amountDNA input amount FragmentsFragments
(genome equivalents)(genome equivalents)
Fragments Fragments
in 51~330 bp rangein 51~330 bp range
Collision countsCollision counts Collision fragment ratioCollision fragment ratio Unique fragmentsUnique fragments Unique fragments ratioUnique fragments ratio
1 ng1 ng 330 330 309 309 3 3 0.0110.011 305 305 0.9890.989
5 ng5 ng 1,650 1,650 1,543 1,543 95 95 0.0610.061 1,448 1,448 0.9390.939
10 ng10 ng 3,300 3,300 3,087 3,087 365 365 0.1180.118 2,721 2,721 0.8820.882
20 ng20 ng 6,600 6,600 6,173 6,173 1,319 1,319 0.2140.214 4,855 4,855 0.7860.786
40 ng40 ng 13,200 13,200 12,346 12,346 4,338 4,338 0.3510.351 8,008 8,008 0.6490.649
100 ng100 ng 33,000 33,000 30,866 30,866 17,327 17,327 0.5610.561 13,539 13,539 0.4390.439
150 ng150 ng 49,500 49,500 46,299 46,299 29,825 29,825 0.6440.644 16,474 16,474 0.3560.356
200 ng200 ng 66,000 66,000 61,732 61,732 42,929 42,929 0.6950.695 18,803 18,803 0.3050.305
이에 본원에 따른 방법은 고유 단편의 비가 증가하도록 cfDNA 시료를 복수개의 aliquot로 분할하여 각 aliquot에 대하여 NGS 분석을 수행한다. 예를 들면 표 1을 참조하면, 시작 DNA가 20ng인 경우 1,319개 (21.4%) 단편이 우연히 동일한 서열로 판단되어 분석에 사용되지 못하지만, 시작 cfDNA를 4개로 나누어 각각 약 5ng를 시작 DNA 양으로 하는 경우에는 380 (=95*4)개 단편 (6.1%)이 우연히 동일할 수 있다. 그 결과 4개로 분할하여 NGS 라이브러리를 준비하면 93.9%를 분석에 사용할 수 있고, 분할하지 않은 경우 78.6%에 대비해 15.2% 더 많은 DNA 단편의 서열을 분석에 사용할 수 있다.Accordingly, in the method according to the present application, the cfDNA sample is divided into a plurality of aliquots to increase the ratio of native fragments, and NGS analysis is performed for each aliquot. For example, referring to Table 1, when the starting DNA is 20 ng, 1,319 (21.4%) fragments are determined to be identical sequences by chance and cannot be used for analysis. In this case, 380 (=95*4) fragments (6.1%) may be identical by chance. As a result, if the NGS library is prepared by dividing into four, 93.9% of the DNA fragment can be used for analysis, and 15.2% more DNA fragment sequences can be used for analysis compared to 78.6% if not divided.
본원에 따른 방법에서는 하나의 시작 cfDNA 시료에 우연히 서열이 동일한 DNA가 최소화되도록 시작 cfDNA 시료를 적절하게 나누어 분석함으로써, molecular barcode 없이도 각 aliquot를 구분하는 것만으로 고유 단편에서 유래된 저빈도 변이를 발견할 수 있다. In the method according to the present application, by appropriately dividing and analyzing the starting cfDNA sample so that DNA having the same sequence by chance in one starting cfDNA sample is minimized, it is possible to detect low-frequency mutations derived from unique fragments simply by distinguishing each aliquot without a molecular barcode. can
일 구현예에서는 고유 단편의 비가 최소 93% 이상이 되도록 특정 양의 시작 DNA를 적절히 분할하여 라이브러리를 제조한다. In one embodiment, a library is prepared by appropriately dividing a specific amount of starting DNA so that the ratio of native fragments is at least 93%.
임상에서 한 환자로부터 혈액을 채취하여 얻을 수 있는 cfDNA의 양이 20ng인 것을 고려하며, 일 구현예에서 특정 양의 cfDNA는 20ng이고, 고유 단편이 비가 93.9%가 되도록 상기 cfDNA를 4개의 aliquot로 분할한다. Considering that the amount of cfDNA that can be obtained by taking blood from a patient in clinical practice is 20 ng, in one embodiment, the specific amount of cfDNA is 20 ng, and the cfDNA is divided into 4 aliquots so that the ratio of native fragments is 93.9% do.
본원에 따른 방법의 다음단계에서는 분할된 복수개의 각 aliquot를 구분하기 위해, 각 aliquot 별로 각각 상이한 인덱스를 포함하는 어뎁터를 태깅하여 라이브러리를 제조하고, NGS분석을 수행한 후 상기 각 aliquot의 NGS 결과를 통합하는 단계를 포함한다. 본원에서 상기 각 aliquots를 구분하는 인덱스를 tube barcode라고 칭한다. 상이한 인덱스를 포함하는 어뎁터의 선택, 태깅 방법을 포함하는 라이브러리 제조는 채용되는 구체적 플랫폼에 따라 상이할 수 있으며, 본원 실시예 등의 기재를 참조하여 당업자라면 적절한 것을 선택할 수 있다. In the next step of the method according to the present application, in order to distinguish each divided plural aliquots, a library is prepared by tagging an adapter including a different index for each aliquot, and NGS analysis is performed, and then the NGS results of each aliquot are obtained. Including the step of integrating. In the present application, the index that distinguishes each of the aliquots is referred to as a tube barcode. Selection of adapters including different indices and library preparation including tagging methods may differ depending on the specific platform employed, and those skilled in the art may select an appropriate one with reference to the description of Examples and the like herein.
일 구현예에서는 Illumina와 같은 NGS 플랫폼에서 멀티플렉스 시퀀싱(multiplex sequencing) 방법으로 시퀀싱된다. In one embodiment, the sequence is performed by a multiplex sequencing method on an NGS platform such as Illumina.
본원에 따른 방법의 다음 단계에서 각 aliquot의 NGS 데이터는 독립적으로 참조 유전체에 맵핑되고 error correction을 진행한다. 이 단계의 결과물로 각 aliquot 마다 하나의 bam 파일이 생성된다. 이 bam 파일들을 한 개의 bam 파일로 통합한다. 이후 일반적인 변이 분석 프로그램 (Mutect2, Varscan, Vardict, Strelka2, 등)을 이용해 상기 통합된 bam 파일로부터 유전자 변이를 검출한다. In the next step of the method according to the present application, the NGS data of each aliquot is independently mapped to the reference genome and error correction is performed. As a result of this step, one bam file is created for each aliquot. These bam files are combined into one bam file. Thereafter, gene mutations are detected from the integrated bam file using a general mutation analysis program (Mutect2, Varscan, Vardict, Strelka2, etc.).
또한 다른 양태에서 본원은 NGS 분석을 이용한 cfDNA의 저빈도 변이 검출에 있어서, 상기 NGS 분석에 사용되는 고유 단편의 비율을 향상시키기 위한 라이브러리 제조방법에 관한 것이다. In another aspect, the present application relates to a library preparation method for improving the ratio of native fragments used for NGS analysis in detecting low-frequency mutations in cfDNA using NGS analysis.
일 구현예에서 상기 방법은 (a) 특정 양의 cfDNA 시료를 제공하는 단계; (b) 상기 특정 양의 cfDNA에 포함된 유전체 단편 중 51bp 내지 330bp 길이에 해당하는 총 유전체 단편의 수, 및 고유 단편 (unique fragment)의 수를 계산하는 단계로, 상기 고유 단편의 수는 상기 총 유전체 단편의 수에서 상기 51~330bp 길이에 해당하는 각 유전체 단편의 collision count의 합을 제외한 값이고, 상기 collision count 합은 식 1로부터 계산되고, (c) 상기 단계 (b)로부터 상기 총 유전체 단편 수에서 상기 고유 단편의 수가 차지하는 비를 계산하고, 상기 고유 단편의 비를 증가시키도록 상기 특정 양의 cfDNA 시료를 복수 개의 aliquot로 나누는 단계; 및 (d) 상기 복수개의 각 aliquot 별로, 각각 상이한 인덱스를 포함하는 어뎁터를 태깅하여 NGS용 라이브러리를 제조하는 단계를 포함한다. In one embodiment, the method comprises the steps of (a) providing a specific amount of a cfDNA sample; (b) calculating the number of total genome fragments corresponding to a length of 51 bp to 330 bp among the genome fragments included in the specific amount of cfDNA, and the number of unique fragments, wherein the number of unique fragments is the total number of It is a value obtained by subtracting the sum of collision counts of each genome fragment corresponding to the 51-330 bp length from the number of genome fragments, and the collision count sum is calculated from Equation 1, (c) the total genome fragments from step (b) calculating the ratio of the number of the unique fragments to the number, and dividing the specific amount of the cfDNA sample into a plurality of aliquots to increase the ratio of the unique fragments; and (d) preparing a library for NGS by tagging adapters each having a different index for each of the plurality of aliquots.
상기 방법에 포함된 각 단계는 앞서 언급한 바를 참조할 수 있다.Each step included in the method may refer to the aforementioned bar.
이하, 본 발명의 이해를 돕기 위해서 실시예를 제시한다. 그러나 하기의 실시예는 본 발명을 보다 쉽게 이해하기 위하여 제공되는 것일 뿐 본 발명이 하기의 실시예에 한정되는 것은 아니다.Hereinafter, examples are presented to help the understanding of the present invention. However, the following examples are only provided for easier understanding of the present invention, and the present invention is not limited to the following examples.
실시예Example
실시예 1. 상이한 인덱스를 사용한 시료 분할 NGS 분석Example 1. Sample Segmentation NGS Analysis Using Different Indices
본원에서 발견된 것을 다음과 같이 실험으로 증명하였다.What was found herein was experimentally verified as follows.
실험에 사용한 cfDNA는 SeraseqTM ctDNA mutation mix v2 및 SeraseqTM cfDNA mutation mix v2 WT (SeraCare, Milford, MA)에서 구입하여 사용하였다. The cfDNA used in the experiment was purchased from SeraseqTM ctDNA mutation mix v2 and SeraseqTM cfDNA mutation mix v2 WT (SeraCare, Milford, MA) and used.
실험 디자인은 다음 표와 같다. The experimental design is shown in the table below.
Sample IDSample ID Input DNA (ng)Input DNA (ng) Reference MaterialReference Material The number of AliquotThe number of Aliquot Dual Index ListDual Index List
LAH001LAH001 5050 Seraseq® ctDNA
Mutation Mix v2 AF1%
Seraseq® ctDNA
Mutation Mix v2 AF1%
44 501-701, 501-702, 501-703, 501-704501-701, 501-702, 501-703, 501-704
LAH002LAH002 2020 Seraseq® ctDNA
Mutation Mix v2 AF1%
Seraseq® ctDNA
Mutation Mix v2 AF1%
44 502-705, 502-706, 502-707, 502-708502-705, 502-706, 502-707, 502-708
LAH003 LAH003 1010 Seraseq® ctDNA
Mutation Mix v2 AF1%
Seraseq® ctDNA
Mutation Mix v2 AF1%
44 503-709, 503-710, 503-711, 503-712503-709, 503-710, 503-711, 503-712
LAH004 LAH004 22 Seraseq® ctDNA
Mutation Mix v2 AF1%
Seraseq® ctDNA
Mutation Mix v2 AF1%
44 504-701, 504-702, 504-703, 504-704504-701, 504-702, 504-703, 504-704
LAH005LAH005 2020 Seraseq® ctDNA
Mutation Mix v2 WT
Seraseq® ctDNA
Mutation Mix v2 WT
44 505-705, 505-706, 505-707, 505-708505-705, 505-706, 505-707, 505-708
LAH006LAH006 2020 Seraseq® ctDNA
Mutation Mix v2 AF1%
Seraseq® ctDNA
Mutation Mix v2 AF1%
1One 501-701501-701
LAH007LAH007 2020 Seraseq® ctDNA
Mutation Mix v2 AF1%
Seraseq® ctDNA
Mutation Mix v2 AF1%
22 502-701, 502-702502-701, 502-702
LAH008LAH008 2020 Seraseq® ctDNA
Mutation Mix v2 AF1%
Seraseq® ctDNA
Mutation Mix v2 AF1%
44 503-701, 503-702, 503-703, 503-704503-701, 503-702, 503-703, 503-704
LAH009LAH009 2020 Seraseq® ctDNA
Mutation Mix v2 AF1%
Seraseq® ctDNA
Mutation Mix v2 AF1%
88 504-701, 504-702, 504-703, 504-704, 504-705, 504-706, 504-707, 504-708504-701, 504-702, 504-703, 504-704, 504-705, 504-706, 504-707, 504-708
cfDNA는 TapeStation (Agilent Inc.)을 이용하여 QC (Quality control)를 진행하였고, TapeStation 기준 cfDNA 20 ng을 사용하였다. cfDNA의 양쪽 말단에 A (아데노신)을 결합하고 어뎁터 (Illumina, Inc)를 라이게이션(ligation)으로 결합하였다. 그리고 극미량의 샘플 swap도 방지할 수 있는 듀얼 인덱스(dual index) 방법 적용을 위해, 5` 말단과 3`말단에 각각 i7과 i5 인덱스를 포함하는 PCR 프라이머 (Illumina, Inc)를 상기 어뎁터에 상보적으로 결합시켰다. 상기 PCR 프라이머는 다음과 같은 공통적인 서열과 각 aliquot를 구별할 수 있는 인덱스 서열([i7]과 [i5]로 표시)이 포함되어 있다.For cfDNA, QC (quality control) was performed using TapeStation (Agilent Inc.), and 20 ng of cfDNA based on TapeStation was used. A (adenosine) was bound to both ends of cfDNA, and an adapter (Illumina, Inc) was bound by ligation. And for application of the dual index method that can prevent even a very small amount of sample swap, PCR primers (Illumina, Inc) including i7 and i5 indexes at the 5' end and 3' end, respectively, are complementary to the adapter combined with The PCR primers include the following common sequences and index sequences (indicated by [i7] and [i5]) that can distinguish each aliquot.
5'-CAAGCAGAAGACGGCATACGAGAT [i7] GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC-s-T-3'5'-CAAGCAGAAGACGGCATACGAGAT [i7] GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC-sT-3'
5´-AATGATACGGCGACCACCGAGATCTACAC[i5]ACACTCTTTCCCTACACGACGCTCTTCCGATC-s-T-3´5´-AATGATACGGCGACCACCGAGATCTACAC [i5] ACACTCTTTCCCTACACGACGCTCTTCCGATC-sT-3´
Index IDIndex ID Index SequenceIndex Sequence Index IDIndex ID Index SequenceIndex Sequence
Dual index 501 primerDual index 501 primer TATAGCCTTATAGCCT Dual index 701 primerDual index 701 primer ATTACTCGATTACTCG
Dual index 502 primerDual index 502 primer ATAGAGGCATAGAGGC Dual index 702 primerDual index 702 primer TCCGGAGATCCGGAGA
Dual index 503 primerDual index 503 primer CCTATCCTCCTATCCT Dual index 703 primerDual index 703 primer CGCTCATTCGCTCATT
Dual index 504 primerDual index 504 primer GGCTCTGAGGCTCTGA Dual index 704 primerDual index 704 primer GAGATTCCGAGATTCC
Dual index 505 primerDual index 505 primer AGGCGAAGAGGCGAAG Dual index 705 primerDual index 705 primer ATTCAGAAATTCAGAA
Dual index 506 primerDual index 506 primer TAATCTTATAATCTTA Dual index 706 primerDual index 706 primer GAATTCGTGAATTCGT
Dual index 507 primerDual index 507 primer CAGGACGTCAGGACGT Dual index 707 primerDual index 707 primer CTGAAGCTCTGAAGCT
Dual index 508 primerDual index 508 primer GTACTGACGTACTGAC Dual index 708 primerDual index 708 primer TAATGCGCTAATGCGC
Dual index 709 primerDual index 709 primer CGGCTATGCGGCTATG
Dual index 710 primerDual index 710 primer TCCGCGAATCCGCGAA
Dual index 711 primerDual index 711 primer TCTCGCGCTCTCGCGC
Dual index 712 primerDual index 712 primer AGCGATAGAGCGATAG
일차적으로 만들어진 library는 106개 인간 유전자 (표 5 참조)의 probe (Celemics, Inc. custom panel)를 사용하여 관심이 있는 유전자 만을 캡처하여 최종 library를 기존 방법대로 제조하였다 (Kang, JK et al. Plos one 2020.May). For the primary library, only the gene of interest was captured using a probe (Celemics, Inc. custom panel) of 106 human genes (see Table 5), and the final library was prepared according to the existing method (Kang, JK et al. Plos). one 2020.May).
Gene symbolGene symbol Gene full nameGene full name
ARAFARAF A-Raf proto-oncogene, serine/threonine kinaseA-Raf proto-oncogene, serine/threonine kinase
ABL1ABL1 ABL proto-oncogene 1, non-receptor tyrosine kinaseABL proto-oncogene 1, non-receptor tyrosine kinase
AKT1AKT1 AKT serine/threonine kinase 1AKT serine/threonine kinase 1
AKT2AKT2 AKT serine/threonine kinase 2AKT serine/threonine kinase 2
APCAPC APC, WNT signaling pathway regulatorAPC, WNT signaling pathway regulator
ARID1AARID1A AT-rich interaction domain 1AAT-rich interaction domain 1A
ATMATM ATM serine/threonine kinaseATM serine/threonine kinase
BRAFBRAF B-Raf proto-oncogene, serine/threonine kinaseB-Raf proto-oncogene, serine/threonine kinase
BCRBCR BCR, RhoGEF and GTPase activating proteinBCR, RhoGEF and GTPase activating protein
BRCA1BRCA1 BRCA1, DNA repair associatedBRCA1, DNA repair associated
BRCA2BRCA2 BRCA2, DNA repair associatedBRCA2, DNA repair associated
BTKBTK Bruton tyrosine kinaseBruton tyrosine kinase
CEBPACEBPA CCAAT/enhancer binding protein alphaCCAAT/enhancer binding protein alpha
CD274CD274 CD274 moleculeCD274 molecule
CBLCBL Cbl proto-oncogeneCbl proto-oncogene
FBXW7FBXW7 F-box and WD repeat domain containing 7F-box and WD repeat domain containing 7
GNA11GNA11 G protein subunit alpha 11G protein subunit alpha 11
GNAQGNAQ G protein subunit alpha qG protein subunit alpha q
GATA3GATA3 GATA binding protein 3GATA binding protein 3
GNASGNAS GNAS complex locusGNAS complex locus
HRASHRAS HRas proto-oncogene, GTPaseHRas proto-oncogene, GTPase
JAK2JAK2 Janus kinase 2Janus kinase 2
JAK3JAK3 Janus kinase 3Janus kinase 3
KITKIT KIT proto-oncogene receptor tyrosine kinaseKIT proto-oncogene receptor tyrosine kinase
KRASKRAS KRAS proto-oncogene, GTPaseKRAS proto-oncogene, GTPase
MDM2MDM2 MDM2 proto-oncogeneMDM2 proto-oncogene
METMET MET proto-oncogene, receptor tyrosine kinaseMET proto-oncogene, receptor tyrosine kinase
MPLMPL MPL proto-oncogene, thrombopoietin receptorMPL proto-oncogene, thrombopoietin receptor
PMS2PMS2 PMS1 homolog 2, mismatch repair system componentPMS1 homolog 2, mismatch repair system component
RB1RB1 RB transcriptional corepressor 1RB transcriptional corepressor 1
ROS1ROS1 ROS proto-oncogene 1, receptor tyrosine kinaseROS proto-oncogene 1, receptor tyrosine kinase
RAF1RAF1 Raf-1 proto-oncogene, serine/threonine kinaseRaf-1 proto-oncogene, serine/threonine kinase
RHEBRHEB Ras homolog enriched in brainRas homolog enriched in brain
RIT1RIT1 Ras like without CAAX 1Ras like without CAAX 1
SETD2SETD2 SET domain containing 2SET domain containing 2
SMAD4SMAD4 SMAD family member 4SMAD family member 4
U2AF1U2AF1 U2 small nuclear RNA auxiliary factor 1U2 small nuclear RNA auxiliary factor 1
UGT1A1UGT1A1 UDP glucuronosyltransferase family 1 member A1UDP glucuronosyltransferase family 1 member A1
ALKALK anaplastic lymphoma receptor tyrosine kinaseanaplastic lymphoma receptor tyrosine kinase
ARAR androgen receptorandrogen receptor
CDH1CDH1 cadherin 1cadherin 1
CTNNB1CTNNB1 catenin beta 1catenin beta 1
CSF1RCSF1R colony stimulating factor 1 receptorcolony stimulating factor 1 receptor
CCND1CCND1 cyclin D1cyclin D1
CCND2CCND2 cyclin D2cyclin D2
CCNE1CCNE1 cyclin E1cyclin E1
CDK4CDK4 cyclin dependent kinase 4cyclin dependent kinase 4
CDK6CDK6 cyclin dependent kinase 6cyclin dependent kinase 6
CDKN2ACDKN2A cyclin dependent kinase inhibitor 2Acyclin dependent kinase inhibitor 2A
DPYDDPYD dihydropyrimidine dehydrogenasedihydropyrimidine dehydrogenase
DDR2DDR2 discoidin domain receptor tyrosine kinase 2discoidin domain receptor tyrosine kinase 2
EGFREGFR epidermal growth factor receptorepidermal growth factor receptor
ERBB2ERBB2 erb-b2 receptor tyrosine kinase 2erb-b2 receptor tyrosine kinase 2
ERBB3ERBB3 erb-b2 receptor tyrosine kinase 3erb-b2 receptor tyrosine kinase 3
ESR1ESR1 estrogen receptor 1estrogen receptor 1
FGFR1FGFR1 fibroblast growth factor receptor 1fibroblast growth factor receptor 1
FGFR2FGFR2 fibroblast growth factor receptor 2fibroblast growth factor receptor 2
FGFR3FGFR3 fibroblast growth factor receptor 3fibroblast growth factor receptor 3
FLT3FLT3 fms related tyrosine kinase 3fms related tyrosine kinase 3
IGF1RIGF1R insulin like growth factor 1 receptorinsulin-like growth factor 1 receptor
IDH1IDH1 isocitrate dehydrogenase (NADP(+)) 1, cytosolicisocitrate dehydrogenase (NADP(+)) 1, cytosolic
IDH2IDH2 isocitrate dehydrogenase (NADP(+)) 2, mitochondrialisocitrate dehydrogenase (NADP(+)) 2, mitochondrial
KEAP1KEAP1 kelch like ECH associated protein 1kelch-like ECH-associated protein 1
KDRKDR kinase insert domain receptorkinase insert domain receptor
KDM6AKDM6A lysine demethylase 6Alysine demethylase 6A
MTORMTOR mechanistic target of rapamycinmechanistic target of rapamycin
MAPK1MAPK1 mitogen-activated protein kinase 1mitogen-activated protein kinase 1
MAPK3MAPK3 mitogen-activated protein kinase 3mitogen-activated protein kinase 3
MAP2K1MAP2K1 mitogen-activated protein kinase kinase 1mitogen-activated protein kinase kinase 1
MAP2K2MAP2K2 mitogen-activated protein kinase kinase 2mitogen-activated protein kinase kinase 2
MLH1MLH1 mutL homolog 1mutL homolog 1
MSH2MSH2 mutS homolog 2mutS homolog 2
MSH6MSH6 mutS homolog 6mutS homolog 6
NRASNRAS neuroblastoma RAS viral oncogene homologneuroblastoma RAS viral oncogene homolog
NF1NF1 neurofibromin 1neurofibromin 1
NF2NF2 neurofibromin 2neurofibromin 2
NTRK1NTRK1 neurotrophic receptor tyrosine kinase 1neurotrophic receptor tyrosine kinase 1
NTRK2NTRK2 neurotrophic receptor tyrosine kinase 2neurotrophic receptor tyrosine kinase 2
NTRK3NTRK3 neurotrophic receptor tyrosine kinase 3neurotrophic receptor tyrosine kinase 3
NOTCH1NOTCH1 notch 1notch 1
NFE2L2NFE2L2 nuclear factor, erythroid 2 like 2nuclear factor, erythroid 2 like 2
NPM1NPM1 nucleophosminnucleophosmin
PTENPTEN phosphatase and tensin homologphosphatase and tensin homologs
PIK3CAPIK3CA phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alphaphosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha
PIK3R1PIK3R1 phosphoinositide-3-kinase regulatory subunit 1phosphoinositide-3-kinase regulatory subunit 1
PDGFRAPDGFRA platelet derived growth factor receptor alphaplatelet derived growth factor receptor alpha
PDGFRBPDGFRB platelet derived growth factor receptor betaplatelet derived growth factor receptor beta
PDCD1LG2PDCD1LG2 programmed cell death 1 ligand 2programmed cell death 1 ligand 2
PPP2R1APPP2R1A protein phosphatase 2 scaffold subunit Aalphaprotein phosphatase 2 scaffold subunit Aalpha
PTPN11PTPN11 protein tyrosine phosphatase, non-receptor type 11protein tyrosine phosphatase, non-receptor type 11
RHOARHOA ras homolog family member Aras homolog family member A
RETRET ret proto-oncogeneret proto-oncogene
RNF43RNF43 ring finger protein 43ring finger protein 43
RUNX1RUNX1 runt related transcription factor 1runt related transcription factor 1
STK11STK11 serine/threonine kinase 11serine/threonine kinase 11
SMOSMO smoothened, frizzled class receptorsmoothened, frizzled class receptor
STAG2STAG2 stromal antigen 2stromal antigen 2
TERTTERT telomerase reverse transcriptasetelomerase reverse transcriptase
TOP2ATOP2A topoisomerase (DNA) II alphatopoisomerase (DNA) II alpha
TCF7L2TCF7L2 transcription factor 7 like 2transcription factor 7 like 2
TSC1TSC1 tuberous sclerosis 1tuberous sclerosis 1
TSC2TSC2 tuberous sclerosis 2tuberous sclerosis 2
TP53TP53 tumor protein p53tumor protein p53
MYCMYC v-myc avian myelocytomatosis viral oncogene homologv-myc avian myelocytomatosis viral oncogene homolog
MYCNMYCN v-myc avian myelocytomatosis viral oncogene neuroblastoma derived homologv-myc avian myelocytomatosis viral oncogene neuroblastoma derived homolog
VHLVHL von Hippel-Lindau tumor suppressorvon Hippel-Lindau tumor suppressor
이어 상기 제조한 라이브러리는 Nextseq550 Dx (Illumina, SanDiego, CA, USA) 장비를 사용하여 2x150 bp paired end 로 sequencing 하였고, bcl2fastq (v2.19.0.316, Illumina Inc.) 프로그램을 이용해 demultiplexing 하여 각 aliquot에 해당하는 fastq 파일을 생성했다. Aliquot마다 forward 방향, reverse 방향의 pair 로 2개의 fastq 파일이 생성된다. 각각 파일은 fastp (version 0.20.0, Shifu Chen et al.)를 사용해 insert (DNA fragment)와 함께 read 말단에 읽힌 adapter 서열을 제거하여 새로운 fastq 파일을 만들었다. 이어서 adapter 서열이 제거된 fastq 파일 (trimmed fastq)을 FastQC (v0.11.8, Babraham Institute)를 사용해 per base sequence quality, overrepresented sequences, adapter content 항목이 'good' 인 경우 QC를 통과한 것으로 판단하여 다음 단계로 진행했다. Trimmed fastq 파일은 GRCh38 버전의 참조유전체에 대하여 bwa (version 0.7.17-r1188)의 BWA-MEM 알고리즘을 사용해 맵핑(mapping)하여 bam 파일을 만들었다. Bam에서 sequencing error를 수정하고 PCR duplicate를 제거하기 위해 gencore (version 0.14.0, Shifu Chen et al.)를 사용했고, 이 결과로 새로운 bam 파일 (collapsed bam)을 만들었다. 이 collapsed bam 파일에 기록된 reads는 error가 수정되고 PCR duplicate가 제거되었기 때문에 혈액에 존재하는 DNA fragment의 정보를 반영한다. 이 과정까지 진행하여 각 샘플별 aliqout로 나눈 개수만큼 bam 파일이 생성되었다. 다음 단계로 진행하기 위해 각 aliquot bam 파일은 sambamba (version 0.7.0, Artem Tarasov et al.)의 merge 기능을 사용해 하나의 bam 파일로 합쳐졌다. 그리고 캡쳐한 106개 유전자에 대해 평균적으로 몇 개의 DNA fragment 정보가 있는지 알기 위해 sambamba (version 0.7.0, Artem Tarasov et al.)의 depth 기능을 이용해 per base depth를 계산하고, 전체 106개 유전자의 base에 대한 평균값을 구해 fragment mean depth(FMD) 값을 구했다.Then, the prepared library was sequenced with 2x150 bp paired ends using Nextseq550 Dx (Illumina, SanDiego, CA, USA) equipment, and demultiplexed using the bcl2fastq (v2.19.0.316, Illumina Inc.) program to correspond to each aliquot. A fastq file was created. For each aliquot, two fastq files are created as pairs in forward and reverse directions. For each file, using fastp (version 0.20.0, Shifu Chen et al.), the adapter sequence read at the end of the read was removed along with the insert (DNA fragment) to create a new fastq file. Then, use FastQC (v0.11.8, Babraham Institute) for the fastq file (trimmed fastq) with the adapter sequence removed. If the per base sequence quality, overrepresented sequences, and adapter content items are 'good', it is determined that the QC has been passed and the next step proceeded with The trimmed fastq file was mapped to the GRCh38 version reference genome using the BWA-MEM algorithm of bwa (version 0.7.17-r1188) to create a bam file. Gencore (version 0.14.0, Shifu Chen et al.) was used to correct sequencing errors in Bam and remove PCR duplicates, and a new bam file (collapsed bam) was created as a result. The reads recorded in this collapsed bam file reflect the information of DNA fragments present in blood because errors have been corrected and PCR duplicates have been removed. By proceeding to this process, bam files were created as much as the number divided by aliqout for each sample. To proceed to the next step, each aliquot bam file was merged into a single bam file using the merge function of sambamba (version 0.7.0, Artem Tarasov et al.). And to find out how many DNA fragments are on average for the 106 captured genes, the per base depth is calculated using the depth function of sambamba (version 0.7.0, Artem Tarasov et al.), and the base of all 106 genes The average value was obtained for the fragment mean depth (FMD).
결과는 도 3에 기재되어 있다. 동일한 20ng DNA를 aliquot로 나누지 않거나(1 aliquot), 2개, 4개, 8개로 나눠서 실험했을 때 검사에 사용할 수 있는 고유 단편의 개수가 상이하였다. 고유 단편의 수는 Aliquot 4개로 나눴을 때 포화되고, 이 이상 8개로 나누는 경우에는 큰 효과를 보기는 어려운 것으로 나타났다. 한편 aliquot를 많이 나눌수록 실험적 복잡성이 증가하므로 휴먼 에러를 유발할 가능성도 높아진다. 그러므로, 시작 DNA 양이 20ng인 경우에는 각 aliquot 별로 5ng에 해당하도록 4개의 aliquot로 나눴을 때 ctDNA 정보의 양을 최적으로 얻을 수 있다.The results are shown in FIG. 3 . When the same 20ng DNA was not divided into aliquots (1 aliquot) or divided into 2, 4, or 8 pieces, the number of unique fragments that could be used for the test was different. The number of unique fragments is saturated when divided by 4 aliquots, and it is difficult to see a big effect when dividing more than this into 8 aliquots. On the other hand, as the number of aliquots increases, the experimental complexity increases and thus the possibility of causing human error increases. Therefore, if the starting DNA amount is 20 ng, the amount of ctDNA information can be optimally obtained when divided into 4 aliquots corresponding to 5 ng for each aliquot.
또한 Aliquots 로 분할하여 얻은 DNA 정보를 이용해 본원에 따른 방법으로 분석을 하면 그렇지 않은 경우 (모든 duplicate 제거)보다 FMD (Fragment mean depth) 값이 높아지는 것을 확인하였다 (도 4 및 표 7 참조). 즉 동일한 시작 DNA를 사용하여 더 많은 DNA 정보를 검사에 사용할 수 있다. FMD는 NGS 검사 영역에서 대하여 고유 단편을 맵핑 했을 때 계산되는 평균적인 시퀀싱 정도(sequencing depth)로 그 값을 표 6에 나타냈다. In addition, it was confirmed that the FMD (Fragment mean depth) value was higher than if it was not analyzed by the method according to the present application using the DNA information obtained by dividing into aliquots (refer to FIGS. 4 and 7). That is, more DNA information can be used for testing using the same starting DNA. Table 6 shows FMD as the average sequencing depth calculated when unique fragments are mapped to the NGS test region.
AliquotAliquot Fragment mean depth (FMD)Fragment mean depth (FMD)
1One 1,5721572
22 1,4871,487
44 1,9671,967
44 1,9081,908
44 1,8351,835
88 1,9631963
#Aliquot#Aliquot FMDFMD Aliquot 1개 대비 FMD 증가FMD increase compared to 1 aliquot
1One 1,5721572 --
22 1,4871,487 -5.4%-5.4%
44 1,9671,967 25.1%25.1%
88 1,9631963 24.9%24.9%
기존의 library를 제작하는 방법은 앞에서 기술한 방법과 유사하다. 단 차이점은 본원에 따른 방법에서는 하나의 샘플을 4개의 aliquot로 나눈 뒤 각각 다른 index를 사용하나, 기존의 방법은 하나의 샘플을 하나의 tube로 실험하기 때문에 하나의 인덱스만을 사용하여 라이브러리를 제작하였다.The method of creating the existing library is similar to the method described above. The only difference is that in the method according to the present application, a sample is divided into 4 aliquots and then a different index is used. However , in the existing method, since one sample is tested with one tube, a library was prepared using only one index. .
또한 표 8을 참조하면, 동일한 input DNA 양 (20ng) 일 때, aliquot의 개수를 늘리는 만큼 개별 aliquot당 DNA 양은 줄어들고, 각 aliquot의 DNA는 pre-PCR 단계에서 증폭되는데, 1ng 당 pre-PCR 양은 aliquot에 5ng의 DNA가 있을 때 포화된다. 결과적으로 전체 pre-PCR DNA 양은 aliquot 개수를 늘릴수록 증가하고 4개로 늘렸을 때 포화되는 것으로 나타났다 (도 5 참조). 이 것은 FMD 값이 aliquots 4개인 경우 포화되는 것과 비슷한 현상으로 aliquots를 4개로 나눌 때의 특장점으로 볼 수 있다. 부가적으로 한 번의 NGS 실험에서는 증폭된 DNA 중 1000~2000 ng을 사용하기 때문에, aliquot를 나누지 않은 경우엔 1회 분석 분량의 DNA밖에 얻을 수 없는 반면, aliquot로 나누어서 증폭한 경우, 4000~6000 ng를 얻을 수 있기 때문에, 향후 검증과정에서 재검사에 활용하거나, 다른 실험에 사용할 수 있는 추가의 장점이 생긴다.Also, referring to Table 8, at the same input DNA amount (20ng), the amount of DNA per individual aliquot decreases as the number of aliquots increases, and the DNA of each aliquot is amplified in the pre-PCR step. It is saturated when there is 5 ng of DNA in it. As a result, it was found that the total amount of pre-PCR DNA increased as the number of aliquots increased, and was saturated when the number of aliquots was increased to 4 (see FIG. 5). This is a phenomenon similar to saturation when the FMD value is 4 aliquots, and can be viewed as a feature when dividing aliquots into 4 aliquots. Additionally, since 1000 to 2000 ng of amplified DNA is used in one NGS experiment, only the amount of DNA for one analysis can be obtained if the aliquot is not divided. can be obtained, there is an additional advantage that it can be used for re-inspection in the future verification process or used for other experiments.
IDID Input DNA (ng)Input DNA (ng) #Aliquot#Aliquot DNA양(ng)/aliquotDNA amount (ng)/aliquot pre-PCR DNA양(ng)/aliquotpre-PCR DNA amount (ng)/aliquot pre-PCR DNA양/ngpre-PCR DNA amount/ng Total pre-PCR DNA양 (ng)Total pre-PCR DNA amount (ng)
LAH006LAH006 2020 1One 2020 12541254 62.762.7 12541254
LAH007LAH007 2020 22 1010 13551355 135.5135.5 27102710
LAH002LAH002 2020 44 55 10701070 214214 42804280
LAH005LAH005 2020 44 55 1572.51572.5 314.5314.5 62906290
LAH008LAH008 2020 44 55 14001400 280280 56005600
LAH009LAH009 2020 88 2.52.5 622.5622.5 249249 49804980
실시예 2. 본원에 따른 방법을 이용한 분석에서 오류 수정 전/후 변이의 VAF (Variant allele frequency) 향상Example 2. Variant allele frequency (VAF) improvement of variation before and after error correction in analysis using the method according to the present application
cfDNA 분석에 있어서 molecular barcode를 사용한 경우, barcode를 사용해 동일한 세포에서 유래한 DNA의 PCR duplicates로 확인되면 서로 염기서열을 비교하고, 다른 부분이 있을 경우 다수결의 원칙으로 오류가 생긴 염기를 수정하고 전체 PCR duplicates를 대표하는 하나의 consensus DNA 서열을 만든다. 이 방법으로 error rate를 1/10000 (10e-4)까지 낮출 수 있다.In the case of using molecular barcode in cfDNA analysis, if PCR duplicates of DNA derived from the same cell using barcode are confirmed, the nucleotide sequences are compared with each other. Creates a single consensus DNA sequence that represents duplicates. In this way, the error rate can be reduced to 1/10000 (10e-4).
본원에 따른 방법은 적절하게 aliquot로 나눈 경우 서로 다른 세포에서 온 DNA가 우연히 생기지 않는 것으로 가정하고, 모든 duplicates를 PCR duplicates로 가정하여 molecular barcode를 사용할 때와 유사하게 다수결 원칙으로 consensus DNA를 만든다. 하지만 실제로 다른 세포 유래의 우연히 동일한 DNA가 있을 수 있기 때문에 오류 수정 과정에서 서로 다른 염기가 많은 경우 분석에서 제외시킨다. The method according to the present application assumes that DNA from different cells does not occur by chance when properly divided into aliquots, and all duplicates are assumed to be PCR duplicates, similar to when using molecular barcodes, to make consensus DNA with a majority rule. However, in fact, since there may be identical DNA by chance from different cells, in the error correction process, if there are many different bases, it is excluded from the analysis.
한편, DNA를 aliquot로 나누는 것 자체는 반복실험의 효과가 있다. 이 점을 이용해 aliquots 사이에 통계적으로 유의미한 불일치가 발생하는 경우 분석 결과에서 제외한다. 결과적으로 error rate를 10배 더 낮춰서 1/100000 (10e-5)까지 낮출 수 있었다. 도 6a 및 도 6b에 나타난 바와 같이 오류 수정 전 변이는 VAF 1%에서 다수의 false positive (FP) 변이가 검출된다 (도 6a). 반면 오류 수정 후 true positive (TP)만 남고 모든 FP가 사라졌다 (도 6b). On the other hand, dividing DNA into aliquots itself has the effect of repeated experiments. Using this point, statistically significant discrepancies between aliquots are excluded from the analysis results. As a result, it was possible to lower the error rate by a factor of 10 to 1/100000 (10e-5). As shown in FIGS. 6A and 6B , a large number of false positive (FP) mutations were detected at 1% of the VAF before error correction ( FIG. 6A ). On the other hand, after error correction, only true positive (TP) remained and all FPs disappeared (FIG. 6b).
이상에서 본원의 예시적인 실시예에 대하여 상세하게 설명하였지만 본원의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본원의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본원의 권리범위에 속하는 것이다.Although the exemplary embodiments of the present application have been described in detail above, the scope of the present application is not limited thereto, and various modifications and improvements by those skilled in the art using the basic concept of the present application as defined in the following claims are also included in the scope of the present application. will belong to
본 발명에서 사용되는 모든 기술용어는, 달리 정의되지 않는 이상, 본 발명의 관련 분야에서 통상의 당업자가 일반적으로 이해하는 바와 같은 의미로 사용된다. 본 명세서에 참고문헌으로 기재되는 모든 간행물의 내용은 본 발명에 도입된다.All technical terms used in the present invention, unless otherwise defined, have the same meaning as commonly understood by one of ordinary skill in the art of the present invention. The contents of all publications herein incorporated by reference are incorporated herein by reference.

Claims (8)

  1. NGS (Next Generation Sequencing) 분석을 이용한 cfDNA (cell free DNA)의 저빈도 변이 검출에 있어서, 상기 NGS 분석에 사용되는 고유 단편의 비율을 향상시키는 방법으로, 상기 방법은 In the detection of low-frequency mutations in cfDNA (cell free DNA) using NGS (Next Generation Sequencing) analysis, a method for improving the ratio of unique fragments used in the NGS analysis, the method comprising:
    (a) 특정 양의 cfDNA 시료를 제공하는 단계; (a) providing a specific amount of a cfDNA sample;
    (b) 상기 특정 양의 cfDNA에 포함된 유전체 단편 중 51 내지 330bp 길이에 해당하는 총 유전체 단편의 수, 및 고유 단편(unique fragment)의 수를 계산하는 단계로, 상기 고유 단편의 수는 상기 총 유전체 단편의 수에서 상기 51 내지 330bp 길이에 해당하는 각 유전체 단편의 collision count의 합을 제외한 값이고, (b) calculating the total number of genome fragments corresponding to a length of 51 to 330 bp among the genome fragments included in the specific amount of cfDNA, and the number of unique fragments, wherein the number of unique fragments is the total It is a value obtained by subtracting the sum of the collision counts of each genome fragment corresponding to the length of 51 to 330 bp from the number of genome fragments,
    상기 collision count 합은 다음 식으로부터 계산되며, The collision count sum is calculated from the following equation,
    Figure PCTKR2021011654-appb-img-000004
    .
    Figure PCTKR2021011654-appb-img-000004
    .
    상기 식에서 q(k-1;d)는 [1,d]의 범위의 n 개의 숫자 중 k 와 같은 숫자가 있을 확률, k는 특정 숫자, d는 숫자의 범위, n은 숫자의 개수이고, In the above formula, q(k-1;d) is the probability that there is a number such as k among n numbers in the range of [1,d], k is a specific number, d is a range of numbers, n is the number of numbers,
    (c) 상기 총 유전체 단편 수에서 상기 고유 단편 수가 차지하는 비를 계산하고, 상기 고유 단편의 비를 증가시키도록 상기 특정 양의 cfDNA 시료를 복수 개의 aliquot로 나누는 단계; 및 (c) calculating the ratio of the number of unique fragments to the total number of genome fragments, and dividing the specific amount of the cfDNA sample into a plurality of aliquots to increase the ratio of the unique fragments; and
    (d) 상기 복수개의 각 aliquot 별로, 각각 상이한 인덱스를 포함하는 어뎁터를 태깅하여 라이브러리를 제조하고, NGS (Next Generation Sequencing) 분석을 수행한 후 상기 각 aliquot의 NGS 결과를 통합하는 단계를 포함하는, 방법. (d) for each of the plurality of aliquots, tagging an adapter having a different index, respectively, to prepare a library, performing NGS (Next Generation Sequencing) analysis, and integrating the NGS results of each aliquot; method.
  2. 제 1 항에 있어서, The method of claim 1,
    상기 저빈도 변이는 1% 미만의 빈도인 것인, 방법. The method of claim 1, wherein the low frequency variation is less than 1% frequency.
  3. 제 1 항 또는 제 2 항에 있어서, 3. The method of claim 1 or 2,
    상기 (d) 단계에서 상기 고유 단편의 비가 최소 93% 이상이 되도록 하는 것인, 방법. The method, wherein in step (d), the ratio of the native fragments is at least 93%.
  4. 제 1 항 또는 제 2 항에 있어서, 3. The method of claim 1 or 2,
    상기 특정 양의 cfDNA는 20ng이고, 이 경우 상기 cfDNA는 상기 고유 단편이 비가 93.9%가 되도록 상기 cfDNA를 4개의 aliquot로 분할하는 것인, 방법. wherein the specific amount of cfDNA is 20 ng, in which case the cfDNA is divided into 4 aliquots of the cfDNA such that the ratio of the native fragments is 93.9%.
  5. NGS (Next Generation Sequencing) 분석을 이용한 cfDNA (cell free DNA)의 저빈도 변이 검출에 있어서, 상기 NGS 분석에 사용되는 고유 단편의 비율을 향상시키기 위한 라이브러리 제조방법으로, 상기 방법은 In the detection of low-frequency mutations in cfDNA (cell free DNA) using NGS (Next Generation Sequencing) analysis, a library preparation method for improving the ratio of unique fragments used in the NGS analysis, the method comprising:
    (a) 특정 양의 cfDNA 시료를 제공하는 단계; (a) providing a specific amount of a cfDNA sample;
    (b) 상기 특정 양의 cfDNA에 포함된 유전체 단편 중 51bp 내지 330bp 길이에 해당하는 총 유전체 단편의 수, 및 고유 단편 (unique fragment)의 수를 계산하는 단계로, 상기 고유 단편의 수는 상기 총 유전체 단편의 수에서 상기 51~330bp 길이에 해당하는 각 유전체 단편의 collision count의 합을 제외한 값이고, (b) calculating the number of total genome fragments corresponding to a length of 51 bp to 330 bp among the genome fragments included in the specific amount of cfDNA, and the number of unique fragments, wherein the number of unique fragments is the total number of It is a value obtained by subtracting the sum of the collision counts of each genome fragment corresponding to the length of 51 to 330 bp from the number of genome fragments,
    상기 collision count 합은 다음 식으로부터 계산되며,The collision count sum is calculated from the following equation,
    Figure PCTKR2021011654-appb-img-000005
    .
    Figure PCTKR2021011654-appb-img-000005
    .
    상기 식에서 q(k-1;d)는 [1,d]의 범위의 n 개의 숫자 중 k 와 같은 숫자가 있을 확률, k는 특정 숫자, d는 숫자의 범위, n은 숫자의 개수이고, In the above formula, q(k-1;d) is the probability that there is a number such as k among n numbers in the range of [1,d], k is a specific number, d is a range of numbers, n is the number of numbers,
    (c) 상기 단계 (b)로부터 상기 총 유전체 단편 수에서 상기 고유 단편의 수가 차지하는 비를 계산하고, 상기 고유 단편의 비를 증가시키도록 상기 특정 양의 cfDNA 시료를 복수 개의 aliquot로 나누는 단계; 및 (c) calculating a ratio of the number of unique fragments to the total number of genome fragments from step (b), and dividing the specific amount of cfDNA sample into a plurality of aliquots to increase the ratio of unique fragments; and
    (d) 상기 복수개의 각 aliquot 별로, 각각 상이한 인덱스를 포함하는 어뎁터를 태깅하여 NGS용 라이브러리를 제조하는 단계를 포함하는, 방법. (d) for each of the plurality of aliquots, each tagging an adapter including a different index to prepare a library for NGS.
  6. 제 5 항에 있어서, 6. The method of claim 5,
    상기 저빈도는 1% 미만의 빈도인 것인, 방법.The method of claim 1, wherein the low frequency is less than 1%.
  7. 제 5 항 또는 제 6 항에 있어서, 7. The method according to claim 5 or 6,
    상기 (d) 단계에서 상기 고유 단편의 비가 최소 93% 이상이 되도록 하는 것인, 방법. The method, wherein in step (d), the ratio of the native fragments is at least 93%.
  8. 제 5 항 또는 제 6 항에 있어서, 7. The method according to claim 5 or 6,
    상기 특정 양의 cfDNA는 20ng이고, 이 경우 상기 cfDNA는 상기 고유 단편이 비가 93.9%가 되도록 상기 cfDNA를 4개의 aliquot로 분할하는 것인, 방법. wherein the specific amount of cfDNA is 20 ng, in which case the cfDNA is divided into 4 aliquots of the cfDNA such that the ratio of the native fragments is 93.9%.
PCT/KR2021/011654 2020-09-01 2021-08-31 Method for increasing ratio of intrinsic fragment used in ngs analysis for detecting low-frequency mutation of cfdna WO2022050654A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2020-0110773 2020-09-01
KR1020200110773A KR102530247B1 (en) 2020-09-01 2020-09-01 Method of enhancing the proportion of the unique DNA fragment used for NGS analysis of cfDNA to detect low frequency variant

Publications (1)

Publication Number Publication Date
WO2022050654A1 true WO2022050654A1 (en) 2022-03-10

Family

ID=80491218

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/011654 WO2022050654A1 (en) 2020-09-01 2021-08-31 Method for increasing ratio of intrinsic fragment used in ngs analysis for detecting low-frequency mutation of cfdna

Country Status (2)

Country Link
KR (1) KR102530247B1 (en)
WO (1) WO2022050654A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117587131A (en) * 2024-01-09 2024-02-23 阅尔基因技术(苏州)有限公司 Detection method for dynamically monitoring ctDNA and application thereof

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102491485B1 (en) * 2022-03-21 2023-01-27 주식회사 아이엠비디엑스 Analysis Method for Copy Number Variation of Circulating Tumor DNA

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180060764A (en) * 2016-11-29 2018-06-07 연세대학교 산학협력단 Methods for detecting nucleic acid sequence variations and a device for detecting nucleic acid sequence variations using the same
US20190316185A1 (en) * 2012-09-04 2019-10-17 Guardant Health, Inc. Systems and methods to detect rare mutations and copy number variation
WO2019204208A1 (en) * 2018-04-16 2019-10-24 Memorial Sloan Kettering Cancer Center SYSTEMS AND METHODS FOR DETECTING CANCER VIA cfDNA SCREENING
WO2020104670A1 (en) * 2018-11-23 2020-05-28 Cancer Research Technology Limited Improvements in variant detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190316185A1 (en) * 2012-09-04 2019-10-17 Guardant Health, Inc. Systems and methods to detect rare mutations and copy number variation
KR20180060764A (en) * 2016-11-29 2018-06-07 연세대학교 산학협력단 Methods for detecting nucleic acid sequence variations and a device for detecting nucleic acid sequence variations using the same
WO2019204208A1 (en) * 2018-04-16 2019-10-24 Memorial Sloan Kettering Cancer Center SYSTEMS AND METHODS FOR DETECTING CANCER VIA cfDNA SCREENING
WO2020104670A1 (en) * 2018-11-23 2020-05-28 Cancer Research Technology Limited Improvements in variant detection

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
REN YONGZHE; ZHANG YANG; WANG DANDAN; LIU FENGYING; FU YING; XIANG SHAOHUA; SU LI; LI JIANCHENG; DAI HENG; HUANG BINGDING: "SinoDuplex: An Improved Duplex Sequencing Approach to Detect Low-frequency Variants in Plasma cfDNA Samples", GENOMICS PROTEOMICS AND BIOINFORMATICS, vol. 18, no. 1, 1 February 2020 (2020-02-01), CN , pages 81 - 90, XP086229745, ISSN: 1672-0229, DOI: 10.1016/j.gpb.2020.02.003 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117587131A (en) * 2024-01-09 2024-02-23 阅尔基因技术(苏州)有限公司 Detection method for dynamically monitoring ctDNA and application thereof

Also Published As

Publication number Publication date
KR20220029001A (en) 2022-03-08
KR102530247B1 (en) 2023-05-09

Similar Documents

Publication Publication Date Title
AU2021202012B2 (en) Methods and systems for detecting genetic variants
Liu et al. Biological background of the genomic variations of cf-DNA in healthy individuals
WO2022050654A1 (en) Method for increasing ratio of intrinsic fragment used in ngs analysis for detecting low-frequency mutation of cfdna
CN109427412B (en) Sequence combination for detecting tumor mutation load and design method thereof
CN111321140A (en) Tumor mutation load detection method and device based on single sample
CN110872617A (en) System and method for detecting rare mutations and copy number variations
Kadri et al. Amplicon indel hunter is a novel bioinformatics tool to detect large somatic insertion/deletion mutations in amplicon-based next-generation sequencing data
Eboreime et al. Estimating exceptionally rare germline and somatic mutation frequencies via next generation sequencing
Nix et al. The stochastic nature of errors in next-generation sequencing of circulating cell-free DNA
KR102416074B1 (en) Method for determining the quality of nucleic acid of biological samples
WO2023182585A1 (en) Method for analyzing copy number variation in circulating tumor nucleic acid
WO2019009431A1 (en) Method for highly accurately distinguishing spontaneous mutations occurring in tumor cells
EP3938541B9 (en) Method for sequencing a direct repeat
US20230392187A1 (en) Reference ladders and adaptors
WO2018216905A2 (en) Method for generating frequency distribution of background allele in sequencing data obtained from acellular nucleic acid, and method for detecting mutation from acellular nucleic acid using same
US12024745B2 (en) Methods and systems for detecting genetic variants
US12024746B2 (en) Methods and systems for detecting genetic variants
Zack The Landscape of Structural Variants Within Pediatric High Grade Glioma
CN116770443A (en) Cell free DNA library construction method and library construction kit
차수진 Patient-specific genomic profiling for advanced cancers in young adults
Measurand et al. EVALUATION OF AUTOMATIC CLASS III DESIGNATION FOR

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21864614

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21864614

Country of ref document: EP

Kind code of ref document: A1