WO2019108014A1 - Method for measuring integrity of uid nucleic acid sequence in nucleic acid sequencing analysis - Google Patents

Method for measuring integrity of uid nucleic acid sequence in nucleic acid sequencing analysis Download PDF

Info

Publication number
WO2019108014A1
WO2019108014A1 PCT/KR2018/015086 KR2018015086W WO2019108014A1 WO 2019108014 A1 WO2019108014 A1 WO 2019108014A1 KR 2018015086 W KR2018015086 W KR 2018015086W WO 2019108014 A1 WO2019108014 A1 WO 2019108014A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleic acid
acid sequence
region
uid
polynucleotide
Prior art date
Application number
PCT/KR2018/015086
Other languages
French (fr)
Korean (ko)
Inventor
정종석
박동현
박웅양
Original Assignee
사회복지법인 삼성생명공익재단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 사회복지법인 삼성생명공익재단 filed Critical 사회복지법인 삼성생명공익재단
Publication of WO2019108014A1 publication Critical patent/WO2019108014A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2525/00Reactions involving modified oligonucleotides, nucleic acids, or nucleotides
    • C12Q2525/10Modifications characterised by
    • C12Q2525/191Modifications characterised by incorporating an adaptor
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2535/00Reactions characterised by the assay type for determining the identity of a nucleotide base or a sequence of oligonucleotides
    • C12Q2535/122Massive parallel sequencing

Definitions

  • a polynucleotide for measuring the degree of purity of a UID nucleic acid sequence and a method for measuring the degree of purity of a UID nucleic acid sequence in nucleic acid sequence analysis using the polynucleotide.
  • a genome or genome is any genetic information that a creature has.
  • Several techniques have been developed for sequencing or sequencing genomes of a single individual, such as DNA chip and Next Generation Sequencing (NGS), and Next Generation Sequencing (NNGS).
  • NGS is widely used for research and diagnostic purposes. Although NGS differs depending on the kind of equipment, it can be broadly divided into three stages: sampling, library production, and nucleic acid sequence analysis. After the nucleic acid sequence analysis, the presence or absence of the gene mutation is detected based on the produced sequence analysis data.
  • a number of samples may be mixed into one nucleic acid sequencing kit.
  • the sample to be mixed should have a label that can distinguish each sample before mixing.
  • the label may cause errors in nucleic acid sequence analysis results due to polymerase-induced errors in polymerase chain reaction and / or detection errors during nucleic acid sequence analysis, There is a problem that it inhibits. Therefore, there is a need for a method that can identify whether a label capable of distinguishing a large number of samples correctly binds to a sample to be analyzed and correctly labels the sample.
  • One aspect provides a polynucleotide for measuring the purity of a UID nucleic acid sequence.
  • Another aspect provides a method for determining the purity of the UID nucleic acid sequence in nucleic acid sequencing.
  • One aspect includes a first region in which two or more consecutive nucleotides comprise a unique identification (UID) nucleic acid sequence, a second region in which at least two consecutive nucleotides comprise a non-homologous nucleic acid sequence Region and a third region comprising two or more contiguous nucleotides of the nucleic acid sequence homologous to the reference genome.
  • UID unique identification
  • the first region in the polynucleotide may comprise a unique identification (UID) nucleic acid sequence.
  • UID unique identification
  • the UID refers to a nucleic acid fragment that serves to identify a sample during nucleic acid sequencing. That is, the UID is a marker for distinguishing different samples from each other in nucleic acid sequence analysis for a plurality of samples. Therefore, in order to distinguish a plurality of samples, the UID may have different nucleic acid sequences among the samples.
  • the polynucleotide for measuring the degree of purity of UID may be one kind of nucleic acid sequence analysis Or a different UID nucleic acid sequence from the above-mentioned samples.
  • the polynucleotide may be synthesized or prepared so as to have the same UID nucleic acid sequence among a plurality of polynucleotides.
  • the polynucleotide may have one common UID nucleic acid sequence such as AGTC, or may have one or more identical UID nucleic acid sequences, for example, AGTC and TGAC in common.
  • the UID nucleic acid sequence may be mixed with unique molecular identifiers (UMI), an index, or a barcode.
  • UMI unique molecular identifiers
  • the UID nucleic acid sequence may include, but is not limited to, a base of A, G, C, or T. Also, the UID nucleic acid sequence may be from about 2 bp to about 40 bp, from about 2 bp to about 35 bp, from about 2 bp to about 30 bp, from about 2 bp to about 25 bp, from about 2 bp to about 30 bp, About 3 bp to about 20 bp, about 4 bp to about 20 bp, or about 4 bp to about 16 bp, but the length is not limited thereto.
  • the polynucleotide may be for application to multiplexing.
  • Multiplexing means mixing two or more samples so that two or more samples can be sequenced in one nucleic acid sequencing lane or chip.
  • the integrity of the UID means the number or percentage of unique UIDs present in the sample in the sequence analysis data.
  • the degree of purity of the UID may be affected by the library production process and / or the nucleic acid sequencing process.
  • the degree of purity of the UID may be expressed as a relative level.
  • the second region in the polynucleotide may comprise a nucleic acid sequence that is non-homologous to the reference genome.
  • the second region in the polynucleotide may comprise a nucleic acid sequence which is not homologous to the reference genome.
  • the polynucleotide may be synthesized or prepared so as to have the same nucleic acid sequence having no homology with the reference genome among a plurality of polynucleotides.
  • Sequences that do not have homology with the reference genome, after nucleic acid sequence analysis, are designed to remove synthetic fragments (polynucleotides to measure the purity of the UIDs) artificially injected from the sequence analysis results of the original sequence analysis sample , Sequences that are not homologous to the reference genome can be separated if the consecutive nucleotide sequence of at least 4 bp or more is different from the reference genome in order to prevent the sequencing data of the generated fragment from being located in the reference genome.
  • Nucleic acid sequences that are not homologous to the reference genome may be from about 2 bp to about 250 bp, from about 2 bp to about 40 bp, from about 2 bp to about 35 bp, from about 2 bp to about 30 bp, from about 2 bp From about 2 bp to about 30 bp, from about 3 bp to about 20 bp, from about 4 bp to about 20 bp, or from about 4 bp to about 16 bp, although the length is not limited thereto .
  • the reference genomic data may be a database already known in the art such as National Center for Biotechnology Information (NCBI), Gene Expression Omnibus (GEO), Food and Drug Administration (FDA), My Cancer Genome, TCGA , Or may be obtained from a control, i.e., a biological sample of a normal person.
  • the normal person may be a healthy person who has not found a specific disease, for example, a tumor.
  • the reference genome may be a human reference genome, and may be hg18 or hg19.
  • the homology means the degree to which the homology matches the nucleotide sequence of a given reference genome.
  • the third region in the polynucleotide may comprise a nucleic acid sequence having homology with the reference genome.
  • the third region in the polynucleotide may comprise a nucleic acid sequence homologous to the reference genome.
  • the nucleic acid sequence that is homologous to the reference genome may be homologous to two or more consecutive nucleotides of the nucleic acid sequence of the target region.
  • the whole genome can be sequenced using the next-generation nucleic acid sequencing method, or the nucleic acid sequence can be analyzed only in the exosome region or a specific region. This method of analysis is called target sequence analysis or targeted resequencing.
  • the polynucleotide may be for application in target sequence analysis.
  • the target region may be a whole or a partial region of the gene of interest, and the kind of the gene is not limited.
  • the second region is located at the 5 'end of the third region, the 3' end of the third region, or the 5 'end and the 3' end of the third region, Terminus of the first region, the 3 'terminus of the third region, the 5' terminus of the second region, the 3 'terminus of the second region, or the 5' terminus of the second region and the 3 'terminus of the second region.
  • the polynucleotide may comprise a first region, a second region and a third region, or a third region, a second region and a first region, in the direction from the 5 'end to the 3' end have.
  • the polynucleotide may include a nucleotide sequence that does not have homology with the UID nucleic acid sequence and / or the reference genome at both ends of the polynucleotide, for example, at the 5 'end to the 3' , A second region, a third region, a second region and a first region.
  • the second region may comprise a nucleic acid sequence that is the same as or different from the second region and not homologous to the reference genome and wherein the first region comprises a UID nucleic acid sequence identical or different than the first region Lt; / RTI >
  • the first region, the second region, and the third regions may be immediately adjacent to each other, or may be located at a certain distance further including any other nucleic acid sequence therebetween.
  • Fig. 1 is an image showing the structure of a synthetic fragment for measuring the degree of purity of UID (the above-described polynucleotide).
  • a synthetic fragment for measuring purity of a plurality of UIDs may include the same UID nucleic acid sequence, a nucleic acid sequence that is not homologous to the reference genome, and a nucleic acid sequence that is homologous to the reference genome have.
  • the synthetic fragment for measuring the purity of the UID may further comprise a primer and / or an adapter.
  • the synthetic fragment for measuring the degree of purity of the UID may be inserted into a library preparation step for one or more samples requiring nucleic acid sequence analysis and subjected to nucleic acid sequence analysis together. In this case, one or more samples for which nucleic acid sequence analysis is required and a sample for measuring the degree of purity of UID may be those having the same primer and / or adapter nucleic acid sequence, while having different UID nucleic acid sequences.
  • composition comprising the polynucleotides described above.
  • Another aspect provides a kit comprising the polynucleotides described above.
  • a method for producing a polynucleotide comprising the steps of: preparing a first library for measuring purity of a UID comprising the polynucleotide described above; Fragmenting a nucleic acid isolated from a biological sample and ligating a polynucleotide comprising a unique identification (UID) nucleic acid sequence to one or more ends of the fragmented nucleic acid to prepare a second library for nucleic acid sequence analysis; Nucleic acid sequence analysis of said first library and said second library to obtain sequencing data; Extracting a lead including a second region from a read of the obtained sequence analysis data; Calculating a ratio of the lead including the first region out of the leads including the extracted second region; And a step of measuring the degree of purity of the UID from the ratio of the lead containing the calculated first region to the degree of purity of the UID in the nucleic acid sequence analysis.
  • UID unique identification
  • the library means a nucleic acid fragment prepared in a form suitable for nucleic acid sequence analysis.
  • the library may be a fragment in which a primer, an adapter, a unique identifier or a combination thereof is ligated to a nucleic acid fragment for which nucleic acid sequence analysis is required, It may be an amplification product thereof.
  • the library may comprise a nucleic acid fragment set prepared in a form suitable for nucleic acid sequence analysis before or after pre-capture polymerase chain reaction (PCR), target enrichment, or post capture PCR Lt; / RTI >
  • the library may be, for example, a genomic library, a complementary DNA library, a randomized mutant library, or a combination thereof.
  • the method includes the step of constructing a first library for measuring purity of a UID comprising the polynucleotide.
  • the first library may be prepared by ligating primers and / or adapters to one or more ends of the polynucleotide.
  • the primers and / or adapters may be adapters suitable for nucleic acid sequencing, primers for polymerase chain reaction, primers suitable for nucleic acid sequencing, regions in which primers suitable for nucleic acid sequencing can be annealed, or combinations thereof .
  • the primers and / or adapters can be selected according to the nucleic acid sequencing method by a person skilled in the art.
  • the method comprises fragmenting a nucleic acid isolated from a biological sample and ligation of a polynucleotide comprising a UID nucleic acid sequence at one or more ends of the fragmented nucleic acid to prepare a second library for nucleic acid sequence analysis.
  • the biological sample may be one obtained from a suspected individual having a disease, a suspected individual having a tumor, a normal person, or a combination thereof, or a compound.
  • the subject may be a human being, a cow, a horse, a pig, a sheep, a goat, a dog, a cat, or a rodent.
  • the biological sample may be obtained from blood, plasma, serum, urine, saliva, mucous secretion, sputum, feces, tears or a combination thereof.
  • the nucleic acid may be a genome or a fragment thereof, and may be used interchangeably with a polynucleotide having an arbitrary length.
  • the genome or genome refers to the entire chromosome, chromatin, or gene.
  • the nucleic acid may be DNA (deoxyribonucleic acid), RNA (ribonucleic acid) or a combination thereof, and may be, for example, cell-free DNA (cf DNA).
  • the method of separating the nucleic acid from the sample can be carried out by a method known to a person skilled in the art.
  • the method of fragmenting the isolated nucleic acid may be performed by methods known to those of ordinary skill in the art, and may be physical, chemical, or enzymatic cleavage of the genome, such as by digesting the genome with a restriction enzyme Lt; / RTI >
  • the method may comprise selecting the size of the fragmented nucleic acid.
  • the step of selecting the size may be performed by electrophoresis, centrifugation, chromatography, or a combination thereof.
  • the isolated nucleic acid fragment may be from about 10 bp to about 2000 bp, from about 15 bp to about 1500 bp, from about 20 bp to about 1000 bp, from about 20 bp to about 500 bp, or from about 20 to about 300 bp.
  • the second library may be prepared by ligating a polynucleotide comprising a UID nucleic acid sequence to one or more ends of the fragmented nucleic acid.
  • a sample may be one having a different UID nucleic acid sequence from another sample and a sample for measuring the degree of purity of the UID.
  • the second library may be prepared by ligating primers and / or adapters to one or more ends of the fragmented nucleic acid.
  • the primers and / or adapters may be adapters suitable for nucleic acid sequencing, primers for polymerase chain reaction, primers suitable for nucleic acid sequencing, regions in which primers suitable for nucleic acid sequencing can be annealed, or combinations thereof .
  • the primers and / or adapters can be selected according to the nucleic acid sequencing method by a person skilled in the art.
  • the method may comprise subjecting to enrichment.
  • the target enrichment means increasing the frequency of the gene or other region of interest to be subjected to the nucleic acid sequence analysis.
  • the target enrichment may be carried out in a manner known to those skilled in the art, for example, by in-solution capture, hybridization of a sample with a bait, polymerase chain reaction, or a combination thereof .
  • the method may include performing pre capture polymerase chain reaction (PCR) prior to target enrichment, post capture PCR after target enrichment, or a combination thereof.
  • PCR pre capture polymerase chain reaction
  • the step of producing the first library and the step of producing the second library may be performed before the nucleic acid sequence analysis. Therefore, the step of producing the first library may be performed first, and the step of producing the second library may be performed first, or the step of producing the first library and the second library may be performed simultaneously.
  • the method comprises nucleic acid sequencing the first library and the second library to obtain sequencing data.
  • the nucleic acid sequence analysis may be next generation sequencing (NGS).
  • NGS next generation sequencing
  • Nucleic acid sequence analysis may be used interchangeably with sequencing, sequencing, or sequencing.
  • the NGS may be used interchangeably with massive parallel sequencing or second-generation sequencing.
  • the NGS is a technique for simultaneously sequencing large amounts of nucleic acid of a fragment, which is a chip-based and polymerase chain reaction (PCR) -based paired end format , And performing sequencing at a very high speed based on hybridization of the fragment.
  • PCR polymerase chain reaction
  • Illumina HiSeq Illumina HiSeq 2500, Illumina Genome Analyzer, Solexa platform, SOLiD System (Applied Biosystems), Ion Proton (Life Technologies), Complete Genomics , Helicos Biosciences Heliscope, Pacific Biosciences single molecule real time (SMRT (TM)) technology, or a combination thereof.
  • SOLiD System Applied Biosystems
  • Ion Proton Life Technologies
  • Complete Genomics , Helicos Biosciences Heliscope, Pacific Biosciences single molecule real time (SMRT (TM)) technology, or a combination thereof.
  • SMRT single molecule real time
  • the nucleic acid sequence analysis may be a nucleic acid sequence analysis for analyzing only the region of interest.
  • the nucleic acid sequence analysis may include, for example, NGS-based targeted sequencing, targeted deep sequencing or panel sequencing.
  • the sequence analysis data means the data obtained by the nucleic acid sequence analysis, and may include the sequence, frequency, and quality index of the individual leads to the nucleic acid sequence analyzing object.
  • the " read " means the nucleic acid sequence information of the nucleic acid fragment obtained by the nucleic acid sequence analysis, and may be data derived from nucleic acid sequence analysis or a fragment of the nucleic acid sequence.
  • the sequence analysis data may be obtained, for example, from binary version of SAM (SAM) format and / or Sequence Alignment / Map (SAM) format data.
  • SAM binary version of SAM
  • SAM Sequence Alignment / Map
  • the BAM format and / or the SAM format may typically be used in a format that describes data about short leads.
  • the data in the BAM format and / or the SAM format includes a start point of a lead, a direction of a lead, a mapping quality, a FLAG indicating a degree of alignment, a text related to a CIGAR (Compact Idiosyncratic Gapped Alignment Report) Data may be included.
  • CIGAR Cosmetic Idiosyncratic Gapped Alignment Report
  • the method includes extracting a lead comprising a second region from the leads of the obtained sequence analysis data. It is possible to select only the lid including the second region in the lead of the sequence analysis data, specifically, the lid containing the nucleic acid sequence not having homology with the reference genome of the polynucleotide.
  • the method includes calculating the ratio of leads comprising the first region among the leads comprising the extracted second region.
  • the ratio of the lid including the first region among the leads including the extracted second region, specifically, the lid having the UID nucleic acid sequence of the polynucleotide can be calculated.
  • the method includes measuring the degree of purity of the UID from the ratio of leads comprising the calculated first region.
  • the leads including the extracted second region may all have the same UID nucleic acid sequence.
  • the degree of purity of the UID in the nucleic acid sequence analysis may be 100%.
  • the UID nucleic acid sequence may be different from the UID nucleic acid sequence of the synthetic fragment for measuring purity. In this case, the degree of purity of the UID in the nucleic acid sequence analysis may be from 0% to less than 100%.
  • the synthetic fragment for measuring the purity of the UID of the present invention it is possible to determine whether or not the nucleic acid sequence is in error by the UID, and the ratio of the nucleic acid sequence to the nucleic acid sequence, Reliability can be predicted.
  • a method for producing a polynucleotide comprising the steps of: preparing a first library for measuring purity of a UID comprising the polynucleotide described above; Nucleic acid sequence analysis of said first library to obtain nucleic acid sequence analysis data; Extracting a lead containing the second region from the leads of the obtained sequence analysis data; Calculating a ratio of the lead including the first region out of the leads including the extracted second region; And a step of measuring the degree of purity of the UID from the ratio of the lead containing the calculated first region to the degree of purity of the UID in the nucleic acid sequence analysis.
  • the above method is the same as the method of measuring the degree of purity of UID described above, except that nucleic acid sequence analysis is performed on one or more samples requiring nucleic acid sequence analysis.
  • This method can be performed to test the probability of occurrence of an error in the whole process of library production and nucleic acid sequence analysis even when there is no sample requiring nucleic acid sequence analysis.
  • Figure 1 illustrates the purity of a UID containing a unique identifier (UID) nucleic acid sequence, a non-homologous nucleic acid sequence with the reference genome, and a nucleic acid sequence of a target region homologous to the reference genome
  • Fig. 2 is an image showing the structure of a composite section for use in the present invention.
  • UID unique identifier
  • a and B show nucleic acid sequence analysis using synthetic fragments for measuring purity of UID, select leads containing nucleic acid sequences not having homology with the reference genome, This is the result of confirming the ratio of the UID and the UID of the combined intercept.
  • ss means a specific sequence with no homology to the reference genome (ss).
  • FIG. 3 is a flowchart of a procedure for performing nucleic acid sequence analysis using synthetic fragments for measuring the purity of UID.
  • a method for measuring the degree of purity of a UID includes performing nucleic acid fragmentation, end-repair, 3'-adenosine tailing, and adapter ligation, introducing synthetic fragments for measuring purity of UID , Purity of UID can be measured by performing capture-pre-polymerase chain reaction, target enrichment, capture-post-polymerase chain reaction, nucleic acid sequence analysis, fastq file extraction, and UID nucleic acid sequence selection.
  • Example 1 Measurement of integrity of UID in nucleic acid sequence analysis of a target region
  • next generation sequencing the genes KRAS, IDH1, BRAC1, ALK, and ERBB2, which are known to have mutations in the cancer as the nucleotide sequence of the target region, and the regions of these genes were selected.
  • a reference sequence of about 100 to about 350 bp was selected based on the selected site.
  • a nucleic acid sequence having a unique identifier (UID) nucleic acid sequence, a non-homologous nucleic acid sequence having no homology with the reference genome, and a nucleic acid sequence having a homology with the reference genome and having an illumina P5 Adapter and a P7 adapter nucleic acid sequence hereinafter referred to as "synthetic fragments for measuring purity of UID").
  • nucleotide sequences of the selected fragments for the determination of the purity of the selected gene, the reference sequence and the UID, and the nucleic acid sequences extracted from the sequence analysis are shown in Table 1 below.
  • Table 1 the nucleic acid sequence that is not homologous to the reference genome is 4 bp in front of the extraction sequence and is shown in bold color.
  • UID nucleic acid sequences are shown in bold and underlined text in the synthetic fragments for UID purity measurement.
  • the P5 and P7 sequences are shown in slanted text.
  • Sequence analysis primer binds after P5 and before GTCT sequence. Sequence from P7 to P7 is combined with sequence analysis primer and index sequence analysis primer.
  • the target region corresponds to all sequences other than the nucleic acid sequence 4 bp that is not homologous to the reference genome in the extracted sequence.
  • a library for nucleic acid sequence analysis of the target region was constructed as follows.
  • genomic DNA genomic DNA: gDNA
  • NA12878 sample of the Coriell institute 50 ng of genomic DNA (genomic DNA: gDNA) of the NA12878 sample of the Coriell institute was prepared.
  • the prepared gDNA samples were subjected to fragmentation, end-repair, 3'-adenosine tailing (3'A-C) using the KAPA hyper illumina production kit (Kapa Biosystems) according to the manufacturer's method. tailing and adapter ligation were performed and purified using AMPure beads (Beckman Coulter, Indiana, USA) to prepare a fragment for nucleic acid sequence analysis.
  • AMPure beads Beckman Coulter, Indiana, USA
  • a synthetic section for measuring the purity of UID prepared in Example 1.1 was quantified and a synthetic section for measuring purity of 5 amole UID was added to the section for nucleic acid sequence analysis.
  • a pre-capture polymerase chain reaction (PCR) was performed on a fragment for nucleic acid sequence analysis to which a synthetic fragment for measurement of purity of UID was added.
  • PCR polymerase chain reaction
  • the completed library was purified with AMPure beads and quantified by PicoGreen fluorescence analysis using a dsDNA HS assay kit and Qubit 2.0 fluorescence photometer. Based on the DNA concentration and average fragment size, the library was normalized to a concentration of 2 nM.
  • the DNA was denatured using 0.2N NaOH and the denatured library was diluted in a hybridization buffer (Illumina, San Diego, Calif., USA) to 20 pM.
  • the denatured template was cluster amplified according to the manufacturer's instruction (Illumina).
  • Flow cells were sequenced in a 100 bp pair-terminal mode using a HiSeq 2500 v3 Sequencing-by-Synthesis kit (Illumina) and analyzed using RTA software (v.1.12.4.2 or higher).
  • the nucleotide sequence was extracted in BCL format and converted to a fastq format file through the bcl converter.
  • BWA-mem v0.7.5
  • a BAM file was generated by aligning all the raw data to the hg19 human reference genome.
  • SAMTOOLS v0.1.18
  • Picard v1.93
  • GATK v3.1.1
  • SAM / BAM files were categorized, local realignment was performed, and redundancy was indicated. Through this process, redundancy, mismatch pairs, and leads deviating from the target were removed.
  • a lid with the extraction sequence set forth in Table 1 in a separate lid i.e., a lid containing a nucleic acid sequence that is not homologous to the reference genome, was selected. Then, the ratio of the UID nucleic acid sequence, that is, the lead having the AGTC, in the selected lid was confirmed.
  • a and B show nucleic acid sequence analysis using synthetic fragments for measuring purity of UID, select leads containing nucleic acid sequences not having homology with the reference genome, This is the result of confirming the ratio of the UID and the UID of the combined intercept.
  • a and B in Fig. 2 it can be seen that a plurality of leads having UID nucleic acid sequences other than the AGTC, which is a UID nucleic acid sequence, can be included.
  • a synthetic fragment for measuring the purity of the UID of the present invention is used in the process of preparing a nucleic acid sequence analyzing library and performing nucleic acid sequence analysis, the error and the ratio in which the UID is deleted, substituted, deleted, or replaced are measured .

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Provided are a polynucleotide for measuring the integrity of UID and a method for measuring the integrity of UID in nucleic acid sequencing by using the same, the polynucleotide comprising: a first region including a UID nucleic acid sequence; a second region including a nucleic acid sequence having no homology with a reference genome; and a third region including a nucleic acid sequence having homology with the reference genome.

Description

핵산 서열분석에서 UID 핵산 서열의 순결도를 측정하는 방법How to measure the purity of UID nucleic acid sequence in nucleic acid sequence analysis
UID 핵산 서열의 순결도를 측정하기 위한 폴리뉴클레오티드 및 이를 이용하여 핵산 서열분석에서 UID 핵산 서열의 순결도를 측정하는 방법에 관한 것이다.A polynucleotide for measuring the degree of purity of a UID nucleic acid sequence, and a method for measuring the degree of purity of a UID nucleic acid sequence in nucleic acid sequence analysis using the polynucleotide.
유전체 또는 게놈 (genome)이란 한 생물이 가지는 모든 유전 정보를 말한다. 어느 한 개인의 유전체의 시퀀싱 (sequencing) 또는 서열분석을 위하여, DNA 칩 및 차세대 서열분석 (Next Generation Sequencing: NGS), 차차세대 서열분석 (Next Next Generation Sequencing: NNGS) 등 여러 기술들이 개발되고 있다. NGS는 연구 및 진단의 목적으로 널리 활용되고 있다. NGS는 장비의 종류에 따라 다르지만, 크게 보면 시료의 채취, 라이브러리의 제작, 및 핵산 서열분석의 수행의 총 3단계로 구분할 수 있다. 핵산 서열분석 후에는 생산된 서열분석 데이터에 기초하여, 유전자 변이 여부를 검출한다. A genome or genome is any genetic information that a creature has. Several techniques have been developed for sequencing or sequencing genomes of a single individual, such as DNA chip and Next Generation Sequencing (NGS), and Next Generation Sequencing (NNGS). NGS is widely used for research and diagnostic purposes. Although NGS differs depending on the kind of equipment, it can be broadly divided into three stages: sampling, library production, and nucleic acid sequence analysis. After the nucleic acid sequence analysis, the presence or absence of the gene mutation is detected based on the produced sequence analysis data.
다수의 시료를 동시에 분석하기 위하여, 하나의 핵산 서열분석 장비에 다수의 시료를 혼합하여 투입할 수 있다. 이 경우, 혼합되는 검체는 혼합 전 각 시료를 구별하여 나타낼 수 있는 표지를 갖아야 한다. 상기 표지는 중합효소 연쇄 반응 중 중합효소가 야기하는 오류, 및/또는 핵산 서열분석 과정 중에서의 감지의 오류 등으로 인하여 핵산 서열분석 결과에서의 오차를 야기할 수 있는데, 이러한 오차는 변이의 검출을 저해하는 문제점이 있다. 따라서, 다수의 시료를 구별할 수 있는 표지가 분석 대상 시료에 바르게 결합하여, 해당 시료를 정확하게 표지하는지 확인할 수 있는 방법이 요구된다.To analyze multiple samples at the same time, a number of samples may be mixed into one nucleic acid sequencing kit. In this case, the sample to be mixed should have a label that can distinguish each sample before mixing. The label may cause errors in nucleic acid sequence analysis results due to polymerase-induced errors in polymerase chain reaction and / or detection errors during nucleic acid sequence analysis, There is a problem that it inhibits. Therefore, there is a need for a method that can identify whether a label capable of distinguishing a large number of samples correctly binds to a sample to be analyzed and correctly labels the sample.
일 양상은 UID 핵산 서열의 순결도를 측정하기 위한 폴리뉴클레오티드를 제공한다. One aspect provides a polynucleotide for measuring the purity of a UID nucleic acid sequence.
다른 양상은 핵산 서열분석에서 UID 핵산 서열의 순결도를 측정하는 방법을 제공한다. Another aspect provides a method for determining the purity of the UID nucleic acid sequence in nucleic acid sequencing.
일 양상은 2 이상의 연속 뉴클레오티드가 고유 식별자 (unique identification: UID) 핵산 서열을 포함하는 제1 영역, 2 이상의 연속 뉴클레오티드가 참조 게놈과 상동성을 갖지 않는 (non-homologous) 핵산 서열을 포함하는 제2 영역, 및 2 이상의 연속 뉴클레오티드가 참조 게놈과 상동성을 갖는 핵산 서열을 포함하는 제3 영역을 포함하는, UID의 순결도(integrity)를 측정하기 위한 폴리뉴클레오티드를 제공한다.One aspect includes a first region in which two or more consecutive nucleotides comprise a unique identification (UID) nucleic acid sequence, a second region in which at least two consecutive nucleotides comprise a non-homologous nucleic acid sequence Region and a third region comprising two or more contiguous nucleotides of the nucleic acid sequence homologous to the reference genome.
상기 폴리뉴클레오티드에서 제1 영역은 고유 식별자 (unique identification: UID) 핵산 서열을 포함하는 것일 수 있다. The first region in the polynucleotide may comprise a unique identification (UID) nucleic acid sequence.
UID는 핵산 염기 서열분석시에 시료를 구별하는 역할을 하는 핵산 단편을 의미한다. 즉, UID는 복수의 시료에 대하여 핵산 염기 서열분석 시에 서로 다른 시료를 구별하기 위한 표지가 된다. 따라서, 복수의 시료를 구별하기 위하여 UID는 시료 간에 서로 다른 핵산 서열을 갖는 것일 수 있다. 핵산 서열분석이 요구되는 1종 이상의 시료와 UID의 순결도를 측정하기 위한 시료를 대상으로 핵산 서열분석을 수행하는 경우, UID의 순결도를 측정하기 위한 폴리뉴클레오티드는 핵산 서열분석이 요구되는 1종 이상의 시료와 서로 다른 UID 핵산 서열을 갖는 것일 수 있다. UID의 순결도를 측정하기 위한 폴리뉴클레오티드를 제작하는 경우, 상기 폴리뉴클레오티드는 복수개의 폴리뉴클레오티드 간에 동일한 UID 핵산 서열을 갖도록 합성 또는 제작할 수 있다. 상기 폴리뉴클레오티드는 1종의 동일한 UID 핵산 서열, 예를 들면, AGTC를 공통적으로 갖는 것일 수 있고, 또는 1종 이상의 동일한 UID 핵산 서열, 예를 들면, AGTC 및 TGAC를 공통적으로 갖는 것일 수 있다. 상기 UID 핵산 서열은 고유 분자 식별자 (unique molecular identifiers: UMI), 인덱스(index), 또는 바코드(barcode)와 혼용될 수 있다. UID refers to a nucleic acid fragment that serves to identify a sample during nucleic acid sequencing. That is, the UID is a marker for distinguishing different samples from each other in nucleic acid sequence analysis for a plurality of samples. Therefore, in order to distinguish a plurality of samples, the UID may have different nucleic acid sequences among the samples. When performing a nucleic acid sequence analysis for one or more samples requiring nucleic acid sequence analysis and a sample for measuring the degree of purity of UID, the polynucleotide for measuring the degree of purity of UID may be one kind of nucleic acid sequence analysis Or a different UID nucleic acid sequence from the above-mentioned samples. When a polynucleotide for measuring the degree of purity of UID is produced, the polynucleotide may be synthesized or prepared so as to have the same UID nucleic acid sequence among a plurality of polynucleotides. The polynucleotide may have one common UID nucleic acid sequence such as AGTC, or may have one or more identical UID nucleic acid sequences, for example, AGTC and TGAC in common. The UID nucleic acid sequence may be mixed with unique molecular identifiers (UMI), an index, or a barcode.
상기 UID 핵산 서열은 A, G, C, 또는 T의 염기를 포함할 수 있으나, 이에 제한되는 것은 아니다. 또한, 상기 UID 핵산 서열은 약 2 bp(염기쌍) 내지 약 40 bp, 약 2 bp 내지 약 35 bp, 약 2 bp 내지 약 30 bp, 약 2 bp 내지 약 25 bp, 약 2 bp 내지 약 30 bp, 약 3 bp 내지 약 20 bp, 약 4 bp 내지 약 20 bp, 또는 약 4 bp 내지 약 16 bp인 것일 수 있으나, 그 길이가 이에 제한되는 것은 아니다. The UID nucleic acid sequence may include, but is not limited to, a base of A, G, C, or T. Also, the UID nucleic acid sequence may be from about 2 bp to about 40 bp, from about 2 bp to about 35 bp, from about 2 bp to about 30 bp, from about 2 bp to about 25 bp, from about 2 bp to about 30 bp, About 3 bp to about 20 bp, about 4 bp to about 20 bp, or about 4 bp to about 16 bp, but the length is not limited thereto.
상기 폴리뉴클레오티드는 멀티플렉싱 (multiplexing)에 적용하기 위한 것일 수 있다. 멀티플렉싱은 2 이상의 시료를 하나의 핵산 서열분석 레인 또는 칩에서 서열분석할 수 있도록 상기 2 이상의 시료를 혼합하는 것을 의미한다.The polynucleotide may be for application to multiplexing. Multiplexing means mixing two or more samples so that two or more samples can be sequenced in one nucleic acid sequencing lane or chip.
상기 UID의 순결도(integrity)는 서열분석 데이터에서 해당 시료에 존재하는 고유한 (unique) UID의 수 또는 비율을 의미한다. UID의 순결도는 라이브러리 제작 과정, 및/또는 핵산 서열분석 과정 등에 영향을 받을 수 있다. 상기 UID의 순결도는 상대적인 수준으로 나타낼 수 있다.The integrity of the UID means the number or percentage of unique UIDs present in the sample in the sequence analysis data. The degree of purity of the UID may be affected by the library production process and / or the nucleic acid sequencing process. The degree of purity of the UID may be expressed as a relative level.
상기 폴리뉴클레오티드에서 제2 영역은 참조 게놈과 상동성을 갖지 않는 (non-homologous) 핵산 서열을 포함하는 것일 수 있다. The second region in the polynucleotide may comprise a nucleic acid sequence that is non-homologous to the reference genome.
핵산 서열분석이 요구되는 1종 이상의 시료에 대한 서열분석 데이터가 UID의 순결도를 측정하기 위한 시료에 대한 서열분석 데이터에 의하여 영향을 받는 것을 최소화하거나 또는 영향을 받지 않도록 하기 위하여, 핵산 서열분석이 요구되는 시료와 명확하게 구별될 수 있도록, 상기 폴리뉴클레오티드에서 제2 영역은 참조 게놈과 상동성을 갖지 않는 핵산 서열을 포함하는 것일 수 있다. UID의 순결도를 측정하기 위한 폴리뉴클레오티드를 제작하는 경우, 상기 폴리뉴클레오티드는 복수개의 폴리뉴클레오티드 간에 참조 게놈과 상동성을 갖지 않는 동일한 핵산 서열을 갖도록 합성 또는 제작할 수 있다. Sequence analysis data for one or more samples for which nucleic acid sequence analysis is required To minimize or not to be affected by sequencing data for the sample to determine the purity of the UID, In order to be clearly distinguishable from the required sample, the second region in the polynucleotide may comprise a nucleic acid sequence which is not homologous to the reference genome. In the case of producing a polynucleotide for measuring the degree of purity of UID, the polynucleotide may be synthesized or prepared so as to have the same nucleic acid sequence having no homology with the reference genome among a plurality of polynucleotides.
참조 게놈과 상동성을 갖지 않는 서열은, 핵산 서열분석 이후, 원래 서열분석의 대상이 되는 시료의 서열분석 결과에서 인위적으로 주입한 합성 절편 (상기 UID의 순결도를 측정하기 위한 폴리뉴클레오티드)을 제거하기 위함이며, 생성된 절편의 서열분석 데이터를 참조 게놈에 위치하지 않도록 하기 위해, 참조 게놈과 상동성을 갖지 않는 서열은 최소한 4 bp 이상의 연속 뉴클레오티드 서열이 참조 게놈과 다르게 되면 분리할 수 있다. 참조 게놈과 상동성을 갖지 않는 핵산 서열은, 약 2 bp(염기쌍) 내지 약 250 bp, 약 2 bp 내지 약 40 bp, 약 2 bp 내지 약 35 bp, 약 2 bp 내지 약 30 bp, 약 2 bp 내지 약 25 bp, 약 2 bp 내지 약 30 bp, 약 3 bp 내지 약 20 bp, 약 4 bp 내지 약 20 bp, 또는 약 4 bp 내지 약 16 bp인 것일 수 있으나, 그 길이가 이에 제한되는 것은 아니다. Sequences that do not have homology with the reference genome, after nucleic acid sequence analysis, are designed to remove synthetic fragments (polynucleotides to measure the purity of the UIDs) artificially injected from the sequence analysis results of the original sequence analysis sample , Sequences that are not homologous to the reference genome can be separated if the consecutive nucleotide sequence of at least 4 bp or more is different from the reference genome in order to prevent the sequencing data of the generated fragment from being located in the reference genome. Nucleic acid sequences that are not homologous to the reference genome may be from about 2 bp to about 250 bp, from about 2 bp to about 40 bp, from about 2 bp to about 35 bp, from about 2 bp to about 30 bp, from about 2 bp From about 2 bp to about 30 bp, from about 3 bp to about 20 bp, from about 4 bp to about 20 bp, or from about 4 bp to about 16 bp, although the length is not limited thereto .
참조 게놈 데이터는 NCBI (National Center for Biotechnology Information), GEO (Gene Expression Omnibus), FDA (Food and Drug Administration), My Cancer Genome, TCGA (The Cancer Genome Atlas) 등과 같은 당해 기술분야에서 이미 공지된 데이터 베이스로부터 획득되거나, 또는 대조군 즉 정상인의 생물학적 시료로부터 획득된 것일 수 있다. 상기 정상인은 특정 질병, 예를 들면, 종양 등이 발견되지 않은 건강한 사람인 것일 수 있다. 상기 참조 게놈은 인간 참조 게놈일 수 있고, hg18 또는 hg19인 것일 수 있다. 상기 상동성은 주어진 참조 게놈의 염기 서열과 일치하는 정도를 의미한다.The reference genomic data may be a database already known in the art such as National Center for Biotechnology Information (NCBI), Gene Expression Omnibus (GEO), Food and Drug Administration (FDA), My Cancer Genome, TCGA , Or may be obtained from a control, i.e., a biological sample of a normal person. The normal person may be a healthy person who has not found a specific disease, for example, a tumor. The reference genome may be a human reference genome, and may be hg18 or hg19. The homology means the degree to which the homology matches the nucleotide sequence of a given reference genome.
상기 폴리뉴클레오티드에서 제3 영역은 참조 게놈과 상동성을 갖는 핵산 서열을 포함하는 것일 수 있다. 핵산 서열분석에서 정렬 (alignment) 또는 맵핑 (mapping)된 서열분석 데이터를 수득하기 위하여, 상기 폴리뉴클레오티드에서 제3 영역은 참조 게놈과 상동성을 갖는 핵산 서열을 포함하는 것일 수 있다.The third region in the polynucleotide may comprise a nucleic acid sequence having homology with the reference genome. In order to obtain alignment or mapped sequence analysis data in the nucleic acid sequence analysis, the third region in the polynucleotide may comprise a nucleic acid sequence homologous to the reference genome.
상기 참조 게놈과 상동성을 갖는 핵산 서열은 표적 영역의 핵산 서열의 2 이상의 연속 뉴클레오티드와 상동성을 갖는 것일 수 있다. 질병의 원인 유전자를 찾기 위하여, 차세대 핵산 서열분석법을 이용해 전장 유전체 (Whole-genome)를 핵산 서열분석하거나, 또는 엑솜 영역 또는 특정 영역만을 목표로 하여 핵산 서열분석할 수 있다. 이러한 분석 방법을 표적 서열분석 또는 표적 시퀀싱(targeted resequencing)이라고 한다. 상기 폴리뉴클레오티드는 표적 서열분석에 적용하기 위한 것일 수 있다. 상기 표적 영역은 관심 대상인 유전자의 전체 또는 일부 영역일 수 있으며, 유전자의 종류가 제한되는 것은 아니다.The nucleic acid sequence that is homologous to the reference genome may be homologous to two or more consecutive nucleotides of the nucleic acid sequence of the target region. To search for the causative gene of the disease, the whole genome can be sequenced using the next-generation nucleic acid sequencing method, or the nucleic acid sequence can be analyzed only in the exosome region or a specific region. This method of analysis is called target sequence analysis or targeted resequencing. The polynucleotide may be for application in target sequence analysis. The target region may be a whole or a partial region of the gene of interest, and the kind of the gene is not limited.
상기 제2 영역은 상기 제3 영역의 5' 말단, 상기 제3 영역의 3' 말단, 또는 상기 제3 영역의 5' 말단 및 3' 말단에 위치하고, 상기 제1 영역은 상기 제 3 영역의 5' 말단, 제 3 영역의 3' 말단, 제2 영역의 5' 말단, 제 2 영역의 3' 말단, 또는 제2 영역의 5' 말단 및 제2 영역의 3' 말단에 위치하는 것일 수 있다. 예를 들면, 상기 폴리뉴클레오티드는 5' 말단에서 3' 말단 방향으로, 제1 영역, 제2 영역 및 제3 영역을 포함하거나, 또는 제3 영역, 제2 영역 및 제1 영역을 포함하는 것일 수 있다. 또는 상기 폴리뉴클레오티드는, UID 핵산 서열 및/또는 참조 게놈과 상동성을 갖지 않는 핵산 서열을, 폴리뉴클레오티드의 양 말단에 포함하여, 예를 들면, 5' 말단에서 3' 말단 방향으로, 제1 영역, 제2 영역, 제3 영역, 제2' 영역 및 제1' 영역을 포함하는 것일 수 있다. 상기 제2' 영역은 제2 영역과 동일 또는 상이하고 참조 게놈과 상동성을 갖지 않는 핵산 서열을 포함하는 것일 수 있고, 상기 제1' 영역은 제1 영역과 동일 또는 상이한 UID 핵산 서열을 포함하는 것일 수 있다. 상기 제1 영역, 제2 영역 및 제3 영역들은 바로 인접하거나, 그 사이에 다른 임의의 핵산 서열을 더 포함하여 일정한 거리를 두고 위치하는 것일 수도 있다.The second region is located at the 5 'end of the third region, the 3' end of the third region, or the 5 'end and the 3' end of the third region, Terminus of the first region, the 3 'terminus of the third region, the 5' terminus of the second region, the 3 'terminus of the second region, or the 5' terminus of the second region and the 3 'terminus of the second region. For example, the polynucleotide may comprise a first region, a second region and a third region, or a third region, a second region and a first region, in the direction from the 5 'end to the 3' end have. Alternatively, the polynucleotide may include a nucleotide sequence that does not have homology with the UID nucleic acid sequence and / or the reference genome at both ends of the polynucleotide, for example, at the 5 'end to the 3' , A second region, a third region, a second region and a first region. The second region may comprise a nucleic acid sequence that is the same as or different from the second region and not homologous to the reference genome and wherein the first region comprises a UID nucleic acid sequence identical or different than the first region Lt; / RTI > The first region, the second region, and the third regions may be immediately adjacent to each other, or may be located at a certain distance further including any other nucleic acid sequence therebetween.
도 1은 UID의 순결도 측정용 합성 절편 (전술한 폴리뉴클레오티드)의 구조를 나타낸 이미지이다. 도 1에 나타낸 바와 같이, 복수개의 UID의 순결도 측정용 합성 절편은, 동일한 UID 핵산 서열, 참조 게놈과 상동성을 갖지 않는 핵산 서열, 및 참조 게놈과 상동성을 갖는 핵산 서열을 포함하는 것일 수 있다. 상기 UID의 순결도 측정용 합성 절편은 프라이머 및/또는 어댑터를 더 포함할 수 있다. 상기 UID의 순결도 측정용 합성 절편은 핵산 서열분석이 요구되는 1종 이상의 시료에 대한 라이브러리 제작 단계에서 투입되어, 함께 핵산 서열분석되는 것일 수 있다. 이 경우, 핵산 서열분석이 요구되는 1종 이상의 시료와 UID의 순결도를 측정하기 위한 시료는 서로 다른 UID 핵산 서열을 갖으면서, 같은 프라이머 및/또는 어댑터 핵산 서열을 갖는 것일 수 있다. Fig. 1 is an image showing the structure of a synthetic fragment for measuring the degree of purity of UID (the above-described polynucleotide). As shown in Fig. 1, a synthetic fragment for measuring purity of a plurality of UIDs may include the same UID nucleic acid sequence, a nucleic acid sequence that is not homologous to the reference genome, and a nucleic acid sequence that is homologous to the reference genome have. The synthetic fragment for measuring the purity of the UID may further comprise a primer and / or an adapter. The synthetic fragment for measuring the degree of purity of the UID may be inserted into a library preparation step for one or more samples requiring nucleic acid sequence analysis and subjected to nucleic acid sequence analysis together. In this case, one or more samples for which nucleic acid sequence analysis is required and a sample for measuring the degree of purity of UID may be those having the same primer and / or adapter nucleic acid sequence, while having different UID nucleic acid sequences.
다른 양상은 전술한 폴리뉴클레오티드를 포함하는 조성물을 제공한다. Another aspect provides a composition comprising the polynucleotides described above.
다른 양상은 전술한 폴리뉴클레오티드를 포함하는 키트를 제공한다. Another aspect provides a kit comprising the polynucleotides described above.
다른 양상은, 전술한 폴리뉴클레오티드를 포함하는 UID의 순결도 측정용 제1 라이브러리를 제작하는 단계; 생물학적 시료로부터 분리된 핵산을 단편화하고, 단편화된 핵산의 하나 이상의 말단에 고유 식별자 (unique identification: UID) 핵산 서열을 포함하는 폴리뉴클레오티드를 라이게이션하여 핵산 서열분석용 제2 라이브러리를 제작하는 단계; 상기 제1 라이브러리 및 제2 라이브러리를 핵산 서열분석하여 서열분석 데이터를 수득하는 단계; 수득된 서열분석 데이터의 리드(read) 중에서 제2 영역을 포함하는 리드를 추출하는 단계; 추출된 제2 영역을 포함하는 리드 중에서 제1 영역을 포함하는 리드의 비율을 산출하는 단계; 및 산출된 제1 영역을 포함하는 리드의 비율로부터 UID의 순결도를 측정하는 단계를 포함하는, 핵산 서열분석에 있어서 UID의 순결도를 측정하는 방법을 제공한다.In another aspect, there is provided a method for producing a polynucleotide comprising the steps of: preparing a first library for measuring purity of a UID comprising the polynucleotide described above; Fragmenting a nucleic acid isolated from a biological sample and ligating a polynucleotide comprising a unique identification (UID) nucleic acid sequence to one or more ends of the fragmented nucleic acid to prepare a second library for nucleic acid sequence analysis; Nucleic acid sequence analysis of said first library and said second library to obtain sequencing data; Extracting a lead including a second region from a read of the obtained sequence analysis data; Calculating a ratio of the lead including the first region out of the leads including the extracted second region; And a step of measuring the degree of purity of the UID from the ratio of the lead containing the calculated first region to the degree of purity of the UID in the nucleic acid sequence analysis.
상기 라이브러리(library)는 핵산 서열분석에 적합한 형태로 제작된 핵산 단편을 의미하며, 핵산 서열분석이 요구되는 핵산 단편에 프라이머, 어댑터, 고유 식별자 또는 이들의 조합이 라이게이션된 것, 이의 증폭 전 또는 이의 증폭 산물인 것일 수 있다. 예를 들면, 상기 라이브러리는, 프리 캡쳐 PCR(pre capture polymerase chain reaction), 표적 농축(target enrichment), 또는 포스트 캡쳐 PCR(post capture PCR) 전 또는 후에 핵산 서열분석에 적합한 형태로 제작된 핵산 단편 집합인 것일 수 있다.The library means a nucleic acid fragment prepared in a form suitable for nucleic acid sequence analysis. The library may be a fragment in which a primer, an adapter, a unique identifier or a combination thereof is ligated to a nucleic acid fragment for which nucleic acid sequence analysis is required, It may be an amplification product thereof. For example, the library may comprise a nucleic acid fragment set prepared in a form suitable for nucleic acid sequence analysis before or after pre-capture polymerase chain reaction (PCR), target enrichment, or post capture PCR Lt; / RTI >
상기 라이브러리는, 예를 들면, 유전체 라이브러리(genomic library), 상보적 DNA 라이브러리(complementary DNA library), 무작위적 돌연변이 라이브러리(randomized mutant library), 또는 이들의 조합인 것일 수 있다.The library may be, for example, a genomic library, a complementary DNA library, a randomized mutant library, or a combination thereof.
상기 방법은 상기 폴리뉴클레오티드를 포함하는 UID의 순결도 측정용 제1 라이브러리를 제작하는 단계를 포함한다. 상기 제1 라이브러리는 상기 폴리뉴클레오티드의 하나 이상의 말단에 프라이머 및/또는 어댑터를 라이게이션하여 제작될 수 있다. 상기 프라이머 및/또는 어댑터는 핵산 서열분석에 적합한 어댑터, 중합효소 연쇄 반응을 위한 프라이머, 핵산 서열분석에 적합한 프라이머, 핵산 서열분석에 적합한 프라이머가 어닐링할 수 있는 영역, 또는 이들의 조합을 포함하는 것일 수 있다. 상기 프라이머 및/또는 어댑터는 통상의 기술자가 핵산 서열분석법에 따라 선택할 수 있다.The method includes the step of constructing a first library for measuring purity of a UID comprising the polynucleotide. The first library may be prepared by ligating primers and / or adapters to one or more ends of the polynucleotide. The primers and / or adapters may be adapters suitable for nucleic acid sequencing, primers for polymerase chain reaction, primers suitable for nucleic acid sequencing, regions in which primers suitable for nucleic acid sequencing can be annealed, or combinations thereof . The primers and / or adapters can be selected according to the nucleic acid sequencing method by a person skilled in the art.
상기 방법은 생물학적 시료로부터 분리된 핵산을 단편화하고, 단편화된 핵산의 하나 이상의 말단에 UID 핵산 서열을 포함하는 폴리뉴클레오티드를 라이게이션하여 핵산 서열분석용 제2 라이브러리를 제작하는 단계를 포함한다. The method comprises fragmenting a nucleic acid isolated from a biological sample and ligation of a polynucleotide comprising a UID nucleic acid sequence at one or more ends of the fragmented nucleic acid to prepare a second library for nucleic acid sequence analysis.
생물학적 시료는 질환을 갖고 있는 것으로 의심되는 개체, 종양을 갖고 있는 것으로 의심되는 개체, 정상인 또는 이들의 조합으로부터 획득된 것, 또는 합성물인 것일 수 있다. 상기 개체는 인간, 소, 말, 돼지, 양, 염소, 개, 고양이, 또는 설치류인 것일 수 있다. 상기 생물학적 시료는 혈액, 혈장, 혈청, 소변, 타액, 점막 분비물, 객담, 대변, 눈물 또는 이들의 조합으로부터 획득된 것일 수 있다. The biological sample may be one obtained from a suspected individual having a disease, a suspected individual having a tumor, a normal person, or a combination thereof, or a compound. The subject may be a human being, a cow, a horse, a pig, a sheep, a goat, a dog, a cat, or a rodent. The biological sample may be obtained from blood, plasma, serum, urine, saliva, mucous secretion, sputum, feces, tears or a combination thereof.
상기 핵산은 유전체 (genome) 또는 그의 단편인 것일 수 있으며, 임의의 길이를 지닌 폴리뉴클레오티드와 상호교환적으로 사용할 수 있다. 상기 유전체 또는 게놈 (genome)은 염색체, 염색질, 또는 유전자의 전체를 의미한다. 상기 핵산은 DNA (deoxyribonucleic acid), RNA (ribonucleic acid) 또는 이들의 조합인 것일 수 있고, 예를 들면, 무세포 DNA (cell-free DNA: cf DNA)인 것일 수 있다.The nucleic acid may be a genome or a fragment thereof, and may be used interchangeably with a polynucleotide having an arbitrary length. The genome or genome refers to the entire chromosome, chromatin, or gene. The nucleic acid may be DNA (deoxyribonucleic acid), RNA (ribonucleic acid) or a combination thereof, and may be, for example, cell-free DNA (cf DNA).
상기 시료로부터 핵산을 분리하는 방법은 통상의 기술자에게 공지된 방법으로 수행될 수 있다. The method of separating the nucleic acid from the sample can be carried out by a method known to a person skilled in the art.
분리된 핵산을 단편화(fragmentation)하는 방법은 통상의 기술자에게 공지된 방법으로 수행될 수 있으며, 유전체를 물리적, 화학적 또는 효소적으로 절단하는 것일 수 있고, 예를 들면, 유전체를 제한효소로 절단하는 것일 수 있다. The method of fragmenting the isolated nucleic acid may be performed by methods known to those of ordinary skill in the art, and may be physical, chemical, or enzymatic cleavage of the genome, such as by digesting the genome with a restriction enzyme Lt; / RTI >
상기 방법은 단편화된 핵산의 크기를 선별하는 단계를 포함하는 것일 수 있다. 크기를 선별하는 단계는 전기영동, 원심분리, 크로마토그래피, 또는 이들의 조합으로 수행될 수 있다. 상기 분리된 핵산 단편은 약 10 bp 내지 약 2000 bp, 약 15 bp 내지 약 1500 bp, 약 20 bp 내지 약 1000 bp, 약 20 bp 내지 약 500 bp 또는 약 20 내지 약 300 bp인 것일 수 있다. The method may comprise selecting the size of the fragmented nucleic acid. The step of selecting the size may be performed by electrophoresis, centrifugation, chromatography, or a combination thereof. The isolated nucleic acid fragment may be from about 10 bp to about 2000 bp, from about 15 bp to about 1500 bp, from about 20 bp to about 1000 bp, from about 20 bp to about 500 bp, or from about 20 to about 300 bp.
상기 제2 라이브러리는 상기 단편화된 핵산의 하나 이상의 말단에 UID 핵산 서열을 포함하는 폴리뉴클레오티드를 라이게이션하여 제작될 수 있다. 핵산 서열분석이 요구되는 시료가 1종 이상인 경우, 어느 1종의 시료는, 다른 시료 및 UID의 순결도를 측정하기 위한 시료와 서로 다른 UID 핵산 서열을 갖는 것일 수 있다. The second library may be prepared by ligating a polynucleotide comprising a UID nucleic acid sequence to one or more ends of the fragmented nucleic acid. When one or more samples requiring nucleic acid sequence analysis are required, one sample may be one having a different UID nucleic acid sequence from another sample and a sample for measuring the degree of purity of the UID.
상기 제2 라이브러리는 상기 단편화된 핵산의 하나 이상의 말단에 프라이머 및/또는 어댑터를 라이게이션하여 제작될 수 있다. 상기 프라이머 및/또는 어댑터는 핵산 서열분석에 적합한 어댑터, 중합효소 연쇄 반응을 위한 프라이머, 핵산 서열분석에 적합한 프라이머, 핵산 서열분석에 적합한 프라이머가 어닐링할 수 있는 영역, 또는 이들의 조합을 포함하는 것일 수 있다. 상기 프라이머 및/또는 어댑터는 통상의 기술자가 핵산 서열분석법에 따라 선택할 수 있다.The second library may be prepared by ligating primers and / or adapters to one or more ends of the fragmented nucleic acid. The primers and / or adapters may be adapters suitable for nucleic acid sequencing, primers for polymerase chain reaction, primers suitable for nucleic acid sequencing, regions in which primers suitable for nucleic acid sequencing can be annealed, or combinations thereof . The primers and / or adapters can be selected according to the nucleic acid sequencing method by a person skilled in the art.
상기 방법은 표적 농축 (target enrichment)하는 단계를 포함하는 것일 수 있다. 상기 표적 농축은 핵산 서열분석을 수행할 유전자 또는 기타 관심 영역의 빈도를 증가시키는 것을 의미한다. 상기 표적 농축은 통상의 기술자에게 공지된 방법으로 수행될 수 있으며, 예를 들면, 시료를 베이트 (bait)와 혼성화하는 인솔루션 캡쳐 (in-solution capture), 중합효소 연쇄 반응 또는 이들의 조합으로 수행될 수 있다. 상기 방법은 표적 농축 전에 프리 캡쳐 PCR(pre capture polymerase chain reaction), 표적 농축 후에 포스트 캡쳐 PCR(post capture PCR), 또는 이들의 조합을 수행하는 단계를 포함하는 것일 수 있다.The method may comprise subjecting to enrichment. The target enrichment means increasing the frequency of the gene or other region of interest to be subjected to the nucleic acid sequence analysis. The target enrichment may be carried out in a manner known to those skilled in the art, for example, by in-solution capture, hybridization of a sample with a bait, polymerase chain reaction, or a combination thereof . The method may include performing pre capture polymerase chain reaction (PCR) prior to target enrichment, post capture PCR after target enrichment, or a combination thereof.
상기 제1 라이브러리를 제작하는 단계 및 상기 제2 라이브러리를 제작하는 단계는 핵산 서열분석 전에 수행되는 것이면 족하다. 따라서, 제1 라이브러리를 제작하는 단계를 먼저 수행할 수 있고, 제2 라이브러리를 제작하는 단계를 먼저 수행할 수도 있으며, 또는 제1 라이브러리를 제작하는 단계 및 상기 제2 라이브러리를 동시에 수행할 수도 있다. The step of producing the first library and the step of producing the second library may be performed before the nucleic acid sequence analysis. Therefore, the step of producing the first library may be performed first, and the step of producing the second library may be performed first, or the step of producing the first library and the second library may be performed simultaneously.
상기 방법은 제1 라이브러리 및 제2 라이브러리를 핵산 서열분석하여 서열분석 데이터를 수득하는 단계를 포함한다.The method comprises nucleic acid sequencing the first library and the second library to obtain sequencing data.
상기 핵산 서열분석은 차세대 핵산 서열분석(next generation sequencing: NGS)인 것일 수 있다. 핵산 서열분석은 염기 서열분석, 서열분석 또는 시퀀싱 (sequencing)과 상호 교환적으로 사용되는 것일 수 있다. 상기 NGS는 대규모 병렬 서열분석 (massive parallel sequencing) 또는 2세대 서열분석 (second-generation sequencing)과 상호 교환적으로 사용되는 것일 수 있다. 상기 NGS는 대량의 단편의 핵산을 동시다발적으로 서열분석하는 기법으로서, 칩 (chip) 기반 그리고 중합효소 연쇄 반응 (polymerase chain reaction: PCR) 기반 쌍 말단 (paired end) 형식으로 전장 유전체를 조각내고, 상기 조각을 혼성화 반응 (hybridization)에 기초하여 초고속으로 서열분석을 수행하는 것일 수 있다. 상기 NGS는 예를 들면, 454 플랫폼(Roche), GS FLX 티타늄, Illumina MiSeq, Illumina HiSeq, Illumina HiSeq 2500, Illumina Genome Analyzer, Solexa platform, SOLiD System(Applied Biosystems), Ion Proton(Life Technologies), Complete Genomics, Helicos Biosciences Heliscope, Pacific Biosciences의 단일 분자 실시간(SMRT™) 기술, 또는 이들의 조합에 의해 수행되는 것일 수 있다.The nucleic acid sequence analysis may be next generation sequencing (NGS). Nucleic acid sequence analysis may be used interchangeably with sequencing, sequencing, or sequencing. The NGS may be used interchangeably with massive parallel sequencing or second-generation sequencing. The NGS is a technique for simultaneously sequencing large amounts of nucleic acid of a fragment, which is a chip-based and polymerase chain reaction (PCR) -based paired end format , And performing sequencing at a very high speed based on hybridization of the fragment. Illumina HiSeq, Illumina HiSeq 2500, Illumina Genome Analyzer, Solexa platform, SOLiD System (Applied Biosystems), Ion Proton (Life Technologies), Complete Genomics , Helicos Biosciences Heliscope, Pacific Biosciences single molecule real time (SMRT (TM)) technology, or a combination thereof.
상기 핵산 서열분석은 관심 영역만을 분석하기 위한 핵산 서열분석법인 것일 수 있다. 상기 핵산 서열분석은, 예를 들면, NGS 기반의 표적 서열분석 (targeted sequencing), 표적 딥 서열분석 (targeted deep sequencing) 또는 패널 서열분석 (panel sequencing)을 포함하는 것일 수 있다. The nucleic acid sequence analysis may be a nucleic acid sequence analysis for analyzing only the region of interest. The nucleic acid sequence analysis may include, for example, NGS-based targeted sequencing, targeted deep sequencing or panel sequencing.
상기 서열분석 데이터는 상기 핵산 서열분석에 의해 수득된 데이터를 의미하며, 핵산 서열분석 대상에 대한 개별 리드의 염기 서열, 빈도, 및 품질 지표를 포함하는 것일 수 있다. 상기 리드(read)는 핵산 서열분석으로 수득된 핵산 단편의 핵산 서열 정보를 의미하며, 핵산 서열분석으로 나온 데이터, 또는 핵산 서열의 조각인 것일 수 있다. 상기 서열분석 데이터는, 예를 들면, BAM (binary version of SAM) 포맷 및/또는 SAM (Sequence Alignment/Map) 포맷의 데이터로부터 수득된 것일 수 있다. BAM 포맷 및/또는 SAM 포맷은 보통 짧은 리드들에 관한 데이터를 서술하는 포맷으로 이용되는 것일 수 있다. BAM 포맷 및/또는 SAM 포맷의 데이터에는 리드의 시작 포인트, 리드의 방향 (direction), 맵핑 (mapping) 품질, 정렬 (alignment)의 차수를 나타내는 FLAG, CIGAR (Compact Idiosyncratic Gapped Alignment Report) 스트링 등에 관한 텍스트 데이터가 포함될 수 있다. 다양한 정렬 쌍을 생성함으로써 다양한 서포팅 리드들 (supporting reads)을 확보할 수 있다.The sequence analysis data means the data obtained by the nucleic acid sequence analysis, and may include the sequence, frequency, and quality index of the individual leads to the nucleic acid sequence analyzing object. The " read " means the nucleic acid sequence information of the nucleic acid fragment obtained by the nucleic acid sequence analysis, and may be data derived from nucleic acid sequence analysis or a fragment of the nucleic acid sequence. The sequence analysis data may be obtained, for example, from binary version of SAM (SAM) format and / or Sequence Alignment / Map (SAM) format data. The BAM format and / or the SAM format may typically be used in a format that describes data about short leads. The data in the BAM format and / or the SAM format includes a start point of a lead, a direction of a lead, a mapping quality, a FLAG indicating a degree of alignment, a text related to a CIGAR (Compact Idiosyncratic Gapped Alignment Report) Data may be included. Various supporting pairs can be obtained by creating various alignment pairs.
상기 방법은 수득된 서열분석 데이터의 리드 중에서 제2 영역을 포함하는 리드를 추출하는 단계를 포함한다. 상기 서열분석 데이터의 리드에서 제2 영역을 포함하는 리드, 구체적으로 상기 폴리뉴클레오티드의 참조 게놈과 상동성을 갖지 않는 핵산 서열을 포함하는 리드만을 선별할 수 있다. The method includes extracting a lead comprising a second region from the leads of the obtained sequence analysis data. It is possible to select only the lid including the second region in the lead of the sequence analysis data, specifically, the lid containing the nucleic acid sequence not having homology with the reference genome of the polynucleotide.
상기 방법은 추출된 제2 영역을 포함하는 리드 중에서 제1 영역을 포함하는 리드의 비율을 산출하는 단계를 포함한다. 상기 추출된 제2 영역을 포함하는 리드 중에서 제1 영역을 포함하는 리드, 구체적으로 상기 폴리뉴클레오티드의 UID 핵산 서열을 갖는 리드의 비율을 산출할 수 있다.The method includes calculating the ratio of leads comprising the first region among the leads comprising the extracted second region. The ratio of the lid including the first region among the leads including the extracted second region, specifically, the lid having the UID nucleic acid sequence of the polynucleotide can be calculated.
상기 방법은 산출된 제1 영역을 포함하는 리드의 비율로부터 UID의 순결도를 측정하는 단계를 포함한다. 핵산 서열분석 과정에서 UID의 일부 또는 전부가 결실, 치환, 삭제, 또는 교체되는 오류가 발생하지 않는 경우, 추출된 제2 영역을 포함하는 리드는 모두 동일한 UID 핵산 서열을 갖는 것일 수 있다. 이 경우, 해당 핵산 서열분석에서 UID의 순결도는 100%일 수 있다. 그러나, 핵산 서열분석 과정에서 UID의 일부 또는 전부가 결실, 치환, 삭제, 또는 교체되는 오류가 발생하는 경우, 추출된 제2 영역을 포함하는 리드는 중에서 일부 또는 전부는 상기 폴리뉴클레오티드, 즉 UID의 순결도 측정용 합성 절편의 UID 핵산 서열과 다른 UID 핵산 서열을 갖을 수 있다. 이 경우, 해당 핵산 서열분석에서 UID의 순결도는 0% 내지 100% 미만일 수 있다.The method includes measuring the degree of purity of the UID from the ratio of leads comprising the calculated first region. In the case where no error occurs in which part or all of the UID is deleted, substituted, deleted, or replaced in the nucleic acid sequence analysis process, the leads including the extracted second region may all have the same UID nucleic acid sequence. In this case, the degree of purity of the UID in the nucleic acid sequence analysis may be 100%. However, when an error occurs in which a part or all of the UID is deleted, substituted, deleted, or replaced in the nucleic acid sequence analysis process, some or all of the leads including the extracted second region have the polynucleotide, The UID nucleic acid sequence may be different from the UID nucleic acid sequence of the synthetic fragment for measuring purity. In this case, the degree of purity of the UID in the nucleic acid sequence analysis may be from 0% to less than 100%.
본 발명의 UID의 순결도 측정용 합성 절편을 이용하는 경우, 핵산 서열분석 결과, UID에 의한 오류 여부 및 그의 발생 비율을 측정할 수 있어, 핵산 서열분석용 라이브러리 제작 및 핵산 서열분석 전반적인 과정의 정확도 및 신뢰도를 예측할 수 있다. In the case of using the synthetic fragment for measuring the purity of the UID of the present invention, it is possible to determine whether or not the nucleic acid sequence is in error by the UID, and the ratio of the nucleic acid sequence to the nucleic acid sequence, Reliability can be predicted.
다른 양상은, 전술한 폴리뉴클레오티드를 포함하는 UID의 순결도 측정용 제1 라이브러리를 제작하는 단계; 상기 제1 라이브러리를 핵산 서열분석하여 핵산 서열분석 데이터를 수득하는 단계; 수득된 서열분석 데이터의 리드 중에서 제2 영역을 포함하는 리드를 추출하는 단계; 추출된 제2 영역을 포함하는 리드 중에서 제1 영역을 포함하는 리드의 비율을 산출하는 단계; 및 산출된 제1 영역을 포함하는 리드의 비율로부터 UID의 순결도를 측정하는 단계를 포함하는, 핵산 서열분석에 있어서 UID의 순결도를 측정하는 방법을 제공한다. In another aspect, there is provided a method for producing a polynucleotide comprising the steps of: preparing a first library for measuring purity of a UID comprising the polynucleotide described above; Nucleic acid sequence analysis of said first library to obtain nucleic acid sequence analysis data; Extracting a lead containing the second region from the leads of the obtained sequence analysis data; Calculating a ratio of the lead including the first region out of the leads including the extracted second region; And a step of measuring the degree of purity of the UID from the ratio of the lead containing the calculated first region to the degree of purity of the UID in the nucleic acid sequence analysis.
상기 방법은, 핵산 서열분석이 요구되는 1종 이상의 시료를 대상으로 핵산 서열분석을 수행하는 것을 제외하고, 전술한 UID의 순결도를 측정하는 방법과 동일하다. 상기 방법은 핵산 서열분석이 요구되는 시료가 없는 경우에도, 라이브러리 제작 및 핵산 서열분석 전반적인 과정에서의, 오류가 발생할 확률을 시험하기 위하여 수행할 수 있다.The above method is the same as the method of measuring the degree of purity of UID described above, except that nucleic acid sequence analysis is performed on one or more samples requiring nucleic acid sequence analysis. This method can be performed to test the probability of occurrence of an error in the whole process of library production and nucleic acid sequence analysis even when there is no sample requiring nucleic acid sequence analysis.
UID 핵산 서열을 포함하는 제1 영역, 참조 게놈과 상동성을 갖지 않는 핵산 서열을 포함하는 제2 영역, 및 참조 게놈과 상동성을 갖는 핵산 서열을 포함하는 제3 영역을 포함하는, UID의 순결도(integrity)를 측정하기 위한 폴리뉴클레오티드 및 이를 이용하여 핵산 서열분석에 있어서 UID의 순결도를 측정하는 방법을 제공한다. 이에 따르면, 핵산 서열분석 과정에서, 다수의 시료를 구별할 수 있는 표지가 분석 대상 시료에 바르게 결합하여, 해당 시료를 정확하게 표지하는지 확인할 수 있다.A first region comprising a UID nucleic acid sequence, a second region comprising a nucleic acid sequence which is not homologous to the reference genome, and a third region comprising a nucleic acid sequence homologous to the reference genome, A polynucleotide for measuring integrity and a method for measuring purity of UID in nucleic acid sequence analysis using the polynucleotide. According to this, in the nucleic acid sequence analysis process, it is possible to confirm whether a label capable of distinguishing a plurality of samples is correctly bound to a sample to be analyzed, and the sample is correctly labeled.
도 1은 고유 식별자 (unique identification: UID) 핵산 서열, 참조 게놈과 상동성을 갖지 않는 (non-homologous) 핵산 서열, 참조 게놈과 상동성을 갖는 표적 영역의 핵산 서열을 포함하는 UID의 순결도 측정용 합성 절편의 구조를 나타낸 이미지이다. Figure 1 illustrates the purity of a UID containing a unique identifier (UID) nucleic acid sequence, a non-homologous nucleic acid sequence with the reference genome, and a nucleic acid sequence of a target region homologous to the reference genome Fig. 2 is an image showing the structure of a composite section for use in the present invention.
도 2에서 A 및 B는, UID의 순결도 측정용 합성 절편를 이용하여 핵산 서열분석을 수행하고, 참조 게놈과 상동성을 갖지 않는 핵산 서열을 포함하는 리드를 선별한 후, UID의 순결도 측정용 합성 절편의 UID와 다른 UID가 혼입된 리드의 비율을 확인한 결과이다. 도 1 및 2에 있어서, ss는 참조 게놈과 상동성을 갖지 않는 특이적인 핵산 서열 (specific sequence with non-homologous human reference: ss)을 의미한다.In FIG. 2, A and B show nucleic acid sequence analysis using synthetic fragments for measuring purity of UID, select leads containing nucleic acid sequences not having homology with the reference genome, This is the result of confirming the ratio of the UID and the UID of the combined intercept. In Figures 1 and 2, ss means a specific sequence with no homology to the reference genome (ss).
도 3은 UID의 순결도 측정용 합성 절편를 이용하여 핵산 서열분석을 수행하는 과정의 흐름도이다. 일 실시예에 따른 UID의 순결도를 측정하는 방법은, 핵산 절편화, 말단-수선, 3'-아데노신 꼬리달기, 및 어댑터 라이게이션을 수행하고, UID의 순결도 측정용 합성 절편을 투입한 후, 캡쳐-전 중합효소 연쇄 반응, 표적 농축, 캡쳐-후 중합효소 연쇄 반응, 핵산 서열분석, 및 fastq 파일 추출, 및 UID 핵산 서열 선별을 수행함으로써, UID의 순결도를 측정할 수 있다. FIG. 3 is a flowchart of a procedure for performing nucleic acid sequence analysis using synthetic fragments for measuring the purity of UID. A method for measuring the degree of purity of a UID according to an exemplary embodiment includes performing nucleic acid fragmentation, end-repair, 3'-adenosine tailing, and adapter ligation, introducing synthetic fragments for measuring purity of UID , Purity of UID can be measured by performing capture-pre-polymerase chain reaction, target enrichment, capture-post-polymerase chain reaction, nucleic acid sequence analysis, fastq file extraction, and UID nucleic acid sequence selection.
이하 본 발명을 실시예를 통하여 보다 상세하게 설명한다. 그러나, 이들 실시예는 본 발명을 예시적으로 설명하기 위한 것으로 본 발명의 범위가 이들 실시예에 한정되는 것은 아니다.Hereinafter, the present invention will be described in more detail with reference to examples. However, these examples are for illustrative purposes only, and the scope of the present invention is not limited to these examples.
실시예 1. 표적 영역의 핵산 서열분석에서 UID의 순결도(integrity) 측정Example 1. Measurement of integrity of UID in nucleic acid sequence analysis of a target region
1. UID의 순결도 측정용 합성 절편의 제작1. Fabrication of synthetic slice for measuring purity of UID
차세대 핵산 서열분석(next generation sequencing: NGS)을 위해, 표적 영역의 핵산 서열로서 암에서 변이를 갖는 것으로 알려진 유전자 KRAS, IDH1, BRAC1, ALK, 및 ERBB2 및 이들 유전자의 영역을 선정하였다. 선정된 위치를 기준으로 약 100 내지 약 350bp의 참조 서열을 선별하였다. For the next generation sequencing (NGS), the genes KRAS, IDH1, BRAC1, ALK, and ERBB2, which are known to have mutations in the cancer as the nucleotide sequence of the target region, and the regions of these genes were selected. A reference sequence of about 100 to about 350 bp was selected based on the selected site.
고유 식별자 (unique identifier: UID) 핵산 서열, 참조 게놈과 상동성을 갖지 않는 (non-homologous) 핵산 서열, 참조 게놈과 상동성을 갖는 표적 영역의 핵산 서열을 포함하고 양 말단에 일루미나 (illumina) P5 어댑터 및 P7 어댑터 핵산 서열을 포함하는, 합성 절편 (이하, "UID의 순결도 측정용 합성 절편"이라고 함)을 제작하였다. A nucleic acid sequence having a unique identifier (UID) nucleic acid sequence, a non-homologous nucleic acid sequence having no homology with the reference genome, and a nucleic acid sequence having a homology with the reference genome and having an illumina P5 Adapter and a P7 adapter nucleic acid sequence (hereinafter referred to as "synthetic fragments for measuring purity of UID").
선별된 유전자, 참조 서열, UID의 순결도 측정용 합성 절편의 핵산 서열 및 서열분석 결과 추출된 핵산 서열을 하기 표 1에 나타내었다. 표 1에서, 참조 게놈과 상동성을 갖지 않는 핵산 서열은, 추출 서열의 앞에 4bp이며, 굵은색 글씨로 표시하였다. UID 핵산 서열은 UID 순결도 측정용 합성 절편에서 굵은 및 밑줄된 글씨로 표시하였다. P5 및 P7 서열은 기울인 글씨로 표시하였다. P5이후 GTCT 서열 전까지는 서열분석 프라이머가 결합하며, 추출 서열 이후 P7까지의 서열은 서열분석 프라이머 및 index 서열분석 프라이머가 결합한다. 표적 영역은 추출 서열에서 참조 게놈과 상동성을 갖지 않는 핵산 서열 4bp 외 모든 서열에 해당한다.The nucleotide sequences of the selected fragments for the determination of the purity of the selected gene, the reference sequence and the UID, and the nucleic acid sequences extracted from the sequence analysis are shown in Table 1 below. In Table 1, the nucleic acid sequence that is not homologous to the reference genome is 4 bp in front of the extraction sequence and is shown in bold color. UID nucleic acid sequences are shown in bold and underlined text in the synthetic fragments for UID purity measurement. The P5 and P7 sequences are shown in slanted text. Sequence analysis primer binds after P5 and before GTCT sequence. Sequence from P7 to P7 is combined with sequence analysis primer and index sequence analysis primer. The target region corresponds to all sequences other than the nucleic acid sequence 4 bp that is not homologous to the reference genome in the extracted sequence.
.. 참조 게놈의 표적 영역The target region of the reference genome 및 핵산 서열And a nucleic acid sequence UID의 순결도 측정용 합성 절편(UID 핵산 서열 : AGTC) Synthetic fragment (UID nucleic acid sequence: AGTC) for measuring purity of UID 추출 서열Extracted sequence
1One KRAS : Chr12 :엑손 번호: 3 : chr12:25380168-25380346KRAS: Chr12: exon number: 3: chr12: 25380168-25380346 5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTGTCTATCCTGAGAAGGGAGAAACACAGTCTGGATTATTACAGTGCACCTTTTACTTCAAAAAAGGTGTTATATACAACTCAACAACAAAAAATTCAATTTAAAAATGGGCAAAGGACTTGAAAAGACATTGTTCCTGCTCCAAAGATCTGAGAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACG AGTC ATCTCGTATGCCGTCTTCTGCTTG-3'(서열번호 1)5'- GTCT AATGATACGGCGACCACCGA GATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT ATCCTGAGAAGGGAGAAACACAGTCTGGATTATTACAGTGCACCTTTTACTTCAAAAAAGGTGTTATATACAACTCAACAACAAAAAATTCAATTTAAAAATGGGCAAAGGACTTGAAAAGACATTGTTCCTGCTCCAAAGATCTGAGAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACG AGTC ATCTCGTATGCCGTCTTCTGCTTG -3 '(SEQ ID NO: 1) 5'-GTCTATCCTGAGAAGGGAGAAACACAGTCTGGATTATTACAGTGCACCTTTTACTTCAAAAAAGGTGTTATATACAACTCAACAACAAAAAATTCAATTTAAAAATGGGCAAAGGACTTGAAAAGACATTGTTCCTGCTCCAAAGATCTG-3'(서열번호 2)5'- GTCT ATCCTGAGAAGGGAGAAACACAGTCTGGATTATTACAGTGCACCTTTTACTTCAAAAAAGGTGTTATATACAACTCAACAACAAAAAATTCAATTTAAAAATGGGCAAAGGACTTGAAAAGACATTGTTCCTGCTCCAAAGATCTG-3 '(SEQ ID NO: 2)
22 IDH1 : Chr12엑손 번호: 4 : chr2:209113048-209113359IDH1: Chr12 exon number: 4: chr2: 209113048-209113359 5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTGTCTAATGGCTTCTCTGAAGACCGTGCCACCCAGAATATTTCGTATGGTGCCATTTGGTGATTTCCACATTTGTTTCAACTTGAACTCCTCAACCCTCTTCTCATCAGGAGTGATAGTGGCACATTTGACGCCAACATTATGCTTCTCTGAGAGAGATCGGAAGAGCACACGTCTGAACTCC AGTC ACCACGAGTCATCTCGTATGCCGTCTTCTGCTTG-3'(서열번호 3)5'- GTCT AATGATACGGCGACCACCGA GATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT AATGGCTTCTCTGAAGACCGTGCCACCCAGAATATTTCGTATGGTGCCATTTGGTGATTTCCACATTTGTTTCAACTTGAACTCCTCAACCCTCTTCTCATCAGGAGTGATAGTGGCACATTTGACGCCAACATTATGCTTCTCTGAGAGAGATCGGAAGAGCACACGTCTGAACTCC AGTC ACCACGAGTCATCTCGTATGCCGTCTTCTGCTTG -3 '(SEQ ID NO: 3) 5'-GTCTAATGGCTTCTCTGAAGACCGTGCCACCCAGAATATTTCGTATGGTGCCATTTGGTGATTTCCACATTTGTTTCAACTTGAACTCCTCAACCCTCTTCTCATCAGGAGTGATAGTGGCACATTTGACGCCAACATTATGCTTCTCTG-3'(서열번호 4)5'- GTCT AATGGCTTCTCTGAAGACCGTGCCACCCAGAATATTTCGTATGGTGCCATTTGGTGATTTCCACATTTGTTTCAACTTGAACTCCTCAACCCTCTTCTCATCAGGAGTGATAGTGGCACATTTGACGCCAACATTATGCTTCTCTG-3 '(SEQ ID NO: 4)
33 BRAC1 : Chr17엑손 번호: 15 : chr17:41222945-41223255BRAC1: Chr17 exon number: 15: chr17: 41222945-41223255 5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTGTCTTTCTGGCTTCTCCCTGCTCACACTTTCTTCCATTGCATTATACCCAGCAGTATCAGTAGTATGAGCAGCAGCTGGACTCTGGGCAGATTCTGCAACTTTCAACTTTCAATTGGGGAACTTTCAATGCAGAGGTTGAAGATGGTCTGAGAGAGATCGGAAGAGCACACGTCTGAACTCC AGTC ACCACGAGTCATCTCGTATGCCGTCTTCTGCTTG-3'(서열번호 5)5'- GTCT AATGATACGGCGACCACCGA GATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT TTCTGGCTTCTCCCTGCTCACACTTTCTTCCATTGCATTATACCCAGCAGTATCAGTAGTATGAGCAGCAGCTGGACTCTGGGCAGATTCTGCAACTTTCAACTTTCAATTGGGGAACTTTCAATGCAGAGGTTGAAGATGGTCTGAGAGAGATCGGAAGAGCACACGTCTGAACTCC AGTC ACCACGAGTCATCTCGTATGCCGTCTTCTGCTTG -3 '(SEQ ID NO: 5) 5'-GTCTTTCTGGCTTCTCCCTGCTCACACTTTCTTCCATTGCATTATACCCAGCAGTATCAGTAGTATGAGCAGCAGCTGGACTCTGGGCAGATTCTGCAACTTTCAACTTTCAATTGGGGAACTTTCAATGCAGAGGTTGAAGATGGTCTG-3'(서열번호 6)5'- GTCT TTCTGGCTTCTCCCTGCTCACACTTTCTTCCATTGCATTATACCCAGCAGTATCAGTAGTATGAGCAGCAGCTGGACTCTGGGCAGATTCTGCAACTTTCAACTTTCAATTGGGGAACTTTCAATGCAGAGGTTGAAGATGGTCTG-3 '(SEQ ID NO: 6)
44 ALK : Chr2엑손 번호: 20: chr2:29446208-29446394ALK: Chr2 exon number: 20: chr2: 29446208-29446394 5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTGTCTACTGATGGAGGAGGTCTTGCCAGCAAAGCAGTAGTTGGGGTTGTAGTCGGTCATGATGGTCGAGGTGCGGAGCTTGCTCAGCTTGTACTCAGGGCTCTGCAGCTCCATCTGCATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTCTGAGAGAGATCGGAAGAGCACACGTCTGAACTCC AGTC ACCACGAGTCATCTCGTATGCCGTCTTCTGCTTG-3'(서열번호 7)5'- GTCT AATGATACGGCGACCACCGA GATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT ACTGATGGAGGAGGTCTTGCCAGCAAAGCAGTAGTTGGGGTTGTAGTCGGTCATGATGGTCGAGGTGCGGAGCTTGCTCAGCTTGTACTCAGGGCTCTGCAGCTCCATCTGCATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTCTGAGAGAGATCGGAAGAGCACACGTCTGAACTCC AGTC ACCACGAGTCATCTCGTATGCCGTCTTCTGCTTG -3 '(SEQ ID NO: 7) 5'-GTCTACTGATGGAGGAGGTCTTGCCAGCAAAGCAGTAGTTGGGGTTGTAGTCGGTCATGATGGTCGAGGTGCGGAGCTTGCTCAGCTTGTACTCAGGGCTCTGCAGCTCCATCTGCATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTCTG-3'(서열번호 8)5'- GTCT ACTGATGGAGGAGGTCTTGCCAGCAAAGCAGTAGTTGGGGTTGTAGTCGGTCTGTGGTCGAGGTGCGGAGCTTGCTCAGCTTGTACTCAGGGCTCTGCAGCTCCATCTGCATGGCTTGCAGCTCCTGGTGCTTCCGGCGGTCTG-3 '(SEQ ID NO: 8)
55 ERBB2 : Chr17엑손 번호: 6: chr17:37864574-37864787ERBB2: Chr17 exon number: 6: chr17: 37864574-37864787 5'-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTGTCTGCTACGTGCTCATCGCTCACAACCAAGTGAGGCAGGTCCCACTGCAGAGGCTGCGGATTGTGCGAGGCACCCAGCTCTTTGAGGACAACTATGCCCTGGCCGTGCTAGACAATGGAGACCCGCTGAACAATACCACCCCTGTTCTGAGAGAGATCGGAAGAGCACACGTCTGAACTCC AGTC ACCACGAGTCATCTCGTATGCCGTCTTCTGCTTG-3'(서열번호 9)5'- GTCT AATGATACGGCGACCACCGA GATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT GCTACGTGCTCATCGCTCACAACCAAGTGAGGCAGGTCCCACTGCAGAGGCTGCGGATTGTGCGAGGCACCCAGCTCTTTGAGGACAACTATGCCCTGGCCGTGCTAGACAATGGAGACCCGCTGAACAATACCACCCCTGTTCTGAGAGAGATCGGAAGAGCACACGTCTGAACTCC AGTC ACCACGAGTCATCTCGTATGCCGTCTTCTGCTTG -3 '(SEQ ID NO: 9) 5'-GTCTGCTACGTGCTCATCGCTCACAACCAAGTGAGGCAGGTCCCACTGCAGAGGCTGCGGATTGTGCGAGGCACCCAGCTCTTTGAGGACAACTATGCCCTGGCCGTGCTAGACAATGGAGACCCGCTGAACAATACCACCCCTGTTCTG-3'(서열번호 10)5'- GTCT GCTACGTGCTCATCGCTCACAACCAAGTGAGGCAGGTCCCACTGCAGAGGCTGCGGATTGTGCGAGGCACCCAGCTCTTTGAGGACAACTATGCCCTGGCCGTGCTAGACAATGGAGACCCGCTGAACAATACCACCCCTGTTCTG-3 '(SEQ ID NO: 10)
2. 표적 영역의 핵산 서열분석을 위한 라이브러리의 제작 및 핵산 서열분석2. Preparation of library for nucleic acid sequence analysis of target region and nucleic acid sequence analysis
표적 영역의 핵산 서열분석을 위한 라이브러리를 다음과 같이 제작하였다.A library for nucleic acid sequence analysis of the target region was constructed as follows.
Coriell institute의 NA12878 시료의 게놈 DNA (genomic DNA: gDNA) 50 ng을 준비하였다. 준비된 gDNA 시료를 KAPA hyper illumina 제작 키트 (Kapa Biosystems)를 사용하여, 제조사가 제공한 방법에 따라, 단편화 (fragmentation), 말단-수선 (end-repair), 3'-아데노신 꼬리달기 (3'A-tailing), 어댑터 라이게이션 (adaptor ligation)을 수행하고, AMPure 비드 (Beckman Coulter, Indiana, USA)를 이용하여 정제(purification)하여, 핵산 서열분석용 절편를 제작하였다. 50 ng of the genomic DNA (genomic DNA: gDNA) of the NA12878 sample of the Coriell institute was prepared. The prepared gDNA samples were subjected to fragmentation, end-repair, 3'-adenosine tailing (3'A-C) using the KAPA hyper illumina production kit (Kapa Biosystems) according to the manufacturer's method. tailing and adapter ligation were performed and purified using AMPure beads (Beckman Coulter, Indiana, USA) to prepare a fragment for nucleic acid sequence analysis.
실시예 1.1에서 제작된 UID의 순결도 측정용 합성 절편을 정량하고, 5 amole의 UID의 순결도 측정용 합성 절편을 핵산 서열분석용 절편에 첨가하였다. UID의 순결도 측정용 합성 절편이 첨가된 핵산 서열분석용 절편에 대하여 캡쳐-전 (pre-capture) 중합효소 연쇄 반응 (polymerase chain reaction: PCR)을 수행하였다. 이어서, 미리 표지화된 (pre-indexed) 어댑터에 대한 블로킹 올리뉴클레오티드를 IDT x Gen 블로킹 올리고뉴클레오티드 (IDT, Santa Clara, CA, USA)로 대체하는 변형을 갖는 SureSelect 베이트 혼성화 프로토콜에 따라, 표적 영역을 표적 농축(target enrichment)하였다. 그 후, 표적 농축된 절편들에 대하여 캡쳐-후(post-capture) PCR을 수행하여, 핵산 서열분석을 위한 라이브러리를 완성하였다.A synthetic section for measuring the purity of UID prepared in Example 1.1 was quantified and a synthetic section for measuring purity of 5 amole UID was added to the section for nucleic acid sequence analysis. A pre-capture polymerase chain reaction (PCR) was performed on a fragment for nucleic acid sequence analysis to which a synthetic fragment for measurement of purity of UID was added. Subsequently, according to the SureSelect bait hybridization protocol with a modification that replaces the blocking oligonucleotide for the pre-indexed adapter with an IDT x Gen blocking oligonucleotide (IDT, Santa Clara, Calif., USA) And then enriched (target enrichment). Thereafter, post-capture PCR was performed on the target enriched fragments to complete a library for nucleic acid sequence analysis.
완성된 라이브러리는 AMPure 비드로 정제하고, dsDNA HS 분석 키트와 Qubit 2.0 형광광도계를 이용하여 PicoGreen 형광 분석법으로 정량하였다. DNA 농도 및 평균 단편 크기에 기초하여, 라이브러리를 2nM의 농도가 되도록 표준화하였다. 0.2N의 NaOH를 이용하여, DNA를 변성시킨 후, 변성된 라이브러리를 혼성화 버퍼 (Illumina, San Diego, CA, USA)에 희석하여 20 pM이 되도록 하였다. 변성된 주형은 제조사 (Illumina)의 지시에 따라 클러스터 증폭 (Cluster amplification)하였다. 플로우 셀 (Flow cells)을 HiSeq 2500 v3 Sequencing-by-Synthesis 키트 (Illumina)를 이용하여 100bp의 쌍-말단 모드에서 서열분석하고, RTA 소프트웨어 (v.1.12.4.2 이상)를 사용하여 분석하였다. BCL 포맷으로 염기 서열을 추출하고, bcl 컨버터를 통하여, fastq 포맷의 파일로 변환하였다. BWA-mem (v0.7.5)를 사용하여, 모든 원 데이터를 hg19 인간 참조 게놈에 정렬시켜 BAM 파일을 생성하였다. SAMTOOLS (v0.1.18), Picard (v1.93), 및 GATK (v3.1.1)를 사용하여, SAM/BAM 파일을 분류하고, 로컬 재정렬 (local realignment)을 수행하고, 중복을 표시하였다. 상기 가공 과정을 통해서, 중복, 불일치 쌍, 및 표적에서 벗어난 리드를 제거하였다. UID 핵산 서열, 정방향 리드 (forward read: r1) 및 역방향 리드 (reverse read: r2)를 분리하였다. The completed library was purified with AMPure beads and quantified by PicoGreen fluorescence analysis using a dsDNA HS assay kit and Qubit 2.0 fluorescence photometer. Based on the DNA concentration and average fragment size, the library was normalized to a concentration of 2 nM. The DNA was denatured using 0.2N NaOH and the denatured library was diluted in a hybridization buffer (Illumina, San Diego, Calif., USA) to 20 pM. The denatured template was cluster amplified according to the manufacturer's instruction (Illumina). Flow cells were sequenced in a 100 bp pair-terminal mode using a HiSeq 2500 v3 Sequencing-by-Synthesis kit (Illumina) and analyzed using RTA software (v.1.12.4.2 or higher). The nucleotide sequence was extracted in BCL format and converted to a fastq format file through the bcl converter. Using BWA-mem (v0.7.5), a BAM file was generated by aligning all the raw data to the hg19 human reference genome. Using SAMTOOLS (v0.1.18), Picard (v1.93), and GATK (v3.1.1), SAM / BAM files were categorized, local realignment was performed, and redundancy was indicated. Through this process, redundancy, mismatch pairs, and leads deviating from the target were removed. A UID nucleic acid sequence, a forward read (r1), and a reverse read (r2).
분리된 리드에서 표 1에 기재된 추출 서열(extraction sequence)을 갖는 리드, 즉 참조 게놈과 상동성을 갖지 않는 핵산 서열을 포함하는 리드를 선별하였다. 이어서, 선별된 리드에서 UID 핵산 서열, 즉 AGTC를 갖는 리드의 비율을 확인하였다. A lid with the extraction sequence set forth in Table 1 in a separate lid, i.e., a lid containing a nucleic acid sequence that is not homologous to the reference genome, was selected. Then, the ratio of the UID nucleic acid sequence, that is, the lead having the AGTC, in the selected lid was confirmed.
도 2에서 A 및 B는, UID의 순결도 측정용 합성 절편를 이용하여 핵산 서열분석을 수행하고, 참조 게놈과 상동성을 갖지 않는 핵산 서열을 포함하는 리드를 선별한 후, UID의 순결도 측정용 합성 절편의 UID와 다른 UID가 혼입된 리드의 비율을 확인한 결과이다. 도 2에서 A 및 B에 나타낸 바와 같이, UID 핵산 서열인 AGTC와 다른 UID 핵산 서열을 갖는 리드가 다수 포함될 수 있음을 알 수 있다. 핵산 서열분석용 라이브러리를 제작하고 핵산 서열분석을 수행하는 과정에서, 본 발명의 UID의 순결도 측정용 합성 절편을 이용하는 경우, UID가 결실, 치환, 삭제, 또는 교체되는 오류 및 그의 비율을 측정할 수 있다.In FIG. 2, A and B show nucleic acid sequence analysis using synthetic fragments for measuring purity of UID, select leads containing nucleic acid sequences not having homology with the reference genome, This is the result of confirming the ratio of the UID and the UID of the combined intercept. As shown in A and B in Fig. 2, it can be seen that a plurality of leads having UID nucleic acid sequences other than the AGTC, which is a UID nucleic acid sequence, can be included. When a synthetic fragment for measuring the purity of the UID of the present invention is used in the process of preparing a nucleic acid sequence analyzing library and performing nucleic acid sequence analysis, the error and the ratio in which the UID is deleted, substituted, deleted, or replaced are measured .

Claims (10)

  1. 2 이상의 연속 뉴클레오티드가 고유 식별자 (unique identification: UID) 핵산 서열을 포함하는 제1 영역, 2 이상의 연속 뉴클레오티드가 참조 게놈과 상동성을 갖지 않는 (non-homologous) 핵산 서열을 포함하는 제2 영역, 및 2 이상의 연속 뉴클레오티드가 참조 게놈과 상동성을 갖는 핵산 서열을 포함하는 제3 영역을 포함하는, UID의 순결도(integrity)를 측정하기 위한 폴리뉴클레오티드.A first region in which two or more consecutive nucleotides comprise a unique identification (UID) nucleic acid sequence, a second region in which at least two consecutive nucleotides comprise a non-homologous nucleic acid sequence, and A polynucleotide for measuring the integrity of a UID, wherein the at least two consecutive nucleotides comprise a third region comprising a nucleic acid sequence homologous to the reference genome.
  2. 청구항 1에 있어서, 상기 UID 핵산 서열은 2 bp 내지 40 bp인 것인 폴리뉴클레오티드.2. The polynucleotide of claim 1, wherein the UID nucleic acid sequence is from 2 bp to 40 bp.
  3. 청구항 1에 있어서, 상기 참조 게놈과 상동성을 갖지 않는 핵산 서열은 2 bp 내지 250 bp인 것인 폴리뉴클레오티드.The polynucleotide of claim 1, wherein the nucleic acid sequence that is not homologous to the reference genome is between 2 bp and 250 bp.
  4. 청구항 1에 있어서, 상기 참조 게놈과 상동성을 갖는 핵산 서열은 표적 영역의 핵산 서열의 2 이상의 연속 뉴클레오티드와 상동성을 갖는 것인, 폴리뉴클레오티드.2. The polynucleotide of claim 1, wherein the nucleic acid sequence homologous to the reference genome is homologous to two or more contiguous nucleotides of the nucleic acid sequence of the target region.
  5. 청구항 1에 있어서, 상기 제2 영역은 상기 제3 영역의 5' 말단, 상기 제3 영역의 3' 말단, 또는 상기 제3 영역의 5' 말단 및 3' 말단에 위치하고, 상기 제1 영역은 상기 제3 영역의 5' 말단, 제3 영역의 3' 말단, 제2 영역의 5' 말단, 제2 영역의 3' 말단, 또는 제2 영역의 5' 말단 및 제2 영역의 3' 말단에 위치하는 것인, 폴리뉴클레오티드.2. The method of claim 1, wherein the second region is located at the 5 'end of the third region, the 3' end of the third region, or the 5 'and 3' ends of the third region, The 5 'end of the third region, the 5' end of the second region, the 3 'end of the second region, or the 5' end of the second region and the 3 'end of the second region Polynucleotide. ≪ / RTI >
  6. 청구항 1에 있어서, 차세대 핵산 서열분석(next generation sequencing: NGS)에 적용하기 위한 것인 폴리뉴클레오티드.The polynucleotide of claim 1, which is for application to next generation sequencing (NGS).
  7. 청구항 1에 있어서, 표적 서열분석 (targeted sequencing), 표적 딥 서열분석 (targeted deep sequencing) 또는 패널 서열분석 (panel sequencing)에 적용하기 위한 것인, 폴리뉴클레오티드.The polynucleotide of claim 1, for use in targeted sequencing, targeted deep sequencing or panel sequencing.
  8. 청구항 1의 폴리뉴클레오티드를 포함하는 UID의 순결도 측정용 제1 라이브러리를 제작하는 단계; Preparing a first library for measuring purity of a UID comprising the polynucleotide of claim 1;
    생물학적 시료로부터 분리된 핵산을 단편화하고, 단편화된 핵산의 하나 이상의 말단에 고유 식별자 (unique identification: UID) 핵산 서열을 포함하는 폴리뉴클레오티드를 라이게이션하여 핵산 서열분석용 제2 라이브러리를 제작하는 단계;Fragmenting a nucleic acid isolated from a biological sample and ligating a polynucleotide comprising a unique identification (UID) nucleic acid sequence to one or more ends of the fragmented nucleic acid to prepare a second library for nucleic acid sequence analysis;
    상기 제1 라이브러리 및 제2 라이브러리를 핵산 서열분석하여 핵산 서열분석 데이터를 수득하는 단계; Nucleic acid sequence analysis of said first library and said second library to obtain nucleic acid sequence analysis data;
    수득된 서열분석 데이터의 리드 중에서 제2 영역을 포함하는 리드를 추출하는 단계;Extracting a lead containing the second region from the leads of the obtained sequence analysis data;
    추출된 제2 영역을 포함하는 리드 중에서 제1 영역을 포함하는 리드의 비율을 산출하는 단계; 및Calculating a ratio of the lead including the first region out of the leads including the extracted second region; And
    산출된 제1 영역을 포함하는 리드의 비율로부터 UID의 순결도를 측정하는 단계를 포함하는, 핵산 서열분석에 있어서 UID의 순결도를 측정하는 방법.And measuring the degree of purity of the UID from the ratio of the lead containing the calculated first region.
  9. 청구항 8에 있어서, 상기 핵산 서열분석은 차세대 핵산 서열분석(next generation sequencing: NGS)인 것인 방법.9. The method of claim 8, wherein the nucleic acid sequence analysis is next generation sequencing (NGS).
  10. 청구항 8에 있어서, 상기 핵산 서열분석은 표적 서열분석 (targeted sequencing), 표적 딥 서열분석 (targeted deep sequencing) 또는 패널 서열분석 (panel sequencing)인 것인 방법.9. The method of claim 8, wherein the nucleic acid sequence analysis is targeted sequencing, targeted deep sequencing or panel sequencing.
PCT/KR2018/015086 2017-11-30 2018-11-30 Method for measuring integrity of uid nucleic acid sequence in nucleic acid sequencing analysis WO2019108014A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020170162809A KR101967879B1 (en) 2017-11-30 2017-11-30 Method for measuring integrity of unique identifier in sequencing
KR10-2017-0162809 2017-11-30

Publications (1)

Publication Number Publication Date
WO2019108014A1 true WO2019108014A1 (en) 2019-06-06

Family

ID=66163983

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2018/015086 WO2019108014A1 (en) 2017-11-30 2018-11-30 Method for measuring integrity of uid nucleic acid sequence in nucleic acid sequencing analysis

Country Status (2)

Country Link
KR (1) KR101967879B1 (en)
WO (1) WO2019108014A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160319345A1 (en) * 2015-04-28 2016-11-03 Illumina, Inc. Error suppression in sequenced dna fragments using redundant reads with unique molecular indices (umis)
KR20160141680A (en) * 2015-06-01 2016-12-09 연세대학교 산학협력단 Method of next generation sequencing using adapter comprising barcode sequence
US20170058340A1 (en) * 2009-04-30 2017-03-02 Prognosys Biosciences, Inc. Nucleic acid constructs and methods of use
JP2017511121A (en) * 2014-02-11 2017-04-20 エフ.ホフマン−ラ ロシュ アーゲーF. Hoffmann−La Roche Aktiengesellschaft Target sequencing and UID filtering

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200165662A1 (en) * 2017-05-12 2020-05-28 Seoul National University R&Db Foundation Method and apparatus for capturing high-purity nucleotides

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170058340A1 (en) * 2009-04-30 2017-03-02 Prognosys Biosciences, Inc. Nucleic acid constructs and methods of use
JP2017511121A (en) * 2014-02-11 2017-04-20 エフ.ホフマン−ラ ロシュ アーゲーF. Hoffmann−La Roche Aktiengesellschaft Target sequencing and UID filtering
US20160319345A1 (en) * 2015-04-28 2016-11-03 Illumina, Inc. Error suppression in sequenced dna fragments using redundant reads with unique molecular indices (umis)
KR20160141680A (en) * 2015-06-01 2016-12-09 연세대학교 산학협력단 Method of next generation sequencing using adapter comprising barcode sequence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KOU, R. ET AL.: "Benefits and Challenges with Applying Unique Molecular Identifiers in Next Generation Sequencing to Detect Low Frequency Mutations", PLOS ONE, vol. 11, no. 1, 11 January 2016 (2016-01-11), pages e0146638, XP055469818 *

Also Published As

Publication number Publication date
KR101967879B1 (en) 2019-04-10

Similar Documents

Publication Publication Date Title
CN113661249A (en) Compositions and methods for isolating cell-free DNA
US20210024996A1 (en) Method for verifying bioassay samples
CN110832087A (en) Universal short adaptors for indexing of polynucleotide samples
AU2018261332A1 (en) Optimal index sequences for multiplex massively parallel sequencing
US20200283839A1 (en) Methods of attaching adapters to sample nucleic acids
WO2016195382A1 (en) Next-generation nucleotide sequencing using adaptor comprising bar code sequence
JP2013215212A (en) Method for identifying restriction fragment in sample
Yin et al. Challenges in the application of NGS in the clinical laboratory
Profaizer et al. Human leukocyte antigen typing by next-generation sequencing
CN105331606A (en) Nucleic acid molecule quantification method applied to high-throughput sequencing
CN112639983B (en) Microsatellite instability detection
WO2017204572A1 (en) Method for preparing library for highly parallel sequencing by using molecular barcoding, and use thereof
WO2017193044A1 (en) Noninvasive prenatal diagnostic
WO2021072275A1 (en) Use of cell free bacterial nucleic acids for detection of cancer
CN108359723B (en) Method for reducing deep sequencing errors
US20240141425A1 (en) Correcting for deamination-induced sequence errors
KR102347463B1 (en) Method and appartus for detecting false positive variants in nucleic acid sequencing analysis
CN113454218A (en) Methods, compositions, and systems for improved recovery of nucleic acid molecules
WO2019108014A1 (en) Method for measuring integrity of uid nucleic acid sequence in nucleic acid sequencing analysis
CN114746560A (en) Methods, compositions, and systems for improved binding of methylated polynucleotides
WO2018110940A1 (en) Method for measuring complexity of library for next generation sequencing
WO2017179946A1 (en) Error confirmation method and device for massive parallel sequencing
RU2799654C2 (en) Sequence graph-based tool for determining variation in short tandem repeat areas
CN110870017A (en) Method for generating frequency distribution of background alleles from sequence analysis data obtained from cell-free nucleic acids and method for detecting mutations from cell-free nucleic acids using said method
WO2022181858A1 (en) Composition for improving molecular barcoding efficiency and use thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18883670

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18883670

Country of ref document: EP

Kind code of ref document: A1