WO2015102226A1 - Method for sequencing whole genome sequences of chloroplast, mitochondria or nuclear ribosomal dna of organism using next generation sequencing method - Google Patents

Method for sequencing whole genome sequences of chloroplast, mitochondria or nuclear ribosomal dna of organism using next generation sequencing method Download PDF

Info

Publication number
WO2015102226A1
WO2015102226A1 PCT/KR2014/010999 KR2014010999W WO2015102226A1 WO 2015102226 A1 WO2015102226 A1 WO 2015102226A1 KR 2014010999 W KR2014010999 W KR 2014010999W WO 2015102226 A1 WO2015102226 A1 WO 2015102226A1
Authority
WO
WIPO (PCT)
Prior art keywords
chloroplast
sequence
assembly
sequences
genome
Prior art date
Application number
PCT/KR2014/010999
Other languages
French (fr)
Korean (ko)
Inventor
양태진
Original Assignee
서울대학교산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 서울대학교산학협력단 filed Critical 서울대학교산학협력단
Publication of WO2015102226A1 publication Critical patent/WO2015102226A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the present invention relates to a method for deciphering the complete genome sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using a next generation sequencing method, and more specifically, (a) next generation sequencing of the whole genome of the organism (NGS, next translating the base sequence by a generation sequencing method; (b) generating an NGS data set based on the amount of genomic coverage of the chloroplast, mitochondrial or nuclear ribosomal DNA using the reads (sequences) generated by sequencing of step (a); (c) assembling the leads of the generated NGS data set of step (b) using assembly software; (d) separating the contigs comprising at least one sequence selected from the group consisting of chloroplast, mitochondrial and nuclear ribosomal DNA (nrDNA) sequences in the contigs produced after assembly of step (c); And (e) linking the separated contigs of step (d) using a sequence comparison program and correcting errors during assembly.
  • NGS next generation sequencing of the whole genome of the organism
  • the present invention relates to a method for decoding a sequence alone or simultaneously and to a recording medium having recorded thereon a computer readable program for performing the method.
  • This method optimizes the step of de novo assembly by producing a low coverage genome sequence of the organism using NGS. Based on a method called de nove assembly using low coverage whole genome sequence (DNALCW) It includes a method of bioinformatically correcting a possible error.
  • Chloroplasts are the main organs responsible for photosynthesis and generally have maternal inheritance.
  • the size of the chloroplast genome is 120-217 kb, with about 130 genes being conserved and maintained in small variations, while relatively large numbers of single nucleotide polymorphisms (SNPs) and insertion-deletions exist between genes and intergenic spacers (IGS). (InDel), inversion, translocation and other variations.
  • SNPs single nucleotide polymorphisms
  • IGS intergenic spacers
  • Chloroplasts are circular in all plant cells, with hundreds of copies in one cell.
  • mitochondria are present in more than a few tens of copies in a single cell.
  • the genomes vary in size and complexity, but less than 100Kb and stable structures of inferior plants such as moss, mushrooms, etc. It has a very stable structure.
  • Nuclear ribosomal DNA is present in the nucleus of plant cells and concentrates on two chromosomal ends in tandem repeats, repeating from thousands to tens of thousands of copies. It exists in the form of a nucleolar organizer region and is known to homogenize very quickly even when the genomes of both parents recombine. Plant nrDNA has a high level of preservation during plant sequencing because it preserves the genetic rules of ribosomal assembly and nucleolus formation.
  • the 45S nrDNA consists of 18S, 5.8S and 25S / 26S / 28S gene clusters in all seed plants and one 45S cistron unit containing relatively variable internal transcribed spacers (ITS1) and ITS2 between each gene. .
  • Each 45S cistron unit is divided into IGSs of varying sizes and arranged in a columnar arrangement.
  • chloroplast genome and nrDNA sequences are very well preserved as essential genomic components and represent the cytoplasm and nuclear genomes, respectively, providing important clues about the diversity and evolution of the entire plant genome.
  • GenBank Organelle Genome Resources June 2013
  • only one nearly complete 45S nrDNA sequence May 2013
  • Some chloroplast genome sequences have been achieved by the plant genome sequencing project, but most chloroplast genome sequences have been created by several independent researchers.
  • nrDNA units were assembled on Solanum lycopersicum chromosome 2 (Genbank No. AC215459.2) and 3 (Genbank No. AC246968.1). To date, 45s rDNA units have been reported in more than 20 species, including Arabidopsis chromosomes 2 and 3 (ncbi blastn basis).
  • NGS technology can analyze sequences with significant time and cost savings, but getting meaningful and complete data from large data sets is an important challenge. Therefore, we have developed a very effective method to produce Illumina paired end sequences from commonly prepared whole genomic DNA and to obtain complete chloroplast genome sequences and perfect nrDNA units simultaneously using less than 1 Gbp. It was. In addition, we propose a method to analyze and resolve all types of errors that may occur during assembly, thus eliminating the additional PCR or ABI sequencing process and completing the complete sequence. Doing. This method is applicable to onions and lilies with a very large genome size from lower plants such as moss and lichens, and can be used as a technological tool in analyzing the diversity of species and the origin of evolution in all plant kingdoms. I expect to be there. In addition, by completing the chloroplasts and nrDNA of various strains in the entire species, it is possible to identify the differences between the strains, thus suggesting practical applications such as breed identification markers, bio sovereignty protection, and breeder rights protection.
  • chloroplasts and nrDNAs are used as the main target sequences, but animals, fish, and insects have no chloroplasts. Instead, mitochondria are highly stable, such as 16 kb to 100 kb, and thus are used as nucleotide diversity sites for evolution. In addition, mitochondria, like chloroplasts, have many copies in a single cell, so they can be completed in the same way as chloroplasts.
  • Korean Patent Publication No. 2013-0134269 discloses 'Ultra High Density Gene Mapping Technique Using Next Generation Base Sequence SNP Genotyping'
  • Korean Patent No. 1313087 discloses 'Sequence Recombination Method and Apparatus for NGS' Is disclosed, but there is no description of a method for decoding the complete genome sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using the next generation sequencing method of the present invention.
  • the present invention is derived from the above requirements, and the present inventors use a small amount of genomic DNA of photosynthetic organisms to decode sequencing by next generation sequencing (NGS), using only a specific amount of data set of the NGS data.
  • NGS next generation sequencing
  • the present invention has been accomplished by developing a method for efficiently performing assembly and rapidly and accurately completing the complete sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism simultaneously or independently through the assembly results.
  • NGS next generation sequencing
  • step (b) generating an NGS data set based on the amount of genomic coverage of the chloroplast, mitochondrial or nuclear ribosomal DNA using the reads (sequences) generated by sequencing of step (a);
  • step (c) assembling the leads of the generated NGS data set of step (b) using assembly software
  • step (d) separating the contigs comprising at least one sequence selected from the group consisting of chloroplast, mitochondrial and nuclear ribosomal DNA (nrDNA) sequences in the contigs produced after assembly of step (c);
  • step (e) linking the isolated contigs of step (d) using a sequence comparison program and correcting errors during assembly, wherein the complete genomic sequence of the chloroplast, mitochondrial or nuclear ribosomal DNA of the organism is characterized in that it comprises It provides a method to decipher alone or simultaneously.
  • the present invention also provides a recording medium having recorded thereon a computer readable program for performing the method.
  • nucleotide sequence decoding method of the present invention is based on studies of high copy essential genomic regions and major repeat sites, development of DNA barcoding markers for identifying species or genera, identification of origin, seed purity, research on evolutionary mechanisms of photosynthetic organisms, and sequence information. As it can be applied variously to the protection of the rights of indigenous resources and the rights of breeders for a specific breed, it is considered to be useful industrially.
  • a represents chloroplasts, nuclear ribosomal DNA (nrDNA), mitochondria, and other condensates.
  • the number of tags is displayed as a bar graph, and the% number represents the ratio of the corresponding length to the total length of each reference sequence in the extracted contigs.
  • B is also the result of mapping the chloroplast sequences by assemble assembly using a rice Os2 data set.
  • c is the result of mapping the assembly results of ginseng and the chloroplast sequences involved in the assembly
  • d is the result of mapping the 10kb long pair end data of ginseng to prove that the assembly was successful.
  • FIG. 2 is a view showing the difference in efficiency of the assembly results according to the assembly conditions of rice. On the left is the result of assembly using three datasets with different sequence amounts using SOAP de novo program, and on the right is the result of assembling using CLC de novo assembler using the same three data sets. It is shown to be much more efficient at completing chloroplast sequences.
  • FIG. 3 is a diagram showing the difference in efficiency of the assembly results according to the assembly conditions of ginseng. On the left is the result of assembly using three datasets with different sequence amounts using SOAP de novo program, and on the right is the result of assembling using CLC de novo assembler using the same three data sets. It is shown to be much more efficient at completing chloroplast sequences.
  • a is a type of incorrect assembly in the ginseng overlapping region
  • b is a schematic diagram showing the correct assembly of the region where 4 copies of the 18bp line duplication in ginseng is present
  • c is the 18bp line duplication of b above.
  • FIG. 6 is a diagram showing a case where various types of thymine (T) homopolymer sites and homologous sequences thereof exist in the nucleus, indicating assembly errors, and a method of correcting and correcting them.
  • T thymine
  • FIG. 7 is a diagram showing an example of a wrong assembly site caused by intranuclear or mitochondrial DNA reads (sequence fragments) similar in sequence to chloroplasts and a method of correcting the same.
  • Figure 8 shows the chloroplast sequence completed by the method of the present invention and the sequence reads involved in the assembly, showing a 100% agreement with the reported reference sequence.
  • rDNA ribosomal DNA
  • a is a result of comparing 7,928bp of rDNA unit of the finished rice variety (Nippon Barre) with the Nihombare chromosome 9 homology region
  • b is a model of the structure of rice ( Oryza sativa ), American ginseng ( Panax quinquefolius ), ginseng ( Panax ginseng ) against the structure of 1 unit of nrDNA
  • c is a comparison of the nrDNA sequences of American ginseng, ginseng
  • rice D is the result of primers in 45s conserved region to confirm the IGS length of the completed nrDNA
  • e and f are performed to confirm the IGS length and species variation region of ginseng and American ginseng.
  • One PCR result demonstrates the same length as the completed sequence.
  • Figure 10 is a chloroplast genome map of rice and rice seedlings completed by the method of the present invention.
  • FIG. 11 is a chloroplast genome map of ginseng and ginseng related species completed by the method of the present invention.
  • the arrowheads on the inside of the genomic map indicate areas that show diversity information between ginseng varieties hurricane and gusts.
  • FIG. 13 is a diagram showing the evolutionary relationship of ginseng by analyzing the phylogeny of ginseng and ginseng myoma based on the sequence of 45s rDNA.
  • Figure 15 shows an example of species-specific chloroplast genome-based barcoding markers using the method of the present invention and shows the utility of the plant species classification marker used as a herbal medicine through this.
  • FIG. 16 is a result showing a case in which species-specific markers of 'heaven' were developed in a region (the arrowhead area of FIG. 11) showing a difference between the ginseng varieties of typhoon and lotuses.
  • 17 is a diagram showing a unique marker that can be utilized for species identification and breed identification through the ribosome DNA sequence.
  • FIG. 18 is a schematic diagram illustrating the flow of the de novo assembly method of the chloroplast and ribosomal DNA according to the present invention.
  • FIG. 18 is a schematic diagram showing that the NGS method can simultaneously analyze a large number of plants (up to 600). to be.
  • De novo assembly using low coverage wgs is a method that simultaneously completes chloroplasts, mitochondria, and rDNA using de novo assembly using a small amount of low coverage wgs.
  • Figure 20 is a genomic map of the moss mitochondrial genome in the same way as the chloroplast completion method of the present invention.
  • Figure 21 is a genomic map of the mushroom completed the mitochondrial genome in the same way as the chloroplast completion method of the present invention.
  • NGS next generation sequencing
  • step (b) generating an NGS data set based on the amount of genomic coverage of the chloroplast, mitochondrial or nuclear ribosomal DNA using the reads (sequences) generated by sequencing of step (a);
  • step (c) assembling the leads of the generated NGS data set of step (b) using assembly software
  • step (d) separating the contigs comprising at least one sequence selected from the group consisting of chloroplast, mitochondrial and nuclear ribosomal DNA (nrDNA) sequences in the contigs produced after assembly of step (c);
  • step (e) linking the isolated contigs of step (d) using a sequence comparison program and correcting errors during assembly, wherein the complete genomic sequence of the chloroplast, mitochondrial or nuclear ribosomal DNA of the organism is characterized in that it comprises It provides a method to decipher alone or simultaneously.
  • the step (e) of decoding the complete nucleotide sequence of the chloroplast or mitochondria is the step (e) of decoding the complete nucleotide sequence of the chloroplast or mitochondria
  • Artificially listing two contigs containing the 45s rDNA sequence among the isolated contigs of step (d), assigning artificial gaps between them, and using a gap closer program between genes and genes may include, but are not limited to, filling the physical gap of the region, completing the complete 45s rDNA unit, mapping the raw data sequence of the completed complete 45s rRNA unit, and eliminating assembly errors. .
  • the NGS data set of step (b) may be an amount capable of covering 50 to 500 times the chloroplast genome, but is not limited thereto.
  • the assembly software of step (c) may be SOAP de novo, CLC de novo, Bowtie, Velvet or BWA, etc., preferably SOAP de novo or CLC de novo Software, but is not limited thereto.
  • the base sequence comparison program of step (e) may be a program such as Blast, Clusatal X, Bioedit or Phydit, preferably Blast or Bioedit, more preferably May be Blast, but is not limited thereto.
  • the organism is a microorganism, moss such as moss, moss such as moss, algae such as algae, green algae, red algae or brown algae, such as fungi, giant genomes including mushrooms It may be higher plants, insects, fish or animals and the like, but is not limited thereto.
  • the assembly error may be a sequential error, a false gap, a tandem repeat error, a monopolymer error or a monobasic polymorphism (SNP) error. May be, but is not limited thereto.
  • the present invention also provides a recording medium having recorded thereon a computer readable program for performing the method.
  • the present invention provides a recording medium that records a computer-readable program for performing a method for analyzing a complete sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using NGS experimental data.
  • Computer-readable recording medium refers to any recording medium that can be read directly and accessed by a computer.
  • Such recording media include magnetic recording media such as floppy disks, hard disks, and magnetic tapes, optical recording media such as CD-ROMs, CD-Rs, CDs, RWs, DVD-ROMs, DVD-RAMs, DVD-RWs, RAMs and ROMs.
  • Electrical recording media such as and mixtures of these categories (e.g., magnetic / optical recording media such as MO), but are not limited thereto.
  • Example 1 de novo assembly of chloroplast genome with nrDNA using whole genome sequence
  • Whole genome assembly is about 3 to 10 times for Sanger sequencing, about 13 to 22 times for 454 pyrosequencing, and about 60 to 100 times for Illumina sequencing.
  • Culture genomic data is required.
  • assembly is performed using 100 times or more of genome coverage whole genome sequence (WGS) sequence, and more than 30 plant genome translations have been completed.
  • WGS genome coverage whole genome sequence
  • WGS genome coverage whole genome sequence
  • Most of the contigs, including the chloroplast sequences produced, were found to be in the form of chimeric fusions fused with genomic DNA sequences in the nucleus.
  • Panax ginseng also showed nrDNA sequences of 3 chloroplasts, 13 mitochondria, and 9,422 bp among 30 long contiguous sequences (Table 2 and FIG. 1).
  • the CLC assembler could cover the entire chloroplast genome with less than five relatively long contigs, even as the data set grew.
  • the entire chloroplast was covered with a small number of contigs with a small CLC assembler (Table 3).
  • WGS 44,425,734,760bp of the rice standard nipponbare and WGS 220,948,250,844bp of the ginseng cultivar 'Typhoon' were used. This amounts to about 100 and 70 times the size of the rice and ginseng genomes, respectively, and each WGS contained 1.69% and 6% chloroplast genomes, respectively. Based on this, 10 WGS data sets including 50 to 5,000-fold based on the chloroplast genome were generated as shown in Table 4 below.
  • chloroplast genome coverage of 1-fold genome coverage was 50 and 1,050-fold, respectively, and rDNA coverage was 324-fold and 3,560-fold, respectively.
  • rDNA coverage was 324-fold and 3,560-fold, respectively.
  • Assembling the chloroplast genome with NGS data results in gaps or unspecific nucleotide 'N's, including false gaps, errors caused by tandem repeats, errors caused by monopolymers, and mitochondria and nuclei. Errors of single nucleotide polymorphism (SNP) type due to interference of genomic DNA.
  • SNP single nucleotide polymorphism
  • gaps caused by assembly errors usually contained N in the assembled sequence.
  • N As shown in FIG. 4, even though there are sequences overlapping each other on the left and right sides of an incorrectly assembled sequence, a complete sequence in which N has been removed by artificially combining the overlapping portions into one sequence in the assembly process creates a fake gap with one N. It was confirmed that the raw reads were neatly mapped to the modified sequence. This result confirmed that the modified sequence and the reference sequence matched with those of the rice Nippon Barre varieties, and the ginseng was able to reconfirm how to resolve the fake gap through fertilization through PCR and sequencing. .
  • NGS produces significantly more data than Sanger sequencing but increases the chance of assembly errors due to short read lengths (around 100 bp).
  • tandem duplication regions in de novo genomic assembly frequently caused assembly errors due to changes in the number of copies. Repeats that are longer than the read length or that are tandem or interspersed within the genome have resulted in repeat collapse and rearrangement. If the length of the repeating unit is smaller than the analyte, the error could be solved by adjusting the k-mer value according to the repeating length. As shown in FIG. 5, the repeated decay of vertical duplication in 18 bp units was corrected from 2 copies to 4 copies when the k-mer value was a maximum of 64.
  • Error sites that appear in these homopolymer regions may be caused by sequence errors, but chloroplast genome assembly errors occur, especially during homologous chloroplast sequence fragments inserted into mitochondrial or nuclear DNA, where mutations accumulate at homopolymer sites and these sequences interfere with assembly. It was predicted to cause the cause.
  • DNA fragment sequences derived from the chloroplast genome were inserted and distributed throughout the chromosome.
  • 17 Ts are homopolymers, and the surrounding sequences are distributed on 10 rice chromosomes, especially in the T polymer region (FIG. 6A). And b).
  • SNPs Fake monobasic polymorphisms
  • the draft chloroplast genome using the Os5 set shows guanine (G) and thymine (T) at 51,940 and 51,944 bp, but most primitive leads (186 out of 212) had T and A and were mapped incorrectly. (FIG. 7A).
  • G and T sequences were observed in 24 out of 212, indicating that they exist in the mitochondria. This allowed the correct SNPs to be corrected with T and A, the major sequences that account for 186 of the 212 reads.
  • Genomic DNA was prepared from the plant leaves, and a pair-end library of 300-500 bp was generated at least 1 ⁇ g, and data about WGS 1 Gbp was generated using a HiSeq2000 or MySeq (Illumina, USA) platform.
  • a HiSeq2000 or MySeq Illumina, USA
  • the set of chloroplast contamination rates was determined and then WGS containing about 100-500 times the amount based on chloroplast genome coverage was extracted and assembled using the CLC assembler.
  • the gap filling process is used to compare existing known chloroplast sequences with the BLAST function, to distinguish chloroplast sequences in the assembled config dataset, to determine their order, and to find overlapping portions between the continuum sequences.
  • the generated chloroplast contigs are mapped to the raw data to find the faulty part, and the raw data mapping is used to find false gaps, column duplication errors, homopolymer errors, and SNP errors. Calibrated.
  • the gene region of the 45s transcription unit, the internal transcribed spacer 1 (ITS1), and the ITS2 region have relatively stable structures and have been used as the main targets for plant evolution and differentiation research.
  • One unit of 45 s nrDNA of the plant is known to be about 6-18 kb in length and has been reported to be homogenized within a plant species by concerted evolution, but also in some heterogeneous forms.
  • the length difference of nrDNA units is mainly due to the length diversity of tandem subrepeat elements present between genes and intergenic spacers (IGS).
  • IGS intergenic spacers
  • the column subrepeatable elements present in IGS generate heterogeneous forms in the genome by unequal crossing over, and thus, even in the case of complete genome detoxification, the complete nrDNA unit is not included.
  • nrDNA sequence assembly there is a heterogeneous occurrence in the genome, and it is difficult to identify a representative sequence.
  • occurrence of N due to the collapse of the repeat due to the existence of a tandem array, the occurrence of N is not completed in one unit. There was. However, the following steps almost complete the complete nrDNA sequence.
  • SNPs could occur due to the presence of heterologous forms. This case is mainly found in ITS1 and ITS2, and selecting the highest nucleotide was advantageous for finding a representative form. Heterologous sequence reads could also be selected to find different types simultaneously.
  • ginseng IGS showed repetition of various sizes ranging from 8bp to 641bp.
  • the 641 bp repeat unit is 3.5 copies, which are then arranged in 337 bp and 149 bp units, resulting in incorrect assembly in many cases. Since various repetitions larger or smaller than the read length exist at the same time and there are some sequence differences between the repeating units, the collapse of the repetitions can be solved based on the mapping information of the pair-end reads.
  • a contiguate generated by de novo assembly contains only a 45s rDNA gene region and some IGS sequences
  • the two contiguous constructs are paralleled using an arrayed 45s rDNA feature, and artificially interposed between them.
  • 50-200 nucleotides were used to artificially create a new contigu.
  • the new config file was repeated using the Gapcloser (SOAP de novo package) program to remove nucleotides, and raw data mapping completed the final unit.
  • Rice genome detoxification was confirmed by the completion of chloroplasts and nrDNA by using the standard varieties nipponbare, which have almost complete genome and chloroplast sequences, as the material. To confirm the presence of assembly errors in the finished sequence, comparisons were made with reference sequences and further PCR and ABI sequencing reconfirmation experiments were performed.
  • the nrDNA repeat unit which includes rice IGS, consists of 5,877 bp of 45s (18s-5.8s-26s) and 2,051 bp of IGS, including the external transcribed spacer (ETS) and non-transcribed spacer (NTS), and enhances the 45s transcription unit. It was confirmed that 254bp of the sub-unit known as is arranged in three copies (Fig. 9b, d). In the finished rice standard genome information, it was confirmed that about 4.5 copies of 100 units of the unit completed in the present invention were present at the upper end portion of chromosome 9 (GenBank No. OSJNBb0013K10; AP008245.2).
  • Example 8 de novo assembly of the chloroplast genome for various plant species
  • Example 9 de novo assembly of complete nuclear ribosomal DNA units for rice and ginseng species
  • the total length nrDNA units of ginseng cultivar and American ginseng were 11,091bp and 11,169 bp, respectively, and the length of 45s transcription units was 5,856bp and 5,853bp, respectively. 5,235 bp and 5,316 bp, respectively, 3,200bp longer than the rice appeared to be almost no homology (Fig. 9b and 9c).
  • the length of the IGS predicted based on the completed sequence was predicted that the assembly was accurate as the product of the expected length was confirmed by PCR (FIGS. 9D and 9E). However, in the picture of ginseng IGS amplification, an additional band of about 500 bp in length was amplified (FIGS. 9E and 9F).
  • the number of repeats known as the enhancer of the 45s transcriptional unit, located at 5 'ETS in the IGS, is a 641 bp 3.5 copies of the typhoon and 640 bp 3.5 copies of the American ginseng (variable: 2 3 5, 95% concordance).
  • wild rice Oryza nivara
  • the variation was particularly severe in the ITS1 and ITS2 regions, but some variation was observed in the gene region.
  • Ginseng and American ginseng showed severe differences in IGS and SNPs were observed in some genetic and ITS regions.
  • phylogenetic analysis was confirmed with 45S rDNA region sequences of 13 ginsengs and 16 rices (except O. nivara ) obtained by the present invention (FIGS. 13 and 14).
  • the rDNA sequence can be effectively used to develop markers that can be utilized for species classification and breed identification by developing unique markers between species and species within a nucleus (FIG. 17).

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention relates to: a method for sequencing base sequences by means of a next generation sequencing (NGS) using a small amount of genomic DNA of a photosynthetic organism, efficiently performing an assembly using only a particular quantity of a data set among NGS data, and rapidly and accurately completing concurrently or independently whole sequences of a chloroplast, a mitochondria or a nuclear ribosomal DNA of the organism through a result of the assembly; and a computer-readable recording medium which records a program for performing the method.

Description

차세대 시퀀싱 방법을 이용하여 생물체의 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 완전한 게놈 서열을 해독하는 방법Methods for deciphering the complete genome sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using next generation sequencing methods
본 발명은 차세대 시퀀싱 방법을 이용하여 생물체의 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 완전한 게놈 서열을 해독하는 방법에 관한 것으로, 더욱 상세하게는 (a) 생물체의 전체 게놈을 대상으로 차세대 시퀀싱(NGS, next generation sequencing) 방법으로 염기서열을 해독하는 단계; (b) 상기 (a) 단계의 염기서열 해독을 통해 생성되는 리드(서열조각)들을 이용하여 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 게놈 커버리지 양에 기초하여 NGS 데이터 세트를 생성하는 단계; (c) 상기 (b) 단계의 생성된 NGS 데이터 세트의 리드들을 어셈블리 소프트웨어를 사용하여 어셈블리하는 단계; (d) 상기 (c) 단계의 어셈블리 후 생성된 컨티그에서 엽록체, 미토콘드리아 및 핵 리보솜 DNA(nrDNA, nuclear ribosomal DNA) 서열로 이루어진 군으로부터 선택되는 하나 이상의 서열을 포함하는 컨티그들을 분리하는 단계; 및 (e) 상기 (d) 단계의 분리된 컨티그들을 염기서열 비교 프로그램을 이용하여 연결하고 어셈블리 중 발생한 오류를 수정하는 단계를 포함하는 것을 특징으로 하는 생물체의 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 완전한 게놈 서열을 단독으로 또는 동시에 해독하는 방법 및 상기 방법을 수행하기 위한 컴퓨터로 판독 가능한 프로그램을 기록한 기록매체에 관한 것이다. 본 방법은 NGS를 이용하여 해당 생물체의 low coverage genome 서열을 생산하여 de novo assembly하는 단계를 최적화한 방법으로 dnaLCW (de nove assembly using low coverage whole genome sequence)라고 명명한 방법에 기초하여 이의 최적화 및 발생할 수 있는 에러를 생물정보학적으로 수정하는 방법을 포함하고 있다.The present invention relates to a method for deciphering the complete genome sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using a next generation sequencing method, and more specifically, (a) next generation sequencing of the whole genome of the organism (NGS, next translating the base sequence by a generation sequencing method; (b) generating an NGS data set based on the amount of genomic coverage of the chloroplast, mitochondrial or nuclear ribosomal DNA using the reads (sequences) generated by sequencing of step (a); (c) assembling the leads of the generated NGS data set of step (b) using assembly software; (d) separating the contigs comprising at least one sequence selected from the group consisting of chloroplast, mitochondrial and nuclear ribosomal DNA (nrDNA) sequences in the contigs produced after assembly of step (c); And (e) linking the separated contigs of step (d) using a sequence comparison program and correcting errors during assembly. Complete genome of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism The present invention relates to a method for decoding a sequence alone or simultaneously and to a recording medium having recorded thereon a computer readable program for performing the method. This method optimizes the step of de novo assembly by producing a low coverage genome sequence of the organism using NGS. Based on a method called de nove assembly using low coverage whole genome sequence (DNALCW) It includes a method of bioinformatically correcting a possible error.
식물세포는 핵, 엽록체, 미토콘드리아에 게놈(genome)을 가지고 있다. 엽록체는 광합성을 책임지는 주요 기관이며, 일반적으로 모계 유전을 한다. 엽록체 게놈의 크기는 120~217kb로 약 130개의 유전자가 적은 변이로 보존·유지되고 있는 반면 유전자와 유전자 사이(IGS, intergenic spacers)에는 비교적 많은 단일염기 다형성(SNP, single nucleotide polymorphism), 삽입-결실(InDel), 역위(inversion), 전이(translocation) 등의 변이를 가지고있다. 엽록체는 모든 식물세포에 원형으로 존재하며 한 세포 안에 수백 카피가 존재한다.Plant cells have genomes in the nucleus, chloroplasts, and mitochondria. Chloroplasts are the main organs responsible for photosynthesis and generally have maternal inheritance. The size of the chloroplast genome is 120-217 kb, with about 130 genes being conserved and maintained in small variations, while relatively large numbers of single nucleotide polymorphisms (SNPs) and insertion-deletions exist between genes and intergenic spacers (IGS). (InDel), inversion, translocation and other variations. Chloroplasts are circular in all plant cells, with hundreds of copies in one cell.
미토콘드리아도 엽록체와 마찬가지로 한 개의 세포안에 수십 카피 이상 존재하며 식물의 경우 게놈의 크기가 다양하고 복잡하지만 이끼 등 하등식물, 버섯류 등의 대체로 100Kb 보다 적고 안정적인 구조를 가지며 동물이나 어류, 곤충은 16Kb 내외로 매우 안정적인 구조를 가지고 있다.Like chloroplasts, mitochondria are present in more than a few tens of copies in a single cell.In the case of plants, the genomes vary in size and complexity, but less than 100Kb and stable structures of inferior plants such as moss, mushrooms, etc. It has a very stable structure.
핵 리보솜 DNA (nrDNA, nuclear ribosomal DNA)는 식물세포의 핵에 존재하며 종렬중복(tandem repeat) 형태로 한 두 개의 염색체 말단 부위에 집중하여 수천에서 수만 카피까지 반복하여 존재하는 인형성부위(NOR, nucleolar organizer region) 형태로 존재하며 양쪽 부모의 게놈이 재결합되어도 매우 빨리 균일화(homogenization)된다고 알려져 있다. 식물의 nrDNA는 식물 염기서열 중에 높은 보존 수준을 보이는데 리보솜 조립과 핵소체 형성의 유전자 규칙을 보존하기 때문이다. 고등식물에서는 4 개의 rRNA 구성요소가 5S nrDNA와 45S nrDNA으로 두 염색체 부위에 따로 존재하지만 몇몇 고대 식물, 은행나무, 이끼 및 조류(algae)의 경우는 45S nrDNA와 5S nrDNA가 동일 종렬(tandem) 단위에 공존하고 있다.Nuclear ribosomal DNA (nrDNA) is present in the nucleus of plant cells and concentrates on two chromosomal ends in tandem repeats, repeating from thousands to tens of thousands of copies. It exists in the form of a nucleolar organizer region and is known to homogenize very quickly even when the genomes of both parents recombine. Plant nrDNA has a high level of preservation during plant sequencing because it preserves the genetic rules of ribosomal assembly and nucleolus formation. In higher plants, four rRNA components are present in the two chromosomal regions, 5S nrDNA and 45S nrDNA, but for some ancient plants, ginkgo biloba, moss, and algae, 45S nrDNA and 5S nrDNA are identical tandem units. Coexist in.
45S nrDNA는 모든 종자식물에서 18S, 5.8S 그리고 25S/26S/28S 유전자 클러스터와 각 유전자 사이에 상대적으로 변이가 많은 ITS1(internal transcribed spacers 1) 및 ITS2를 포함하는 한 개의 45S 시스트론 단위로 구성된다. 각각의 45S 시스트론 단위는 다양한 크기의 IGS로 나누어지며 종렬 배열을 이루고 있다.The 45S nrDNA consists of 18S, 5.8S and 25S / 26S / 28S gene clusters in all seed plants and one 45S cistron unit containing relatively variable internal transcribed spacers (ITS1) and ITS2 between each gene. . Each 45S cistron unit is divided into IGSs of varying sizes and arranged in a columnar arrangement.
엽록체 게놈과 nrDNA 염기서열은 필수적인 게놈 구성요소로 매우 잘 보존되어 있으며 세포질과 핵 게놈을 각각 대표하기 때문에 전체 식물 게놈의 다양성과 진화에 대한 중요한 단서를 제공하고 있다. 지금까지 식물에서 약 360개의 엽록체 전체 게놈 서열(GenBank Organelle Genome Resources, July 2013)과 오직 하나의 거의 완전한 45S nrDNA 염기서열(May 2013)이 GenBank(www.ncbi.nlm.nih.gov/genbank/)에 보고되었다. 일부 엽록체 게놈 염기서열은 식물체 게놈 시퀀싱 프로젝트에 의해 달성되었으나, 대부분의 엽록체 게놈 염기서열은 여러 독립적인 연구자들의 노력에 의해 만들어졌다. 즉, 대부분의 엽록체 게놈 서열은 BAC 클론에 삽입된 엽록체 게놈 DNA 조각의 염기서열을 밝히거나 참고 게놈 서열을 이용하여 PCR 워킹 및 시퀀싱 방법으로 완성하였다. 반면, 유전체 해독이 완료된 많은 식물에서도 45S nrDNA 단위가 클러스터를 이루고 있는 인형성부위(NOR, nucleolar organizer region) 영역은 아직까지 미완성 정보로 남아있다. 현재 보고된 완전한 45S nrDNA 단위 염기서열은 벼 염색체 9번의 말단 부위에 4.5개의 완전한 7,928bp의 45s rDNA 종렬 배열 서열이 BAC 클론 시퀀싱 방법에 의해 보고되어 있으며(GenBank No.OSJNBb0013K10; AP008245.2). 그 외 토마토(Solanum lycopersicum) 염색체 2번(Genbank No. AC215459.2), 3번(Genbank No. AC246968.1)에서 약 9kb의 완전한 nrDNA 단위가 어셈블리 되어 있다. 지금까지 45s rDNA 단위는 애기장대 염색체 2번 및 3번을 비롯하여 20여 종에서 보고가 되어 있다(ncbi blastn 기준).The chloroplast genome and nrDNA sequences are very well preserved as essential genomic components and represent the cytoplasm and nuclear genomes, respectively, providing important clues about the diversity and evolution of the entire plant genome. To date, about 360 chloroplast whole genome sequences (GenBank Organelle Genome Resources, July 2013) and only one nearly complete 45S nrDNA sequence (May 2013) have been produced in GenBank (www.ncbi.nlm.nih.gov/genbank/). Was reported on. Some chloroplast genome sequences have been achieved by the plant genome sequencing project, but most chloroplast genome sequences have been created by several independent researchers. That is, most of the chloroplast genome sequences were completed by PCR working and sequencing methods by identifying the nucleotide sequence of the chloroplast genomic DNA fragment inserted into the BAC clone or using the reference genome sequence. On the other hand, the nucleolar organizer region (NOR), in which 45S nrDNA units are clustered, is still incomplete in many plants where genome detoxification is completed. The presently reported complete 45S nrDNA unit sequence shows 4.5 complete 7,928 bp 45s rDNA sequence sequence at the terminal site of rice chromosome 9 by BAC clone sequencing method (GenBank No. OSJNBb0013K10; AP008245.2). In addition, approximately 9 kb of complete nrDNA units were assembled on Solanum lycopersicum chromosome 2 (Genbank No. AC215459.2) and 3 (Genbank No. AC246968.1). To date, 45s rDNA units have been reported in more than 20 species, including Arabidopsis chromosomes 2 and 3 (ncbi blastn basis).
최근에 454 GS-FLX, SOLiD 및 Illumina사의 염기서열 장비를 이용한 차세대 시퀀싱(NGS, next generation sequencing) 방법으로 엽록체 게놈을 완성한 일부 보고들이 있지만 대부분 엽록체 DNA만을 순수 분리하여 염기서열 분석 후 참고 가이드 매핑을 이용하여 de novo 서열 어셈블리를 수행한 후 많은 갭을 채우기 위해 추가적인 PCR 및 시퀀싱을 통해 완성되어 여전히 많은 노력과 시간을 요하고 있다. 최근에는 두 종의 이끼에서 GS-FLX 플랫폼 기반의 전체 게놈 서열을 이용하여 nrDNA 단위와 부분적인 소기관 게놈 서열을 동시에 생산하는 방법을 소개하였다(Liu et al., 2013, Mol Phylogenet Evol, 66:1089-1094). 그러나 소개된 대부분의 NGS를 이용하는 접근방식은 고효율(high throughput) 적용이 어렵고 많은 시간과 노력이 요구되었다.Recently, some reports have completed the chloroplast genome by next generation sequencing (NGS) using 454 GS-FLX, SOLiD, and Illumina sequencing methods. After performing de novo sequence assembly using the additional PCR and sequencing to fill a large gap still requires a lot of effort and time. Recently, a method of simultaneously producing nrDNA units and partial organelle genome sequences using the whole genome sequence based on the GS-FLX platform in two species of lichen was introduced (Liu et al., 2013, Mol Phylogenet Evol, 66: 1089). -1094). However, most of the NGS-based approaches introduced are difficult to apply for high throughput and require a lot of time and effort.
NGS 기술은 시간과 비용을 크게 줄여서 염기서열을 분석할 수 있지만 대규모 데이터로부터 의미있는 완전한 데이터를 얻는 것이 매우 중요한 과제이다. 따라서 우리는 일반적으로 준비되는 전체 게놈 DNA로부터 Illumina 페어드 엔드 서열(paired end sequences)을 생산하고 1Gbp 미만의 적은 양을 이용하여 완벽한 엽록체 게놈 서열과 완벽한 nrDNA 단위를 동시에 얻을 수 있는 매우 효과적인 방법을 개발하였다. 또한 어셈블리 과정에서 나타날 수 있는 오류발생 형태를 모두 분석하고 해결하는 방법을 제시하여 추가적인 PCR이나 ABI 시퀀싱 과정을 거의 배제하고 완전한 서열을 완성할 수 있으며 한 레인의 분석으로 50종 이상의 분석이 가능함을 제시하고 있다. 이 방법은 이끼나 지의류 등 하등식물에서부터 유전체 크기가 매우 큰 양파나 백합에도 적용이 가능함을 확인하여 모든 식물계(plant kingdom)를 대상으로 종의 다양성 분석 및 진화의 기원을 탐구하는데 획기적인 수단으로 활용될 수 있으리라 기대한다. 더불어 전체 종내 다양한 계통의 엽록체와 nrDNA를 완성함으로써 계통간 차이까지 식별이 가능하여 품종식별 마커, 생물주권 보호, 육종가의 권리보호 등 실용적인 활용 방법 또한 제시하고 있다.NGS technology can analyze sequences with significant time and cost savings, but getting meaningful and complete data from large data sets is an important challenge. Therefore, we have developed a very effective method to produce Illumina paired end sequences from commonly prepared whole genomic DNA and to obtain complete chloroplast genome sequences and perfect nrDNA units simultaneously using less than 1 Gbp. It was. In addition, we propose a method to analyze and resolve all types of errors that may occur during assembly, thus eliminating the additional PCR or ABI sequencing process and completing the complete sequence. Doing. This method is applicable to onions and lilies with a very large genome size from lower plants such as moss and lichens, and can be used as a groundbreaking tool in analyzing the diversity of species and the origin of evolution in all plant kingdoms. I expect to be there. In addition, by completing the chloroplasts and nrDNA of various strains in the entire species, it is possible to identify the differences between the strains, thus suggesting practical applications such as breed identification markers, bio sovereignty protection, and breeder rights protection.
식물의 경우 이와 같이 엽록체와 nrDNA를 주요 목적 염기서열로 사용하지만 동물, 어류, 곤충 등은 엽록체가 부재하며 대신 미토콘드리아가 16kb-100kb 정도로 매우 안정적이어서 진화를 설명하는 염기서열 다양성 부위로 활용이 된다. 그리고 미토콘드리아도 엽록체와 마찬가지로 한세포 내에 많은 카피가 존재하므로 엽록체 완성하는 방법과 동일하게 완성할 수 있음을 검정하였고 이에 대한 예를 제시하고 있다.In the case of plants, chloroplasts and nrDNAs are used as the main target sequences, but animals, fish, and insects have no chloroplasts. Instead, mitochondria are highly stable, such as 16 kb to 100 kb, and thus are used as nucleotide diversity sites for evolution. In addition, mitochondria, like chloroplasts, have many copies in a single cell, so they can be completed in the same way as chloroplasts.
한편, 한국공개특허 제2013-0134269호에는 '차세대염기서열기반 SNP 유전형 분석을 이용한 초고밀도 유전자 지도 작성기법'이 개시되어 있고, 한국등록특허 제1313087호에는 'NGS를 위한 서열 재조합 방법 및 장치'가 개시되어 있으나, 본 발명의 차세대 시퀀싱 방법을 이용하여 생물체의 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 완전한 게놈서열을 해독하는 방법에 대해서는 기재된 바가 없다.On the other hand, Korean Patent Publication No. 2013-0134269 discloses 'Ultra High Density Gene Mapping Technique Using Next Generation Base Sequence SNP Genotyping', and Korean Patent No. 1313087 discloses 'Sequence Recombination Method and Apparatus for NGS' Is disclosed, but there is no description of a method for decoding the complete genome sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using the next generation sequencing method of the present invention.
본 발명은 상기와 같은 요구에 의해 도출된 것으로서, 본 발명자들은 광합성 생물체의 소량의 게놈 DNA를 이용하여 차세대 시퀀싱(NGS)으로 염기서열을 해독하고, 상기 NGS 데이터 중 특정양의 데이터 세트만을 이용하여 효율적으로 어셈블리를 수행하고, 상기 어셈블리 결과를 통해 생물체의 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 완전한 서열을 동시에 또는 독립적으로 신속 정확하게 완성하는 방법을 개발함으로써, 본 발명을 완성하였다.The present invention is derived from the above requirements, and the present inventors use a small amount of genomic DNA of photosynthetic organisms to decode sequencing by next generation sequencing (NGS), using only a specific amount of data set of the NGS data. The present invention has been accomplished by developing a method for efficiently performing assembly and rapidly and accurately completing the complete sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism simultaneously or independently through the assembly results.
상기 과제를 해결하기 위해, 본 발명은In order to solve the above problems, the present invention
(a) 생물체의 전체 게놈을 대상으로 차세대 시퀀싱(NGS, next generation sequencing) 방법으로 염기서열을 해독하는 단계;(a) deciphering the nucleotide sequence by next generation sequencing (NGS) on the whole genome of the organism;
(b) 상기 (a) 단계의 염기서열 해독을 통해 생성되는 리드(서열조각)들을 이용하여 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 게놈 커버리지 양에 기초하여 NGS 데이터 세트를 생성하는 단계;(b) generating an NGS data set based on the amount of genomic coverage of the chloroplast, mitochondrial or nuclear ribosomal DNA using the reads (sequences) generated by sequencing of step (a);
(c) 상기 (b) 단계의 생성된 NGS 데이터 세트의 리드들을 어셈블리 소프트웨어를 사용하여 어셈블리하는 단계;(c) assembling the leads of the generated NGS data set of step (b) using assembly software;
(d) 상기 (c) 단계의 어셈블리 후 생성된 컨티그에서 엽록체, 미토콘드리아 및 핵 리보솜 DNA(nrDNA, nuclear ribosomal DNA) 서열로 이루어진 군으로부터 선택되는 하나 이상의 서열을 포함하는 컨티그들을 분리하는 단계; 및(d) separating the contigs comprising at least one sequence selected from the group consisting of chloroplast, mitochondrial and nuclear ribosomal DNA (nrDNA) sequences in the contigs produced after assembly of step (c); And
(e) 상기 (d) 단계의 분리된 컨티그들을 염기서열 비교 프로그램을 이용하여 연결하고 어셈블리 중 발생한 오류를 수정하는 단계를 포함하는 것을 특징으로 하는 생물체의 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 완전한 게놈 서열을 단독으로 또는 동시에 해독하는 방법을 제공한다.(e) linking the isolated contigs of step (d) using a sequence comparison program and correcting errors during assembly, wherein the complete genomic sequence of the chloroplast, mitochondrial or nuclear ribosomal DNA of the organism is characterized in that it comprises It provides a method to decipher alone or simultaneously.
또한, 본 발명은 상기 방법을 수행하기 위한 컴퓨터로 판독 가능한 프로그램을 기록한 기록매체를 제공한다.The present invention also provides a recording medium having recorded thereon a computer readable program for performing the method.
본 발명에서는 적은 양의 전체 게놈 서열 정보를 가지고 고효율 방법으로 완전한 엽록체 유전체 서열, 미토콘드리아 서열 및 45s nrDNA 주요 단위 서열을 동시에 de novo 어셈블하는 새로운 방법 및 오류 제거 방법을 개발하였다. 본 발명의 염기서열 해독방법은 high copy 필수 게놈 영역과 주요 반복 부위의 연구, 종간 또는 속간 식별용 DNA 바코딩 마커 개발, 기원판별, 종자 순도검정, 광합성 생물의 진화 기작 규명 연구 및 서열 정보를 바탕으로 한 고유 자원의 권리보호와 특정 품종에 대한 육종가의 권리보호 등에 다양하게 적용될 수 있으므로, 산업적으로 유용하게 이용될 것으로 판단된다.In the present invention, a novel method and error elimination method for simultaneously de novo assembling a complete chloroplast genome sequence, mitochondrial sequence and 45s nrDNA main unit sequence with a small amount of whole genome sequence information in a highly efficient manner has been developed. The nucleotide sequence decoding method of the present invention is based on studies of high copy essential genomic regions and major repeat sites, development of DNA barcoding markers for identifying species or genera, identification of origin, seed purity, research on evolutionary mechanisms of photosynthetic organisms, and sequence information. As it can be applied variously to the protection of the rights of indigenous resources and the rights of breeders for a specific breed, it is considered to be useful industrially.
도 1은 벼(Oryza sativa)와 인삼(Panax ginseng)의 데이터 세트 중 상위 길이 30개의 컨티그를 추출하여 분석한 결과로, a는 엽록체, 핵 리보솜 DNA(nrDNA), 미토콘드리아 및 기타에 해당하는 컨티그 수를 막대그래프로 표시한 것으로, % 숫자는 추출된 컨티그로 각 참조 서열길이 전체 대비 해당하는 길이의 비율을 의미한다. 또한 b는 벼 Os2 데이터 세트를 사용하여 컨티그 어셈블하여 엽록체 서열을 매핑한 결과이다. c는 인삼의 어셈블리결과와 어셈블에 참여한 엽록체 서열을 매핑한 결과이며, d는 인삼의 10kb long pair end 데이터를 매핑하여 어셈블이 잘 되었다는 것을 증명하는 결과이다. 1 is a result of extracting and analyzing the top 30 contigs of the data sets of rice ( Oryza sativa ) and ginseng ( Panax ginseng ), where a represents chloroplasts, nuclear ribosomal DNA (nrDNA), mitochondria, and other condensates. The number of tags is displayed as a bar graph, and the% number represents the ratio of the corresponding length to the total length of each reference sequence in the extracted contigs. B is also the result of mapping the chloroplast sequences by assemble assembly using a rice Os2 data set. c is the result of mapping the assembly results of ginseng and the chloroplast sequences involved in the assembly, and d is the result of mapping the 10kb long pair end data of ginseng to prove that the assembly was successful.
도 2는 벼의 어셈블리 조건에 따른 어셈블리 결과의 효율 차이를 보여주는 그림이다. 왼쪽은 SOAP de novo 프로그램을 이용하여 시퀀스 양이 다른 3가지 데이터셋을 이용하여 어셈블리한 결과이며, 오른쪽은 같은 3가지 데이터셋을 이용하여 CLC de novo assembler를 이용하여 어셈블한 결과로 CLC 프로그램이 전체 엽록체서열을 완성하는데 훨씬 효율적임을 보여주고 있다. 2 is a view showing the difference in efficiency of the assembly results according to the assembly conditions of rice. On the left is the result of assembly using three datasets with different sequence amounts using SOAP de novo program, and on the right is the result of assembling using CLC de novo assembler using the same three data sets. It is shown to be much more efficient at completing chloroplast sequences.
도 3은 인삼의 어셈블리 조건에 따른 어셈블리 결과의 효율 차이를 보여주는 그림이다. 왼쪽은 SOAP de novo 프로그램을 이용하여 시퀀스 양이 다른 3가지 데이터셋을 이용하여 어셈블리한 결과이며, 오른쪽은 같은 3가지 데이터셋을 이용하여 CLC de novo assembler를 이용하여 어셈블한 결과로 CLC 프로그램이 전체 엽록체서열을 완성하는데 훨씬 효율적임을 보여주고 있다. 3 is a diagram showing the difference in efficiency of the assembly results according to the assembly conditions of ginseng. On the left is the result of assembly using three datasets with different sequence amounts using SOAP de novo program, and on the right is the result of assembling using CLC de novo assembler using the same three data sets. It is shown to be much more efficient at completing chloroplast sequences.
도 4는 서열 삽입결실 어셈블리 오류와 수정과정을 보여주는 그림이다.4 shows sequence deletion assembly errors and corrections.
도 5는 어셈블리 오류로 나타날 수 있는 종렬중복 부위를 나타내는 그림이다. a는 인삼의 종렬중복 부위에 나타나는 잘못된 어셈블리의 형태를 나타내며, b는 인삼에서 나타난 18bp 종렬중복이 4 카피 존재하는 지역의 올바른 어셈블리 경우를 표시한 모식도이며, c는 상기 b의 18bp 종렬중복이 잘못 어셈블리된 경우로 각각 2 카피 및 4 카피로 어셈블 되었을 때의 리드의 깊이를 표시한 것이다.5 is a diagram showing a column overlap region that may appear as an assembly error. a is a type of incorrect assembly in the ginseng overlapping region, b is a schematic diagram showing the correct assembly of the region where 4 copies of the 18bp line duplication in ginseng is present, and c is the 18bp line duplication of b above. When assembled, the depth of the lead when assembled into two and four copies, respectively.
도 6은 다양한 형태의 티민(T) 단일중합체 부위와 이와 상동 서열이 핵내에 존재하여 어셈블리 에러를 나타내는 경우와 이를 보정하여 수정하는 방법을 보여주는 그림이다.FIG. 6 is a diagram showing a case where various types of thymine (T) homopolymer sites and homologous sequences thereof exist in the nucleus, indicating assembly errors, and a method of correcting and correcting them.
도 7은 엽록체와 서열이 유사한 핵내 혹은 미토콘드리아 DNA 리드(서열 조각)에 의해 야기되는 잘못된 어셈블 부위의 예와 이를 수정하는 방법을 나타내는 그림이다.7 is a diagram showing an example of a wrong assembly site caused by intranuclear or mitochondrial DNA reads (sequence fragments) similar in sequence to chloroplasts and a method of correcting the same.
도 8은 본 발명의 방법에 의해 완성된 엽록체 서열과 어셈블리에 참여한 시퀀스 리드들을 보여주고 있으며, 보고된 참조 서열과 100% 일치함을 보여주는 결과이다.Figure 8 shows the chloroplast sequence completed by the method of the present invention and the sequence reads involved in the assembly, showing a 100% agreement with the reported reference sequence.
도 9는 벼와 인삼의 리보솜 DNA(rDNA)의 분포 경향을 나타내는 결과로, a는 완성된 벼 품종(니폰바레)의 rDNA 단위의 7,928bp를 니폰바레 염색체 9번 상동성 지역과 비교한 결과이며, b는 nrDNA 1 단위의 구조에 대해 벼(Oryza sativa), 미국삼(Panax quinquefolius), 인삼(Panax ginseng)의 구조를 모식화한 것이고, c는 미국삼, 인삼, 벼 nrDNA 서열을 상호 비교한 결과이며, d는 완성된 nrDNA의 IGS 길이를 확인하기 위해 45s의 보존 영역에 프라이머를 제작한 것을 보여주는 그림이며, e와 f는 인삼과 미국삼의 IGS 길이 및 종 간 변이 지역을 확인하기 위해 수행한 PCR 결과로 완성된 서열을 기반으로 한 길이와 같음을 증명하고 있다.9 is a result showing the distribution of ribosomal DNA (rDNA) of rice and ginseng, a is a result of comparing 7,928bp of rDNA unit of the finished rice variety (Nippon Barre) with the Nihombare chromosome 9 homology region , b is a model of the structure of rice ( Oryza sativa ), American ginseng ( Panax quinquefolius ), ginseng ( Panax ginseng ) against the structure of 1 unit of nrDNA, c is a comparison of the nrDNA sequences of American ginseng, ginseng, rice D is the result of primers in 45s conserved region to confirm the IGS length of the completed nrDNA, and e and f are performed to confirm the IGS length and species variation region of ginseng and American ginseng. One PCR result demonstrates the same length as the completed sequence.
도 10은 본 발명의 방법에 의해 완성된 벼와 벼 근연종들의 엽록체 게놈 지도이다.Figure 10 is a chloroplast genome map of rice and rice seedlings completed by the method of the present invention.
도 11은 본 발명의 방법에 의해 완성된 인삼과 인삼 근연종들의 엽록체 게놈 지도이다. 게놈지도 안쪽에 표시된 화살표머리는 인삼 품종 천풍과 연풍간에 다양성 정보를 보여주는 지역을 표시한다.11 is a chloroplast genome map of ginseng and ginseng related species completed by the method of the present invention. The arrowheads on the inside of the genomic map indicate areas that show diversity information between ginseng varieties hurricane and gusts.
도 12는 엽록체 게놈을 바탕으로 한 17개의 벼 품종간의 계통 발생을 분석한 결과이다.12 is a result of analyzing the phylogeny between 17 rice varieties based on the chloroplast genome.
도 13은 45s rDNA의 서열을 바탕으로 인삼과 인삼 근연종의 계통 발생을 분석하여 인삼의 진화관계를 보여주는 그림이다.13 is a diagram showing the evolutionary relationship of ginseng by analyzing the phylogeny of ginseng and ginseng myoma based on the sequence of 45s rDNA.
도 14는 45s rDNA의 서열을 바탕으로 벼 16 종의 계통 발생을 분석한 결과이다.14 shows the results of analyzing the phylogeny of 16 types of rice based on the sequence of 45s rDNA.
도 15는 본 발명의 방법을 이용하여 종 특이 엽록체 게놈 기반 바코딩 마커들의 예를 나타내고 있으며 이를 통해 한약재로 사용되는 식물종 구분 마커의 활용성을 나타낸다.Figure 15 shows an example of species-specific chloroplast genome-based barcoding markers using the method of the present invention and shows the utility of the plant species classification marker used as a herbal medicine through this.
도 16은 인삼 품종 천풍과 연풍간 차이를 나타내는 지역 (도 11의 화살표머리 지역)에서 '천풍'의 종 특이적 마커가 개발된 사례를 나타내는 결과이다.FIG. 16 is a result showing a case in which species-specific markers of 'heaven' were developed in a region (the arrowhead area of FIG. 11) showing a difference between the ginseng varieties of typhoon and lotuses.
도 17은 리보솜 DNA 서열을 통해 종 구분 및 품종 식별 등에 활용될 수 있는 고유 마커를 나타내는 그림이다.17 is a diagram showing a unique marker that can be utilized for species identification and breed identification through the ribosome DNA sequence.
도 18은 본 발명의 엽록체 및 리보솜 DNA의 de novo 어셈블리 방법의 흐름을 모식화한 그림으로, NGS 방법을 이용할 경우 많은 종류 (600개까지 가능)의 식물에 대해서도 동시에 분석할 수 있음을 표현하는 모식도이다.FIG. 18 is a schematic diagram illustrating the flow of the de novo assembly method of the chloroplast and ribosomal DNA according to the present invention. FIG. 18 is a schematic diagram showing that the NGS method can simultaneously analyze a large number of plants (up to 600). to be.
도 19는 본 발명의 WGS(whole genome sequence)로부터 엽록체와 nrDNA의 완전한 서열정보를 해독하는 구체적인 방법의 흐름도이다. 적은 양의 WGS (low coverage wgs)를 이용하여 de novo assembly 방법으로 엽록체, 미토콘드리아, rDNA를 동시에 완성하는 방법으로 dnaLCW (de novo assembly using low coverage wgs) 방법이라고 명명하고자 한다. 19 is a flowchart of a specific method of decoding complete sequence information of chloroplasts and nrDNA from the whole genome sequence (WGS) of the present invention. De novo assembly using low coverage wgs (DNALCW) is a method that simultaneously completes chloroplasts, mitochondria, and rDNA using de novo assembly using a small amount of low coverage wgs.
도 20은 본 발명의 엽록체 완성 방법과 같은 방법으로 이끼의 미토콘드리아 게놈을 완성한 게놈지도 그림이다.Figure 20 is a genomic map of the moss mitochondrial genome in the same way as the chloroplast completion method of the present invention.
도 21은 본 발명의 엽록체 완성 방법과 같은 방법으로 버섯의 미토콘드리아 게놈을 완성한 게놈지도 그림이다.Figure 21 is a genomic map of the mushroom completed the mitochondrial genome in the same way as the chloroplast completion method of the present invention.
상기 목적을 달성하기 위하여, 본 발명은In order to achieve the above object, the present invention
(a) 생물체의 전체 게놈을 대상으로 차세대 시퀀싱(NGS, next generation sequencing) 방법으로 염기서열을 해독하는 단계;(a) deciphering the nucleotide sequence by next generation sequencing (NGS) on the whole genome of the organism;
(b) 상기 (a) 단계의 염기서열 해독을 통해 생성되는 리드(서열조각)들을 이용하여 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 게놈 커버리지 양에 기초하여 NGS 데이터 세트를 생성하는 단계;(b) generating an NGS data set based on the amount of genomic coverage of the chloroplast, mitochondrial or nuclear ribosomal DNA using the reads (sequences) generated by sequencing of step (a);
(c) 상기 (b) 단계의 생성된 NGS 데이터 세트의 리드들을 어셈블리 소프트웨어를 사용하여 어셈블리하는 단계;(c) assembling the leads of the generated NGS data set of step (b) using assembly software;
(d) 상기 (c) 단계의 어셈블리 후 생성된 컨티그에서 엽록체, 미토콘드리아 및 핵 리보솜 DNA(nrDNA, nuclear ribosomal DNA) 서열로 이루어진 군으로부터 선택되는 하나 이상의 서열을 포함하는 컨티그들을 분리하는 단계; 및(d) separating the contigs comprising at least one sequence selected from the group consisting of chloroplast, mitochondrial and nuclear ribosomal DNA (nrDNA) sequences in the contigs produced after assembly of step (c); And
(e) 상기 (d) 단계의 분리된 컨티그들을 염기서열 비교 프로그램을 이용하여 연결하고 어셈블리 중 발생한 오류를 수정하는 단계를 포함하는 것을 특징으로 하는 생물체의 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 완전한 게놈 서열을 단독으로 또는 동시에 해독하는 방법을 제공한다.(e) linking the isolated contigs of step (d) using a sequence comparison program and correcting errors during assembly, wherein the complete genomic sequence of the chloroplast, mitochondrial or nuclear ribosomal DNA of the organism is characterized in that it comprises It provides a method to decipher alone or simultaneously.
본 발명의 일 구현 예에 따른 방법에 있어서, 상기 엽록체 또는 미토콘드리아의 완전한 염기서열을 해독하는 (e) 단계는,In the method according to an embodiment of the present invention, the step (e) of decoding the complete nucleotide sequence of the chloroplast or mitochondria,
상기 (d) 단계의 분리된 컨티그 중 엽록체 서열을 포함하는 컨티그를 정렬하고 연결시켜 완전한 원형 서열로 만든 후, 생성된 원시 데이터 서열을 매핑하고 어셈블리 오류를 제거하는 단계를 포함하는 것일 수 있으며,Aligning and concatenating the contigs containing the chloroplast sequences in the isolated contigs of step (d) to form a complete circular sequence, and then mapping the generated raw data sequences and eliminating assembly errors. ,
상기 핵 리보솜 DNA의 완전한 염기서열을 해독하는 (e) 단계는,(E) decoding the complete sequence of the nuclear ribosomal DNA,
상기 (d) 단계의 분리된 컨티그 중 45s rDNA 서열을 포함하는 컨티그를 인위적으로 두 개 나열한 후, 그 사이에 인위적인 갭을 부여하고, 갭 클로저(Gap closer) 프로그램을 사용하여 유전자와 유전자 사이(IGS, intergenic spacer) 영역의 물리적 갭을 채우고 완전한 45s rDNA 단위를 완성하고, 완성된 완전한 45s rRNA 단위의 원시 데이터 서열을 매핑하고 어셈블리 오류를 제거하는 단계를 포함하는 것일 수 있으나, 이에 제한되지 않는다.Artificially listing two contigs containing the 45s rDNA sequence among the isolated contigs of step (d), assigning artificial gaps between them, and using a gap closer program between genes and genes (IGS, intergenic spacer) may include, but are not limited to, filling the physical gap of the region, completing the complete 45s rDNA unit, mapping the raw data sequence of the completed complete 45s rRNA unit, and eliminating assembly errors. .
본 발명의 일 구현 예에 따른 방법에 있어서, 상기 (b) 단계의 NGS 데이터 세트는 엽록체 게놈의 50~500배를 커버리지할 수 있는 양인 것일 수 있으나, 이에 제한되지 않는다.In the method according to an embodiment of the present invention, the NGS data set of step (b) may be an amount capable of covering 50 to 500 times the chloroplast genome, but is not limited thereto.
또한, 본 발명의 일 구현 예에 따른 방법에 있어서, 상기 (c) 단계의 어셈블리 소프트웨어는 SOAP de novo, CLC de novo, Bowtie, Velvet 또는 BWA 등일 수 있고, 바람직하게는 SOAP de novo 또는 CLC de novo 소프트웨어일 수 있으나, 이에 제한되지 않는다.In addition, in the method according to an embodiment of the present invention, the assembly software of step (c) may be SOAP de novo, CLC de novo, Bowtie, Velvet or BWA, etc., preferably SOAP de novo or CLC de novo Software, but is not limited thereto.
본 발명의 일 구현 예에 따른 방법에 있어서, 상기 (e) 단계의 염기서열 비교 프로그램은 Blast, Clusatal X, Bioedit 또는 Phydit 등의 프로그램일 수 있고, 바람직하게는 Blast 또는 Bioedit일 수 있으며, 더욱 바람직하게는 Blast일 수 있으나, 이에 제한되지 않는다. In the method according to an embodiment of the present invention, the base sequence comparison program of step (e) may be a program such as Blast, Clusatal X, Bioedit or Phydit, preferably Blast or Bioedit, more preferably May be Blast, but is not limited thereto.
본 발명의 일 구현 예에 따른 방법에 있어서, 상기 생물체는 미생물, 물이끼, 솔이끼 등의 이끼류, 녹조류, 홍조류 또는 갈조류 등의 조류(algae) 등의 하등 광합성 생물체, 버섯류를 포함하는 곰팡이류, 거대 유전체를 가진 고등 식물체, 곤충, 어류 또는 동물 등일 수 있으나, 이에 제한되지 않는다.In the method according to an embodiment of the present invention, the organism is a microorganism, moss such as moss, moss such as moss, algae such as algae, green algae, red algae or brown algae, such as fungi, giant genomes including mushrooms It may be higher plants, insects, fish or animals and the like, but is not limited thereto.
본 발명의 일 구현 예에 따른 방법에 있어서, 상기 어셈블리 오류는 염기서열 오류, 가짜 갭(false gap), 종렬중복(tandem repeat) 오류, 단일중합체(monopolymer) 오류 또는 단일염기 다형성(SNP) 오류 등일 수 있으나, 이에 제한되지 않는다.In the method according to an embodiment of the present invention, the assembly error may be a sequential error, a false gap, a tandem repeat error, a monopolymer error or a monobasic polymorphism (SNP) error. May be, but is not limited thereto.
또한, 본 발명은 상기 방법을 수행하기 위한 컴퓨터로 판독 가능한 프로그램을 기록한 기록매체를 제공한다. 구체적으로, NGS 실험 자료를 이용하여 생물체의 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 완전한 염기서열을 분석하기 위한 방법을 수행하기 위한 컴퓨터로 판독 가능한 프로그램을 기록한 기록매체를 제공한다.The present invention also provides a recording medium having recorded thereon a computer readable program for performing the method. Specifically, the present invention provides a recording medium that records a computer-readable program for performing a method for analyzing a complete sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using NGS experimental data.
컴퓨터로 판독할 수 있는 기록매체란 컴퓨터에 의해 직접 판독되고 엑세스될 수 있는 임의의 기록매체를 말한다. 이러한 기록매체로서는 플로피 디스크, 하드 디스크, 자기 테이프 등의 자기기록매체, CD-ROM, CD-R, CD, RW, DVD-ROM, DVD-RAM, DVD-RW 등의 광학기록매체, RAM이나 ROM 등의 전기 기록매체 및 이들 범주의 혼합물(예: MO 등의 자기/광학기록매체)을 들 수 있지만, 이에 제한되지 않는다.Computer-readable recording medium refers to any recording medium that can be read directly and accessed by a computer. Such recording media include magnetic recording media such as floppy disks, hard disks, and magnetic tapes, optical recording media such as CD-ROMs, CD-Rs, CDs, RWs, DVD-ROMs, DVD-RAMs, DVD-RWs, RAMs and ROMs. Electrical recording media such as and mixtures of these categories (e.g., magnetic / optical recording media such as MO), but are not limited thereto.
이하, 본 발명을 실시예에 의해 상세히 설명한다. 단, 하기 실시예는 본 발명을 예시하는 것일 뿐, 본 발명의 내용이 하기 실시예에 한정되는 것은 아니다.Hereinafter, the present invention will be described in detail by way of examples. However, the following examples are merely to illustrate the invention, but the content of the present invention is not limited to the following examples.
실시예 1. 전체 게놈 서열을 이용한 엽록체 게놈과 nrDNA의 de novo 어셈블리Example 1 de novo assembly of chloroplast genome with nrDNA using whole genome sequence
전체 게놈 어셈블리(whole genome assembly)를 수행하기 위해서는 생거 시퀀싱(Sanger sequencing)의 경우 약 3~10배, 454 파이로시퀀싱(pyrosequencing)의 경우 약 13~22배, Illumina 시퀀싱의 경우는 약 60~100배 양의 게놈 데이터가 요구된다. 최근 가장 많이 이용되고 있는 Illumina 플랫폼의 경우 약 100배 이상의 게놈 커버리지 WGS(whole genome sequence) 서열을 이용하여 어셈블리를 수행하며, 이를 통해 30여종 이상의 식물 유전체 해독이 완료되었다. 하지만 이렇게 많은 양의 WGS를 이용하더라도 완전한 엽록체 게놈과 nrDNA 서열을 확보하기가 어렵다. 이때 생성되는 엽록체 서열을 포함하는 대부분의 컨티그들은 핵 내 게놈 DNA 서열과 융합된 키메릭(chimeric) 컨티그들의 형태임이 확인되었다. 반면에 게놈 기준 약 1배의 소량 WGS 데이터를 이용하여 CLC de novo 어셈블러로 어셈블리한 결과, 대부분의 길게 생성된 컨티그들은 엽록체, 미토콘드리아 그리고 리보솜 DNA(rDNA) 서열 등 세포 내 매우 높은 카피수로 존재하는 게놈 서열인 것으로 나타났다. 벼(Oryza sativa)의 경우 가장 길게 어셈블된 30개의 컨티그 서열 중 5개의 컨티그가 약 20bp의 중복을 가지고 전체 엽록체 게놈을 커버하는 것을 확인할 수 있었으며 더불어 핵의 리보솜 DNA(nrDNA) 서열을 포함하는 6,889bp의 1개의 컨티그를 확인하였다. 반면 15개 컨티그는 미토콘드리아 유전체의 약 50%를 커버하였다(표 1 및 도 1).Whole genome assembly is about 3 to 10 times for Sanger sequencing, about 13 to 22 times for 454 pyrosequencing, and about 60 to 100 times for Illumina sequencing. Culture genomic data is required. In the case of Illumina platform, which is being used most recently, assembly is performed using 100 times or more of genome coverage whole genome sequence (WGS) sequence, and more than 30 plant genome translations have been completed. However, even with such a large amount of WGS, it is difficult to obtain a complete chloroplast genome and nrDNA sequence. Most of the contigs, including the chloroplast sequences produced, were found to be in the form of chimeric fusions fused with genomic DNA sequences in the nucleus. On the other hand, assembling the CLC de novo assembler using a small amount of WGS data from the genome about one-fold, most long-lived contigs are found in very high copy numbers in cells such as chloroplast, mitochondria, and ribosomal DNA (rDNA) sequences. It was shown to be a genomic sequence. In the case of rice ( Oryza sativa ), five of the 30 longest assembled config sequences consisted of about 20 bp overlapping the entire chloroplast genome, and also including the nuclear ribosomal DNA (nrDNA) sequence. One contig of 6,889 bp was identified. Whereas the 15 contigs covered about 50% of the mitochondrial genome (Table 1 and FIG. 1).
표 1
Figure PCTKR2014010999-appb-T000001
Table 1
Figure PCTKR2014010999-appb-T000001
인삼(Panax ginseng)의 경우도 30개의 긴 컨티그 서열 중 3개가 엽록체, 13개가 미토콘드리아, 1개가 9,422bp의 nrDNA 서열을 나타내었다(표 2 및 도 1). Panax ginseng also showed nrDNA sequences of 3 chloroplasts, 13 mitochondria, and 9,422 bp among 30 long contiguous sequences (Table 2 and FIG. 1).
표 2
Figure PCTKR2014010999-appb-T000002
TABLE 2
Figure PCTKR2014010999-appb-T000002
이를 바탕으로 유전체 크기가 430Mbp로 상대적으로 작은 벼와 비교적 게놈 사이즈가 크며 아직 연구가 미비한 인삼의 경우 WGS를 이용하여 적절한 어셈블리 조건을 제시하면 완전한 엽록체와 nrDNA 서열을 완성할 수 있다는 가능성을 제시하였다. 이에 본 발명에서는 최적의 어셈블리 조건을 찾기 위한 연구를 수행하였다.Based on this, rice with relatively small genome size of 430Mbp and ginseng with relatively large genome size and insufficient research have suggested the possibility of complete chloroplast and nrDNA sequences by providing appropriate assembly conditions using WGS. Therefore, the present invention has been conducted to find the optimal assembly conditions.
실시예 2. De novo 게놈 어셈블러의 선택Example 2 Selection of De novo Genomic Assembler
현재 대중적으로 이용되고 있는 몇 가지 게놈 어셈블러 중에 SOAP de novo 2.04 버전(http://soap.genomics.org.cn/)과 CLC-NGS-CELL 4.06 베타 버전(www.clcbio.com/products/clc-assembly-cell)을 이용하여 엽록체 게놈을 포함하는 컨티그의 생성 능력을 비교하였다. 두 게놈 어셈블러 모두 엽록체 서열을 포함하는 컨티그를 형성하였지만 SOAP de novo 2.04 버전의 경우, CLC-NGS-CELL 4.06 베타 버전을 이용하였을 때보다 50배 및 250배 데이터 세트에서 다소 많은 수의 짧은 컨티그를 생성하였고, 커버리지 또한 낮게 나와 엽록체 게놈을 완성하기 위한 조건이 매우 민감하였다. 하지만 CLC 어셈블러의 경우, 데이터 세트가 증가하여도 5개 미만으로 된 비교적 긴 컨티그로 전체 엽록체 유전체를 커버할 수 있었다. 특히, 식물 중 비교적 게놈 사이즈가 큰 인삼(약 3.2 Gbp)에서도 CLC 어셈블러가 적은 컨티그 수로 엽록체 전체를 커버하였다(표 3).Among the several genome assemblers currently in use, SOAP de novo 2.04 version (http://soap.genomics.org.cn/) and CLC-NGS-CELL 4.06 beta version (www.clcbio.com/products/clc-). assembly-cell) was used to compare the ability of constituents to comprise the chloroplast genome. Both genome assemblers formed a continuum containing chloroplast sequences, but for SOAP de novo version 2.04, a rather large number of shorter constructs in the 50- and 250-fold data sets than with the CLC-NGS-CELL 4.06 beta version. The coverage was also low and the conditions for completing the chloroplast genome were very sensitive. However, the CLC assembler could cover the entire chloroplast genome with less than five relatively long contigs, even as the data set grew. In particular, even in ginseng with a relatively large genome size (about 3.2 Gbp) among plants, the entire chloroplast was covered with a small number of contigs with a small CLC assembler (Table 3).
표 3
Figure PCTKR2014010999-appb-T000003
TABLE 3
Figure PCTKR2014010999-appb-T000003
따라서 CLC 어셈블러를 이용하여 엽록체 게놈 및 nrDNA 서열을 완성하는 최적의 조건을 확립하고자 추가 연구를 수행하였다.Therefore, further studies were conducted to establish optimal conditions for completing the chloroplast genome and nrDNA sequences using the CLC assembler.
실시예 3. 엽록체 게놈 어셈블리를 위해 필요한 NGS 데이터의 양Example 3 Amount of NGS Data Required for Chloroplast Genome Assembly
WGS 데이터를 활용하여 엽록체와 nrDNA 서열을 어셈블리하기 위한 적정 데이터 양을 찾기 위해 벼 표준품종 니폰바레(Nipponbare)의 WGS 44,425,734,760bp와 인삼 품종 '천풍'의 WGS 220,948,250,844bp를 이용하였다. 이는 각각 벼와 인삼 유전체 크기의 약 100배 및 70배의 커버리지에 해당하는 양이며 각각의 WGS에는 엽록체 유전체가 각각 1.69%와 6% 포함되어 있었다. 이에 기초하여 엽록체 게놈 기준으로 50~5,000배를 포함하는 WGS 데이터 세트 10개를 하기 표 4와 같이 생성하였다.In order to find the appropriate amount of data for assembling chloroplasts and nrDNA sequences using WGS data, WGS 44,425,734,760bp of the rice standard nipponbare and WGS 220,948,250,844bp of the ginseng cultivar 'Typhoon' were used. This amounts to about 100 and 70 times the size of the rice and ginseng genomes, respectively, and each WGS contained 1.69% and 6% chloroplast genomes, respectively. Based on this, 10 WGS data sets including 50 to 5,000-fold based on the chloroplast genome were generated as shown in Table 4 below.
표 4
Figure PCTKR2014010999-appb-T000004
Table 4
Figure PCTKR2014010999-appb-T000004
벼와 인삼에서 1배 게놈 커버리지 기준 엽록체 게놈 커버리지는 각각 50배와 1,050배에 해당하며 rDNA 커버리지는 각각 324배와 3,560배에 해당하였다. 각각의 데이터 세트별로 어셈블리 후, 엽록체 시퀀스(NCBI accession No. GU592207.1)를 포함하는 컨티그 수와 어셈블리 오류를 확인하였다.In rice and ginseng, chloroplast genome coverage of 1-fold genome coverage was 50 and 1,050-fold, respectively, and rDNA coverage was 324-fold and 3,560-fold, respectively. After assembly for each data set, the number of assemblies and assembly errors containing the chloroplast sequence (NCBI accession No. GU592207.1) was identified.
벼(Oryza sativa) 데이터 세트는 OS3-OS6에서, 인삼(Panax ginseng) 데이터 세트는 PG3-PG7에서 적은 컨티그 수와 더불어 하기 표 5와 같이 적은 양의 염기서열 오류를 보여주었다.Rice ( Oryza sativa ) data set in OS3-OS6, Panax ginseng data set in PG3-PG7 showed a small amount of sequence error as shown in Table 5 with a small number of conteg.
표 5
Figure PCTKR2014010999-appb-T000005
Table 5
Figure PCTKR2014010999-appb-T000005
엽록체 게놈 기준 깊이(depth)가 50배 이하 및 1,000배 이상인 데이터 세트는 전체 엽록체 유전체를 커버하지 못하거나 엽록체 컨티그 수가 증가하고 어셈블리 오류 또한 증가하였다(도 2 및 도 3).Data sets with chloroplast genome reference depths of 50 and less and 1,000 or more times did not cover the entire chloroplast genome or increased the number of chloroplasts and increased assembly errors (FIGS. 2 and 3).
벼와 인삼의 NGS 데이터 세트를 활용하여 적당한 WGS 데이터 양을 알아본 결과 엽록체 기준으로 약 100~500배의 데이터 세트에서 적은 갭과 미스매치가 나타났고, 이 범위는 벼의 경우 0.86~4.3Gbp의 WGS 양에 해당하고 인삼의 경우 0.3~1.5Gbp의 WGS 양에 해당하므로 인삼과 같이 게놈 사이즈가 큰 식물이라도 엽록체 게놈의 혼입 정도에 따라 일정 양의 WGS 서열로 엽록체 게놈의 어셈블리가 가능하다는 것을 확인할 수 있었다.Using the NGS data set of rice and ginseng to determine the appropriate amount of WGS data, we found a small gap and mismatch in about 100 ~ 500 times data set based on chloroplast, which ranged from 0.86 ~ 4.3Gbp for rice. Since it corresponds to the amount of WGS and the amount of WGS of 0.3 ~ 1.5Gbp, it can be confirmed that even a plant having a large genome size, such as ginseng, can be assembled with a certain amount of WGS sequence according to the degree of incorporation of the chloroplast genome. there was.
실시예 4. 어셈블리 오류의 수정Example 4. Correction of Assembly Errors
NGS 데이터로 엽록체 게놈을 어셈블리하면 갭이나 불특정 뉴클레오티드 'N'이 나타나는데 여기에는 가짜 갭(false gap), 종렬중복(tandem repeats)의 반복에 의한 오류, 단일중합체(monopolymer)에 의한 오류와 미토콘드리아와 핵 게놈 DNA의 간섭에 의한 단일염기다형성(SNP) 타입의 오류 등이 있다. NGS 데이터로 엽록체 게놈을 작성할 시 생기는 오류를 보정하기 전에 오류 자리들을 발견해야 하는데 그 방법은 초안 엽록체 게놈(draft chloroplast genome) 서열을 완성한 후 원시 데이터를 매핑하여 CLC 어셈블리 뷰어를 통해 매핑된 원시 데이터 경향을 전체적으로 보는 과정이 필요하다. 어셈블리 오류가 있는 지역은 잘못 매핑된 원시 리드(서열조각)들이 많이 존재하므로 이를 통해 아래와 같이 보정할 수 있다.Assembling the chloroplast genome with NGS data results in gaps or unspecific nucleotide 'N's, including false gaps, errors caused by tandem repeats, errors caused by monopolymers, and mitochondria and nuclei. Errors of single nucleotide polymorphism (SNP) type due to interference of genomic DNA. Before correcting the errors that occur when creating the chloroplast genome with NGS data, error spots must be found, which is done by completing the draft chloroplast genome sequence and mapping raw data to map raw data trends through the CLC Assembly Viewer. It is necessary to see the whole view. Regions with assembly errors have many misleading raw leads (sequence fragments), which can be corrected as follows:
4-1. 가짜 갭(false gap)4-1. False gap
실제로는 갭이 아니지만 어셈블리 오류에 의해 생긴 갭들은 보통 어셈블리 된 서열 중에 N을 포함하였다. 도 4에서 보이듯 N을 중심으로 잘못 어셈블된 서열 좌우에는 서로 중복되는 서열이 있음에도 어셈블리 과정에 한 개의 N이 있는 가짜 갭을 만드는데 중복되는 부분을 하나의 시퀀스로 인위적으로 결합시킴으로써 N이 제거된 완전한 서열로 수정할 수 있고 수정된 서열에는 원시 리드들이 깨끗하게 매핑되는 것을 확인할 수 있었다. 이 결과는 벼 니폰바레 품종의 서열과 비교를 통해 수정된 서열과 참조 서열이 일치하는 것을 확인하였으며, 인삼의 경우도 PCR 및 시퀀싱을 통해 수정 과정을 통해 가짜 갭을 해결하는 법을 재확인할 수 있었다.Although not actually a gap, gaps caused by assembly errors usually contained N in the assembled sequence. As shown in FIG. 4, even though there are sequences overlapping each other on the left and right sides of an incorrectly assembled sequence, a complete sequence in which N has been removed by artificially combining the overlapping portions into one sequence in the assembly process creates a fake gap with one N. It was confirmed that the raw reads were neatly mapped to the modified sequence. This result confirmed that the modified sequence and the reference sequence matched with those of the rice Nippon Barre varieties, and the ginseng was able to reconfirm how to resolve the fake gap through fertilization through PCR and sequencing. .
4-2. 종렬중복(tandem repeats)4-2. Tandem repeats
NGS는 생거 시퀀싱보다 월등히 많은 양의 데이터를 생산하지만 짧은 리드 길이(100bp 내외)로 인한 어셈블리 오류 가능성은 증가되었다. 특히, de novo 게놈 어셈블리에서 종렬이중(tandem duplication) 지역은 복제 수 변화에 의해 흔히 어셈블리 오류가 발생하였다. 분석량(read length) 보다 반복 단위(repeat unit)의 길이가 길거나, 게놈 내에 종렬(tandem) 또는 산재(interspersed) 되어 있는 반복은 반복 붕괴 및 재정렬(repeat collapse and rearrangement) 등을 야기하였다. 분석량보다 반복 단위의 길이가 작은 경우 반복길이에 따른 k-mer 값의 조절로 오류를 해결할 수 있었다. 도 5에서 보여지듯 18bp 단위의 종렬중복의 반복 붕괴는 k-mer 값을 최대 64로 하였을 경우 2 카피에서 4 카피로 오류가 교정되었다. 종렬중복 수가 원래보다 작게 어셈블된 초안 엽록체 서열에 원시 데이터 매핑을 할 경우, 반복 붕괴가 생긴 곳에는 잘못 매핑된 리드들을 확인할 수 있을 뿐만 아니라 이 잘못 매핑된 원시 리드들이 깊이에 포함되는 이유로 인하여 리드 깊이 또한 주변지역보다 현저히 높아지기 때문에 반복의 복제 수 오류를 예측하고 수정할 수 있었다. 대부분의 식물 엽록체 게놈에 존재하는 종렬중복의 단위 크기는 100bp 이하이므로 이와 같은 방법으로 거의 모든 오류를 발견할 수 있고 또한 제거할 수 있었다.NGS produces significantly more data than Sanger sequencing but increases the chance of assembly errors due to short read lengths (around 100 bp). In particular, tandem duplication regions in de novo genomic assembly frequently caused assembly errors due to changes in the number of copies. Repeats that are longer than the read length or that are tandem or interspersed within the genome have resulted in repeat collapse and rearrangement. If the length of the repeating unit is smaller than the analyte, the error could be solved by adjusting the k-mer value according to the repeating length. As shown in FIG. 5, the repeated decay of vertical duplication in 18 bp units was corrected from 2 copies to 4 copies when the k-mer value was a maximum of 64. If you do raw data mapping to a draft chloroplast sequence assembled with a smaller number of tandem duplications than the original, you will not only see the mismapped leads where the repeat collapse occurred, but also because of the reason that these mismapped raw leads are included in the depth. It was also significantly higher than the surrounding area, so it was possible to predict and correct errors in the number of copies in the iteration. Since the unit size of column duplication in most plant chloroplast genomes is less than 100 bp, almost all errors can be detected and eliminated in this way.
4-3. 단일중합체(monopolymer)4-3. Monopolymer
단일중합체(monopolymer)는 게놈 DNA 뿐만 아니라 엽록체 게놈에서도 많은 문제를 야기하는데, 벼와 인삼의 엽록체 게놈에 8mer 이상의 단일중합체가 나타나는 지역은 조사결과, 각각 95개와 91개이었으며 이 중 아데닌(A) 혹은 티민(T) 반복이 대부분을 차지하였다(표 6).Monopolymers cause many problems not only in genomic DNA but also in the chloroplast genome. In the rice and ginseng chloroplast genomes, more than 8mer homopolymers were found in 95 and 91, respectively, among which adenine (A) or Thymine (T) repeats accounted for the majority (Table 6).
표 6
Figure PCTKR2014010999-appb-T000006
Table 6
Figure PCTKR2014010999-appb-T000006
이러한 단일중합체 지역에서 나타나는 오류 자리는 시퀀스 오류에 의해 유발될 수도 있지만 미토콘드리아나 핵 DNA에 삽입된 엽록체 서열 조각 중에 특히 단일중합체 부위에 변이가 많이 축적되었고 이런 서열이 어셈블리에 간섭을 일으켜 엽록체 게놈 어셈블리 오류의 원인을 유발한 것으로 예측되었다. 벼에서는 엽록체 게놈에서 유래한 DNA 조각 서열들이 염색체 전체에 삽입되어 분포하였다. 벼 엽록체 게놈의 78,424bp(NCBI accession No. GU592207.1)에는 17개의 T가 단일중합체로 되어있는데, 이 주변 서열은 벼 염색체 10군데에 분포하는데 특히 T 중합체 지역에서 변이가 많이 관찰되었다(도 6a 및 b). Os3 데이터 세트를 가지고 어셈블된 초기 컨티그에서는 T8로 어셈블 되었는데 이는 핵 내 염색체 5, 6, 7 및 9번에 존재하는 유사한 엽록체 시퀀스로 인해 잘못 어셈블된 것으로 판단할 수 있었다. 보정 방법으로는 원시 데이터에 존재하는 단일중합체 T들의 개수에 맞추어 임의로 서열을 생성한 후 원시 데이터를 매핑하여, 높은 깊이(depth)의 단일중합체를 고르는 방법으로 보정할 수 있었다. 엽록체 게놈 서열은 NGS 데이터에서 높은 깊이로 존재하기 때문에 이를 선택하는 것이 가장 정확한 엽록체 게놈 서열이라 판단할 수 있었다. 실제, T 단일중합체 반복 조합을 가진 서열 7, 8, 9, 10, 11, 12, 15 및 17개의 T를 가진 서열들로 초안 엽록체 게놈 서열을 만든 후 100%의 유사성으로 페어-엔드(pair-end) 매핑 하였을때 17개의 단일의 T 중합체를 가진 참조 엽록체 게놈 서열에 33.14로 가장 깊이가 높기 때문에 이를 통해 엽록체 게놈 서열을 확인할 수 있었다(도 6c). 이렇게 핵 DNA에 존재하는 엽록체 유래 서열이 어셈블리에 영향을 미치는 정도는 벼 엽록체 게놈 어셈블리에서 특이적으로 많이 나타났으며 이는 사용한 WGS 데이터 양이 벼 게놈 커버리지의 5배 이상이 되면서 나타났고, 커버리지가 높아질수록 오류 가능성이 증가하였으며 키메릭 어셈블리 형성이 증가함을 확인하였다.Error sites that appear in these homopolymer regions may be caused by sequence errors, but chloroplast genome assembly errors occur, especially during homologous chloroplast sequence fragments inserted into mitochondrial or nuclear DNA, where mutations accumulate at homopolymer sites and these sequences interfere with assembly. It was predicted to cause the cause. In rice, DNA fragment sequences derived from the chloroplast genome were inserted and distributed throughout the chromosome. At 78,424 bp (NCBI accession No. GU592207.1) of the rice chloroplast genome, 17 Ts are homopolymers, and the surrounding sequences are distributed on 10 rice chromosomes, especially in the T polymer region (FIG. 6A). And b). Early assembles with Os3 datasets were assembled to T8, which was judged to be misassembled due to similar chloroplast sequences on chromosomes 5, 6, 7, and 9 in the nucleus. As a calibration method, a sequence was randomly generated according to the number of homopolymers T present in the raw data, and the raw data was mapped to correct the homopolymer having a high depth. Since the chloroplast genomic sequence is present at a high depth in the NGS data, it was judged that selecting it was the most accurate chloroplast genome sequence. Indeed, the draft chloroplast genome sequence was constructed from sequences with sequences 7, 8, 9, 10, 11, 12, 15 and 17 T with T homopolymer repeat combinations and then paired with 100% similarity. end) When mapped, the highest chloroplast genome sequence was 33.14 in the reference chloroplast genome sequence having 17 single T polymers (FIG. 6C). The effect of chloroplast-derived sequences present in nuclear DNA on assembly was particularly high in rice chloroplast genome assembly, with the amount of WGS data used being more than five times that of rice genome coverage. As the error probability increased, the chimeric assembly formation increased.
4-4. 상동의(homologous) 미토콘드리아 및 핵 DNA의 간섭에 의한 가짜 단일염기다형성(SNPs)4-4. Fake monobasic polymorphisms (SNPs) by interference of homologous mitochondria and nuclear DNA
초기 어셈블리에 사용한 WGS 양이 많아질 경우 미토콘드리아와 핵 게놈에 삽입되어 있는 엽록체 유래 DNA 조각이 어셈블리에 잘못 참여하여 단일염기다형성(SNP) 오류를 유발할 수 있는데, de-bruin 그래프(Compeau et al., 2011, Nat Biotechnol, 29:987-991)에 의해 어셈블되는 CLC 어셈블러의 특성상 이와 같은 오류는 매우 드물게 발생하였다(도 7b). 이런 형태의 오류는 마치 SNP처럼 나타나며 초안 어셈블리에 매핑된 원시 리드들의 확인을 통해 알 수 있었다. Os5 세트를 이용한 초안 엽록체 게놈은 51,940bp와 51,944bp 위치에 구아닌(G)과 티민(T)을 나타내고 있지만 대부분의 원시 리드들(212개 중 186개)이 T와 A를 가지고 있으며 잘못 매핑되어 있었다(도 7a). 반면 G와 T를 가지고 있는 서열은 212개 중 24개로 관찰되었고 이는 미토콘드리아에 존재하는 서열임을 확인할 수 있었다. 이를 통해 전체 212개 리드들 중 186개를 차지하는 주요 서열인 T와 A로 잘못된 SNP를 보정할 수 있었다.Increasing the amount of WGS used in the initial assembly can cause chloroplast-derived DNA fragments inserted into the mitochondrial and nuclear genomes to incorrectly participate in the assembly, leading to single nucleotide polymorphism (SNP) errors. The de-bruin graph (Compeau et al., This error occurred very rarely due to the nature of the CLC assembler assembled by Nat Biotechnol, 29: 987-991 (FIG. 7b). This type of error appears as if it were an SNP and was confirmed by checking the primitive leads mapped to the draft assembly. The draft chloroplast genome using the Os5 set shows guanine (G) and thymine (T) at 51,940 and 51,944 bp, but most primitive leads (186 out of 212) had T and A and were mapped incorrectly. (FIG. 7A). On the other hand, G and T sequences were observed in 24 out of 212, indicating that they exist in the mitochondria. This allowed the correct SNPs to be corrected with T and A, the major sequences that account for 186 of the 212 reads.
실시예 5. 완전한 엽록체 게놈의 염기서열 해독을 위한 조건 최적화Example 5. Condition Optimization for Sequence Translation of the Complete Chloroplast Genome
식물 잎으로부터 게놈 DNA를 준비하고 최소 1㎍ 정도로 300~500bp의 페어-엔드 라이브러리를 만들고 HiSeq2000이나 MySeq(Illumina, 미국) 플랫폼을 이용하여 WGS 1Gbp 내외 정도의 데이터를 생성하였다. 생성한 데이터 세트의 서열들 중 저급값(low quality value)을 가진 서열들을 제거하고 엽록체 서열의 오염 비율을 알기 위해서 공개된 데이터베이스에서 근연관계에 있는 서열을 찾아 CLC 참조 어셈블리 도구를 이용하여 매핑하고 데이터 세트의 엽록체 오염 비율을 알아낸 다음 엽록체 게놈 커버리지 기준으로 약 100~500배의 양이 포함되는 WGS를 추출하고 CLC 어셈블러를 이용하여 어셈블리하였다. 이때 종렬중복의 잘못된 어셈블을 막기 위해서는 k-mer 값을 64로 설정하고 어셈블리를 진행하는게 도움을 준다. 어셈블리 후에 갭 채우기 과정을 거친 후, 기존의 알려진 엽록체 시퀀스와 BLAST 기능을 이용하여 비교하고 어셈블리된 컨티그 데이터 세트에서 엽록체 서열을 분별하고 그 순서를 확정하고 컨티그 서열간에 중복되는 부분을 찾아서 하나의 엽록체 컨티그로 만들었다. 이렇게 생성한 엽록체 컨티그를 원시 데이터와의 매핑을 통하여 오류가 있는 부분을 찾고, 원시 데이터 매핑을 통해서 가짜 갭, 종렬중복 오류, 단일중합체 오류, 및 SNP 오류 등을 전술한 오류 수정 방법들을 통해 매뉴얼 보정하였다.Genomic DNA was prepared from the plant leaves, and a pair-end library of 300-500 bp was generated at least 1 μg, and data about WGS 1 Gbp was generated using a HiSeq2000 or MySeq (Illumina, USA) platform. In order to remove low quality values among the sequences of the data set generated and to know the contamination rate of chloroplast sequences, find related sequences in the published database and map them using the CLC reference assembly tool. The set of chloroplast contamination rates was determined and then WGS containing about 100-500 times the amount based on chloroplast genome coverage was extracted and assembled using the CLC assembler. At this time, it is helpful to set the k-mer value to 64 and to proceed with the assembly to prevent misassembly of the column duplication. After assembly, the gap filling process is used to compare existing known chloroplast sequences with the BLAST function, to distinguish chloroplast sequences in the assembled config dataset, to determine their order, and to find overlapping portions between the continuum sequences. Made with chloroplast contigues. The generated chloroplast contigs are mapped to the raw data to find the faulty part, and the raw data mapping is used to find false gaps, column duplication errors, homopolymer errors, and SNP errors. Calibrated.
실시예 6. 완전한 핵 리보솜 DNA 단위 서열의 어셈블리 방법Example 6 Assembly Method of Complete Nuclear Ribosome DNA Unit Sequences
45s 전사단위의 유전자영역과 ITS1(internal transcribed spacer 1) 및 ITS2 영역은 비교적 안정된 구조를 가지고있어 식물 진화와 분화 연구의 주요 타겟으로 이용되어 왔다. 식물의 45s nrDNA 한 개의 단위는 약 6~18kb 길이로 알려져 있으며 협조진화(concerted evolution)에 의해 신속히 한 식물 종 내에 균일화(homogenization) 되지만 일부 이종의(heterogeneous) 형태로도 존재한다고 보고되었다. nrDNA 단위의 길이 차이는 주로 유전자와 유전자 사이(IGS, intergenic spacer)에 존재하는 종렬 하위반복 요소(tandem subrepeat elements)의 길이 다양성에서 기인한다. 더불어 IGS에 존재하는 종렬 하위반복 요소는 부등 교차(unequal crossing over)에 의해 게놈 내에 이종의 형태를 발생시키므로 전체 유전체 해독이 완성된 식물에서도 완전한 nrDNA 단위가 포함되지 않은 경우가 대부분이다.The gene region of the 45s transcription unit, the internal transcribed spacer 1 (ITS1), and the ITS2 region have relatively stable structures and have been used as the main targets for plant evolution and differentiation research. One unit of 45 s nrDNA of the plant is known to be about 6-18 kb in length and has been reported to be homogenized within a plant species by concerted evolution, but also in some heterogeneous forms. The length difference of nrDNA units is mainly due to the length diversity of tandem subrepeat elements present between genes and intergenic spacers (IGS). In addition, the column subrepeatable elements present in IGS generate heterogeneous forms in the genome by unequal crossing over, and thus, even in the case of complete genome detoxification, the complete nrDNA unit is not included.
본 발명에서는 엽록체 게놈 어셈블리와 함께, 식물 진화와 분화 연구의 중요한 대상인 45S nrDNA 반복 단위를 완성할 수 있는 간편하고 정확한 방법을 개발하였다. 제시하는 프로토콜은 45s nrDNA 전사단위와 더불어 IGS 서열을 함께 완성하는 방법으로 nrDNA 단위의 가장 대표적 서열을 제시하는 것인 반면 이종의 다른 종류의 nrDNA이 존재하지 않음을 의미하는 것은 아니다.In the present invention, along with the chloroplast genome assembly, a simple and accurate method for completing 45S nrDNA repeat units, which is an important object of plant evolution and differentiation research, has been developed. The protocol presented is to present the most representative sequence of nrDNA units by completing the IGS sequence together with the 45s nrDNA transcription unit, but does not mean that no other heterogeneous nrDNA is present.
랜덤 세트에서 어셈블리되어 nrDNA 서열로 확인된 컨티그에는 거의 완전한 45s 서열을 가지고 있으며 전부 또는 일부의 IGS 서열이 포함되어 있었다. nrDNA 서열 어셈블리시 나타나는 오류에는 게놈 내 이종 형태 발생으로 대표적 하나의 서열을 확인하기가 어려운 경우, 종렬 배열(tandem array) 존재에 의한 반복의 붕괴로 N 발생이 생기는 경우 완전한 한 단위로 완성되지 않는 경우가 있었다. 하지만 아래와 같은 단계를 통해 거의 대부분 완전한 nrDNA 서열을 완성할 수 있었다.Contigs assembled in a random set and identified as nrDNA sequences had a nearly complete 45s sequence and contained all or part of the IGS sequence. In the case of nrDNA sequence assembly, there is a heterogeneous occurrence in the genome, and it is difficult to identify a representative sequence.In case of occurrence of N due to the collapse of the repeat due to the existence of a tandem array, the occurrence of N is not completed in one unit. There was. However, the following steps almost complete the complete nrDNA sequence.
첫째, 45s 내의 보존적인 서열의 어셈블리에서는 이종 형태의 존재로 인해 SNP가 발생할 수 있었다. 이 경우는 주로 ITS1 및 ITS2에서 많이 나타나는데, 가장 높은 깊이의 뉴클레오티드를 선택하는 것이 대표적 형태를 찾는데 유리하였다. 또한 이종의 서열 리드를 선택하여 다른 다양한 타입을 동시에 찾을 수 있었다.First, in the assembly of conservative sequences within 45s, SNPs could occur due to the presence of heterologous forms. This case is mainly found in ITS1 and ITS2, and selecting the highest nucleotide was advantageous for finding a representative form. Heterologous sequence reads could also be selected to find different types simultaneously.
둘째, IGS 내 하위반복 요소들의 종렬 배열로 인한 반복의 붕괴 현상으로, 인삼의 IGS의 경우 8bp에서 641bp까지 다양한 크기의 반복이 나타났다. 641bp 반복 단위는 3.5 카피, 이는 다시 337bp, 149bp 단위로 종렬 배열하고 있어 많은 경우 잘못된 어셈블리가 생겨났다. 분석량(read length)보다 크거나 작은 다양한 반복들이 동시에 존재하면서도 반복 단위간에 약간의 서열차이가 존재하기 때문에 페어-엔드 리드들의 매핑 정보를 바탕으로 반복의 붕괴 현상을 해결할 수 있었다. Secondly, due to the decay of repetition due to the arrangement of the lower repetitive elements in IGS, ginseng IGS showed repetition of various sizes ranging from 8bp to 641bp. The 641 bp repeat unit is 3.5 copies, which are then arranged in 337 bp and 149 bp units, resulting in incorrect assembly in many cases. Since various repetitions larger or smaller than the read length exist at the same time and there are some sequence differences between the repeating units, the collapse of the repetitions can be solved based on the mapping information of the pair-end reads.
셋째로 de novo 어셈블리로 생성된 하나의 컨티그가 45s rDNA 유전자영역과 일부 IGS 서열만 포함할 경우, 종렬 배열하는 45s rDNA 특징을 이용하여 생성된 컨티그 두 개를 병렬연결시키고, 그 사이에 인위적으로 50-200개의 뉴클레오티드들을 채워 하나의 새로운 컨티그를 인위적으로 만들었다. 만들어진 새로운 컨티그 파일을 Gapcloser(SOAP de novo 패키지) 프로그램를 통해 뉴클레오티드를 제거하는 과정을 반복 수행하고 원시 데이터 매핑을 통해 최종 단위를 완성할 수 있었다.Third, if a contiguate generated by de novo assembly contains only a 45s rDNA gene region and some IGS sequences, the two contiguous constructs are paralleled using an arrayed 45s rDNA feature, and artificially interposed between them. 50-200 nucleotides were used to artificially create a new contigu. The new config file was repeated using the Gapcloser (SOAP de novo package) program to remove nucleotides, and raw data mapping completed the final unit.
실시예 7. 벼 '니폰바레' 품종의 엽록체 및 핵 리보솜 DNA를 사용한 검증Example 7 Verification Using Chloroplast and Nuclear Ribosome DNA of Rice 'Nippon Barre' Varieties
벼 유전체 해독은 거의 완전한 유전체 및 엽록체 서열이 완성되어 있는 표준품종 니폰바레(nipponbare)를 재료로 이용하여 위에서 제시한 방법으로 엽록체와 nrDNA를 완성하여 확인해보았다. 완성된 서열에 어셈블리 오류가 존재하는지 확인하기 위해 참조 서열과 비교하고 추가적인 PCR 및 ABI 시퀀싱 재확인 실험을 하였다.Rice genome detoxification was confirmed by the completion of chloroplasts and nrDNA by using the standard varieties nipponbare, which have almost complete genome and chloroplast sequences, as the material. To confirm the presence of assembly errors in the finished sequence, comparisons were made with reference sequences and further PCR and ABI sequencing reconfirmation experiments were performed.
결과적으로 본 발명의 방법에 의해 완성한 134,591bp의 엽록체 서열과 7,928bp rDNA 서열은 참조 서열과 정확히 일치하여 본 발명의 방법이 정확함을 보여주었다(도 8). 벼 IGS를 포함한 nrDNA 반복 단위는 45s(18s-5.8s-26s) 5,877bp와 ETS(external transcribed spacer)와 NTS(non-transcribed spacer)를 포함한 IGS는 2,051bp로 구성되어 있으며, 45s 전사단위의 인핸서로 알려진 하위-단위 254bp가 3 카피로 종렬 배열하고 있음을 확인하였다(도 9b, d). 완성된 벼 표준유전체 정보에는 9번 염색체 상단 말단 부위에 본 발명에서 완성한 단위와 100% 일치하는 약 4.5 카피가 존재함을 확인할 수 있었다(GenBank No. OSJNBb0013K10; AP008245.2).As a result, the 134,591 bp chloroplast sequence and 7,928 bp rDNA sequence completed by the method of the present invention were exactly identical to the reference sequence, indicating that the method of the present invention is accurate (FIG. 8). The nrDNA repeat unit, which includes rice IGS, consists of 5,877 bp of 45s (18s-5.8s-26s) and 2,051 bp of IGS, including the external transcribed spacer (ETS) and non-transcribed spacer (NTS), and enhances the 45s transcription unit. It was confirmed that 254bp of the sub-unit known as is arranged in three copies (Fig. 9b, d). In the finished rice standard genome information, it was confirmed that about 4.5 copies of 100 units of the unit completed in the present invention were present at the upper end portion of chromosome 9 (GenBank No. OSJNBb0013K10; AP008245.2).
실시예 8. 다양한 식물 종에 대한 엽록체 게놈의 de novo 어셈블리Example 8 de novo assembly of the chloroplast genome for various plant species
본 발명의 전술한 방법에 따라 다양한 식물 종 이끼류, 벼 근연종 식물 및 게놈 사이즈가 비교적 큰 인삼과 미국삼(Panax quinquefolius) 그리고 양파 등의 엽록체 게놈을 새롭게 완성하였는데, 인삼 '천풍' 품종은 크기가 156,248bp였으며 미국삼은 156,088bp였고, 두 종간에 염기서열 변이가 약 0.1% 정도로 관찰되었다. 종렬중복 영역 등 오류 발생가능 부위에 대해서는 PCR과 ABI 시퀀싱을 통해 재확인한 결과 정확한 엽록체 게놈이 생성된 것을 확인하였다. 반면 GenBank에 있는 인삼 엽록체 게놈 서열(GenBank No. AY582139.1,66)과는 SNP 127개, 삽입-결실(Indel) 71개의 변이가 발견되었는데, 이는 재료의 차이에 의해서 나타난 문제이거나 시퀀싱 오류의 가능성이 있었다. 한편 추가로 12개의 인삼 품종에 대해 본 발명의 방법으로 엽록체 게놈 서열을 완성하였을 때 전체 인삼 품종간 변이 지역은 SNP 10개, Indel 7개로 이내로 나타났으며, 기존에 보고된 엽록체 게놈 서열이 PCR 워킹(walking) 과정에서 시퀀싱 오류를 포함하였으리라 추정되었다. According to the above-described method of the present invention, various plant species such as moss, rice plant species, and chloroplast genomes such as ginseng, Panax quinquefolius , and onion, which have a relatively large genome size, have been newly completed. 156,248bp, American ginseng was 156,088bp, and the sequence variation was about 0.1% between the two species. Reproducible sites such as the column overlap region were reconfirmed by PCR and ABI sequencing to confirm that the correct chloroplast genome was generated. On the other hand, 127 SNPs and 71 indels were found in the GenBank chloroplast genome sequence (GenBank No. AY582139.1,66), which may be caused by material differences or the possibility of sequencing errors. There was this. On the other hand, when the chloroplast genome sequences of 12 ginseng varieties were completed by the method of the present invention, the variation regions among the ginseng varieties were found to be within 10 SNPs and 7 Indels, and the previously reported chloroplast genome sequences were PCR-worked. It was estimated that the sequencing errors were included in the walking process.
추가로 벼 품종 7개의 WGS와 애리조나 지노믹스 연구소(Arizona Genomics Institute)에서 벼 유전자 지도 얼라인먼트 프로젝트(Oryza Map Alignment Project)를 위해 WGS를 수행한 벼 근연종 9개 및 근연속 나도겨풀(Leersia)의 0.25~4Gbp의 WGS 데이터를 분양받아, 본 발명의 방법으로 엽록체 게놈을 완성한 일부 적용 예를 표로 표시하였다(표 7).In addition to the seven rice varieties, WGS and the nine to nine rice varieties and W. leersia 0.25 to the WGS for the Oryza Map Alignment Project at the Arizona Genomics Institute. WGS data of 4 Gbp was distributed and some application examples in which the chloroplast genome was completed by the method of the present invention are shown in a table (Table 7).
표 7
Figure PCTKR2014010999-appb-T000007
TABLE 7
Figure PCTKR2014010999-appb-T000007
또한, 해독된 엽록체 게놈 서열을 토대로 계통수(phylogenetic tree)를 조사한 결과, 벼 근연종 유사도를 확인할 수 있었다. 벼 내에서는 자포니카와 인디카 품종이 명확히 구분되었으며 같은 아종 그룹에 속한 품종간에는 전혀 변이가 관찰되지 않았다. 더불어 인디카와 자포니카 잡종 유래된 품종인 통일, 다산, 밀양 23호는 최종 모본의 엽록체형과 동일함을 보여주고 있었다(도 12). In addition, as a result of examining the phylogenetic tree based on the decoded chloroplast genome sequence, rice plant species similarity could be confirmed. In rice, japonica and indica varieties were clearly distinguished, and no variation was observed between varieties belonging to the same subspecies group. In addition, Indica and Japonica hybrid-derived varieties, Unification, Dasan, and Miryang No. 23 were shown to be identical to the chloroplast type of the final mother (FIG. 12).
실시예 9. 벼 및 인삼 종에 대한 완전한 핵 리보솜 DNA 단위의 de novo 어셈블리Example 9 de novo assembly of complete nuclear ribosomal DNA units for rice and ginseng species
인삼품종 천풍과 미국삼의 전체길이 nrDNA 단위는 각각 11,091bp와 11,169 bp로 완성되었으며 45s 전사단위 길이는 각각 5,856bp, 5,853bp이었으며 벼의 전사단위 서열과도 높은 상동성을 보여주었지만 IGS 길이는 각각 5,235bp, 5,316bp로 벼보다 3,200bp 이상 긴 것으로 나타났으며 상동성도 거의 존재하지 않았다(도 9b 및 9c). 완성된 서열을 기반으로 예측한 IGS의 길이는 PCR 결과 예상되는 길이의 산물이 확인되어 어셈블리가 정확하다고 예측하였다(도 9d 및 9e). 하지만 인삼 IGS를 증폭한 그림에서 길이 약 500bp 의 부가적인 밴드가 증폭되었는데(도 9e 및 9f), 이는 nrDNA 단위의 이형 타입이 존재하여 나타나거나 핵 게놈 내 인형성부위(NOR) 지역이 아닌 다른 지역에 존재하는 rDNA 유래 조각으로부터 증폭되었을 것으로 추측하였다. IGS 내의 5' ETS에 위치하는 45s 전사단위의 인핸서로 알려진 하부 반복의 반복 수는 천풍은 641bp 3.5 카피, 미국삼은 640bp 3.5 카피(변수:2 3 5, 일치율 95%) 종렬 배열이다. 벼 17종 중 야생벼(Oryza nivara)는 다른 종과 다르게 게놈을 완성할 수 없었는데 이는 야생벼 게놈 내 이형의(heterogeneous) rDNA가 혼재하기 때문으로 추측되었다. 벼 17종의 45S nrDNA 부위를 비교하였을 때 ITS1 및 ITS2 영역에서 특히 변이가 심하지만 유전자 영역에서도 일부의 변이가 관찰되며, 인삼 품종간 변이는 5.8s 지역에서 SNP 하나가 나타났다. 인삼과 미국삼의 경우 IGS에서 심한 차이를 보이며 일부 유전자 영역과 ITS 영역에서 SNP가 관찰되었다. 또한 본 발명으로 얻어진 인삼 13종과 벼 16종(O. nivara 제외)의 45S rDNA 지역 서열로 계통발생분석(phylogenetic analysis)을 확인할 수 있었다(도 13 및 도 14).The total length nrDNA units of ginseng cultivar and American ginseng were 11,091bp and 11,169 bp, respectively, and the length of 45s transcription units was 5,856bp and 5,853bp, respectively. 5,235 bp and 5,316 bp, respectively, 3,200bp longer than the rice appeared to be almost no homology (Fig. 9b and 9c). The length of the IGS predicted based on the completed sequence was predicted that the assembly was accurate as the product of the expected length was confirmed by PCR (FIGS. 9D and 9E). However, in the picture of ginseng IGS amplification, an additional band of about 500 bp in length was amplified (FIGS. 9E and 9F). It was speculated that it would have been amplified from the rDNA derived fragment present in. The number of repeats, known as the enhancer of the 45s transcriptional unit, located at 5 'ETS in the IGS, is a 641 bp 3.5 copies of the typhoon and 640 bp 3.5 copies of the American ginseng (variable: 2 3 5, 95% concordance). Of 17 rices , wild rice ( Oryza nivara ), unlike other species, could not complete the genome, presumably because of the heterogeneous rDNA in the wild rice genome. When comparing the 45S nrDNA sites of 17 species of rice, the variation was particularly severe in the ITS1 and ITS2 regions, but some variation was observed in the gene region. Ginseng and American ginseng showed severe differences in IGS and SNPs were observed in some genetic and ITS regions. In addition, phylogenetic analysis was confirmed with 45S rDNA region sequences of 13 ginsengs and 16 rices (except O. nivara ) obtained by the present invention (FIGS. 13 and 14).
실시예 10. 엽록체 서열 완성 후 종 및 품종 식별 등 마커 개발 검증Example 10. Marker development verification including species and variety identification after completion of chloroplast sequence
본 발명의 방법을 이용하여 다양한 식물 100종 이상에 대해 엽록체와 rDNA를 완성할 수 있었으며, 이끼류의 경우 미토콘드리아도 완전하게 완성할 수 있었다. 완성된 서열을 바탕으로 종간 차이를 보이는 서열에 대해 종간 식별이 가능한 다양한 PCR 마커를 효율적으로 개발할 수 있으며 이는 종판별, 한약재의 기원 판별, 분류 등을 위한 barcoding 마커 개발에 매우 효율적으로 활용될 수 있음을 보여주었다(도 15). 또한 동일 종(species) 의 다른 품종에 대해 엽록체 서열을 완성하였을 때 품종 특이 마커를 개발할 수 있었는데 인삼 두 품종간에도 세 군데 특이적인 마커를 개발할 수 있었으며 이를 통해 품종식별 및 품종권리 보호 등에 활용될 수 있는 마커 개발의 수단으로 매우 유용함을 확인하였다(도 16). Using the method of the present invention it was possible to complete the chloroplast and rDNA for more than 100 species of various plants, in the case of moss was also able to complete the mitochondria completely. Based on the completed sequences, various PCR markers can be efficiently developed for identifying species with different species, which can be used to develop barcoding markers for species discrimination, herbal medicine origin classification, and classification. (Fig. 15). In addition, when the chloroplast sequence was completed for other varieties of the same species, varieties specific markers could be developed. Three specific markers could be developed between two ginseng varieties, which could be used for identification of varieties and protection of varieties. It was found to be very useful as a means of marker development (FIG. 16).
뿐만 아니라 rDNA 서열을 통해 핵내 종간 및 품종간 식별 고유 마커를 개발하여 종구분 및 품종 식별 등에 활용될 수 있는 마커 개발에 효율적으로 활용될 수 있음을 확인하였다(도 17).In addition, it was confirmed that the rDNA sequence can be effectively used to develop markers that can be utilized for species classification and breed identification by developing unique markers between species and species within a nucleus (FIG. 17).
실시예 11. 엽록체 서열 완성과 같은 방법으로 이끼와 버섯의 미토콘드리아 유전체의 완성 검증Example 11 Completion Verification of Mitochondrial Genomes of Moss and Mushroom in the Same Method as Chloroplast Sequence Completion
본 발명의 방법을 이용하여 미토콘드리아 유전체를 완성하는 방법을 이끼와 버섯을 대상으로 적용하여 완전하게 완성할 수 있음을 확인하였다. 동물, 어류, 곤충과 같은 생물체는 엽록체가 부재한 대신 미토콘드리아가 매우 안정적인 구조를 가지고 있으며 크기가 작고 (16Kb 원형 DNA), 또한 한 개의 세포 내에 많은 수의 미토콘드리아 카피를 가지고 있기 때문에 식물에서 엽록체를 완성하는 방법을 적용함으로써 보다 용이하게 완성할 수 있음을 검증하였다. 이끼와 버섯류의 경우 보다 큰 미토콘드리아를 가진다. 하지만 이끼의 경우 동일 방법으로 엽록체와 미토콘드리아를 동시에 완성할 수 있었으며 (도 20), 버섯의 경우 엽록체는 없고 미토콘드리아만 있으므로 미토콘드리아 유전체를 완성한 결과를 제시하고 있다 (도 21). 어류, 곤충, 버섯, 이끼 등에서는 동일 방법으로 미토콘드리아 유전체를 완성하고 종구분 등의 barcoding 마커로 활용성 높은 정보를 추출할 수 있다. It was confirmed that the method of completing the mitochondrial genome using the method of the present invention can be completely completed by applying to moss and mushrooms. Organisms such as animals, fish and insects complete the chloroplasts in plants because the mitochondria have a very stable structure instead of the absence of chloroplasts, they are small in size (16Kb circular DNA) and also have a large number of mitochondrial copies in one cell. By applying the method, it was verified that it can be completed more easily. Moss and mushrooms have larger mitochondria. However, in the case of moss, chloroplasts and mitochondria could be completed at the same time (FIG. 20), and in the case of mushrooms, there is no chloroplast and only mitochondria, thus presenting the results of completing the mitochondrial genome (FIG. 21). Fish, insects, mushrooms, moss, etc. can be used to complete the mitochondrial genome and extract useful information with barcoding markers such as species classification.

Claims (8)

  1. (a) 생물체의 전체 게놈을 대상으로 차세대 시퀀싱(NGS, next generation sequencing) 방법으로 염기서열을 해독하는 단계;(a) deciphering the nucleotide sequence by next generation sequencing (NGS) on the whole genome of the organism;
    (b) 상기 (a) 단계의 염기서열 해독을 통해 생성되는 리드(서열조각)들을 이용하여 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 게놈 커버리지 양에 기초하여 NGS 데이터 세트를 생성하는 단계;(b) generating an NGS data set based on the amount of genomic coverage of the chloroplast, mitochondrial or nuclear ribosomal DNA using the reads (sequences) generated by sequencing of step (a);
    (c) 상기 (b) 단계의 생성된 NGS 데이터 세트의 리드들을 어셈블리 소프트웨어를 사용하여 어셈블리하는 단계;(c) assembling the leads of the generated NGS data set of step (b) using assembly software;
    (d) 상기 (c) 단계의 어셈블리 후 생성된 컨티그에서 엽록체, 미토콘드리아 및 핵 리보솜 DNA(nrDNA, nuclear ribosomal DNA) 서열로 이루어진 군으로부터 선택되는 하나 이상의 서열을 포함하는 컨티그들을 분리하는 단계; 및(d) separating the contigs comprising at least one sequence selected from the group consisting of chloroplast, mitochondrial and nuclear ribosomal DNA (nrDNA) sequences in the contigs produced after assembly of step (c); And
    (e) 상기 (d) 단계의 분리된 컨티그들을 염기서열 비교 프로그램을 이용하여 연결하고 어셈블리 중 발생한 오류를 수정하는 단계를 포함하는 것을 특징으로 하는 생물체의 엽록체, 미토콘드리아 또는 핵 리보솜 DNA의 완전한 게놈 서열을 단독으로 또는 동시에 해독하는 방법.(e) linking the isolated contigs of step (d) using a sequence comparison program and correcting errors during assembly, wherein the complete genomic sequence of the chloroplast, mitochondrial or nuclear ribosomal DNA of the organism is characterized in that it comprises To decipher alone or simultaneously.
  2. 제1항에 있어서, 엽록체 또는 미토콘드리아의 완전한 염기서열을 해독하는 (e) 단계는, 상기 (d) 단계의 분리된 컨티그 중 엽록체 서열을 포함하는 컨티그를 정렬하고 연결시켜 완전한 원형 서열로 만든 후, 생성된 원시 데이터 서열을 매핑하고 어셈블리 오류를 제거하는 단계를 포함하는 것을 특징으로 하는 방법.The method of claim 1, wherein the step (e) of deciphering the complete sequence of the chloroplast or mitochondria is performed by aligning and linking the contigs comprising the chloroplast sequences in the isolated contigs of step (d) to a complete circular sequence. And then mapping the generated raw data sequence and eliminating assembly errors.
  3. 제1항에 있어서, 핵 리보솜 DNA의 완전한 염기서열을 해독하는 (e) 단계는, 상기 (d) 단계의 분리된 컨티그 중 45s rDNA 서열을 포함하는 컨티그를 인위적으로 두 개 나열한 후, 그 사이에 인위적인 갭을 부여하고, 갭 클로저(Gap closer) 프로그램을 사용하여 유전자와 유전자 사이(IGS, intergenic spacer) 영역의 물리적 갭을 채우고 완전한 45s rDNA 단위를 완성하고, 완성된 완전한 45s rRNA 단위의 원시 데이터 서열을 매핑하고 어셈블리 오류를 제거하는 단계를 포함하는 것을 특징으로 하는 방법.The method of claim 1, wherein the step (e) of deciphering the complete sequence of the nuclear ribosomal DNA comprises artificially listing two contigs comprising 45s rDNA sequences in the isolated concatenation of step (d). Create an artificial gap between them, use the gap closer program to fill the physical gaps of genes and intergenic spacer (IGS) regions, complete the complete 45s rDNA unit, and get the complete 45s rRNA unit raw Mapping the data sequences and eliminating assembly errors.
  4. 제1항에 있어서, 생물체는 미생물, 하등 광합성 생물체, 버섯류, 거대 유전체를 가진 고등 식물체, 곤충, 어류 또는 동물인 것을 특징으로 하는 방법.The method of claim 1, wherein the organism is a microorganism, lower photosynthetic organism, mushroom, higher plant with a large genome, insect, fish or animal.
  5. 제1항에 있어서, 상기 NGS 데이터 세트는 엽록체 게놈의 50~500배를 커버리지 할 수 있는 양인 것을 특징으로 하는 방법.The method of claim 1, wherein the NGS data set is an amount capable of covering 50-500 times the chloroplast genome.
  6. 제1항에 있어서, 상기 어셈블리 소프트웨어는 CLC de novo 어셈블리 소프트웨어 또는 SOAP de novo 어셈블리 소프트 웨어인 것을 특징으로 하는 방법.The method of claim 1 wherein the assembly software is CLC de novo assembly software or SOAP de novo assembly software.
  7. 제1항에 있어서, 상기 어셈블리 오류는 염기서열 오류, 가짜 갭(false gap), 종렬중복(tandem repeat) 오류, 단일중합체(monopolymer) 오류 또는 단일염기 다형성(SNP) 오류인 것을 특징으로 하는 방법.The method of claim 1, wherein the assembly error is a sequential error, a false gap, a tandem repeat error, a monopolymer error, or a single base polymorphism (SNP) error.
  8. 제1항 내지 제7항 중 어느 한 항의 방법을 수행하기 위한 컴퓨터로 판독 가능한 프로그램을 기록한 기록매체.A recording medium having recorded thereon a computer readable program for performing the method of any one of claims 1 to 7.
PCT/KR2014/010999 2013-12-31 2014-11-17 Method for sequencing whole genome sequences of chloroplast, mitochondria or nuclear ribosomal dna of organism using next generation sequencing method WO2015102226A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020130167982A KR101447593B1 (en) 2013-12-31 2013-12-31 Method for determining whole genome sequence of chloroplast, mitochondria or nuclear ribosomal DNA of organism using next generation sequencing
KR10-2013-0167982 2013-12-31

Publications (1)

Publication Number Publication Date
WO2015102226A1 true WO2015102226A1 (en) 2015-07-09

Family

ID=51996655

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2014/010999 WO2015102226A1 (en) 2013-12-31 2014-11-17 Method for sequencing whole genome sequences of chloroplast, mitochondria or nuclear ribosomal dna of organism using next generation sequencing method

Country Status (2)

Country Link
KR (1) KR101447593B1 (en)
WO (1) WO2015102226A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784199A (en) * 2017-10-18 2018-03-09 中国科学院昆明植物研究所 A kind of organelle gene group screening technique based on STb gene sequencing result
CN112259169A (en) * 2020-11-18 2021-01-22 东北农业大学 Method for rapidly acquiring chloroplast genome from transcriptome data
CN112802554A (en) * 2021-01-28 2021-05-14 中国科学院成都生物研究所 Animal mitochondrial genome assembly method based on second-generation data

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101632881B1 (en) 2015-11-26 2016-06-23 주식회사 지앤시바이오 Sequencing method of genomic DNA end sequence using NGS
KR101665632B1 (en) 2016-06-14 2016-10-14 주식회사 지앤시바이오 Sequencing method of cDNA end sequence using NGS
KR101798229B1 (en) * 2016-12-27 2017-12-12 주식회사 천랩 ribosomal RNA sequence extraction method and microorganism identification method using extracted ribosomal RNA sequence
WO2021066465A1 (en) * 2019-10-01 2021-04-08 (주)컨투어젠 Method and apparatus for extracting nucleic acid from nucleic acid-containing sample while retaining 2-dimensional position information, and method for analyzing genome including position information using same

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101313087B1 (en) * 2011-10-31 2013-09-30 삼성에스디에스 주식회사 Method and Apparatus for rearrangement of sequence in Next Generation Sequencing

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101313087B1 (en) * 2011-10-31 2013-09-30 삼성에스디에스 주식회사 Method and Apparatus for rearrangement of sequence in Next Generation Sequencing

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KUMAR, SHAILESH ET AL.: "Genome annotation of Burkholderia sp. SJ98 with special focus on chemotaxis genes.", PLOS ONE, vol. 8, no. 8, 5 August 2013 (2013-08-05), pages 1 - 2 *
VARSHNEY, RAJEEV K. ET AL.: "Next-generation sequencing technologies and their implications for crop genetics and breeding.", TRENDS IN BIOTECHNOLOGY, vol. 27, no. 9, September 2009 (2009-09-01), pages 522 - 530 *
ZAGORDI, OSVALDO ET AL.: "Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies.", NUCLEIC ACIDS RESEARCH, vol. 38, no. 21, November 2010 (2010-11-01), pages 7400 - 7409 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107784199A (en) * 2017-10-18 2018-03-09 中国科学院昆明植物研究所 A kind of organelle gene group screening technique based on STb gene sequencing result
CN112259169A (en) * 2020-11-18 2021-01-22 东北农业大学 Method for rapidly acquiring chloroplast genome from transcriptome data
CN112259169B (en) * 2020-11-18 2024-01-30 东北农业大学 Method for rapidly obtaining chloroplast genome from transcriptome data
CN112802554A (en) * 2021-01-28 2021-05-14 中国科学院成都生物研究所 Animal mitochondrial genome assembly method based on second-generation data
CN112802554B (en) * 2021-01-28 2023-09-22 中国科学院成都生物研究所 Animal mitochondrial genome assembly method based on second-generation data

Also Published As

Publication number Publication date
KR101447593B1 (en) 2014-10-07

Similar Documents

Publication Publication Date Title
WO2015102226A1 (en) Method for sequencing whole genome sequences of chloroplast, mitochondria or nuclear ribosomal dna of organism using next generation sequencing method
Zhebentyayeva et al. Genetic characterization of worldwide Prunus domestica (plum) germplasm using sequence-based genotyping
Shirasawa et al. The genome sequence of sweet cherry (Prunus avium) for use in genomics-assisted breeding
Daccord et al. High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development
Vargas et al. Conflicting phylogenomic signals reveal a pattern of reticulate evolution in a recent high‐Andean diversification (Asteraceae: Astereae: Diplostephium)
Deschamps et al. Genotyping-by-sequencing in plants
Zhang et al. High-density interspecific genetic maps of kiwifruit and the identification of sex-specific markers
Chung et al. Population structure and domestication revealed by high-depth resequencing of Korean cultivated and wild soybean genomes
Lee et al. Evolution and selection of R hg1, a copy‐number variant nematode‐resistance locus
Han et al. Integration of physical and genetic maps in apple confirms whole-genome and segmental duplications in the apple genome
Wang et al. A global analysis of QTLs for expression variations in rice shoots at the early seedling stage
Li et al. Frequency and type of inheritable mutations induced by γ rays in rice as revealed by whole genome sequencing
Gui et al. Improving Nelumbo nucifera genome assemblies using high‐resolution genetic maps and BioNano genome mapping reveals ancient chromosome rearrangements
Sucher et al. DNA fingerprinting, DNA barcoding, and next generation sequencing technology in plants
Li et al. Development of nuclear SSR and chloroplast genome markers in diverse Liriodendron chinense germplasm based on low-coverage whole genome sequencing
Dohm et al. Palaeohexaploid ancestry for Caryophyllales inferred from extensive gene‐based physical and genetic mapping of the sugar beet genome (Beta vulgaris)
Ren et al. Genetic mapping and quantitative trait loci analysis of growth-related traits in the small abalone Haliotis diversicolor using restriction-site-associated DNA sequencing
EP3919629A1 (en) Method for using whole genome re-sequencing data to quickly identify transgenic or gene editing material and insertion sites thereof
Sierro et al. Whole genome profiling physical map and ancestral annotation of tobacco H icks B roadleaf
Yu et al. Whole-genome duplication and molecular evolution in Cornus L.(Cornaceae)–Insights from transcriptome sequences
Li et al. Construction of a high-density genetic map and identification of QTLs for cucumber mosaic virus resistance in pepper (Capsicum annuum L.) using specific length amplified fragment sequencing (SLAF-seq)
Tsai et al. RNA-seq SSRs of moth orchid and screening for molecular markers across genus Phalaenopsis (Orchidaceae)
CN114672586A (en) SNP molecular marker related to width character of papaya fruit, amplification primer, detection kit and application thereof
Wang et al. Construction of a high-density genetic map for grape using specific length amplified fragment (SLAF) sequencing
Kamisugi et al. A sequence‐anchored genetic linkage map for the moss, Physcomitrella patens

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14876897

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14876897

Country of ref document: EP

Kind code of ref document: A1