WO2015102226A1

WO2015102226A1 - Method for sequencing whole genome sequences of chloroplast, mitochondria or nuclear ribosomal dna of organism using next generation sequencing method

Info

Publication number: WO2015102226A1
Application number: PCT/KR2014/010999
Authority: WO
Inventors: 양태진
Original assignee: 서울대학교산학협력단
Priority date: 2013-12-31
Filing date: 2014-11-17
Publication date: 2015-07-09
Also published as: KR101447593B1

Abstract

The present invention relates to: a method for sequencing base sequences by means of a next generation sequencing (NGS) using a small amount of genomic DNA of a photosynthetic organism, efficiently performing an assembly using only a particular quantity of a data set among NGS data, and rapidly and accurately completing concurrently or independently whole sequences of a chloroplast, a mitochondria or a nuclear ribosomal DNA of the organism through a result of the assembly; and a computer-readable recording medium which records a program for performing the method.

Description

Methods for deciphering the complete genome sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using next generation sequencing methods

The present invention relates to a method for deciphering the complete genome sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using a next generation sequencing method, and more specifically, (a) next generation sequencing of the whole genome of the organism (NGS, next translating the base sequence by a generation sequencing method; (b) generating an NGS data set based on the amount of genomic coverage of the chloroplast, mitochondrial or nuclear ribosomal DNA using the reads (sequences) generated by sequencing of step (a); (c) assembling the leads of the generated NGS data set of step (b) using assembly software; (d) separating the contigs comprising at least one sequence selected from the group consisting of chloroplast, mitochondrial and nuclear ribosomal DNA (nrDNA) sequences in the contigs produced after assembly of step (c); And (e) linking the separated contigs of step (d) using a sequence comparison program and correcting errors during assembly. Complete genome of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism The present invention relates to a method for decoding a sequence alone or simultaneously and to a recording medium having recorded thereon a computer readable program for performing the method. This method optimizes the step of de novo assembly by producing a low coverage genome sequence of the organism using NGS. Based on a method called de nove assembly using low coverage whole genome sequence (DNALCW) It includes a method of bioinformatically correcting a possible error.

Plant cells have genomes in the nucleus, chloroplasts, and mitochondria. Chloroplasts are the main organs responsible for photosynthesis and generally have maternal inheritance. The size of the chloroplast genome is 120-217 kb, with about 130 genes being conserved and maintained in small variations, while relatively large numbers of single nucleotide polymorphisms (SNPs) and insertion-deletions exist between genes and intergenic spacers (IGS). (InDel), inversion, translocation and other variations. Chloroplasts are circular in all plant cells, with hundreds of copies in one cell.

Like chloroplasts, mitochondria are present in more than a few tens of copies in a single cell.In the case of plants, the genomes vary in size and complexity, but less than 100Kb and stable structures of inferior plants such as moss, mushrooms, etc. It has a very stable structure.

Nuclear ribosomal DNA (nrDNA) is present in the nucleus of plant cells and concentrates on two chromosomal ends in tandem repeats, repeating from thousands to tens of thousands of copies. It exists in the form of a nucleolar organizer region and is known to homogenize very quickly even when the genomes of both parents recombine. Plant nrDNA has a high level of preservation during plant sequencing because it preserves the genetic rules of ribosomal assembly and nucleolus formation. In higher plants, four rRNA components are present in the two chromosomal regions, 5S nrDNA and 45S nrDNA, but for some ancient plants, ginkgo biloba, moss, and algae, 45S nrDNA and 5S nrDNA are identical tandem units. Coexist in.

The 45S nrDNA consists of 18S, 5.8S and 25S / 26S / 28S gene clusters in all seed plants and one 45S cistron unit containing relatively variable internal transcribed spacers (ITS1) and ITS2 between each gene. . Each 45S cistron unit is divided into IGSs of varying sizes and arranged in a columnar arrangement.

The chloroplast genome and nrDNA sequences are very well preserved as essential genomic components and represent the cytoplasm and nuclear genomes, respectively, providing important clues about the diversity and evolution of the entire plant genome. To date, about 360 chloroplast whole genome sequences (GenBank Organelle Genome Resources, July 2013) and only one nearly complete 45S nrDNA sequence (May 2013) have been produced in GenBank (www.ncbi.nlm.nih.gov/genbank/). Was reported on. Some chloroplast genome sequences have been achieved by the plant genome sequencing project, but most chloroplast genome sequences have been created by several independent researchers. That is, most of the chloroplast genome sequences were completed by PCR working and sequencing methods by identifying the nucleotide sequence of the chloroplast genomic DNA fragment inserted into the BAC clone or using the reference genome sequence. On the other hand, the nucleolar organizer region (NOR), in which 45S nrDNA units are clustered, is still incomplete in many plants where genome detoxification is completed. The presently reported complete 45S nrDNA unit sequence shows 4.5 complete 7,928 bp 45s rDNA sequence sequence at the terminal site of rice chromosome 9 by BAC clone sequencing method (GenBank No. OSJNBb0013K10; AP008245.2). In addition, approximately 9 kb of complete nrDNA units were assembled on Solanum lycopersicum chromosome 2 (Genbank No. AC215459.2) and 3 (Genbank No. AC246968.1). To date, 45s rDNA units have been reported in more than 20 species, including Arabidopsis chromosomes 2 and 3 (ncbi blastn basis).

Recently, some reports have completed the chloroplast genome by next generation sequencing (NGS) using 454 GS-FLX, SOLiD, and Illumina sequencing methods. After performing de novo sequence assembly using the additional PCR and sequencing to fill a large gap still requires a lot of effort and time. Recently, a method of simultaneously producing nrDNA units and partial organelle genome sequences using the whole genome sequence based on the GS-FLX platform in two species of lichen was introduced (Liu et al., 2013, Mol Phylogenet Evol, 66: 1089). -1094). However, most of the NGS-based approaches introduced are difficult to apply for high throughput and require a lot of time and effort.

NGS technology can analyze sequences with significant time and cost savings, but getting meaningful and complete data from large data sets is an important challenge. Therefore, we have developed a very effective method to produce Illumina paired end sequences from commonly prepared whole genomic DNA and to obtain complete chloroplast genome sequences and perfect nrDNA units simultaneously using less than 1 Gbp. It was. In addition, we propose a method to analyze and resolve all types of errors that may occur during assembly, thus eliminating the additional PCR or ABI sequencing process and completing the complete sequence. Doing. This method is applicable to onions and lilies with a very large genome size from lower plants such as moss and lichens, and can be used as a groundbreaking tool in analyzing the diversity of species and the origin of evolution in all plant kingdoms. I expect to be there. In addition, by completing the chloroplasts and nrDNA of various strains in the entire species, it is possible to identify the differences between the strains, thus suggesting practical applications such as breed identification markers, bio sovereignty protection, and breeder rights protection.

In the case of plants, chloroplasts and nrDNAs are used as the main target sequences, but animals, fish, and insects have no chloroplasts. Instead, mitochondria are highly stable, such as 16 kb to 100 kb, and thus are used as nucleotide diversity sites for evolution. In addition, mitochondria, like chloroplasts, have many copies in a single cell, so they can be completed in the same way as chloroplasts.

On the other hand, Korean Patent Publication No. 2013-0134269 discloses 'Ultra High Density Gene Mapping Technique Using Next Generation Base Sequence SNP Genotyping', and Korean Patent No. 1313087 discloses 'Sequence Recombination Method and Apparatus for NGS' Is disclosed, but there is no description of a method for decoding the complete genome sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using the next generation sequencing method of the present invention.

The present invention is derived from the above requirements, and the present inventors use a small amount of genomic DNA of photosynthetic organisms to decode sequencing by next generation sequencing (NGS), using only a specific amount of data set of the NGS data. The present invention has been accomplished by developing a method for efficiently performing assembly and rapidly and accurately completing the complete sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism simultaneously or independently through the assembly results.

In order to solve the above problems, the present invention

(a) deciphering the nucleotide sequence by next generation sequencing (NGS) on the whole genome of the organism;

(b) generating an NGS data set based on the amount of genomic coverage of the chloroplast, mitochondrial or nuclear ribosomal DNA using the reads (sequences) generated by sequencing of step (a);

(c) assembling the leads of the generated NGS data set of step (b) using assembly software;

(d) separating the contigs comprising at least one sequence selected from the group consisting of chloroplast, mitochondrial and nuclear ribosomal DNA (nrDNA) sequences in the contigs produced after assembly of step (c); And

(e) linking the isolated contigs of step (d) using a sequence comparison program and correcting errors during assembly, wherein the complete genomic sequence of the chloroplast, mitochondrial or nuclear ribosomal DNA of the organism is characterized in that it comprises It provides a method to decipher alone or simultaneously.

The present invention also provides a recording medium having recorded thereon a computer readable program for performing the method.

In the present invention, a novel method and error elimination method for simultaneously de novo assembling a complete chloroplast genome sequence, mitochondrial sequence and 45s nrDNA main unit sequence with a small amount of whole genome sequence information in a highly efficient manner has been developed. The nucleotide sequence decoding method of the present invention is based on studies of high copy essential genomic regions and major repeat sites, development of DNA barcoding markers for identifying species or genera, identification of origin, seed purity, research on evolutionary mechanisms of photosynthetic organisms, and sequence information. As it can be applied variously to the protection of the rights of indigenous resources and the rights of breeders for a specific breed, it is considered to be useful industrially.

1 is a result of extracting and analyzing the top 30 contigs of the data sets of rice ( Oryza sativa ) and ginseng ( Panax ginseng ), where a represents chloroplasts, nuclear ribosomal DNA (nrDNA), mitochondria, and other condensates. The number of tags is displayed as a bar graph, and the% number represents the ratio of the corresponding length to the total length of each reference sequence in the extracted contigs. B is also the result of mapping the chloroplast sequences by assemble assembly using a rice Os2 data set. c is the result of mapping the assembly results of ginseng and the chloroplast sequences involved in the assembly, and d is the result of mapping the 10kb long pair end data of ginseng to prove that the assembly was successful.

2 is a view showing the difference in efficiency of the assembly results according to the assembly conditions of rice. On the left is the result of assembly using three datasets with different sequence amounts using SOAP de novo program, and on the right is the result of assembling using CLC de novo assembler using the same three data sets. It is shown to be much more efficient at completing chloroplast sequences.

3 is a diagram showing the difference in efficiency of the assembly results according to the assembly conditions of ginseng. On the left is the result of assembly using three datasets with different sequence amounts using SOAP de novo program, and on the right is the result of assembling using CLC de novo assembler using the same three data sets. It is shown to be much more efficient at completing chloroplast sequences.

4 shows sequence deletion assembly errors and corrections.

5 is a diagram showing a column overlap region that may appear as an assembly error. a is a type of incorrect assembly in the ginseng overlapping region, b is a schematic diagram showing the correct assembly of the region where 4 copies of the 18bp line duplication in ginseng is present, and c is the 18bp line duplication of b above. When assembled, the depth of the lead when assembled into two and four copies, respectively.

FIG. 6 is a diagram showing a case where various types of thymine (T) homopolymer sites and homologous sequences thereof exist in the nucleus, indicating assembly errors, and a method of correcting and correcting them.

7 is a diagram showing an example of a wrong assembly site caused by intranuclear or mitochondrial DNA reads (sequence fragments) similar in sequence to chloroplasts and a method of correcting the same.

Figure 8 shows the chloroplast sequence completed by the method of the present invention and the sequence reads involved in the assembly, showing a 100% agreement with the reported reference sequence.

9 is a result showing the distribution of ribosomal DNA (rDNA) of rice and ginseng, a is a result of comparing 7,928bp of rDNA unit of the finished rice variety (Nippon Barre) with the Nihombare chromosome 9 homology region , b is a model of the structure of rice ( Oryza sativa ), American ginseng ( Panax quinquefolius ), ginseng ( Panax ginseng ) against the structure of 1 unit of nrDNA, c is a comparison of the nrDNA sequences of American ginseng, ginseng, rice D is the result of primers in 45s conserved region to confirm the IGS length of the completed nrDNA, and e and f are performed to confirm the IGS length and species variation region of ginseng and American ginseng. One PCR result demonstrates the same length as the completed sequence.

Figure 10 is a chloroplast genome map of rice and rice seedlings completed by the method of the present invention.

11 is a chloroplast genome map of ginseng and ginseng related species completed by the method of the present invention. The arrowheads on the inside of the genomic map indicate areas that show diversity information between ginseng varieties hurricane and gusts.

12 is a result of analyzing the phylogeny between 17 rice varieties based on the chloroplast genome.

13 is a diagram showing the evolutionary relationship of ginseng by analyzing the phylogeny of ginseng and ginseng myoma based on the sequence of 45s rDNA.

14 shows the results of analyzing the phylogeny of 16 types of rice based on the sequence of 45s rDNA.

Figure 15 shows an example of species-specific chloroplast genome-based barcoding markers using the method of the present invention and shows the utility of the plant species classification marker used as a herbal medicine through this.

FIG. 16 is a result showing a case in which species-specific markers of 'heaven' were developed in a region (the arrowhead area of FIG. 11) showing a difference between the ginseng varieties of typhoon and lotuses.

17 is a diagram showing a unique marker that can be utilized for species identification and breed identification through the ribosome DNA sequence.

FIG. 18 is a schematic diagram illustrating the flow of the de novo assembly method of the chloroplast and ribosomal DNA according to the present invention. FIG. 18 is a schematic diagram showing that the NGS method can simultaneously analyze a large number of plants (up to 600). to be.

19 is a flowchart of a specific method of decoding complete sequence information of chloroplasts and nrDNA from the whole genome sequence (WGS) of the present invention. De novo assembly using low coverage wgs (DNALCW) is a method that simultaneously completes chloroplasts, mitochondria, and rDNA using de novo assembly using a small amount of low coverage wgs.

Figure 20 is a genomic map of the moss mitochondrial genome in the same way as the chloroplast completion method of the present invention.

Figure 21 is a genomic map of the mushroom completed the mitochondrial genome in the same way as the chloroplast completion method of the present invention.

In order to achieve the above object, the present invention

In the method according to an embodiment of the present invention, the step (e) of decoding the complete nucleotide sequence of the chloroplast or mitochondria,

Aligning and concatenating the contigs containing the chloroplast sequences in the isolated contigs of step (d) to form a complete circular sequence, and then mapping the generated raw data sequences and eliminating assembly errors. ,

(E) decoding the complete sequence of the nuclear ribosomal DNA,

Artificially listing two contigs containing the 45s rDNA sequence among the isolated contigs of step (d), assigning artificial gaps between them, and using a gap closer program between genes and genes (IGS, intergenic spacer) may include, but are not limited to, filling the physical gap of the region, completing the complete 45s rDNA unit, mapping the raw data sequence of the completed complete 45s rRNA unit, and eliminating assembly errors. .

In the method according to an embodiment of the present invention, the NGS data set of step (b) may be an amount capable of covering 50 to 500 times the chloroplast genome, but is not limited thereto.

In addition, in the method according to an embodiment of the present invention, the assembly software of step (c) may be SOAP de novo, CLC de novo, Bowtie, Velvet or BWA, etc., preferably SOAP de novo or CLC de novo Software, but is not limited thereto.

In the method according to an embodiment of the present invention, the base sequence comparison program of step (e) may be a program such as Blast, Clusatal X, Bioedit or Phydit, preferably Blast or Bioedit, more preferably May be Blast, but is not limited thereto.

In the method according to an embodiment of the present invention, the organism is a microorganism, moss such as moss, moss such as moss, algae such as algae, green algae, red algae or brown algae, such as fungi, giant genomes including mushrooms It may be higher plants, insects, fish or animals and the like, but is not limited thereto.

In the method according to an embodiment of the present invention, the assembly error may be a sequential error, a false gap, a tandem repeat error, a monopolymer error or a monobasic polymorphism (SNP) error. May be, but is not limited thereto.

The present invention also provides a recording medium having recorded thereon a computer readable program for performing the method. Specifically, the present invention provides a recording medium that records a computer-readable program for performing a method for analyzing a complete sequence of chloroplast, mitochondrial or nuclear ribosomal DNA of an organism using NGS experimental data.

Computer-readable recording medium refers to any recording medium that can be read directly and accessed by a computer. Such recording media include magnetic recording media such as floppy disks, hard disks, and magnetic tapes, optical recording media such as CD-ROMs, CD-Rs, CDs, RWs, DVD-ROMs, DVD-RAMs, DVD-RWs, RAMs and ROMs. Electrical recording media such as and mixtures of these categories (e.g., magnetic / optical recording media such as MO), but are not limited thereto.

Hereinafter, the present invention will be described in detail by way of examples. However, the following examples are merely to illustrate the invention, but the content of the present invention is not limited to the following examples.

실시예 1. 전체 게놈 서열을 이용한 엽록체 게놈과 nrDNA의 de novo 어셈블리Example 1 de novo assembly of chloroplast genome with nrDNA using whole genome sequence

Whole genome assembly is about 3 to 10 times for Sanger sequencing, about 13 to 22 times for 454 pyrosequencing, and about 60 to 100 times for Illumina sequencing. Culture genomic data is required. In the case of Illumina platform, which is being used most recently, assembly is performed using 100 times or more of genome coverage whole genome sequence (WGS) sequence, and more than 30 plant genome translations have been completed. However, even with such a large amount of WGS, it is difficult to obtain a complete chloroplast genome and nrDNA sequence. Most of the contigs, including the chloroplast sequences produced, were found to be in the form of chimeric fusions fused with genomic DNA sequences in the nucleus. On the other hand, assembling the CLC de novo assembler using a small amount of WGS data from the genome about one-fold, most long-lived contigs are found in very high copy numbers in cells such as chloroplast, mitochondria, and ribosomal DNA (rDNA) sequences. It was shown to be a genomic sequence. In the case of rice ( Oryza sativa ), five of the 30 longest assembled config sequences consisted of about 20 bp overlapping the entire chloroplast genome, and also including the nuclear ribosomal DNA (nrDNA) sequence. One contig of 6,889 bp was identified. Whereas the 15 contigs covered about 50% of the mitochondrial genome (Table 1 and FIG. 1).

Table 1

Panax ginseng also showed nrDNA sequences of 3 chloroplasts, 13 mitochondria, and 9,422 bp among 30 long contiguous sequences (Table 2 and FIG. 1).

TABLE 2

Based on this, rice with relatively small genome size of 430Mbp and ginseng with relatively large genome size and insufficient research have suggested the possibility of complete chloroplast and nrDNA sequences by providing appropriate assembly conditions using WGS. Therefore, the present invention has been conducted to find the optimal assembly conditions.

실시예 2. De novo 게놈 어셈블러의 선택Example 2 Selection of De novo Genomic Assembler

Among the several genome assemblers currently in use, SOAP de novo 2.04 version (http://soap.genomics.org.cn/) and CLC-NGS-CELL 4.06 beta version (www.clcbio.com/products/clc-). assembly-cell) was used to compare the ability of constituents to comprise the chloroplast genome. Both genome assemblers formed a continuum containing chloroplast sequences, but for SOAP de novo version 2.04, a rather large number of shorter constructs in the 50- and 250-fold data sets than with the CLC-NGS-CELL 4.06 beta version. The coverage was also low and the conditions for completing the chloroplast genome were very sensitive. However, the CLC assembler could cover the entire chloroplast genome with less than five relatively long contigs, even as the data set grew. In particular, even in ginseng with a relatively large genome size (about 3.2 Gbp) among plants, the entire chloroplast was covered with a small number of contigs with a small CLC assembler (Table 3).

TABLE 3

Therefore, further studies were conducted to establish optimal conditions for completing the chloroplast genome and nrDNA sequences using the CLC assembler.

실시예 3. 엽록체 게놈 어셈블리를 위해 필요한 NGS 데이터의 양Example 3 Amount of NGS Data Required for Chloroplast Genome Assembly

In order to find the appropriate amount of data for assembling chloroplasts and nrDNA sequences using WGS data, WGS 44,425,734,760bp of the rice standard nipponbare and WGS 220,948,250,844bp of the ginseng cultivar 'Typhoon' were used. This amounts to about 100 and 70 times the size of the rice and ginseng genomes, respectively, and each WGS contained 1.69% and 6% chloroplast genomes, respectively. Based on this, 10 WGS data sets including 50 to 5,000-fold based on the chloroplast genome were generated as shown in Table 4 below.

Table 4

In rice and ginseng, chloroplast genome coverage of 1-fold genome coverage was 50 and 1,050-fold, respectively, and rDNA coverage was 324-fold and 3,560-fold, respectively. After assembly for each data set, the number of assemblies and assembly errors containing the chloroplast sequence (NCBI accession No. GU592207.1) was identified.

Rice ( Oryza sativa ) data set in OS3-OS6, Panax ginseng data set in PG3-PG7 showed a small amount of sequence error as shown in Table 5 with a small number of conteg.

Table 5

Data sets with chloroplast genome reference depths of 50 and less and 1,000 or more times did not cover the entire chloroplast genome or increased the number of chloroplasts and increased assembly errors (FIGS. 2 and 3).

Using the NGS data set of rice and ginseng to determine the appropriate amount of WGS data, we found a small gap and mismatch in about 100 ~ 500 times data set based on chloroplast, which ranged from 0.86 ~ 4.3Gbp for rice. Since it corresponds to the amount of WGS and the amount of WGS of 0.3 ~ 1.5Gbp, it can be confirmed that even a plant having a large genome size, such as ginseng, can be assembled with a certain amount of WGS sequence according to the degree of incorporation of the chloroplast genome. there was.

실시예 4. 어셈블리 오류의 수정Example 4. Correction of Assembly Errors

Assembling the chloroplast genome with NGS data results in gaps or unspecific nucleotide 'N's, including false gaps, errors caused by tandem repeats, errors caused by monopolymers, and mitochondria and nuclei. Errors of single nucleotide polymorphism (SNP) type due to interference of genomic DNA. Before correcting the errors that occur when creating the chloroplast genome with NGS data, error spots must be found, which is done by completing the draft chloroplast genome sequence and mapping raw data to map raw data trends through the CLC Assembly Viewer. It is necessary to see the whole view. Regions with assembly errors have many misleading raw leads (sequence fragments), which can be corrected as follows:

4-1. False gap

Although not actually a gap, gaps caused by assembly errors usually contained N in the assembled sequence. As shown in FIG. 4, even though there are sequences overlapping each other on the left and right sides of an incorrectly assembled sequence, a complete sequence in which N has been removed by artificially combining the overlapping portions into one sequence in the assembly process creates a fake gap with one N. It was confirmed that the raw reads were neatly mapped to the modified sequence. This result confirmed that the modified sequence and the reference sequence matched with those of the rice Nippon Barre varieties, and the ginseng was able to reconfirm how to resolve the fake gap through fertilization through PCR and sequencing. .

4-2. Tandem repeats

NGS produces significantly more data than Sanger sequencing but increases the chance of assembly errors due to short read lengths (around 100 bp). In particular, tandem duplication regions in de novo genomic assembly frequently caused assembly errors due to changes in the number of copies. Repeats that are longer than the read length or that are tandem or interspersed within the genome have resulted in repeat collapse and rearrangement. If the length of the repeating unit is smaller than the analyte, the error could be solved by adjusting the k-mer value according to the repeating length. As shown in FIG. 5, the repeated decay of vertical duplication in 18 bp units was corrected from 2 copies to 4 copies when the k-mer value was a maximum of 64. If you do raw data mapping to a draft chloroplast sequence assembled with a smaller number of tandem duplications than the original, you will not only see the mismapped leads where the repeat collapse occurred, but also because of the reason that these mismapped raw leads are included in the depth. It was also significantly higher than the surrounding area, so it was possible to predict and correct errors in the number of copies in the iteration. Since the unit size of column duplication in most plant chloroplast genomes is less than 100 bp, almost all errors can be detected and eliminated in this way.

4-3. Monopolymer

Monopolymers cause many problems not only in genomic DNA but also in the chloroplast genome. In the rice and ginseng chloroplast genomes, more than 8mer homopolymers were found in 95 and 91, respectively, among which adenine (A) or Thymine (T) repeats accounted for the majority (Table 6).

Table 6

Error sites that appear in these homopolymer regions may be caused by sequence errors, but chloroplast genome assembly errors occur, especially during homologous chloroplast sequence fragments inserted into mitochondrial or nuclear DNA, where mutations accumulate at homopolymer sites and these sequences interfere with assembly. It was predicted to cause the cause. In rice, DNA fragment sequences derived from the chloroplast genome were inserted and distributed throughout the chromosome. At 78,424 bp (NCBI accession No. GU592207.1) of the rice chloroplast genome, 17 Ts are homopolymers, and the surrounding sequences are distributed on 10 rice chromosomes, especially in the T polymer region (FIG. 6A). And b). Early assembles with Os3 datasets were assembled to T8, which was judged to be misassembled due to similar chloroplast sequences on

chromosomes

5, 6, 7, and 9 in the nucleus. As a calibration method, a sequence was randomly generated according to the number of homopolymers T present in the raw data, and the raw data was mapped to correct the homopolymer having a high depth. Since the chloroplast genomic sequence is present at a high depth in the NGS data, it was judged that selecting it was the most accurate chloroplast genome sequence. Indeed, the draft chloroplast genome sequence was constructed from sequences with

sequences

7, 8, 9, 10, 11, 12, 15 and 17 T with T homopolymer repeat combinations and then paired with 100% similarity. end) When mapped, the highest chloroplast genome sequence was 33.14 in the reference chloroplast genome sequence having 17 single T polymers (FIG. 6C). The effect of chloroplast-derived sequences present in nuclear DNA on assembly was particularly high in rice chloroplast genome assembly, with the amount of WGS data used being more than five times that of rice genome coverage. As the error probability increased, the chimeric assembly formation increased.

4-4. Fake monobasic polymorphisms (SNPs) by interference of homologous mitochondria and nuclear DNA

Increasing the amount of WGS used in the initial assembly can cause chloroplast-derived DNA fragments inserted into the mitochondrial and nuclear genomes to incorrectly participate in the assembly, leading to single nucleotide polymorphism (SNP) errors. The de-bruin graph (Compeau et al., This error occurred very rarely due to the nature of the CLC assembler assembled by Nat Biotechnol, 29: 987-991 (FIG. 7b). This type of error appears as if it were an SNP and was confirmed by checking the primitive leads mapped to the draft assembly. The draft chloroplast genome using the Os5 set shows guanine (G) and thymine (T) at 51,940 and 51,944 bp, but most primitive leads (186 out of 212) had T and A and were mapped incorrectly. (FIG. 7A). On the other hand, G and T sequences were observed in 24 out of 212, indicating that they exist in the mitochondria. This allowed the correct SNPs to be corrected with T and A, the major sequences that account for 186 of the 212 reads.

실시예 5. 완전한 엽록체 게놈의 염기서열 해독을 위한 조건 최적화Example 5. Condition Optimization for Sequence Translation of the Complete Chloroplast Genome

Genomic DNA was prepared from the plant leaves, and a pair-end library of 300-500 bp was generated at least 1 μg, and data about WGS 1 Gbp was generated using a HiSeq2000 or MySeq (Illumina, USA) platform. In order to remove low quality values among the sequences of the data set generated and to know the contamination rate of chloroplast sequences, find related sequences in the published database and map them using the CLC reference assembly tool. The set of chloroplast contamination rates was determined and then WGS containing about 100-500 times the amount based on chloroplast genome coverage was extracted and assembled using the CLC assembler. At this time, it is helpful to set the k-mer value to 64 and to proceed with the assembly to prevent misassembly of the column duplication. After assembly, the gap filling process is used to compare existing known chloroplast sequences with the BLAST function, to distinguish chloroplast sequences in the assembled config dataset, to determine their order, and to find overlapping portions between the continuum sequences. Made with chloroplast contigues. The generated chloroplast contigs are mapped to the raw data to find the faulty part, and the raw data mapping is used to find false gaps, column duplication errors, homopolymer errors, and SNP errors. Calibrated.

실시예 6. 완전한 핵 리보솜 DNA 단위 서열의 어셈블리 방법Example 6 Assembly Method of Complete Nuclear Ribosome DNA Unit Sequences

The gene region of the 45s transcription unit, the internal transcribed spacer 1 (ITS1), and the ITS2 region have relatively stable structures and have been used as the main targets for plant evolution and differentiation research. One unit of 45 s nrDNA of the plant is known to be about 6-18 kb in length and has been reported to be homogenized within a plant species by concerted evolution, but also in some heterogeneous forms. The length difference of nrDNA units is mainly due to the length diversity of tandem subrepeat elements present between genes and intergenic spacers (IGS). In addition, the column subrepeatable elements present in IGS generate heterogeneous forms in the genome by unequal crossing over, and thus, even in the case of complete genome detoxification, the complete nrDNA unit is not included.

In the present invention, along with the chloroplast genome assembly, a simple and accurate method for completing 45S nrDNA repeat units, which is an important object of plant evolution and differentiation research, has been developed. The protocol presented is to present the most representative sequence of nrDNA units by completing the IGS sequence together with the 45s nrDNA transcription unit, but does not mean that no other heterogeneous nrDNA is present.

Contigs assembled in a random set and identified as nrDNA sequences had a nearly complete 45s sequence and contained all or part of the IGS sequence. In the case of nrDNA sequence assembly, there is a heterogeneous occurrence in the genome, and it is difficult to identify a representative sequence.In case of occurrence of N due to the collapse of the repeat due to the existence of a tandem array, the occurrence of N is not completed in one unit. There was. However, the following steps almost complete the complete nrDNA sequence.

First, in the assembly of conservative sequences within 45s, SNPs could occur due to the presence of heterologous forms. This case is mainly found in ITS1 and ITS2, and selecting the highest nucleotide was advantageous for finding a representative form. Heterologous sequence reads could also be selected to find different types simultaneously.

Secondly, due to the decay of repetition due to the arrangement of the lower repetitive elements in IGS, ginseng IGS showed repetition of various sizes ranging from 8bp to 641bp. The 641 bp repeat unit is 3.5 copies, which are then arranged in 337 bp and 149 bp units, resulting in incorrect assembly in many cases. Since various repetitions larger or smaller than the read length exist at the same time and there are some sequence differences between the repeating units, the collapse of the repetitions can be solved based on the mapping information of the pair-end reads.

Third, if a contiguate generated by de novo assembly contains only a 45s rDNA gene region and some IGS sequences, the two contiguous constructs are paralleled using an arrayed 45s rDNA feature, and artificially interposed between them. 50-200 nucleotides were used to artificially create a new contigu. The new config file was repeated using the Gapcloser (SOAP de novo package) program to remove nucleotides, and raw data mapping completed the final unit.

실시예 7. 벼 '니폰바레' 품종의 엽록체 및 핵 리보솜 DNA를 사용한 검증Example 7 Verification Using Chloroplast and Nuclear Ribosome DNA of Rice 'Nippon Barre' Varieties

Rice genome detoxification was confirmed by the completion of chloroplasts and nrDNA by using the standard varieties nipponbare, which have almost complete genome and chloroplast sequences, as the material. To confirm the presence of assembly errors in the finished sequence, comparisons were made with reference sequences and further PCR and ABI sequencing reconfirmation experiments were performed.

As a result, the 134,591 bp chloroplast sequence and 7,928 bp rDNA sequence completed by the method of the present invention were exactly identical to the reference sequence, indicating that the method of the present invention is accurate (FIG. 8). The nrDNA repeat unit, which includes rice IGS, consists of 5,877 bp of 45s (18s-5.8s-26s) and 2,051 bp of IGS, including the external transcribed spacer (ETS) and non-transcribed spacer (NTS), and enhances the 45s transcription unit. It was confirmed that 254bp of the sub-unit known as is arranged in three copies (Fig. 9b, d). In the finished rice standard genome information, it was confirmed that about 4.5 copies of 100 units of the unit completed in the present invention were present at the upper end portion of chromosome 9 (GenBank No. OSJNBb0013K10; AP008245.2).

실시예 8. 다양한 식물 종에 대한 엽록체 게놈의 de novo 어셈블리Example 8 de novo assembly of the chloroplast genome for various plant species

According to the above-described method of the present invention, various plant species such as moss, rice plant species, and chloroplast genomes such as ginseng, Panax quinquefolius , and onion, which have a relatively large genome size, have been newly completed. 156,248bp, American ginseng was 156,088bp, and the sequence variation was about 0.1% between the two species. Reproducible sites such as the column overlap region were reconfirmed by PCR and ABI sequencing to confirm that the correct chloroplast genome was generated. On the other hand, 127 SNPs and 71 indels were found in the GenBank chloroplast genome sequence (GenBank No. AY582139.1,66), which may be caused by material differences or the possibility of sequencing errors. There was this. On the other hand, when the chloroplast genome sequences of 12 ginseng varieties were completed by the method of the present invention, the variation regions among the ginseng varieties were found to be within 10 SNPs and 7 Indels, and the previously reported chloroplast genome sequences were PCR-worked. It was estimated that the sequencing errors were included in the walking process.

In addition to the seven rice varieties, WGS and the nine to nine rice varieties and W. leersia 0.25 to the WGS for the Oryza Map Alignment Project at the Arizona Genomics Institute. WGS data of 4 Gbp was distributed and some application examples in which the chloroplast genome was completed by the method of the present invention are shown in a table (Table 7).

TABLE 7

In addition, as a result of examining the phylogenetic tree based on the decoded chloroplast genome sequence, rice plant species similarity could be confirmed. In rice, japonica and indica varieties were clearly distinguished, and no variation was observed between varieties belonging to the same subspecies group. In addition, Indica and Japonica hybrid-derived varieties, Unification, Dasan, and Miryang No. 23 were shown to be identical to the chloroplast type of the final mother (FIG. 12).

실시예 9. 벼 및 인삼 종에 대한 완전한 핵 리보솜 DNA 단위의 de novo 어셈블리Example 9 de novo assembly of complete nuclear ribosomal DNA units for rice and ginseng species

The total length nrDNA units of ginseng cultivar and American ginseng were 11,091bp and 11,169 bp, respectively, and the length of 45s transcription units was 5,856bp and 5,853bp, respectively. 5,235 bp and 5,316 bp, respectively, 3,200bp longer than the rice appeared to be almost no homology (Fig. 9b and 9c). The length of the IGS predicted based on the completed sequence was predicted that the assembly was accurate as the product of the expected length was confirmed by PCR (FIGS. 9D and 9E). However, in the picture of ginseng IGS amplification, an additional band of about 500 bp in length was amplified (FIGS. 9E and 9F). It was speculated that it would have been amplified from the rDNA derived fragment present in. The number of repeats, known as the enhancer of the 45s transcriptional unit, located at 5 'ETS in the IGS, is a 641 bp 3.5 copies of the typhoon and 640 bp 3.5 copies of the American ginseng (variable: 2 3 5, 95% concordance). Of 17 rices , wild rice ( Oryza nivara ), unlike other species, could not complete the genome, presumably because of the heterogeneous rDNA in the wild rice genome. When comparing the 45S nrDNA sites of 17 species of rice, the variation was particularly severe in the ITS1 and ITS2 regions, but some variation was observed in the gene region. Ginseng and American ginseng showed severe differences in IGS and SNPs were observed in some genetic and ITS regions. In addition, phylogenetic analysis was confirmed with 45S rDNA region sequences of 13 ginsengs and 16 rices (except O. nivara ) obtained by the present invention (FIGS. 13 and 14).

실시예 10. 엽록체 서열 완성 후 종 및 품종 식별 등 마커 개발 검증Example 10. Marker development verification including species and variety identification after completion of chloroplast sequence

Using the method of the present invention it was possible to complete the chloroplast and rDNA for more than 100 species of various plants, in the case of moss was also able to complete the mitochondria completely. Based on the completed sequences, various PCR markers can be efficiently developed for identifying species with different species, which can be used to develop barcoding markers for species discrimination, herbal medicine origin classification, and classification. (Fig. 15). In addition, when the chloroplast sequence was completed for other varieties of the same species, varieties specific markers could be developed. Three specific markers could be developed between two ginseng varieties, which could be used for identification of varieties and protection of varieties. It was found to be very useful as a means of marker development (FIG. 16).

In addition, it was confirmed that the rDNA sequence can be effectively used to develop markers that can be utilized for species classification and breed identification by developing unique markers between species and species within a nucleus (FIG. 17).

실시예 11. 엽록체 서열 완성과 같은 방법으로 이끼와 버섯의 미토콘드리아 유전체의 완성 검증Example 11 Completion Verification of Mitochondrial Genomes of Moss and Mushroom in the Same Method as Chloroplast Sequence Completion

It was confirmed that the method of completing the mitochondrial genome using the method of the present invention can be completely completed by applying to moss and mushrooms. Organisms such as animals, fish and insects complete the chloroplasts in plants because the mitochondria have a very stable structure instead of the absence of chloroplasts, they are small in size (16Kb circular DNA) and also have a large number of mitochondrial copies in one cell. By applying the method, it was verified that it can be completed more easily. Moss and mushrooms have larger mitochondria. However, in the case of moss, chloroplasts and mitochondria could be completed at the same time (FIG. 20), and in the case of mushrooms, there is no chloroplast and only mitochondria, thus presenting the results of completing the mitochondrial genome (FIG. 21). Fish, insects, mushrooms, moss, etc. can be used to complete the mitochondrial genome and extract useful information with barcoding markers such as species classification.

Claims

(a) deciphering the nucleotide sequence by next generation sequencing (NGS) on the whole genome of the organism;

(b) generating an NGS data set based on the amount of genomic coverage of the chloroplast, mitochondrial or nuclear ribosomal DNA using the reads (sequences) generated by sequencing of step (a);

(c) assembling the leads of the generated NGS data set of step (b) using assembly software;

(d) separating the contigs comprising at least one sequence selected from the group consisting of chloroplast, mitochondrial and nuclear ribosomal DNA (nrDNA) sequences in the contigs produced after assembly of step (c); And

(e) linking the isolated contigs of step (d) using a sequence comparison program and correcting errors during assembly, wherein the complete genomic sequence of the chloroplast, mitochondrial or nuclear ribosomal DNA of the organism is characterized in that it comprises To decipher alone or simultaneously.
The method of claim 1, wherein the step (e) of deciphering the complete sequence of the chloroplast or mitochondria is performed by aligning and linking the contigs comprising the chloroplast sequences in the isolated contigs of step (d) to a complete circular sequence. And then mapping the generated raw data sequence and eliminating assembly errors.
The method of claim 1, wherein the step (e) of deciphering the complete sequence of the nuclear ribosomal DNA comprises artificially listing two contigs comprising 45s rDNA sequences in the isolated concatenation of step (d). Create an artificial gap between them, use the gap closer program to fill the physical gaps of genes and intergenic spacer (IGS) regions, complete the complete 45s rDNA unit, and get the complete 45s rRNA unit raw Mapping the data sequences and eliminating assembly errors.
The method of claim 1, wherein the organism is a microorganism, lower photosynthetic organism, mushroom, higher plant with a large genome, insect, fish or animal.
The method of claim 1, wherein the NGS data set is an amount capable of covering 50-500 times the chloroplast genome.
The method of claim 1 wherein the assembly software is CLC de novo assembly software or SOAP de novo assembly software.
The method of claim 1, wherein the assembly error is a sequential error, a false gap, a tandem repeat error, a monopolymer error, or a single base polymorphism (SNP) error.
A recording medium having recorded thereon a computer readable program for performing the method of any one of claims 1 to 7.