WO2014085891A1

WO2014085891A1 - Method and use for verification of mounting errors in genomes

Info

Publication number: WO2014085891A1
Application number: PCT/BR2013/000543
Authority: WO
Inventors: Roberto Hirochi HERAI; Michel Eduardo Beleza YAMAGISHI
Original assignee: Universidade Estadual De Campinas - Unicamp; Empresa Brasileira De Pesquisa Agropecuária - Embrapa -
Priority date: 2012-12-05
Filing date: 2013-12-03
Publication date: 2014-06-12
Also published as: BR102012031096B1; BR102012031096A2

Abstract

The present invention relates to a method for verification of mounting errors in genomes of organisms, which genomes have been sequenced or produced synthetically, which utilizes frequency relationships between nucleotide sequence fragments of a genome. The method is of use in the verification of errors in a mounted genome based on fragments produced using different types of sequencing technology and assists in the construction of synthetic genomes, for example. The present invention may be used to compress genetic material since the frequency relationships determined by the method allow reduction of the nucleotide sequence complexity in such a manner as to represent the content thereof in compressed form, thereby reducing the space required for storage thereof.

Description

METHOD AND USE FOR VERIFICATION OF ASSEMBLY ERRORS IN

Genomes

Field of the invention

The present invention is a method for verifying assembly errors in genomes of sequenced or synthetically produced organisms that use frequency ratios between nucleotide sequence fragments of a genome.

The method has applications in error checking in a genome assembled from fragments derived from different types of sequencing technologies (Illumin, 454, Solid, PacBio, among others) and can assist in the construction of synthetic genomes, such as for example. example in genetically hybrid or transgenic organisms. In addition, the proposed method can also be used for compression of genetic material, as the frequency ratios determined by the method allow to reduce the complexity of nucleotide sequences, so as to represent their content in a compressed manner, thus reducing the space required for their compression. storage.

Background of the Invention

In recent years, the cheapness and refinement of large-scale sequencing technologies for genetic material has made it possible to know the DNA and RNA of all living things. However, all existing sequencing technologies, from the earliest to the latest, use fragmentation techniques of the original molecule. Such Fragments are then analyzed by computational tools that, using fragment overlapping methods, attempt to reconstruct the original molecule. The high frequency of repetitive sequences, errors from sequencing equipment as well as failures and inaccuracy of laboratory processes for sequencing sample purification can also introduce errors that are reflected in the final assembly of the sequenced fragments. Due to the mentioned problems, there are some computational tools that allow to filter out the sequenced fragments that are of low quality or that come from artifacts introduced before or during the sequencing process. Despite their usefulness, such tools are limited to just removing the fragments, or parts of them, that may generate inconsistencies in genome assembly. After the final stage of genome assembly, the only methods available to verify its quality more accurately are those based on similarity, where the use of a reference genome of some organism of the same species or genetically close is necessary. the one that was sequenced (Meader ef a /., 2010). The problem with this is that many species do not yet have sequenced organisms, and even those that have already been sequenced can present assembly errors, transferring their errors to other organisms that make use of their damaged assembly. Others rely on the use of sequenced fragment quality information to try to improve the quality of an assembly (Haiminen et al., 201 1). Existing technological limitations for extraction of complete nucleotide sequences (DNA or RNA molecules) require that they be fragmented into short pieces, requiring the use of computational tools for their reconstruction. These tools are based on methods that use reads and look for overlapping regions at their ends to expand them and ultimately reconstruct the original molecule. Despite the proven quality of such assembly methodologies in generating sequences equivalent to entire genomes, there are still no efficient ways to treat various types of situations, whether biological or computational. Problems such as heterozygosity, ploidy quantity and biological quality of the sample that may be in degradation can introduce errors and problems in the assembly process, the main one being the high frequency of repetitive sequences in a genome. In the technological and computational context there are two major problems: the first is the current inability of sequencing equipment to extract genetic material that corresponds to entire nucleotide sequences, which makes it necessary to introduce a fragmentation step in different regions of the molecule. Second is the inability of computational tools to accurately handle the high frequency of repetitive sequences present in genomes, which are often much larger than the average size of reads generated by the sequencing process. In addition, the computational assembly steps of genomes also require a correct and thorough configuration of the software that will be used, because the parameters they vary according to the type of organism, sequencing equipment used, and available computer resources. Other factors, such as preparation and / or contamination of genetic material and lack of strict control in the purification process of the samples to be sequenced, also influence the final assembly of a genome. Although most computational tools try to address them, they use a conservative approach, thus reducing the amount of assembly errors and also the reconstruction level of the original molecule.

Assembly problems motivated the development of methodologies that seek to minimize or even correct possible assembly errors in genomes. Experimental strategies are much more costly and complex, so there is a preference for strictly computational methodologies. Some are based on the use of genome information from nearby organisms as a reference in identifying assembly problems. Other methodologies use information from conserved regions between groups of organisms to locate possible errors in specific regions of the assembled genome. There are still other methodologies that use information on the genome nucleotide frequency, the best known being the method based on the "Chargaff Rules". Here, the frequency of nucleotides A (adenine), C (cytosine), G (guanine) and T (thymine) are counted, and the following two frequency ratios are valid: A ~ = T and C ~ = G. Because there is still no efficient method that can replace all the others, the best strategy has been to combine different approaches to provide greater assurance on the quality of a genome, whether assembled or synthetically produced.

Despite the need for methods to verify the quality of genome assemblies, there are still no methods similar to those proposed by this patent. It presents a method that, for the first time, is able to use frequency relationship information between nucleotide fragments of a genome so that the quality of assembly of a genome can be verified.

Brief Description of the Invention

From the frequency ratios described by the method, they can be used to check for assembly errors in a genome that has been reconstructed from nucleotide sequence fragments or has been synthetically constructed. Its use has advantages related to the possibility of obtaining sequences that are closer to the original sequence when extracted from any organism of nature. In the case of synthetic molecules, such as synthetic genomes, the method can be used to verify if their construction was performed in such a way as to move closer to a natural genome, thus suggesting rearranging it to be more biologically efficient. Another advantage of this method is that it allows nucleotide sequences to be compressed, making their transfer faster and reducing the space required for their storage.

Brief Description of the Figures

Figure 1 shows an application of the method of the present invention based on frequency ratios of the words F (w _k ) and F (R (w _k )) to identify assembly errors in genomes, for word sizes ranging from 2 to 8. Thirty-two genomes were considered and, as considered by the method, the unexpected deviation of at least 0.01 in the sum of the frequency of words in the genome of the bacterium Xylella fastidiosa 9a5c demonstrates the existence of assembly errors.

Figure 2 shows an application of the method of the present invention, considering frequency ratios for size fragments ranging from 1 to 8. It is clearly noted that, regardless of the value of k, frequency ratios become invalid for genome of the bacterium Xylella fastidiosa 9a5c, assembled by Simpson et al. (2000).

Figure 3 presents a comparison of genome assemblies of the species of X. fastidiosa ssp. using NCBI's Gmap software.

Figure 4 shows a multiple alignment of genome assemblies of species of X. fastidiosa ssp. X. fastidiosa 9a5c genome blocks (Xf_9a5c_DNA.fas) at the bottom represent regions that have undergone inversions or translocations in relation to the other genomes that are practically the same structurally.

Detailed Description of the Invention

The invention describes a set of oligonucleotide frequency parity rules that are observed in various genomes and can be applied to: check for assembly errors in reconstructed genomes from sequence fragments; evaluate the quality of synthetic genomes in the same way as is done in an assembled genome; compress nucleotide sequences to reduce the physical space they occupy in a computer system. The following sections briefly contextualize and describe details of the method proposed in the present invention.

A new method is presented that is strictly computational and can be applied to verify the existence of assembly errors in a genome, which has been tested in several organisms. Other applications are cited in other sections. This method represents an extension of the rules proposed by Chargaff.

The method considers the existence of two frequency ratios (Equations 1 and 2) of words that are invariant with each other. Such relationships take into account a sequence w of length k, and the following operators about w _k : R (w _k ) - reverse sequence of w _k ; C (w _k ) - complementary sequence of w _k where complementary of A is T, and of C is G; and R (C (w _k )) - reverse and complementary sequence of w _k .

The method for verifying assembly errors in genomes comprises the following steps:

i) fragment the genome into words of size k; where K represents the size of a genome subword, ranging from 3 to 8;

ii) calculate, for each word w _k , the frequency of the words F (w _k );

iii) calculate, from the word frequencies, the sum of the word frequencies for each type of operator, in order to obtain

iv) apply the frequency ratios of Equations 1 and 2;

v) mounting error detection if the range is greater than 0.01.

Examples

Considering a fictitious case where k = 3 (size 3 sequence fragments obtained from a assembled or synthetic genome), it will be 64 words in all, and with only 20 of them, applying the operators described, it is possible represent the entire set of possible words in a MathTable, as shown in Table 1. The 20 words are part of what is defined in the Generating Set (GS), and each of them together with the words derived using the C operators. , R and C (R) form an Equivalence Class (CE).

For example, given a word w3 = ATC (equivalence class 6 of table 1), whose size k = 3, applying the operators described, has that R (ATC3) = CTA, C (ATC3) = TAG and R (C (ATC3)) = GAT. Analyzing the frequency of these words in the genome of 32 organisms, considering animals, plants, microorganisms and model organisms in general, we have the confirmation of the frequencies described by equations 1 and 2. Even if the GS selection is made from At random, the frequency ratios remain valid for 20 EC, for k ranging from 3 to 10.

This way of representing a set of genome words with only GS also makes it possible to use frequency ratios for nucleotide sequence compression.

Table 1 - MathTable with words of length k = 3. There are 20 Equivalence Classes generated from the application of operators C, R and C (R) on the Generating Set.

The described method has already been applied to analyze the genome of 32 organisms, considering animals, plants, microorganisms and model organisms that have a high quality genome in their assembly (Arabidopsis thaliana, Drosophila melanogaster, Oryza sativa, Danio rerio, Escherichia coli). , Homo sapiens, among others). For all organisms analyzed, with the exception of two cases, we obtained confirmation of the frequency relationships described by the method, for words whose size k ranged from 3 to 8 (Figure 1). Table 2 illustrates the MathTable calculated for the genome (hg19 version) of the organism Homo sapiens. In the lower part of the table is presented the proportional numerical value obtained by calculating the frequency of the words that are considered by the frequency ratios described by Equations 1 and 2 of this patent. For all other organisms with high-quality genomes the The final frequencies for the 3 operators (C (w), R (w) and C (R (w))), considered over a word w of size k, are also very approximate, never varying more than 0.01, either for more. or for less.

Figure 1 represents the application of the method based on frequency ratios of words F (w _k ) and F (R (w _k )) to identify assembly errors in genomes for word sizes ranging from 2 to 8. Thirty-two genomes were considered and, as considered by the method, the unexpected deviation of at least 0.01 in the sum of the frequency of words in the genome of the bacterium Xylella fastidiosa (9a5c) demonstrates the existence of assembly errors. The exceptions were the HIV RNA virus genome and the bacterium Xylella fastidiosa 9a5c. In them, the frequency ratios of the method presented variation greater than 0.01, and with great deviation compared to the other organisms (Figure 1). For the case of the virus, it was the only organism we considered in our analysis whose genome is formed by RNA, unlike the others that have DNA as the molecule that forms the genome. As a result, MathTable has shown great variation from what would be expected for a genome with good assembly quality. It can be clearly observed that the calculated MathTable for X. fastidiosa bacteria (Table 3) does not follow the frequency ratios proposed by this work.

For ease of viewing mounting errors, other sequence comparison methods are used. In one of them, a pairwise alignment of the Xylella genomes was performed (Figure 3) to observe several inversions in the Xylella fastidiosa 9a5c genome with respect to the others. The position of inversions can be clearly observed when multiple alignment of the bacterial genomes is made, as shown in Figure 4, in which several blocks disordered relative to the other bacterial subspecies represent the regions with error.

Table 3 - MathTable with Generating Sets (GS) for the primate organism Homo sapiens, when k = 3.

Sequencing and subsequent assembly of genomes from the use of sequencing technologies has become increasingly common. Such technologies are based on the fragmentation of molecules that, with the use of sequence overlapping computational tools, are reconstructed. However, several factors ranging from the high frequency of repetitive sequences in the genomes as well as the generation of artifacts (contamination or poor data quality) make the assembly process quite complex. Despite the importance of using methods to verify the final quality of a genome, whether assembled or even synthetically constructed, there are still no methods that are based on frequency relationships of fragments of a genome. The present work describes a new method that can be used for the validation step of a genome. In the method, a genome is fragmented into fixed length sequences. For each fragment, its frequency and the frequency of the fragments equivalent to its review, complement and reverse complement are verified. Using a set of frequency relationships, we demonstrate that it is possible to validate a genome, whether assembled or synthetic, and briefly describe how the presented method can also be used as an alternative for nucleotide sequence compression.

References

Haiminen N, Kuhn DN, Parida L, Rigoutsos I. Evaluation of methods is again genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results. PLoS One. 201 1; 6 (9): e24182.

Krishnan NM, Pattnaik S, Jain P, Gaur P, Choudhary R, Vaidyanathan S, Deepak S, Hariharan AK, Krishna PB, Nair J, Varghese L, Valivarthi NK, Dhas K, Ramaswamy K, Panda B. A Draft of the Genome and Four Transcriptomes of a Medicinal and Pesticidal Angiosperm Azadirachta indica. BMC Genomics. 2012 Sep 9; 13 (1): 464.

Meader S, Hillier LW, Locke D, Ponting CP, Lunter G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 2010 May; 20 (5): 675-84.

Simpson AJ, Reinach FC, Arruda P, Abreu FA, Acencio M et al. The genome sequence of the plant pathogen Xylella fastidiosa. The xylella fastidious Consortium of the Organization for Nucleotide Sequencing and Analysis. Nature 2000 Jul 13; 406 (6792): 151-9.

Yamagishi MEB, Hirai RH: Grammar of Biology Chargaffs: New Fractal-like Rules. arXiv; 201 1.

Claims

1. Method for verifying assembly errors in genomes comprising the following steps:

(ii) calculate for each word W _k the frequency of the words F (w _k );

iv) apply the frequency ratios of Equations 1 and 2, and

v) mounting error detection if the range is greater than 0.01.

Use of the method described in claim 1, which is applicable for verifying genome assembly errors.