WO2014085891A1 - Method and use for verification of mounting errors in genomes - Google Patents

Method and use for verification of mounting errors in genomes Download PDF

Info

Publication number
WO2014085891A1
WO2014085891A1 PCT/BR2013/000543 BR2013000543W WO2014085891A1 WO 2014085891 A1 WO2014085891 A1 WO 2014085891A1 BR 2013000543 W BR2013000543 W BR 2013000543W WO 2014085891 A1 WO2014085891 A1 WO 2014085891A1
Authority
WO
WIPO (PCT)
Prior art keywords
genome
genomes
frequency
assembly
errors
Prior art date
Application number
PCT/BR2013/000543
Other languages
French (fr)
Portuguese (pt)
Inventor
Roberto Hirochi HERAI
Michel Eduardo Beleza YAMAGISHI
Original Assignee
Universidade Estadual De Campinas - Unicamp
Empresa Brasileira De Pesquisa Agropecuária - Embrapa -
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universidade Estadual De Campinas - Unicamp, Empresa Brasileira De Pesquisa Agropecuária - Embrapa - filed Critical Universidade Estadual De Campinas - Unicamp
Publication of WO2014085891A1 publication Critical patent/WO2014085891A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the present invention is a method for verifying assembly errors in genomes of sequenced or synthetically produced organisms that use frequency ratios between nucleotide sequence fragments of a genome.
  • the method has applications in error checking in a genome assembled from fragments derived from different types of sequencing technologies (Illumin, 454, Solid, PacBio, among others) and can assist in the construction of synthetic genomes, such as for example. example in genetically hybrid or transgenic organisms.
  • the proposed method can also be used for compression of genetic material, as the frequency ratios determined by the method allow to reduce the complexity of nucleotide sequences, so as to represent their content in a compressed manner, thus reducing the space required for their compression. storage.
  • the first is the current inability of sequencing equipment to extract genetic material that corresponds to entire nucleotide sequences, which makes it necessary to introduce a fragmentation step in different regions of the molecule.
  • Second is the inability of computational tools to accurately handle the high frequency of repetitive sequences present in genomes, which are often much larger than the average size of reads generated by the sequencing process.
  • the computational assembly steps of genomes also require a correct and thorough configuration of the software that will be used, because the parameters they vary according to the type of organism, sequencing equipment used, and available computer resources.
  • Other factors such as preparation and / or contamination of genetic material and lack of strict control in the purification process of the samples to be sequenced, also influence the final assembly of a genome.
  • most computational tools try to address them, they use a conservative approach, thus reducing the amount of assembly errors and also the reconstruction level of the original molecule.
  • the present invention is a method for verifying assembly errors in genomes of sequenced or synthetically produced organisms that use frequency ratios between nucleotide sequence fragments of a genome.
  • the method can be used to check for assembly errors in a genome that has been reconstructed from nucleotide sequence fragments or has been synthetically constructed. Its use has advantages related to the possibility of obtaining sequences that are closer to the original sequence when extracted from any organism of nature.
  • synthetic molecules such as synthetic genomes
  • the method can be used to verify if their construction was performed in such a way as to move closer to a natural genome, thus suggesting rearranging it to be more biologically efficient.
  • Another advantage of this method is that it allows nucleotide sequences to be compressed, making their transfer faster and reducing the space required for their storage.
  • Figure 1 shows an application of the method of the present invention based on frequency ratios of the words F (w k ) and F (R (w k )) to identify assembly errors in genomes, for word sizes ranging from 2 to 8. Thirty-two genomes were considered and, as considered by the method, the unexpected deviation of at least 0.01 in the sum of the frequency of words in the genome of the bacterium Xylella fastidiosa 9a5c demonstrates the existence of assembly errors.
  • Figure 2 shows an application of the method of the present invention, considering frequency ratios for size fragments ranging from 1 to 8. It is clearly noted that, regardless of the value of k, frequency ratios become invalid for genome of the bacterium Xylella fastidiosa 9a5c, assembled by Simpson et al. (2000).
  • Figure 3 presents a comparison of genome assemblies of the species of X. fastidiosa ssp. using NCBI's Gmap software.
  • Figure 4 shows a multiple alignment of genome assemblies of species of X. fastidiosa ssp.
  • X. fastidiosa 9a5c genome blocks (Xf_9a5c_DNA.fas) at the bottom represent regions that have undergone inversions or translocations in relation to the other genomes that are practically the same structurally.
  • the present invention is a method for verifying assembly errors in genomes of sequenced or synthetically produced organisms that use frequency ratios between nucleotide sequence fragments of a genome.
  • the invention describes a set of oligonucleotide frequency parity rules that are observed in various genomes and can be applied to: check for assembly errors in reconstructed genomes from sequence fragments; evaluate the quality of synthetic genomes in the same way as is done in an assembled genome; compress nucleotide sequences to reduce the physical space they occupy in a computer system.
  • the method considers the existence of two frequency ratios (Equations 1 and 2) of words that are invariant with each other. Such relationships take into account a sequence w of length k, and the following operators about w k : R (w k ) - reverse sequence of w k ; C (w k ) - complementary sequence of w k where complementary of A is T, and of C is G; and R (C (w k )) - reverse and complementary sequence of w k .
  • the method for verifying assembly errors in genomes comprises the following steps:
  • Figure 1 represents the application of the method based on frequency ratios of words F (w k ) and F (R (w k )) to identify assembly errors in genomes for word sizes ranging from 2 to 8.
  • F (w k ) and F (R (w k )) are considered and, as considered by the method, the unexpected deviation of at least 0.01 in the sum of the frequency of words in the genome of the bacterium Xylella fastidiosa (9a5c) demonstrates the existence of assembly errors.
  • the exceptions were the HIV RNA virus genome and the bacterium Xylella fastidiosa 9a5c. In them, the frequency ratios of the method presented variation greater than 0.01, and with great deviation compared to the other organisms ( Figure 1).
  • Sequencing and subsequent assembly of genomes from the use of sequencing technologies has become increasingly common. Such technologies are based on the fragmentation of molecules that, with the use of sequence overlapping computational tools, are reconstructed. However, several factors ranging from the high frequency of repetitive sequences in the genomes as well as the generation of artifacts (contamination or poor data quality) make the assembly process quite complex. Despite the importance of using methods to verify the final quality of a genome, whether assembled or even synthetically constructed, there are still no methods that are based on frequency relationships of fragments of a genome. The present work describes a new method that can be used for the validation step of a genome. In the method, a genome is fragmented into fixed length sequences.
  • Yamagishi MEB, Hirai RH Grammar of Biology Chargaffs: New Fractal-like Rules. arXiv; 201 1.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention relates to a method for verification of mounting errors in genomes of organisms, which genomes have been sequenced or produced synthetically, which utilizes frequency relationships between nucleotide sequence fragments of a genome. The method is of use in the verification of errors in a mounted genome based on fragments produced using different types of sequencing technology and assists in the construction of synthetic genomes, for example. The present invention may be used to compress genetic material since the frequency relationships determined by the method allow reduction of the nucleotide sequence complexity in such a manner as to represent the content thereof in compressed form, thereby reducing the space required for storage thereof.

Description

MÉTODO E USO PARA VERIFICAÇÃO DE ERROS DE MONTAGEM EM  METHOD AND USE FOR VERIFICATION OF ASSEMBLY ERRORS IN
GENOMAS  Genomes
Campo da invenção  Field of the invention
A presente invenção trata-se de um método para verificação de erros de montagem em genomas de organismos sequenciados ou produzidos de maneira sintética que utiliza relações de frequência entre fragmentos de sequências de nucleotídeos de um genoma.  The present invention is a method for verifying assembly errors in genomes of sequenced or synthetically produced organisms that use frequency ratios between nucleotide sequence fragments of a genome.
O método possui aplicações na verificação de erros em um genoma montado a partir de fragmentos oriundos dos diferentes tipos de tecnologias de sequenciamento (lllumina, 454, Solid, PacBio, entre outros) e pode auxiliar na construção de genomas sintéticos, como por exemplo como por exemplo em organismos geneticamente híbridos ou transgênico. Além disso, o método proposto também pode ser empregado para compressão de material genético, pois as relações de frequência determinadas pelo método permitem reduzir a complexidade de sequências de nucleotídeos, de forma a representar seu conteúdo de maneira comprimida, reduzindo assim o espaço necessário para seu armazenamento.  The method has applications in error checking in a genome assembled from fragments derived from different types of sequencing technologies (Illumin, 454, Solid, PacBio, among others) and can assist in the construction of synthetic genomes, such as for example. example in genetically hybrid or transgenic organisms. In addition, the proposed method can also be used for compression of genetic material, as the frequency ratios determined by the method allow to reduce the complexity of nucleotide sequences, so as to represent their content in a compressed manner, thus reducing the space required for their compression. storage.
Fundamentos da Invenção Background of the Invention
Nos últimos anos, o barateamento e o aperfeiçoamento das tecnologias de sequenciamento em larga escala de material genético tornou possível conhecer o DNA e o RNA de todos os seres vivos. Entretanto, todas as tecnologias de sequenciamento existentes, desde as mais antigas até as mais recentes, usam técnicas de fragmentação da molécula original. Tais fragmentos são então analisados por ferramentas computacionais que, utilizando métodos baseados em sobreposição de fragmentos, tentam reconstruir a molécula original. A alta frequência de sequências repetitivas, erros oriundos dos equipamentos de sequenciamento bem como falhas e falta de precisão dos processos de laboratório para purificação das amostras que serão sequenciadas também podem introduzir erros que se refletem na montagem final dos fragmentos sequenciados. Em função dos problemas mencionados, existem algumas ferramentas computacionais que permitem filtrar os fragmentos sequenciados que apresentam baixa qualidade ou que são oriundos de artefatos introduzidos antes ou durante o processo de sequenciamento. Apesar de sua utilidade, tais ferramentas limitam-se a apenas remover os fragmentos, ou partes dele, que poderão gerar inconsistências na montagem do genoma. Após a etapa final de montagem do genoma, os únicos métodos existentes que permitem verificar sua qualidade com maior rigor são aqueles baseados em similaridade, em que é necessário o uso de um genoma de referência de algum organismo da mesma espécie, ou que seja geneticamente próximo daquele que foi sequenciado (Meader ef a/., 2010). O problema disso é que muitas espécies ainda não possuem organismos sequenciados e até mesmo aqueles que já foram sequenciados podem apresentar erros de montagem, transferindo seus erros para outros organismos que fizerem uso de sua montagem danificada. Já outros baseiam-se no uso de informações da qualidade dos fragmentos sequenciados para tentar melhorar a qualidade de uma montagem (Haiminen et al., 201 1 ). As limitações tecnológicas existentes para extração de sequências completas de nucleotídeos (moléculas de DNA ou RNA) requer que as mesmas sejam fragmentadas em pedaços curtos (reads), impondo o uso de ferramentas computacionais para sua reconstrução. Tais ferramentas baseiam-se em métodos que utilizam os reads e buscam por regiões de sobreposição em suas extremidades, de forma a expandi-los e, ao final, reconstruir a molécula original. Apesar da comprovada qualidade de tais metodologias de montagem em gerar sequências equivalentes a genomas inteiros, ainda não há formas eficientes de tratar diversos tipos de situações, sejam de contexto biológico ou computacional. Problemas como heterozigozidade, quantidade de ploidias e qualidade biológica da amostra que pode estar em fase de degradação podem introduzem erros e problemas no processo de montagem, sendo o principal deles a alta frequência de sequências repetitivas em um genoma. No contexto tecnológico e computacional há dois grandes problemas: o primeiro é a incapacidade atual dos equipamentos de sequenciamento em extrair material genético que corresponda a sequências inteiras de nucleotídeos, o que torna necessário introduzir uma etapa de fragmentação em diferentes regiões da molécula. O segundo é a incapacidade das ferramentas computacionais de tratar de forma precisa a alta frequência de sequências repetitivas presentes nos genomas, os quais costumam ser muito maiores do que o tamanho médio dos reads gerados pelo processo de sequenciamento. Além disso, as etapas de montagem computacional de genomas também requerem uma correta e minuciosa configuração dos softwares que serão utilizados, pois os parâmetros variam de acordo com o tipo de organismo, equipamento de sequenciamento utilizado e recursos computacionais disponíveis. Outros fatores, como preparo e/ou contaminação do material genético e falta de um controle rígido no processo de purificação das amostras que serão sequenciadas, também influenciam na montagem final de um genoma. Apesar da maioria das ferramentas computacionais tentarem tratá-los, elas utilizam uma abordagem conservadora, reduzindo assim a quantidade de erros na montagem e, também, o nível de reconstrução da molécula original. In recent years, the cheapness and refinement of large-scale sequencing technologies for genetic material has made it possible to know the DNA and RNA of all living things. However, all existing sequencing technologies, from the earliest to the latest, use fragmentation techniques of the original molecule. Such Fragments are then analyzed by computational tools that, using fragment overlapping methods, attempt to reconstruct the original molecule. The high frequency of repetitive sequences, errors from sequencing equipment as well as failures and inaccuracy of laboratory processes for sequencing sample purification can also introduce errors that are reflected in the final assembly of the sequenced fragments. Due to the mentioned problems, there are some computational tools that allow to filter out the sequenced fragments that are of low quality or that come from artifacts introduced before or during the sequencing process. Despite their usefulness, such tools are limited to just removing the fragments, or parts of them, that may generate inconsistencies in genome assembly. After the final stage of genome assembly, the only methods available to verify its quality more accurately are those based on similarity, where the use of a reference genome of some organism of the same species or genetically close is necessary. the one that was sequenced (Meader ef a /., 2010). The problem with this is that many species do not yet have sequenced organisms, and even those that have already been sequenced can present assembly errors, transferring their errors to other organisms that make use of their damaged assembly. Others rely on the use of sequenced fragment quality information to try to improve the quality of an assembly (Haiminen et al., 201 1). Existing technological limitations for extraction of complete nucleotide sequences (DNA or RNA molecules) require that they be fragmented into short pieces, requiring the use of computational tools for their reconstruction. These tools are based on methods that use reads and look for overlapping regions at their ends to expand them and ultimately reconstruct the original molecule. Despite the proven quality of such assembly methodologies in generating sequences equivalent to entire genomes, there are still no efficient ways to treat various types of situations, whether biological or computational. Problems such as heterozygosity, ploidy quantity and biological quality of the sample that may be in degradation can introduce errors and problems in the assembly process, the main one being the high frequency of repetitive sequences in a genome. In the technological and computational context there are two major problems: the first is the current inability of sequencing equipment to extract genetic material that corresponds to entire nucleotide sequences, which makes it necessary to introduce a fragmentation step in different regions of the molecule. Second is the inability of computational tools to accurately handle the high frequency of repetitive sequences present in genomes, which are often much larger than the average size of reads generated by the sequencing process. In addition, the computational assembly steps of genomes also require a correct and thorough configuration of the software that will be used, because the parameters they vary according to the type of organism, sequencing equipment used, and available computer resources. Other factors, such as preparation and / or contamination of genetic material and lack of strict control in the purification process of the samples to be sequenced, also influence the final assembly of a genome. Although most computational tools try to address them, they use a conservative approach, thus reducing the amount of assembly errors and also the reconstruction level of the original molecule.
Os problemas de montagem motivaram o desenvolvimento de metodologias que procuram minimizar ou até mesmo corrigir possíveis erros de montagem nos genomas. Estratégias experimentais são muito mais custosas e complexas, e por isso há preferência por metodologias estritamente computacionais. Algumas se baseiam no uso de informações de genomas de organismos próximos como referência na identificação de problemas de montagem. Outras metodologias utilizam informações de regiões conservadas entre grupos de organismos para localizar eventuais erros em regiões específicas do genoma montado. Há ainda outras metodologias que utilizam informações da frequência de nucleotídeos do genoma, sendo a mais conhecida delas o método baseado nas "Regras de Chargaff". Nele, é contabilizada a frequência dos nucleotídeos A (adenina), C (citosina), G (guanina) e T (timina), e são verificadas se as duas relações de frequência a seguir são válidas: A~=T e C~=G. Pelo fato de não haver, ainda, um método eficiente e que seja capaz de substituir todos os demais, a melhor estratégia tem sido combinar diferentes abordagens para conferir maior garantia na qualidade de um genoma, seja ele montado ou produzido de forma sintética. Assembly problems motivated the development of methodologies that seek to minimize or even correct possible assembly errors in genomes. Experimental strategies are much more costly and complex, so there is a preference for strictly computational methodologies. Some are based on the use of genome information from nearby organisms as a reference in identifying assembly problems. Other methodologies use information from conserved regions between groups of organisms to locate possible errors in specific regions of the assembled genome. There are still other methodologies that use information on the genome nucleotide frequency, the best known being the method based on the "Chargaff Rules". Here, the frequency of nucleotides A (adenine), C (cytosine), G (guanine) and T (thymine) are counted, and the following two frequency ratios are valid: A ~ = T and C ~ = G. Because there is still no efficient method that can replace all the others, the best strategy has been to combine different approaches to provide greater assurance on the quality of a genome, whether assembled or synthetically produced.
Apesar da necessidade de métodos que permitam verificar a qualidade de montagens de genomas, ainda não há métodos similares ao proposto por esta patente. Nele, é apresentado um método que, pela primeira vez, é capaz de utilizar informações de relações de frequência entre fragmentos de nucleotídeos de um genoma para que seja possível verificar a qualidade da montagem de um genoma.  Despite the need for methods to verify the quality of genome assemblies, there are still no methods similar to those proposed by this patent. It presents a method that, for the first time, is able to use frequency relationship information between nucleotide fragments of a genome so that the quality of assembly of a genome can be verified.
Breve Descrição da Invenção Brief Description of the Invention
A presente invenção trata-se de um método para verificação de erros de montagem em genomas de organismos sequenciados ou produzidos de maneira sintética que utiliza relações de frequência entre fragmentos de sequências de nucleotídeos de um genoma.  The present invention is a method for verifying assembly errors in genomes of sequenced or synthetically produced organisms that use frequency ratios between nucleotide sequence fragments of a genome.
A partir das relações de frequência descritas pelo método, é possível utilizá-las para verificar se há erros de montagem em um genoma que foi reconstruído a partir de fragmentos de sequências de nucleotídeos, ou que foi construído de forma sintética. Sua utilização traz vantagens relacionadas a possibilidade de obtenção de sequências que sejam mais próximas da sequência original quando extraída de um organismo qualquer da natureza. Para o caso de moléculas sintéticas, como genomas sintéticos, o método pode ser empregado para verificar se a contrução da mesma foi realizada de forma a se aproximar mais de um genoma natural, sugerindo assim reorganizá-la para que seja biologicamente mais eficiente. Outra vantagem deste método é o fato de permitir que sequências de nucleotídeos possam ser comprimidas, tornando sua transferência mais rápida e reduzindo o espaço necessário para seu armazenamento. From the frequency ratios described by the method, they can be used to check for assembly errors in a genome that has been reconstructed from nucleotide sequence fragments or has been synthetically constructed. Its use has advantages related to the possibility of obtaining sequences that are closer to the original sequence when extracted from any organism of nature. In the case of synthetic molecules, such as synthetic genomes, the method can be used to verify if their construction was performed in such a way as to move closer to a natural genome, thus suggesting rearranging it to be more biologically efficient. Another advantage of this method is that it allows nucleotide sequences to be compressed, making their transfer faster and reducing the space required for their storage.
Breve Descrição das Figuras  Brief Description of the Figures
A Figura 1 apresenta uma aplicação do método, da presente invenção, baseado em relações de frequência das palavras F(wk) e F(R(wk)) para identificar erros de montagem em genomas, para tamanhos de palavra variando entre 2 e 8. Trinta e dois genomas foram considerados e, conforme considerado pelo método, o desvio inesperado de pelo menos 0.01 na soma da frequência das palavras no genoma da bactéria Xylella fastidiosa 9a5c demonstra a existência de erros de montagem. Figure 1 shows an application of the method of the present invention based on frequency ratios of the words F (w k ) and F (R (w k )) to identify assembly errors in genomes, for word sizes ranging from 2 to 8. Thirty-two genomes were considered and, as considered by the method, the unexpected deviation of at least 0.01 in the sum of the frequency of words in the genome of the bacterium Xylella fastidiosa 9a5c demonstrates the existence of assembly errors.
A Figura 2 apresenta uma aplicação do método, da presente invenção, considerando as relações de frequência para fragmentos de tamanho variando entre 1 e 8. Nota-se claramente que, independentemente do valor de k, as relações de frequência tornam-se inválidas para o genoma da bactéria Xylella fastidiosa 9a5c, montada por Simpson et al. (2000).  Figure 2 shows an application of the method of the present invention, considering frequency ratios for size fragments ranging from 1 to 8. It is clearly noted that, regardless of the value of k, frequency ratios become invalid for genome of the bacterium Xylella fastidiosa 9a5c, assembled by Simpson et al. (2000).
A Figura 3 apresenta uma comparação das montagens do genoma das espécies de X. fastidiosa ssp. com o uso do software Gmap do NCBI.  Figure 3 presents a comparison of genome assemblies of the species of X. fastidiosa ssp. using NCBI's Gmap software.
A Figura 4 apresenta um alinhamento múltiplo das montagens do genoma das espécies de X. fastidiosa ssp. Blocos do genoma da X. fastidiosa 9a5c (Xf_9a5c_DNA.fas) que se encontram na parte inferior representam regiões que sofreram inversões ou translocações com relação aos demais genomas que são praticamente iguais estruturalmente. Figure 4 shows a multiple alignment of genome assemblies of species of X. fastidiosa ssp. X. fastidiosa 9a5c genome blocks (Xf_9a5c_DNA.fas) at the bottom represent regions that have undergone inversions or translocations in relation to the other genomes that are practically the same structurally.
Descrição Detalhada da Invenção Detailed Description of the Invention
A presente invenção trata-se de um método para verificação de erros de montagem em genomas de organismos sequenciados ou produzidos de maneira sintética que utiliza relações de frequência entre fragmentos de sequências de nucleotídeos de um genoma.  The present invention is a method for verifying assembly errors in genomes of sequenced or synthetically produced organisms that use frequency ratios between nucleotide sequence fragments of a genome.
O invento descreve um conjunto de regras de paridade de frequência de oligonucleotideos que são observadas em vários genomas e podem ser aplicadas para: verificar a ocorrência de erros de montagem em genomas reconstruídos a partir de fragmentos de sequências; avaliar a qualidade de genomas sintéticos da mesma forma que é feita em um genoma montado; comprimir sequências de nucleotídeos para reduzir o espaço físico ocupado por elas em um sistema computacional. As seções seguintes contextualizam e descrevem brevemente detalhes do método proposto na presente invenção.  The invention describes a set of oligonucleotide frequency parity rules that are observed in various genomes and can be applied to: check for assembly errors in reconstructed genomes from sequence fragments; evaluate the quality of synthetic genomes in the same way as is done in an assembled genome; compress nucleotide sequences to reduce the physical space they occupy in a computer system. The following sections briefly contextualize and describe details of the method proposed in the present invention.
É apresentado um novo método que é estritamente computational e pode ser aplicado para verificar a existência de erros de montagem em um genoma, o qual foi testado em diversos organismos. Outras aplicações são citadas em outras seções. Este método representa uma extensão das regras propostas por Chargaff.  A new method is presented that is strictly computational and can be applied to verify the existence of assembly errors in a genome, which has been tested in several organisms. Other applications are cited in other sections. This method represents an extension of the rules proposed by Chargaff.
O método considera a existência de duas relações de frequência (Equações 1 e 2) de palavras que são invariantes entre si. Tais relações levam em consideração uma sequência w de tamanho k, e os seguintes operadores sobre wk: R(wk) - sequência reversa de wk; C(wk) - sequência complementar de wk em que complementar de A é T, e de C é G; e R(C(wk)) - sequência reversa e complementar de wk. The method considers the existence of two frequency ratios (Equations 1 and 2) of words that are invariant with each other. Such relationships take into account a sequence w of length k, and the following operators about w k : R (w k ) - reverse sequence of w k ; C (w k ) - complementary sequence of w k where complementary of A is T, and of C is G; and R (C (w k )) - reverse and complementary sequence of w k .
Figure imgf000010_0001
Figure imgf000010_0001
O método para verificação de erros de montagem em genomas compreende as seguintes etapas:  The method for verifying assembly errors in genomes comprises the following steps:
i) fragmentar o genoma em palavras de tamanho k; onde K representa o tamanho de uma subpalavra do genoma, podendo variar entre 3 e 8;  i) fragment the genome into words of size k; where K represents the size of a genome subword, ranging from 3 to 8;
ii) calcular, para cada palavra wk, a frequência das palavras F(wk),
Figure imgf000010_0002
ii) calculate, for each word w k , the frequency of the words F (w k );
Figure imgf000010_0002
iii) calcular, a partir das frequências das palavras, o somatório das frequências de palavras para cada tipo de operador, de forma a obter
Figure imgf000010_0003
iii) calculate, from the word frequencies, the sum of the word frequencies for each type of operator, in order to obtain
Figure imgf000010_0003
iv) aplicar as relações de frequência das Equações 1 e 2;  iv) apply the frequency ratios of Equations 1 and 2;
v) detecção de erro de montagem se a variação for superior a 0.01.  v) mounting error detection if the range is greater than 0.01.
Exemplos  Examples
Considerando um caso fictício em que k=3 (fragmentos de sequências de tamanho 3 obtidas de um genoma montado ou sintético), serão 64 palavras ao todo, e com apenas 20 delas, aplicando os operadores descritos, é possível representar o conjunto inteiro de palavras possíveis em um MathTable, como pode ser visto na Tabela 1. As 20 palavras fazem parte do que é definido de Generating Set (GS), e cada uma delas juntamente com as palavras derivadas com o uso dos operadores C, R e C(R) formam uma Classe de Equivalência (CE). Considering a fictitious case where k = 3 (size 3 sequence fragments obtained from a assembled or synthetic genome), it will be 64 words in all, and with only 20 of them, applying the operators described, it is possible represent the entire set of possible words in a MathTable, as shown in Table 1. The 20 words are part of what is defined in the Generating Set (GS), and each of them together with the words derived using the C operators. , R and C (R) form an Equivalence Class (CE).
Por exemplo, dado uma palavra w3=ATC (classe de equivalência 6 da tabela 1 ), cujo tamanho k=3, aplicando-se os operadores descritos tem-se que R(ATC3)=CTA, C(ATC3)=TAG e R(C(ATC3))=GAT. Analisando-se a frequência destas palavras no genoma de 32 organismos, considerando animais, plantas, micro-organismos e organismos modelos em geral, tem-se a confirmação das frequências descritas pelas equações 1 e 2. Mesmo que a seleção do GS seja feita de forma aleatória, as relações de frequência continuam válidas para as 20 CE, para k variando entre 3 e 10.  For example, given a word w3 = ATC (equivalence class 6 of table 1), whose size k = 3, applying the operators described, has that R (ATC3) = CTA, C (ATC3) = TAG and R (C (ATC3)) = GAT. Analyzing the frequency of these words in the genome of 32 organisms, considering animals, plants, microorganisms and model organisms in general, we have the confirmation of the frequencies described by equations 1 and 2. Even if the GS selection is made from At random, the frequency ratios remain valid for 20 EC, for k ranging from 3 to 10.
Esta forma de representar um conjunto de palavras do genoma com apenas o GS torna também viável o uso das relações de frequência para compressão de sequências de nucleotídeos.  This way of representing a set of genome words with only GS also makes it possible to use frequency ratios for nucleotide sequence compression.
Tabela 1 - MathTable com palavras de tamanho k=3. São 20 Classes de Equivalência geradas a partir da aplicação dos operadores C, R e C(R) sobre o Generating Set.
Figure imgf000012_0001
Table 1 - MathTable with words of length k = 3. There are 20 Equivalence Classes generated from the application of operators C, R and C (R) on the Generating Set.
Figure imgf000012_0001
O método descrito já foi aplicado para analisar o genoma de 32 organismos, considerando animais, plantas, micro-organismos e organismos modelos que possuem um genoma com alta qualidade em sua montagem (Arabidopsis thaliana, Drosophila melanogaster, Oryza sativa, Danio rerio, Escherichia coli, Homo sapiens, entre outros). Para todos os organismos analisados, com èxceção de dois casos, obteve-se a confirmação das relações frequências descritas pelo método, para palavras cujo tamanho k variou entre 3 e 8 (Figura 1 ). A Tabela 2 traz de forma ilustrativa o MathTable calculado para o genoma (versão hg19) do organismo Homo sapiens. Na parte inferior da tabela é apresentado o valor numérico proporcional obtido com o cálculo da frequência das palavras que são consideradas pelas relações de frequência descritas pelas Equações 1 e 2 desta patente. Para todos os outros organismos com genomas cuja montagem apresenta alta qualidade, as frequências finais para os 3 operadores (C(w), R(w) e C(R(w))), considerados sobre uma palavra w de tamanho k, também são bastante aproximados, nunca variando mais do que 0.01 , seja para mais ou para menos. The described method has already been applied to analyze the genome of 32 organisms, considering animals, plants, microorganisms and model organisms that have a high quality genome in their assembly (Arabidopsis thaliana, Drosophila melanogaster, Oryza sativa, Danio rerio, Escherichia coli). , Homo sapiens, among others). For all organisms analyzed, with the exception of two cases, we obtained confirmation of the frequency relationships described by the method, for words whose size k ranged from 3 to 8 (Figure 1). Table 2 illustrates the MathTable calculated for the genome (hg19 version) of the organism Homo sapiens. In the lower part of the table is presented the proportional numerical value obtained by calculating the frequency of the words that are considered by the frequency ratios described by Equations 1 and 2 of this patent. For all other organisms with high-quality genomes the The final frequencies for the 3 operators (C (w), R (w) and C (R (w))), considered over a word w of size k, are also very approximate, never varying more than 0.01, either for more. or for less.
Figure imgf000013_0001
Figure imgf000013_0001
A Figura 1 representa a aplicação do método baseado em relações de frequência das palavras F(wk) e F(R(wk)) para identificar erros de montagem em genomas, para tamanhos de palavra variando entre 2 e 8. Trinta e dois genomas foram considerados e, conforme considerado pelo método, o desvio inesperado de pelo menos 0.01 na soma da frequência das palavras no genoma da bactéria Xylella fastidiosa (9a5c) demonstra a existência de erros de montagem. As exceções ocorreram com o genoma do vírus de RNA HIV e a bactéria Xylella fastidiosa 9a5c. Nelas, as relações de frequência do método apresentaram variação maior do que 0.01 , e com grande desvio se comparado aos demais organismos (Figura 1 ). Para o caso do vírus, foi o único organismo que consideramos em nossa análise cujo genoma é formado por RNA, ao contrário dos demais que possuem DNA como molécula que forma o genoma. Consequentemente, o MathTable apresentou grande variação com relação ao que seria esperado para um genoma com boa qualidade de montagem. Pode- se observar claramente que o MathTable calculado para a bactéria X. fastidiosa (Tabela 3) não segue as relações de frequência propostas por este trabalho. Figure 1 represents the application of the method based on frequency ratios of words F (w k ) and F (R (w k )) to identify assembly errors in genomes for word sizes ranging from 2 to 8. Thirty-two genomes were considered and, as considered by the method, the unexpected deviation of at least 0.01 in the sum of the frequency of words in the genome of the bacterium Xylella fastidiosa (9a5c) demonstrates the existence of assembly errors. The exceptions were the HIV RNA virus genome and the bacterium Xylella fastidiosa 9a5c. In them, the frequency ratios of the method presented variation greater than 0.01, and with great deviation compared to the other organisms (Figure 1). For the case of the virus, it was the only organism we considered in our analysis whose genome is formed by RNA, unlike the others that have DNA as the molecule that forms the genome. As a result, MathTable has shown great variation from what would be expected for a genome with good assembly quality. It can be clearly observed that the calculated MathTable for X. fastidiosa bacteria (Table 3) does not follow the frequency ratios proposed by this work.
Para facilitação a visualização dos erros de montagem, utiizados outros métodos de comparação de sequências. Em um deles, foi realizado um alinhamento par-a-par dos genomas da Xylella (Figura 3) para observar diversas inversões no genoma da Xylella fastidiosa 9a5c com relação às demais. A posição das inversões pode ser claramente observando quando é feito um alinhamento múltiplo dos genomas da bactéria, conforme Figura 4, em que diversos blocos desordenados com relação as outras subespécies da bactéria representam as regiões com erro.  For ease of viewing mounting errors, other sequence comparison methods are used. In one of them, a pairwise alignment of the Xylella genomes was performed (Figure 3) to observe several inversions in the Xylella fastidiosa 9a5c genome with respect to the others. The position of inversions can be clearly observed when multiple alignment of the bacterial genomes is made, as shown in Figure 4, in which several blocks disordered relative to the other bacterial subspecies represent the regions with error.
Tabela 3 - MathTable com os Generating Sets (GS) para o organismo primata Homo sapiens, quando k=3. Table 3 - MathTable with Generating Sets (GS) for the primate organism Homo sapiens, when k = 3.
Figure imgf000015_0001
Figure imgf000015_0001
O sequenciamento e posterior montagem de genomas a partir do uso de tecnologias de sequenciamento tem se tornado cada vez mais frequente. Tais tecnologias baseiam-se na fragmentação das moléculas que, com o uso de ferramentas computacionais baseadas em sobreposição de sequências, são recontruídas. Entretanto, diversos fatores que vão desde a alta frequência de sequências repetitivas nos genomas e também a geração de artefatos (contaminação ou má qualidade dos dados) torna o processo de montagem bastante complexo. Apesar da importância de utilização de métodos para permitir verificar a qualidade final de um genoma, seja ele montado ou até mesmo aqueles construídos de forma sintética, não há ainda métodos que sejam baseados em relações de frequência de fragmentos de um genoma. O presente trabalho descreve um novo método que pode ser utilizado para a etapa de validação de um genoma. No método, um genoma é fragmentado em sequências de tamanho fixo. Para cada fragmento, é verificada sua frequência e a frequência dos fragmentos equivalents ao seu rever, complementar e reverso complementar. Utilizando um conjunto de relações de frequências, nós demonstramos que é possível validar um genoma, seja ele montado ou sintético, além de descrevermos brevemente como o método apresentado pode também ser utilizado como uma alternativa para compressão de sequências de nucleotídeos. Sequencing and subsequent assembly of genomes from the use of sequencing technologies has become increasingly common. Such technologies are based on the fragmentation of molecules that, with the use of sequence overlapping computational tools, are reconstructed. However, several factors ranging from the high frequency of repetitive sequences in the genomes as well as the generation of artifacts (contamination or poor data quality) make the assembly process quite complex. Despite the importance of using methods to verify the final quality of a genome, whether assembled or even synthetically constructed, there are still no methods that are based on frequency relationships of fragments of a genome. The present work describes a new method that can be used for the validation step of a genome. In the method, a genome is fragmented into fixed length sequences. For each fragment, its frequency and the frequency of the fragments equivalent to its review, complement and reverse complement are verified. Using a set of frequency relationships, we demonstrate that it is possible to validate a genome, whether assembled or synthetic, and briefly describe how the presented method can also be used as an alternative for nucleotide sequence compression.
Referências References
Haiminen N, Kuhn DN, Parida L, Rigoutsos I. Evaluation of methods for de novo genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results. PLoS One. 201 1 ;6(9):e24182.  Haiminen N, Kuhn DN, Parida L, Rigoutsos I. Evaluation of methods is again genome assembly from high-throughput sequencing reads reveals dependencies that affect the quality of the results. PLoS One. 201 1; 6 (9): e24182.
Krishnan NM, Pattnaik S, Jain P, Gaur P, Choudhary R, Vaidyanathan S, Deepak S, Hariharan AK, Krishna PB, Nair J, Varghese L, Valivarthi NK, Dhas K, Ramaswamy K, Panda B. A Draft of the Genome and Four Transcriptomes of a Medicinal and Pesticidal Angiosperm Azadirachta indica. BMC Genomics. 2012 Sep 9;13(1 ):464.  Krishnan NM, Pattnaik S, Jain P, Gaur P, Choudhary R, Vaidyanathan S, Deepak S, Hariharan AK, Krishna PB, Nair J, Varghese L, Valivarthi NK, Dhas K, Ramaswamy K, Panda B. A Draft of the Genome and Four Transcriptomes of a Medicinal and Pesticidal Angiosperm Azadirachta indica. BMC Genomics. 2012 Sep 9; 13 (1): 464.
Meader S, Hillier LW, Locke D, Ponting CP, Lunter G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 2010 May;20(5):675-84.  Meader S, Hillier LW, Locke D, Ponting CP, Lunter G. Genome assembly quality: assessment and improvement using the neutral indel model. Genome Res. 2010 May; 20 (5): 675-84.
Simpson AJ, Reinach FC, Arruda P, Abreu FA, Acencio M et al. The genome sequence of the plant pathogen Xylella fastidiosa. The Xylella fastidiosa Consortium of the Organization for Nucleotide Sequencing and Analysis. Nature. 2000 Jul 13;406(6792): 151-9. Simpson AJ, Reinach FC, Arruda P, Abreu FA, Acencio M et al. The genome sequence of the plant pathogen Xylella fastidiosa. The xylella fastidious Consortium of the Organization for Nucleotide Sequencing and Analysis. Nature 2000 Jul 13; 406 (6792): 151-9.
Yamagishi MEB, Hirai RH: Chargaffs "Grammar of Biology": New Fractal-like Rules. arXiv; 201 1.  Yamagishi MEB, Hirai RH: Grammar of Biology Chargaffs: New Fractal-like Rules. arXiv; 201 1.

Claims

REIVINDICAÇÕES
1. Método para verificação de erros de montagem em genomas caracterizado por compreender as seguintes etapas:  1. Method for verifying assembly errors in genomes comprising the following steps:
i) fragmentar o genoma em palavras de tamanho k; onde K representa o tamanho de uma subpalavra do genoma, podendo variar entre 3 e 8;  i) fragment the genome into words of size k; where K represents the size of a genome subword, ranging from 3 to 8;
ii) calcular, para cada palavra Wk, a frequência das palavras F(wk),
Figure imgf000018_0001
(ii) calculate for each word W k the frequency of the words F (w k );
Figure imgf000018_0001
iii) calcular, a partir das frequências das palavras, o somatório das frequências de palavras para cada tipo de operador, de forma a obter
Figure imgf000018_0002
iii) calculate, from the word frequencies, the sum of the word frequencies for each type of operator, in order to obtain
Figure imgf000018_0002
iv) aplicar as relações de frequência das Equações 1 e 2, e  iv) apply the frequency ratios of Equations 1 and 2, and
v) detecção de erro de montagem se a variação for superior a 0.01.  v) mounting error detection if the range is greater than 0.01.
2. Uso do método descrito na reivindicaçãol caracterizado por ser aplicável na verificação de erros de montagem de genoma.  Use of the method described in claim 1, which is applicable for verifying genome assembly errors.
PCT/BR2013/000543 2012-12-05 2013-12-03 Method and use for verification of mounting errors in genomes WO2014085891A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
BRBR1020120310961 2012-12-05
BR102012031096A BR102012031096B1 (en) 2012-12-05 2012-12-05 method and use for verifying assembly errors in genomes

Publications (1)

Publication Number Publication Date
WO2014085891A1 true WO2014085891A1 (en) 2014-06-12

Family

ID=50882688

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/BR2013/000543 WO2014085891A1 (en) 2012-12-05 2013-12-03 Method and use for verification of mounting errors in genomes

Country Status (2)

Country Link
BR (1) BR102012031096B1 (en)
WO (1) WO2014085891A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001063543A2 (en) * 2000-02-22 2001-08-30 Pe Corporation (Ny) Method and system for the assembly of a whole genome using a shot-gun data set
WO2008098014A2 (en) * 2007-02-05 2008-08-14 Applied Biosystems, Llc System and methods for indel identification using short read sequencing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001063543A2 (en) * 2000-02-22 2001-08-30 Pe Corporation (Ny) Method and system for the assembly of a whole genome using a shot-gun data set
WO2008098014A2 (en) * 2007-02-05 2008-08-14 Applied Biosystems, Llc System and methods for indel identification using short read sequencing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN T ET AL.: "Trie-Based Data Structures for Sequence Assembly", THE EIGHTH SYMPOSIUM ON COMBINATORIAL PATTERN MATCHING, 1997, 11 June 1997 (1997-06-11), pages 1 - 17 *
ISTVANICK W ET AL.: "Dynamic methods for fragment assembly in large scale genome sequencing projects", SYSTEM SCIENCES, 1993, PROCEEDING OF THE, TWENTY-SIXTH HAWAII INTERNATIONAL CONFERENCE ON WAITEA, HI, USA, 5 January 1993 (1993-01-05) - 8 January 1993 (1993-01-08), LOS ALAMITOS, CA , USA , IEEE , US,A, pages 534 - 543 *

Also Published As

Publication number Publication date
BR102012031096B1 (en) 2019-10-22
BR102012031096A2 (en) 2014-09-16

Similar Documents

Publication Publication Date Title
Jin et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes
Irisarri et al. Phylotranscriptomic consolidation of the jawed vertebrate timetree
Raghavan et al. A simple guide to de novo transcriptome assembly and annotation
Folk et al. A protocol for targeted enrichment of intron‐containing sequence markers for recent radiations: A phylogenomic example from Heuchera (Saxifragaceae)
Sahlin et al. De novo clustering of long-read transcriptome data using a greedy, quality value-based algorithm
Straub et al. Navigating the tip of the genomic iceberg: Next‐generation sequencing for plant systematics
Smith et al. RNA-seq analysis reveals extensive transcriptional plasticity to temperature stress in a freshwater fish species
Soorni et al. Organelle_PBA, a pipeline for assembling chloroplast and mitochondrial genomes from PacBio DNA sequencing data
Xin et al. Accelerating read mapping with FastHASH
Tang et al. Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps
Wolf Principles of transcriptome analysis and gene expression quantification: an RNA‐seq tutorial
Ripma et al. Geneious! Simplified genome skimming methods for phylogenetic systematic studies: A case study in Oreocarya (Boraginaceae)
Nishihara et al. Rooting the eutherian tree: the power and pitfalls of phylogenomics
Shi et al. MSOAR 2.0: Incorporating tandem duplications into ortholog assignment based on genome rearrangement
Nauheimer et al. HybPhaser: A workflow for the detection and phasing of hybrids in target capture data sets
Beiko et al. The impact of reticulate evolution on genome phylogeny
Hearn et al. Likelihood‐based inference of population history from low‐coverage de novo genome assemblies
James et al. Universal and taxon-specific trends in protein sequences as a function of age
De Simone et al. Contaminations in (meta) genome data: An open issue for the scientific community
Sutton et al. Optimizing experimental design for genome sequencing and assembly with Oxford Nanopore Technologies
Zhai et al. Complete chloroplast genome sequencing and comparative analysis reveals changes to the chloroplast genome after allopolyploidization in Cucumis
Molinari et al. Transcriptome analysis using RNA-Seq fromexperiments with and without biological replicates: areview
Bzikadze et al. UniAligner: a parameter-free framework for fast sequence alignment
WO2014085891A1 (en) Method and use for verification of mounting errors in genomes
Hall Effects of sequence diversity and recombination on the accuracy of phylogenetic trees estimated by kSNP

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13860331

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13860331

Country of ref document: EP

Kind code of ref document: A1