EP2626802B1 - Assembly of metagenomic sequences - Google Patents

Assembly of metagenomic sequences Download PDF

Info

Publication number
EP2626802B1
EP2626802B1 EP12169566.2A EP12169566A EP2626802B1 EP 2626802 B1 EP2626802 B1 EP 2626802B1 EP 12169566 A EP12169566 A EP 12169566A EP 2626802 B1 EP2626802 B1 EP 2626802B1
Authority
EP
European Patent Office
Prior art keywords
metagenomic
sequences
sequence
metagenomic sequences
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
EP12169566.2A
Other languages
German (de)
French (fr)
Other versions
EP2626802A3 (en
EP2626802A2 (en
Inventor
Sharmila Shekhar Mande
Tarini Shankar Ghosh
Varun Mehra
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tata Consultancy Services Ltd
Original Assignee
Tata Consultancy Services Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tata Consultancy Services Ltd filed Critical Tata Consultancy Services Ltd
Publication of EP2626802A2 publication Critical patent/EP2626802A2/en
Publication of EP2626802A3 publication Critical patent/EP2626802A3/en
Application granted granted Critical
Publication of EP2626802B1 publication Critical patent/EP2626802B1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis

Definitions

  • the present subject matter relates, in general, to the field of metagenomics and, in particular, to assembly of metagenomic sequences.
  • Metagenomics provides information pertaining to taxonomic diversity and physiology of various organisms present in the environmental sample.
  • a facility such as a research laboratory or a clinic, involved in genomic study typically uses high capacity platforms, such as next generation sequencing (NGS) platforms, capable of generating huge volumes of metagenomic data every year.
  • the metagenomic data thus generated may be further analyzed, for example, to determine various organisms present in the metagenomic data and to identify the functional roles of the various genes they encompass.
  • the metagenomic data may be stored for further analysis and future studies.
  • each year metagenomic data is generated in huge volumes, in the range of hundreds of terabytes (TB), and stored in repositories for future studies.
  • nucleotide sequences such as DNA or RNA sequences constituting the metagenomic data are generally assembled into larger sequences called contigs.
  • the process of assembly typically involves performing a pairwise comparison of the nucleotide sequences, numbering in millions, thus requiring huge computational resources and infrastructure.
  • an attempt to assemble nucleotide sequences, originating from genomes of a large number of organisms belonging to diverse taxonomic groups, may result in formation of erroneous chimeric sequences, which may affect the results of analyses of the metagenomic data.
  • a variety of assembly techniques have been used for assembly of the metagenomic sequences derived from various organisms present in a given environmental sample into their corresponding contigs.
  • Conventional assembly techniques involve comparing the metagenomic sequences with predetermined oligonucleotide frequency based models and tagging the metagenomic sequences to the models showing highest similarity. The metagenomic sequences tagged to similar models may then be assembled into contigs.
  • metagenomic sequences belonging to unknown genomes may not show significant similarity to any of the models and may not be assembled into contigs, thus resulting in ambiguity and less efficient analysis.
  • Another conventional technique involves assembling the metagenomic sequences based on taxonomic origin of each of the metagenomic sequences.
  • the metagenomic sequences having similar taxonomic origin may be assembled together to form contigs.
  • the metagenomic sequences may not be efficiently assembled using the above approach, for example, when the metagenomic sequences belong to an organism that may not have been taxonomically classified. Metagenomic sequences belonging to such unknown organisms may thus not be assembled into the contigs, leading to ambiguous results and analysis of the metagenomic data.
  • Another conventional technique involves assembling the metagenomic sequences based on oligonucleotide usage patterns of the metagenomic sequences.
  • the metagenomic sequences having similar oligonucleotide usage patterns may be initially grouped into clusters, using clustering techniques, such as K-means. Subsequently, metagenomic sequences belonging to a single cluster may be assembled into contigs.
  • each of the metagenomic sequences is transformed into an n-dimensional vector, such that each of the n dimensions corresponds to the frequency of a specific oligonucleotide, of a given length, in the metagenomic sequences.
  • the metagenomic sequences may be grouped into clusters based on a relative difference obtained between their corresponding n-dimensional vectors.
  • clustering the metagenomic sequences based on the frequencies of oligonucleotides of longer length may result in erroneous clustering, for example, in case of metagenomic sequences having lengths of less than 1000 bps.
  • assembling the metagenomic sequences belonging to such ambiguous clusters may result in incorrect contigs.
  • assembling the metagenomic sequences based on the frequencies may require increased time and computational resources, due to time required for computing the frequencies as well as distances between the n-dimensional vectors.
  • the paper discloses a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possible to obtain a meaningful phylogenetic signal for relatively short DNA sequences.
  • the method for assembly of metagenomic sequences includes representing each of a plurality of metagenomics sequence in three-dimensional space to obtain a plurality of sequence vector, wherein representing comprises: determining frequencies of possible tetra-nucleotides for each of the plurality of plurality of metagenomics sequence; obtaining an intermediate vector corresponding to each of a plurality of metagenomic sequences based on the determined frequencies; and transforming, for each of the plurality of metagenomic sequences, the intermediate vector into a sequence vector to obtain a plurality of sequence vectors based on a set of reference points;
  • the method further includes defining, based on the plurality of sequence vectors, a cuboid having a plurality of grids in the three dimensional space, wherein the cuboid encompasses the plurality of metagenomic sequences, wherein defining further comprises:ascertaining the three dimensional coordinates for each of the plurality of metagenomic sequences based on the plurality of sequence vectors; determining, for each axis of the three dimensional space, a farthest coordinate and a closest coordinate from among the three dimensional coordinates; and calculating length of the cuboid in the each axis based on the difference between the farthest coordinate and the closest coordinate in the corresponding axis;
  • the method further includes traversing progressively the plurality of grids to assemble the plurality of metagenomics sequence into one or more contigs, wherein a contig includes metagenomics sequence originating from the same genome, wherein the traversing further comprises:obtaining, for each of the plurality of grids, one or more metagenomic sequences from among the plurality of metagenomics sequences, wherein the one or more metagenomic sequences are located within coordinates defined by the grid and immediate neighbors of the grid in the cuboid; and assembling, for each of the plurality of grids, the corresponding one or more metagenomic sequences into the one or more contigs.
  • the system(s) for assembly of metagenomic sequence a processor (104); and a memory (106) coupled to the processor (104), the memory (106) comprising modules configured to perform the aforementioned method.
  • metagenome genetic material extracted directly from either a biological or an environmental sample, i.e., metagenome
  • the genetic material is sequenced to generate a plurality of nucleotide sequences, such as DNA or RNA sequences.
  • the nucleotide sequences also known as metagenomic sequences, may be subsequently assembled into genomic fragments, called contigs, corresponding to genomes of organisms residing in the environmental sample.
  • the contigs may be further analyzed, for example, to estimate taxonomic diversity and the functional profiles of the organisms present in the environmental sample.
  • the present subject matter describes methods and systems for assembly of metagenomic sequences into contigs using an optimized method of data partitioning.
  • metagenomic data having metagenomic sequences corresponding to fragments of different genomes constituting the metagenomic data it will be understood that the methods and systems for assembly can be implemented for genomic data having genomic fragments from the same genome as well, albeit with a few variations, as will be understood by a person skilled in the art.
  • metagenomic data having a plurality of metagenomic sequences is received for assembly into a plurality of contigs.
  • each of the contigs constitutes metagenomic sequences corresponding to a distinct genome, with each genome being associated with a distinct organism residing in the environmental sample. Further, the contigs thus generated may be processed using a subsequent iteration of the above described process in order to obtain longer contigs or a complete genome corresponding to an organism residing in the environmental sample.
  • Each of the metagenomic sequences obtained from the metagenomic data is initially transformed into a 256 dimensional vector, hereinafter referred to as intermediate vectors, based on frequencies of all possible tetra-nucleotides for each of the metagenomic sequences.
  • a plurality of intermediate vectors thus obtained are transformed into a plurality of sequence vectors in three dimensional space, such that each metagenomic sequence is represented as a sequence vector in the three dimensional space.
  • the metagenomic sequences are represented as the sequence vectors using, for example, a set of reference points obtained based on a plurality of reference genomes.
  • a cuboid may be defined in the three dimensional space such that the cuboid encloses the sequence vectors corresponding to all the metagenomic sequences. Further, the cuboid may be divided into a plurality of equally sized smaller cuboids, hereinafter referred to as grids, such that each grid includes the sequence vectors and, in turn, the metagenomic sequences located within the coordinates defined by the particular grid in the cuboid.
  • each of the grids may be analyzed, using a method of progressive traversal, to identify and group all the metagenomic sequences which may belong to a particular genome.
  • the grids are traversed such that, in each step of traversal, metagenomic sequences present in a grid and its neighboring grids, collectively referred to as a cluster of grids, are obtained.
  • the metagenomic sequences thus obtained may be further assembled into contigs such that the metagenomic sequences having similar taxonomic origin are combined to form a single contig.
  • metagenomic sequences that have not been assembled during traversal of a particular grid may be considered for assembly during traversal of a subsequent grid.
  • metagenomic sequences unassembled during traversal of a grid '000' may be considered for assembly along with metagenomic sequences obtained during traversal of the subsequent grid, i.e., a grid '100' .
  • indexes of unassembled sequences and assembled sequences along with the contigs may be prepared and stored for further reference and/or analyses.
  • the present subject matter thus provides an efficient and easy method for assembly of metagenomic sequences into contigs using an optimized method of data partitioning. Partitioning the metagenomic sequences into the sequence vectors and the plurality of grids effectively reduces computational time required for analyzing and assembling the metagenomic sequences. Further, using the method of progressive traversal and assembling the metagenomic sequences of one cluster of grids at a time helps in optimizing resources required for an efficient assembly of the metagenomic sequences.
  • Fig. 1(a) illustrates a metagenomic sequences assembly system 100, according to an implementation of the present subject matter.
  • the metagenomic sequences assembly system 100 can be implemented in systems that include, but are not limited to, desktop computers, multiprocessor systems, laptops, network computers, cloud servers, minicomputers, mainframe computers, and the like.
  • the metagenomic sequences assembly system 100 hereinafter referred to as, the system 100 includes interface(s) 102, one or more processor(s) 104, and a memory 106 coupled to the processor(s) 104.
  • the interfaces 102 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, and a printer. Further, the interfaces 102 may enable the system 100 to communicate with other devices, such as web servers and external databases.
  • the interfaces 102 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite.
  • the interfaces 102 may include one or more ports for connecting a number of computing systems with one another or to another server computer.
  • the processor(s) 104 can be a single processing unit or a number of units, all of which could include multiple computing units.
  • the processor 104 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions.
  • the processor 104 is configured to fetch and execute computer-readable instructions and data stored in the memory 106.
  • the memory 106 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • volatile memory such as static random access memory (SRAM) and dynamic random access memory (DRAM)
  • non-volatile memory such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes.
  • ROM read only memory
  • erasable programmable ROM erasable programmable ROM
  • the modules 108 include routines, programs, objects, components, data structures, etc., which perform particular tasks or implement particular abstract data types.
  • the modules 108 further include a grid generation module 112, a sequence assembly module 114, and other module(s) 116.
  • the other modules 116 may include programs that supplement applications on the system 100, for example, programs in the operating system.
  • the data 110 serves, amongst other things, as a repository for storing data processed, received, and generated by one or more of the modules 108.
  • the data 110 includes grid data 118, assembled data 120, and other data 122.
  • the other data 122 includes data generated as a result of the execution of one or more modules in the modules 108.
  • the system 100 is associated with a metagenomic data repository (not shown in the figure).
  • the metagenomic data repository can be either external or internal to the system 100.
  • the metagenomic data repository includes a plurality of metagenomic data files having metagenomic data generated by a metagenomic data generation platform, such as an NGS based platform.
  • the metagenomic data includes a plurality of metagenomic sequences corresponding to genomes of a plurality of organisms residing in the environmental sample.
  • metagenomic data having a plurality of metagenomic sequences is received by the system 100 for being assembled into a plurality of contigs.
  • a contig may be understood as a group of metagenomic sequences corresponding to a distinct genome, with each genome being associated with a distinct organism residing in the environmental sample corresponding to the metagenomic data.
  • the grid generation module 112 receives and stores the metagenomic data having the metagenomic sequences in the grid data 118. Further, the grid generation module 112 represents each of the metagenomic sequences in three dimensional space to obtain a plurality of sequence vectors.
  • the grid generation module 112 initially determines the frequencies of all possible tetra-nucleotides for each of the metagenomic sequences. Based on the determination, the grid generation module 112 represents the metagenomic sequences as 256 dimensional vectors. Thus, for each of the metagenomic sequences, the grid generation module 112 obtains a 256 dimensional vector, hereinafter referred to as intermediate vectors. The intermediate vector corresponding to each of a plurality of metagenomic sequences is based on the determined frequencies. Further, the grid generation module 112 may transform each of the intermediate vectors to the three dimensional sequence vectors.
  • the grid generation module 112 obtains a sequence vector by computing a distance between the corresponding intermediate vector and a set of reference points.
  • the grid generation module 112 obtains the set of reference points using a plurality of reference genomes retrieved from a reference database, for example, a database of all currently sequenced genomes.
  • the grid generation module 112 obtains the plurality of reference genomes such that each reference genome corresponds to a different genus.
  • the grid generation module 112 may retrieve reference genomes corresponding to 237 completely sequenced microbial genomes from a known genomic database, such as National Center for Biotechnology Information (NCBI) database.
  • NCBI National Center for Biotechnology Information
  • the grid generation module 112 subsequently fragments each of the plurality of reference genomes into a plurality of non-overlapping reference fragments. For instance, in the previous example, the grid generation module 112 splits the 237 reference genomes into a plurality, say, 1000 base pairs of non-overlapping reference fragments. Further, the grid generation module 112 analyzes each of the reference fragments to compute a corresponding 256 dimensional fragment vector having frequencies of all possible tetra-nucleotides. Fragment vectors thus obtained are subsequently clustered into fragment clusters by the grid generation module 112 using any known clustering process. For instance, the grid generation module 112 may use K-means clustering approach for clustering of the fragment vectors to obtain the fragment clusters. In one implementation, the grid generation module 112 uses the K-means clustering approach to obtain k number of fragment clusters, wherein the value of k may be determined using the formula as given in equation 1.
  • n is equal to the number of reference fragments obtained from the reference genomes.
  • the grid generation module 112 may obtain a total of 631 fragment clusters using the reference fragments obtained from the 237 reference genomes. Further, the grid generation module 112 determines, for each of the fragment clusters, a cluster vector corresponding to the centroid of each fragment cluster. Based on the determination, the grid generation module 112 subsequently identifies three least correlated cluster vectors. In one implementation, the grid generation module 112 obtains a pairwise dot product between unit vectors corresponding to the cluster vectors and identifies a set of three cluster vectors having least pairwise dot product amongst them as the set of reference points. The grid generation module 112 thus identifies three cluster vectors as the reference points and stores the set of reference points in the grid data 118.
  • the set of reference points thus generated represent nucleotide usage patterns observed in the known biological realm, thus ensuring a correct representation of the metagenomic sequences in the three dimensional space.
  • the reference points may be used by the grid generation module 112 to determine the sequence vectors corresponding to the metagenomic sequences, for example, by computing a distance between the corresponding intermediate vector and the set of reference points.
  • the sequence vectors help in determining Cartesian coordinates for the metagenomic sequences in three dimensional space.
  • the grid generation module 112 defines a cuboid 124, as illustrated in Fig. 1(b) , in the three dimensional space based on the sequence vectors.
  • the cuboid 124 is generated such that it encompasses all the metagenomic sequences under consideration.
  • the grid generation module 112 initially determines three dimensional coordinates, i.e., x, y, and z coordinates of each of the metagenomic sequences based on the sequence vectors. Further, the grid generation module 112 determines, for each of the x, y, and z directions of the three dimensional space, a farthest coordinate and a closest coordinate.
  • the farthest coordinate in each direction may be defined as a maximum value in the corresponding direction among the three dimensional coordinates of the metagenomic sequences, i.e., the coordinate placed at a maximum distance from a point of origin in the three dimensional space.
  • the closest coordinate in each direction may be defined as a minimum value from among the three dimensional coordinates of the metagenomic sequences, i.e., the coordinate placed at a least distance from the point of origin.
  • the grid generation module 112 may subsequently define the cuboid 124 such that length of the cuboid 124 in each of the x, y, and z directions is equal to a difference between the farthest coordinate and the closest coordinate in the corresponding direction.
  • the cuboid 124 thus obtained may be saved by the grid generation module 112 in the grid data 118.
  • the grid generation module 112 may divide the cuboid 124 into a plurality of grids, as illustrated in the Fig. 1 (b) , such that each grid includes the sequence vectors, and in turn the metagenomic sequences, located within coordinates defined by the particular grid in the cuboid 124.
  • the grids may be equally sized. Data related to the grids thus obtained may be stored by the grid generation module 112 in the grid data 118.
  • the sequence assembly module 114 may analyze the cuboid 124 to assemble the metagenomic sequences into contigs.
  • the sequence assembly module 114 may use a method of progressive traversal to assemble the metagenomic sequences into contigs. Using the method of progressive traversal allows the sequence assembly module 114 to traverse the grids such that in each step of traversal, metagenomic sequences present in a grid under consideration and its neighboring grids, collectively referred to as a cluster of grids, are obtained.
  • the sequence assembly module 114 identifies a grid, say, grid 'ABC' , for analyses and traverses through the cluster of grids, formed by the grid 'ABC' and its immediate neighbors, in all three directions of the three dimensional space, as illustrated in the Fig. 1 (c) .
  • the sequence assembly module 114 may traverse through the grid 'ABC' and seven immediate neighbors of the grid 'ABC', i.e., grids (A+1)BC, A(B+1)C, AB(C+1), (A+1)(B+1)C, A(B+1)(C+1), (A+1)B(C+1), (A+1)(B+1)(C+1), as illustrated in Fig. 1(c) .
  • the sequence assembly module 114 obtains a selective subset of metagenomic sequences, i.e., the metagenomic sequences encompassed by the cluster of grids for assembling into one or more contigs.
  • the sequence assembly module 114 may use any known method of sequence assembly, such as CAP3, SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, AbySS, AllPaths, Velvet, Euler, and SOAPdenovo for assembling the selective subset of metagenomic sequences. Further, the sequence assembly module 114 assembles the selective metagenomic sequences into one or more contigs such that the metagenomic sequences originating from the same genome have a higher probability of getting combined to form a single contig.
  • sequence assembly module 114 may use any known method of sequence assembly, such as CAP3, SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, AbySS, AllPaths, Velvet, Euler, and SOAPdenovo for assembling the selective subset of metagenomic sequences. Further, the sequence assembly module 114 assembles the selective metagenomic sequences into one or more contigs such that the metagenomic sequence
  • the above method of grid partitioning results in a high probability of metagenomic sequences originating from the same genome getting combined in to a single contig as the above method of grid partitioning helps in clustering the metagenomic sequences of similar origin together.
  • the contigs thus obtained includes metagenomic sequences probably originating from the same genome, thus providing an efficient assembly of the metagenomic sequences.
  • the metagenomic sequences unassembled by the sequence assembly module 114 during a particular step of traversal for example, due to absence of overlapping metagenomic sequences originating from the same genome may be considered for assembly during traversal of a subsequent grid.
  • the sequence assembly module 114 may consider the metagenomic sequences unassembled during traversal of a grid '100' for assembly along with selective metagenomic sequences obtained during traversal of the subsequent grid, i.e., a grid '200' .
  • the sequence assembly module 114 may thus traverse through all the grids and obtain a plurality of contigs.
  • the plurality of contigs thus obtained by the sequence assembly module 114 is saved in the assembled data 120.
  • the sequence assembly module 114 first performs the traversal in the X direction, followed by Y direction and finally in direction of the Z axis.
  • sequence assembly module 114 may combine the contigs, received after traversal of all the grids, into longer contigs or an entire genome. The contigs thus obtained may be saved in the assembled data 120. Further, the metagenomic sequences remaining unassembled after the traversal through the grids may also be saved in the assembled data 120. In addition, the sequence assembly module 114 may generate and store indexes of the unassembled metagenomic sequences and assembled metagenomic sequences along with the contigs in the assembled data 120 for further reference and/or analyses.
  • simHC Fidelity of Analysis of Metagenomic Samples
  • simMC Fidelity of Analysis of Metagenomic Samples
  • simLC conventional techniques, such as CAP3.
  • the simHC data sets are defined as data sets for which all constituting genomes are represented equally.
  • the simMC data sets are defined as data sets in which a first half of the genomes have a high representation, where as remaining half of the genomes have a low representation.
  • the simLC data sets are defined as data sets in which a few genomes are overrepresented as compared to other genomes.
  • a first validation was performed for determining the resolving power of the metagenomic sequences assembly system to obtain grids containing taxonomically similar metagenomic sequences, which may facilitate their assembly into contigs.
  • the three distinct sets of metagenomes were provided as inputs to the system 100 and a cuboid, such as the cuboid 124 was defined based on the three distinct sets.
  • the cuboid was further divided into a plurality of grids and analyzed to determine taxonomic affiliations of the metagenomic sequences covered in each grid. Based on the determination, purity of each grid was ascertained at phylum level of taxonomic classification.
  • the plot 200 depicts percentage of metagenomic sequences covered in pure grids achieved for each data set using the system 100.
  • the three data sets used for validation are represented on a horizontal axis 202, while percentage of metagenomic sequences covered in pure grids obtained for the three data sets is represented on a vertical axis 204.
  • purity level of grids obtained for the simLC data set are represented by a bar 206
  • for the simMC data set are represented by a bar 208
  • for the simHC data set are represented by a bar 210.
  • percentage of metagenomic sequences covered in pure grids was more than 60% for all the three data sets. Further, the percentage for the simLC and the simMC datasets was more than 70%. Such a high percentage of metagenomic sequences covered in the pure grids thus illustrates efficiency of the system 100 in pre-partitioning the metagenomic data for assembly.
  • Fig. 3 illustrates a method 300 for assembly of metagenomic sequences, in accordance with an implementation of the present subject matter
  • Fig. 4 illustrates a method 304 for generating a set of reference points for assembly of the metagenomic sequences according to an embodiment of the present subject matter.
  • the methods 300 and 304 are implemented in computing device, such as the metagenomic sequences assembly system 100.
  • the methods may be described in the general context of computer executable instructions.
  • computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types.
  • the methods may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network.
  • metagenomic data having a plurality of metagenomic sequences to be assembled is received, for example, by the system 100.
  • the metagenomic data is obtained from a metagenomic data repository associated with the system 100.
  • the metagenomic data includes a plurality of metagenomic sequences corresponding to genomes of a plurality of organisms residing in the environmental sample for which the metagenomic data is generated.
  • the metagenomic data may be obtained by the grid generation module 112 and stored in the grid data 118.
  • each of the plurality of metagenomic sequences is represented in three dimensional space to obtain a plurality of sequence vectors.
  • a sequence vector is obtained for each of the metagenomic sequence by grid generation module 112 using, for example, a set of reference points such that a sequence is represented as a unique point in three dimensional space.
  • Each sequence vector represents the corresponding metagenomic sequence in three dimensional space thus facilitating an easy and efficient partitioning of the metagenomic sequences for assembling into contigs.
  • representing of the each of plurality of metagenomics sequence includes determining frequencies of possible tetra-nucleotides for each of the plurality of metagenomics sequence, obtaining an intermediate vector corresponding to each of a plurality of metagenomic sequences based on the determined frequencies and transforming, for each of the plurality of metagenomic sequences, the intermediate vector into a sequence vector to obtain a plurality of sequence vectors based on a set of reference points.
  • the set of reference points used for transforming the metagenomic data sets may be obtained based on a plurality of reference fragments as will be described in greater detail with reference to fig. 4 .
  • a cuboid is defined in the three dimensional space based on the plurality of sequence vectors, for example, by the grid generation module 112.
  • defining further includes ascertaining three dimensional coordinates for each of the plurality of metagenomic sequences based on the plurality of sequence vectors.
  • the sequence vectors obtained for the metagenomic sequences are analyzed to determine a farthest coordinate and a closest coordinate for each of the x, y, and z axes of the three dimensional space.
  • length of the cuboid in each of the x, y, and z axes may be ascertained as a value equal to a difference between the farthest coordinate and the closest coordinate in the corresponding axis. Subsequently the cuboid may be defined in the three dimensional space such that it encompasses all the metagenomic sequences obtained for being assembled.
  • the cuboid is divided into a plurality of smaller equally sized cuboids, hereinafter referred to as grids.
  • the grid generation module 112 is configured to divide the cuboid into the plurality of grids such that each grid includes all the metagenomic sequences whose sequence vectors lie in the coordinates covered by the grid under consideration.
  • the plurality of grids is progressively traversed to assemble the plurality of metagenomic sequences into one or more contigs.
  • the contig from among one or more contigs includes metagenomics sequence originating from the same genome.
  • the plurality of grids may be traversed by a sequence assembly module, such as the sequence assembly module 114.
  • the sequence assembly module 114 is configured to traverse the grids such that in each traversal, metagenomic sequences residing in the grid under consideration and its immediate neighbors are obtained and assembled into one or more contigs.
  • traversing further includes obtaining, for each of the plurality of grids, one or more metagenomic sequences from among the plurality of metagenomics sequences.
  • the one or more metagenomic sequences are located within coordinates defined by the grid and immediate neighbors of the grid in the cuboid. Further, all the metagenomic sequences unassembled during a particular step of traversal are considered for assembly during a next step of traversal and so on till all the grids are traversed to obtain the contigs.
  • the contigs may be further assembled into a plurality of longer contigs or complete genomes. The longer contigs or genomes thus obtained include metagenomic sequences probably originating from the same genome. Additionally, the contigs and the sequences remaining unassembled at the end of the traversal of the grids may be stored in the assembled data 120 of the system 100.
  • the method 304 generates a set of reference points for representing the metagenomic sequences in the three dimensional space for assembling into contigs, according to an example embodiment of the present subject matter.
  • each of the plurality of reference genomes is split into a plurality of reference fragments.
  • a plurality of reference genomes corresponding to distinct genera is obtained from a reference database, such as a database of all sequenced genomes.
  • each of the reference genomes are fragmented into the plurality of reference fragments, for example, by the grid generation module 112 and stored in the grid data 118.
  • a plurality of fragment vectors corresponding to each of the reference fragments are computed, for example, by the grid generation module 112.
  • each of the reference fragments are analyzed to compute a corresponding fragment vector having frequencies of all possible 256 tetra-nucleotides.
  • fragment vectors obtained are clustered to obtain one or more fragment clusters, for example, by the grid generation module 112.
  • the fragment vectors are clustered into one or more fragment clusters using any known clustering process, such as the K-means approach.
  • the fragment vectors may be clustered into a total of 631 clusters using the K-means approach.
  • a cluster vector corresponding to a centroid of each fragment cluster is computed.
  • each of the fragment clusters are analyzed to ascertain a corresponding cluster vector.
  • the computed cluster vectors may be further stored in the grid data 118.
  • a set of reference points is obtained based on cluster vectors corresponding to the fragment clusters.
  • the cluster vectors corresponding to the fragment clusters are analyzed, for example, by the grid generation module 112 to ascertain three least correlated cluster vectors as the set of reference points.
  • the least correlated cluster vectors may be identified based on pairwise dot products computed for unit vectors corresponding to the cluster vectors. Further, the cluster vectors having least three pairwise dot products amongst them may be identified as the set of reference points.
  • the set of reference points may be further used for representing the metagenomic sequences in the three dimensional space.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Micro-Organisms Or Cultivation Processes Thereof (AREA)

Description

    TECHNICAL FIELD
  • The present subject matter relates, in general, to the field of metagenomics and, in particular, to assembly of metagenomic sequences.
  • BACKGROUND
  • The study of genetic material recovered directly from an environmental sample, by sequencing the genetic material, is referred to as metagenomics. Metagenomics provides information pertaining to taxonomic diversity and physiology of various organisms present in the environmental sample.
  • A facility, such as a research laboratory or a clinic, involved in genomic study typically uses high capacity platforms, such as next generation sequencing (NGS) platforms, capable of generating huge volumes of metagenomic data every year. The metagenomic data thus generated may be further analyzed, for example, to determine various organisms present in the metagenomic data and to identify the functional roles of the various genes they encompass. Generally, the metagenomic data may be stored for further analysis and future studies. Thus, each year metagenomic data is generated in huge volumes, in the range of hundreds of terabytes (TB), and stored in repositories for future studies.
  • In order to analyze the metagenomic data, nucleotide sequences, such as DNA or RNA sequences constituting the metagenomic data are generally assembled into larger sequences called contigs. The process of assembly typically involves performing a pairwise comparison of the nucleotide sequences, numbering in millions, thus requiring huge computational resources and infrastructure. Furthermore, an attempt to assemble nucleotide sequences, originating from genomes of a large number of organisms belonging to diverse taxonomic groups, may result in formation of erroneous chimeric sequences, which may affect the results of analyses of the metagenomic data.
  • A variety of assembly techniques have been used for assembly of the metagenomic sequences derived from various organisms present in a given environmental sample into their corresponding contigs. Conventional assembly techniques involve comparing the metagenomic sequences with predetermined oligonucleotide frequency based models and tagging the metagenomic sequences to the models showing highest similarity. The metagenomic sequences tagged to similar models may then be assembled into contigs. However, metagenomic sequences belonging to unknown genomes may not show significant similarity to any of the models and may not be assembled into contigs, thus resulting in ambiguity and less efficient analysis.
  • Another conventional technique involves assembling the metagenomic sequences based on taxonomic origin of each of the metagenomic sequences. The metagenomic sequences having similar taxonomic origin may be assembled together to form contigs. However, the metagenomic sequences may not be efficiently assembled using the above approach, for example, when the metagenomic sequences belong to an organism that may not have been taxonomically classified. Metagenomic sequences belonging to such unknown organisms may thus not be assembled into the contigs, leading to ambiguous results and analysis of the metagenomic data.
  • Another conventional technique involves assembling the metagenomic sequences based on oligonucleotide usage patterns of the metagenomic sequences. According to the technique, the metagenomic sequences having similar oligonucleotide usage patterns may be initially grouped into clusters, using clustering techniques, such as K-means. Subsequently, metagenomic sequences belonging to a single cluster may be assembled into contigs. For the purpose, each of the metagenomic sequences is transformed into an n-dimensional vector, such that each of the n dimensions corresponds to the frequency of a specific oligonucleotide, of a given length, in the metagenomic sequences. Further, the metagenomic sequences may be grouped into clusters based on a relative difference obtained between their corresponding n-dimensional vectors. However, clustering the metagenomic sequences based on the frequencies of oligonucleotides of longer length may result in erroneous clustering, for example, in case of metagenomic sequences having lengths of less than 1000 bps. Further, assembling the metagenomic sequences belonging to such ambiguous clusters may result in incorrect contigs. Moreover, assembling the metagenomic sequences based on the frequencies may require increased time and computational resources, due to time required for computing the frequencies as well as distances between the n-dimensional vectors.
  • Prior Art:
  • The paper titled "The oligonucleotide frequency derived error gradient and its application to the binning of metagenome fragments", by SAEED ISAAM et. at published at BMC GENOMICS, BIOMED CENTRAL LTD, LONDON, UK, vol. 10, no. Suppl 3, 3 December 2009 (2009-12-03), page S10. The paper discloses a one-dimensional signature, OFDEG, derived from the oligonucleotide frequency profile of a DNA sequence, and show that it is possible to obtain a meaningful phylogenetic signal for relatively short DNA sequences.
  • SUMMARY
  • The invention is defined by the appended claims.
  • Method(s) and a system(s) for assembly of metagenomic sequences are described herein. The method for assembly of metagenomic sequences includes representing each of a plurality of metagenomics sequence in three-dimensional space to obtain a plurality of sequence vector, wherein representing comprises: determining frequencies of possible tetra-nucleotides for each of the plurality of plurality of metagenomics sequence; obtaining an intermediate vector corresponding to each of a plurality of metagenomic sequences based on the determined frequencies; and transforming, for each of the plurality of metagenomic sequences, the intermediate vector into a sequence vector to obtain a plurality of sequence vectors based on a set of reference points;
  • The method further includes defining, based on the plurality of sequence vectors, a cuboid having a plurality of grids in the three dimensional space, wherein the cuboid encompasses the plurality of metagenomic sequences, wherein defining further comprises:ascertaining the three dimensional coordinates for each of the plurality of metagenomic sequences based on the plurality of sequence vectors; determining, for each axis of the three dimensional space, a farthest coordinate and a closest coordinate from among the three dimensional coordinates; and calculating length of the cuboid in the each axis based on the difference between the farthest coordinate and the closest coordinate in the corresponding axis;
  • The method further includes traversing progressively the plurality of grids to assemble the plurality of metagenomics sequence into one or more contigs, wherein a contig includes metagenomics sequence originating from the same genome, wherein the traversing further comprises:obtaining, for each of the plurality of grids, one or more metagenomic sequences from among the plurality of metagenomics sequences, wherein the one or more metagenomic sequences are located within coordinates defined by the grid and immediate neighbors of the grid in the cuboid; and assembling, for each of the plurality of grids, the corresponding one or more metagenomic sequences into the one or more contigs.
  • The system(s) for assembly of metagenomic sequences a processor (104); and a memory (106) coupled to the processor (104), the memory (106) comprising modules configured to perform the aforementioned method.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings for reference to like features and components.
    • Fig. 1(a) illustrates a metagenomic sequences assembly system, in accordance with an embodiment of the present subject matter.
    • Fig. 1(b) illustrates a cuboid generated by the metagenomic sequences assembly system for assembly of metagenomic sequences, in accordance with an embodiment of the present subject matter.
    • Fig. 1(c) illustrates a pictorial representation of progressive traversal performed on the cuboid by the metagenomic sequences assembly system, in accordance with an embodiment of the present subject matter.
    • Fig. 2 illustrates a plot depicting percentage of metagenomic sequences covered in pure grids achieved using the metagenomic sequences assembly system, according to an embodiment of the present subject matter.
    • Fig. 3 illustrates a method for assembly of metagenomic sequences, in accordance with an embodiment of the present subject matter.
    • Fig. 4 illustrates a method of generating a set of reference points for assembly of the metagenomic sequences, in accordance with an embodiment of the present subject matter.
    DETAILED DESCRIPTION
  • Systems and methods for assembly of metagenomic sequences are described herein. Generally, genetic material extracted directly from either a biological or an environmental sample, i.e., metagenome, is processed and stored as metagenomic data for research or medical purposes. The genetic material is sequenced to generate a plurality of nucleotide sequences, such as DNA or RNA sequences. The nucleotide sequences, also known as metagenomic sequences, may be subsequently assembled into genomic fragments, called contigs, corresponding to genomes of organisms residing in the environmental sample. The contigs may be further analyzed, for example, to estimate taxonomic diversity and the functional profiles of the organisms present in the environmental sample.
  • The present subject matter describes methods and systems for assembly of metagenomic sequences into contigs using an optimized method of data partitioning. Although the description herein is provided in considerable detail with respect to metagenomic data having metagenomic sequences corresponding to fragments of different genomes constituting the metagenomic data, it will be understood that the methods and systems for assembly can be implemented for genomic data having genomic fragments from the same genome as well, albeit with a few variations, as will be understood by a person skilled in the art. According to an embodiment of the present subject matter, metagenomic data having a plurality of metagenomic sequences is received for assembly into a plurality of contigs. As will be understood, each of the contigs constitutes metagenomic sequences corresponding to a distinct genome, with each genome being associated with a distinct organism residing in the environmental sample. Further, the contigs thus generated may be processed using a subsequent iteration of the above described process in order to obtain longer contigs or a complete genome corresponding to an organism residing in the environmental sample.
  • Each of the metagenomic sequences obtained from the metagenomic data is initially transformed into a 256 dimensional vector, hereinafter referred to as intermediate vectors, based on frequencies of all possible tetra-nucleotides for each of the metagenomic sequences. A plurality of intermediate vectors thus obtained are transformed into a plurality of sequence vectors in three dimensional space, such that each metagenomic sequence is represented as a sequence vector in the three dimensional space. In one implementation, the metagenomic sequences are represented as the sequence vectors using, for example, a set of reference points obtained based on a plurality of reference genomes. Further, based on the sequence vectors, a cuboid may be defined in the three dimensional space such that the cuboid encloses the sequence vectors corresponding to all the metagenomic sequences. Further, the cuboid may be divided into a plurality of equally sized smaller cuboids, hereinafter referred to as grids, such that each grid includes the sequence vectors and, in turn, the metagenomic sequences located within the coordinates defined by the particular grid in the cuboid.
  • Furthermore, each of the grids may be analyzed, using a method of progressive traversal, to identify and group all the metagenomic sequences which may belong to a particular genome. In one implementation, the grids are traversed such that, in each step of traversal, metagenomic sequences present in a grid and its neighboring grids, collectively referred to as a cluster of grids, are obtained. The metagenomic sequences thus obtained may be further assembled into contigs such that the metagenomic sequences having similar taxonomic origin are combined to form a single contig. Further, metagenomic sequences that have not been assembled during traversal of a particular grid, for example, due to absence of overlapping metagenomic sequences of similar taxonomic origin, may be considered for assembly during traversal of a subsequent grid. For example, the metagenomic sequences unassembled during traversal of a grid '000' may be considered for assembly along with metagenomic sequences obtained during traversal of the subsequent grid, i.e., a grid '100'. On traversal of all the grids, indexes of unassembled sequences and assembled sequences along with the contigs may be prepared and stored for further reference and/or analyses.
  • The present subject matter thus provides an efficient and easy method for assembly of metagenomic sequences into contigs using an optimized method of data partitioning. Partitioning the metagenomic sequences into the sequence vectors and the plurality of grids effectively reduces computational time required for analyzing and assembling the metagenomic sequences. Further, using the method of progressive traversal and assembling the metagenomic sequences of one cluster of grids at a time helps in optimizing resources required for an efficient assembly of the metagenomic sequences.
  • Although the description herein is with reference to metagenomic data, the systems and methods may be implemented for other data, such as genomic data, as well, albeit with a few variations, as will be understood by a person skilled in the art.
  • These and other advantages of the present subject matter would be described in greater detail in conjunction with the following figures. While aspects of described systems and methods for assembly of metagenomic sequence can be implemented in any number of different computing systems, environments, and/or configurations, the embodiments are described in the context of the following exemplary system(s).
  • Fig. 1(a) illustrates a metagenomic sequences assembly system 100, according to an implementation of the present subject matter. The metagenomic sequences assembly system 100 can be implemented in systems that include, but are not limited to, desktop computers, multiprocessor systems, laptops, network computers, cloud servers, minicomputers, mainframe computers, and the like. In one implementation, the metagenomic sequences assembly system 100, hereinafter referred to as, the system 100 includes interface(s) 102, one or more processor(s) 104, and a memory 106 coupled to the processor(s) 104.
  • The interfaces 102 may include a variety of software and hardware interfaces, for example, interfaces for peripheral device(s), such as a keyboard, a mouse, an external memory, and a printer. Further, the interfaces 102 may enable the system 100 to communicate with other devices, such as web servers and external databases. The interfaces 102 can facilitate multiple communications within a wide variety of networks and protocol types, including wired networks, for example, local area network (LAN), cable, etc., and wireless networks, such as Wireless LAN (WLAN), cellular, or satellite. For the purpose, the interfaces 102 may include one or more ports for connecting a number of computing systems with one another or to another server computer.
  • The processor(s) 104 can be a single processing unit or a number of units, all of which could include multiple computing units. The processor 104 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 104 is configured to fetch and execute computer-readable instructions and data stored in the memory 106.
  • The memory 106 may include any computer-readable medium known in the art including, for example, volatile memory, such as static random access memory (SRAM) and dynamic random access memory (DRAM), and/or non-volatile memory, such as read only memory (ROM), erasable programmable ROM, flash memories, hard disks, optical disks, and magnetic tapes. The memory 106 also includes module(s) 108 and data 110.
  • The modules 108, amongst other things, include routines, programs, objects, components, data structures, etc., which perform particular tasks or implement particular abstract data types. The modules 108 further include a grid generation module 112, a sequence assembly module 114, and other module(s) 116. The other modules 116 may include programs that supplement applications on the system 100, for example, programs in the operating system. On the other hand, the data 110 serves, amongst other things, as a repository for storing data processed, received, and generated by one or more of the modules 108. The data 110 includes grid data 118, assembled data 120, and other data 122. The other data 122 includes data generated as a result of the execution of one or more modules in the modules 108.
  • In one implementation, the system 100 is associated with a metagenomic data repository (not shown in the figure). The metagenomic data repository, as will be understood, can be either external or internal to the system 100. The metagenomic data repository includes a plurality of metagenomic data files having metagenomic data generated by a metagenomic data generation platform, such as an NGS based platform. The metagenomic data, as will be understood, includes a plurality of metagenomic sequences corresponding to genomes of a plurality of organisms residing in the environmental sample. Although the description of the system 100 and the methods herein is provided in considerable detail with respect to metagenomic data having metagenomic sequences, it will be understood that the methods and systems for assembly can be implemented for genomic data having genomic fragments as well, albeit with a few variations, as will be understood by a person skilled in the art.
  • According to an embodiment of the present subject matter, metagenomic data having a plurality of metagenomic sequences is received by the system 100 for being assembled into a plurality of contigs. A contig may be understood as a group of metagenomic sequences corresponding to a distinct genome, with each genome being associated with a distinct organism residing in the environmental sample corresponding to the metagenomic data. In one implementation, the grid generation module 112 receives and stores the metagenomic data having the metagenomic sequences in the grid data 118. Further, the grid generation module 112 represents each of the metagenomic sequences in three dimensional space to obtain a plurality of sequence vectors. In one embodiment, the grid generation module 112 initially determines the frequencies of all possible tetra-nucleotides for each of the metagenomic sequences. Based on the determination, the grid generation module 112 represents the metagenomic sequences as 256 dimensional vectors. Thus, for each of the metagenomic sequences, the grid generation module 112 obtains a 256 dimensional vector, hereinafter referred to as intermediate vectors. The intermediate vector corresponding to each of a plurality of metagenomic sequences is based on the determined frequencies. Further, the grid generation module 112 may transform each of the intermediate vectors to the three dimensional sequence vectors.
  • In an example implementation, the grid generation module 112 obtains a sequence vector by computing a distance between the corresponding intermediate vector and a set of reference points. In one implementation, the grid generation module 112 obtains the set of reference points using a plurality of reference genomes retrieved from a reference database, for example, a database of all currently sequenced genomes. Further, the grid generation module 112 obtains the plurality of reference genomes such that each reference genome corresponds to a different genus. For example, the grid generation module 112 may retrieve reference genomes corresponding to 237 completely sequenced microbial genomes from a known genomic database, such as National Center for Biotechnology Information (NCBI) database.
  • The grid generation module 112 subsequently fragments each of the plurality of reference genomes into a plurality of non-overlapping reference fragments. For instance, in the previous example, the grid generation module 112 splits the 237 reference genomes into a plurality, say, 1000 base pairs of non-overlapping reference fragments. Further, the grid generation module 112 analyzes each of the reference fragments to compute a corresponding 256 dimensional fragment vector having frequencies of all possible tetra-nucleotides. Fragment vectors thus obtained are subsequently clustered into fragment clusters by the grid generation module 112 using any known clustering process. For instance, the grid generation module 112 may use K-means clustering approach for clustering of the fragment vectors to obtain the fragment clusters. In one implementation, the grid generation module 112 uses the K-means clustering approach to obtain k number of fragment clusters, wherein the value of k may be determined using the formula as given in equation 1.
  • (1) where n is equal to the number of reference fragments obtained from the reference genomes.
  • Referring to the example discussed above, the grid generation module 112 may obtain a total of 631 fragment clusters using the reference fragments obtained from the 237 reference genomes. Further, the grid generation module 112 determines, for each of the fragment clusters, a cluster vector corresponding to the centroid of each fragment cluster. Based on the determination, the grid generation module 112 subsequently identifies three least correlated cluster vectors. In one implementation, the grid generation module 112 obtains a pairwise dot product between unit vectors corresponding to the cluster vectors and identifies a set of three cluster vectors having least pairwise dot product amongst them as the set of reference points. The grid generation module 112 thus identifies three cluster vectors as the reference points and stores the set of reference points in the grid data 118. It would be understood that the set of reference points thus generated represent nucleotide usage patterns observed in the known biological realm, thus ensuring a correct representation of the metagenomic sequences in the three dimensional space. Further, the reference points may be used by the grid generation module 112 to determine the sequence vectors corresponding to the metagenomic sequences, for example, by computing a distance between the corresponding intermediate vector and the set of reference points. The sequence vectors, as will be understood, help in determining Cartesian coordinates for the metagenomic sequences in three dimensional space.
  • Further, the grid generation module 112 defines a cuboid 124, as illustrated in Fig. 1(b), in the three dimensional space based on the sequence vectors. The cuboid 124 is generated such that it encompasses all the metagenomic sequences under consideration. For the purpose, the grid generation module 112 initially determines three dimensional coordinates, i.e., x, y, and z coordinates of each of the metagenomic sequences based on the sequence vectors. Further, the grid generation module 112 determines, for each of the x, y, and z directions of the three dimensional space, a farthest coordinate and a closest coordinate. The farthest coordinate in each direction may be defined as a maximum value in the corresponding direction among the three dimensional coordinates of the metagenomic sequences, i.e., the coordinate placed at a maximum distance from a point of origin in the three dimensional space. The closest coordinate in each direction may be defined as a minimum value from among the three dimensional coordinates of the metagenomic sequences, i.e., the coordinate placed at a least distance from the point of origin. The grid generation module 112 may subsequently define the cuboid 124 such that length of the cuboid 124 in each of the x, y, and z directions is equal to a difference between the farthest coordinate and the closest coordinate in the corresponding direction. Defining the boundaries of the cuboid 124 based on the farthest coordinate and the closest coordinate in each direction ensures that sequence vectors corresponding to all the metagenomic sequences are encompassed within the cuboid 124. The cuboid 124 thus obtained may be saved by the grid generation module 112 in the grid data 118.
  • Further, the grid generation module 112 may divide the cuboid 124 into a plurality of grids, as illustrated in the Fig. 1 (b), such that each grid includes the sequence vectors, and in turn the metagenomic sequences, located within coordinates defined by the particular grid in the cuboid 124. In one implementation, the grids may be equally sized. Data related to the grids thus obtained may be stored by the grid generation module 112 in the grid data 118.
  • Based on the grids thus obtained, the sequence assembly module 114 may analyze the cuboid 124 to assemble the metagenomic sequences into contigs. In one implementation, the sequence assembly module 114 may use a method of progressive traversal to assemble the metagenomic sequences into contigs. Using the method of progressive traversal allows the sequence assembly module 114 to traverse the grids such that in each step of traversal, metagenomic sequences present in a grid under consideration and its neighboring grids, collectively referred to as a cluster of grids, are obtained. Initially, the sequence assembly module 114 identifies a grid, say, grid 'ABC', for analyses and traverses through the cluster of grids, formed by the grid 'ABC' and its immediate neighbors, in all three directions of the three dimensional space, as illustrated in the Fig. 1 (c). In one implementation, the sequence assembly module 114 may traverse through the grid 'ABC' and seven immediate neighbors of the grid 'ABC', i.e., grids (A+1)BC, A(B+1)C, AB(C+1), (A+1)(B+1)C, A(B+1)(C+1), (A+1)B(C+1), (A+1)(B+1)(C+1), as illustrated in Fig. 1(c). Based on the traversal, the sequence assembly module 114 obtains a selective subset of metagenomic sequences, i.e., the metagenomic sequences encompassed by the cluster of grids for assembling into one or more contigs.
  • In an example implementation, the sequence assembly module 114 may use any known method of sequence assembly, such as CAP3, SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, AbySS, AllPaths, Velvet, Euler, and SOAPdenovo for assembling the selective subset of metagenomic sequences. Further, the sequence assembly module 114 assembles the selective metagenomic sequences into one or more contigs such that the metagenomic sequences originating from the same genome have a higher probability of getting combined to form a single contig. Thus, the above method of grid partitioning results in a high probability of metagenomic sequences originating from the same genome getting combined in to a single contig as the above method of grid partitioning helps in clustering the metagenomic sequences of similar origin together. The contigs thus obtained includes metagenomic sequences probably originating from the same genome, thus providing an efficient assembly of the metagenomic sequences. Additionally, the metagenomic sequences unassembled by the sequence assembly module 114 during a particular step of traversal, for example, due to absence of overlapping metagenomic sequences originating from the same genome may be considered for assembly during traversal of a subsequent grid. For instance, the sequence assembly module 114 may consider the metagenomic sequences unassembled during traversal of a grid '100' for assembly along with selective metagenomic sequences obtained during traversal of the subsequent grid, i.e., a grid '200'. The sequence assembly module 114 may thus traverse through all the grids and obtain a plurality of contigs. The plurality of contigs thus obtained by the sequence assembly module 114 is saved in the assembled data 120. In one implementation, the sequence assembly module 114 first performs the traversal in the X direction, followed by Y direction and finally in direction of the Z axis.
  • Further, the sequence assembly module 114 may combine the contigs, received after traversal of all the grids, into longer contigs or an entire genome. The contigs thus obtained may be saved in the assembled data 120. Further, the metagenomic sequences remaining unassembled after the traversal through the grids may also be saved in the assembled data 120. In addition, the sequence assembly module 114 may generate and store indexes of the unassembled metagenomic sequences and assembled metagenomic sequences along with the contigs in the assembled data 120 for further reference and/or analyses.
  • VALIDATION AND RESULTS
  • For the purpose of validation, three distinct sets of simulated metagenomic data were downloaded from the online repository of simulated metagenomes present in the Fidelity of Analysis of Metagenomic Samples (FAMeS) database and assembled using the system 100 in accordance with the present embodiment. The results for assembly of the three distinct metagenomic data sets, i.e., simHC, simMC, and simLC were further compared with conventional techniques, such as CAP3. The simHC data sets are defined as data sets for which all constituting genomes are represented equally. The simMC data sets are defined as data sets in which a first half of the genomes have a high representation, where as remaining half of the genomes have a low representation. The simLC data sets are defined as data sets in which a few genomes are overrepresented as compared to other genomes.
  • Further, the experiments were performed for two different validations. A first validation was performed for determining the resolving power of the metagenomic sequences assembly system to obtain grids containing taxonomically similar metagenomic sequences, which may facilitate their assembly into contigs. Initially the three distinct sets of metagenomes were provided as inputs to the system 100 and a cuboid, such as the cuboid 124 was defined based on the three distinct sets. The cuboid was further divided into a plurality of grids and analyzed to determine taxonomic affiliations of the metagenomic sequences covered in each grid. Based on the determination, purity of each grid was ascertained at phylum level of taxonomic classification. For this purpose, all grids having at least 70 % of the metagenomic sequences belonging to a single phylum were ascertained as 'phylum-level-pure' grids. Results obtained after splitting the cuboid into grids using the system 100 are depicted in bar plot 200 illustrated in Fig. 2.
  • The plot 200 depicts percentage of metagenomic sequences covered in pure grids achieved for each data set using the system 100. In the plot 200, the three data sets used for validation are represented on a horizontal axis 202, while percentage of metagenomic sequences covered in pure grids obtained for the three data sets is represented on a vertical axis 204. In one implementation, purity level of grids obtained for the simLC data set are represented by a bar 206, for the simMC data set are represented by a bar 208, and for the simHC data set are represented by a bar 210. As illustrated in the plot 200, percentage of metagenomic sequences covered in pure grids was more than 60% for all the three data sets. Further, the percentage for the simLC and the simMC datasets was more than 70%. Such a high percentage of metagenomic sequences covered in the pure grids thus illustrates efficiency of the system 100 in pre-partitioning the metagenomic data for assembly.
  • Additionally, a second validation was performed for applicability of the grid assembly approach for assembly of metagenomic sequences. For the purpose, the three data sets were initially processed to obtain the plurality of grids and then assembled into contigs using the CAP3 assembly technique. Results thus obtained were compared with contigs obtained using only the CAP3 assembly technique. Results obtained after assembly of the metagenomic sequences using the system 100 and the conventional techniques were further analyzed based on three parameters, i.e., the average length of contigs, purity of the contigs, and number of metagenomic sequences assigned to the contigs as summarized in table 1. Table 1
    simH simM
    C C simLC
    Contig details System 100 CAP3 Contig details System 100 CAP3 Contig details System 100 CA P3
    0 3000 8613 7023 0 3000 10000 8677 0 3000 5881 6884
    3000 6000 17 13 3000 6000 873 694 3000 6000 385 93
    6000 9000 0 0 6000 9000 52 44 600 0 9000 132 170
    9000 12000 0 0 9000 12000 4 6 9000 12000 76 21
    12000 15000 0 0 12000 15000 1 0 12000 15000 49 5
    15000 18000 0 0 15000 18000 0 0 15000 18000 35 3
    18000 21000 0 0 18000 21000 0 0 18000 21000 20 7
    21000 24000 0 0 21000 24000 0 0 21000 24000 6 7
    24000 27000 0 0 24000 27000 0 0 24000 27000 1 2
    27000 30000 0 0 27000 30000 0 0 27000 30000 3 0
    30000 33000 0 0 30000 33000 0 0 30000 33000 1 0
    33000 36000 0 0 33000 36000 0 0 33000 36000 0 0
    36000 39000 0 0 36000 39000 0 0 36000 39000 1 0
    Total Number of contigs 8630 7036 Total Number of contigs 10930 9421 Total Number of contigs 6590 7 1 9 2
    Average Length (bp) 1336 1347 Average Length (bp) 1782 1732 Average Length (bp) 2088 1 8 4 6
    Percentage of Pure contigs 93.20% 88.57% Percentage of Pure contigs 98.31% 96.79% Percentage of Pure contigs 98.37% 9 5. 7 2 %
    Time taken in minutes 89 (184) 75 Time taken in minutes 145 (240) 128 Time taken in minutes 152 (300) 11 4
    No. of sequences in contigs 19996 15694 No. of sequences in contigs 41734 36491 No. of sequences in contigs 37793 3 7
  • As illustrated in Table 1, percentage of pure contigs obtained using the system 100 was higher than the percentage achieved using the conventional technique for all the three data sets. Further, the contigs obtained using the system 100 constituted more number of metagenomic sequences as compared to the contigs obtained using the CAP3 technique thus indicating high efficiency in assembly of the metagenomic sequences. Additionally, average length of contigs obtained using the system 100 for the simMC and the simLC data sets were significantly longer than the average length achieved using the conventional technique. The system 100 may thus be efficiently used for generating contigs of higher length and purity.
  • Fig. 3 illustrates a method 300 for assembly of metagenomic sequences, in accordance with an implementation of the present subject matter; Fig. 4 illustrates a method 304 for generating a set of reference points for assembly of the metagenomic sequences according to an embodiment of the present subject matter. The methods 300 and 304 are implemented in computing device, such as the metagenomic sequences assembly system 100.
  • The methods may be described in the general context of computer executable instructions. Generally, computer executable instructions can include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform particular functions or implement particular abstract data types. The methods may also be practiced in a distributed computing environment where functions are performed by remote processing devices that are linked through a communications network.
  • The order in which the methods are described is not intended to be construed as a limitation, and some of the described method blocks can be combined in any order to implement the method. Furthermore, the methods can be implemented in any suitable hardware, software, firmware, or combination thereof.
  • At block 302, metagenomic data having a plurality of metagenomic sequences to be assembled is received, for example, by the system 100. In one implementation, the metagenomic data is obtained from a metagenomic data repository associated with the system 100. The metagenomic data includes a plurality of metagenomic sequences corresponding to genomes of a plurality of organisms residing in the environmental sample for which the metagenomic data is generated. In an implementation, the metagenomic data may be obtained by the grid generation module 112 and stored in the grid data 118.
  • At block 304, each of the plurality of metagenomic sequences is represented in three dimensional space to obtain a plurality of sequence vectors. A sequence vector is obtained for each of the metagenomic sequence by grid generation module 112 using, for example, a set of reference points such that a sequence is represented as a unique point in three dimensional space. Each sequence vector represents the corresponding metagenomic sequence in three dimensional space thus facilitating an easy and efficient partitioning of the metagenomic sequences for assembling into contigs. In one implementation, representing of the each of plurality of metagenomics sequence includes determining frequencies of possible tetra-nucleotides for each of the plurality of metagenomics sequence, obtaining an intermediate vector corresponding to each of a plurality of metagenomic sequences based on the determined frequencies and transforming, for each of the plurality of metagenomic sequences, the intermediate vector into a sequence vector to obtain a plurality of sequence vectors based on a set of reference points. Further, the set of reference points used for transforming the metagenomic data sets may be obtained based on a plurality of reference fragments as will be described in greater detail with reference to fig. 4.
  • At block 306, a cuboid is defined in the three dimensional space based on the plurality of sequence vectors, for example, by the grid generation module 112. In one implementation, defining further includes ascertaining three dimensional coordinates for each of the plurality of metagenomic sequences based on the plurality of sequence vectors. The sequence vectors obtained for the metagenomic sequences are analyzed to determine a farthest coordinate and a closest coordinate for each of the x, y, and z axes of the three dimensional space. Based on the determination, length of the cuboid in each of the x, y, and z axes may be ascertained as a value equal to a difference between the farthest coordinate and the closest coordinate in the corresponding axis. Subsequently the cuboid may be defined in the three dimensional space such that it encompasses all the metagenomic sequences obtained for being assembled.
  • At block 308, the cuboid is divided into a plurality of smaller equally sized cuboids, hereinafter referred to as grids. The grid generation module 112 is configured to divide the cuboid into the plurality of grids such that each grid includes all the metagenomic sequences whose sequence vectors lie in the coordinates covered by the grid under consideration.
  • At block 310, the plurality of grids is progressively traversed to assemble the plurality of metagenomic sequences into one or more contigs. The contig from among one or more contigs includes metagenomics sequence originating from the same genome. In one implementation, the plurality of grids may be traversed by a sequence assembly module, such as the sequence assembly module 114. The sequence assembly module 114 is configured to traverse the grids such that in each traversal, metagenomic sequences residing in the grid under consideration and its immediate neighbors are obtained and assembled into one or more contigs. In another implementation, traversing further includes obtaining, for each of the plurality of grids, one or more metagenomic sequences from among the plurality of metagenomics sequences. The one or more metagenomic sequences are located within coordinates defined by the grid and immediate neighbors of the grid in the cuboid. Further, all the metagenomic sequences unassembled during a particular step of traversal are considered for assembly during a next step of traversal and so on till all the grids are traversed to obtain the contigs. The contigs may be further assembled into a plurality of longer contigs or complete genomes. The longer contigs or genomes thus obtained include metagenomic sequences probably originating from the same genome. Additionally, the contigs and the sequences remaining unassembled at the end of the traversal of the grids may be stored in the assembled data 120 of the system 100.
  • Referring to Fig. 4, the method 304 generates a set of reference points for representing the metagenomic sequences in the three dimensional space for assembling into contigs, according to an example embodiment of the present subject matter.
  • At block 402, each of the plurality of reference genomes is split into a plurality of reference fragments. In an example implementation, a plurality of reference genomes corresponding to distinct genera is obtained from a reference database, such as a database of all sequenced genomes. Further, each of the reference genomes are fragmented into the plurality of reference fragments, for example, by the grid generation module 112 and stored in the grid data 118.
  • At block 404, a plurality of fragment vectors corresponding to each of the reference fragments are computed, for example, by the grid generation module 112. In one implementation, each of the reference fragments are analyzed to compute a corresponding fragment vector having frequencies of all possible 256 tetra-nucleotides.
  • At block 406, fragment vectors obtained are clustered to obtain one or more fragment clusters, for example, by the grid generation module 112. In an example implementation, the fragment vectors are clustered into one or more fragment clusters using any known clustering process, such as the K-means approach. For instance, the fragment vectors may be clustered into a total of 631 clusters using the K-means approach.
  • At block 408, a cluster vector corresponding to a centroid of each fragment cluster is computed. In one implementation, each of the fragment clusters are analyzed to ascertain a corresponding cluster vector. The computed cluster vectors may be further stored in the grid data 118.
  • At block 410, a set of reference points is obtained based on cluster vectors corresponding to the fragment clusters. The cluster vectors corresponding to the fragment clusters are analyzed, for example, by the grid generation module 112 to ascertain three least correlated cluster vectors as the set of reference points. In one implementation, the least correlated cluster vectors may be identified based on pairwise dot products computed for unit vectors corresponding to the cluster vectors. Further, the cluster vectors having least three pairwise dot products amongst them may be identified as the set of reference points. The set of reference points may be further used for representing the metagenomic sequences in the three dimensional space.

Claims (5)

  1. A computer-implemented method for assembly of metagenomic sequences comprising:
    representing each of a plurality of metagenomics sequence in three-dimensional space to obtain a plurality of sequence vector, wherein representing comprises:
    determining frequencies of possible tetra-nucleotides for each of the plurality of metagenomics sequence;
    obtaining an intermediate vector corresponding to each of a plurality of metagenomic sequences based on the determined frequencies; and;
    transforming, for each of the plurality of metagenomic sequences, the intermediate vector into a sequence vector to obtain a plurality of sequence vectors based on a set of reference points;
    defining, based on the plurality of sequence vectors, a cuboid having a plurality of grids in the three dimensional space, wherein the cuboid encompasses the plurality of metagenomic sequences, wherein defining further comprises:;
    ascertaining three dimensional coordinates for each of the plurality of metagenomic sequences based on the plurality of sequence vectors;
    determining, for each axis of the three dimensional space, a farthest coordinate and a closest coordinate from among the three dimensional coordinates; and
    calculating length of the cuboid in the each axis based on the difference between the farthest coordinate and the closest coordinate in the corresponding axis;
    traversing progressively the plurality of grids to assemble the plurality of metagenomics sequence into one or more contigs, wherein a contig includes metagenomics sequence originating from the same genome, wherein the traversing further comprises:
    obtaining, for each of the plurality of grids, one or more metagenomic sequences from among the plurality of metagenomics sequences, wherein the one or more metagenomic sequences are located within coordinates defined by the grid and immediate neighbors of the grid in the cuboid; and
    assembling, for each of the plurality of grids, the corresponding one or more metagenomic sequences into the one or more contigs.
  2. The method as claimed in claim 1, wherein the transforming comprises computing a distance between the intermediate vector and a set of reference points.
  3. The method as claimed in claim 1,
    wherein the method further comprises:
    splitting each of a plurality of reference genomes, containing one representative from each microbial genus, into a plurality of non-overlapping reference fragments;
    computing a fragment vector for each of the plurality of reference fragments;
    clustering fragment vectors to obtain one or more fragment clusters;
    assessing, for each of the fragment clusters, a cluster vector corresponding to a centroid of the fragment cluster; and
    identifying three least correlated cluster vectors from among cluster vectors as the set of reference points.
  4. Ametagenomic sequences assembly system (100) comprising:
    a processor (104); and
    a memory (106) coupled to the processor (104), the memory (106) comprising modules configured to perform the method of claims 1-3.
  5. A computer-readable medium having embodied thereon a computer program, which when it is run on a computer, executes the method of claim 1.
EP12169566.2A 2012-02-10 2012-05-25 Assembly of metagenomic sequences Active EP2626802B1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
IN388MU2012 2012-02-10

Publications (3)

Publication Number Publication Date
EP2626802A2 EP2626802A2 (en) 2013-08-14
EP2626802A3 EP2626802A3 (en) 2015-02-25
EP2626802B1 true EP2626802B1 (en) 2016-11-16

Family

ID=46229206

Family Applications (1)

Application Number Title Priority Date Filing Date
EP12169566.2A Active EP2626802B1 (en) 2012-02-10 2012-05-25 Assembly of metagenomic sequences

Country Status (3)

Country Link
US (1) US9372959B2 (en)
EP (1) EP2626802B1 (en)
CN (1) CN103246829B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2653991B1 (en) * 2012-02-24 2017-07-26 Tata Consultancy Services Limited Prediction of horizontally transferred gene
KR101560274B1 (en) * 2013-05-31 2015-10-14 삼성에스디에스 주식회사 Apparatus and Method for Analyzing Data
US20170308645A1 (en) * 2016-04-25 2017-10-26 Tata Consultancy Services Limited Method and system for representing compositional properties of a biological sequence fragment and applications thereof
CN106055928B (en) * 2016-05-29 2018-09-14 吉林大学 A kind of sorting technique of macro genome contig
WO2018119882A1 (en) * 2016-12-29 2018-07-05 中国科学院深圳先进技术研究院 Method and device for data classification of metagenomes
US10733214B2 (en) 2017-03-20 2020-08-04 International Business Machines Corporation Analyzing metagenomics data
US11023485B2 (en) * 2018-09-18 2021-06-01 International Business Machines Corporation Cube construction for an OLAP system
IL294909A (en) 2020-02-13 2022-09-01 Zymergen Inc Metagenomic library and natural product discovery platform
CN112466404B (en) * 2020-12-14 2024-02-02 浙江师范大学 Metagenome contig unsupervised clustering method and system
CN113611359B (en) * 2021-08-13 2022-08-05 江苏先声医学诊断有限公司 Method for improving strain assembly efficiency of metagenome nanopore sequencing data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006138823A (en) * 2004-11-15 2006-06-01 Sony Corp Method for standardizing expressed amount in gene, program, and system
FR2891278B1 (en) * 2005-09-23 2008-07-04 Vigilent Technologies Sarl METHOD FOR DETERMINING THE STATUS OF A SET OF CELLS AND SYSTEM FOR CARRYING OUT SAID METHOD
CN101751517B (en) * 2008-12-12 2014-02-26 深圳华大基因科技服务有限公司 Method and system for fast processing genome short sequence mapping
EP2390810B1 (en) * 2010-05-26 2019-10-16 Tata Consultancy Services Limited Taxonomic classification of metagenomic sequences
EP2390811B1 (en) * 2010-05-26 2016-12-28 Tata Consultancy Services Limited Identification of ribosomal DNA sequences
EP2653991B1 (en) * 2012-02-24 2017-07-26 Tata Consultancy Services Limited Prediction of horizontally transferred gene

Also Published As

Publication number Publication date
US9372959B2 (en) 2016-06-21
US20130325428A1 (en) 2013-12-05
EP2626802A3 (en) 2015-02-25
CN103246829B (en) 2017-12-01
CN103246829A (en) 2013-08-14
EP2626802A2 (en) 2013-08-14

Similar Documents

Publication Publication Date Title
EP2626802B1 (en) Assembly of metagenomic sequences
Nikolenko et al. BayesHammer: Bayesian clustering for error correction in single-cell sequencing
Tzeng et al. Multidimensional scaling for large genomic data sets
Li et al. Ultrafast clustering algorithms for metagenomic sequence analysis
Schbath et al. Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis
Kim et al. Using single cell sequencing data to model the evolutionary history of a tumor
EP2390810B1 (en) Taxonomic classification of metagenomic sequences
Ren et al. Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics
Comin et al. Clustering of reads with alignment-free measures and quality values
He et al. Informative SNP selection methods based on SNP prediction
KR20220069943A (en) Single-cell RNA-SEQ data processing
Reddy et al. MetaCAA: A clustering-aided methodology for efficient assembly of metagenomic datasets
Lücking et al. PICS-Ord: unlimited coding of ambiguous regions by pairwise identity and cost scores ordination
Uddin et al. A fast and efficient algorithm for DNA sequence similarity identification
Celik et al. Biological cartography: Building and benchmarking representations of life
Bloch et al. Optimization of co-evolution analysis through phylogenetic profiling reveals pathway-specific signals
Prezza et al. Detecting mutations by ebwt
Vasimuddin et al. Identification of significant computational building blocks through comprehensive investigation of NGS secondary analysis methods
Stanberry et al. Visualizing the protein sequence universe
Doğan et al. Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences
Tapinos et al. Alignment by numbers: sequence assembly using compressed numerical representations
US9116839B2 (en) Prediction of horizontally transferred gene
Rathod et al. Understanding Data Analysis and Why Should We Do It?
Vasimuddin et al. Identification of significant computational building blocks through comprehensive deep dive of ngs secondary analysis methods
Shi et al. Sparse learning based linear coherent bi-clustering

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 19/24 20110101AFI20150120BHEP

17P Request for examination filed

Effective date: 20150824

RBV Designated contracting states (corrected)

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

17Q First examination report despatched

Effective date: 20151126

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

INTG Intention to grant announced

Effective date: 20160602

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: AT

Ref legal event code: REF

Ref document number: 846531

Country of ref document: AT

Kind code of ref document: T

Effective date: 20161215

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602012025345

Country of ref document: DE

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602012025345

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LV

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

REG Reference to a national code

Ref country code: NL

Ref legal event code: MP

Effective date: 20161116

REG Reference to a national code

Ref country code: LT

Ref legal event code: MG4D

REG Reference to a national code

Ref country code: AT

Ref legal event code: MK05

Ref document number: 846531

Country of ref document: AT

Kind code of ref document: T

Effective date: 20161116

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: NO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20170216

Ref country code: LT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20170217

Ref country code: NL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 6

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: PT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20170316

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: AT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: HR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: RS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: EE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: CZ

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: DK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: RO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: SK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

REG Reference to a national code

Ref country code: DE

Ref legal event code: R097

Ref document number: 602012025345

Country of ref document: DE

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SM

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170531

Ref country code: IT

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: BG

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20170216

Ref country code: BE

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

26N No opposition filed

Effective date: 20170817

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: SI

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

REG Reference to a national code

Ref country code: CH

Ref legal event code: PL

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MC

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

REG Reference to a national code

Ref country code: IE

Ref legal event code: MM4A

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CH

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170531

Ref country code: LI

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170531

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: LU

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170525

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170525

REG Reference to a national code

Ref country code: FR

Ref legal event code: PLFP

Year of fee payment: 7

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MT

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20170525

REG Reference to a national code

Ref country code: DE

Ref legal event code: R079

Ref document number: 602012025345

Country of ref document: DE

Free format text: PREVIOUS MAIN CLASS: G06F0019240000

Ipc: G16B0040000000

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: HU

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT; INVALID AB INITIO

Effective date: 20120525

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: CY

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20161116

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: MK

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: TR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: AL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20161116

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20170316

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230526

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20230421

Year of fee payment: 12

Ref country code: DE

Payment date: 20230425

Year of fee payment: 12

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20230420

Year of fee payment: 12