WO2016139546A1 - Mapping of short reads in sequencing in platforms - Google Patents

Mapping of short reads in sequencing in platforms Download PDF

Info

Publication number
WO2016139546A1
WO2016139546A1 PCT/IB2016/050840 IB2016050840W WO2016139546A1 WO 2016139546 A1 WO2016139546 A1 WO 2016139546A1 IB 2016050840 W IB2016050840 W IB 2016050840W WO 2016139546 A1 WO2016139546 A1 WO 2016139546A1
Authority
WO
WIPO (PCT)
Prior art keywords
reference genome
mapping
short read
mismatches
alignment
Prior art date
Application number
PCT/IB2016/050840
Other languages
French (fr)
Inventor
Santhi NATARAJAN
Debnath PAL
S.K. Nandy
Original Assignee
Indian Institute Of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Indian Institute Of Science filed Critical Indian Institute Of Science
Publication of WO2016139546A1 publication Critical patent/WO2016139546A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Definitions

  • the mapper 204 can be a three stage datapath as mapping for up to two mismatches is expected to cover over 99% of the reference genome. First stage being an absolute match followed by parallel mapping for all possible single mismatches, and then followed by parallel mapping for all possible two mismatches. Thus, at the end of three clock cycles from the instance short read-RIT pairs are scheduled into the mapper 204, mapping results are available. Further, since the mapper 204 evaluates multiple masks in parallel, for every mapping criterion like absolute match or single mismatch or two mismatches etc., mapping can be achieved in minimal time.
  • the present disclosure provides a mapper in the architecture with adequate masks, accuracy and parallelism so that minimal time is spent in mapping.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The present disclosure relates to a scalable hardware accelerator for mapping and alignment of short reads with a reference genome that incorporates a mapper in the architecture with masks designed for identifying absolute match, single mismatch, two mismatches, and the like, in a three stage datapath. It is possible to map the short reads for any number of mismatches depending on the requirement with corresponding number of stages in the datapath and corresponding masks. Availability of multiple masks does away with need for multiple indexing steps on reference genome, corresponding to mask selection and corresponding storage of voluminous data. In another aspect adequate parallelism is incorporated to speed up the mapping process.

Description

MAPPING OF SHORT READS IN SEQUENCING PLATFORMS
TECHNICAL FIELD
[1] The present disclosure generally relates to the field of bioinformatics and molecular biology. In particular, it pertains to a scalable hardware accelerator to map and align genomic data.
BACKGROUND
[2] Background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.
[3] Latest technical advances in genomic sequencing have revolutionized many aspects of biology and medicine. These advances have dramatically lowered the cost and exponentially increased the throughput of DNA sequencing. As a result, sequencing technology is now being applied to a rapidly widening array of scientific and medical problems, from basic biology to forensics, ecology, evolutionary studies, agriculture, drug discovery, and the growing field of personalized medicine.
[4] Sequencing machines determine the nucleotide sequence of short DNA fragments, typically a few tens to hundreds of bases, called short reads. With present day sequencing technologies this can be done in a massively parallel manner, yielding much higher throughput than older sequencing technologies - on the order of tens of billions of bases per day from one machine. For comparison, the human genome is approximately 3 billion bases in length.
[5] For most applications, a complete genetic sequence of an organism is not determined de novo. Rather, in most instances, for the organism in question, a "reference" genome sequence has already been determined and is known. Since the short reads are derived from randomly fragmenting genome of one organism for which a reference genome sequence is already known, the first step for data analysis is ordering of all of these fragments to determine the overall gene sequence of the individual sample using the reference genome sequence effectively as a template, i.e., mapping these short read fragments to the reference genome sequence. In this analysis, a determination is made concerning the best location in the reference genome to which each short read maps, and is referred to as the short read mapping problem.
[6] Short read mapping problem is technically challenging, both due to the volume of data and because sample sequences may not be identical to the reference genome sequence, but as expected, will contain a wide variety of individual genetic variations. Due to the sheer volume of data, e.g., a billion short reads from a single sample, the speed or runtime of the data analysis is significant, with the data analysis now becoming the effective bottleneck in genome sequencing. In addition, successful sequencing should exhibit sensitivity to genetic variations to successfully map sequences that are not completely identical to the reference, both because of technical errors in the sequencing and because of genetic differences between the subject and the reference genome.
[7] Biologists and other researchers use sequence alignment as a fundamental comparison method to find common patterns between sequences, predict protein structure, identify important genetic regions, and facilitate drug design. For example, sequence alignment is used to derive flu vaccines by identifying DNA signatures of pathogens. Since biological sequence alignment is now an essential tool used in molecular biology and biomedical applications it is essential that alignment results are available in a timely manner. The growing volume of genomic data and the complexity of sequence alignment present a challenge in obtaining accurate alignment results in a timely manner.
[8] A number of softwares are available in art that perform short read alignment with the reference genome for example BWA, Novoalign, Bowtie, SOAP2, BFAST, SSAHA2, Mpscan, GASSST, Churchill etc. These software based approaches have number of limitations such as use of heuristic algorithms for mapping that reduces the accuracy as compared to exact algorithms. In addition, they take more time to perform alignment of millions of short reads, making short read mapping the major task affecting the throughput and performance of the sequencing pipeline.
[9] Very few attempts have been made to develop short read mapping accelerators in hardware. One such model for short read mapping has been developed based on research sponsored by Washington Technology Center (WTC) and Pico Computing. This platform performs mapping of 50 million 76 bp short reads from one of paired end Illumina GA IIx run on human exome data. The hardware is based on a 24- FPGA Pico Computing system. The platform uses BFAST algorithm for indexing, and Smith Waterman and Needleman Wunsch Algorithms for scoring. However, this platform is not scalable and time taken for alignment is decided by problem size. Furthermore, the accuracy is compromised due to heuristics involved.
[10] Smith Waterman algorithm is a dynamic programming method for determining similarity between a pair of nucleotide or protein sequences. It ensures the best optimal alignment between sequences. The algorithm is used for performing pairwise local alignment of DNA or protein sequences. A pairwise alignment finds highly related subsequences of two sequences. It identifies subsequences that are preserved during the course of evolution. It is highly useful for dissimilar sequences suspected to contain regions of similarity within their larger sequence context. The alignment need not include entire length of the two sequences. The method is very sensitive in detecting similarity between two sequences sharing evolutionary origin along the entire length, while part of the sequence under strong enough selection pressure to preserve valid similarity.
[11] There have been several attempts to accelerate the Smith Waterman Dynamic programming algorithm in hardware. However, these implementations suffer from various short comings such as sequence length considered for alignment is limited by the hardware size, the architectures are not inherently scalable, they do not perform traceback with forward scan in overlapped mode, their performance is limited by hardware I/O bandwidth, they have severe processing overhead in software when alignment matrix is recalculated. Besides they also have severe memory bottleneck issues.
[12] There is therefore a need for a solution that overcomes the drawbacks of the known methods and provides a hardware accelerator for accurate mapping and alignment of short reads in high throughput Sequencing Platforms.
[13] All publications herein are incorporated by reference to the same extent as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.
[14] In some embodiments, the numbers expressing quantities of ingredients, properties such as concentration, reaction conditions, and so forth, used to describe and claim certain embodiments of the invention are to be understood as being modified in some instances by the term "about." Accordingly, in some embodiments, the numerical parameters set forth in the written description and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of some embodiments of the invention are approximations, the numerical values set forth in the specific examples are reported as precisely as practicable. The numerical values presented in some embodiments of the invention may contain certain errors necessarily resulting from the standard deviation found in their respective testing measurements.
[15] As used in the description herein and throughout the claims that follow, the meaning of "a," "an," and "the" includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of "in" includes "in" and "on" unless the context clearly dictates otherwise.
[16] The recitation of ranges of values herein is merely intended to serve as a shorthand method of referring individually to each separate value falling within the range. Unless otherwise indicated herein, each individual value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g. "such as") provided with respect to certain embodiments herein is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention otherwise claimed. No language in the specification should be construed as indicating any non-claimed element essential to the practice of the invention.
[17] Groupings of alternative elements or embodiments of the invention disclosed herein are not to be construed as limitations. Each group member can be referred to and claimed individually or in any combination with other members of the group or other elements found herein. One or more members of a group can be included in, or deleted from, a group for reasons of convenience and/or patentability. When any such inclusion or deletion occurs, the specification is herein deemed to contain the group as modified thus fulfilling the written description of all Markush groups used in the appended claims. OBJECTS OF THE INVENTION
[18] An object of the present disclosure is to overcome the drawbacks of existing methods of short read mapping and alignment with a reference genome sequence.
[19] Another object of the present disclosure is to provide a hardware accelerator that overcomes the drawbacks of existing methods of short read mapping and alignment with a reference genome sequence.
[20] Another object of the present disclosure is to use a cost function model of the dynamic programming algorithm for short read alignment with a reference genome sequence to achieve accurate results.
[21] Another object of the present disclosure is to provide an accelerator architecture that reduces storage requirement while indexing, mapping and aligning, thus overcoming bottleneck in existing methods.
[22] Another object of the present disclosure is to provide a mapper in the architecture with adequate masks, accuracy and parallelism so that minimal time is spent in mapping.
[23] Yet another object is to provide a mapper in the accelerator architecture that requires the reference genome to be indexed only once.
Yet another object is to provide a mapper in the accelerator architecture that ensures that every short read is checked for a map across the length of the entire reference genome leaving no location and no read unmapped.
SUMMARY
[24] Aspects of the present disclosure relate to mapping of short reads with a reference genome. In an aspect, the disclosure provides a hardware accelerator that can speed up the process of mapping of short reads (also referred to simply as read hereinafter) with the reference genome.
[25] In an aspect, the disclosed hardware architecture does not depend on Heuristic algorithms and performs exact mapping, resulting in no error. In another aspect, the proposed architecture uses a cost function model of the dynamic programming algorithm for aligning reads which are mapped to a reference, which ensures best optimal alignment between the sequences.
[26] In another aspect of the present disclosure, a mapper in the proposed architecture can be provided with adequate masks designed to identify an absolute match, single mismatch, two mismatch, and likewise. Availability of such adequate masks does away with the need for multiple indexing steps on the reference genome, corresponding to mask selection. In another aspect, adequate parallelism is incorporated to speed up the mapping process.
[27] In yet another aspect of the present disclosure, the reference genome needs to be indexed only once, and then every short read can be paired with all entries of the index table, which ensures that every short read is checked for a map across the length of the entire genome, leaving no location unmapped.
[28] In another aspect, the disclosed architecture reduces storage requirements for the high throughput sequencing pipeline as there is no need for multiple indexing steps on the reference genome, corresponding to mask selection, and for further processing of index table, in terms of hashing, tree search etc. Thus, storage is required only for the reference genome, short reads, and single index table for reference genome thereby overcoming the bottleneck created by storage and retrieval of data and making the mapping process faster and more efficient.
Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which like numerals represent like components.
BRIEF DESCRIPTION OF THE DRAWINGS
[29] The accompanying drawings are included to provide a further understanding of the present disclosure, and are incorporated in and constitute a part of this specification. The drawings illustrate exemplary embodiments of the present disclosure and, together with the description, serve to explain the principles of the present disclosure.
[30] FIG. 1 illustrates an exemplary block diagram of general computational infrastructure for data transport from sequencing platforms for storage and analysis including the hardware accelerator for genomic data mapping and alignment in the data path in accordance with embodiments of the present disclosure.
[31] FIG. 2 illustrates an exemplary block diagram of hardware accelerator architecture with the streaming interface in accordance with embodiments of the present disclosure.
[32] FIG. 3 illustrates an exemplary block diagram indicating multiple datapaths for mapper within the hardware accelerator architecture in accordance with embodiments of the present disclosure.
DETAILED DESCRIPTION
[33] The following is a detailed description of embodiments of the disclosure depicted in the accompanying drawings. The embodiments are in such detail as to clearly communicate the disclosure. However, the amount of detail offered is not intended to limit the anticipated variations of embodiments; on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the present disclosure as defined by the appended claims.
[34] Each of the appended claims defines a separate invention, which for infringement purposes is recognized as including equivalents to the various elements or limitations specified in the claims. Depending on the context, all references below to the "invention" may in some cases refer to certain specific embodiments only. In other cases it will be recognized that references to the "invention" will refer to subject matter recited in one or more, but not necessarily all, of the claims.
[35] Various terms as used herein are shown below. To the extent a term used in a claim is not defined below, it should be given the broadest definition persons in the pertinent art have given that term as reflected in printed publications and issued patents at the time of filing.
[36] Embodiments of the present disclosure relate to mapping of short reads with a reference genome. In an aspect, the present disclosure provides a hardware accelerator that can speed up the process of mapping of short reads with the reference genome.
[37] In another embodiment, the disclosed hardware accelerator acts as a mapping and alignment cluster, hosting many hardware accelerators that can run in parallel. Genomic data can be streamed to these clusters over any streaming protocol/methodology/technology, thereby transporting raw short reads and reference index elements for the chosen application and species.
[38] In another embodiment, the disclosed hardware accelerator cluster is within the streaming network with local host storing the genomic data in the form of reference genome and raw reads. In another embodiment, the reference genome can be indexed by a local host, and reference index elements can be streamed to the hardware accelerator cluster for mapping and alignment.
[39] In an embodiment, the disclosed architecture is inherently scalable, with read length not limited by the reconfigurable hardware size. Time taken by the accelerator can be independent of the problem size with streaming limitations posed only from interconnections, bus architecture etc. The design can include adequate genome indexing, mapping, alignment, and streaming techniques.
[40] In an aspect, the disclosed hardware architecture can be configured such that it does not depend on Heuristic algorithms and does exact mapping resulting in no error. In another aspect, it uses a cost function model of the dynamic programming algorithm for pairwise sequence alignment. In an embodiment of implementation the dynamic programming algorithm can be Smith-Waterman algorithm that determines similarity between a pair of nucleotide or protein sequences. It ensures the best optimal alignment between sequences.
[41] In another embodiment of the present disclosure, mappers in the proposed architecture can be provided with adequate masks designed for identifying absolute match, single mismatch, two mismatch, and likewise, wherein availability of adequate masks does away with need for multiple indexing steps on reference genome, corresponding to mask selection. Existing softwares use different methods of indexing the genome, including simple indexing techniques, hashing algorithms, transform algorithms and several other heuristic techniques. In these much of the time is spent in this initial indexing, and complexity increases with depth of indexing and length of short reads/genome.
[42] In another embodiment, one time indexing of the reference genome in the disclosed architecture reduces storage requirements for the sequencing platform pipeline. As there is no need for multiple indexing steps on reference genome, corresponding to mask selection, and for further processing of index table, in terms of hashing, tree search etc. there is no corresponding storage requirement. Existing techniques have tremendous storage requirements of these indices, hash tables and pointers. This is much more than the actual storage requirements for the short reads and reference genome. Thus, one time indexing of the reference genome does away with these storage requirements and associated bottlenecks.
[43] In another embodiment, storage is required only for the reference genome, short reads and single index table for reference genome. In conventional hardware acceleration platforms, storage requirements are met in the form of memory within the accelerator (FPGA/ASIC), as rich DDR memories on board and as secondary storage requirements, which results in much of the processing time being spent in retrieving data from storage, and handling memory bottleneck issues. Elimination/reduction of storage requirement thus overcomes the bottleneck created by storage and retrieval of data and making the mapping process faster and more efficient.
[44] In another embodiment, process of mapping can be preceded by one time indexing of the reference genome to generate a Reference Index Table (RIT), post which all short reads can be paired with all entries of the index table to find a suitable hit, ensuring that every short read is checked for a map across the length of the entire genome, and leaving no location unmapped. In another embodiment, adequate parallelism can be incorporated into speeding up the mapping process.
[45] In an aspect the disclosed hardware architecture is a high availability solution that exploits re-configurability of the target platform on which it is realized. As would be apparent to a person skilled in art, the disclosed coarse grain reconfigurable architecture that hosts the parallel pipeline of hardware kernels can provide the necessary and sufficient features to make the design fault tolerant.
[46] System for mapping and aligning a short read with a reference genome of the present disclosure can include a number of functional modules such as streaming module that can be configured to stream the short reads and reference genome data through a system hardware. The system can also include a mapping module configured to map the short reads with the reference genome based on one time indexing of the reference genome to generate a reference index table. In an embodiment the mapping module can incorporate a plurality of masks to map the short reads with the reference index table in a single pass. These masks can correspond to absolute match, single mismatch, two mismatches, and three mismatches and so on. During mapping each of the shot read is checked for mapping across entire length of said reference genome.
[47] The system can further include an alignment module to align a mapped short read with the reference genome. In an embodiment the system can comprise multiple data paths in parallel such that each of parallel data paths comprises a mapping module followed by an alignment module.
[48] Method for mapping and aligning a short read with a reference genome in accordance with the present disclosure can comprise steps of streaming the short read and the reference genome through system hardware. The reference genome can be preprocessed for one time indexing and generation of a reference index table. At next step the short read can be mapped against and the reference genome based on the reference index table. In an embodiment a plurality of masks are configured to map the short read with the reference index table in a single pass. The plurality of masks can correspond to absolute match, single mismatch, two mismatches, and three mismatches and so on. Thereafter the mapped short read can be aligned with the reference genome. During the process the genomic data can be streamed through multiple data paths configured in parallel, each of these parallel data paths can comprise a mapper followed by an aligner.
[49] In an alternate embodiment, the hardware accelerator can function with independent, decoupled mapper and aligner models as well, where the mapper alone is present in the pipeline initially, which can perform mapping and filtering of read-RIT pairs, which are probable candidates for alignment. Thereafter, the filtered read-RIT pairs can be streamed to the accelerator again, which has only aligner present in the datapath, which can perform alignment on the incoming pairs, and produce the scores and associated results for each pair for streaming as output data.
[50] FIG. 1 illustrates an exemplary block diagram 100 of general computational infrastructure for data transport from sequencing platforms for storage and analysis including the hardware accelerator for genomic data alignment in the data path in accordance with embodiments of the present disclosure. Sequencing research/clinical facilities 102-1, 102-2, 102-3 etc. (collectively referred to as 102 hereinafter) typically produce terabytes of data, creating a huge demand for data storage coupled with server facility for secured access of data for analysis. These storage facilities such as 104 can be centralized or local storages, which demands physical transfer 106 of raw data from sequencing platform to storages 104. Once stored these data can be again physically transferred to a network, cloud, or custom made facility for genomic data alignment and visualization and preparation of final report for application.
[51] In an embodiment, such a facility for genomic data alignment and visualization and preparation of final report for application can incorporate a host 108 and a number of hardware accelerators such as 112-1, 112-2, 112-3 etc. (collectively referred to as 112 hereinafter). Hardware accelerators 112 can act as alignment clusters running in parallel with genomic data that is streamed 110 to them over any streaming protocol/methodology.
[52] In an embodiment, the raw reads and reference genome can be downloaded and stored in the host 108, wherein the reference genome can be indexed by the host 108 to generate RIT elements. These reads and RIT elements can then be paired, converted to binary, and then into hex to embed them as payload within a frame to be streamed to the hardware accelerators 112.
[53] In another embodiment, the reference genome can be indexed only once and does not require multiple indexing steps corresponding to mask selection or further processing of index table, in terms of hashing, tree search etc. As there is no need for multiple indexing steps on reference genome, corresponding to mask selection, and for further processing of index table, in terms of hashing, tree search etc, there is no corresponding storage requirement. Existing techniques have tremendous storage requirements of these indices, hash tables and pointers. This is much more than the actual storage requirements for the short reads and reference genome. Thus, one time indexing of the reference genome does away with these storage requirements and attended bottlenecks.
[54] In another embodiment, storage is required only for the reference genome, short reads, and single index table for reference genome. In existing hardware acceleration platforms, storage requirements are met in the form of memory within the accelerator (FPGA/ASIC), as rich DDR memories on board and as secondary storage requirements, which results in much of the processing time being spent in retrieving data from storage, and handling memory bottleneck issues. Elimination/reduction of storage requirement thus overcomes the bottleneck created by storage and retrieval of data and making the mapping process faster and more efficient. [55] In an embodiment, genomic alignment cluster provides alignment data at accelerated speeds to a visualization and analysis engine 114 to derive the reports from alignment.
[56] FIG. 2 illustrates an exemplary block diagram 200 of hardware accelerator architecture with the streaming interface in accordance with embodiments of the present disclosure. Hardware accelerator 112 can incorporate stream receive block 202 that can receive the stream duly embedded with genomic data from the host 108, and can preprocess them to extract short reads and RITs before transferring the extracted data to short read to reference mapper 204 (referred to as mapper 204 hereinafter).
[57] In an embodiment, mapper 204 can map an incoming short read, against the reference genome, by looking for a hit in the RIT. A single short read can be paired against each of the elements within RIT, to identify a map. The choice of masks, read length and seed length decides the complexity of the mapper 204. Mapping can be performed for different conditions such as absolute match of the short read and RIT element, single mismatch, two mismatches, three mismatches and so on. There can be a mask corresponding to each of these conditions.
[58] In an embodiment, for a two mismatch mapping, the mapper 204 can be a three stage datapath as mapping for up to two mismatches is expected to cover over 99% of the reference genome. First stage being an absolute match followed by parallel mapping for all possible single mismatches, and then followed by parallel mapping for all possible two mismatches. Thus, at the end of three clock cycles from the instance short read-RIT pairs are scheduled into the mapper 204, mapping results are available. Further, since the mapper 204 evaluates multiple masks in parallel, for every mapping criterion like absolute match or single mismatch or two mismatches etc., mapping can be achieved in minimal time.
[59] It should be appreciated that mapping only up to two mismatches is only a preferred embodiment. It is possible to map the short reads for any number of mismatches depending on the requirement with corresponding number of stages in the datapath and corresponding masks. Since these stages work in parallel in the datapath, the time required for completing the mapping would not change appreciably even with larger number of stages.
[60] In an aspect, hardware accelerator 112 can also incorporate a short read aligner
206 (also referred to as aligner) that can receive the short read and RIT element pair shortlisted by the mapper 204, and work on them to come up with an alignment score. The aligner 206 can have multiple parallel kernels. The hardware accelerator 112 can further incorporate a stream TX block 208 that can collect the results of alignment from parallel datapaths of aligner 206 and prepare them to stream back to the host 108.
[61] In an embodiment, there can be multiple data paths provisioned within the hardware accelerator architecture. FIG. 3 illustrates an exemplary block diagram 300 indicating multiple datapaths for mapper within the hardware accelerator architecture in accordance with embodiments of the present disclosure. As illustrated there can be multiple mappers such as 204-1, 204-N in the hardware accelerator 112 each linked to a corresponding aligner 206 such as 206-1, 206-N, making the disclosed architecture of the proposed hardware accelerator 112 scalable and configurable to handle any genome mapping problem irrespective of its size. The FIG. 3 also illustrates a three stage datapath for a two mismatch mapping condition, as an example, wherein the first stage 302-1....302-N being for absolute match, followed by a second stage 304-1 304-N that does a parallel mapping for all possible single mismatches, and then followed by a third stage 306- 1 306-N doing parallel mapping for all possible two mismatches.
[62] In an alternate embodiment, the hardware accelerator can function with independent, decoupled mapper and aligner models as well, where the mapper alone is present in the pipeline initially, which can perform mapping and filtering of read-RIT pairs, which are probable candidates for alignment. Thereafter, the filtered read-RIT pairs can be streamed to the accelerator again, which has only aligner present in the datapath, which can perform alignment on the incoming pairs, and produce the scores and associated results for each pair for streaming as output data.
While the foregoing describes various embodiments of the invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. The scope of the invention is determined by the claims that follow. The invention is not limited to the described embodiments, versions or examples, which are included to enable a person having ordinary skill in the art to make and use the invention when combined with information and knowledge available to the person having ordinary skill in the art.
ADVANTAGES OF THE INVENTION
[63] The present disclosure overcomes the drawbacks of existing methods of short read mapping and alignment with a reference genome sequence. [64] The present disclosure provides a hardware accelerator that overcomes the drawbacks of existing methods of short read mapping and alignment with a reference genome sequence.
[65] The present disclosure uses a cost function model of the dynamic programming algorithm for short read alignment with a reference genome sequence thus provides accurate results.
[66] The present disclosure provides an accelerator architecture that reduces storage requirement while indexing, mapping and alignment thus overcoming a bottleneck in existing methods.
[67] The present disclosure provides a mapper in the architecture with adequate masks, accuracy and parallelism so that minimal time is spent in mapping.
[68] The present disclosure provides a mapper in the accelerator architecture that requires the reference genome to be indexed only once.
[69] The present disclosure provides a mapper in the accelerator architecture that ensures that every short read is checked for a map across the length of the entire reference genome leaving no location and no read unmapped.

Claims

A system for mapping and aligning a short read with a reference genome comprising: a streaming module configured to stream said short read and said reference genome through a system hardware;
a mapping module configured to map said short read with said reference genome based on one time indexing of said reference genome to generate a reference index table, wherein said mapping module incorporates a plurality of masks to map said short read with said reference index table in a single pass; and
an alignment module to align a mapped short read with said reference genome.
The system of claim 1, wherein said plurality of masks comprise one or a combination of absolute match, single mismatch, two mismatches, three mismatches and the like.
The system of claim 1, wherein said short read is checked for mapping across entire length of said reference genome.
The system of claim 1, wherein said system comprises multiple data paths in parallel, and wherein each of said parallel data paths comprises a mapping module followed by an alignment module.
A method for mapping and aligning a short read with a reference genome comprising:
streaming said short read and said reference genome through a system hardware;
mapping said short read with said reference genome based on one time indexing of said reference genome to generate a reference index table, wherein a plurality of masks are configured to map said short read with said reference index table in a single pass; and
aligning a mapped short read with said reference genome.
The method of claim 5, wherein said plurality of masks comprise one or a combination of absolute match, single mismatch, two mismatches, three mismatches and the like.
7. The method of claim 5, wherein said short read is checked for mapping across entire length of said reference genome.
8. The method of claim 5, wherein said system comprises multiple data paths in parallel, and wherein each of said parallel data paths comprises a mapping module followed by an alignment module.
9. A hardware accelerator for mapping and aligning a short read with a reference
genome comprising:
a streaming interface configured to enable streaming of said short read and said reference genome;
a mapper configured to map said short read with said reference genome based on one time indexing of said reference genome to generate a reference index table, wherein said mapper incorporates a plurality of masks to map said short read with said reference index table in a single pass; and
an aligner a mapped short read with said reference genome.
10. The hardware accelerator of claim 9, wherein said plurality of masks comprise one or a combination of absolute match, single mismatch, two mismatches, three mismatches and the like.
PCT/IB2016/050840 2015-03-05 2016-02-17 Mapping of short reads in sequencing in platforms WO2016139546A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN1089/CHE/2015 2015-03-05
IN1089CH2015 2015-03-05

Publications (1)

Publication Number Publication Date
WO2016139546A1 true WO2016139546A1 (en) 2016-09-09

Family

ID=56849254

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2016/050840 WO2016139546A1 (en) 2015-03-05 2016-02-17 Mapping of short reads in sequencing in platforms

Country Status (1)

Country Link
WO (1) WO2016139546A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140297196A1 (en) * 2013-03-15 2014-10-02 Pico Computing, Inc. Hardware Acceleration of Short Read Mapping for Genomic and Other Types of Analyses

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140297196A1 (en) * 2013-03-15 2014-10-02 Pico Computing, Inc. Hardware Acceleration of Short Read Mapping for Genomic and Other Types of Analyses

Similar Documents

Publication Publication Date Title
US7917299B2 (en) Method and apparatus for performing similarity searching on a data stream with respect to a query string
Canzar et al. Short read mapping: an algorithmic tour
Schbath et al. Mapping reads on a genomic sequence: an algorithmic overview and a practical comparative analysis
Sahlin Effective sequence similarity detection with strobemers
US10204207B2 (en) Systems and methods for transcriptome analysis
Al-Ghalith et al. NINJA-OPS: fast accurate marker gene alignment using concatenated ribosomes
Alser et al. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures
US20180239864A1 (en) Hardware accelerator for alignment of short reads in sequencing platforms
US20160019339A1 (en) Bioinformatics tools, systems and methods for sequence assembly
Lancaster et al. Acceleration of ungapped extension in Mercury BLAST
Al-Ghalith et al. BURST enables mathematically optimal short-read alignment for big data
Lee et al. Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations
Sahlin Strobemers: an alternative to k-mers for sequence comparison
Ekim et al. Efficient mapping of accurate long reads in minimizer space with mapquik
Minetti et al. An improved trajectory-based hybrid metaheuristic applied to the noisy DNA fragment assembly problem
Deng et al. HiGene: A high-performance platform for genomic data analysis
CN112534507B (en) System and method for grouping and folding of sequencing reads
US8340917B2 (en) Sequence matching allowing for errors
Clement et al. Parallel mapping approaches for GNUMAP
WO2016139546A1 (en) Mapping of short reads in sequencing in platforms
Dai et al. Cloud based short read mapping service
WO2016139547A1 (en) Data streaming in hardware accelerator for mapping and alignment of short reads
Zhang et al. GenoMiX: Accelerated Simultaneous Analysis of Human Genomics, Microbiome Metagenomics, and Viral Sequences
Cascitti et al. RNACache: Fast Mapping of RNA-Seq Reads to Transcriptomes Using MinHashing
Natarajan et al. AccuRA: accurate alignment of short reads on scalable reconfigurable accelerators

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16758516

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16758516

Country of ref document: EP

Kind code of ref document: A1