NZ789138A - Bioinformatics systems, apparatus, and methods for performing secondary and/or tertiary processing - Google Patents

Bioinformatics systems, apparatus, and methods for performing secondary and/or tertiary processing

Info

Publication number
NZ789138A
NZ789138A NZ789138A NZ78913817A NZ789138A NZ 789138 A NZ789138 A NZ 789138A NZ 789138 A NZ789138 A NZ 789138A NZ 78913817 A NZ78913817 A NZ 78913817A NZ 789138 A NZ789138 A NZ 789138A
Authority
NZ
New Zealand
Prior art keywords
instances
server
computing
analysis platform
accordance
Prior art date
Application number
NZ789138A
Inventor
Mark Hahm
Rami Mehio
Eric Ojard
Amnon Ptashek
Michael Ruehle
Gavin Stone
Rooyen Pieter Van
Original Assignee
Illumina Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Illumina Inc filed Critical Illumina Inc
Publication of NZ789138A publication Critical patent/NZ789138A/en

Links

Abstract

system, method and apparatus for executing a bioinformatics analysis on genetic sequence data is provided. Particularly, a genomics analysis platform for executing a sequence analysis pipeline is provided. The genomics analysis platform includes one or more of a first integrated circuit, where each first integrated circuit forms a central processing unit(CPU) that is responsive to one or more software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline. Additionally, a second integrated circuit is also provided, where each second integrated circuit forming a field programmable gate array (FPGA), the FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of physical interconnects to perform a second set of genomic processing steps of the sequence analysis pipeline, the set of hardwired digital logic circuits of each FPGA being arranged as a set of processing engines to perform the second set of genomic processing steps. A shared memory is also provided. h first integrated circuit forms a central processing unit(CPU) that is responsive to one or more software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline. Additionally, a second integrated circuit is also provided, where each second integrated circuit forming a field programmable gate array (FPGA), the FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of physical interconnects to perform a second set of genomic processing steps of the sequence analysis pipeline, the set of hardwired digital logic circuits of each FPGA being arranged as a set of processing engines to perform the second set of genomic processing steps. A shared memory is also provided.

Description

A system, method and apparatus for executing a bioinformatics analysis on genetic sequence data is provided. Particularly, a genomics analysis platform for executing a sequence analysis pipeline is provided. The genomics analysis platform includes one or more of a first integrated circuit, where each first integrated circuit forms a central processing unit(CPU) that is sive to one or more software thms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline. Additionally, a second integrated circuit is also provided, where each second integrated circuit forming a field mmable gate array (FPGA), the FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of al interconnects to perform a second set of c processing steps of the ce analysis pipeline, the set of hardwired digital logic circuits of each FPGA being ed as a set of processing engines to perform the second set of genomic processing steps. A shared memory is also provided.
NZ 789138 BIOINFORMATICS SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING SECONDARY AND/OR TERTIARY PROCESSING Cross-Reference to Related Application The current application claims priority to U.S. Application No. 62/347,080, filed June 7, 2016, U.S. Application No. 62/399,582, filed September 26, 2016, U.S.
Application No. 62/414,637, filed October 28, 2016, U.S. Application No. 15/404,146, filed January 11, 2017, U.S. Application No. 62/462,869, filed February 23, 2017, U.S.
Application No. 62/469,442, filed March 9, 2017, and U.S. Application No. 15/497,149, filed April 25, 2017, the disclosures of each application are incorporated herein by reference in their entireties.
Field ofthe Disclosure The subject matter described herein relates to bioinformatics, and more particularly to systems, apparatuses, and s for implementing bioinformatic protocols, such as performing one or more functions for analyzing genomic data on an integrated circuit, such as on a hardware processing platform.
Background to the Disclosure As described in detail herein, some major ational challenges for highthroughput DNA sequencing analysis is to address the explosive growth in available c data, the need for increased accuracy and sensitivity when gathering that data, and the need for fast, efficient, and accurate computational tools when ming analysis on a wide range ofsequencing data sets derived from such genomic data.
Keeping pace with such increased sequencing throughput generated by Next Gen cers has typically been sted as multithreaded software tools that have been executed on ever r numbers of faster processors in computer rs with ive high availability storage that requires substantial power and icant IT support costs.
Importantly, future increases in cing throughput rates will translate into accelerating real dollar costs for these secondary processing solutions.
The devices, systems, and methods of their use described herein are provided, at least in part, so as to address these and other such challenges.
Summary ofthe sure The present disclosure is directed to devices, systems, and methods for employing the same in the performance of one or more genomics and/or ormatics protocols on data generated through a primary processing procedure, such as on genetic sequence data. For instance, in various aspects, the devices, systems, and methods herein provided are configured for performing secondary and/or tertiary analysis protocols on genetic data, such as data ted by the sequencing of RNA and/or DNA, e.g., by a Next Gen Sequencer ("NGS"). In particular embodiments, one or more secondary processing pipelines for processing genetic sequence data is provided. In other embodiments, one or more tertiary processing pipelines for processing genetic sequence data is provided, such as where the pipelines, and/or dual elements thereof, r or sensitivity and improved cy on a wider range of sequence derived data than is currently available in the art.
For example, provided herein is a system, such as for executing one or more of a sequence and/or genomic analysis pipeline on genetic sequence data and/or other data derived rom. In s embodiments, the system may include one or more of an electronic data source that provides digital s representing a plurality of reads of genetic and/or genomic data, such as where each of the plurality of reads of c data include a sequence of nucleotides. The system may further include a memory, e.g., a DRAM, or a cache, such as for storing one or more of the sequenced reads, one or a plurality of genetic reference sequences, and one or more indices ofthe one or more genetic reference sequences.
The system may additionally include one or more integrated circuits, such as a FPGA, ASIC, or sASIC, and/or a CPU and/or a GPU, which integrated circuit, e.g., with respect to the FPGA, ASIC, or sASIC may be formed of a set of hardwired l logic circuits that are interconnected by a plurality of physical electrical interconnects. The system may additionally include a quantum computing processing unit, for use in implementing one or more ofthe methods disclosed herein.
In various embodiments, one or more of the plurality of electrical interconnects may e an input to the one or more integrated circuits that may be connected or connectable, e.g., directly, via a suitable wired connection, or indirectly such as via a wireless network connection (for instance, a cloud or hybrid cloud), with the electronic data source. Regardless of a connection with the sequencer, an ated circuit of the disclosure may be ured for receiving the plurality of reads of genomic data, e.g., directly from the sequencer or from an associated memory. The reads may be digitally encoded in a standard FASTQ or BCL file format. ingly, the system may include an integrated circuit having one or more electrical interconnects that may be a physical onnect that includes a memory interface so as to allow the integrated circuit to access the memory.
Particularly, the hardwired digital logic circuit of the integrated circuit may be arranged as a set ofprocessing engines, such as where each processing engine may be formed of a subset of the hardwired l logic circuits so as to perform one or more steps in the sequence, genomic, and/or tertiary analysis pipeline, as described herein below, on the plurality of reads of c data as well as on other data derived rom. For instance, each subset of the red digital logic circuits may be in a wired configuration to perform the one or more steps in the analysis ne. Additionally, where the integrated circuit is an FPGA, such steps in the sequence and/or further analysis process may e the partial reconfiguration ofthe FPGA during the is process.
Particularly, the set of processing engines may include a mapping module, e.g., in a wired configuration, to access, according to at least some of the sequence of nucleotides in a read of the plurality of reads, the index of the one or more genetic reference sequences, from the memory via the memory interface, so as to map the read to one or more segments of the one or more genetic reference sequences based on the index. Additionally, the set ofprocessing engines may include an alignment module in the wired configuration to access the one or more genetic reference sequences from the memory via the memory interface to align the read, e.g., the mapped read, to one or more positions in the one or more ts of the one or more genetic reference sequences, e.g., as received from the mapping module and/or stored in the memory.
Further, the set of processing engines may include a sorting module so as to sort each aligned read according to the one or more positions in the one or more genetic reference sequences. Furthermore, the set of processing engines may include a variant call module, such as for processing the mapped, aligned, and/or sorted reads, such as with respect to a reference genome, to thereby produce an HMM readout and/or variant call file for use with and/or detailing the variations between the sequenced genetic data and the reference genomic reference data. In various ces, one or more of the plurality of physical ical onnects may e an output from the integrated circuit for communicating result data from the mapping module and/or the alignment and/or g and/or variant call modules. ularly, with respect to the mapping module, in various embodiments, a system for executing a mapping analysis pipeline on a plurality s of genetic data using an index of genetic reference data is provided. In various instances, the genetic sequence, e.g., read, and/or the genetic reference data may be represented by a sequence ofnucleotides, which may be stored in a memory of the . The mapping module may be included within the integrated circuit and may be formed of a set of pre-configured and/or hardwired l logic circuits that are interconnected by a plurality ofphysical electrical interconnects, which physical electrical interconnects may include a memory interface for ng the integrated circuit to access the memory. In more ular embodiments, the hardwired digital logic circuits may be arranged as a set of processing engines, such as where each processing engine is formed of a subset of the hardwired digital logic circuits to perform one or more steps in the sequence analysis pipeline on the plurality ofreads ofgenomic data.
For instance, in one embodiment, the set of processing engines may include a mapping module in a hardwired configuration, where the mapping module, and/or one or more processing engines thereof is configured for receiving a read of genomic data, such as via one or more of a plurality ofphysical electrical interconnects, and for extracting a portion of the read in such a manner as to te a seed therefrom. In such an ce, the read may be represented by a sequence of nucleotides, and the seed may ent a subset of the sequence of tides represented by the read. The g module may include or be connectable to a memory that includes one or more of the reads, one or more of the seeds of the reads, at least a portion of one or more of the reference genomes, and/or one or more indexes, such an index built from the one or more reference genomes. In certain instances, a processing engine of the mapping module employ the seed and the index to calculate an address within the index based on the seed.
Once an address has been calculated or otherwise derived and/or stored, such as in an onboard or offboard memory, the address may be accessed in the index in the memory so as to receive a record from the address, such as a record representing position information in the genetic reference sequence. This position ation may then be used to determine one or more matching ons from the read to the genetic reference sequence based on the record. Then at least one ofthe matching positions may be output to the memory via the memory interface.
WO 14320 PCT/0S2017/036424 In another embodiment, a set of the processmg engines may include an alignment module, such as in a pre-configured and/or hardwired configuration. In this instance, one or more ofthe processing s may be configured to receive one or more of the mapped positions for the read data via one or more of the plurality of physical electrical interconnects. Then the memory (internal or external) may be accessed for each mapped position to retrieve a segment ofthe reference sequence/genome corresponding to the mapped position. An alignment of the read to each ved reference segment may be calculated along with a score for the alignment. Once calculated, at least one best-scoring alignment of the read may be selected and output. In various instances, the alignment module may also ent a c programming algorithm when calculating the alignment, such as one or more of a Smith-Waterman thm, e.g., with linear or affine gap g, a gapped ent algorithm, and/or a gapless ent algorithm. In particular instances, the calculating of the alignment may include first performing a s alignment to each reference segment, and based on the gapless alignment s, selecting reference segments with which to further perform gapped alignments.
In various embodiments, a variant call module may be provided for performing improved variant call functions that when implemented in one or both of re and/or hardware configurations generate superior processing speed, better processed result accuracy, and enhanced overall efficiency than the methods, devices, and systems currently known in the art. Specifically, in one aspect, improved s for performing variant call operations in software and/or in hardware, such as for performing one or more HMM operations on genetic sequence data, are provided. In another aspect, novel devices including an ated circuit for performing such improved variant call operations, where at least a portion ofthe variant call operation is implemented in hardware, are provided.
Accordingly, in various instances, the s disclosed herein may include mapping, by a first subset of hardwired and/or quantum digital logic ts, a plurality of reads to one or more segments of one or more genetic reference sequences. Additionally, the methods may include accessing, by the ated and/or quantum circuits, e.g., by one or more of the plurality of physical electrical interconnects, from the memory or a cache associated therewith, one or more of the mapped reads and/or one or more of the genetic reference sequences; and aligning, by a second subset of the hardwired and/or quantum digital logic circuits, the plurality of mapped reads to the one or more segments of the one or more genetic reference sequences.
In various embodiments, the method may additionally include accessing, by the integrated and/or quantum circuit, e.g., by one or more of the plurality of physical ical interconnects from a memory or a cache associated therewith, the aligned plurality of reads. In such an instance the method may include sorting, by a third subset of the hardwired and/or quantum digital logic circuits, the aligned plurality of reads according to their positions in the one or more genetic reference sequences. In n instances, the method may further include outputting, such as by one or more of the plurality of physical electrical interconnects ofthe integrated and/or quantum circuit, result data from the mapping and/or the ng and/or the sorting, such as where the result data es positions of the mapped and/or aligned and/or sorted plurality ofreads.
In some ces, the method may additionally include using the obtained result data, such as by a further subset ofthe hardwired and/or quantum digital logic circuits, for the purpose of determining how the mapped, aligned, and/or sorted data, d from the subject's sequenced genetic sample, differs from a reference sequence, so as to produce a variant call file delineating the genetic differences between the two samples. Accordingly, in various embodiments, the method may further include accessing, by the integrated and/or quantum t, e.g., by one or more ofthe ity ofphysical electrical interconnects from a memory or a cache associated therewith, the mapped and/or aligned and/or sorted plurality of reads. In such an instance the method may include performing a variant call function, e.g., an HMM or paired HMM operation, on the accessed reads, by a third or fourth subset of the hardwired and/or quantum digital logic ts, so as to produce a variant call file detailing how the mapped, aligned, and/or sorted reads vary from that of one or more reference, e.g., haplotype, sequences.
Accordingly, in accordance with particular s ofthe disclosure, ted herein is a compact hardware, e.g., chip based, or quantum accelerated platform for performing secondary and/or tertiary analyses on genetic and/or genomic sequencing data.
Particularly, a rm or pipeline of hardwired and/or quantum digital logic circuits that have specifically been designed for performing secondary and/or tertiary genetic analysis, such as on sequenced genetic data, or genomic data derived therefrom, is ed. ularly, a set of hardwired digital and/or quantum logic circuits, which may be ed as a set essing engines, may be provided, such as where the processing engines may be present in a preconfigured and/or red and/or quantum configuration on a processing platform of the disclosure, and may be specifically designed for performing secondary mapping and/or aligning and/or variant call operations related to genetic analysis on DNA and/or RNA data, and/or may be ically designed for performing other tertiary processing on the results data.
In particular instances, the present devices, systems, and methods of employing the same in the mance of one or more genomics and/or bioinformatics secondary and/or tertiary processing protocols, have been optimized so as to deliver an improvement in processing speed that is orders of magnitude faster than rd secondary processing pipelines that are implemented in software. Additionally, the pipelines and/or components thereof as set forth herein provide better sensitivity and cy on a wide range of sequence derived data sets for the purposes of genomics and bioinformatics processing. In various instances, one or more of these operations may be performed on by an integrated circuit that is part of or ured as a general e central processing unit and/or a graphics processing unit and/or a m processing unit.
For example, genomics and bioinformatics are fields concerned with the application of information technology and computer science to the field of genetics and/or molecular biology. In particular, bioinformatics techniques can be d to process and analyze various genetic and/or genomic data, such as from an individual, so as to determine qualitative and quantitative information about that data that can then be used by various practitioners in the development of prophylactic, therapeutic, and/or diagnostic methods for preventing, treating, ameliorating, and/or at least identifying ed states and/or their potential, and thus, improving the safety, quality, and effectiveness of health care on an individualized level. Hence, because of their focus on advancing personalized healthcare, genomics and bioinformatics fields promote individualized healthcare that is proactive, instead of reactive, and this gives the subject in need of ent the opportunity to become more involved in their own wellness. An age of ing the cs, genomics, and/or bioinformatics technologies disclosed herein is that the qualitative and/or quantitative analyses of molecular biological, e.g., genetic, data can be performed on a broader range of sample sets at a much higher rate of speed and often times more accurately, thus expediting the emergence of a alized healthcare system. Particularly, in various embodiments, the genomics and/or bioinformatics related tasks may form a genomics pipeline that includes one or more of a array analysis pipeline, a genome, e.g., whole genome is pipeline, genotyping analysis pipeline, exome analysis ne, epigenome analysis pipeline, metagenome analysis pipeline, microbiome analysis pipeline, genotyping analysis pipeline, including joint genotyping, variants analysis pipelines, including structural variants, c ts, and GATK, as well as RNA sequencing and other genetic es pipelines.
Accordingly, to make use of these advantages there exists enhanced and more accurate software implementations for performing one or a series of such bioinformatics based analytical techniques, such as for deployment by a general purpose CPU and/or GPU and/or may be implemented in one or more m circuits of a quantum processing platform. r, common characteristics of traditionally configured software based bioinformatics methods and systems is that they are labor intensive, take a long time to execute on such general purpose processors, and are prone to errors. Therefore, bioinformatics systems as implemented herein that could perform these algorithms, such as implemented in software by a CPU and/or GPU of quantum processing unit in a less labor and/or processing intensive manner with a greater tage accuracy would be useful.
Such implementations have been developed and are ted herein, such as where the genomics and/or bioinformatics analyses are performed by optimized software run on a CPU and/or GPU and/or quantum er in a system that makes use of the genetic sequence data derived by the processing units and/or integrated circuits of the disclosure.
Further, it is to be noted that the cost of analyzing, storing, and sharing this raw digital data has far outpaced the cost of producing it. Accordingly, also presented herein are "just in time" storage and/or retrieval methods that optimize the storage of such data in a manner that substitutes the speed of regenerating the data in exchange for the cost of storing such data collectively. Hence, the data generation, analysis, and "just in time" or "JIT" storage methods presented herein solve a key bottleneck that is a long felt but unmet obstacle standing between the ever-growing raw data generation and storage and the real medical insight being sought from it.
Presented herein, ore, are systems, apparatuses, and methods for implementing genomics and/or bioinformatic protocols or portions f, such as for performing one or more functions for analyzing genomic data, for ce, on one or both of an integrated circuit, such as on a re processing rm, and a general purpose processor, such as for performing one or more bioanalytic operations in software and/or on firmware. For example, as set forth herein below, in s implementations, an integrated t and/or quantum circuit is provided so as to accelerate one or more processes in a primary, secondary, and/or tertiary processing platform. In various instances, the integrated t may be employed in performing genetic analytic related tasks, such as mapping, ng, t calling, compressing, decompressing, and the like, in an accelerated manner, and as such the integrated circuit may include a hardware accelerated configuration.
Additionally, in various instances, an integrated and/or quantum circuit may be provided such as where the circuit is part of a processing unit that is configured for performing one or more genomics and/or bioinformatics ols on the generated mapped and/or aligned and/or t called data.
Particularly, in a first embodiment, a first integrated circuit may be formed of an FPGA, ASIC, and/or sASIC that is coupled to or otherwise attached to the motherboard and configured, or in the case of an FPGA may be programmable by firmware to be configured, as a set of hardwired digital logic circuits that are adapted to perform at least a first set of sequence analysis functions in a genomics analysis pipeline, such as where the integrated circuit is configured as described herein above to include one or more digital logic ts that are arranged as a set ofprocessing engines, which are adapted to perform one or more steps in a mapping, aligning, and/or variant calling operation on the genetic data so as to produce sequence analysis results data. The first ated circuit may further include an output, e.g., formed of a plurality of physical electrical interconnects, such as for communicating the result data from the mapping and/or the alignment and/or other procedures to the memory.
Additionally, a second integrated and/or quantum circuit may be included, coupled to or otherwise ed to the motherboard, and in ication with the memory via a ications ace. The second integrated and/or quantum circuit may be formed as a central processing unit (CPU) or graphics processing unit (GPU) or m processing unit (QPU) that is configured for receiving the mapped and/or aligned and/or variant called sequence analysis result data and may be d to be responsive to one or more software algorithms that are configured to instruct the CPU or GPU to perform one or more genomics and/or bioinformatics functions of the c analysis pipeline on the mapped, d, and/or variant called ce analysis result data. Specifically, the genomics and/or bioinformatics related tasks may form a genomics analysis ne that includes one or more of a micro-array analysis, a genome pipeline, e.g., whole genome analysis pipeline, genotyping analysis pipeline, exome analysis pipeline, epigenome analysis pipeline, metagenome analysis pipeline, microbiome analysis pipeline, genotyping analyses pipelines, including joint genotyping, variants analyses pipelines, including structural variants, somatic variants, and GATK, as well as RNA sequencing analysis pipeline and other genetic analyses pipelines.
For instance, in one embodiment, the CPU and/or GPU and/or QPU of the second integrated circuit may e software that is configured for arranging the genome analysis pipeline for executing a whole genome analysis pipeline, such as a whole genome analysis pipeline that includes one or more of genome-wide variation analysis, whole-exome DNA analysis, whole transcriptome RNA analysis, gene function analysis, protein function analysis, protein binding is, quantitative gene analysis, and/or a gene assembly analysis. In certain instances, the whole genome analysis ne may be performed for the purposes of one or more of ancestry analysis, personal l history analysis, e diagnostics, drug discovery, and/or n profiling. In a particular instance, the whole genome analysis pipeline is performed for the purposes of oncology analysis. In various instances, the results ofthis data may be made available, e.g. globally, throughout the system.
In various instances, the CPU and/or GPU and/or a quantum processing unit (QPU) of the second integrated and/or quantum circuit may include software that is configured for arranging the genome analysis pipeline for executing a genotyping analysis, such as a genotyping analysis including joint genotyping. For instance, the joint genotyping analysis may be performed using a Bayesian probability calculation, such as a Bayesian probability calculation that s in an absolute probability that a given determined genotype is a true genotype. In other instances, the software may be configured for performing a metagenome analysis so as to e nome result data that may in tum be employed in the performance of a iome analysis.
In certain instances, the first and/or second integrated t and/or the memory may be housed on an ion card, such as a eral component interconnect (PCI) card. For instance, in various embodiments, one or more of the integrated circuits may be one or more chips coupled to a PCie card or otherwise associated with the motherboard. In various instances, the integrated and/or quantum circuit(s) and/or chip(s) may be a component within a sequencer or er, or server, such as part of a server farm. In ular embodiments, the integrated and/or quantum circuit(s) and/or expansion card(s) and/or er(s) and/or server(s) maybe accessible via the internet, e.g., cloud.
Further, in some instances, the memory may be a volatile random access memory (RAM), e.g., a direct access memory (DRAM). Particularly, in various embodiments, the memory may include at least two memories, such as a first memory that is an HMEM, e.g., for storing the reference haplotype sequence data, and a second memory that is an RMEM, e.g., for storing the read of genomic sequence data. In particular instances, each ofthe two memories may e a write port and/or a read port, such as where the write port and the read port each accessing a separate clock. Additionally, each of the two memories may include a flip-flop configuration for storing a multiplicity of genetic sequence and/or processing result data.
Accordingly, in another , the system may be configured for sharing memory resources amongst its ent parts, such as in on to performing some computational tasks via software, such as run by the CPU and/or GPU and/or quantum processing platform, and/or ming other computational tasks via firmware, such as via the hardware of an associated ated circuit, e.g., FPGA, ASIC, and/or sASIC. This may be achieved in a number ofdifferent ways, such as by a direct loose or tight coupling between the CPU/GPU/QPU and the FPGA, e.g., chip or PCie card. Such configurations may be particularly useful when distributing operations related to the processing of the large data structures associated with genomics and/or bioinformatics analyses to be used and accessed by both the CPU/GPU/QPU and the associated integrated circuit. Particularly, in various embodiments, when processing data through a genomics pipeline, as herein described, such as to accelerate l processing function, timing, and efficiency, a number of different ions may be run on the data, which operations may involve both software and hardware processing components.
Consequently, data may need to be shared and/or ise communicated, between the software component(s) running on the CPU and/or GPU and/or QPU and/or the hardware component embodied in the chip, e.g., an FPGA. Accordingly, one or more of the various steps in the genomics and/or bioinformatics processing pipeline, or a portion thereof, may be performed by one device, e.g., the CPU/GPU/QPU, and one or more of the various steps may be med by a hardwired device, e.g., the FPGA. In such an instance, the CPU/GPU/QPU and/or the FPGA may be communicably coupled in such a manner to allow the efficient transmission of such data, which coupling may involve the shared use of memory ces. To achieve such distribution of tasks and the sharing of information for the performance of such tasks, the various CPUs/GPUs/QPUs may be loosely or tightly coupled to one another and/or the re devices, e.g., FPGA, or other chip set, such as by a quick path interconnect.
Particularly, m vanous embodiments, a genom1cs analysis platform is provided. For instance, the platform may include a motherboard, a memory, and plurality of integrated and/or quantum circuits, such as g one or more of a U/QPU, a mapping module, an alignment module, a sorting module, and/or a variant call module. ically, in particular embodiments, the platform may include a first integrated and/or quantum circuit, such as an integrated circuit forming a central processing unit (CPU) or graphics processing unit (GPU), or a quantum circuit forming a quantum processor, that is responsive to one or more software or other thms that are configured to instruct the CPU/GPU/QPU to perform one or more sets of genomics analysis functions, as described herein, such as where the CPU/GPU/QPU includes a first set of physical electronic onnects to connect with the motherboard. In various instances, the memory may also be attached to the motherboard and may further be electronically connected with the CPU/GPU/QPU, such as via at least a portion of the first set of physical electronic interconnects. In such instances, the memory may be configured for storing a plurality of reads of genomic data, and/or at least one or more genetic reference ces, and/or an index ofthe one or more genetic reference sequences.
Additionally, the platform may include one or more of another integrated circuit(s), such as where each of the other integrated circuit forms a field programmable gate array (FPGA) having a second set of physical onic interconnects to connect with the CPU/GPU/QPU and the memory, such as via a to-point onnect protocol. In such an instance, such as where the ated circuit is an FPGA, the FPGA may be programmable by firmware to ure a set of hardwired digital logic circuits that are interconnected by a plurality of physical interconnects to perform a second set of cs analysis functions, e.g., g, aligning, variant calling, etc. Particularly, the hardwired digital logic circuits of the FPGA may be arranged as a set ofprocessing engines to perform one or more pre-configured steps in a sequence analysis ne of the genomics analysis, such as where the set(s) of processing s include one or more of a mapping and/or aligning and/or t call module, which modules may be formed of the separate or the same subsets ofprocessing engines.
As indicated, the system may be configured to include one or more processing engines, and in various embodiments, an included processing engine may itself be configured for determining one or more transition probabilities for the sequence of nucleotides of the read of c sequence going from one state to another, such as from a match state to an

Claims (29)

Claims
1. A genomics analysis platform for executing a sequence analysis pipeline, the genomics analysis platform comprising: a set of a first computing instance type, each first computing instance type having a CPU configured by one or more software algorithms for receiving one or more of a BCL file and/or a FASTQ file representing genomic sequence data, and performing one or more preprocessing steps on the genomic sequence data to produce a first set of result data; a set of a second computing instance type, each second computing instance type having an FPGA comprising a set of hardwired digital logic circuits configured by firmware to arrange a set of processing engines for accessing at least a portion of the first set of result data and performing mapping and aligning processing steps of the sequence analysis pipeline to produce a second set of result data, wherein performing the aligning processing steps comprises performing Smith-Waterman steps; and a second set of the first computing instance type for accessing at least a portion of the second set of result data and performing variant calling processing steps of the sequence analysis pipeline to produce a third set of result data, wherein performing the variant calling processing steps comprises performing one more Hidden-Markov Model (HMM) processing steps.
2. The genomics analysis platform in accordance with claim 1, further comprising a reconfigurable storage connected with the set of first computing instances and the set of second computing instances, the reconfigurable storage being switchable between one or more of the set of first of computing instances and one or more of the set of second computing instances for read and write access by successively active computing instances of the set of first computing instances or second computing instances.
3. The genomics analysis platform in accordance with claim 1, wherein the reconfigurable storage is connected to the set of first computing instances and the set of second computing instances via a data communication network.
4. The genomics analysis platform in accordance with claim 1, further comprising a work 321 flow manager having a load estimator logic to estimate a data load of the sequence analysis pipeline to be performed by the set of first of computing instances or the set of second computing instances, the work flow manager further being configured to instantiate a number of the set of first computing instances or the set of the second computing instances based on the estimated data load.
5. The genomics analysis platform in accordance with claim 4, wherein the load estimator logic is further configured to instantiate the number of the set of first computing instances or the set of second computing instances to be instantiated to optimize an efficiency of the genomics analysis platform for processing the data load.
6. The genomics analysis platform in accordance with claim 1, further comprising a set of third computing instances, each third computing instance having at least one graphical processing unit (GPU) that is responsive to one or more graphical processing algorithms that are configured to instruct the GPU to perform a third set of genomic processing steps of the sequence analysis pipeline, and wherein the reconfigurable storage is connected with the set of third server instances via the data communication network.
7. A genomics analysis platform for executing a sequence analysis pipeline, the genomics analysis platform comprising: a set of a first computing instance type, each first computing instance type having a CPU configured by one or more software algorithms for performing a first set of genomic processing steps of the sequence analysis pipeline to produce a first set of result data; a set of a second computing instance type, each second computing instance type having an FPGA comprising a set of hardwired digital logic circuits configured by firmware to arrange a set of processing engines for accessing at least a portion of the first set of result data and performing a second set of genomic processing steps comprising mapping and aligning processing steps of the sequence analysis pipeline to produce a second set of result data; a second set of the first computing instance type for accessing at least a portion of the second set of result data and performing a third set of genomic processing steps of the sequence analysis pipeline to produce a third set of result data; and 322 a second set of the second computing instance type for accessing at least a portion of the third set of result data and performing a fourth set of genomic processing steps of the sequence analysis pipeline to produce a fourth set of result data.
8. The genomics analysis platform in accordance with claim 7, further comprising: a reconfigurable storage external to the sets of the first and second computing instance types, the reconfigurable storage being communicably switchable between the CPUs of the first computing instance type and the FPGAs of the second computing instance type, to receive, store and provide access to result data.
9. The genomics analysis platform in accordance with claim 7, wherein the first set of genomic processing steps includes one or more of image processing, base calling, error correction, BCL conversion, and FASTQ processing.
10. The genomics analysis platform in accordance with claim 9, wherein the mapping includes performing one or more of a Burrow-Wheeler transform and a hash-table function.
11. The genomics analysis platform in accordance with claim 9, wherein the aligning includes one or more of performing a Smith-Waterman, a Needleman-Wusnch, a gapless, and a gapped alignment.
12. The genomics analysis platform in accordance with claim 11, wherein the second set of genomic processing steps includes one or more of sorting and deduplication.
13. The genomics analysis platform in accordance with claim 12, wherein the second set of genomic processing steps includes performing a SmithWaterman operation; or wherein the third set of genomic processing steps includes a Hidden Markov Model operation.
14. The genomics analysis platform in accordance with claim 7, wherein the reconfigurable 323 storage is connected to the set of first computing instances and the set of second computing instances via a data communication network.
15. The genomics analysis platform in accordance with claim 7, further comprising a work flow manager having a load estimator logic to estimate a data load of the sequence analysis pipeline to be performed by the set of first of computing instances or the set of second computing instances, the work flow manager further being configured to instantiate a number of the set of first computing instances or the set of the second computing instances based on the estimated data load.
16. The genomics analysis platform in accordance with claim 15, wherein the load estimator logic is further configured to instantiate the number of the set of first computing instances or the set of second computing instances to be instantiated to optimize an efficiency of the genomics analysis platform for processing the data load.
17. The genomics analysis platform in accordance with claim 16, further comprising a set of third computing instances, each third computing instance having at least one graphical processing unit (GPU) that is responsive to one or more graphical processing algorithms that are configured to instruct the GPU to perform a third set of genomic processing steps of the sequence analysis pipeline, and wherein the reconfigurable storage is connected with the set of third server instances via the data communication network.
18. The genomics analysis platform in accordance with claim 7, further comprising an ondemand cloud computing platform comprising a set of server computers and a set of databases, the set of server computers defining a virtual set of server computer instances that is configurable to provide the set of first server instances and the set of second server instances, and the set of databases defining a virtual database storage that is configurable to provide the reconfigurable storage.
19. A genomics analysis platform for executing a sequence analysis pipeline, the genomics analysis platform comprising: 324 a set of first server instances and a set of second server instances connected with a data communication network, each first server instance having at least one central processing unit (CPU) that is responsive to one or more software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline, each second server instance having at least one field programmable gate array (FPGA), each FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of physical electrical interconnects to form a set of processing engines to perform a second set of genomic processing steps of the sequence analysis pipeline; and an elastic reconfigurable storage connectable with the set of first server instances and the set of second server instances via the data communication network, the elastic reconfigurable storage being switchable between one or more of the set of first server instances and one or more of the set of second server instances for read and write access by successively active server instances of the set of first server instances or second server instances.
20. The genomics analysis platform in accordance with claim 19, further comprising an elastic controller having load estimator logic to estimate a data load of the sequence analysis pipeline to be performed by the set of first server instances or the set of second server instances, the elastic control node further being configured to instantiate a number of the first server instances or the second server instances based on the estimated data load.
21. The genomics analysis platform in accordance with claim 20, wherein the load estimator logic is further configured to instantiate the number of the first server instances or the second server instances to be instantiated to optimize an efficiency of the genomics analysis platform for processing the data load.
22. The genomics analysis platform in accordance with claim 21, further comprising a set of third server instances, each third server instance having at least one graphical processing unit (GPU) that is responsive to one or more graphical processing algorithms that are configured to instruct the GPU to perform a third set of genomic processing steps of the sequence analysis pipeline, and wherein the reconfigurable storage is connected with the set of third server 325 instances via the data communication network.
23. The genomics analysis platform in accordance with claim 21, wherein the second set of genomic processing steps includes mapping and aligning, and the second set of results data includes one or more of a mapped and aligned read.
24. The genomics analysis platform in accordance with claim 23, wherein the aligning includes performing a Smith-Waterman operation.
25. The genomics analysis platform in accordance with claim 19, further comprising an ondemand cloud computing platform comprising a set of server computers and a set of databases, the set of server computers defining a virtual set of server computer instances that is configurable to provide the set of first server instances and the set of second server instances, and the set of databases defining a virtual database storage that is configurable to provide the reconfigurable storage.
26. A genomics analysis platform for executing a sequence analysis pipeline, the genomics analysis platform comprising: one or more of a first server instance, each first server instance having at least one central processing unit (CPU) forming a compute node of the genomic analysis platform, each CPU being responsive to one or more software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline; one or more of a second server instance, each second server instance having at least one field programmable gate array (FPGA), each FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of physical electrical interconnects to form a set of processing engines to perform a second set of genomic processing steps of the sequence analysis pipeline; and an elastic reconfigurable storage external to the one or more first server instances and second server instances, the reconfigurable storage being communicably switchable between the one or more first server instances and the second server instances to receive and store result data of the first or second set of genomic processing steps of the sequence analysis pipeline, and 326 provide access to the result data by a next server instance.
27. The genomics analysis platform in accordance with claim 26, further comprising an elastic controller having load estimator logic to estimate a data load of the sequence analysis pipeline to be performed by the set of first server instances or the set of second server instances, the elastic control node further being configured to instantiate a number of the first server instances or the second server instances based on the estimated data load.
28. The genomics analysis platform in accordance with claim 27, wherein the load estimator logic is further configured to instantiate the number of the first server instances or the second server instances to be instantiated to optimize an efficiency of the genomics analysis platform for processing the data load.
29. The genomics analysis platform in accordance with claim 28, further comprising an ondemand cloud computing platform comprising a set of server computers and a set of databases, the set of server computers defining a virtual set of server computer instances that is configurable to provide the set of first server instances and the set of second server instances, and the set of databases defining a virtual database storage that is configurable to provide the reconfigurable storage.
NZ789138A 2016-06-07 2017-06-07 Bioinformatics systems, apparatus, and methods for performing secondary and/or tertiary processing NZ789138A (en)

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
US62/347,080 2016-06-07
US62/399,582 2016-09-26
US62/414,637 2016-10-28
US15/404,146 2017-01-11
US62/462,869 2017-02-23
US62/469,442 2017-03-09
US15/497,149 2017-04-25

Publications (1)

Publication Number Publication Date
NZ789138A true NZ789138A (en) 2022-07-01

Family

ID=

Similar Documents

Publication Publication Date Title
EP3465507B1 (en) Genetic multi-region joint detection and variant calling
Blom et al. Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming
Brøndum et al. Strategies for imputation to whole genome sequence using a single or multi-breed reference population in cattle
CA3042239A1 (en) Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing
EP3837690B1 (en) Systems and methods for using neural networks for germline and somatic variant calling
US20160171153A1 (en) Bioinformatics Systems, Apparatuses, And Methods Executed On An Integrated Circuit Processing Platform
Liu et al. CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding
Alser et al. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures
JP2019510323A5 (en)
CN104992079B (en) Protein-ligand based on sampling study binds site estimation method
Chen et al. A hybrid short read mapping accelerator
Souilmi et al. Scalable and cost-effective NGS genotyping in the cloud
AU2021203941B2 (en) Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
Lee et al. Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations
Kearse et al. The Geneious 6.0. 3 read mapper
Alballa et al. TranCEP: Predicting the substrate class of transmembrane transport proteins using compositional, evolutionary, and positional information
Houtgast et al. An efficient gpuaccelerated implementation of genomic short read mapping with bwamem
WO2016139545A1 (en) Hardware accelerator for alignment of short reads in sequencing platforms
Vineetha et al. SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning
Denti et al. Shark: fishing relevant reads in an RNA-Seq sample
NZ789138A (en) Bioinformatics systems, apparatus, and methods for performing secondary and/or tertiary processing
Yue et al. A systematic review on the state-of-the-art strategies for protein representation
Leung et al. SV-AUTOPILOT: optimized, automated construction of structural variation discovery and benchmarking pipelines
Köster et al. Massively parallel read mapping on GPUs with the q-group index and PEANUT
Han et al. HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing