NZ789138A - Bioinformatics systems, apparatus, and methods for performing secondary and/or tertiary processing - Google Patents
Bioinformatics systems, apparatus, and methods for performing secondary and/or tertiary processingInfo
- Publication number
- NZ789138A NZ789138A NZ789138A NZ78913817A NZ789138A NZ 789138 A NZ789138 A NZ 789138A NZ 789138 A NZ789138 A NZ 789138A NZ 78913817 A NZ78913817 A NZ 78913817A NZ 789138 A NZ789138 A NZ 789138A
- Authority
- NZ
- New Zealand
- Prior art keywords
- instances
- server
- computing
- analysis platform
- accordance
- Prior art date
Links
- 238000004458 analytical method Methods 0.000 claims abstract description 133
- 238000006243 chemical reaction Methods 0.000 claims 1
- 238000007781 pre-processing Methods 0.000 claims 1
- 230000015654 memory Effects 0.000 abstract description 42
- 229920001850 Nucleic acid sequence Polymers 0.000 abstract description 10
- 238000003766 bioinformatics method Methods 0.000 abstract description 3
- 230000002068 genetic Effects 0.000 description 34
- 238000000034 method Methods 0.000 description 7
- 238000004450 types of analysis Methods 0.000 description 7
- 239000002773 nucleotide Substances 0.000 description 4
- 125000003729 nucleotide group Chemical group 0.000 description 4
- 229920000160 (ribonucleotides)n+m Polymers 0.000 description 3
- 229920003013 deoxyribonucleic acid Polymers 0.000 description 3
- 230000035945 sensitivity Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000001808 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 150000002500 ions Chemical class 0.000 description 2
- 244000005700 microbiome Species 0.000 description 2
- 238000003559 rna-seq method Methods 0.000 description 2
- 238000001712 DNA sequencing Methods 0.000 description 1
- 241000229754 Iva xanthiifolia Species 0.000 description 1
- 230000002730 additional Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000011030 bottleneck Methods 0.000 description 1
- 238000007374 clinical diagnostic method Methods 0.000 description 1
- 230000000875 corresponding Effects 0.000 description 1
- 238000002405 diagnostic procedure Methods 0.000 description 1
- 238000007876 drug discovery Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000010208 microarray analysis Methods 0.000 description 1
- 230000036678 protein binding Effects 0.000 description 1
- 230000004853 protein function Effects 0.000 description 1
- 230000001172 regenerating Effects 0.000 description 1
- 230000000392 somatic Effects 0.000 description 1
- 230000001225 therapeutic Effects 0.000 description 1
Abstract
system, method and apparatus for executing a bioinformatics analysis on genetic sequence data is provided. Particularly, a genomics analysis platform for executing a sequence analysis pipeline is provided. The genomics analysis platform includes one or more of a first integrated circuit, where each first integrated circuit forms a central processing unit(CPU) that is responsive to one or more software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline. Additionally, a second integrated circuit is also provided, where each second integrated circuit forming a field programmable gate array (FPGA), the FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of physical interconnects to perform a second set of genomic processing steps of the sequence analysis pipeline, the set of hardwired digital logic circuits of each FPGA being arranged as a set of processing engines to perform the second set of genomic processing steps. A shared memory is also provided. h first integrated circuit forms a central processing unit(CPU) that is responsive to one or more software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline. Additionally, a second integrated circuit is also provided, where each second integrated circuit forming a field programmable gate array (FPGA), the FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of physical interconnects to perform a second set of genomic processing steps of the sequence analysis pipeline, the set of hardwired digital logic circuits of each FPGA being arranged as a set of processing engines to perform the second set of genomic processing steps. A shared memory is also provided.
Description
A system, method and apparatus for executing a bioinformatics analysis on genetic sequence data
is provided. Particularly, a genomics analysis platform for executing a sequence analysis pipeline is
provided. The genomics analysis platform includes one or more of a first integrated circuit, where
each first integrated circuit forms a central processing unit(CPU) that is sive to one or more
software thms that are configured to instruct the CPU to perform a first set of genomic
processing steps of the sequence analysis pipeline. Additionally, a second integrated circuit is also
provided, where each second integrated circuit forming a field mmable gate array (FPGA),
the FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that
are interconnected by a plurality of al interconnects to perform a second set of c
processing steps of the ce analysis pipeline, the set of hardwired digital logic circuits of
each FPGA being ed as a set of processing engines to perform the second set of genomic
processing steps. A shared memory is also provided.
NZ 789138
BIOINFORMATICS SYSTEMS, APPARATUSES, AND METHODS FOR
PERFORMING SECONDARY AND/OR TERTIARY PROCESSING
Cross-Reference to Related Application
The current application claims priority to U.S. Application No. 62/347,080,
filed June 7, 2016, U.S. Application No. 62/399,582, filed September 26, 2016, U.S.
Application No. 62/414,637, filed October 28, 2016, U.S. Application No. 15/404,146, filed
January 11, 2017, U.S. Application No. 62/462,869, filed February 23, 2017, U.S.
Application No. 62/469,442, filed March 9, 2017, and U.S. Application No. 15/497,149, filed
April 25, 2017, the disclosures of each application are incorporated herein by reference in
their entireties.
Field ofthe Disclosure
The subject matter described herein relates to bioinformatics, and more
particularly to systems, apparatuses, and s for implementing bioinformatic protocols,
such as performing one or more functions for analyzing genomic data on an integrated
circuit, such as on a hardware processing platform.
Background to the Disclosure
As described in detail herein, some major ational challenges for highthroughput
DNA sequencing analysis is to address the explosive growth in available c
data, the need for increased accuracy and sensitivity when gathering that data, and the need
for fast, efficient, and accurate computational tools when ming analysis on a wide
range ofsequencing data sets derived from such genomic data.
Keeping pace with such increased sequencing throughput generated by Next
Gen cers has typically been sted as multithreaded software tools that have been
executed on ever r numbers of faster processors in computer rs with ive
high availability storage that requires substantial power and icant IT support costs.
Importantly, future increases in cing throughput rates will translate into accelerating
real dollar costs for these secondary processing solutions.
The devices, systems, and methods of their use described herein are provided,
at least in part, so as to address these and other such challenges.
Summary ofthe sure
The present disclosure is directed to devices, systems, and methods for
employing the same in the performance of one or more genomics and/or ormatics
protocols on data generated through a primary processing procedure, such as on genetic
sequence data. For instance, in various aspects, the devices, systems, and methods herein
provided are configured for performing secondary and/or tertiary analysis protocols on
genetic data, such as data ted by the sequencing of RNA and/or DNA, e.g., by a Next
Gen Sequencer ("NGS"). In particular embodiments, one or more secondary processing
pipelines for processing genetic sequence data is provided. In other embodiments, one or
more tertiary processing pipelines for processing genetic sequence data is provided, such as
where the pipelines, and/or dual elements thereof, r or sensitivity and
improved cy on a wider range of sequence derived data than is currently available in
the art.
For example, provided herein is a system, such as for executing one or more of
a sequence and/or genomic analysis pipeline on genetic sequence data and/or other data
derived rom. In s embodiments, the system may include one or more of an
electronic data source that provides digital s representing a plurality of reads of genetic
and/or genomic data, such as where each of the plurality of reads of c data include a
sequence of nucleotides. The system may further include a memory, e.g., a DRAM, or a
cache, such as for storing one or more of the sequenced reads, one or a plurality of genetic
reference sequences, and one or more indices ofthe one or more genetic reference sequences.
The system may additionally include one or more integrated circuits, such as a FPGA, ASIC,
or sASIC, and/or a CPU and/or a GPU, which integrated circuit, e.g., with respect to the
FPGA, ASIC, or sASIC may be formed of a set of hardwired l logic circuits that are
interconnected by a plurality of physical electrical interconnects. The system may
additionally include a quantum computing processing unit, for use in implementing one or
more ofthe methods disclosed herein.
In various embodiments, one or more of the plurality of electrical
interconnects may e an input to the one or more integrated circuits that may be
connected or connectable, e.g., directly, via a suitable wired connection, or indirectly such as
via a wireless network connection (for instance, a cloud or hybrid cloud), with the electronic
data source. Regardless of a connection with the sequencer, an ated circuit of the
disclosure may be ured for receiving the plurality of reads of genomic data, e.g.,
directly from the sequencer or from an associated memory. The reads may be digitally
encoded in a standard FASTQ or BCL file format. ingly, the system may include an
integrated circuit having one or more electrical interconnects that may be a physical
onnect that includes a memory interface so as to allow the integrated circuit to access
the memory.
Particularly, the hardwired digital logic circuit of the integrated circuit may be
arranged as a set ofprocessing engines, such as where each processing engine may be formed
of a subset of the hardwired l logic circuits so as to perform one or more steps in the
sequence, genomic, and/or tertiary analysis pipeline, as described herein below, on the
plurality of reads of c data as well as on other data derived rom. For instance,
each subset of the red digital logic circuits may be in a wired configuration to perform
the one or more steps in the analysis ne. Additionally, where the integrated circuit is an
FPGA, such steps in the sequence and/or further analysis process may e the partial
reconfiguration ofthe FPGA during the is process.
Particularly, the set of processing engines may include a mapping module,
e.g., in a wired configuration, to access, according to at least some of the sequence of
nucleotides in a read of the plurality of reads, the index of the one or more genetic reference
sequences, from the memory via the memory interface, so as to map the read to one or more
segments of the one or more genetic reference sequences based on the index. Additionally,
the set ofprocessing engines may include an alignment module in the wired configuration to
access the one or more genetic reference sequences from the memory via the memory
interface to align the read, e.g., the mapped read, to one or more positions in the one or more
ts of the one or more genetic reference sequences, e.g., as received from the mapping
module and/or stored in the memory.
Further, the set of processing engines may include a sorting module so as to
sort each aligned read according to the one or more positions in the one or more genetic
reference sequences. Furthermore, the set of processing engines may include a variant call
module, such as for processing the mapped, aligned, and/or sorted reads, such as with respect
to a reference genome, to thereby produce an HMM readout and/or variant call file for use
with and/or detailing the variations between the sequenced genetic data and the reference
genomic reference data. In various ces, one or more of the plurality of physical
ical onnects may e an output from the integrated circuit for communicating
result data from the mapping module and/or the alignment and/or g and/or variant call
modules.
ularly, with respect to the mapping module, in various embodiments, a
system for executing a mapping analysis pipeline on a plurality s of genetic data using
an index of genetic reference data is provided. In various instances, the genetic sequence,
e.g., read, and/or the genetic reference data may be represented by a sequence ofnucleotides,
which may be stored in a memory of the . The mapping module may be included
within the integrated circuit and may be formed of a set of pre-configured and/or hardwired
l logic circuits that are interconnected by a plurality ofphysical electrical interconnects,
which physical electrical interconnects may include a memory interface for ng the
integrated circuit to access the memory. In more ular embodiments, the hardwired
digital logic circuits may be arranged as a set of processing engines, such as where each
processing engine is formed of a subset of the hardwired digital logic circuits to perform one
or more steps in the sequence analysis pipeline on the plurality ofreads ofgenomic data.
For instance, in one embodiment, the set of processing engines may include a
mapping module in a hardwired configuration, where the mapping module, and/or one or
more processing engines thereof is configured for receiving a read of genomic data, such as
via one or more of a plurality ofphysical electrical interconnects, and for extracting a portion
of the read in such a manner as to te a seed therefrom. In such an ce, the read
may be represented by a sequence of nucleotides, and the seed may ent a subset of the
sequence of tides represented by the read. The g module may include or be
connectable to a memory that includes one or more of the reads, one or more of the seeds of
the reads, at least a portion of one or more of the reference genomes, and/or one or more
indexes, such an index built from the one or more reference genomes. In certain instances, a
processing engine of the mapping module employ the seed and the index to calculate an
address within the index based on the seed.
Once an address has been calculated or otherwise derived and/or stored, such
as in an onboard or offboard memory, the address may be accessed in the index in the
memory so as to receive a record from the address, such as a record representing position
information in the genetic reference sequence. This position ation may then be used to
determine one or more matching ons from the read to the genetic reference sequence
based on the record. Then at least one ofthe matching positions may be output to the memory
via the memory interface.
WO 14320 PCT/0S2017/036424
In another embodiment, a set of the processmg engines may include an
alignment module, such as in a pre-configured and/or hardwired configuration. In this
instance, one or more ofthe processing s may be configured to receive one or more of
the mapped positions for the read data via one or more of the plurality of physical electrical
interconnects. Then the memory (internal or external) may be accessed for each mapped
position to retrieve a segment ofthe reference sequence/genome corresponding to the mapped
position. An alignment of the read to each ved reference segment may be calculated
along with a score for the alignment. Once calculated, at least one best-scoring alignment of
the read may be selected and output. In various instances, the alignment module may also
ent a c programming algorithm when calculating the alignment, such as one or
more of a Smith-Waterman thm, e.g., with linear or affine gap g, a gapped
ent algorithm, and/or a gapless ent algorithm. In particular instances, the
calculating of the alignment may include first performing a s alignment to each
reference segment, and based on the gapless alignment s, selecting reference segments
with which to further perform gapped alignments.
In various embodiments, a variant call module may be provided for
performing improved variant call functions that when implemented in one or both of re
and/or hardware configurations generate superior processing speed, better processed result
accuracy, and enhanced overall efficiency than the methods, devices, and systems currently
known in the art. Specifically, in one aspect, improved s for performing variant call
operations in software and/or in hardware, such as for performing one or more HMM
operations on genetic sequence data, are provided. In another aspect, novel devices including
an ated circuit for performing such improved variant call operations, where at least a
portion ofthe variant call operation is implemented in hardware, are provided.
Accordingly, in various instances, the s disclosed herein may include
mapping, by a first subset of hardwired and/or quantum digital logic ts, a plurality of
reads to one or more segments of one or more genetic reference sequences. Additionally, the
methods may include accessing, by the ated and/or quantum circuits, e.g., by one or
more of the plurality of physical electrical interconnects, from the memory or a cache
associated therewith, one or more of the mapped reads and/or one or more of the genetic
reference sequences; and aligning, by a second subset of the hardwired and/or quantum
digital logic circuits, the plurality of mapped reads to the one or more segments of the one or
more genetic reference sequences.
In various embodiments, the method may additionally include accessing, by
the integrated and/or quantum circuit, e.g., by one or more of the plurality of physical
ical interconnects from a memory or a cache associated therewith, the aligned plurality
of reads. In such an instance the method may include sorting, by a third subset of the
hardwired and/or quantum digital logic circuits, the aligned plurality of reads according to
their positions in the one or more genetic reference sequences. In n instances, the
method may further include outputting, such as by one or more of the plurality of physical
electrical interconnects ofthe integrated and/or quantum circuit, result data from the mapping
and/or the ng and/or the sorting, such as where the result data es positions of the
mapped and/or aligned and/or sorted plurality ofreads.
In some ces, the method may additionally include using the obtained
result data, such as by a further subset ofthe hardwired and/or quantum digital logic circuits,
for the purpose of determining how the mapped, aligned, and/or sorted data, d from the
subject's sequenced genetic sample, differs from a reference sequence, so as to produce a
variant call file delineating the genetic differences between the two samples. Accordingly, in
various embodiments, the method may further include accessing, by the integrated and/or
quantum t, e.g., by one or more ofthe ity ofphysical electrical interconnects from
a memory or a cache associated therewith, the mapped and/or aligned and/or sorted plurality
of reads. In such an instance the method may include performing a variant call function, e.g.,
an HMM or paired HMM operation, on the accessed reads, by a third or fourth subset of the
hardwired and/or quantum digital logic ts, so as to produce a variant call file detailing
how the mapped, aligned, and/or sorted reads vary from that of one or more reference, e.g.,
haplotype, sequences.
Accordingly, in accordance with particular s ofthe disclosure, ted
herein is a compact hardware, e.g., chip based, or quantum accelerated platform for
performing secondary and/or tertiary analyses on genetic and/or genomic sequencing data.
Particularly, a rm or pipeline of hardwired and/or quantum digital logic circuits that
have specifically been designed for performing secondary and/or tertiary genetic analysis,
such as on sequenced genetic data, or genomic data derived therefrom, is ed.
ularly, a set of hardwired digital and/or quantum logic circuits, which may be ed
as a set essing engines, may be provided, such as where the processing engines may be
present in a preconfigured and/or red and/or quantum configuration on a processing
platform of the disclosure, and may be specifically designed for performing secondary
mapping and/or aligning and/or variant call operations related to genetic analysis on DNA
and/or RNA data, and/or may be ically designed for performing other tertiary
processing on the results data.
In particular instances, the present devices, systems, and methods of
employing the same in the mance of one or more genomics and/or bioinformatics
secondary and/or tertiary processing protocols, have been optimized so as to deliver an
improvement in processing speed that is orders of magnitude faster than rd secondary
processing pipelines that are implemented in software. Additionally, the pipelines and/or
components thereof as set forth herein provide better sensitivity and cy on a wide range
of sequence derived data sets for the purposes of genomics and bioinformatics processing. In
various instances, one or more of these operations may be performed on by an integrated
circuit that is part of or ured as a general e central processing unit and/or a
graphics processing unit and/or a m processing unit.
For example, genomics and bioinformatics are fields concerned with the
application of information technology and computer science to the field of genetics and/or
molecular biology. In particular, bioinformatics techniques can be d to process and
analyze various genetic and/or genomic data, such as from an individual, so as to determine
qualitative and quantitative information about that data that can then be used by various
practitioners in the development of prophylactic, therapeutic, and/or diagnostic methods for
preventing, treating, ameliorating, and/or at least identifying ed states and/or their
potential, and thus, improving the safety, quality, and effectiveness of health care on an
individualized level. Hence, because of their focus on advancing personalized healthcare,
genomics and bioinformatics fields promote individualized healthcare that is proactive,
instead of reactive, and this gives the subject in need of ent the opportunity to become
more involved in their own wellness. An age of ing the cs, genomics,
and/or bioinformatics technologies disclosed herein is that the qualitative and/or quantitative
analyses of molecular biological, e.g., genetic, data can be performed on a broader range of
sample sets at a much higher rate of speed and often times more accurately, thus expediting
the emergence of a alized healthcare system. Particularly, in various embodiments, the
genomics and/or bioinformatics related tasks may form a genomics pipeline that includes one
or more of a array analysis pipeline, a genome, e.g., whole genome is pipeline,
genotyping analysis pipeline, exome analysis ne, epigenome analysis pipeline,
metagenome analysis pipeline, microbiome analysis pipeline, genotyping analysis pipeline,
including joint genotyping, variants analysis pipelines, including structural variants, c
ts, and GATK, as well as RNA sequencing and other genetic es pipelines.
Accordingly, to make use of these advantages there exists enhanced and more
accurate software implementations for performing one or a series of such bioinformatics
based analytical techniques, such as for deployment by a general purpose CPU and/or GPU
and/or may be implemented in one or more m circuits of a quantum processing
platform. r, common characteristics of traditionally configured software based
bioinformatics methods and systems is that they are labor intensive, take a long time to
execute on such general purpose processors, and are prone to errors. Therefore,
bioinformatics systems as implemented herein that could perform these algorithms, such as
implemented in software by a CPU and/or GPU of quantum processing unit in a less labor
and/or processing intensive manner with a greater tage accuracy would be useful.
Such implementations have been developed and are ted herein, such as
where the genomics and/or bioinformatics analyses are performed by optimized software run
on a CPU and/or GPU and/or quantum er in a system that makes use of the genetic
sequence data derived by the processing units and/or integrated circuits of the disclosure.
Further, it is to be noted that the cost of analyzing, storing, and sharing this raw digital data
has far outpaced the cost of producing it. Accordingly, also presented herein are "just in
time" storage and/or retrieval methods that optimize the storage of such data in a manner that
substitutes the speed of regenerating the data in exchange for the cost of storing such data
collectively. Hence, the data generation, analysis, and "just in time" or "JIT" storage methods
presented herein solve a key bottleneck that is a long felt but unmet obstacle standing
between the ever-growing raw data generation and storage and the real medical insight being
sought from it.
Presented herein, ore, are systems, apparatuses, and methods for
implementing genomics and/or bioinformatic protocols or portions f, such as for
performing one or more functions for analyzing genomic data, for ce, on one or both of
an integrated circuit, such as on a re processing rm, and a general purpose
processor, such as for performing one or more bioanalytic operations in software and/or on
firmware. For example, as set forth herein below, in s implementations, an integrated
t and/or quantum circuit is provided so as to accelerate one or more processes in a
primary, secondary, and/or tertiary processing platform. In various instances, the integrated
t may be employed in performing genetic analytic related tasks, such as mapping,
ng, t calling, compressing, decompressing, and the like, in an accelerated manner,
and as such the integrated circuit may include a hardware accelerated configuration.
Additionally, in various instances, an integrated and/or quantum circuit may be provided such
as where the circuit is part of a processing unit that is configured for performing one or more
genomics and/or bioinformatics ols on the generated mapped and/or aligned and/or
t called data.
Particularly, in a first embodiment, a first integrated circuit may be formed of
an FPGA, ASIC, and/or sASIC that is coupled to or otherwise attached to the motherboard
and configured, or in the case of an FPGA may be programmable by firmware to be
configured, as a set of hardwired digital logic circuits that are adapted to perform at least a
first set of sequence analysis functions in a genomics analysis pipeline, such as where the
integrated circuit is configured as described herein above to include one or more digital logic
ts that are arranged as a set ofprocessing engines, which are adapted to perform one or
more steps in a mapping, aligning, and/or variant calling operation on the genetic data so as
to produce sequence analysis results data. The first ated circuit may further include an
output, e.g., formed of a plurality of physical electrical interconnects, such as for
communicating the result data from the mapping and/or the alignment and/or other
procedures to the memory.
Additionally, a second integrated and/or quantum circuit may be included,
coupled to or otherwise ed to the motherboard, and in ication with the memory
via a ications ace. The second integrated and/or quantum circuit may be formed
as a central processing unit (CPU) or graphics processing unit (GPU) or m processing
unit (QPU) that is configured for receiving the mapped and/or aligned and/or variant called
sequence analysis result data and may be d to be responsive to one or more software
algorithms that are configured to instruct the CPU or GPU to perform one or more genomics
and/or bioinformatics functions of the c analysis pipeline on the mapped, d,
and/or variant called ce analysis result data. Specifically, the genomics and/or
bioinformatics related tasks may form a genomics analysis ne that includes one or more
of a micro-array analysis, a genome pipeline, e.g., whole genome analysis pipeline,
genotyping analysis pipeline, exome analysis pipeline, epigenome analysis pipeline,
metagenome analysis pipeline, microbiome analysis pipeline, genotyping analyses pipelines,
including joint genotyping, variants analyses pipelines, including structural variants, somatic
variants, and GATK, as well as RNA sequencing analysis pipeline and other genetic analyses
pipelines.
For instance, in one embodiment, the CPU and/or GPU and/or QPU of the
second integrated circuit may e software that is configured for arranging the genome
analysis pipeline for executing a whole genome analysis pipeline, such as a whole genome
analysis pipeline that includes one or more of genome-wide variation analysis, whole-exome
DNA analysis, whole transcriptome RNA analysis, gene function analysis, protein function
analysis, protein binding is, quantitative gene analysis, and/or a gene assembly
analysis. In certain instances, the whole genome analysis ne may be performed for the
purposes of one or more of ancestry analysis, personal l history analysis, e
diagnostics, drug discovery, and/or n profiling. In a particular instance, the whole
genome analysis pipeline is performed for the purposes of oncology analysis. In various
instances, the results ofthis data may be made available, e.g. globally, throughout the system.
In various instances, the CPU and/or GPU and/or a quantum processing unit
(QPU) of the second integrated and/or quantum circuit may include software that is
configured for arranging the genome analysis pipeline for executing a genotyping analysis,
such as a genotyping analysis including joint genotyping. For instance, the joint genotyping
analysis may be performed using a Bayesian probability calculation, such as a Bayesian
probability calculation that s in an absolute probability that a given determined
genotype is a true genotype. In other instances, the software may be configured for
performing a metagenome analysis so as to e nome result data that may in tum
be employed in the performance of a iome analysis.
In certain instances, the first and/or second integrated t and/or the
memory may be housed on an ion card, such as a eral component interconnect
(PCI) card. For instance, in various embodiments, one or more of the integrated circuits may
be one or more chips coupled to a PCie card or otherwise associated with the motherboard. In
various instances, the integrated and/or quantum circuit(s) and/or chip(s) may be a
component within a sequencer or er, or server, such as part of a server farm. In
ular embodiments, the integrated and/or quantum circuit(s) and/or expansion card(s)
and/or er(s) and/or server(s) maybe accessible via the internet, e.g., cloud.
Further, in some instances, the memory may be a volatile random access
memory (RAM), e.g., a direct access memory (DRAM). Particularly, in various
embodiments, the memory may include at least two memories, such as a first memory that is
an HMEM, e.g., for storing the reference haplotype sequence data, and a second memory that
is an RMEM, e.g., for storing the read of genomic sequence data. In particular instances, each
ofthe two memories may e a write port and/or a read port, such as where the write port
and the read port each accessing a separate clock. Additionally, each of the two memories
may include a flip-flop configuration for storing a multiplicity of genetic sequence and/or
processing result data.
Accordingly, in another , the system may be configured for sharing
memory resources amongst its ent parts, such as in on to performing some
computational tasks via software, such as run by the CPU and/or GPU and/or quantum
processing platform, and/or ming other computational tasks via firmware, such as via
the hardware of an associated ated circuit, e.g., FPGA, ASIC, and/or sASIC. This may
be achieved in a number ofdifferent ways, such as by a direct loose or tight coupling between
the CPU/GPU/QPU and the FPGA, e.g., chip or PCie card. Such configurations may be
particularly useful when distributing operations related to the processing of the large data
structures associated with genomics and/or bioinformatics analyses to be used and accessed
by both the CPU/GPU/QPU and the associated integrated circuit. Particularly, in various
embodiments, when processing data through a genomics pipeline, as herein described, such
as to accelerate l processing function, timing, and efficiency, a number of different
ions may be run on the data, which operations may involve both software and hardware
processing components.
Consequently, data may need to be shared and/or ise communicated,
between the software component(s) running on the CPU and/or GPU and/or QPU and/or the
hardware component embodied in the chip, e.g., an FPGA. Accordingly, one or more of the
various steps in the genomics and/or bioinformatics processing pipeline, or a portion thereof,
may be performed by one device, e.g., the CPU/GPU/QPU, and one or more of the various
steps may be med by a hardwired device, e.g., the FPGA. In such an instance, the
CPU/GPU/QPU and/or the FPGA may be communicably coupled in such a manner to allow
the efficient transmission of such data, which coupling may involve the shared use of
memory ces. To achieve such distribution of tasks and the sharing of information for
the performance of such tasks, the various CPUs/GPUs/QPUs may be loosely or tightly
coupled to one another and/or the re devices, e.g., FPGA, or other chip set, such as by
a quick path interconnect.
Particularly, m vanous embodiments, a genom1cs analysis platform is
provided. For instance, the platform may include a motherboard, a memory, and plurality of
integrated and/or quantum circuits, such as g one or more of a U/QPU, a
mapping module, an alignment module, a sorting module, and/or a variant call module.
ically, in particular embodiments, the platform may include a first integrated and/or
quantum circuit, such as an integrated circuit forming a central processing unit (CPU) or
graphics processing unit (GPU), or a quantum circuit forming a quantum processor, that is
responsive to one or more software or other thms that are configured to instruct the
CPU/GPU/QPU to perform one or more sets of genomics analysis functions, as described
herein, such as where the CPU/GPU/QPU includes a first set of physical electronic
onnects to connect with the motherboard. In various instances, the memory may also be
attached to the motherboard and may further be electronically connected with the
CPU/GPU/QPU, such as via at least a portion of the first set of physical electronic
interconnects. In such instances, the memory may be configured for storing a plurality of
reads of genomic data, and/or at least one or more genetic reference ces, and/or an
index ofthe one or more genetic reference sequences.
Additionally, the platform may include one or more of another integrated
circuit(s), such as where each of the other integrated circuit forms a field programmable gate
array (FPGA) having a second set of physical onic interconnects to connect with the
CPU/GPU/QPU and the memory, such as via a to-point onnect protocol. In such
an instance, such as where the ated circuit is an FPGA, the FPGA may be
programmable by firmware to ure a set of hardwired digital logic circuits that are
interconnected by a plurality of physical interconnects to perform a second set of cs
analysis functions, e.g., g, aligning, variant calling, etc. Particularly, the hardwired
digital logic circuits of the FPGA may be arranged as a set ofprocessing engines to perform
one or more pre-configured steps in a sequence analysis ne of the genomics analysis,
such as where the set(s) of processing s include one or more of a mapping and/or
aligning and/or t call module, which modules may be formed of the separate or the
same subsets ofprocessing engines.
As indicated, the system may be configured to include one or more processing
engines, and in various embodiments, an included processing engine may itself be configured
for determining one or more transition probabilities for the sequence of nucleotides of the
read of c sequence going from one state to another, such as from a match state to an
Claims (29)
1. A genomics analysis platform for executing a sequence analysis pipeline, the genomics analysis platform comprising: a set of a first computing instance type, each first computing instance type having a CPU configured by one or more software algorithms for receiving one or more of a BCL file and/or a FASTQ file representing genomic sequence data, and performing one or more preprocessing steps on the genomic sequence data to produce a first set of result data; a set of a second computing instance type, each second computing instance type having an FPGA comprising a set of hardwired digital logic circuits configured by firmware to arrange a set of processing engines for accessing at least a portion of the first set of result data and performing mapping and aligning processing steps of the sequence analysis pipeline to produce a second set of result data, wherein performing the aligning processing steps comprises performing Smith-Waterman steps; and a second set of the first computing instance type for accessing at least a portion of the second set of result data and performing variant calling processing steps of the sequence analysis pipeline to produce a third set of result data, wherein performing the variant calling processing steps comprises performing one more Hidden-Markov Model (HMM) processing steps.
2. The genomics analysis platform in accordance with claim 1, further comprising a reconfigurable storage connected with the set of first computing instances and the set of second computing instances, the reconfigurable storage being switchable between one or more of the set of first of computing instances and one or more of the set of second computing instances for read and write access by successively active computing instances of the set of first computing instances or second computing instances.
3. The genomics analysis platform in accordance with claim 1, wherein the reconfigurable storage is connected to the set of first computing instances and the set of second computing instances via a data communication network.
4. The genomics analysis platform in accordance with claim 1, further comprising a work 321 flow manager having a load estimator logic to estimate a data load of the sequence analysis pipeline to be performed by the set of first of computing instances or the set of second computing instances, the work flow manager further being configured to instantiate a number of the set of first computing instances or the set of the second computing instances based on the estimated data load.
5. The genomics analysis platform in accordance with claim 4, wherein the load estimator logic is further configured to instantiate the number of the set of first computing instances or the set of second computing instances to be instantiated to optimize an efficiency of the genomics analysis platform for processing the data load.
6. The genomics analysis platform in accordance with claim 1, further comprising a set of third computing instances, each third computing instance having at least one graphical processing unit (GPU) that is responsive to one or more graphical processing algorithms that are configured to instruct the GPU to perform a third set of genomic processing steps of the sequence analysis pipeline, and wherein the reconfigurable storage is connected with the set of third server instances via the data communication network.
7. A genomics analysis platform for executing a sequence analysis pipeline, the genomics analysis platform comprising: a set of a first computing instance type, each first computing instance type having a CPU configured by one or more software algorithms for performing a first set of genomic processing steps of the sequence analysis pipeline to produce a first set of result data; a set of a second computing instance type, each second computing instance type having an FPGA comprising a set of hardwired digital logic circuits configured by firmware to arrange a set of processing engines for accessing at least a portion of the first set of result data and performing a second set of genomic processing steps comprising mapping and aligning processing steps of the sequence analysis pipeline to produce a second set of result data; a second set of the first computing instance type for accessing at least a portion of the second set of result data and performing a third set of genomic processing steps of the sequence analysis pipeline to produce a third set of result data; and 322 a second set of the second computing instance type for accessing at least a portion of the third set of result data and performing a fourth set of genomic processing steps of the sequence analysis pipeline to produce a fourth set of result data.
8. The genomics analysis platform in accordance with claim 7, further comprising: a reconfigurable storage external to the sets of the first and second computing instance types, the reconfigurable storage being communicably switchable between the CPUs of the first computing instance type and the FPGAs of the second computing instance type, to receive, store and provide access to result data.
9. The genomics analysis platform in accordance with claim 7, wherein the first set of genomic processing steps includes one or more of image processing, base calling, error correction, BCL conversion, and FASTQ processing.
10. The genomics analysis platform in accordance with claim 9, wherein the mapping includes performing one or more of a Burrow-Wheeler transform and a hash-table function.
11. The genomics analysis platform in accordance with claim 9, wherein the aligning includes one or more of performing a Smith-Waterman, a Needleman-Wusnch, a gapless, and a gapped alignment.
12. The genomics analysis platform in accordance with claim 11, wherein the second set of genomic processing steps includes one or more of sorting and deduplication.
13. The genomics analysis platform in accordance with claim 12, wherein the second set of genomic processing steps includes performing a SmithWaterman operation; or wherein the third set of genomic processing steps includes a Hidden Markov Model operation.
14. The genomics analysis platform in accordance with claim 7, wherein the reconfigurable 323 storage is connected to the set of first computing instances and the set of second computing instances via a data communication network.
15. The genomics analysis platform in accordance with claim 7, further comprising a work flow manager having a load estimator logic to estimate a data load of the sequence analysis pipeline to be performed by the set of first of computing instances or the set of second computing instances, the work flow manager further being configured to instantiate a number of the set of first computing instances or the set of the second computing instances based on the estimated data load.
16. The genomics analysis platform in accordance with claim 15, wherein the load estimator logic is further configured to instantiate the number of the set of first computing instances or the set of second computing instances to be instantiated to optimize an efficiency of the genomics analysis platform for processing the data load.
17. The genomics analysis platform in accordance with claim 16, further comprising a set of third computing instances, each third computing instance having at least one graphical processing unit (GPU) that is responsive to one or more graphical processing algorithms that are configured to instruct the GPU to perform a third set of genomic processing steps of the sequence analysis pipeline, and wherein the reconfigurable storage is connected with the set of third server instances via the data communication network.
18. The genomics analysis platform in accordance with claim 7, further comprising an ondemand cloud computing platform comprising a set of server computers and a set of databases, the set of server computers defining a virtual set of server computer instances that is configurable to provide the set of first server instances and the set of second server instances, and the set of databases defining a virtual database storage that is configurable to provide the reconfigurable storage.
19. A genomics analysis platform for executing a sequence analysis pipeline, the genomics analysis platform comprising: 324 a set of first server instances and a set of second server instances connected with a data communication network, each first server instance having at least one central processing unit (CPU) that is responsive to one or more software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline, each second server instance having at least one field programmable gate array (FPGA), each FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of physical electrical interconnects to form a set of processing engines to perform a second set of genomic processing steps of the sequence analysis pipeline; and an elastic reconfigurable storage connectable with the set of first server instances and the set of second server instances via the data communication network, the elastic reconfigurable storage being switchable between one or more of the set of first server instances and one or more of the set of second server instances for read and write access by successively active server instances of the set of first server instances or second server instances.
20. The genomics analysis platform in accordance with claim 19, further comprising an elastic controller having load estimator logic to estimate a data load of the sequence analysis pipeline to be performed by the set of first server instances or the set of second server instances, the elastic control node further being configured to instantiate a number of the first server instances or the second server instances based on the estimated data load.
21. The genomics analysis platform in accordance with claim 20, wherein the load estimator logic is further configured to instantiate the number of the first server instances or the second server instances to be instantiated to optimize an efficiency of the genomics analysis platform for processing the data load.
22. The genomics analysis platform in accordance with claim 21, further comprising a set of third server instances, each third server instance having at least one graphical processing unit (GPU) that is responsive to one or more graphical processing algorithms that are configured to instruct the GPU to perform a third set of genomic processing steps of the sequence analysis pipeline, and wherein the reconfigurable storage is connected with the set of third server 325 instances via the data communication network.
23. The genomics analysis platform in accordance with claim 21, wherein the second set of genomic processing steps includes mapping and aligning, and the second set of results data includes one or more of a mapped and aligned read.
24. The genomics analysis platform in accordance with claim 23, wherein the aligning includes performing a Smith-Waterman operation.
25. The genomics analysis platform in accordance with claim 19, further comprising an ondemand cloud computing platform comprising a set of server computers and a set of databases, the set of server computers defining a virtual set of server computer instances that is configurable to provide the set of first server instances and the set of second server instances, and the set of databases defining a virtual database storage that is configurable to provide the reconfigurable storage.
26. A genomics analysis platform for executing a sequence analysis pipeline, the genomics analysis platform comprising: one or more of a first server instance, each first server instance having at least one central processing unit (CPU) forming a compute node of the genomic analysis platform, each CPU being responsive to one or more software algorithms that are configured to instruct the CPU to perform a first set of genomic processing steps of the sequence analysis pipeline; one or more of a second server instance, each second server instance having at least one field programmable gate array (FPGA), each FPGA being configured by firmware to arrange a set of hardwired digital logic circuits that are interconnected by a plurality of physical electrical interconnects to form a set of processing engines to perform a second set of genomic processing steps of the sequence analysis pipeline; and an elastic reconfigurable storage external to the one or more first server instances and second server instances, the reconfigurable storage being communicably switchable between the one or more first server instances and the second server instances to receive and store result data of the first or second set of genomic processing steps of the sequence analysis pipeline, and 326 provide access to the result data by a next server instance.
27. The genomics analysis platform in accordance with claim 26, further comprising an elastic controller having load estimator logic to estimate a data load of the sequence analysis pipeline to be performed by the set of first server instances or the set of second server instances, the elastic control node further being configured to instantiate a number of the first server instances or the second server instances based on the estimated data load.
28. The genomics analysis platform in accordance with claim 27, wherein the load estimator logic is further configured to instantiate the number of the first server instances or the second server instances to be instantiated to optimize an efficiency of the genomics analysis platform for processing the data load.
29. The genomics analysis platform in accordance with claim 28, further comprising an ondemand cloud computing platform comprising a set of server computers and a set of databases, the set of server computers defining a virtual set of server computer instances that is configurable to provide the set of first server instances and the set of second server instances, and the set of databases defining a virtual database storage that is configurable to provide the reconfigurable storage.
Applications Claiming Priority (7)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US62/347,080 | 2016-06-07 | ||
US62/399,582 | 2016-09-26 | ||
US62/414,637 | 2016-10-28 | ||
US15/404,146 | 2017-01-11 | ||
US62/462,869 | 2017-02-23 | ||
US62/469,442 | 2017-03-09 | ||
US15/497,149 | 2017-04-25 |
Publications (1)
Publication Number | Publication Date |
---|---|
NZ789138A true NZ789138A (en) | 2022-07-01 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3465507B1 (en) | Genetic multi-region joint detection and variant calling | |
Blom et al. | Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming | |
Brøndum et al. | Strategies for imputation to whole genome sequence using a single or multi-breed reference population in cattle | |
CA3042239A1 (en) | Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing | |
EP3837690B1 (en) | Systems and methods for using neural networks for germline and somatic variant calling | |
US20160171153A1 (en) | Bioinformatics Systems, Apparatuses, And Methods Executed On An Integrated Circuit Processing Platform | |
Liu et al. | CUSHAW3: sensitive and accurate base-space and color-space short-read alignment with hybrid seeding | |
Alser et al. | From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures | |
JP2019510323A5 (en) | ||
CN104992079B (en) | Protein-ligand based on sampling study binds site estimation method | |
Chen et al. | A hybrid short read mapping accelerator | |
Souilmi et al. | Scalable and cost-effective NGS genotyping in the cloud | |
AU2021203941B2 (en) | Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform | |
Lee et al. | Scalable metagenomics alignment research tool (SMART): a scalable, rapid, and complete search heuristic for the classification of metagenomic sequences from complex sequence populations | |
Kearse et al. | The Geneious 6.0. 3 read mapper | |
Alballa et al. | TranCEP: Predicting the substrate class of transmembrane transport proteins using compositional, evolutionary, and positional information | |
Houtgast et al. | An efficient gpuaccelerated implementation of genomic short read mapping with bwamem | |
WO2016139545A1 (en) | Hardware accelerator for alignment of short reads in sequencing platforms | |
Vineetha et al. | SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning | |
Denti et al. | Shark: fishing relevant reads in an RNA-Seq sample | |
NZ789138A (en) | Bioinformatics systems, apparatus, and methods for performing secondary and/or tertiary processing | |
Yue et al. | A systematic review on the state-of-the-art strategies for protein representation | |
Leung et al. | SV-AUTOPILOT: optimized, automated construction of structural variation discovery and benchmarking pipelines | |
Köster et al. | Massively parallel read mapping on GPUs with the q-group index and PEANUT | |
Han et al. | HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing |