US20170177597A1 - Biological data systems - Google Patents
Biological data systems Download PDFInfo
- Publication number
- US20170177597A1 US20170177597A1 US15/386,729 US201615386729A US2017177597A1 US 20170177597 A1 US20170177597 A1 US 20170177597A1 US 201615386729 A US201615386729 A US 201615386729A US 2017177597 A1 US2017177597 A1 US 2017177597A1
- Authority
- US
- United States
- Prior art keywords
- genomic sequence
- memory
- file
- index
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G06F17/30091—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/188—Virtual file systems
-
- G06F17/30115—
-
- G06F17/30233—
-
- G06F19/28—
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/50—Compression of genetic data
Definitions
- second generation genome sequencing machines sequence each base of a subject sample multiple times. This is achieved by synthesis of a large number of nucleic acid reads covering the various portions of the sample. The average number of reads generated for each base can be as high as 30 or more using current genome sequencers, resulting in a large amount of data when sequencing genomes such as the human genome, which contains more than 3 billion base pairs.
- researchers wishing to process such large amounts of read data confront limitations posed by available computing resources.
- FIG. 1 shows a diagram of an exemplary computer that may be used to construct a system for executing the processes described herein.
- FIG. 2 shows a diagram of an exemplary system for performing the methods described herein.
- FIG. 4 depicts a flow diagram of a method for processing data.
- FIG. 5 shows various elements of the systems and processes described herein.
- FIG. 6 shows results of an experiment.
- Next-generation or second-generation sequencing devices such as the HiSeq X Ten, employ massively parallel sequencing chemistry, creating millions of fragments of DNA against a template DNA sequence derived from a biological sample of interest.
- researchers have developed computational methods to align these fragments, known as reads, against a reference DNA sequence. Variations found in the aligned reads, or alignments, in comparison to the reference DNA sequence are identified for further downstream analysis.
- Raw reads generated by a sequencing device can be aligned against a reference sequence using an alignment program.
- Illumina sequencing devices can produce a bcl basecall file, which may be aligned directly by ISAAC aligner software, or can be converted into a fastq file, which is then aligned by software such as ISAAC aligner or BWA and Sambamba.
- the resulting file contains alignments, which are reads that have been matched to positions in a reference sequence.
- One common alignment file format is the SAM file. See Sequence Alignment/Map Format Specification by the SAM/BAM Format Specification Working Group. Because of their large size, SAM files will often be compressed to produce a BAM file. Even as a compressed format, BAM files may still be large. In order to facilitate use of a BAM file in later computational applications, BAM files are often used in conjunction with a file index, such as a BAI file. A file index in this context contains information that allows random access of a compressed file. BAI files can be used to quickly retrieve alignments overlapping a specified region of a sequence without going through all of the alignments contained in the BAM file.
- Existing methods of using a BAI and BAM file for retrieving alignments typically include a step of decompressing some portion of the BAM file. It may be advantageous in some applications to avoid decompression of a file containing genomic sequence data, such as a BAM file, since decompressing and recompressing parts of the file incur costs in time and consumption of computing resources.
- FIG. 1 is a block diagram of an exemplary computer or computing system 100 that may be used to construct a system for executing the processes described herein.
- Computer 100 includes a processor 102 for executing instructions.
- the processor 102 represents one processor or a multiprocessor system including several processors (e.g., two, four, eight, or another suitable number).
- Processor 102 may include any suitable processor capable of executing instructions.
- processor 102 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (ISAs), such as the x86, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA.
- ISAs instruction set architectures
- each of processors may commonly, but not necessarily, implement the same ISA.
- executable instructions are stored in a memory 104 , which is accessible by and coupled to the processor 102 .
- Memory 104 is any device allowing information, such as executable instructions and/or other data, to be stored and retrieved.
- a memory may be volatile memory, nonvolatile memory or a combination of one or more volatile and one or more nonvolatile memory.
- Computer 100 may, in some embodiments, include a user interface device 110 for receiving data from or presenting data to user 108 .
- User 108 may interact indirectly with computer 100 via another computer.
- User interface device 110 may include, for example, a keyboard, a pointing device, a mouse, a stylus, a touch sensitive panel (e.g., a touch pad or a touch screen), a gyroscope, an accelerometer, a position detector, an audio input device or any combination thereof.
- user interface device 110 receives data from user 108 , while another device (e.g., a presentation device) presents data to user 108 .
- another device e.g., a presentation device
- user interface device 110 has a single component, such as a touch screen, that both outputs data to and receives data from user 108 .
- user interface device 110 operates as a component or presentation device for presenting or conveying information to user 108 .
- user interface device 110 may include, without limitation, a display device (e.g., a liquid crystal display (LCD), organic light emitting diode (OLED) display, or electronic ink display), an audio output device (e.g., a speaker or headphones) or both.
- user interface device 110 includes an output adapter, such as a video adapter, an audio adapter or both.
- An output adapter is operatively coupled to processor 102 and configured to be operatively coupled to an output device, such as a display device or an audio output device.
- Computer 100 also includes a network communication interface 112 , which enables computer 100 to communicate with a remote device (e.g., another computer) via a communication medium, such as a wired or wireless packet network.
- a remote device e.g., another computer
- computer 100 may transmit or receive data via network communication interface 112 .
- User interface device 110 or network communication interface 112 may be referred to collectively as an input interface and may be configured to receive information from user 108 .
- Any server, compute node, controller or object store (or storage, used interchangeably) described herein may be implemented as one or more computers (whether local or remote).
- Object stores include memory for storing and accessing data.
- One or more computers or computing systems 100 can be used to execute program instructions to perform any of the methods and operations described herein.
- a system comprises a memory and a processor coupled to the memory, wherein the memory comprises program instructions executable by the processor to perform any of the methods and operations described herein.
- FIG. 2 shows a diagram of an exemplary system for performing the steps, operations and methods described herein.
- the user interacts with one or more local computers 201 in communication with one or more remote servers (controllers) 203 by way of one or more networks 202 .
- User 108 via his or her local computers 201 , instructs controllers 203 to initiate processing.
- the remote controllers 203 may themselves be in communication with each other through one or more networks 202 and may be further connected to one or more remote compute nodes 204 / 205 , also via one or more networks 207 .
- Controllers 203 provision one or more compute nodes 204 / 205 to process the data, such as genomic sequence data.
- Remote compute nodes 204 / 205 may be connected to one or more object storage 206 via one or more networks 208 .
- the data such as genomic sequence data, may be stored in object storage 206 .
- one or more networks shown in FIG. 2 overlap.
- the user interacts with one or more local computers in communication with one or more remote computers by way of one or more networks.
- the remote computers may themselves be in communication with each other through the one or more networks.
- a subset of local computers is organized as a cluster or a cloud as understood in the art.
- some or all of the remote computers are organized as a cluster or a cloud.
- a user interacts with a local computer in communication with a cluster or a cloud via a one or more networks.
- a user interacts with a local computer in communication with a remote computer via one or more networks.
- a file such as a genomic sequence file or an index, is stored in an object store, for example, in a local or remote computer (such as a cloud).
- FIG. 3 shows a schematic of the data of interest.
- a reference sequence 301 contains a known sequence of characters, each numbered serially to provide a position or location within the sequence.
- the characters can be any data, biological data being of particular interest.
- the characters of a reference sequence refer to genomic data or genomic sequence data, such as the various nucleic acids that compose DNA or RNA.
- a sequencing device receives a sample containing genetic material and generates a collection of short strings of nucleic acids, referred to as reads 302 . Any of a whole host of algorithms and software known in the art can be used to match or align the reads against a reference sequence.
- Genomic sequence files contain the genomic sequence data described herein, as well as other information.
- a genomic sequence file can comprise data on a plurality of alignments against one or more reference sequences.
- reads, positions and optional information, such as read quality are arranged in some logical manner in a genomic sequence file.
- the SAM and BAM formats may contain sorted alignments by reference ID and then the leftmost position coordinate of the alignment.
- SAM and BAM files, as well as other types of genomic sequence files also may include header data, which may include format version, reference sequence names and lengths and other identifiers.
- Genomic sequence files can also include an end-of-file (EOF) string or marker, which is a series of data that is interpreted to indicate the end of the file. In BAM files, the end-of-file marker is 28 bytes.
- EEF end-of-file
- BGZF block compression is used, for example, to generate a BAM file from a SAM file.
- sequential blocks of a SAM file are bgzipped into compressed BGZF blocks.
- BGZF block compression the blocks before and after compression do not exceed 64 kB (65,536 bytes). BAM files are thus a concatenation of BGZF blocks.
- a compressed file can be randomly accessed at various points throughout the compressed file in order to grab certain portions of the data. It is noted that the maximum size of a compressed block of data can be no greater than the maximum size of the same block uncompressed.
- FIG. 4 depicts a flow diagram of a method for processing data.
- these data are genomic sequence data (for example, reads or alignments) contained in a genomic sequence file, such as those described herein (in particular, a BAM file).
- the genomic sequence file comprises a header and genomic sequence data.
- the genomic sequence file comprises a header, genomic sequence data and an end-of-file marker.
- the genomic sequence file is compressed and no portion thereof is decompressed in performing the processes described herein, such as in FIG. 4 .
- genomic sequence data are compressed and no portion thereof is decompressed in performing the processes described herein, such as in FIG. 4 .
- a first index is written into memory.
- the first index generally contains data referring to locations in a compressed genomic sequence file where an alignment or chunks of alignments can be found.
- a chunk is a set of compressed alignments whose size is based on a predetermined amount of memory in which the uncompressed data fit.
- the locations in the first index may be given as a virtual file offset, an integer in which one portion refers to a byte offset into the compressed file to the beginning of a compressed block and another portion that refers to a byte offset into the data stream of such uncompressed block.
- a virtual file offset could be a 32-bit unsigned integer in which the leftmost 16 bits refer to a byte offset into the compressed file to the beginning of a compressed block and the rightmost 16 bits refers to the byte offset into the data stream of the uncompressed block.
- the first index may also contain other data, such as bins, to facilitate indexing and fast retrieval. Alignments can be grouped into or associated with such bins and arranged or ordered in the file index accordingly. Thus, a first index may contain data on one or more reference sequences, each divided into bins, which themselves contain alignments.
- the file index is a BAI file.
- a magic string is followed by a number of reference sequences, each of which has a number of bins, which of which contains a number of chunks, each of which is characterized by a beginning virtual file offset and an end virtual file offset representing the start and end of the chunk respectively.
- a set of virtual file offsets each of which represents the first alignment in a series of 16 kbp intervals across the reference sequence, thus forming a linear index.
- the BAI file optionally ends with an integer representing the number of unplaced unmapped reads.
- a first index is written into nonvolatile memory.
- a URL is used to access the contents of a file, such as the first index.
- the file is located on a computer remote from the local computer that is used to analyze or store some or all of contents of the file.
- the remote computer in some instances may store the file in an object store.
- a URL to the first index is written into memory.
- the first index accessed at the URL is written into memory.
- the preallocated memory can be calculated with high precision in order minimize the amount of preallocated memory that does not have data written to it. That is, it is possible to calculate down to a single byte the amount of memory that needs to be preallocated. For a remote file accessed by the HTTP protocol, this calculation could comprise making an HTTP head request. Thus, in some embodiments, memory is preallocated for the first index.
- Asynchronous I/O methods can optionally be used to quickly retrieve data from any file stored remotely.
- a first index is loaded into memory by asynchronous I/O.
- the use of asynchronous I/O allows the program's execution to continue while data is being transmitted.
- the program To read or write from a network connection, the program performs non-blocking read or write system calls to the kernel, and monitors all pending operations using a single polling call to the kernel. During the time that no read or write requests can be fulfilled, the program is free to perform other processing tasks.
- a number of different asynchronous I/O libraries are widely available for this purpose. For example, libuv is a multiplatform support library designed around an event-driven asynchronous I/O model. With asynchronous I/O, a plurality of simultaneous connections between a local computer and a remote computing is utilized to transfer data between the computers.
- a byte range of the genomic sequence file is calculated.
- the byte range is determined based on the data contained in the first index and a genomic range selected and input by the user.
- the genomic range is a coordinate range (such as one or more identification or index numbers) referring to one or more reference sequences.
- the reference sequence can represent any structure of interest, in particular a biological structure such as a chromosome.
- the genomic range consists of whole contiguous reference sequences. A whole contiguous reference sequence is one that has not been divided further. Calculating the byte range of the genomic sequence file may be done by a processor.
- the bins into which the reference sequence have been divided can be traversed to determine the lowest and highest virtual file offsets of the chunks contained within the bins. Bins that contain metadata and not any actual sequences (i.e., pseudobins) may be ignored. The lowest and highest virtual file offsets of the aggregate of these chunks determine the region of the genomic sequence file containing the alignment data of interest to the user.
- the entire genomic sequence file is retrieved.
- the lowest virtual file offset is not found and the genomic sequence file is smaller than the maximum size of a compression block plus an end-of-file string
- at least the first compression block is retrieved (by retrieving the maximum compression block size starting at the beginning of the file) along with the end-of-file string. Because the contents of an end-of-file string may be known in advance, in some instances, instead of retrieving the end-of-file string, such string may be simply appended to retrieved data.
- “low position” 501 indicates the beginning of the first compressed block of the genomic sequence file 504 to retrieve.
- the highest virtual file offset is a “high position” 502 indicating the beginning of the last compressed block of the genomic sequence file to retrieve.
- the maximum size of a compressed block of data is added to the high position resulting in a new high position 503 that is used to determine the end point of the length of data to retrieve from the genomic sequence file. If the new high position is greater than the size of the genomic sequence file less one plus the EOF marker, then the high position is simply set to the end of the genomic sequence file.
- the header of the genomic sequence file is written into memory. Since the header is located at the beginning of the genomic sequence file, in these embodiments, at least the first compressed block is retrieved by retrieving the maximum size of a compressed block starting from the beginning of the genomic sequence file. If the location of the low position is less than or equal to the maximum size of a compressed block, then the low position is reset to the beginning of the genomic sequence file (zero). If the location of the low position is greater than the maximum size of a compressed block, then a “first-block offset” between the end of the first block retrieved and the low position is determined and stored for later use. In some embodiments, the header of the genomic sequence file is written into nonvolatile memory.
- the length between the low and high position may then be divided into a set of intervals by which portions of the genomic sequence file are retrieved.
- the EOF of the genomic sequence file is retrieved.
- an additional interval equal to the length of the EOF is appended to the intervals.
- the EOF of the genomic sequence file is not retrieved.
- a portion of the genomic sequence file is written into memory.
- a URL to the genomic sequence file is written into memory, and a portion of the genomic sequence file accessed at the URL is written into memory (for example, nonvolatile memory).
- the portion of the genomic sequence file accessed at the URL and written into memory comprises genomic sequence data located within the byte range of block 402
- the portion of the genomic sequence file accessed at the URL and written into memory comprises a header and genomic sequence data located within the byte range of block 402 .
- a process comprises writing into the memory a URL to the genomic sequence file and transferring into memory (for example, nonvolatile memory) a portion of the genomic sequence file accessed at the URL, wherein the portion comprises genomic sequence data.
- the portion of the genomic sequence data that is written into memory will generally be greater than or equal to 1 byte but less than the entirety of the genomic data.
- a portion of the genomic sequence file is written into nonvolatile memory.
- one or more portions of the genomic sequence file is retrieved using an asynchronous I/O library.
- the same functions used in writing the first index into memory can also be used to write portions of the genomic sequence file.
- exemplary libraries include libuv.
- a second genomic sequence file is generated.
- the second genomic sequence file will generally comprise various portions of an original or first genomic sequence file, such portions including any combination of some or all of the header, some or all of the genomic sequence data and some or all of the end-of-file marker.
- the second genomic sequence file comprises a header and genomic sequence data.
- the second genomic sequence file comprises a header, genomic sequence data and an end-of-file marker.
- the second genomic sequence file comprises genomic sequence data. It may be advantageous to avoid decompression and recompression of first genomic sequence file. Thus, the various portions of a first or original genomic sequence file can be retrieved and, without decompression, assembled into the second genomic sequence file.
- no portion of the first genomic sequence file is decompressed.
- the header is not decompressed.
- the genomic sequence data are not decompressed.
- no portion of the header or genomic sequence data is decompressed.
- memory (such as nonvolatile memory) is preallocated for the second genomic sequence file.
- the second genomic sequence file is not a proper compressed file. In these embodiments, the second genomic sequence file cannot be linearly decompressed using the compression algorithm that generated the original file whose portions were used to assemble the second genomic sequence file.
- a second index is calculated. Because only a subset of the first genomic sequence file will have been retrieved, a second index different from the first index is needed in order to use such subset. Using the first index as a template to generate a second index, all reference sequences outside of the genomic range can be deleted. Thus, only a subset of the chunks described in the first index will need to be retained in many cases.
- the virtual file offsets of the start and end points of each chunk also need to be adjusted to reflect their locations in a second genomic sequence file.
- Each start and end point is bit-shifted right by the bit-length of the maximum uncompressed offset and then the first-block offset is subtracted.
- the bits representing the uncompressed offset are then appended to this new value to produce a new start and end point. This process is repeated for each chunk of each reference sequence that is to be retained.
- the first-block offset is also used in a similar manner to adjust the locations of the first alignment contained in each interval of the linear index.
- the second index is written into memory.
- the second index is written into nonvolatile memory.
- the steps or operations described herein, such as shown in FIG. 4 may be arranged in a different order. For example, in some embodiments, after block 402 where a byte range of the genomic sequence file has been calculated, block 404 (calculating a second index) and block 405 (writing the second index into memory) can take place before block 403 (writing into memory a portion of the genomic sequence file). Certain steps or operations described sequentially may in some cases be performed concurrently.
- the genomic sequence file is compressed and no portion thereof is decompressed in performing the process.
- genomic sequence data are compressed and no portion thereof is decompressed in performing the process.
- the information or data in a second index and in a second genomic sequence file are passed to one or more downstream algorithms or software for additional processing.
- the second genomic sequence file contains genomic sequences that have been aligned to a certain position in the reference sequence.
- Various other analyses known in the art can be performed to extract additional information from the second genomic sequence file. For example, it may be useful to determine if the alignments indicate a variant with respect to a reference sequence.
- the second genomic sequence file is passed as input to variant call software.
- the second genomic sequence file and the second index are passed as input to variant call software. Many variant call algorithms and software are known in the art.
- the variant call software is GATK (HaplotypeCaller).
- the downstream algorithm or software could employ a method for parallel processing of the files.
- one or more executions of the variant call software are simultaneously initiated, wherein each execution uses one or more threads.
- each of the executions is run on a separate cloud instance.
- Other types of analysis that might be performed on the information or data in the second index and second genomic sequence file include multi-sample processing, annotation and filtering of variants, data aggregation, association analysis, population structure analysis and visualization, among others.
- Any of the disclosed operations, combinations of operations or methods can be implemented as computer-executable instructions stored on one or more computer-readable storage media (e.g., one or more optical media discs, volatile memory components (such as DRAM or SRAM), or non-volatile memory components (such as flash memory or hard drives)) and executed on a computer (e.g., any commercially available computer, including smart phones or other mobile devices that include computing hardware).
- a computer e.g., any commercially available computer, including smart phones or other mobile devices that include computing hardware.
- Any of the computer-executable instructions for implementing the disclosed techniques as well as any data created and used during implementation of the disclosed embodiments can be stored on one or more computer-readable storage media.
- the computer-executable instructions can be part of, for example, a dedicated software application or a software application that is accessed or downloaded via a web browser or other software application (such as a remote computing application).
- Such software can be executed, for example, on a single local computer (e.g., any suitable commercially available computer) or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a customer-server network (such as a cloud computing network), or other such network) using one or more network computers.
- any functionality described herein can be performed, at least in part, by one or more hardware logic components, instead of software.
- illustrative types of hardware logic components include field-programmable gate arrays (FPGAs), program-specific integrated circuits (ASICs), program-specific standard products (ASSPs), system-on-a-chip systems (SOCs), complex programmable logic devices (CPLDs), etc.
- FIG. 6 shows the time savings gained by utilizing the systems and processes described herein.
- GATK HaplotypeCaller was run on files produced by a system described herein and on BAM files that were sliced by Sambamba Slice. As shown in FIG. 6 , the presently described system finished processing the largest reference sequence in 3 hours and 15 minutes, compared to about 4 hours using files produced by Sambamba Slice.
Abstract
Description
- This application claims priority to U.S. Provisional Patent Application No. 62/271,204 filed Dec. 22, 2015, the disclosure of which is hereby incorporated by reference in its entirety for all purposes.
- In order to achieve high confidence in calling bases in genomic sequencing applications, second generation genome sequencing machines sequence each base of a subject sample multiple times. This is achieved by synthesis of a large number of nucleic acid reads covering the various portions of the sample. The average number of reads generated for each base can be as high as 30 or more using current genome sequencers, resulting in a large amount of data when sequencing genomes such as the human genome, which contains more than 3 billion base pairs. Researchers wishing to process such large amounts of read data confront limitations posed by available computing resources. These and other challenges are addressed by the systems and processes described herein.
-
FIG. 1 shows a diagram of an exemplary computer that may be used to construct a system for executing the processes described herein. -
FIG. 2 shows a diagram of an exemplary system for performing the methods described herein. -
FIG. 3 shows a schematic of the data of interest. -
FIG. 4 depicts a flow diagram of a method for processing data. -
FIG. 5 shows various elements of the systems and processes described herein. -
FIG. 6 shows results of an experiment. - Next-generation or second-generation sequencing devices, such as the HiSeq X Ten, employ massively parallel sequencing chemistry, creating millions of fragments of DNA against a template DNA sequence derived from a biological sample of interest. Researchers have developed computational methods to align these fragments, known as reads, against a reference DNA sequence. Variations found in the aligned reads, or alignments, in comparison to the reference DNA sequence are identified for further downstream analysis.
- Raw reads generated by a sequencing device (or optionally a preprocessed version of the reads) can be aligned against a reference sequence using an alignment program. For example, Illumina sequencing devices can produce a bcl basecall file, which may be aligned directly by ISAAC aligner software, or can be converted into a fastq file, which is then aligned by software such as ISAAC aligner or BWA and Sambamba. The resulting file contains alignments, which are reads that have been matched to positions in a reference sequence.
- One common alignment file format is the SAM file. See Sequence Alignment/Map Format Specification by the SAM/BAM Format Specification Working Group. Because of their large size, SAM files will often be compressed to produce a BAM file. Even as a compressed format, BAM files may still be large. In order to facilitate use of a BAM file in later computational applications, BAM files are often used in conjunction with a file index, such as a BAI file. A file index in this context contains information that allows random access of a compressed file. BAI files can be used to quickly retrieve alignments overlapping a specified region of a sequence without going through all of the alignments contained in the BAM file. Existing methods of using a BAI and BAM file for retrieving alignments typically include a step of decompressing some portion of the BAM file. It may be advantageous in some applications to avoid decompression of a file containing genomic sequence data, such as a BAM file, since decompressing and recompressing parts of the file incur costs in time and consumption of computing resources.
-
FIG. 1 is a block diagram of an exemplary computer orcomputing system 100 that may be used to construct a system for executing the processes described herein.Computer 100 includes aprocessor 102 for executing instructions. Theprocessor 102 represents one processor or a multiprocessor system including several processors (e.g., two, four, eight, or another suitable number).Processor 102 may include any suitable processor capable of executing instructions. For example, invarious embodiments processor 102 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (ISAs), such as the x86, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In multiprocessor systems, each of processors may commonly, but not necessarily, implement the same ISA. In some embodiments, executable instructions are stored in amemory 104, which is accessible by and coupled to theprocessor 102.Memory 104 is any device allowing information, such as executable instructions and/or other data, to be stored and retrieved. A memory may be volatile memory, nonvolatile memory or a combination of one or more volatile and one or more nonvolatile memory. Thus, thememory 104 may comprise, for example, random access memory (RAM), read-only memory (ROM), hard disk drives, solid-state drives, USB flash drives, memory cards accessed via a memory card reader, floppy disks accessed via an associated floppy disk drive, optical discs accessed via an optical disc drive, magnetic tapes accessed via an appropriate tape drive, and/or other memory components, or a combination of any two or more of these memory components. In addition, the RAM may comprise, for example, static random access memory (SRAM), dynamic random access memory (DRAM), or magnetic random access memory (MRAM) and other such devices. The ROM may comprise, for example, a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or other like memory device. -
Computer 100 may, in some embodiments, include auser interface device 110 for receiving data from or presenting data touser 108.User 108 may interact indirectly withcomputer 100 via another computer.User interface device 110 may include, for example, a keyboard, a pointing device, a mouse, a stylus, a touch sensitive panel (e.g., a touch pad or a touch screen), a gyroscope, an accelerometer, a position detector, an audio input device or any combination thereof. In some embodiments,user interface device 110 receives data fromuser 108, while another device (e.g., a presentation device) presents data touser 108. In other embodiments,user interface device 110 has a single component, such as a touch screen, that both outputs data to and receives data fromuser 108. In such embodiments,user interface device 110 operates as a component or presentation device for presenting or conveying information touser 108. For example,user interface device 110 may include, without limitation, a display device (e.g., a liquid crystal display (LCD), organic light emitting diode (OLED) display, or electronic ink display), an audio output device (e.g., a speaker or headphones) or both. In some embodiments,user interface device 110 includes an output adapter, such as a video adapter, an audio adapter or both. An output adapter is operatively coupled toprocessor 102 and configured to be operatively coupled to an output device, such as a display device or an audio output device. -
Computer 100 includes astorage interface 116 that enablescomputer 100 to communicate with one or more of data stores, which store virtual disk images, software applications, or any other data suitable for use with the systems and processes described herein. In exemplary embodiments,storage interface 116couples computer 100 to a storage area network (SAN) (e.g., a Fibre Channel network), a network-attached storage (NAS) system (e.g., via a packet network) or both. Thestorage interface 116 may be integrated withnetwork communication interface 112. -
Computer 100 also includes anetwork communication interface 112, which enablescomputer 100 to communicate with a remote device (e.g., another computer) via a communication medium, such as a wired or wireless packet network. For example,computer 100 may transmit or receive data vianetwork communication interface 112.User interface device 110 ornetwork communication interface 112 may be referred to collectively as an input interface and may be configured to receive information fromuser 108. Any server, compute node, controller or object store (or storage, used interchangeably) described herein may be implemented as one or more computers (whether local or remote). Object stores include memory for storing and accessing data. One or more computers orcomputing systems 100 can be used to execute program instructions to perform any of the methods and operations described herein. Thus, in some embodiments, a system comprises a memory and a processor coupled to the memory, wherein the memory comprises program instructions executable by the processor to perform any of the methods and operations described herein. -
FIG. 2 shows a diagram of an exemplary system for performing the steps, operations and methods described herein. From the point of view ofuser 108, the user interacts with one or more local computers 201 in communication with one or more remote servers (controllers) 203 by way of one ormore networks 202.User 108, via his or her local computers 201, instructscontrollers 203 to initiate processing. Theremote controllers 203 may themselves be in communication with each other through one ormore networks 202 and may be further connected to one or moreremote compute nodes 204/205, also via one ormore networks 207.Controllers 203 provision one ormore compute nodes 204/205 to process the data, such as genomic sequence data.Remote compute nodes 204/205 may be connected to one ormore object storage 206 via one or more networks 208. The data, such as genomic sequence data, may be stored inobject storage 206. In some embodiments, one or more networks shown inFIG. 2 overlap. In some embodiments, the user interacts with one or more local computers in communication with one or more remote computers by way of one or more networks. The remote computers may themselves be in communication with each other through the one or more networks. In some embodiments, a subset of local computers is organized as a cluster or a cloud as understood in the art. In some embodiments, some or all of the remote computers are organized as a cluster or a cloud. In some embodiments, a user interacts with a local computer in communication with a cluster or a cloud via a one or more networks. In some embodiments, a user interacts with a local computer in communication with a remote computer via one or more networks. In some embodiments, a file, such as a genomic sequence file or an index, is stored in an object store, for example, in a local or remote computer (such as a cloud). -
FIG. 3 shows a schematic of the data of interest. Areference sequence 301 contains a known sequence of characters, each numbered serially to provide a position or location within the sequence. The characters can be any data, biological data being of particular interest. In exemplary embodiments, the characters of a reference sequence refer to genomic data or genomic sequence data, such as the various nucleic acids that compose DNA or RNA. A sequencing device receives a sample containing genetic material and generates a collection of short strings of nucleic acids, referred to as reads 302. Any of a whole host of algorithms and software known in the art can be used to match or align the reads against a reference sequence. Examples include BBMap, BigBWA, Bowtie, BWA, CUSHAW, GEM, iSAAC, Novoalign, SOAP and Stampy. These algorithms determine the best or likely location within a reference sequence where some, most or all of the nucleic acids of a read match or align with the nucleic acids of the reference sequence or of the reverse complement of the reference sequence. One of skill in the art will appreciate that a read may be determined to match even in the presence of nucleic acid insertions, deletions or substitutions in either the reference sequence, the read or both. Alignment algorithms thus output at least a position in the reference sequence where a read is matched. Reads that have been associated or matched with some position in a reference sequence can be referred to as mapped reads oralignments 303. - Genomic sequence files contain the genomic sequence data described herein, as well as other information. A genomic sequence file can comprise data on a plurality of alignments against one or more reference sequences. In exemplary embodiments, reads, positions and optional information, such as read quality, are arranged in some logical manner in a genomic sequence file. For example, the SAM and BAM formats may contain sorted alignments by reference ID and then the leftmost position coordinate of the alignment. SAM and BAM files, as well as other types of genomic sequence files, also may include header data, which may include format version, reference sequence names and lengths and other identifiers. Genomic sequence files can also include an end-of-file (EOF) string or marker, which is a series of data that is interpreted to indicate the end of the file. In BAM files, the end-of-file marker is 28 bytes. Thus, genomic sequence files can contain various combinations of a header, genomic sequence data and end-of-file marker.
- Because of the large number alignments that may be needed for certain applications, it may be useful to compress files containing such alignments in order to conserve storage resources. Any of a number of compression algorithms known in the art may be used for such purpose. In exemplary embodiments, BGZF block compression is used, for example, to generate a BAM file from a SAM file. In those embodiments, sequential blocks of a SAM file are bgzipped into compressed BGZF blocks. Using BGZF block compression, the blocks before and after compression do not exceed 64 kB (65,536 bytes). BAM files are thus a concatenation of BGZF blocks. Through the use of a file index (sometimes referred to simply as an index), a compressed file can be randomly accessed at various points throughout the compressed file in order to grab certain portions of the data. It is noted that the maximum size of a compressed block of data can be no greater than the maximum size of the same block uncompressed.
-
FIG. 4 depicts a flow diagram of a method for processing data. In exemplary embodiments, these data are genomic sequence data (for example, reads or alignments) contained in a genomic sequence file, such as those described herein (in particular, a BAM file). In exemplary embodiments, the genomic sequence file comprises a header and genomic sequence data. In some embodiments, the genomic sequence file comprises a header, genomic sequence data and an end-of-file marker. In exemplary embodiments, the genomic sequence file is compressed and no portion thereof is decompressed in performing the processes described herein, such as inFIG. 4 . In exemplary embodiments, genomic sequence data are compressed and no portion thereof is decompressed in performing the processes described herein, such as inFIG. 4 . - In
block 401, a first index is written into memory. The first index generally contains data referring to locations in a compressed genomic sequence file where an alignment or chunks of alignments can be found. A chunk is a set of compressed alignments whose size is based on a predetermined amount of memory in which the uncompressed data fit. The locations in the first index may be given as a virtual file offset, an integer in which one portion refers to a byte offset into the compressed file to the beginning of a compressed block and another portion that refers to a byte offset into the data stream of such uncompressed block. For example, a virtual file offset could be a 32-bit unsigned integer in which the leftmost 16 bits refer to a byte offset into the compressed file to the beginning of a compressed block and the rightmost 16 bits refers to the byte offset into the data stream of the uncompressed block. The first index may also contain other data, such as bins, to facilitate indexing and fast retrieval. Alignments can be grouped into or associated with such bins and arranged or ordered in the file index accordingly. Thus, a first index may contain data on one or more reference sequences, each divided into bins, which themselves contain alignments. In exemplary embodiments, the file index is a BAI file. In the general arrangement of a BAI file, a magic string is followed by a number of reference sequences, each of which has a number of bins, which of which contains a number of chunks, each of which is characterized by a beginning virtual file offset and an end virtual file offset representing the start and end of the chunk respectively. For each reference sequence, following the chunks is a set of virtual file offsets each of which represents the first alignment in a series of 16 kbp intervals across the reference sequence, thus forming a linear index. The BAI file optionally ends with an integer representing the number of unplaced unmapped reads. In some embodiments, a first index is written into nonvolatile memory. - In some embodiments, a URL is used to access the contents of a file, such as the first index. In these embodiments, the file is located on a computer remote from the local computer that is used to analyze or store some or all of contents of the file. The remote computer in some instances may store the file in an object store. In exemplary embodiments, a URL to the first index is written into memory. In some embodiments, the first index accessed at the URL is written into memory.
- Before retrieving the data from a remote computer, it may be useful to calculate the amount of space needed locally and to preallocate memory before writing the data. The preallocated memory can be calculated with high precision in order minimize the amount of preallocated memory that does not have data written to it. That is, it is possible to calculate down to a single byte the amount of memory that needs to be preallocated. For a remote file accessed by the HTTP protocol, this calculation could comprise making an HTTP head request. Thus, in some embodiments, memory is preallocated for the first index.
- Asynchronous I/O methods can optionally be used to quickly retrieve data from any file stored remotely. In exemplary embodiments, a first index is loaded into memory by asynchronous I/O. The use of asynchronous I/O allows the program's execution to continue while data is being transmitted. To read or write from a network connection, the program performs non-blocking read or write system calls to the kernel, and monitors all pending operations using a single polling call to the kernel. During the time that no read or write requests can be fulfilled, the program is free to perform other processing tasks. A number of different asynchronous I/O libraries are widely available for this purpose. For example, libuv is a multiplatform support library designed around an event-driven asynchronous I/O model. With asynchronous I/O, a plurality of simultaneous connections between a local computer and a remote computing is utilized to transfer data between the computers.
- In
block 402, a byte range of the genomic sequence file is calculated. Here, the byte range is determined based on the data contained in the first index and a genomic range selected and input by the user. The genomic range is a coordinate range (such as one or more identification or index numbers) referring to one or more reference sequences. The reference sequence can represent any structure of interest, in particular a biological structure such as a chromosome. In some embodiments, the genomic range consists of whole contiguous reference sequences. A whole contiguous reference sequence is one that has not been divided further. Calculating the byte range of the genomic sequence file may be done by a processor. - For each reference sequence in the genomic range, the bins into which the reference sequence have been divided can be traversed to determine the lowest and highest virtual file offsets of the chunks contained within the bins. Bins that contain metadata and not any actual sequences (i.e., pseudobins) may be ignored. The lowest and highest virtual file offsets of the aggregate of these chunks determine the region of the genomic sequence file containing the alignment data of interest to the user.
- In the case where the lowest virtual file offset is not found and the genomic sequence file is smaller than or equal to the maximum size of a compression block plus an end-of-file string, the entire genomic sequence file is retrieved. In the case where the lowest virtual file offset is not found and the genomic sequence file is smaller than the maximum size of a compression block plus an end-of-file string, at least the first compression block is retrieved (by retrieving the maximum compression block size starting at the beginning of the file) along with the end-of-file string. Because the contents of an end-of-file string may be known in advance, in some instances, instead of retrieving the end-of-file string, such string may be simply appended to retrieved data.
- Referring to
FIG. 5 in whichreference sequences genomic sequence file 504 to retrieve. The highest virtual file offset is a “high position” 502 indicating the beginning of the last compressed block of the genomic sequence file to retrieve. Because each of the compressed blocks of a genomic sequence file may be of varying sizes, the location of the end of the last compressed block to be retrieved is not known. Thus, the maximum size of a compressed block of data is added to the high position resulting in a newhigh position 503 that is used to determine the end point of the length of data to retrieve from the genomic sequence file. If the new high position is greater than the size of the genomic sequence file less one plus the EOF marker, then the high position is simply set to the end of the genomic sequence file. - In exemplary embodiments, the header of the genomic sequence file is written into memory. Since the header is located at the beginning of the genomic sequence file, in these embodiments, at least the first compressed block is retrieved by retrieving the maximum size of a compressed block starting from the beginning of the genomic sequence file. If the location of the low position is less than or equal to the maximum size of a compressed block, then the low position is reset to the beginning of the genomic sequence file (zero). If the location of the low position is greater than the maximum size of a compressed block, then a “first-block offset” between the end of the first block retrieved and the low position is determined and stored for later use. In some embodiments, the header of the genomic sequence file is written into nonvolatile memory.
- The length between the low and high position may then be divided into a set of intervals by which portions of the genomic sequence file are retrieved. In some embodiments, the EOF of the genomic sequence file is retrieved. In those embodiments, an additional interval equal to the length of the EOF is appended to the intervals. In some embodiments, the EOF of the genomic sequence file is not retrieved.
- In
block 403, a portion of the genomic sequence file is written into memory. In some embodiments, a URL to the genomic sequence file is written into memory, and a portion of the genomic sequence file accessed at the URL is written into memory (for example, nonvolatile memory). In some embodiments, the portion of the genomic sequence file accessed at the URL and written into memory comprises genomic sequence data located within the byte range ofblock 402, and in exemplary embodiments, the portion of the genomic sequence file accessed at the URL and written into memory comprises a header and genomic sequence data located within the byte range ofblock 402. Thus, in some embodiments, a process comprises writing into the memory a URL to the genomic sequence file and transferring into memory (for example, nonvolatile memory) a portion of the genomic sequence file accessed at the URL, wherein the portion comprises genomic sequence data. The portion of the genomic sequence data that is written into memory will generally be greater than or equal to 1 byte but less than the entirety of the genomic data. In some embodiments, a portion of the genomic sequence file is written into nonvolatile memory. - In some embodiments, one or more portions of the genomic sequence file is retrieved using an asynchronous I/O library. The same functions used in writing the first index into memory can also be used to write portions of the genomic sequence file. Thus, exemplary libraries include libuv.
- In some embodiments, a second genomic sequence file is generated. The second genomic sequence file will generally comprise various portions of an original or first genomic sequence file, such portions including any combination of some or all of the header, some or all of the genomic sequence data and some or all of the end-of-file marker. In exemplary embodiments, the second genomic sequence file comprises a header and genomic sequence data. In some embodiments, the second genomic sequence file comprises a header, genomic sequence data and an end-of-file marker. In some embodiments, the second genomic sequence file comprises genomic sequence data. It may be advantageous to avoid decompression and recompression of first genomic sequence file. Thus, the various portions of a first or original genomic sequence file can be retrieved and, without decompression, assembled into the second genomic sequence file. In exemplary embodiments, no portion of the first genomic sequence file is decompressed. In exemplary embodiments, the header is not decompressed. In exemplary embodiments, the genomic sequence data are not decompressed. In exemplary embodiments, no portion of the header or genomic sequence data is decompressed.
- In some embodiments, memory (such as nonvolatile memory) is preallocated for the second genomic sequence file.
- In some embodiments, the second genomic sequence file is not a proper compressed file. In these embodiments, the second genomic sequence file cannot be linearly decompressed using the compression algorithm that generated the original file whose portions were used to assemble the second genomic sequence file.
- In
block 404, a second index is calculated. Because only a subset of the first genomic sequence file will have been retrieved, a second index different from the first index is needed in order to use such subset. Using the first index as a template to generate a second index, all reference sequences outside of the genomic range can be deleted. Thus, only a subset of the chunks described in the first index will need to be retained in many cases. - When calculating a second index for use with a second genomic sequence file, the virtual file offsets of the start and end points of each chunk also need to be adjusted to reflect their locations in a second genomic sequence file. Each start and end point is bit-shifted right by the bit-length of the maximum uncompressed offset and then the first-block offset is subtracted. The bits representing the uncompressed offset are then appended to this new value to produce a new start and end point. This process is repeated for each chunk of each reference sequence that is to be retained. The first-block offset is also used in a similar manner to adjust the locations of the first alignment contained in each interval of the linear index.
- In
block 405, the second index is written into memory. In some embodiments, the second index is written into nonvolatile memory. - One of skill in the art will appreciate that in some embodiments, the steps or operations described herein, such as shown in
FIG. 4 , may be arranged in a different order. For example, in some embodiments, afterblock 402 where a byte range of the genomic sequence file has been calculated, block 404 (calculating a second index) and block 405 (writing the second index into memory) can take place before block 403 (writing into memory a portion of the genomic sequence file). Certain steps or operations described sequentially may in some cases be performed concurrently. In some of these various embodiments, the genomic sequence file is compressed and no portion thereof is decompressed in performing the process. In exemplary embodiments, genomic sequence data are compressed and no portion thereof is decompressed in performing the process. - In some embodiments, the information or data in a second index and in a second genomic sequence file are passed to one or more downstream algorithms or software for additional processing. The second genomic sequence file contains genomic sequences that have been aligned to a certain position in the reference sequence. Various other analyses known in the art can be performed to extract additional information from the second genomic sequence file. For example, it may be useful to determine if the alignments indicate a variant with respect to a reference sequence. Thus, in some embodiments, the second genomic sequence file is passed as input to variant call software. In some embodiments, the second genomic sequence file and the second index are passed as input to variant call software. Many variant call algorithms and software are known in the art. Examples include CRISP, GATK, SAMtools, SNVer, SomaticSniper, CNVnator, RDXplorer, CONTRA ExomeCNV, BreakDancer, Breakpointer, CLEVER, GASVPro and SVMerge. In exemplary embodiments, the variant call software is GATK (HaplotypeCaller). The downstream algorithm or software could employ a method for parallel processing of the files. Thus, in some embodiments, one or more executions of the variant call software are simultaneously initiated, wherein each execution uses one or more threads. In some embodiments, each of the executions is run on a separate cloud instance. Other types of analysis that might be performed on the information or data in the second index and second genomic sequence file include multi-sample processing, annotation and filtering of variants, data aggregation, association analysis, population structure analysis and visualization, among others.
- Any of the disclosed operations, combinations of operations or methods can be implemented as computer-executable instructions stored on one or more computer-readable storage media (e.g., one or more optical media discs, volatile memory components (such as DRAM or SRAM), or non-volatile memory components (such as flash memory or hard drives)) and executed on a computer (e.g., any commercially available computer, including smart phones or other mobile devices that include computing hardware). Any of the computer-executable instructions for implementing the disclosed techniques as well as any data created and used during implementation of the disclosed embodiments can be stored on one or more computer-readable storage media. The computer-executable instructions can be part of, for example, a dedicated software application or a software application that is accessed or downloaded via a web browser or other software application (such as a remote computing application). Such software can be executed, for example, on a single local computer (e.g., any suitable commercially available computer) or in a network environment (e.g., via the Internet, a wide-area network, a local-area network, a customer-server network (such as a cloud computing network), or other such network) using one or more network computers.
- It should be understood that the disclosed technology is not limited to any specific computer language or program. For instance, the disclosed technology can be implemented by software written in C, C++, Java, Perl, JavaScript or any other suitable programming language. Likewise, the disclosed technology is not limited to any particular computer or type of hardware. Certain details of suitable computers and hardware are well known in the art.
- It should also be well understood that any functionality described herein can be performed, at least in part, by one or more hardware logic components, instead of software. For example, and without limitation, illustrative types of hardware logic components that can be used include field-programmable gate arrays (FPGAs), program-specific integrated circuits (ASICs), program-specific standard products (ASSPs), system-on-a-chip systems (SOCs), complex programmable logic devices (CPLDs), etc.
-
FIG. 6 shows the time savings gained by utilizing the systems and processes described herein. GATK HaplotypeCaller was run on files produced by a system described herein and on BAM files that were sliced by Sambamba Slice. As shown inFIG. 6 , the presently described system finished processing the largest reference sequence in 3 hours and 15 minutes, compared to about 4 hours using files produced by Sambamba Slice. - The articles “a,” “an” and “the” as used herein do not exclude a plural number of the referent, unless context clearly dictates otherwise. The conjunction “or” is not mutually exclusive, unless context clearly dictates otherwise. The term “include” refers to nonexhaustive examples.
- All references, publications, patent applications, issued patents, records, databases, websites and urls cited herein are incorporated by reference in their entirety for all purposes.
Claims (24)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/386,729 US20170177597A1 (en) | 2015-12-22 | 2016-12-21 | Biological data systems |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201562271204P | 2015-12-22 | 2015-12-22 | |
US15/386,729 US20170177597A1 (en) | 2015-12-22 | 2016-12-21 | Biological data systems |
Publications (1)
Publication Number | Publication Date |
---|---|
US20170177597A1 true US20170177597A1 (en) | 2017-06-22 |
Family
ID=59066360
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/386,729 Abandoned US20170177597A1 (en) | 2015-12-22 | 2016-12-21 | Biological data systems |
Country Status (1)
Country | Link |
---|---|
US (1) | US20170177597A1 (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108647317A (en) * | 2018-05-10 | 2018-10-12 | 东软集团股份有限公司 | Generation method, device and the storage medium and electronic equipment of delta file |
US10230390B2 (en) * | 2014-08-29 | 2019-03-12 | Bonnie Berger Leighton | Compressively-accelerated read mapping framework for next-generation sequencing |
CN109712674A (en) * | 2019-01-14 | 2019-05-03 | 深圳市泰尔迪恩生物信息科技有限公司 | Annotations database index structure, quick gloss hereditary variation method and system |
US20200095628A1 (en) * | 2018-09-21 | 2020-03-26 | doc.ai, Inc. | Ordinal position-specific and hash-based efficient comparison of sequencing results |
US10957433B2 (en) | 2018-12-03 | 2021-03-23 | Tempus Labs, Inc. | Clinical concept identification, extraction, and prediction system and related methods |
US11037685B2 (en) | 2018-12-31 | 2021-06-15 | Tempus Labs, Inc. | Method and process for predicting and analyzing patient cohort response, progression, and survival |
US11151081B1 (en) * | 2018-01-03 | 2021-10-19 | Amazon Technologies, Inc. | Data tiering service with cold tier indexing |
US11295841B2 (en) | 2019-08-22 | 2022-04-05 | Tempus Labs, Inc. | Unsupervised learning and prediction of lines of therapy from high-dimensional longitudinal medications data |
US20220368347A1 (en) * | 2019-10-18 | 2022-11-17 | Koninklijke Philips N.V. | System and method for effective compression, representation and decompression of diverse tabulated data |
US20220383988A1 (en) * | 2017-05-24 | 2022-12-01 | Petagene Ltd. | Data processing system and method |
US11532397B2 (en) | 2018-10-17 | 2022-12-20 | Tempus Labs, Inc. | Mobile supplementation, extraction, and analysis of health records |
US11640859B2 (en) | 2018-10-17 | 2023-05-02 | Tempus Labs, Inc. | Data based cancer research and treatment systems and methods |
US11875903B2 (en) | 2018-12-31 | 2024-01-16 | Tempus Labs, Inc. | Method and process for predicting and analyzing patient cohort response, progression, and survival |
-
2016
- 2016-12-21 US US15/386,729 patent/US20170177597A1/en not_active Abandoned
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10230390B2 (en) * | 2014-08-29 | 2019-03-12 | Bonnie Berger Leighton | Compressively-accelerated read mapping framework for next-generation sequencing |
US20220383988A1 (en) * | 2017-05-24 | 2022-12-01 | Petagene Ltd. | Data processing system and method |
US11151081B1 (en) * | 2018-01-03 | 2021-10-19 | Amazon Technologies, Inc. | Data tiering service with cold tier indexing |
CN108647317A (en) * | 2018-05-10 | 2018-10-12 | 东软集团股份有限公司 | Generation method, device and the storage medium and electronic equipment of delta file |
US11664089B2 (en) | 2018-09-21 | 2023-05-30 | Sharecare AI, Inc. | Bin-specific and hash-based efficient comparison of sequencing results |
US20200095628A1 (en) * | 2018-09-21 | 2020-03-26 | doc.ai, Inc. | Ordinal position-specific and hash-based efficient comparison of sequencing results |
US11699504B2 (en) | 2018-09-21 | 2023-07-11 | Sharecare AI, Inc. | Hash-based efficient comparison of sequencing results |
US11551784B2 (en) * | 2018-09-21 | 2023-01-10 | Sharecare AI, Inc. | Ordinal position-specific and hash-based efficient comparison of sequencing results |
US11640859B2 (en) | 2018-10-17 | 2023-05-02 | Tempus Labs, Inc. | Data based cancer research and treatment systems and methods |
US11532397B2 (en) | 2018-10-17 | 2022-12-20 | Tempus Labs, Inc. | Mobile supplementation, extraction, and analysis of health records |
US11651442B2 (en) | 2018-10-17 | 2023-05-16 | Tempus Labs, Inc. | Mobile supplementation, extraction, and analysis of health records |
US10957433B2 (en) | 2018-12-03 | 2021-03-23 | Tempus Labs, Inc. | Clinical concept identification, extraction, and prediction system and related methods |
US11769572B2 (en) | 2018-12-31 | 2023-09-26 | Tempus Labs, Inc. | Method and process for predicting and analyzing patient cohort response, progression, and survival |
US11309090B2 (en) | 2018-12-31 | 2022-04-19 | Tempus Labs, Inc. | Method and process for predicting and analyzing patient cohort response, progression, and survival |
US11875903B2 (en) | 2018-12-31 | 2024-01-16 | Tempus Labs, Inc. | Method and process for predicting and analyzing patient cohort response, progression, and survival |
US11037685B2 (en) | 2018-12-31 | 2021-06-15 | Tempus Labs, Inc. | Method and process for predicting and analyzing patient cohort response, progression, and survival |
US11830587B2 (en) | 2018-12-31 | 2023-11-28 | Tempus Labs | Method and process for predicting and analyzing patient cohort response, progression, and survival |
US11699507B2 (en) | 2018-12-31 | 2023-07-11 | Tempus Labs, Inc. | Method and process for predicting and analyzing patient cohort response, progression, and survival |
CN109712674A (en) * | 2019-01-14 | 2019-05-03 | 深圳市泰尔迪恩生物信息科技有限公司 | Annotations database index structure, quick gloss hereditary variation method and system |
US11295841B2 (en) | 2019-08-22 | 2022-04-05 | Tempus Labs, Inc. | Unsupervised learning and prediction of lines of therapy from high-dimensional longitudinal medications data |
US20220368347A1 (en) * | 2019-10-18 | 2022-11-17 | Koninklijke Philips N.V. | System and method for effective compression, representation and decompression of diverse tabulated data |
US11916576B2 (en) * | 2019-10-18 | 2024-02-27 | Koninklijke Philips N.V. | System and method for effective compression, representation and decompression of diverse tabulated data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20170177597A1 (en) | Biological data systems | |
US11030131B2 (en) | Data processing performance enhancement for neural networks using a virtualized data iterator | |
US10372723B2 (en) | Efficient query processing using histograms in a columnar database | |
US9946569B1 (en) | Virtual machine bring-up with on-demand processing of storage requests | |
US9348514B2 (en) | Efficiency sets in a distributed system | |
US20150127619A1 (en) | File System Metadata Capture and Restore | |
US11204935B2 (en) | Similarity analyses in analytics workflows | |
Howison | High-throughput compression of FASTQ data with SeqDB | |
CN106557571A (en) | A kind of data duplicate removal method and device based on K V storage engines | |
US9933971B2 (en) | Method and system for implementing high yield de-duplication for computing applications | |
US20150248430A1 (en) | Efficient encoding and storage and retrieval of genomic data | |
Mohamed et al. | Accelerating data-intensive genome analysis in the cloud | |
Zhan et al. | Optimization of ceph reads/writes based on multi-threaded algorithms | |
KR20160113167A (en) | Optimized data condenser and method | |
US20170169159A1 (en) | Repetition identification | |
US10572452B1 (en) | Context-based read-ahead for B+ tree data structures in a deduplication system | |
KR102425596B1 (en) | Systems and methods for low latency hardware memory management | |
Bicer et al. | A compression framework for multidimensional scientific datasets | |
Byun et al. | A column-aware index management using flash memory for read-intensive databases | |
TW201441850A (en) | System and method of reducing pressure of file server | |
Padmanabhan et al. | In situ exploratory data analysis for scientific discovery | |
CN117667853A (en) | Data reading method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: DNANEXUS, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:ASIMENOS, GEORGE;REEL/FRAME:045334/0976 Effective date: 20170130 |
|
AS | Assignment |
Owner name: INNOVATUS LIFE SCIENCES LENDING FUND I, LP, NEW YO Free format text: SECURITY INTEREST;ASSIGNOR:DNANEXUS, INC.;REEL/FRAME:047361/0920 Effective date: 20181030 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: PERCEPTIVE CREDIT HOLDINGS II, LP, NEW YORK Free format text: PATENT SECURITY AGREEMENT;ASSIGNOR:DNANEXUS, INC.;REEL/FRAME:050831/0452 Effective date: 20191025 |
|
AS | Assignment |
Owner name: DNANEXUS, INC., CALIFORNIA Free format text: TERMINATION AND RELEASE OF INTELLECTUAL PROPERTY SECURITY AGREEMENT;ASSIGNOR:INNOVATUS LIFE SCIENCES LENDING FUND I, LP;REEL/FRAME:050858/0937 Effective date: 20191025 |