WO2015073735A1 - Systems and methods for transmission and pre-processing of sequencing data - Google Patents
Systems and methods for transmission and pre-processing of sequencing data Download PDFInfo
- Publication number
- WO2015073735A1 WO2015073735A1 PCT/US2014/065562 US2014065562W WO2015073735A1 WO 2015073735 A1 WO2015073735 A1 WO 2015073735A1 US 2014065562 W US2014065562 W US 2014065562W WO 2015073735 A1 WO2015073735 A1 WO 2015073735A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- annotation
- output files
- group
- analysis
- transport
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/116—Details of conversion of file system types or formats
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/10—Ontologies; Annotations
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/06—Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
Definitions
- the field of the invention is systems and methods of transmission and pre-processing of genomic sequencing data, especially as it relates to annotation, queuing, and mass transfer of genomic sequencing files from one or more sequencers to a sequence analysis engine.
- one or more network nodes may include a packet generator that generates a data packet including a first header containing network routing information and a second header with attributes associated with a layered data model of existing knowledge representative of the biological sequence data as described in US 2012/0236861 and US 2012/0233201. Handling of high volumes of sequence information in a facility is described in US 2014/0278461. However, none of the known systems and methods is especially suitable to manage vast quantities of data in a manner that would streamline subsequent analysis, especially as such analysis relates to particular analysis needs or requirements by a medical professional.
- inventive subject matter is drawn to various systems and methods in which multiple omic sequences from one or more data sources (e.g., sequencing device) are fed to a transport server that pre-processes and groups the sequences into a transport group that is then routed to a sequence analysis engine.
- pre-processing and grouping is done on the basis of machine- specific annotations in the omic sequences and an annotation input from a user. In that way, the omic sequences can be grouped in real time, and routed to a downstream sequence analysis engine.
- omic sequences are preferably grouped such that all sequences required for sequence analysis are in a single transport group (i.e., in one logical unit), delays associated with interrupted sequence analysis (e.g., due to lack of one or more sequences for analysis or time spent loading missing sequences) are reduced, and more typically entirely avoided.
- delays associated with interrupted sequence analysis e.g., due to lack of one or more sequences for analysis or time spent loading missing sequences
- Such advantage is particularly beneficial where the sequence analysis engine is used to process numerous omic data from numerous users and/or patient samples.
- the systems and methods contemplated herein allow a sequence analysis engine to operate at maximum speed as all data relevant for an analytic task by the sequence analysis engine are provided in a single group or matching/corresponding groups.
- the inventors contemplate a transit system for delivery of a plurality of omic sequences that includes a transport server comprising a transit engine and an annotation engine.
- the transport server is coupled to one or more sequencing devices that provide multiple omic output files to the transport server, wherein each of the omic output files comprises sequence data and a machine- specific annotation, and the transport server is further coupled to a sequence analysis engine (e.g., BAM server) that receives a transport group from the transport server.
- a sequence analysis engine e.g., BAM server
- the annotation engine annotates the omic output files using an annotation input from a user to so form annotated omic output files
- the transit engine groups the annotated omic output files into the transport group based on both, the machine- specific annotation and the annotation input from the user.
- the transit engine then transfers the transport group to the sequence analysis engine.
- the omic output files are genomic output files (e.g., whole genome or exome), RNA-omic output files, or proteomic output files, and where the output file is a nucleotide sequence, it is preferred that the genomic output file is in SAM format, BAM format, VCF format, FASTQ format, and FASTA format.
- the system will also include a temporary data storage device coupled between the plurality of sequencing devices and the transport server, and that the sequencing devices provide the omic output files to the transport server via the temporary data storage.
- at least one of the sequencing devices is configured to receive a feedback signal from the transport server and/or the sequence analysis engine.
- the machine-specific annotation comprises an annotation that includes a date and/or time identifier, a sequencing device identifier, a lane identifier, a quality score, and/or pair member identifier
- the annotation input from the user will typically include an analysis type annotation (e.g., whole genome analysis, exome enrichment analysis, transcriptome analysis, and proteome analysis) and/or a patient specific annotation (e.g., patient identifier, a tissue identifier, a tissue status identifier, and a health record identifier).
- the transit engine will group the annotated omic output files in real time, and/or that the transit engine will group the annotated omic output files independent of actual sequences in the annotated omic output files.
- the transit engine will transmit the transport group upon completion of forming the transport group, or may use a predetermined grouping mode for a machine- specific annotation.
- the transit engine encrypts the transport group, and/or provides or adds a unique ID to the transport group.
- the transport server may receive the omic output files from the sequencing devices in an encrypted form, optionally upon request to the sequencing devices.
- the inventors also contemplate a method of transferring multiple omic sequences in which a transport server having a transit engine and an annotation engine is provided.
- the transport server receives multiple omic output files from respective multiple sequencing devices, wherein each of the omic output files includes sequence data and a machine- specific annotation.
- the annotation engine is then used by a user to annotate the omic output files to so form annotated omic output files, and the transit engine then groups the annotated omic output files into a transport group, preferably in real time. Most preferably, the grouping will be based on both, the machine- specific annotation and the annotation input from the user.
- the transport server will then deliver the transport group to a sequence analysis engine (e.g. BAM server).
- a sequence analysis engine e.g. BAM server
- omic output files may be have numerous types of content, but are typically genomic output files (e.g., exomes, whole genome, etc.), RNA-omic output files (e.g., transcriptome), or proteomic output files, which will preferably converted from a raw format into a SAM format or a BAM format.
- the omic output files may be temporarily stored in a data storage device prior to the step of receiving the plurality of omic output files by the transport server.
- the transport server may provide a feedback signal to one or more of the sequencing devices and/or the sequence analysis engine.
- the machine- specific annotation include a date and/or time identifier, a sequencing device identifier, a lane identifier, a quality score, and/or pair member identifier, and/or that the annotation input from the user includes analysis type annotation (e.g., whole genome analysis, exome enrichment analysis, transcriptome analysis, and proteome analysis) and/or a patient specific annotation (e.g., patient identifier, a tissue identifier, a tissue status identifier, and a health record identifier).
- analysis type annotation e.g., whole genome analysis, exome enrichment analysis, transcriptome analysis, and proteome analysis
- a patient specific annotation e.g., patient identifier, a tissue identifier, a tissue status identifier, and a health record identifier.
- the transport group is delivered upon completion of forming the transport group, or upon a predetermined delivery schedule or protocol.
- the transit engine will provide or add a unique ID to the transport group.
- the inventors also contemplate a method of transferring omic sequences in which a transport server receives multiple omic output files, each comprising sequence data and a machine- specific annotation.
- the omic output files are then grouped into a transport group using an annotation input from a user in addition to the machine- specific annotation.
- the transport group is then transferred from the transport server to a downstream analytic device (e.g., BAM server).
- the grouping is performed independently of the sequence data, and even more preferably in real-time.
- the annotation input from the user includes an analysis type annotation (e.g., whole genome analysis, exome enrichment analysis, transcriptome analysis, and proteome analysis) and a patient specific annotation (e.g., patient identifier, a tissue identifier, a tissue status identifier, and a health record identifier).
- the transport group is transferred from the transport server to the downstream analytic device upon completion of the transport group.
- the omic output files may be provided by a database storing omic output files or by a plurality of sequencing devices.
- a transport server produces a transport group from a multiple omic output files, wherein the omic output files are grouped according to a machine-specific annotation and an annotation input from a user.
- the sequence analysis engine e.g., BAM server
- the omic output files in the transport group will have a SAM format or a BAM format
- the annotation input from the user includes an analysis type annotation (e.g., whole genome analysis, exome enrichment analysis, transcriptome analysis, and proteome analysis) and/or a patient specific annotation (e.g., patient identifier, a tissue identifier, a tissue status identifier, and a health record identifier).
- an analysis type annotation e.g., whole genome analysis, exome enrichment analysis, transcriptome analysis, and proteome analysis
- a patient specific annotation e.g., patient identifier, a tissue identifier, a tissue status identifier, and a health record identifier
- Figure 1 is an exemplary illustration of a transmission and pre-processing system for omics sequences according to the inventive subject matter.
- any language directed to a computer should be read to include any suitable combination of computing devices, including servers, interfaces, systems, databases, agents, peers, engines, controllers, modules, or other types of computing devices operating individually or collectively.
- the computing devices comprise a processor configured to execute software instructions stored on a tangible, non-transitory computer readable storage medium (e.g., hard drive, FPGA, PLA, solid state drive, RAM, flash, ROM, etc.).
- the software instructions preferably configure or program the computing device to provide the roles, responsibilities, or other functionality as discussed below with respect to the disclosed apparatus.
- the disclosed technologies can be embodied as a computer program product that includes a non-transitory computer readable medium storing the software instructions that causes a processor to execute the disclosed steps.
- the various servers, systems, databases, or interfaces exchange data using standardized protocols or algorithms, possibly based on HTTP, HTTPS, AES, public-private key exchanges, web service APIs, known financial transaction protocols, or other electronic information exchanging methods.
- Data exchanges among devices can be conducted over a packet- switched network, the Internet, LAN, WAN, VPN, or other type of packet switched network; a circuit switched network; cell switched network; or other type of network.
- inventive subject matter provides many example embodiments of the inventive subject matter. Although each embodiment represents a single combination of inventive elements, the inventive subject matter is considered to include all possible combinations of the disclosed elements. Thus if one embodiment comprises elements A, B, and C, and a second embodiment comprises elements B and D, then the inventive subject matter is also considered to include other remaining combinations of A, B, C, or D, even if not explicitly disclosed.
- sequence analysis for numerous omics sequences provided by one or more data sources and delivered to a sequence analysis engine can be readily improved by pre-processing and/or grouping of the omics sequences to so form logical units that are then fed to the sequence analysis engine, and that are processed without the need for retrieval of further sequences required for the same analysis. It should be especially noted that such preprocessing and/or grouping will significantly reduce processing time required by the sequence analysis engine, and may also significantly reduce the time to completion where the sequence analysis was compromised by invalid and/or missing data as such data can be requested and sent to the sequence analysis engine in an efficient and coordinated manner.
- pre-processing and/or grouping is performed using both, machine- specific annotations and user annotation(s).
- the inventors contemplate a transport server that lines up and/or groups multiple omics sequences for analysis based on user and (sequencing) device parameters without regard to the actual omic sequences being transmitted. Therefore, and viewed from a different perspective, a user will be able to set up a user-defined rule for sequence analysis, in which the rule determines the real-time grouping of the omics output files into one or more transport groups.
- Figure 1 exemplarily illustrates a transit system 100 for delivery of a plurality of omic sequences from a number of sequencing devices to a sequence analysis engine.
- the omic sequences comprise sequence data (e.g., nucleic acid sequences) and a machine- specific annotation.
- system 100 comprises multiple sequencing devices 110a, 110b, and 110c that produce from a plurality of patient samples, which may be from the same or different patient (not shown) a plurality of omic output files 112a, 112b, and 112c.
- the sequencing devices 110a, 110b, and 110c are informationally coupled to the transport server 120 via wide area network 102, and all of the omic output files 112d are directly or indirectly (e.g., via temporary data storage device 150) routed to the transport server 120.
- Example sequencing device include Oxford Nanopore MinlON, or any of the Illumina® MiSeq or HiSeq devices.
- contemplated systems include a transport server 120 that includes an annotation engine 122 and a transit engine 124, and the transport server 120 is coupled via wide area network 102 to the sequencing devices 110a- 110c so that the sequencing devices can provide respective omic output files to the transport server.
- the transport server is also coupled via wide area network 102 to a sequence analysis engine 140 that receives from the transport server 120 via the wide area network 102 a transport group 126 from the transport server.
- Annotation engine 122 is preferably configured to annotate the omic output files using an annotation input from an input device 130 of user (e.g., medical professional) to so form annotated omic output files 126.
- the transit engine 124 is configured (most typically via one or more predefined rules) to group the annotated omic output files into the transport group based on the machine- specific annotation and the annotation input from the user. Once grouped, the transit engine then transfers the transport group to the sequence analysis engine 140 (e.g., BAM server).
- the sequence analysis engine 140 e.g., BAM server
- the transport server 120, the sequence analysis engines 140, and input device 130 are illustrated as individual computing devices, it should be appreciated that the each device could take on different forms.
- the collection of devices could be implemented as a cloud-based service; perhaps a for-fee service.
- Stakeholders e.g., insurance companies, physicians, oncologists, pharma companies, patients, other analysis engines, etc.
- the services can be provided via web services interfaces (e.g., WSDL, SOAP, HTTP, REST, BEEP, etc.) possibly through a network accessible API.
- the devices can be a singular devices having one or more applications installed on the computing devices.
- the devices can comprise a single, unitary device providing all the rules or responsibilities for the three devices.
- a user has provided (directly or indirectly) a sequencing facility with one or more samples (e.g., a tumor sample and a matched normal sample from the same patient) for whole genome analysis.
- the user uses a suitable security measure (e.g., a one-time use key that is preferably linked to the sequence reads) to access the sequencing facility for download, while the sequencing facility will typically use a corresponding security measure (e.g., same or matching key) for upload to the user.
- a suitable security measure e.g., a one-time use key that is preferably linked to the sequence reads
- the sequencing facility will typically use a corresponding security measure (e.g., same or matching key) for upload to the user.
- the sequence information will be encrypted in at least one segment of transport.
- sequence information may be encrypted by an encryption module of the sequencing device, or an encryption device that is informationally coupled to the sequencing device. While it is generally contemplated that the sequencing devices will be co-located in a single sequencing facility, it should be recognized that co-location is not critical to the inventive subject matter.
- sequencing device is not limiting to the inventive subject matter, but that all devices that produce an omic output are deemed suitable for use herein.
- especially preferred devices include nucleic acid sequencing devices that provide genomic raw data, or genomic data converted to SAM format, BAM format, VCF format, FASTQ format, or FASTA format.
- proteomics high throughput devices and RNA analysis devices are also contemplated herein. While it is contemplated that a patient sample can be exclusively analyzed on a single sequencing device, it is also contemplated that the sample can be analyzed using two or more different sequencing devices.
- the sequencing devices may also be configured to receive one or more feedback signals from the transport server, sequence analysis engine, and/or user via the user input device. For example, where the sequence analysis engine determines that certain regions in the genome require a higher reading threshold, the sequence analysis engine may provide feedback to the transport server and/or sequencing device to perform further analysis for that region. On the other hand, where the transport engine determines that a device parameter of a particular sequencing device fails to satisfy a specific predetermined level (e.g., data of one or more lanes below predetermined quality score), the transport engine may provide instructions to the sequencing device to change an operational parameter or to go offline.
- a specific predetermined level e.g., data of one or more lanes below predetermined quality score
- the device will (preferably automatically) attach to the omic output file a machine- specific annotation.
- suitable machine- specific annotations include a date and/or time identifier, a sequencing device identifier, a lane identifier, a quality score, and/or a pair member identifier.
- the data flowing through transit system 100 can be secured through multiple techniques.
- the omic data can be sent over secure communication links, possibly via secure FTP, HTTPS, SSL, or other protocol.
- secure FTP HTTPS
- SSL Secure Sockets Layer
- higher strength implementations of cryptographic protocols or algorithms are more preferred.
- the computational overhead or other cost associated with cryptographic protocols can dictate using less secure implementations of cryptographic protocols or algorithms.
- AES-128 might be sufficient for most consumers, AES-256 or higher levels of AES could be used for circumstances where confidentiality is of greater import than computational costs.
- the omic data can be stored within secured memories, possibly memories or storage modules that adhere to one or more levels of FIPS- 140.
- Additional other suitable algorithms include 3DES, Twofish, Blowfish, XXTEA, PGP, or other known algorithms or those yet to be invented. It should be appreciated that at least some data from the omics files, a sequence of a patient' s genome, could form a basis for a token or key with respect to the implementations of the cryptographic protocols or algorithms. Thus, only an entity having access to the patient's omic data could unlock or gain access to the data.
- the data source(s) that provide the omic data will in most cases automatically annotate the omic data using device-specific parameters, and that such annotation will be in a predefined format.
- a typical sequencing device will provide sequencing data in FASTQ or FASTA format, and as such include an instrument name, flowcell ID and/or name, index number for a multiplexed sample, indication as to the member of a pair (for paired-end or mate pair reads), etc.
- the device- specific parameters may also include a quality value with respect to the read, and where desired optional sequence annotations (e.g. , sequence identifier and/or description).
- the data source(s) may provide the omic data directly in a streaming fashion, or from an intermediary data storage, or even from a temporary data storage device that is coupled between the sequencing device(s) and the transport server.
- the raw sequence data output files are converted to a file type that is suitable for analysis by the sequence analysis engine.
- the file type for the sequence analysis engine is a SAM or BAM file.
- the user will operate a dedicated transport server via a user input device (e.g., computer or mobile device connected to a wide area network), which may be co- located with the user, or may be remotely located and accessed by the user via a terminal or other appropriate interface.
- a user input device e.g., computer or mobile device connected to a wide area network
- the user will annotate the omic output files (e.g., sequence reads) from the data source (e.g., sequencing device) using an annotation input that is specific to the upload of the omic data.
- the transport server will include an annotation engine to allow the user to perform such annotation.
- annotation may also be provided via a separate annotation module that is then coupled to the transport server. While the nature of the annotation input is not limiting to the inventive subject matter, it should be appreciated that the annotation input will typically bear at least some significance to the sample and/or patient, and most typically include an analysis type annotation and a patient specific annotation.
- the analysis type annotation may be specific to the particular protocol or technique used for sample preparation, sample procedure, etc., and thus may include reference to whole genome analysis, exome enrichment analysis, transcriptome analysis, proteome analysis, etc.
- the patient specific annotation will generally relate to some information that is at least to some degree associated with the patient.
- patient specific annotation will typically include a patient identifier, a tissue identifier, a tissue status identifier (e.g., matched normal, diseased, primary tumor, recurring tumor, metastatic tumor, etc.), a health record identifier (e.g., type of disease, status of patient), electronic medical record identifier, etc.
- User annotation may further include the type of desired analysis (e.g., a request to compare tumor versus matched normal, or tumor versus earlier tumor sample or other reference).
- the user will provide a second layer of information to the omics data that will allow association of the omics information with information that is uniquely relevant to the patient, the specific type of patient sample (e.g., diseased versus control, or before and during/after treatment with a drug) type of analysis ordered (e.g., whole genome analysis or exome or transcriptome analysis).
- the specific type of patient sample e.g., diseased versus control, or before and during/after treatment with a drug
- type of analysis ordered e.g., whole genome analysis or exome or transcriptome analysis.
- Such dual information content i.e., machine- specific annotation and the annotation input from the user
- analysis can be performed with minimal interruptions that would otherwise be due to missing or incomplete omics information.
- the transit engine will be configured to transmit the transport group upon completion of forming the transport group as defined by the user (and appropriate rules governing grouping function).
- grouping according to a predetermined grouping mode for machine- specific annotation is also contemplated.
- Grouping is typically performed at the transport server using the transit engine and both the user annotation and the machine- specific annotation such that a group of sequences is formed that is a complete group of sequences with respect to a particular analytic task by the sequence analysis engine. Therefore, in at least one aspect of the inventive subject matter, grouping may be driven by matching normal and diseased sample, which may be refined by matching genomic regions between the samples, or by specific patient, or patient history, as well as by disease type using different patient samples. Matching may further be driven by quality measures of the omic output file and other machine- specific annotations (e.g., exclusion of omic files coming from a particular lane or device).
- the grouping may be performed using an a priori or default grouping that is based on the machine- specific annotations, which may then be modified or tuned on the basis of the user annotations.
- grouping of the annotated omic output files can be performed independent of actual sequences in the annotated omic output files, but as a function of specific requirements by a user (e.g., as a function of a desired type of analysis, patient history, type of disease, etc.) [0039]
- grouping may be driven or modified by a feedback signal from the sequence analysis engine and/or the omic data source.
- the sequence analysis engine may provide feedback to the transport server to include additional omic data for a particular genomic region, or the omic data source may provide feedback to the transport server that no further omic data are being delivered.
- the transport server may also provide feedback to the omic data source to repeat a particular analysis, or to the sequence analysis engine to indicate presence or absence of particular data.
- the grouping is preferably performed in substantially real-time (i.e., as omics data are delivered or made available), that the groups are sent to the sequence analysis engine with a group- specific ID, and that the group is sent only upon completion of the grouping by the transport server.
- the transport group is preferably encrypted prior to delivery to the sequence analysis engine.
- User annotations can take on many different forms or a broad spectrum of information depending on the nature of the of analysis project at hand. Further the nature of the user annotation can depend on the role or responsibilities of the user with respect to the analysis ecosystem. Consider, for example, where the user has the role of a system administrator of the transport server 120 or the sequence analysis engine 140. The system administrator might create an annotation indicating available network bandwidth or storage capacity. The transport server 120 can package omic data to ensure the resulting logical unit respects such limitation.
- the user could be a physician.
- the physician might include a user annotation that comprises the physician's unique identifier (e.g., physician registry identifier, national provider identifier (NPI), etc.), a diagnosis code (e.g., ICD-9, ICD-10, DSM, etc.), procedure codes (e.g., CPT, etc.), or other physician related information.
- a user annotation that comprises the physician's unique identifier (e.g., physician registry identifier, national provider identifier (NPI), etc.), a diagnosis code (e.g., ICD-9, ICD-10, DSM, etc.), procedure codes (e.g., CPT, etc.), or other physician related information.
- diagnosis code e.g., ICD-9, ICD-10, DSM, etc.
- procedure codes e.g., CPT, etc.
- Additional user annotations could include insurance coverage, urgency information, priority information, data ownership information, or other attributes.
- the user annotations could be normalized according to an a priori defined a user annotation namespace or ontology where each type of user annotation could comprise attributes (i.e., a dimension in the namespace) that take on specific values (i.e., a metric for the dimension).
- attributes i.e., a dimension in the namespace
- specific values i.e., a metric for the dimension
- Machine-specific annotations in a similar vein to the user annotations, can also take on a broad spectrum of values to reflect the nature of one or more specific machines or their corresponding states.
- the machine- specific annotations could pertain to one or more devices within ecosystem 100, including sequencing devices 110a through 110c, transport server 120, input device 130, or even sequencing analysis engine 140.
- Example machine- specific annotations could include device identifiers (e.g., IP addresses, MAC addresses, serial numbers, model numbers, etc.), device bandwidth (e.g., Gpb/second, network bandwidth, etc.), analysis metrics, available machine learning or analysis algorithms, device location, costs to process, CPU availability (e.g, MFLOPs, available threads, available cores, etc.), or other machine-related attributes.
- the machine- specific annotations could adhere to a machine attribute namespace.
- the machine specific annotations can be compiled according to the machine attribute namespace as a machine- specific annotation data structure (e.g., a vector, a tuple, etc.).
- the annotation engine 122 can thus tag or bind the output files with the data structure, possibly as metadata in the form of an XML file.
- the roles or responsibility of the annotation engine 122 can be integrated into sequence devices 110a through 110a, possibly even as an after market adapter.
- the transit engine 124 is configured to execute one or more software instructions that embody rules according to which the output files are grouped together.
- the rules can be provided by the user via input device 130 or could be installed within transport server 120.
- the rules can be implemented as script or other code that operates based on the user and machine- specific annotations.
- transit engine 125 could comprise a script-based run-time (e.g., Python, Ruby, Java, .NET, etc.) that provides an API capable of accessing output files 112a through 112c as well as their corresponding annotations.
- a user can then write a script, or otherwise cause a script to execute, via the APIs, to process the output files in order to building transport group 126.
- the rules can include requirements, conditions, or other criteria that depend on the annotations or their values, possibly based on the a priori defined namespaces.
- a simple example could include rules that seek to bind all output files that correspond to a specific physician.
- the transit engine 124 queries, according to the physician-based rule, for all output files having the physician's identifier. The results set could then be compiled together to form a single logical unit representing that physicians requested work product. It should be appreciated that the rules or scripts could comprise quite complex rules that govern grouping the output files into transport group 126.
- transport group 126 is considered to be a single logical unit with respect to processing the output files. This approach is considered quite advantageous because it enables the computing devices to optimize computational resources from both a global perspective (e.g., with respect to all files) while also respecting local efficiencies (e.g., very specific requests).
- rules or scripts under which the transit engine 124 operates can be considered as the definition of a logical unit processing as defined with respect to the
- system 100 comprises a for-fee genomic processing service available for oncologists.
- An oncologist could submit an urgent request (i.e., a user annotation with an urgency level, a high dollar value request, a time deadline, etc.) to the system to identifying a known drug that might have a positive impact on the patient's immediate car.
- the transit engine 124 can identify all output files having the patient identifier and output files relating to reference genomes associated with one or more known drugs.
- the transit engine 124 can determine which files might require additional reads or data based sequence device annotations. Yet further, the transit engine 124 can use device attributes associated with one or more of sequence analysis engine 140 and that could include device availability or capacity. If sufficient capacity is available, the transit engine 124 can group the related output files together as a logical unit, possibly tagged with the urgency level, and submit the logical unit to the sequence analysis engine 140 for immediate processing.
- the logical unit could be transmitted as a binary file, a text file, or even a serialized file (e.g., XML, YAML, JSON, etc.) or other format.
- the transit engine 124 can combine output files together as a logical unit to address optimization needs of system 100 or a stakeholder, one should further appreciate that logical units can be constructed to address myriad possible optimization metrics.
- Example metrics that could represent a goal or concern for processing transport group 126 include monetary cost, bandwidth, network or processing latency, geographical constraints, security or confidentiality levels, electrical power consumer costs, priority, urgency, importance, patient life expectancy, or other metrics.
- sequence analysis engine it is generally contemplated that all known sequence analysis engines are deemed suitable for use herein. However, it is especially preferred that the sequence analysis engine is configured to use a SAM or BAM file as an input file (e.g., BAMserver), and particularly preferred sequence analysis engines include those that produce a local alignment by incrementally synchronizing the first and second sequence strings using a known position of at least one of plurality of corresponding sub-strings, wherein the local alignment is used to generate a local differential string between the first and second sequence strings within the local alignment. Such local differential string is then used to update a differential genetic sequence object in a differential sequence database. Examples for such sequence analysis engines are described in US 2012/0066001, WO 2013/074058, and WO 2014/058987, all of which are incorporated by reference herein.
- a temporary data storage device may be coupled between the sequencing devices and the transport server to so allow for buffering.
- a temporary buffer could include a personalized genomic data card having a large capacity memory (e.g., preferably greater than 200GB, 500GB, 1TB, 2TB, or more) and a processor.
- the personalized data card can store one or more omic output files of the patient that owns the card.
- the patient's card could comprise a solid state disk drive having a credit card contact pad.
- the transport server or other entity can authorize the transport server or other entity to access their genomic data on the car.
- longer term storage may be implemented in cases where the same patient is subject to testing over a prolonged period of time (e.g., prior to treatment and after treatment/follow-up).
- Example long term storage solutions include a SAN, NAS, RAID, cloud-based storage, a clinical operating system data custodian, or other type of storage.
- the transit system 100 can include one or more a sample database, possibly including a file system, configured to store sequences of the patient's samples.
- a transit system for delivery of multiple omic sequences will include a transport server having a transit engine and an annotation engine.
- the transport server is typically (directly or indirectly) coupled to one or more sequencing devices that provide omic output files
- the annotation engine is configured to annotate the plurality of omic output files using an annotation input from a user to thereby form annotated omic output files
- the transit engine is configured to group the annotated omic output files into the transport group based on the machine- specific annotation and the annotation input from the user.
- the transit engine is configured to transfer the transport group to the sequence analysis engine.
- the inventors therefore also contemplate a method of transferring omic sequences using a transport server having a transit engine and an annotation engine.
- Especially contemplated methods include a step of receiving, by the transport server, omic output files (e.g., genomic output files, RNA-omic output files, or proteomic output files) from sequencing devices, wherein each of the omic output files comprises sequence data and a machine- specific annotation.
- omic output files e.g., genomic output files, RNA-omic output files, or proteomic output files
- the annotation engine annotates the omic output files using annotation input from a user to so form annotated omic output files
- the transit engine groups the annotated omic output files into a transport group, wherein grouping is based on the machine- specific annotation and the annotation input from the user.
- the transport server delivers the transport group to a sequence analysis engine.
- a transport server receives multiple omic output files comprising sequence data and a machine- specific annotation.
- the omic output files are then grouped into a transport group using an annotation input from a user and the machine- specific annotation, and the transport group is then transferred from the transport server to a downstream analytic device.
- Such group transfer will advantageously lead to a method of reducing the processing time for genomic analysis in a sequence analysis engine in which a transport server produces a transport group from multiple omic output files, wherein the omic output files are grouped according to a machine-specific annotation and an annotation input from a user.
- the sequence analysis engine then receives the transport group, wherein the sequence analysis engine processes the transport group as a logical unit.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioethics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
Description
Claims
Priority Applications (7)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CA2932556A CA2932556A1 (en) | 2013-11-13 | 2014-11-13 | Systems and methods for transmission and pre-processing of sequencing data |
| JP2016531069A JP6472798B2 (en) | 2013-11-13 | 2014-11-13 | System and method for transmission and preprocessing of sequencing data |
| CN201480071385.9A CN106687965B (en) | 2013-11-13 | 2014-11-13 | Systems and methods for delivering and preprocessing sequencing data |
| KR1020167015398A KR20160133400A (en) | 2013-11-13 | 2014-11-13 | Systems and methods for transmission and pre-processing of sequencing data |
| EP14862192.3A EP3069285A4 (en) | 2013-11-13 | 2014-11-13 | Systems and methods for transmission and pre-processing of sequencing data |
| AU2014348566A AU2014348566B2 (en) | 2013-11-13 | 2014-11-13 | Systems and methods for transmission and pre-processing of sequencing data |
| AU2019203427A AU2019203427A1 (en) | 2013-11-13 | 2019-05-15 | Systems and methods for transmission and pre-processing of sequencing data |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201361903903P | 2013-11-13 | 2013-11-13 | |
| US61/903,903 | 2013-11-13 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2015073735A1 true WO2015073735A1 (en) | 2015-05-21 |
Family
ID=53044715
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2014/065562 Ceased WO2015073735A1 (en) | 2013-11-13 | 2014-11-13 | Systems and methods for transmission and pre-processing of sequencing data |
Country Status (8)
| Country | Link |
|---|---|
| US (2) | US10193956B2 (en) |
| EP (1) | EP3069285A4 (en) |
| JP (1) | JP6472798B2 (en) |
| KR (1) | KR20160133400A (en) |
| CN (2) | CN106687965B (en) |
| AU (2) | AU2014348566B2 (en) |
| CA (1) | CA2932556A1 (en) |
| WO (1) | WO2015073735A1 (en) |
Families Citing this family (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8666760B2 (en) * | 2005-12-30 | 2014-03-04 | Carefusion 303, Inc. | Medication order processing and reconciliation |
| US10176295B2 (en) | 2013-09-26 | 2019-01-08 | Five3 Genomics, Llc | Systems, methods, and compositions for viral-associated tumors |
| US10380645B2 (en) * | 2014-03-07 | 2019-08-13 | DO-THEDOC Inc. | System for securely transmitting medical records and for providing a sponsorship opportunity |
| WO2018183493A1 (en) * | 2017-03-29 | 2018-10-04 | Nantomics, Llc | Signature-hash for multi-sequence files |
| WO2019027767A1 (en) * | 2017-07-31 | 2019-02-07 | Illumina Inc. | Sequencing system with multiplexed biological sample aggregation |
| KR102304544B1 (en) | 2018-04-30 | 2021-09-24 | 서울대학교 산학협력단 | Genome data model managing apparatus for precision medicine and managing method thereof |
| CN112037866B (en) * | 2020-09-15 | 2024-06-11 | 中国科学院微生物研究所 | Method, device, electronic device and medium for querying strain genome sequencing information |
| CN112185460B (en) * | 2020-09-23 | 2022-07-08 | 谱度众合(武汉)生命科技有限公司 | Heterogeneous data independent proteomics mass spectrometry analysis system and method |
| EP4607519A1 (en) * | 2024-02-20 | 2025-08-27 | Siemens Healthcare Diagnostics Products GmbH | Computer-implemented method for controlling a data processing, processing device, computer program and data carrier |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030211504A1 (en) * | 2001-10-09 | 2003-11-13 | Kim Fechtel | Methods for identifying nucleic acid polymorphisms |
| US20070020651A1 (en) * | 2001-05-25 | 2007-01-25 | Dnaprint Genomics, Inc. | Compositions and methods for the inference of pigmentation traits |
| US20090170717A1 (en) * | 2004-07-02 | 2009-07-02 | The Government Of The United States Of America, As Represented By The Secretary Of The Navy | Re-sequencing pathogen microarray |
| US20120095693A1 (en) * | 2010-08-31 | 2012-04-19 | Annai Systems, Inc. | Method and systems for processing polymeric sequence data and related information |
Family Cites Families (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7235358B2 (en) * | 2001-06-08 | 2007-06-26 | Expression Diagnostics, Inc. | Methods and compositions for diagnosing and monitoring transplant rejection |
| US7743233B2 (en) | 2005-04-05 | 2010-06-22 | Intel Corporation | Sequencer address management |
| US9407942B2 (en) * | 2008-10-03 | 2016-08-02 | Finitiv Corporation | System and method for indexing and annotation of video content |
| WO2011032725A1 (en) * | 2009-09-18 | 2011-03-24 | Kinogea, Inc. | Method and system for building and using a centralised and harmonised relational protein and peptide database |
| US20110288785A1 (en) | 2010-05-18 | 2011-11-24 | Translational Genomics Research Institute (Tgen) | Compression of genomic base and annotation data |
| KR102042253B1 (en) * | 2010-05-25 | 2019-11-07 | 더 리젠츠 오브 더 유니버시티 오브 캘리포니아 | Bambam: parallel comparative analysis of high-throughput sequencing data |
| JP6420543B2 (en) * | 2011-01-19 | 2018-11-07 | コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. | Genome data processing method |
| US8913552B2 (en) * | 2011-01-24 | 2014-12-16 | International Business Machines Corporation | Spatiotemporal annotation of data packets in wireless networks |
| US9215162B2 (en) | 2011-03-09 | 2015-12-15 | Annai Systems Inc. | Biological data networks and methods therefor |
| CN103582887B (en) * | 2011-06-07 | 2017-07-04 | 皇家飞利浦有限公司 | The method and sequencing device of nucleotide sequence data are provided |
| KR102003660B1 (en) * | 2011-07-13 | 2019-07-24 | 더 멀티플 마이얼로머 리서치 파운데이션, 인크. | Methods for data collection and distribution |
| US20130091126A1 (en) * | 2011-10-11 | 2013-04-11 | Life Technologies Corporation | Systems and methods for analysis and interpretation of nucleic acid sequence data |
| KR20190099105A (en) * | 2011-12-08 | 2019-08-23 | 파이브3 제노믹스, 엘엘씨 | Distributed system providing dynamic indexing and visualization of genomic data |
| US20140278461A1 (en) | 2013-03-15 | 2014-09-18 | Memorial Sloan-Kettering Cancer Center | System and method for integrating a medical sequencing apparatus and laboratory system into a medical facility |
-
2014
- 2014-11-13 EP EP14862192.3A patent/EP3069285A4/en not_active Ceased
- 2014-11-13 JP JP2016531069A patent/JP6472798B2/en not_active Expired - Fee Related
- 2014-11-13 CN CN201480071385.9A patent/CN106687965B/en not_active Expired - Fee Related
- 2014-11-13 US US14/541,068 patent/US10193956B2/en active Active
- 2014-11-13 CN CN201910873177.5A patent/CN110570906A/en active Pending
- 2014-11-13 AU AU2014348566A patent/AU2014348566B2/en not_active Ceased
- 2014-11-13 WO PCT/US2014/065562 patent/WO2015073735A1/en not_active Ceased
- 2014-11-13 KR KR1020167015398A patent/KR20160133400A/en not_active Withdrawn
- 2014-11-13 CA CA2932556A patent/CA2932556A1/en not_active Withdrawn
-
2018
- 2018-12-17 US US16/222,750 patent/US20190124135A1/en not_active Abandoned
-
2019
- 2019-05-15 AU AU2019203427A patent/AU2019203427A1/en not_active Withdrawn
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070020651A1 (en) * | 2001-05-25 | 2007-01-25 | Dnaprint Genomics, Inc. | Compositions and methods for the inference of pigmentation traits |
| US20030211504A1 (en) * | 2001-10-09 | 2003-11-13 | Kim Fechtel | Methods for identifying nucleic acid polymorphisms |
| US20090170717A1 (en) * | 2004-07-02 | 2009-07-02 | The Government Of The United States Of America, As Represented By The Secretary Of The Navy | Re-sequencing pathogen microarray |
| US20120095693A1 (en) * | 2010-08-31 | 2012-04-19 | Annai Systems, Inc. | Method and systems for processing polymeric sequence data and related information |
Non-Patent Citations (2)
| Title |
|---|
| ERNEST TURRO, RNA-SEQ MAPPING PRACTICAL, 29 October 2012 (2012-10-29), XP055343886 * |
| See also references of EP3069285A4 * |
Also Published As
| Publication number | Publication date |
|---|---|
| AU2019203427A1 (en) | 2019-06-06 |
| CA2932556A1 (en) | 2015-05-21 |
| JP6472798B2 (en) | 2019-02-20 |
| CN110570906A (en) | 2019-12-13 |
| CN106687965A (en) | 2017-05-17 |
| AU2014348566B2 (en) | 2019-02-28 |
| KR20160133400A (en) | 2016-11-22 |
| US20190124135A1 (en) | 2019-04-25 |
| JP2017504093A (en) | 2017-02-02 |
| US20150134662A1 (en) | 2015-05-14 |
| EP3069285A1 (en) | 2016-09-21 |
| AU2014348566A1 (en) | 2016-06-09 |
| CN106687965B (en) | 2019-10-01 |
| EP3069285A4 (en) | 2017-08-30 |
| US10193956B2 (en) | 2019-01-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| AU2014348566B2 (en) | Systems and methods for transmission and pre-processing of sequencing data | |
| US10957429B2 (en) | Healthcare analysis stream management | |
| US8949427B2 (en) | Administering medical digital images with intelligent analytic execution of workflows | |
| Liu et al. | Big data as an e-health service | |
| US20140365241A1 (en) | System for pre-hospital patient information exchange and methods of using same | |
| US20150310228A1 (en) | Secured mobile genome browsing devices and methods therefor | |
| CN107430584B (en) | Reading data from storage via PCI EXPRESS fabric with fully connected mesh topology | |
| CN107533526B (en) | Write data to storage via PCI EXPRESS fabric with fully connected mesh topology | |
| Pinthong et al. | A simple grid implementation with Berkeley Open Infrastructure for Network Computing using BLAST as a model | |
| CN105279366A (en) | Computer system and method for analyzing data | |
| US20130326122A1 (en) | Distributed memory access in a network | |
| TW200532470A (en) | Embedded computer system for data transmission between multiple micro-processors and method thereof |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14862192 Country of ref document: EP Kind code of ref document: A1 |
|
| DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
| ENP | Entry into the national phase |
Ref document number: 2016531069 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| ENP | Entry into the national phase |
Ref document number: 2932556 Country of ref document: CA |
|
| ENP | Entry into the national phase |
Ref document number: 2014348566 Country of ref document: AU Date of ref document: 20141113 Kind code of ref document: A Ref document number: 20167015398 Country of ref document: KR Kind code of ref document: A |
|
| REEP | Request for entry into the european phase |
Ref document number: 2014862192 Country of ref document: EP |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2014862192 Country of ref document: EP |