EP2761518A1 - System and method for facilitating network-based transactions involving sequence data - Google Patents
System and method for facilitating network-based transactions involving sequence dataInfo
- Publication number
- EP2761518A1 EP2761518A1 EP12835985.8A EP12835985A EP2761518A1 EP 2761518 A1 EP2761518 A1 EP 2761518A1 EP 12835985 A EP12835985 A EP 12835985A EP 2761518 A1 EP2761518 A1 EP 2761518A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- data
- biological
- information
- data units
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/30—Data warehousing; Computing architectures
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/40—Encryption of genetic data
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/50—Compression of genetic data
Definitions
- This application is generally directed to processing polymeric sequence information, including biopolymeric sequence information such as DNA sequence information, and to transmission of such sequence information between locations within a network.
- DNA sequencing is the process of determining the ordering of nucleotide bases (adenine (A), guanine (G), cytosine (C) and thymine (T)) in molecular DNA.
- A adenine
- G guanine
- C cytosine
- T thymine
- nucleotide base sequence data collected from DNA sequencing operations are stored in multiple different formats in a number of different databases.
- databases also contain scientific information related to the DNA sequence data including, for example, information concerning single nucleotide polymorphisms (SNPs), gene expression, copy number variations.
- SNPs single nucleotide polymorphisms
- transcriptomic and proteomic data are also present in multiple formats in multiple databases. This renders it impractical to exchange and process the sources of DNA sequence data and related information collected in various locations, thereby hampering the potential for scientific discoveries and advancements.
- This disclosure is generally directed to a method of processing, transmitting, and otherwise facilitating network-based transactions involving, polymeric sequence information. More particularly but not exclusively, in one aspect the disclosure describes systems and methods for facilitating uploading, downloading and other network-based transactions involving sequence information, such as large files of genomic sequence data. These transactions may involve communicating such large files of sequence information between entities such as, for example, genome sequence centers (GSCs), genome data repositories (GDRs), genome data analysis companies (GDACs) and or data coordination centers (DCCs). Each of these entities may be either public institutions privately owned or privately-owned enterprises.
- GSCs genome sequence centers
- GDRs genome data repositories
- GDACs genome data analysis companies
- DCCs data coordination centers
- the sequencing data involved in such transactions may be generated by, for example, a GSC, which receives a purified prep of a patient's chromosomal and or mitochondria DNA, or an RNA prep, for sequencing.
- the patient's identification will typically be anonymized with a series of codes to label the specific aliquot from a sample preparation and the organ, tissue or cell types.
- other information including but not limited to EMR data, clinical and pharmacological as well other network metadata that is specific to the particular patient can be collected by the DCCs but kept separate from the genomic data.
- the sequence data that is generated by the GSCs may be provided to or otherwise transferred within a biological data network, which may also be referred to herein as a Biolntelligent or "bIQ" network.
- a biological data network which may also be referred to herein as a Biolntelligent or "bIQ" network.
- An exemplary bIQ network is described within, for example, U.S. Patent Application Publication No. 2012/0233201.
- Metadata relating to the sequence data may be collected and utilized during the processing of the sequence data throughout the bIQ network in order to, for example, facilitate data coordination, correlation, privacy, security, validation and authentication.
- the genome storage repository includes a receive interface for receiving, from over a network, a plurality of portions of at least one file of biological sequence data conveyed over the network in accordance with a parallel file transfer process.
- the genome storage repository further includes a controller in communication with the receive interface and the data repository. The controller generates a reconstructed file of biological sequence data by reconstructing the at least one file of biological sequence data using the plurality of portions of the at least one file of biological sequence data.
- the disclosure is directed to a subscriber node operable within a biological data network.
- the subscriber node includes a receive interface for receiving, over one or more data links of the biological data network, a plurality of biological data units containing encoded genomic information wherein the encoded genomic information represents genomic information encoded relative to a reference sequence.
- the subscriber node further includes a controller for processing the plurality of biological data units.
- the disclosure is also directed to a genome storage repository including a data repository containing encoded genomic information and biological information relating to the encoded genomic information.
- the genome storage repository also includes a controller for generating a plurality of data units containing the encoded genomic information and the biological information.
- a transmit interface operates to transfer the plurality of data units to a subscriber device over a network.
- the disclosure pertains to a node operable within a biological data network.
- the node includes a receive interface for receiving a plurality of data units from one or more data links of the biological data network wherein each of the plurality of data units includes a payload representative of encoded genomic information and a header representative of biological information relating to the encoded genomic information.
- the node further includes a data repository and a controller for storing the plurality of data units within the data repository.
- the disclosure relates to a subscriber node having a receive interface for receiving, from over a network, an encrypted data unit containing encoded genomic information wherein the encoded genomic information represents genomic information encoded relative to a reference sequence using a plurality of instructions.
- the subscriber node further includes a controller for decrypting the encrypted data unit using a subscriber key.
- the disclosure further pertains to a method which includes receiving, from over a network, a plurality of portions of at least one file of biological sequence data conveyed over the network in accordance with a parallel file transfer process wherein ones of the plurality of portions are transferred substantially simultaneously in multiple data streams.
- the method also includes generating a reconstructed file of biological sequence data by reconstructing the at least one file of biological sequence data using the plurality of portions of the at least one file of biological sequence data.
- the at least one file of biological sequence data is then stored within a data repository.
- the disclosure relates to a method which includes receiving, over one or more data links of a biological data network, a plurality of biological data units containing encoded genomic information wherein the encoded genomic information represents genomic information encoded relative to a reference sequence.
- the method further includes processing the plurality of biological data units and storing the plurality of biological data units within a memory unit.
- the disclosure pertains to a method which includes establishing a data repository containing encoded genomic information and biological information relating to the encoded genomic information. The method further includes generating a plurality of data units containing the encoded genomic information and the biological information. The plurality of data units are then transferred to a subscriber device over a network.
- the disclosure is also directed to a method which includes receiving a plurality of data units from one or more data links of a biological data network wherein each of the plurality of data units includes a payload representative of encoded genomic information and a header representative of biological information relating to the encoded genomic information.
- the method also includes storing the plurality of data units within a data repository.
- the disclosure pertains to a method which includes receiving, from over a network, an encrypted data unit containing encoded genomic information wherein the encoded genomic information represents genomic information encoded relative to a reference sequence using a plurality of instructions.
- the method also includes decrypting the encrypted data unit using a subscriber key so as to generate a decrypted data unit and storing the decrypted data unit within a memory.
- the disclosure relates to a genome storage repository including a data repository containing encoded genomic information and biological information relating to the encoded genomic information.
- the genome storage repository includes a receive interface for receiving, from over a network, a processing request from an analysis node.
- the genome storage repository further includes a controller operative to process, in response to the processing request, at least the genomic information in accordance with an analysis program in order to generate analysis results.
- the genome storage repository may further include a transmit interface configured to transmit the analysis results over the network to the analysis node.
- the receive interface may be further configured to receive the analysis program from the analysis node.
- the disclosure relates to a method which includes establishing a data repository containing encoded genomic information and biological information relating to the encoded genomic information.
- the method also includes receiving, from over a network, a processing request from an analysis node.
- the method includes processing, in response to the processing request, at least the genomic information in accordance with an analysis program in order to generate analysis results.
- the method may also contemplate transmitting the analysis results over the network to the analysis node and receiving the analysis program from the analysis node.
- FIG. 1 illustratively represents a genome sequence data network incorporating a high capacity, high throughput networked-based genome storage repository (GSR).
- GSR genome storage repository
- FIG. 2 illustrates a first exemplary implementation of a genome sequencing center (GSC) configured to operate in a biological data network.
- GSC genome sequencing center
- FIG. 3 depicts a second exemplary implementation of a GSC configured to operate in a biological data network.
- FIG. 4 illustrates an exemplary implementation of a network-based genome storage repository.
- FIG. 5 depicts a codec schema representative of various encoding, decoding, encryption, decryption and transcoding operations which may be effected within a biological data network.
- FIG. 6 shows one manner in which a distributed conditional access system (DCAS) may be employed for the management of access to the data within a biological data network.
- DCAS distributed conditional access system
- FIG. 7 illustratively represents the incorporation of a distributed conditional access system (DCAS) within an alternative data network.
- DCAS distributed conditional access system
- FIG. 8 illustrates one manner in which the encode/decode and encrypt/decrypt schema described with reference to FIGS. 1-7 may be utilized to mediate genomic-based transactions among various users of a biological data network.
- FIG. 9 is a flowchart of an encoding and encryption process which may be employed within a biological data network.
- FIG. 10 illustrates a comparative sequence analysis process used to minimize apparent biological differences between a reference and a sample sequence entry.
- FIG. 11 is a flowchart of an alternate encoding and encryption process capable of being employed within a biological data network.
- FIG. 12 provides a high-level view of the architecture of a GeneTorrentTM system configured to to enable a cluster of servers to tranfer parallel streams of file information to a user system.
- FIGS. 13-18 illustrate exemplary operation of one embodiment of a Transactor.
- FIGS. 19A-19B illustrate an exemplary GeneTorrentTM upload operation.
- FIGS. 20A-20B provide an illustration of a secure GeneTorrentTM download workflow between client-side GeneTorrentTM data consumers and various server-side components.
- FIG. 21 illustrates an exemplary software architecture of a system capable of providing GeneTorrentTM file transfer capability.
- FIG. 22 illustrates an exemplary system architecture capable of supporting the software architecture of FIG. 21.
- FIG. 1 illustratively represents a genome sequence data network 100 incorporating a high capacity, high throughput networked-based genome storage repository (GSR) 110.
- the network may also be referred to herein as a bIQ network.
- the GSR 110 which contains genomic sequence data and related information, is in network communication with one or more genome sequencing centers (GSCs) 114, one or more genome data analysis centers (GDACs) 116, and one or more subscriber systems 120.
- GSCs genome sequencing centers
- GDACs genome data analysis centers
- subscriber systems 120 In an exemplary embodiment such network communication is designed to take place over one or more existing wide area networks, such as the Internet.
- the GSR 110 may function as a central repository for the GSCs 114 to store, and GDACs 116 to retrieve, sequence data and associated metadata.
- a typical workflow scenario involving the network 100 may begin with submission of a tissue sample to a GSC 114 or associated institution for preparation of genome analyte.
- the workflow continues with DNA/RNA sequencing and characterization by a GSC 114 and upload of the resultant sequence data and related information to the GSR 110.
- the sequence data produced by the GSC 114 is produced in a BAM format or other conventional format and is transferred to the GSR 110 using the GeneTorrentTM techniques described in the above -referenced provisional patent application nos. 61/539,942 and 61/662,996.
- the received BAM files may be encoded into the bIQ format described hereinafter and in the above -referenced patent applications.
- the bIQ- formatted data may then be downloaded to subscriber systems 120 using GeneTorrentTM techniques or otherwise made available for further processing by one or more genome data analysis centers (GDACS) 116.
- GDACS genome data analysis centers
- the GSR 110 also synchronizes with a data coordination center (DCC) 124 or equivalent system configured to provide the primary coordination portal for researchers or other personnel involved with a particular research initiative, project or commercial endeavor.
- DCC data coordination center
- the applicable DCC 124 maintains the higher-level study attributes and clinical data associated with each tissue sample.
- the GSR 110 will query the applicable DCC 124 to verify that submitted data is associated with a valid sample.
- the DCC 412 can also retrieve catalog information from an external source and allow users to perform queries across project, sample and sequence data.
- a bio-specimen e.g., a tissue sample
- the GSCs 114 or to an associated institution such as a Biospecimen Core Resource
- Aliquots of the analyte e.g., DNA or RNA
- the GSC 114 uploads the resultant sequence data and associated metadata to the GSR 110 and may transfer other metadata, e.g., Sample and Data Relationship Format (SDRF) metadata, to a project data portal provided by the DCC 124.
- SDRF Sample and Data Relationship Format
- the GSR 110 will generally synchronize information, and otherwise coordinate closely, with the one or more DCCs 124 respectively providing coordination portals for various projects or groups of researchers.
- each of the DCCs 124 maintains the higher-level study attributes associated with at least one such project as well as clinical data associated with each sample.
- the GSR 110 will query the appropriate DCC 124 to verify that data submitted by a GSC 114 is associated with a valid sample.
- some or all of the DCCs 124 may retrieve catalog information in order to enable users at the GDACs 116 to perform queries across project, sample and sequence data.
- queries from GDACs 116 will be received through a portal or other interface established by the GSR 110.
- the repository 110 consults an external user authentication database (not shown) in connection with authorization of users for uploading, downloading, and/or querying of sequence information.
- an external user authentication database not shown
- users may be authorized for different roles with respect to different projects coordinated by the DCCs 124.
- a unique ID (“UUID") is assigned to each aliquot of the tissue samples provided to or processed by a particular GSC 114.
- the UUID may, for example, be included within anonymized metadata associated with each physical aliquot sample and electronically transmitted by the GSC 114 to the DCC 124.
- metadata may include, for example, information identifying the tissue source site, sample type, analyte type, patient ID, and other information characterizing the sample or the facilities/equipment used to obtain the sample.
- the DCC 124 then creates a new sample record based upon this metadata, which is associated with the UUID corresponding to the aliquot.
- This metadata can then be retrieved from the DCC 124 through, for example, a web interface which may or may not be provided by a data portal of the DCC 124.
- the GSC 114 to which the sample is provided will perform sequencing and thereby generate BAM file(s), or other files of predefined type, containing the resultant sequence information.
- the GSC 114 then defines an analysis object ("Analysis object"), which in one embodiment includes a metadata file and the BAM files(s) corresponding to the metadata.
- the GSC 114 also assigns a UUID to the Analysis object.
- An upload client (described below) at the GSC 114 then initiates the sequence submission process by passing a user certification/session token and the submission metadata to the GSR 110 for validation. If validation is successful, the GSR 110 will create a database entry for the Analysis object and each of its constituent BAM files. As is discussed below, the GSR 110 will then track the status of the submission as it moves from loading, through any validation or transfer errors, until it is ready for download by a subscriber system 120.
- each metadata file may include references to the UUIDs corresponding to all of the sequence data files (e.g., BAM files or other sequence data files of predefined type) and aliquots linked to the bio-specimen data (i.e., data related to the initial tissue sample) maintained within the DCC 124.
- this information may be included within a separate file which is independently provided by the GSC 114 to the GSR 110 as part of the sequence submission process.
- the GSR 110 may then verify that these UUIDs correspond to valid UUIDs stored within the DCC 124 before creating a corresponding submission record and UUID corresponding to each Analysis object (and potentially each individual BAM file of the Analysis object) to be uploaded.
- the sequence data associated with a given submission may be suppressed, and new sequence data can be submitted for the same sample. This may occur with respect to cases in which, for example, it is desired to "top off a previous submission with more complete coverage.
- the GSR 110 maintains a list of "valid" bio-specimens (e.g., tissue samples) for a particular project and regularly synchronizes this list to corresponding information maintained at the corresponding DCC 124. This enables the sequence information corresponding to a particular sample to be redacted at the GSR 110 in response to information received from the DCC 124. For example, if the owner of a particular tissue sample at some point revokes consent relating to the download of sequence information derived from the sample, such sequence information could be redacted at the GSR 110.
- tissue samples e.g., tissue samples
- the metadata information associated with such redacted sequence information could be searched in response to queries submitted by subscriber systems 120 and/or GDACs 116, but the associated, redacted sequence information would not be available for download.
- only users of a subscriber system 120 having a certain authorization or subscription level would be permitted to download sequence information corresponding to metadata identified in response to a query received from such a system 120; that is, such sequence information would be appear to be redacted or otherwise suppressed or unavailable when identified in metadata returned in response to queries received from unauthorized users.
- the GSC 114 may utilize a highspeed, parallelized file transfer process to transfer the BAM file(s) associated with the Analysis object to the GSR 110.
- the BAM file(s) are encrypted using a key specific to the particular session in which the file(s) are transferred.
- the associated metadata which will generally be included within an encrypted file of inconsequential size relative to the size of the Analysis object, may then be separately sent to the GSR 110 using a conventional file transfer process.
- the encrypted BAM files(s) are decrypted and the sequence data included therein is encoded into the biQ format for storage, typically together with all or part of the metadata.
- a substantially similar or identical high-speed, parallelized file transfer process may then be used to communicate the encoded sequence data and related metadata of interest the requesting system 120.
- the encoded sequence data and related metadata is encrypted using both a key specific to the particular session in which the transfer occurs and a key unique to the requesting subscriber system 120.
- FIG. 2 illustrates a first exemplary implementation of a GSC 114.
- One or more high-speed sequencing machines 202 are operative to generate sequence reads, which are then aligned and mapped to a reference sequence in alignment / mapping module 206. Variants may also be called.
- the module 206 produces BAM files comprised of sequence alignment data; that is, binary versions of sequence alignment/mapping (SAM) files.
- SAM sequence alignment/mapping
- the BAM files produced by the module 206 are provided to an input interface 210 of a processing module 220.
- a processor 224 operates to store the received BAM files along with related metadata within a file storage unit 228 and executes an encryption module 240 to encrypt this information using a key associated with, for example, a particular data transfer session.
- the processor 224 executes the instructions of a GeneTorrentTM upload client 230 to transfer the BAM files within the file storage unit 228 to the GSR 110 via a network interface 236.
- the metadata stored within the file storage unit 228, which will typically be only a small fraction of the size of the associated BAM files, is transferred to the GSR 110 using conventional network transmission techniques.
- FIG. 3 depicts a second exemplary implementation of a GSC 114.
- one or more high-speed sequencing machines 302 are operative to generate sequence reads, which are then provided to an input interface 310 of a processing module 320.
- a processor 324 operates to store the received sequence data reads along with related metadata within a storage unit 326.
- the processor 324 executes the instructions of an encoding module 336 in order to encode each sequence read (i.e., segment of biological sequence data) stored within the storage unit 326 into a formatted biological data unit comprised of a header and a payload (such format also being referred to herein as the bIQ format).
- each biological data unit may be representative of or contain an encoded representation of a segment of biological sequence data.
- this encoded representation comprises a set of instructions which are at least implicitly defined relative to a reference sequence 338.
- the header of each biological data unit may include biological or other information relating to the encoded information included within or represented by its payload.
- this header information includes information stored within one or more layered data tables 340.
- the header information may include DNA- related information included within one or more DNA layer tables 342, R A-related information included within one or more RNA layer tables 344, protein-related information included within one or more protein layer tables 346, or information from other layer tables 350.
- the processor 324 stores the biological data units comprising encoded sequence information and related metadata within a file storage unit 328 and may execute an encryption module 332 to encrypt the biological data units using a key associated with, for example, a particular data transfer session.
- the processor 324 may operate upon the sequence reads received from the input interface 310 to create biological data units substantially simultaneously with storing such reads within the storage unit 326.
- the processor 324 further executes the instructions of a GeneTorrentTM upload client 330 to transfer the biological data units within the file storage unit 328 to the GSR 110 via a network interface 360 in the manner described below.
- a GSC 114 may be configured to transfer, using a GeneTorrentTM upload client, either BAM files or encoded sequence information (i.e., biological data units) to the GDR 110 to enable distribution of the subject genomic information to subscriber systems. It should be appreciated that in embodiments in which the subject genomic information is encoded into biological data units at a GSC 114, an encoding process similar or identical to that described with reference to FIG. 3 may occur at the GDR 110. This approach is described below with reference to FIG. 5.
- FIG. 4 depicts an exemplary implementation of the GSR 1 10.
- the GSR 110 is configured to receive BAM files and related metadata from the GSCs 114. That is, in the embodiment of FIG. 4 it is assumed that the sequence reads from the sequencing machines within the GSCs 114 are not being encoded into biological data units prior to be transmitted to the GSR 110. In embodiments in which such biological data units are generated at the GSC 114, it would be unnecessary to include a similar sequence encoding capability within the GSR 110.
- the GSR 110 includes an input interface 410 configured to receive the BAM files and related metadata transferred from a GSC 114.
- a processor 424 of the GSR 110 executes the instructions of a GeneTorrentTM application 430 disposed to interact with the GeneTorrentTM upload client executed at the GSC 114.
- GSR 110 includes a storage processor 425 operative to store the received BAM files along with the related metadata within a storage unit 426.
- the processor 424 executes the instructions of an encoding module 436 in order to encode each sequence read (i.e., segment of biological sequence data) stored within the storage unit 426 into a formatted biological data unit comprised of a header and a payload.
- the payload of each biological data unit may be representative of or contain an encoded representation of a segment of biological sequence data.
- this encoded representation comprises a set of instructions which are at least implicitly defined relative to a reference sequence 438.
- the header of each biological data unit may include biological or other information relating to the encoded information included within or represented by its payload.
- this header information includes information stored within one or more layered data tables 440.
- the header information may include DNA- related information included within one or more DNA layer tables 442, RNA-related information included within one or more RNA layer tables 444, protein-related information included within one or more protein layer tables 446, or information from other layer tables 450.
- the storage processor 425 stores the biological data units comprising encoded sequence information and related metadata within a file storage unit 428 and may execute an encryption module 432 to encrypt the biological data units using one or more encryption keys. For example, in one embodiment execution of the encryption module 432 effects encryption using both a key associated with a particular data transfer session and a key associated with the subscriber system to which the encrypted biological data units are being transferred.
- the processor 324 further executes the instructions of the GeneTorrentTM application 330 to transfer the encrypted biological data units within the file storage unit 428 to a GeneTorrentTM download client within the requesting subscriber system via a network interface 460.
- FIG. 5 depicts a codec schema 500 representative of the various encoding, decoding, encryption, decryption and transcoding operations which may be effected within the data network 100.
- the schema 500 includes an encoder 510 for performing an encode element 512 and an encrypt element 514 with respect to a file of sequence data 516.
- the file of sequence data 516 may be of, for example, a mapped format or a variants call format (VCF).
- VCF variants call format
- the encoder 510 is representative of the encoding and encryption operations which may occur within a GSC 114 in a manner consistent with the present disclosure.
- the encoder 510 may align and map sequence reads to a reference sequence and call variants. During this first stage the format of the data can be expected to be in many different formats and operated upon by several different versions of algorithms and analytical tools. In one embodiment the sequence data that is generated and processed by the encoder 510 is not yet accessible to other components of the data network 100 or to other biological networks in communication therewith.
- each biological data unit may include a header containing information relevant to the sequence information encoded within the payload of the biological data unit.
- the headers of each biological data unit may comprise layers of annotation and other information and may effectively function as tags for the sequence information included within the files 516.
- Metadata may also be directly embedded with the sequence data included within the payloads of biological data units to enhance and facilitate data processing operations elsewhere within the network 100.
- the schema 500 further includes a network-based distributor 520 configured to receive encrypted and encoded files or segments of sequence data for distribution to requesting subscribers.
- the distributor 520 may, for example, be representative of the functionality implemented within an exemplary implementation of the GSR 110.
- the distributor 500 includes a receive element 522 for receiving the encrypted and encoded sequence data transmitted by the encoder 510 over a network.
- a decrypt element 524 decrypts the encrypted and encoded sequence data and provides the unencrypted result to a storage element 526 for storage within the distributor 500.
- a retrieve element 528 cooperates with the storage element 526 to retrieve the encoded sequence information corresponding to the request or query.
- An encrypt element 530 then encrypts the retrieved, encoded sequence information prior to transmission over a network to the requesting decoder 540.
- this encryption is performed using a first encryption key associated with the data transfer session in which the encoded sequence information is transmitted and a second encryption key specific to the requesting decoder 540.
- each decoder 540 includes a decode element 542 and a decrypt element 544 for decrypting and decoding, respectively, the encoded and encrypted sequence information received from the distributor 520. As a result of these operations each decoder 540 produces a file 560 of sequence data corresponding to a reconstructed version of one of the files of sequence data 516 provided to the encoder 510.
- Each decoder 540 may, for example, be representative of the functionality implemented within an exemplary implementation of a subscriber system 120.
- transcoder 570 having a transcode element 572.
- the transcoder 570 is operative to add data to, or associate additional data with, the encoded sequence information managed by the storage element 526 of the distributor 520.
- additional data may be created as a consequence of processing the encoded sequence data within the distributor 520 using analysis programs or tools provided to the distributor by the transcoder 570.
- data may comprise new knowledge from analysis of the encoded sequence data conducted at the transcoder or new information added from network metadata analysis.
- the results of the processing initiated by the transcoder 570 may be returned to the trasnscoder 570 for storage.
- the transcoder 570 may, for example, be representative of the functionality implemented within an exemplary implementation of a GDAC 116.
- the schema 500 further includes a data manager 580 configured with an entitlement element 582 and a catalog 584.
- the entitlement element 582 receives authorization information 586 and is responsible for enforcing conditional access control throughout the network 100. That is, the entitlement element 582 regulates access to the information within the files 516 distributed throughout the network 100.
- conditional access control effected by the entitlement element 582 is distributed among the elements of the network 100.
- This distributed approach may be desirable in view of the nature of the sequence data and metadata being conditionally accessed during the execution of transactions involving such information.
- data may include sensitive or other preferably private information concerning individuals associated with sequence information potentially available throughout the network 100 and throughout systems linked to the network 100.
- a distributed approach to regulating access to such sensitive information may be advantageous since data access may be controlled at multiple points within the network 100.
- FIG.6 illustratively represents the incorporation of a distributed conditional access system (DC AS) within the network 100.
- DC AS distributed conditional access system
- users on the network 100 are authenticated using a system of highly distributed conditional access points. This may involve, for example, using an encoder to perform high speed pattern matching in a manner that is consistent with the standardized compression and encryption format. The encoder is able to efficiently couple these two processes together for best compression with highest security.
- the data may be formatted in such a way that it can be used in a standard compression and encryption format that is consistent with all GSCs approved for medical and pharmaceutical grade sequencing.
- a distributed conditional access system may be employed for the management of access to the data within the network 100.
- Such access may be based on, for example, a combination of qualifications including, without limitation, a consent requirement, medical or health alerts, analytical reports, updating the data with current findings and social reports.
- conditional access of the sequence data may be effected at each and every transaction point. Development of such a common format could, for example, be based upon input provided by various agencies, individuals and institutions.
- digital rights management will be mediated by the DCAS.
- the general specifications of rights management could be developed to be consistent with, for example, regulatory guidelines set by a genotype and phenotype expert group or other organization. For example, such guidelines may specify those authorized to access the stream of germline variants versus those authorized to access somatic variants files.
- One aspect of such guidelines could address an individual's rights with regard to genome sequence data, while another aspect could focus upon gene differential expression from R A-Seq data.
- the common format will preferably be optimized to encode and encrypt this data and will provide guidelines to regulate transmission and storage of this highly sensitive data.
- the encryption scheme should involve granularity to the extent where access to any component of the data can be filtered and regulated to the ⁇ ⁇ degree in order to enable various levels of user accessibility.
- the disclosed system provides for the highest level of privacy by utilizing an approach to access control that is highly-distributive and easily regulated at nearly every transaction point.
- conditional access control functionality should be present at the GSC 114 where sequence data is produced.
- the particular GSC 114 generates genome sequences for many different research groups, consortiums, research projects, clinics, pharma and individuals and all of this data will be sent to different places.
- the various data consumers will have different levels of access to the dataset.
- a typical scenario might involve a case where one GDAC 116 is entitled to view all sequence variants, somatic and germline combined while another might be entitled to access somatic mutations only.
- an encrypted content key, Key c may be generated for one set of genome sequence data files and separate subscriber keys, Key s , generated for subscribers having different levels of entitlement to access the data.
- the data might be sent directly from a GSC 114 and post processed to a GSR 110.
- the source of this data will require access from multiple subscribers and different types of results will be published to several orders more destinations.
- the GSR 110 may be equipped with DC AS as a main transaction point for regulation of queries, subscriptions, publishing and function request.
- the genome data transaction system provides a sequence data validation service which uses network-wide data coordination protocols.
- FIG. 7 illustratively represents the incorporation of a distributed conditional access system (DC AS) within an alternative data network 700.
- the network 700 is similar to the network 100, but includes multiple network-based genome storage repositories 110 linked to a data coordination center (DCC) 710.
- DCC data coordination center
- subscriber systems are disposed to query and interface with a GDAC 116 rather than with a GSR 110.
- GDAC data coordination center
- FIG. 8 an illustration is provided of one manner in which the encode/decode and encrypt/decrypt schema described with reference to FIGS. 1-7 may be utilized to mediate genomic-based transactions among various users of the network 100. As was described with reference to FIGS.
- data that is presented to a GSC 114 in a BAM format or any other format capable of being encoded into the bIQ format may be efficiently transmitted to a remote location within the network 100.
- One advantage of the biQ format is that the data can be operated upon in the compressed format, which obviates the need for conversion between a format suitable for compression and one optimized for processing and/or data security.
- the GSC 114 receives various aliquots of highly purified analytes containing preparations of genomic and mitochondrial DNA and R A. Using the several different sequencing platforms DNA-Seq and R A-Seq data is generated.
- the GSC 114 will generally store the raw sequence reads in the format of the platform or machine generating the reads (e.g., within BAM files).
- the GSC 114 will also typically store the metadata for such platform or machine, information relating to the operator, the date of the sequence run, and other related information. This metadata information can be incorporated into biological data containing the compressed and encoded sequence data, which are then generally encrypted prior to being transmitted from the GSC 114 to the GSR 110 or, in other embodiments, to a GDAC.
- the encoder device utilized in the GSC 114 may be comprised of hardware and software configured in a manner that is capable of processing BAM files at the rate of the stream.
- the encoder preferably matches dictionary word patterns and uses a compression and encryption scheme that enables secure transmission and entitlement management of the transactions that involves this data.
- the codec model of FIG. 5 provides a mechanism for commercial transactions involving transmission, exchange and analysis of genome sequence data.
- a doctor that is treating a cancer patient that is a difficult case can simply order a genome data analysis report (GDA Report) using a process that is similar to ordering a blood chemistry report today.
- GDA Report genome data analysis report
- the doctor may order a whole-genome sequence data analysis.
- the entire process can be medicated by the system described herein, with contractual relations involving the various entities being indicated in the outer layer of FIG. 6.
- the workflow of FIG. 6 may be summarized as follows: • An oncologist experiences difficulty treating a cancer patient
- Tissue sample is taken at a biospecimen core resource (BCR) facility
- GSC genome sequencing center
- the preprocessed is compared against other preprocessing data and metadata to insure the highest quality processing
- GDACs genome data analysis centers
- the analysis that is carried out at the GDAC may involve access to the patient's EMR or relevant information from EMR
- GDACs may also have access to certain relevant drug interaction databases to generate highest quality GDA reports
- the present approach enables a high level of data protection and coordination from the time of a doctor's decision that a patient's genome data and other molecular markers may be relevant to the treatment of the patient.
- This scheme provides a first-in-kind mechanism to offer a genomic data electronic transaction model with state of the art entitlement management system.
- the information that is contained in a full genome data analysis report along with information from the EMR and the various metadata can be used to populate a semantic database and linked with other data such as but not limited to research publications, off-label drug data from pharmaceutical companies, drugs in the development pipeline, upcoming drug trials, communications between experts and other such related information of any type that might be relevant to the case.
- the organization of the various types of data will allow meaningful usefulness of the vast amount of data that can be integrated into a medical decision making process.
- sequence data 904 is encoded/compressed into the payloads 910 of biological data units (stage 908).
- Biological and other information 912 specific to the sequence information is used as the basis of headers 914 inserted such biological data units (stage 916).
- each header 914 includes information relating to a different layer of a biological data model associated with the sequence information.
- the type of information that could be embedded and interwoven with the sequence data in the form of headers 914 could include, for example and without limitation, information concerning genotype, gene expression levels, methylation, micro RNA interactions, drug response, clinical, environmental and any such relate annotation specific to the sequence files. See, for example, U.S. Patent Application Serial No. 13/223,071, entitled “METHOD AND SYSTEMS FOR PROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION"; U.S. Patent Application Serial No. 13/223,077, entitled “METHOD AND SYSTEMS FOR PROCESSING POLYMERIC SEQUENCE DATA AND RELATED INFORMATION"; U.S. Patent Application Serial No.
- analysis may review new variants risk correlations data, or drug efficacy and response data.
- the former it may be useful to have that layer of information packaged with the sequence data because of the specificity and pertinence as well as this might be information that is referenced regularly.
- the latter it might be more reasonable and efficient to include the detailed drug data as well as other drug relationships in a linked drug database.
- a content encryption key (Key c ) 920 is generated based upon the content of the data and a subscriber key (Key s ) 924 is separately generated for each of the authorized users that have permission to access the data.
- the Key c may be derived from a package of the header 914 and the sequence data of the biological data unit 930 being transferred (e.g., the sequence data included within the received BAM or VCF file).
- the biological data unit 930 comprised of the header 914 and payload 910 is encrypted using the Key c (stage 934) and is further encrypted using the Key s (stage 938).
- the encrypted biological data unit 930 and public content key Key c are transmitted from the GSC 114 to the GSR 110 and/or GDAC 116, and subsequently to a user of a subscriber system 120 (e.g. to a researcher, doctor, patient, etc.).
- a biologically-intelligent nucleic acid sequence compression data format capable of being used in, for example, the bIQ network described above with reference to FIGS. 1-9.
- this data format is specifically designed for highly efficient compression, processing, movement, transmission and security of large volumes of DNA and RNA sequences.
- a dictionary approach is used to generate a reference sequence and to then determine the delta between this sequence and the sequence(s) being compressed.
- biological knowledge is integrated into the compression scheme using operation codes. For example, insertions and deletions that may represent thousands of bases can be represented by a single opcode instruction.
- the bIQ format disclosed herein and in the co-pending patent applications referenced herein facilitates the integration of knowledge concerning a sequence into its representation in order to improve compression and meaningful processing of the data. For instance, a base at any given position in a sequence can be substituted by any of the other three bases. However, in every case of a base substitution one of the 3 options has a significantly different biological impact than the other two.
- the bIQ network facilitates the transmission of, for example, the sequence data generated by DNA and R A sequencing processes. These sequencing processes generate files of various file formats including, for example, the BAM, CRAM and VCF format. In one embodiment the bIQ network is capable of receiving sequence data in any of these file formats.
- Sequence Alignment/Map (SAM) files are the precursors of BAM files, which are essentially a binary version of SAM files.
- the SAM file that is generated from sequencing run is a TAB-delimited ASCII format consisting of an optional header section and a telemetric sequence data section for the raw read sequences streaming from the sequencing machine.
- the header information that is associated with BAM files is typically attached at the head of the sequence data.
- the lines in the header start with a '@' sign, while alignment lines do not.
- '@HD' is usually the first line in the BAM files to indicate the start of the header lines in the file.
- the '@HD' line of the header will usually have an information field for the version number of the file format being used (VN) as well as the sorting order of the alignments (SO).
- VN version number of the file format being used
- SO sorting order of the alignments
- the coordinates for alignments are keyed and sorted by the reference sequence name field (RNAME) as well as the base position field (POS).
- the next set lines in the BAM header are usually the lines that represent the reference sequence dictionary which are the lines that contain the information that defines the alignment sorting order of the BAM file. These lines are indicated by a '@SQ' line. Each of these lines has six information fields.
- the first field in the @SQ line is the (SN) which is the field that contains the reference sequence name.
- SN is the field that contains the reference sequence name.
- Each line in the file should have a different identifier for this field.
- This is an information field in the header of BAM files that is used in the alignment record in RNAME next position (PNEXT) fields which is a major coordinate sort key.
- PNEXT RNAME next position
- X, Y and Mito are some of the tags that are used in this field.
- the balance of the information fields in the @SQ line include the reference sequence length (LN), the URI for the sequence file (UR), the identification of the genome assembly that was used (AS), the MD5 checksum without spaces (M5) and the species that the sequence maps to (SP). It is interesting to note that Epstein-Barr virus is one on the species sequenced in the current example.
- next line in the header is the read group indicated by '@RG' which includes several information fields. Much of the sequence machine metadata can be associated with these header lines.
- these lines include an identification number (ID). If there are multiple read group lines in the BAM file header then each line should have a unique id number.
- the @RG line includes the sequencing technology or platform (PL) that was used to generate the sequence. This may include but is not limited to Illumina, SOLiD, IONTORRENT, PACBIO and others.
- the platform unit (PU) is a unique identifier for the actual unit used.
- the reference sequencing library that is used to calibrate the analyte concentration is found in the field for the library is denoted by LB and the date as well as the time of the run by indicated by DT.
- the sample identifier and the genome sequencing center are by SM and CN, respectively.
- the program lines in the header '@PG' contain the information fields for the program identification field (ID) in the program lines. Multiple program lines may exist in the BAM header and each would require a unique program ID.
- the program name (PN) command line (CL) and the version number (VN) fields might be included on this header line.
- nucleic acid sequence information there are a number of considerations relevant to the compression of nucleic acid sequence information including, without limitation, footprint, processing feasibility, efficient movement between memory elements, transmission or network and security.
- CRAM is a new and efficient method for raw DNA sequence data storage using reference-based compression.
- This reference sequence based compression technique would likely be suitable and sufficient if sequence variation were limited to single nucleotide polymorphisms. In that case, all sequence entries would be identical length and compression, multiple sequence alignments, comparative sequence analysis and processing would be a lot easier to handle.
- the CRAM method uses a reference for compression, it should be appreciated that the reference is suboptimal in that it is only used to compress on the order of 70% of the generated sequence reads. Moreover, the algorithm is lossy in that some read sequences are not compressed or encoded whatsoever.
- VCF Variant Call Format
- VCF Variant Call Format
- VCF files are usually stored in a compressed format that can be indexed for fast and efficient random access to data when retrieving information on variant alleles from any position on the reference genome.
- VCFtools In order to interrogate these files, a stack of software called “VCFtools” is used to implement various utilities for processing including, for example, for slicing, merging, inter-leaving, performing format validation, comparing, annotating and performing basic statistical correlations.
- VCFtools and the genome analysis toolkit (GATK) developed by The Genome Sequencing and Analysis Group (GSA) in Medical and Population Genetics at the Broad Institute also provide a general Perl and Python API.
- the VCF file is comprised of a header and a body section. Both file types are reference-based, which is instrumental for navigating the base sequence data. However, whereas the focus of BAM files is to capture a substantial amount of information concerning the sequencing of a sample, the VCF file concentrates on the differences between the reference and sample sequence.
- the header is flexible and extendable with regards to the type and amount of metadata it contains. VCF files are highly-annotated to the extent that they may apply to a particular variant, as a whole or to each genotype. In addition to genotypic annotations, others that are commonly used may include filters, genotype quality score, genotype likelihoods, dbSNP membership, haplotype data, ancestral allele, mobile element information, read depth, mapping quality and other such related information.
- FIG. 10 illustrates a comparative sequence analysis process 1000 used to minimize apparent biological differences between a reference and a sample sequence entry.
- a source database is selected having sequence entries all within the same species.
- one entry within the database is selected (stage 1020).
- the biQ compression algorithm is then executed using by applying the reference sequences against the source database (stage 1030).
- stage 1040 A dictionary compression scheme is then executed in order to identify features which may be used to update the selected reference sequence and thereby enable higher compression of the sequence entries (stage 1040).
- stage 1040 may involve executing the compression algorithm to create a variants profile for each of the entries within the database and analyzing the resulting variants file. Such an analysis could include, for example, determining if the majority of the entries within the database have the same sequence polymorphisms.
- the selected sequence entry may have a nucleotide base that is an "A" at a particular location, but the majority of the entries may instead have a "G" at the specified location.
- the resulting variants data would indicate a transition instruction at that location (as opposed to a transversion which would result in a T or C substitution).
- the selected reference sequence is updated with the result of the data analysis described above. For example, in the scenario described above a "G" would be placed at the specified position.
- stages 1020, 1030 and 1040 are repeated until it is determined (in stage 1050) that further updating of the reference sequence is unlikely to yield further improvements in compression. This may be determined by, for example, comparing the current reference sequence to the dictionary entries and determining whether any changes to the reference would enhance compression performance. That is, the reference sequence will essentially be reduced to a sequence having a minimum number of mutations or structural variants.
- modifications may be made to the type of information that is collected and maintained in the headers of these sequence files (e.g., BAM and VCF sequence data formats).
- the sequence that is used to calibrate the data need not be selected from one of the entries. It could simply be generated or initially assigned by looking at the common entry for each of the positions. For example, if at position 100 more than 50% of the entries have a C then the reference should have a C at that position. In order to develop the minimum reference sequence, substitute a C for recursive optimization of the ideal sequence used for referencing. Doing this for the most common variants would find that the ideal minimum sequence would generate a highly-compressed database of mapped and unmapped raw reads.
- Two sequences that are compared have similarities and differences that can become intimately involved in operation coding of DNA sequence data.
- one sequence as relates to the other allows for one entry to serve as the control reference sequence. This provides an opportunity to use this method to compress the relative differences using biological instructions.
- FIG. 11 a flowchart is provided of an alternate encoding and encryption process 1100 capable of being employed within the network 100.
- the process 1000 employs the dictionary compression and reference sequence modification techniques described above.
- sequence reads are received from a next-generation sequence machine.
- An optimized reference sequence is then generated from these sequence reads (stage 1120).
- a dictionary is created (stage 1130).
- biological data units are encoded, assembled and stored within a GSR (stage 1140).
- the stored biological data units are then encrypted (stage 1150) prior to being transferred to a subscriber system (stage 1160).
- deletions or insertions can be applied to the selected minimum reference sequence as an updated version for improved compression.
- a premature termination codon PTC
- a specific control reference sequence based on a minimum delta value may be selected, and then a dictionary may be generated from the resulting dataset. For example, all the minor variant alleles in BRCAl gene (not limited to any one gene) that correlates with all known clinical and pharmacological effect can be used in a dictionary scheme.
- Each mutation event within each sample entry that results in a phenotypic effect, as well as silent mutations that are common in several entries, can be placed in a dictionary using this approach for further compression of the sequence data.
- the algorithm is able to take advantage of specific difference values from the references that are common to multiple entries.
- sequence files generated by a GSC 114 may be securely transferred to the GSR 110 in parallel fashion through the GeneTorrentTM data transfer application.
- this application is instantiated as the GeneTorrentTM application 430 installed on the GSR 110 and the GeneTorrentTM upload client 230 installed on a GSC 114.
- the GeneTorrentTM application 430 cooperates with a GeneTorrentTM download client installed on a subscriber system 120.
- GeneTorrentTM upload client 230 cooperate to effect submission of a set of one or more sequence data files (e.g., BAM files) to the GSR 110.
- effecting such a submission involves adding the submission to one or more catalogs maintained by the GSR 110 and/or DCC 124, verifying the associated metadata to be uploaded, storing and indexing the metadata for search, storing the sequence data in replicated persistent storage within the GSR 110, and setting access rules based on, for example, consent agreements associated with the tissue samples from which the sequence data files are derived.
- GeneTorrentTM download client within a subscriber system 120 cooperate to retrieve a bundle of one or more sequence data files from the GSR 110.
- retrieving a sequence data file from the GSR 110 includes verifying the requesting user is authorized to view the data within the file, storing the sequence data in local persistent storage at the subscriber system, and verifying that the transfer was performed correctly.
- the actual transfers of the sequence data files are preferably authenticated (i.e., only users associated with the appropriate permissions relative to the file may access its sequence data) and authorized (i.e., only users authorized in view of project- specific or other rules maintained by the GSR 110 and/or DCC 124 are permitted to download the identified sequence data file).
- Such transfers are also preferably secured in that the sequence data is strongly encrypted when transiting the network and reliable (i.e., files may be presumed to have been transferred essentially intact and uncorrupted unless the GeneTorrentTM application provides an indication to the contrary).
- each GeneTorrentTM client provides a command line interface to the end user. Through this interface one of two operating modes typically may be invoked: upload and download. When operative in upload mode, the GeneTorrentTM client operates in concert with the GeneTorrentTM application 430 to upload files to the GSR 110. When operative in download mode, the GeneTorrentTM client and the GeneTorrentTM application 430 cooperate to download files to the client from GSR 110. In addition, the GeneTorrentTM application 430 may enter an "actor" mode during which multiple GeneTorrentTM server instances are created for use in performing parallel transfers to/from the GSR 110.
- 430 executes on one or more application processors to manage file transfers to from GeneTorrentTM clients at GSCs 114 and to/from GeneTorrentTM clients at GDACs 116 .
- multiple GeneTorrentTM server processes executing on the application processors listen for download requests, and multiple GeneTorrentTM upload actor instances are spawned when an upload request is received from a GSC 114 (or, in certain cases, from a GDAC 116).
- application server instances (“AppServer Instances") executing on the application processors may be configured as either GeneTorrentTM upload actor instances or GeneTorrentTM download actor instances.
- the allocation of AppServer Instances among GeneTorrentTM upload and download actor instances may be made in accordance with, for example, the number and type of upload and download requests received from peer GeneTorrentTM instances at the GSCs 114 and GDACs 116. For example, during periods in which a higher number of download requests are received from GDACs 116 relative to the number of upload requests from GSCs 114, more of the AppServer Instances executing on the application processors may be configured as GeneTorrentTM download actor instances. Conversely, more of the AppServer Instances executing on the application processors may be configured as GeneTorrentTM upload actor instances during times in which a relatively larger number of upload requests are received.
- the system dynamically load balances across the application processors to allocate capacity for multiple upload and download processes, allowing it to better respond to the normal fluctuations in GSC and GDAC workflows. Moreover, performance with respect to a particular GeneTorrentTM upload or download session may be enhanced by allocating a relatively larger number of GeneTorrentTM actor instances to such process.
- Analysis objects are the primary container for submitting and downloading sequence data.
- Each Analysis object may include one or binary sequence Alignment/Mapping (BAM) files and is associated with an XML metadata file.
- the payload of each BAM file contains both the sequencing data (in bases, quality scores, and read names produced by the sequencing instrument) and read placements with annotations about strand, alignment, and quality features.
- Raw sequence read files such as .srf files, can also be submitted along with the BAM files.
- each data submission includes a file of submission metadata compliant with the SRA 1.3 XML schema.
- a user When making a new data submission a user will create and save a user authentication key via an authentication Web page hosted by or in association with the GSR 110.
- the user may then invoke an application executed by the GSC 114 to create a unique identifier (UUID) to associate with the Analysis object. Assigning a UUID to the Analysis object ensures that the submission can be subsequently uniquely identified relative to all other submissions provided to the GSR 110.
- the user may then create a directory at the GSC 114 and copy the XML metadata file (e.g., "analysis.xml") and sequence data files relating to the Analysis object into the directory.
- sequence data files may include additional files of type other than BAM, such as legacy formats or proprietary formats containing raw read data.
- the RNA-seq raw read data could be submitted along with the alignment data in the BAM.
- these additional files will be uploaded, stored and downloaded along with the BAM file as part of the same Analysis object.
- the GSR 110 maintains a list of users permitted to upload new submission sequence and metadata. This list may be maintained by, for example, an out-of-band interaction between personnel representing each GSC 114 and operations staff of the GSR 110.
- sequence data may be further constrained by applicable project consent authorization constraints. For example, consents from owners of sequence data relating to those users eligible to download such data may be received by the GSR 110 in one or more files on a regular (e.g., daily) basis.
- the GSR 110 may then update one or more internal authorization tables to reflect any changes.
- each file of sequence data within the GSR 110 is associated with a project coordinated by the DCC 124 through the identifier (e.g., UUID) assigned to the biospecimen from which the sequence data file was derived.
- the GSR 110 may receive this tag as part of the sequence data submission process.
- the GSR 110 may then confirm with the DCC 124 that the identifier is valid.
- the DCC 124 may also provide information on whether the sample has been redacted.
- uploading of a new submission of sequence- related data generally involves several operations.
- the user at the applicable GSC is authenticated and the submission "package" of files to be uploaded is validated.
- the Analysis object with associated metadata is added to a repository catalog associated with one or both of the applicable DCC 124 and the GSR 110.
- the set of one or more sequence data files included within the submission package are then transferred to the GSR 110.
- the correctness of the transfer may then be verified, and its legitimacy may be confirmed with reference to information maintained within the DCC 124.
- the upload process is then generally concluded by setting appropriate authorizations for access to the information within the new Analysis object.
- a user will typically transfer a plurality of files related to sequencing of a sample to the GSR 1 10.
- these files which are all associated with the same Analysis object, may include one or more XML files containing metadata about the sequence data files of interest.
- the Analysis object may, but need not, also include one or more sequence data files (e.g., BAM files) associated with the metadata.
- the GeneTorrentTM client 230 will locate all of the sequence data file(s) (e.g., BAM file(s)) listed in the analysis. xml file within the directory created during the submission stage.
- the GeneTorrentTM client 230 will connect to an API provided by the GSR 110 and pass a GeneTorrentTM object file ("GTO), which is used by a GTO ExecutiveTM subsystem to initiate the upload.
- GTO GeneTorrentTM object file
- the GTO ExecutiveTM subsystem will identify the address of the upload user and generate the required digital certificates.
- the GTO ExecutiveTM subsystem will spawn multiple GeneTorrentTM upload actor instances, which will begin uploading a first of the one or more sequence data files listed in the analysis. xml file.
- the GeneTorrentTM upload client 230 then segments the file and begins parallel file transfer sessions of the file pieces over SSL.
- the GeneTorrentTM protocol will manage transmissions errors on any of the file pieces and will reassemble the file at the GSR 110.
- the GSR 110 will perform a series of validation steps prior to making the data available for download.
- these steps may include, for example, computing the MD5 checksum and comparing it against the value in the XML metadata file, verifying the name of the transferred sequence data file matches the name in the XML metadata file, and validating that the headers of the transferred sequence data file match the header information in the XML metadata file.
- the DCC 124 will be queried to determine if the sample is valid and is in an active state (e.g. has not been redacted). If the sample cannot be found, the state will be set to "verifying sample”. If the sample is found, but has been redacted, the state will be set to "suppressed”. In both cases, the GSR 110 will periodically poll the DCC 124 to see if the state has changed.
- a user may issue a metadata-related to query to the GSR 110.
- queries are directed to the DCC 124.
- the user may specify values for one or more metadata attribute fields within the query.
- the GSR 110 may respond with zero, one, or more URIs referencing Analysis object(s) having metadata matching the specified attribute values.
- a doctor with access to the content-aware bIQ network uses one integrated system that is capable of monitoring and coordinating all of these different data types.
- the process of coordinating the data is obviated by the content-aware network.
- it may currently require several months to institute desired changes to a file containing genomic sequence data (e.g., an update to the header of a BAM files at a data coordination center).
- the not-yet-coordinated data sits in a staging area and not accessible to interested users at a GDAC.
- the bIQ network enables coordination of networked genomic data in a number of different ways. For example, changes to a reference sequence, or modification of the format that is used to store and transmit the sequence data, can be easily facilitated by the bIQ network.
- a network user with relevant algorithms at a GDAC may wish to send a query to find of those subjects with the ApoE marker how many had been treated with a particular drug for a different illness involving overlapping biological pathways. This might be an off- label drug that could be highly effective for treating certain type of stage of Alzheimers.
- the metadata that is available on the network should be made useful in making statistical corrections to determine confidence in finding any correlations can be made with MCI scores or brain images (MRI, PET, etc.). All of this data will be distributed across the network and results are aggregated to publish a result.
- Another level of correlation of this data may exist when DNA and RNA are prepared at various BCRs by different technicians and sequenced at different GSCs on different platforms and mapping and variants calling done by different tools correlations analysis can be done to establish a standard of quality. For example, are certain machine errors increased at a certain GSC at certain times the day or when a particular technician is working.
- data that is stored on the bIQ network is partitioned a manner that is consistent with maintaining the highest level of privacy of data.
- the network may be configured to permit individuals to be able to give dynamic consent to anyone requesting access to their molecular expression and genomic sequence data that is kept at a GDR.
- an individual's data might be stored at a GDR and each query request for access to that particular set of files would alert the owner of the data (the patient).
- the owner can grant access to the data using several different bIQ network compatible devices including but not limited to a cell phone.
- Privacy is also enhanced by the manner in which the relevant data will be compressed and encrypted for transmission and to facilitate other transactions. For example, certain data that is intended to stay private can be encrypted and compressed in a manner that is consistent with generating different levels of privatized genomic variants data.
- data on the network can be accessed and processed by moving applications to the stored data rather than by moving the data from storage or otherwise copying the data.
- data can be accessed or information about the data can be conditionally accessed by network queries by authorized users.
- the privacy of the data can be controlled, partitioned and filtered based on many features including but not limited to the type of the variant SNP versus indels versus copy number variations versus chromosomal rearrangements.
- alternative splicing variants, triplet expansions, repeat sequence, methylation profile and other related types of modification or variants data may reveal non- obvious genotypic or phenotypic information that should be kept private.
- the bIQ network may be configured to permit a specific given set of minor alleles to be accessible to one set of users and but not to other users. There may even exist a scenario where certain regions of the genome are requested by the genome owner and/or subject to remain private from everyone including the owner and/or subject.
- a mechanism is created to coordinate the validation process.
- Such a mechanism would involve a means to synchronize the sequence data content, information in the header, and the various sources of metadata collected at the various steps in the work-flow of the molecular data.
- the data that is generated at a BCR is stored in files with metadata information that relates directly to the type of biological specimen that is being used; organ type, tissue type of cell type for example.
- metadata information that relates directly to the type of biological specimen that is being used; organ type, tissue type of cell type for example.
- TCP transmission control protocol
- a "peer-to-peer” network of computers harnesses the bandwidth and computational power of the computers participating in the network. This contrasts with conventional "client-server” approaches, in which computing power and bandwidth are concentrated in a relatively small number of servers.
- Such peer-to-peer networks may facilitate the transfer of files through a set of connections established between participating peers.
- BitTorrent is a popular file distribution program currently used in peer-to-peer networks.
- a peer within a BitTorrent system may be any computer running an instance of a client program implementing the BitTorrent protocol.
- Each BitTorrent client is capable of preparing, requesting, and transmitting any type of computer file over a network in accordance with the BitTorrent protocol.
- BitTorrent is designed to enable distribution of large amounts of data without consuming correspondingly large amounts of computational and bandwidth resources.
- the peer distributing a data file generally treats the file as being comprised of a number of identically-sized pieces, usually with byte sizes of a power of 2, and typically between 32 kB and 16 MB each.
- the peer creates a hash for each piece, using the SHA-1 hash function, and records the hash value in the torrent file.
- the hash of the piece is compared to the recorded hash to test that the piece is free of errors.
- Peers that provide a complete file are called "seeders”, and the peer providing the initial copy of the file may be called the "initial seeder”.
- the tracker maintains records of which peers are “seeds” (i.e., a peer having the complete file(s) being distributed) and of the other peers in the applicable "swarm” (i.e., the set of seeds and peers involved in the distribution of the file(s)). During the distribution process peers periodically report information to the tracker and request and receive information concerning other peers to which they may connect.
- seeds i.e., a peer having the complete file(s) being distributed
- the applicable "swarm” i.e., the set of seeds and peers involved in the distribution of the file(s)
- Users interested in obtaining a file or files using BitTorrent may, using a web browser installed on a local machine, navigate to a website listing the torrent and download it. Once downloaded, the torrent may be opened in a BitTorrent client stored on the local machine. Once the torrent is opened, the BitTorrent client establishes a connection with the tracker. At this point the tracker provides the BitTorrent client with a list of peers currently downloading the file or files of interest.
- the disclosed GeneTorrentTM high-speed file transfer system utilizes a tracker to enable a plurality of peers to cooperatively distribute a file of interest.
- the GeneTorrentTM system incorporates a Transactor which is integrated within or otherwise operates in conjunction with the tracker.
- the Transactor is a program which operates to immediately identify and make a record of those clients (i.e., actors) which request a certain file of interest (e.g., "file X").
- the Transactor will also generally be configured to determine the authentication and entitlement of each actor based on authorization rules and using a secure key distribution scheme.
- the GeneTorrentTM approach effectively "parallelizes" the transfer of file information and reduces the burden on the initial seed or seeds of file X. Moreover, the use of parallel streams within the GeneTorrentTM system minimizes the effect of a multiplicative decrease in the speed of any one stream resulting from the characteristics of TCP discussed above. Thus, use of the GeneTorrentTM approach may reduce the likelihood of bottlenecks developing around overburdened seed servers in connection with the transfer of very large data files.
- one major rate limiting step involves the ineffectiveness on the part of the research community to make decisions as a group to set the highest standards for the genomic data space. For example, it may be particularly important to develop standards around the quality of service that is used to touch genomic, transcriptomic, proteomic and other large volumes of omics data. Moreover, this type of biological data will typically require extreme security considerations. The dataset might contain genotypic and phenotypic information that could have profound effects on an individual if it is breached.
- user- level authentication and dataset authorization is performed before peers can initiate a GeneTorrentTM transfer.
- a GeneTorrentTM peer desiring to initiate a transfer first contacts a control component (hereinafter also termed a "GT Exec") on a GeneTorrentTM repository and passes the user credentials.
- the GT Exec authenticates that the credentials have not expired and correspond to a known user.
- the GT Exec may then further verify that the user is authorized to perform the requested action (e.g., upload or download) on the specific data files identified in the request.
- GeneTorrentTM enables multiple sending actors to transfer file pieces to multiple receiving actors over parallel streams.
- parallel M:N transfers avoid many of the bottlenecks that occur in 1 : 1 transfers, such as issues with disk I/O, CPU utilization, large bandwidth-delay products in the WAN, and other side-effects of transferring very large data sets. Error detection and error recovery are built in, automatic and very robust.
- the protocol is content-agnostic, allowing data file formats to evolve without impacting the underlying transfer mechanisms
- the protocol scales asymmetrically - M:N transfers for high volume producers and 1 :N downloaders for the periodic user are supported by the same application.
- the protocol is capable of saturating the available network bandwidth and reacts well to dynamic changes in the congestion levels in the transport network
- the GeneTorrentTM protocol is capable of rapidly and efficiently transmitting large biological sequence data files.
- a dynamic encryption key distribution system is integrated into the file transfer system to faciliate secure transfers. This allows for encryption of data in multiple layers that can be controlled using a hierarchical entitlement scheme.
- one GTO file can be encrypted in a format that allows for multiple downloaders to access different layers of data from the files.
- FIG. 12 provides a high-level of the architecture of a GeneTorrentTM system configured to form multiple instances of a GTO file so as to enable a cluster of servers to tranfer parallel streams of file information to a user system.
- FIG. 12 integrated within or layered "on top of the architecture is a higly secure encryption system.
- an actor first locates a Torrent file describing the target data as an initial step in participating in a GeneTorrentTM parallel file transfer.
- a Torrent file may comprise a static "bencoded" dictionary including the Announce URL, an info dictionary, and other optional fields.
- GeneTorrentTM uses dynamic one-time Gene Torrent Object files to bootstrap a secure and encrypted file transfer based on bi-directionally authenticated SSL sessions.
- the Torrent file will generally be structured in order to accommodate the efficient transfer of very large files.
- the task of generating the SHAl hashes for all the "pieces" of a very large file would be computationally expensive and impose an unnecessary I/O burden on the local storage system.
- one or more seeders cache the torrent data for reuse.
- Each large data file will have an associated static Torrent file which will be stored in the same directory.
- This torrent file may comprise a "normal" Torrent, i.e., it may lack SSL certificate information.
- certificate information and any other additional data fields may instead be dynamically inserted into the Gene Torrent Object file at the time of a download request, thus creating a one-time -use Torrent with authentication keys.
- the Transactor at least partially enables a GeneTorrentTM system to transmit a file of interest in multiple parallel streams to a requesting entity.
- the Transactor clusters individual actors which have requested file X into a cast of participating actors comprising an affinity group.
- the Transactor may determine which actors are assigned to a particular cast based upon, for example, the file requested, the location of the file (i.e., with which actor(s) the file is currently stored), as well as the credentials of the actors requesting access to the file.
- the actor exchanges messages with other actors within the cast in order to determine and receive the portions of the file of interest currently possessed by the cast. That is, a requesting leecher actor is proactively directed to a feeder affinity group such that the leecher receives as much of the requested file as possible without, to the extent possible, incrementing the burden on the seed of file X.
- FIGS. 13-18 illustrate exemplary operation of one embodiment of a Transactor.
- Actor 1 makes a first request to the GeneTorrentTM network for file X.
- the GeneTorrentTM network for file X.
- the only copy of file X is stored at the Repository.
- file X could be a file in multiple locations when the request is made to download this file on the network.
- the GeneTorrentTM system can achieve multiple instances of the same file because of the key distribution used to certify copies of file X that it is from one original file.
- GeneTorrentTM can generate multiple instances of secure parallelized streams from a certified copy of file X from one or more actors to one or more actors.
- the line of authentication can be to on original copy.
- the Transactor adds the data certification on top of a "smart-tracker” that tracks not only who have which file but also tracks biological and clinical knowledge about the files (Biolntelligence).
- the SmartTracker may track file specific information contained in these sequence data files on variants, gene expression, copy number variations as well as any disease that might be associated.
- the Transactor uses the SmartTracker, conditional access control and a robust encrypt key distribution to assign high affinity actors to a cast based on file X request and essentially on any field of information contained in the sequence data file.
- Transactor in addition to Transactor assigning actors to a cast because they are all interested in a particular file X, the actors might be clustered based on information about the file.
- the file X is the genome sequence for an individual with disease Y and if it is known that mutations in certain genes on chromosome 17 are associated with the disease then Transactor can be more effective in building out a well-defined affinity cast in the early stages of an impending transmission request load to limit any bottleneck.
- the GeneTorrentTM protocol provides security for biological sequence data in transit by running a well-established protocol over Secure Socket Layer (SSL) connections between the trusted GeneTorrentTM actors involved in the transfer of file pieces on the bIQ network.
- SSL Secure Socket Layer
- the SSL connections will be bi-directionally authenticated in the manner described below.
- the GeneTorrentTM client software runs on both the source system(s) and the Genome Data Repository.
- the web service interface (WSI) and Tracker run only on the repository systems and mediate the file transfers.
- the first step is for an exchange of digitally signed certificates to take place.
- the uploading source and the GDR have mutually authenticated by exchanging digitally signed certificates that can be traced to a trusted 3 rd party, i.e., an Internet CA.
- the certificates are specific to this Gene Torrent Object file and the file it represents, is immune to a replay attack, and is not subject to man-in-the-middle interception.
- the SSL connections will use the AES-128 cipher, which is a more robust (and FIPS- compliant) cipher than the RC4 cipher typically used.
- SSL and the necessary certificate management is a novel and key advantage of GeneTorrentTM over public solutions.
- FIGS. 20A-20B provide an illustration of a secure GeneTorrentTM download workflow between the client-side GeneTorrentTM data consumers and the server-side WSI/Data Manager at GDR, Tracker and GeneTorrentTM actors.
- the network flow data that is available on the system can provide powerful statistical correlation data from comparative sequence data analysis. Consider a case where sequence variants data that is transmitted from various GSCs is monitored for quality assurance. Furthermore, the Tracker might be configured to track pieces of a .gto file to control duplication and distribution of this data.
- the proposed protocol will be able to scale to accommodate multiple server side GeneTorrentTM processes and download requests will be load balanced across processes to optimize server performance.
- the Load Balancer module will talk to the WSI over a specified network interface to receive GTO files for each download session. It will then place the files in the appropriate GeneTorrentTM Peer work queues based on system load and quality of service.
- GeneTorrentTM Client Provides a CLI for secure, high-performance upload and download. Only download functionality will be integrated with the WSI for the POC; Basic file upload will be provided in GeneTorrentTM clients and server processes for POC testing purposes. The client runs on POSIX workstations with sufficient storage and performance and will be installed at customer sites (GSCs / GDACs).
- the server Data Manager is run on the Repository application processor and is responsible for ensuring the integrity and security of the data and providing external interfaces for searching, uploading and downloading data.
- the Data Manager includes a SQL database with all of the sequence metadata, user information and system monitoring data.
- the POC will use MySQL database, but this may be replaced with PostgreSQL or an alternate DB in other releases.
- External interfaces are implemented as RESTful Web Service Interfaces (WSI).
- WSI RESTful Web Service Interfaces
- the WSI uses Apache and a Solr search index.
- GeneTorrentTM Server Hub In one embodiment the GeneTorrentTM Server includes the Tracker and multiple Peer processes.
- the Repository Data Executive and the GeneTorrentTM Server Hub may be collectively referred to herein as the "GT Executive” or the “GT Exec”.
- FIG. 21 illustrates an exemplary software architecture of a system capable of providing GeneTorrentTM file transfer capability.
- FIG. 22 illustrates a corresponding exemplary system architecture.
- the network operating system for the GeneTorrentTM system functions in three (3) modes: upload, download, and seeder.
- Upload mode is used to upload Gene Torrent Object files to a Genome Data Repository.
- Download mode that is represented in the SW suite is used to download files from the various GDRs on the network.
- Seeder mode is a mode used within each GDR to create GeneTorrentTM server instances that seed data to download actors on the bIQ.
- analysis.xml file name may
- —download VARIABLE is a .gto file, a URI, or an XML
- the GeneTorrentTM system may be controlled via a command-line in the manner discussed above, in other embodiments the GeneTorrentTM system may be indirectly controlled by a third party application and/or service.
- This form of interaction may be characterized as a form of "remote control" in that an entity external to the GeneTorrentTM system directs control of upload and download transfers.
- the external entity may reside on the same machine(s) as the GeneTorrentTM system components or it may reside in an entirely different network, operating in a command and control fashion from afar.
- the GeneTorrentTM system will be capable of ingesting files of any format containing genome and transcriptome sequence data and any additional metadata files that are associated. These files are validated, encoded and encrypted in order to maximize transmission rate.
- the GeneTorrentTM method may be applied to transfer very large files of biological sequence data along with files containing other data and information having a very specific relationship. It is this information in these files that are encrypted and configured in layers associated with a layered data model.
- all of the data that is associated with a whole genome or whole exome sequence data could be encrypted within the same layer with one or more keys.
- This information may include, without limitation, annotation data concerning functional regions of the genome, genes, promoters, repeat sequences, DNA methylations, SNPs, CNVs, structural variants including chromosomal rearrangements.
- a second layer of data would include gene expression data including data from splicing, RNA processing, mR A-Seq and miR A-Seq data.
- Another layer of encrypted data associated with the sequence files may include protein function assay results or protein level measurements.
- Other layers of encryption may include clinical test results and information on drug metabolism and response.
- GDAC genome data analysis center
- the owner of the data or an agent will receive a prompt for consent to use the data and user may then be authorized to access those regions of the genome with association to the specific disease based on layered encryption.
- the system is designed to track and coordinate all the data contained within these ancillary files. As a result, the various nodes on the network have awareness of the location of data as well as the compute clusters and algorithms that are available. In essence, the encrypted layered data is a component part of how the network provides the content awareness and biointelligence.
- the operating system of the network is configured in such a manner that allows authorized users to be able to access the various layers of encryption with a consent- based conditional access system. For example, if a user is authorized to access the data then the network will know where this data resides and be able to operate on it.
- the biointelligence of the network may reside in the many different types of information and associated data that are relating specific to genome sequence data.
- all of the information associated with every file is searchable on a network- wide basis.
- a user with the proper authorization would be able to submit a query relating to any type of information from Table 1 and receive a response identifying all the genome sequence data files on the network that are accessible to the user based on the consent given by the respective owners of the data within the queried files.
- a query can be made with reference to all of the genome and transcriptome data that has been uploaded to the network during a predetermined period (e.g., within the last 60 days).
- the response to the query would come from multiple genome data local area networks (gLANs).
- the network OS would monitor the consent for access to data and user authorization and be able to effectively authenticate users.
- the Nextegen Data Repository comprises high-speed fiber optic storage and computing infrastructure capable of facilitating the acquisition, secure storage, searching, and secure sharing of genome sequence data and phenotype metadata with authorized to access.
- the repository has the following attributes:
- Sequence Producing Centers are anticipated to be a primary source of sequence data. At such Centers, digital representations of biological samples are generated by processing such samples. Optionally, research centers may also upload genome data. In addition, phenotype information and high-level features derived from the raw sequence, the metadata, is produced at the DCC and other sources. Software and associated hardware will typically be located at these centers to perform the genome sequence transfer and ingestion processes. Such software will generally be capable of checking data format and validity prior to initiating uploading to the Data Repository. The software will also preferably perform transfer of genome and metadata information to Data Repository and support high capacity transfers at very high data rates (e.g., lOGigabits/second).
- the user access software will be provided to the primary and secondary research center sites to enable downloading of information in the repository.
- cancer type lung, ovarian, etc
- the case/sample ID will have various extensions for the various types of files made for each case:
- Implementation of the techniques, blocks, steps and means described above may be done in various ways. For example, these techniques, blocks, steps and means may be implemented in hardware, software, or a combination thereof.
- the processing units may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described above, and/or a combination thereof.
- ASICs application specific integrated circuits
- DSPs digital signal processors
- DSPDs digital signal processing devices
- PLDs programmable logic devices
- FPGAs field programmable gate arrays
- processors controllers, micro-controllers, microprocessors, other electronic units designed to perform the functions described above, and/or a combination thereof.
- the embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed, but could have additional steps not included in the figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination corresponds to a return of the function to the calling function or the main function.
- the term “storage medium” may represent one or more memories for storing data, including read only memory (ROM), random access memory (RAM), magnetic RAM, core memory, magnetic disk storage mediums, optical storage mediums, flash memory devices and/or other machine readable mediums for storing information.
- ROM read only memory
- RAM random access memory
- magnetic RAM magnetic RAM
- core memory magnetic disk storage mediums
- optical storage mediums flash memory devices and/or other machine readable mediums for storing information.
- machine-readable medium includes, but is not limited to portable or fixed storage devices, optical storage devices, wireless channels, and/or various other storage mediums capable of storing that contain or carry instruction(s) and/or data.
Abstract
Description
Claims
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201161539942P | 2011-09-27 | 2011-09-27 | |
US201161539931P | 2011-09-27 | 2011-09-27 | |
US13/417,184 US20120233201A1 (en) | 2011-03-09 | 2012-03-09 | Biological data networks and methods therefor |
US201261650417P | 2012-05-22 | 2012-05-22 | |
US201261662996P | 2012-06-22 | 2012-06-22 | |
PCT/US2012/057668 WO2013049420A1 (en) | 2011-09-27 | 2012-09-27 | System and method for facilitating network-based transactions involving sequence data |
Publications (2)
Publication Number | Publication Date |
---|---|
EP2761518A1 true EP2761518A1 (en) | 2014-08-06 |
EP2761518A4 EP2761518A4 (en) | 2016-01-27 |
Family
ID=47996410
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP12835985.8A Withdrawn EP2761518A4 (en) | 2011-09-27 | 2012-09-27 | System and method for facilitating network-based transactions involving sequence data |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP2761518A4 (en) |
WO (1) | WO2013049420A1 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2816496A1 (en) | 2013-06-19 | 2014-12-24 | Sophia Genetics S.A. | Method to manage raw genomic data in a privacy preserving manner in a biobank |
KR102068451B1 (en) * | 2014-09-03 | 2020-01-20 | 난트헬쓰, 인코포레이티드 | Synthetic genomic variant-based secure transaction devices, systems and methods |
WO2016083949A1 (en) * | 2014-11-25 | 2016-06-02 | Koninklijke Philips N.V. | Secure transmission of genomic data |
WO2016154254A1 (en) * | 2015-03-23 | 2016-09-29 | Private Access, Inc. | System, method and apparatus to enhance privacy and enable broad sharing of bioinformatic data |
IL298101A (en) * | 2020-09-14 | 2023-01-01 | Illumina Inc | Custom data files for personalized medicine |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5776767A (en) * | 1995-12-12 | 1998-07-07 | Visible Genetics Inc. | Virtual DNA sequencer |
US20020029113A1 (en) * | 2000-08-22 | 2002-03-07 | Yixin Wang | Method and system for predicting splice variant from DNA chip expression data |
TW588243B (en) * | 2002-07-31 | 2004-05-21 | Trek 2000 Int Ltd | System and method for authentication |
WO2006084391A1 (en) * | 2005-02-11 | 2006-08-17 | Smartgene Gmbh | Computer-implemented method and computer-based system for validating dna sequencing data |
-
2012
- 2012-09-27 WO PCT/US2012/057668 patent/WO2013049420A1/en active Application Filing
- 2012-09-27 EP EP12835985.8A patent/EP2761518A4/en not_active Withdrawn
Also Published As
Publication number | Publication date |
---|---|
EP2761518A4 (en) | 2016-01-27 |
WO2013049420A1 (en) | 2013-04-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20130246460A1 (en) | System and method for facilitating network-based transactions involving sequence data | |
AU2013277986B2 (en) | System and method for secure, high-speed transfer of very large files | |
CN110164508B (en) | Biological information data providing method, biological information data storing method, and biological information data transmission system based on multi-block chain | |
Simonyan et al. | High-performance integrated virtual environment (HIVE): a robust infrastructure for next-generation sequence data analysis | |
Xin et al. | High-performance web services for querying gene and variant annotation | |
US10586612B2 (en) | Cloud-like medical-information service | |
US7653634B2 (en) | System for the processing of information between remotely located healthcare entities | |
US8982879B2 (en) | Biological data networks and methods therefor | |
Leekitcharoenphon et al. | snpTree-a web-server to identify and construct SNP trees from whole genome sequence data | |
US20110110568A1 (en) | Web enabled medical image repository | |
US20110153351A1 (en) | Collaborative medical imaging web application | |
WO2013049420A1 (en) | System and method for facilitating network-based transactions involving sequence data | |
Ortega et al. | ETDB-Caltech: a blockchain-based distributed public database for electron tomography | |
Seth et al. | Securing bioinformatics cloud for big data: Budding buzzword or a glance of the future | |
Zolfaghari et al. | Cryptography in hierarchical coded caching: System model and cost analysis | |
US8266706B2 (en) | Cryptographically controlling access to documents | |
US20180322246A1 (en) | System and method for secure, high-speed transfer of very large files | |
Nieroda et al. | iRODS metadata management for a cancer genome analysis workflow | |
Kettimuthu et al. | A data management framework for distributed biomedical research environments | |
Parker et al. | Building infrastructure for African human genomic data management | |
Hachiya et al. | The NBDC-DDBJ imputation server facilitates the use of controlled access reference panel datasets in Japan | |
Wienbrandt et al. | EagleImp-Web: A Fast and Secure Genotype Phasing and Imputation Web Service using Field-Programmable Gate Arrays | |
Kottha et al. | Accessing bio-databases with OGSA-DAI-a performance analysis | |
Joia | Towards Reproducible and Privacy-preserving Analyses Across Federated Repositories for Omics data | |
Xu et al. | SNPTrack TM: an integrated bioinformatics system for genetic association studies |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20140404 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
DAX | Request for extension of the european patent (deleted) | ||
RA4 | Supplementary search report drawn up and despatched (corrected) |
Effective date: 20160105 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 19/28 20110101AFI20151221BHEP |
|
R17P | Request for examination filed (corrected) |
Effective date: 20140404 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20180404 |