CN107004068B - Secure transmission of genomic data - Google Patents

Secure transmission of genomic data Download PDF

Info

Publication number
CN107004068B
CN107004068B CN201580064030.1A CN201580064030A CN107004068B CN 107004068 B CN107004068 B CN 107004068B CN 201580064030 A CN201580064030 A CN 201580064030A CN 107004068 B CN107004068 B CN 107004068B
Authority
CN
China
Prior art keywords
vcf
data
annotated
encoded
chromosome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201580064030.1A
Other languages
Chinese (zh)
Other versions
CN107004068A (en
Inventor
V·阿格拉瓦尔
N·迪米特罗娃
R·J·克拉辛斯基
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Koninklijke Philips NV
Original Assignee
Koninklijke Philips NV
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips NV filed Critical Koninklijke Philips NV
Publication of CN107004068A publication Critical patent/CN107004068A/en
Application granted granted Critical
Publication of CN107004068B publication Critical patent/CN107004068B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/606Protecting data by securing the transmission between two devices or processes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/40Encryption of genetic data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3059Digital compression and data reduction techniques where the original information is represented by a subset or similar information, e.g. lossy compression
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/04Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
    • H04L63/0428Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Abstract

The amount of genomic data and the sensitivity of the information necessarily requires the development of intelligent and efficient ways to transmit genomic data in a secure manner. Despite the existence of encryption schemes, there is a need to first reduce the amount of massive information and then apply encoding and encryption methods, which will be effective both in economic awareness and security against genomic data. In the present invention, we discuss novel techniques for encoding and sending processed variant information to a remote site that ensures the transition delivery. The protocol not only encodes and encrypts information; it also compresses the information that needs to be transmitted.

Description

Secure transmission of genomic data
Technical Field
Embodiments of the present invention relate generally to secure data transmission, and more particularly to systems and methods for secure transmission of large amounts of data subject to privacy restrictions and other security issues on otherwise unsecure networks.
Background
Sequencing technologies such as genomic sequencing and SNP genotyping are capable of generating large amounts of genomic data. For example, a variant call file (variant call file) used to store data from sequencing variants on chromosomes can be hundreds of gigabytes.
Researchers and health care providers often need to transfer genomic data from one location to another, geographically remote. Since private or private networks that span long distances can be prohibitively expensive or otherwise include insecure ranges, data is often transmitted over insecure networks. Genomic data may be associated with a particular patient and thus present privacy concerns; indeed, transmission may be subject to the restrictions of legal regulations relating to the storage and transmission of such data. In addition, the more sensitive the information is in processing the data to identify patient-specific abnormalities, and thus the greater the need for a secure transmission mechanism.
The amount of data and the sensitivity of the information necessitate the development of effective techniques to safely deliver genomic data. The prior art does not necessarily consider the nature of the genomic data (including variant data) nor the quality of the specific data being transmitted.
Therefore, there is a need for an efficient and secure system for transmitting genomic data over an unsecured network.
Disclosure of Invention
In general, aspects of the systems, methods, and apparatus described herein relate to improved systems and methods for transmitting genomic data between geographically remote sites over an insecure network through novel techniques of processing, reducing, encoding, and encrypting data prior to transmission. Specific details for applying the system to the transmission of variant information including Single Nucleotide Polymorphisms (SNPs) have been set forth, but one of ordinary skill in the art will recognize that the embodiments described herein have broader application.
According to one aspect of the present invention, a system for transforming data sequenced from a genome and processed into a Variant Call File (VCF) includes a first processing module and a second processing module, each of the first processing module and the second processing module including a computer processor and a computer readable tangible medium. The first processing module is operable to: reducing the VCF to an annotated VCF based on reference data, the annotated VCF predominantly comprising non-redundant variant data from the VCF; encoding the annotated VCF; and stores the encoded VCF. The second processing module is operable to: receiving the encoded VCF; and augmenting the encoded VCF.
In one embodiment, the reference data comprises reference allele data and alternative allele data from a database of short genomic variations (SNPs). In one embodiment, encoding the annotated VCF comprises transforming chromosome number and chromosome position data of the annotated VCF using a mathematical coordinate system.
According to another aspect of the present invention, there is provided a method performed by a computer processor of transforming data sequenced from a genome of a patient and processed into a Variant Call File (VCF), and the method comprising the steps of: reducing the VCF to an annotated VCF that primarily includes non-redundant variant data from the VCF; encoding the annotated VCF; and storing the encoded VCF on a computer readable tangible medium.
In one embodiment, reducing the VCF includes removing variant calls whose associated quality data does not meet a predetermined threshold. In one embodiment, reducing the VCF comprises removing known variations using a reference database of short genomic variation (SNP) data. The known variation may include one or more of reference allele information and alternative allele information.
In one embodiment, encoding the annotated VCF comprises transforming chromosome number and chromosome position data of the annotated VCF using a mathematical coordinate system. Converting the chromosome number and chromosome position data of the annotated VCF using a mathematical coordinate system may include converting the chromosome number and chromosome position data of the annotated VCF to a circular coordinate system based on a modulus value. The method may further comprise encrypting the modulus value and initiating transmission of the encrypted modulus value and the encoded VCF file to a second terminal over a network connection.
In one embodiment, encoding the annotated VCF comprises converting chromosome number and chromosome position data of the annotated VCF using one of cartesian coordinates, polar coordinates, or linear coordinates. In one embodiment, the method further comprises applying a frequency domain transform to the annotated VCF prior to encoding the annotated VCF. In one embodiment, the method further comprises transmitting the encoded VCF to a second terminal over a network connection.
According to another aspect of the present invention, there is provided a method performed by a computer processor of transforming data sequenced from a genome of a patient and processed into a Variant Call File (VCF), and the method comprising the steps of: receiving a VCF encoded using a mathematical coordinate system; and augmenting the encoded VCF with the reference allele data and the alternative allele data using a reference database of short genomic variation (SNP) data.
In one embodiment, the method further comprises decoding the encoded VCF using a modulus value.
The foregoing and other features and advantages of the invention will become further apparent from the following description, the accompanying drawings, and the claims. Other aspects and advantages of the present invention will be apparent to those skilled in the art based on the present disclosure.
Drawings
In the drawings, like reference numerals generally refer to the same parts throughout the different views. In the following description, various embodiments of the present invention are described with reference to the following drawings, in which:
fig. 1 is a schematic diagram of a secure transmission system according to an exemplary embodiment of the present invention.
Fig. 2 is a diagram of a transmitting station according to an exemplary embodiment of the present invention.
Fig. 3 is a diagram of a receiving station according to an exemplary embodiment of the present invention.
Fig. 4 is a flowchart illustrating an exemplary operation of the transmitting station illustrated in fig. 2.
Fig. 5 is a flowchart illustrating exemplary operations of the receiving station illustrated in fig. 3.
Detailed Description
Described herein are various embodiments of methods and systems according to the present invention. These embodiments are exemplary and should not be construed as limiting the scope of the invention as will be given to those of ordinary skill in the art.
As known to those of ordinary skill in the art, genomic data is exported from a sequencing machine. The amount of raw data output from a sequencing machine can be hundreds of gigabytes in size. The raw data is typically compared to and aligned to a reference genome to create an aligned file, e.g., a Variant Call File (VCF), i.e., the magnitude is of a smaller order than the raw data, but still too large to be ready for transmission to a remote site.
An exemplary embodiment of a system for secure transmission of genomic data over an otherwise insecure network is illustrated in fig. 1. The transmission system 1 includes a transmitting station 100, a receiving station 200, a network 300, and a database 400.
The transmission station 100 includes a processing module 110 and an I/O unit 120. The processing module 110 processes the VCF file to produce a reduced file, as discussed below, for secure transmission to a receiving station 200, which is typically remote from the transmitting station 100. The I/O unit 120 handles the transmission of the reduced file (which may also be encrypted and/or encoded).
Receiving station 200 includes a processing module 210 and an I/O unit 220. The I/O unit 220 handles the reception of the reduced file (which may also be encrypted and/or encoded). The processing module 210 processes the reduced file and restores the reduced file to the original VCF file, etc.
In the transmission system 1, the reduced, encoded and encrypted file is at least partially transmitted over the network 300. The network 300 may include, or may interface to, any one or more of the following: the internet, an intranet, a Personal Area Network (PAN), a Local Area Network (LAN), a Wide Area Network (WAN), etc.
The database 400 includes genomic data information that may be associated with an alignment file (i.e., data that has been previously aligned to a reference genome). If there is any data from the alignment file in the database 400, the alignment file may be annotated with reference information from the database 400, the annotation itself replacing the data that is otherwise stored in the database 400 and accessible by the receiving station 200.
For example, in one exemplary embodiment, the file that has been previously aligned to the reference genomic data is a Variant Call File (VCF) of variant data, and the database 400 is a database of known variants, e.g., SNP data. Databases of SNP data are known to those of ordinary skill in the art and are maintained, for example, by the national center for biotechnology information of the national health institute.
Typical entries in VCFs include the following information relating to the reconstruction of the genome: the chromosome on which the single nucleotide variation (or small insertion or deletion) is located, the location on the chromosome, the reference base (A, C, G, T, or N), the substitute base (A, C, G or T), the quality of the variation call, and the nature of the variation call (homozygote/heterozygote). The entries in the VCF may include other information not relevant to the reconstruction process discussed herein.
For a known variation in VCF, genomic coordinates for a location on a chromosome are sufficient to determine reference and alternative allele data for the variation from information stored in the database of SNP data, and can therefore be used to reduce VCF. Chromosomal coordinates include the chromosome number and the location of the variation on the chromosome.
Database 400 may be a searchable database and may include, or interface to a relational database. Other databases may also be used, such as a database in query format, a database in Standard Query Language (SQL) format, or similar data storage device, query format, platform, or resource. Database 400 may include a single database or a collection of databases, dedicated databases, or other types of databases. In one embodiment, database 400 may store or cooperate with other databases to store various data and information described herein. In some embodiments, database 400 may include a file management system, program, or application for storing and maintaining data and information used or generated by the various features and functions of the systems and methods described herein.
An exemplary embodiment of a transmitting station 100 is illustrated in fig. 2. The processing module 110 includes a reduction module 111, an encoding module 112, an encryption module 113, and a storage module 114. The reduction module 111 reduces the previously aligned data file or VCF, for example, using annotations based on known genomic data stored in a database or other techniques described more fully herein.
The encoding module 112 encodes the reduced file. In examples where VCF has been reduced, the encoding module 112 may replace the variant data with genomic coordinates (i.e., chromosome number and location) encoded using a coordinate system (e.g., cartesian coordinates, polar coordinates, etc.) discussed in more detail below. The encryption module 113 encrypts the VCF using encryption techniques known in the art (e.g., symmetric encryption or asymmetric encryption). The storage module 114 may store the reduction, encoding, and encryption performed by the reduction module 111, the encoding module 112, and the encryption module 113, and intermediate steps thereof.
An exemplary embodiment of a receiving station 200 is illustrated in fig. 3. The processing module 210 of the receiving station 200 comprises a decryption module 211, a decoding module 212, an expansion module 213, and a storage module 214. The decryption module 211 decrypts the encrypted genome data file received via the I/O unit 220. The decoding module 212 decodes the encoded file received from the transmitting station 100 using the coordinate scheme employed during the encoding process. The expansion module 213 expands the decoded, reduced file. In the example of a VCF that is reduced with reference to known variant data stored on dbsnps, the same database or a database containing the same information can be used to replace annotations in the reduced VCF with corresponding genomic data. For example, the alternative allele data and the reference allele data can be looked up in the dbSNP database and "re-added" to the VCF entries.
An exemplary operation of the transmission system 1 for transmitting VCFs will now be described with reference to fig. 4 and 5. The transmitting station 100 receives genome sequencing data (step S100). The sequencing data may be unprocessed, or it may have been previously aligned to a reference genome. If it was not previously aligned to the reference genome, the genome sequencing data is processed and aligned to the reference genome (step S101). Next, the reduction module 111 reduces VCF (step S102). To reduce VCF, reference is made to a database of known variations (dbsnps), usually indexed by chromosome. For each data entry in a VCF, if the variation is already known, the information in the entry can be reduced to the chromosome and the location of the variation on the chromosome. The more information stored in dbSNP, the more VCFs can potentially be reduced.
According to one exemplary embodiment, removing variant data that does not meet a predetermined quality threshold may further reduce variant data in the VCF. When variant calls are robust (of higher quality), reconstruction of the genome is more reliable. In this exemplary embodiment, variant calls that meet a predetermined quality threshold are maintained, and lower quality variant calls are removed or skipped in the creation of the file for transmission. One of ordinary skill in the art will appreciate that the threshold for the quality of the variant may vary depending on the type of variant invoker used. For example, for Illumina next generation sequencing data, at least 20 reads of SNPs would need to be covered.
Next, the reduced VCF may be encoded to further reduce the size of the file (step S103). Encoding of genomic coordinates (i.e., chromosome number and location) may be accomplished using a coordinate system according to one exemplary embodiment. Although cartesian, polar, linear, and cyclic coordinate systems are used according to the exemplary embodiments described herein, any suitable coordinate system may be used.
Cartesian coordinate encoding
The transformation of genomic coordinates to cartesian coordinates may be performed by placing a set of chromosomes (e.g., a set of 24 chromosomes) in question on the x-axis in the following manner: if the y coordinate of the center of each chromosome is zero, the center of the chromosome is located on the x-axis.
Although the range of x coordinates is, for example, [1 … 24], the range of y coordinates will be [ - α/2 … α/2], where α is the number of nucleotide bases present on the chromosome. For each chromosome, the axis coordinate y ═ 0 will be shifted to a new position α', where:
Figure GDA0002871671940000061
polar coordinates
Polar coordinates (r, θ), which represent length (radius) and angle for the genomic position, can be obtained by a transition from the above cartesian coordinates (x, y) such that:
Figure GDA0002871671940000071
linear coordinate
Linear coordinates can be obtained by transforming the genome from its organization into a chromosome, as a single string of-30 hundred million base pairs (the number of base pairs in the human genome). This conversion can be performed by concatenating the nucleotide bases from each of the chromosomes in a conventional chromosomal order (chr1 … chr22, followed by chrX and chrY, respectively) into a string. Thus, the range of linear coordinates would be a ∈ [1 … 3,209,286,105 ].
Circular coordinate
To obtain circular coordinates, the chromosomal locations are mapped to a circular (circle) coordinate system, where points on the circle represent nucleotide positions and the angular separation of these points represents the location coordinates. If the number of locations exceeds the number of possible representations in a span of 1 turn (2 π), then the value can be scaled down using modulo arithmetic.
In one exemplary embodiment, modulo arithmetic values may be used to reduce the complexity of VCFs encoded using a circular coordinate system. Using the modulus (n) to surround the surrounding locations, the linear value of the location, a, can be converted to a location on the circle as follows:
Figure GDA0002871671940000072
the transformed coordinates a' ═ f (n, q, r), where n is the modulus value, q is the quotient of the division, and r is the remainder. For each location in the VCF file, the encoded file will have the following information: (i) quotient of modulo operation; (ii) expressed as the remainder of the modulo operation of the angle; and (iii) a surrogate allele at the location.
The modulus value "n" may serve as a key to decode the information in the VCF. The modulus value may be a constant or may be calculated by a random number generator. The modulus values may be sent with the VCF or, alternatively, may be sent via a different channel. In one exemplary embodiment, the other channel is a secure channel. The secure channel may also be used, for example, to communicate patient identification information.
The modulus value may be encrypted using encryption techniques known to those of ordinary skill in the art. In such an exemplary embodiment, where the modulus values are utilized in order to decode the patient's variant information, the remote site would be required to decrypt the modulus values and then decode the variant coordinates, thereby undergoing two levels of decryption.
In the reducing step and the encoding step, the compressed and encoded VCF may be decrypted by the decryption module 113 (step S104). Any suitable decryption technique may be utilized, including symmetric decryption techniques and asymmetric decryption techniques.
In an exemplary embodiment, the decryption step may be preceded by a step of DNA profiling in which A, C, G of the alternative allele, and the T base, are transformed into the spectral domain, for example using a fourier transform or other frequency transform. Upon receipt, the profile DNA will be transformed back to A, C, G, and T bases of the alternative allele.
During the operation described in fig. 4, including after the encryption step, the results of the respective steps may be stored (step S105).
The processed file may then be transmitted to the receiving station 200 over a network 300, which network 300 may be unsecured or include an unsecured span. The restoration of the original file at the receiving station 200 according to an exemplary embodiment of the present invention will now be described with reference to fig. 5. The process of recovery is essentially a process of applying a reduction step, an encoding step and an encryption step, which are applied in reverse to the transferred file.
If the file has been decrypted, the encrypted file is decrypted by the decryption module 211 (step S201). In an exemplary embodiment that includes a step of frequency transformation into the spectral domain, the spectral DNA information will be transformed back to A, C, G, T bases of the alternative allele. In one exemplary embodiment, the encryption scheme used by the encryption module 113 is known a priori to the decryption module 211. In another exemplary embodiment, the encoding scheme is transmitted to the decryption module 211 together with the transmitted file or to the decryption module 211 after the file is transmitted using the same channel or a separate channel.
In embodiments where a modulus value is used in the encryption process, the value is then decrypted and then used by the decryption module 211 to perform decryption of the encrypted VCF.
Next, the decoding module 212 decodes the decrypted file (step S202). In one exemplary embodiment, the encoding scheme used by the encoding module 112 is known a priori to the decoding module 212. In another exemplary embodiment, the encoding scheme is transmitted to the decoding module 212 together with the transmitted file or to the decoding module 212 after the file is transmitted using the same channel or a separate channel.
Next, the expansion module 213 expands the decoded file with reference to the database of known variations (step S203). During the operation depicted in FIG. 5, the results of the various steps may be stored, including storing the recovered VCF after the final expansion step (step S204). After recovery, the VCF may be transmitted for further processing as needed (step S205).
The transmitting station 100 and receiving station 200 may be incorporated into a computer station where operations are initiated by a human operator, automated, or both. The transmitting station 100 may also be incorporated into a network device (e.g., a server or router) that includes the capability to identify the VCF being transmitted and perform the exemplary operations described herein. The network device may be a gateway that routes data between networks over which such data is transmitted to a network including the receiving station 200, wherein genome sequencing data is reduced, encoded and encrypted according to the exemplary embodiments described herein. Receiving station 200 may also be included in a network device (e.g., a network gateway) that: the network device includes the ability to identify reduced, encoded, and encrypted VCFs and recover genomic data according to exemplary embodiments described herein.
The transmission system as shown in fig. 1, 2 and 3 may be or may include a computer system. The transmission system may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
Those skilled in the art will appreciate that the invention may be practiced with various computer system configurations, including hand-held wireless devices such as mobile telephones or tablets, multiprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The transmission system may include a plurality of software processing modules stored in a memory as described above and running on a processor in the manner described herein. The program modules may be in the form of any suitable programming language that is converted to machine language or object code to allow one or more processors to execute the instructions.
The computer system may include a general purpose computing device in the form of a computer including a processing unit, a system memory, and a system bus that couples various system components including the system memory to the processing unit.
The processing unit that executes commands and instructions may be a general purpose computer, but may also utilize any of a variety of other technologies, including special purpose computers, minicomputers, mainframe computers, programmed microprocessors, microcontrollers, peripheral integrated circuit elements, CSICs (customer application specific integrated circuits), ASICs (application specific integrated circuits), logic circuits, digital signal processors, programmable logic devices such as FPGAs (field programmable gate arrays), PLDs (programmable logic devices), PLAs (programmable logic arrays), RFID integrated circuits, smart chips, or any other device or arrangement of devices capable of implementing the steps of the processes of the invention.
It should be appreciated that the processors and/or memories of the computer system need not be physically at the same location. Each processor and each memory used by the computer system may be geographically distinct locations and connected to communicate with each other in any suitable manner. Additionally, it should be appreciated that each of the processors and/or memories may comprise different physical components of the apparatus.
The computing environment may also include other removable/non-removable, volatile/nonvolatile computer storage media.
The foregoing describes specific embodiments of the present invention. It is expressly noted, however, that the present invention is not limited to these embodiments, but rather, that additions and modifications to what is expressly described herein are intended to be included within the scope of the present invention. Furthermore, it is to be understood that the features of the various embodiments described herein are not mutually exclusive and can exist in various combinations and permutations, even if such combinations or permutations are not expressly stated herein, without departing from the spirit and scope of the invention. Indeed, variations, modifications, and other embodiments of the invention described herein will occur to those skilled in the art without departing from the spirit and scope of the invention. As such, the invention is not limited solely to the foregoing illustrative description.

Claims (10)

1. A system for securely transmitting data sequenced from a genome of a patient, the system comprising:
a first processing module comprising a computer processor and a computer readable tangible medium, wherein the first processing module is configured to:
reducing a known variant call file VCF from genomic sequencing data to an annotated VCF by reducing the known variant in the VCF based on reference data from a database of known variants, the annotated VCF comprising primarily non-redundant variant data from the VCF, wherein the annotated VCF comprises chromosome number and chromosome location data;
encoding the annotated VCF by converting the chromosome number and chromosome position data of the annotated VCF to a circular coordinate system by wrapping locations around a circle with a modulus value, wherein chromosome positions are mapped to the circular coordinate system, and wherein a point on the circle represents a nucleotide position and an angular separation of the point represents a position coordinate; and is
Storing the encoded annotated VCF; and
a second processing module comprising a computer processor and a computer readable tangible medium, wherein the second processing module is configured to:
receiving the modulus value;
receiving the encoded annotated VCF;
decoding the encoded annotated VCF using the received modulus values; and is
Augmenting the decoded annotated VCF with the reference data from the database of known variations to form the VCF.
2. The system of claim 1, wherein the reference data comprises reference allele data and surrogate allele data from a database of short genomic variations (SNPs).
3. A computer-implemented method of securely transmitting data sequenced from a genome of a patient, the method comprising:
providing a computer processor configured to:
reducing a known variant call file VCF from genomic sequencing data to an annotated VCF by reducing the known variant in the VCF based on reference data from a database of known variants, the annotated VCF comprising primarily non-redundant variant data from the VCF, wherein the annotated VCF comprises chromosome number and chromosome location data;
encoding the annotated VCF by converting the chromosome number and chromosome position data of the annotated VCF to a circular coordinate system by wrapping locations around a circle with a modulus value, wherein chromosome positions are mapped to the circular coordinate system, and wherein a point on the circle represents a nucleotide position and an angular separation of the point represents a position coordinate; and is
Storing the encoded annotated VCF on a computer readable tangible medium.
4. The method of claim 3, wherein reducing the VCF comprises removing variant calls whose associated quality data does not meet a predetermined threshold.
5. The method of claim 3, wherein reducing the VCF comprises removing known variations using a reference database of short genomic variation (SNP) data.
6. The method of claim 5, wherein the known variation comprises one or more of reference allele information and alternative allele information.
7. The method of claim 3, further comprising encrypting the modulus value and initiating transmission of the encrypted modulus value and the encoded annotated VCF to a second terminal over a network connection.
8. The method of claim 3, further comprising
Performing a step of DNA profiling in which A, C, G and T bases of the alternative allele in the encoded annotated VCF are transformed into the spectral domain using frequency domain transformation prior to decrypting the encoded annotated VCF.
9. The method of claim 3, further comprising transmitting the encoded annotated VCF to a second terminal over a network connection.
10. A computer-implemented method of securely transmitting data sequenced from a genome of a patient, the method comprising:
providing a computer processor configured to:
receiving a VCF encoded by converting chromosome number and chromosome position data from a variant call file VCF of genomic sequencing data to a circular coordinate system by wrapping positions around a circle with a modulus value, wherein a chromosome position is mapped to the circular coordinate system, and wherein a point on the circle represents a nucleotide position and an angular separation of the point represents a position coordinate;
receiving the modulus value;
decoding the encoded VCF using the received modulus values; and is
The decoded VCF is augmented with reference allele data and alternative allele data using a reference database of short genomic variation (SNP) data.
CN201580064030.1A 2014-11-25 2015-11-18 Secure transmission of genomic data Active CN107004068B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201462084146P 2014-11-25 2014-11-25
US62/084,146 2014-11-25
PCT/IB2015/058912 WO2016083949A1 (en) 2014-11-25 2015-11-18 Secure transmission of genomic data

Publications (2)

Publication Number Publication Date
CN107004068A CN107004068A (en) 2017-08-01
CN107004068B true CN107004068B (en) 2021-08-24

Family

ID=55022623

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201580064030.1A Active CN107004068B (en) 2014-11-25 2015-11-18 Secure transmission of genomic data

Country Status (6)

Country Link
US (1) US10957420B2 (en)
EP (1) EP3224752B1 (en)
JP (1) JP6788587B2 (en)
CN (1) CN107004068B (en)
RU (1) RU2753245C2 (en)
WO (1) WO2016083949A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
MX2019004126A (en) * 2016-10-11 2019-06-12 Genomsys Sa Method and system for the transmission of bioinformatics data.
US20180314842A1 (en) * 2017-04-27 2018-11-01 Awakens, Inc. Computing system with genomic information access mechanism and method of operation thereof
CN108563923B (en) * 2017-12-05 2020-08-18 华南理工大学 Distributed storage method and system for genetic variation data
CN109192245A (en) * 2018-07-26 2019-01-11 中山大学 The GDS-Huffman compression method of genetic mutation data
WO2020158842A1 (en) * 2019-02-01 2020-08-06 株式会社東芝 Terminal device, data processing method, and program
EP3792923A1 (en) * 2019-09-16 2021-03-17 Siemens Healthcare GmbH Method and device for exchanging information regarding the clinical implications of genomic variations
US11562057B2 (en) 2020-02-05 2023-01-24 Quantum Digital Solutions Corporation Ecosystem security platforms for enabling data exchange between members of a digital ecosystem using digital genomic data sets
IL304962A (en) 2021-02-04 2023-10-01 Quantum Digital Solutions Corp Cyphergenics-based ecosystem security platforms
JP2023014547A (en) * 2021-07-19 2023-01-31 国立研究開発法人情報通信研究機構 Personal information protection management system for genome data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101914628A (en) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region
CN102460155A (en) * 2009-04-29 2012-05-16 考利达基因组股份有限公司 Method and system for calling variations in a sample polynucleotide sequence with respect to a reference polynucleotide sequence

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7391816B2 (en) * 2003-09-17 2008-06-24 Intel Corporation Decoding upstream V.92-encoded signals
RU2419137C2 (en) * 2006-02-13 2011-05-20 иПостал Сервисез, Инк. System and method to hand over documents and to control circulation of documents
US20110288785A1 (en) * 2010-05-18 2011-11-24 Translational Genomics Research Institute (Tgen) Compression of genomic base and annotation data
US20120102054A1 (en) 2010-10-25 2012-04-26 Life Technologies Corporation Systems and Methods for Annotating Biomolecule Data
RU2013140708A (en) 2011-02-04 2015-03-10 Конинклейке Филипс Н.В. METHOD FOR ASSESSING INFORMATION FLOW IN BIOLOGICAL NETWORKS
WO2012122549A2 (en) 2011-03-09 2012-09-13 Lawrence Ganeshalingam Biological data networks and methods therefor
US20130246460A1 (en) 2011-03-09 2013-09-19 Annai Systems, Inc. System and method for facilitating network-based transactions involving sequence data
EP2732423A4 (en) * 2011-07-13 2014-11-26 Multiple Myeloma Res Foundation Inc Methods for data collection and distribution
EP2761518A4 (en) * 2011-09-27 2016-01-27 Lawrence Ganeshalingam System and method for facilitating network-based transactions involving sequence data
WO2013067001A1 (en) 2011-10-31 2013-05-10 The Scripps Research Institute Systems and methods for genomic annotation and distributed variant interpretation
US20130245958A1 (en) * 2012-03-15 2013-09-19 Siemens Aktiengesellschaft Accurate comparison and validation of single nucleotide variants
US9483610B2 (en) * 2013-01-17 2016-11-01 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US20140278461A1 (en) 2013-03-15 2014-09-18 Memorial Sloan-Kettering Cancer Center System and method for integrating a medical sequencing apparatus and laboratory system into a medical facility

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102460155A (en) * 2009-04-29 2012-05-16 考利达基因组股份有限公司 Method and system for calling variations in a sample polynucleotide sequence with respect to a reference polynucleotide sequence
CN101914628A (en) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 Method and system for detecting polymorphism locus of genome target region

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
" Metaseq: privacy preserving meta-analysis of sequencing-based association studies";Singh A P;《Biocomputing》;20130228;第1-12页 *
"Genome compression: a novel approach for large collections";S Deorowicz;《Bioinformatics》;20130821;第2572–2578页 *

Also Published As

Publication number Publication date
JP2018503167A (en) 2018-02-01
RU2753245C2 (en) 2021-08-12
EP3224752B1 (en) 2022-07-13
EP3224752A1 (en) 2017-10-04
WO2016083949A1 (en) 2016-06-02
US10957420B2 (en) 2021-03-23
US20170262579A1 (en) 2017-09-14
JP6788587B2 (en) 2020-11-25
CN107004068A (en) 2017-08-01
RU2017122194A3 (en) 2019-12-06
RU2017122194A (en) 2018-12-26

Similar Documents

Publication Publication Date Title
CN107004068B (en) Secure transmission of genomic data
US10652010B2 (en) Fully homomorphic encrypted ciphertext query method and system
CN106610995B (en) Method, device and system for creating ciphertext index
CN106817358B (en) Encryption and decryption method and device for user resources
JP6289680B2 (en) Packet transmission device, packet reception device, packet transmission program, and packet reception program
CN104038336A (en) Data encryption method based on 3DES
CN104966026A (en) Arithmetical operation system
CN112394974A (en) Code change comment generation method and device, electronic equipment and storage medium
CN112287366A (en) Data encryption method and device, computer equipment and storage medium
CA3061776A1 (en) Key information processing method and apparatus, electronic device and computer readable medium
CN111818087B (en) Block chain node access method, device, equipment and readable storage medium
CN117056961A (en) Privacy information retrieval method and computer readable storage medium
CN112994887B (en) Communication encryption method and system suitable for power Internet of things terminal
CN101169776B (en) Data encryption method and device for promoting central processing unit operation efficiency
Mohamed et al. Compression and encryption technique on securing TFTP packet
KR101438312B1 (en) Method of data encryption and encrypted data transmitter-receiver system using thereof
CN113360923A (en) Data interaction method, device and system and electronic equipment
CN113938270A (en) Data encryption method and device capable of flexibly reducing complexity
CN115935299A (en) Authorization control method, device, computer equipment and storage medium
CN108924104B (en) E-government affair encryption and decryption method
CN113347176B (en) Encryption method and device for data communication, computer equipment and readable storage medium
CN110351084B (en) Secret processing method for urban basic mapping data
CN117040913B (en) Cloud resource sharing data security transmission method and system
CN109376721B (en) Fingerprint feature extraction method, fingerprint registration method, fingerprint identification method and device
CN117688595A (en) Homomorphic encryption performance improving method and system based on trusted execution environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant