CN110660450A - Safety counting query and integrity verification device and method based on encrypted genome data - Google Patents

Safety counting query and integrity verification device and method based on encrypted genome data Download PDF

Info

Publication number
CN110660450A
CN110660450A CN201910899612.1A CN201910899612A CN110660450A CN 110660450 A CN110660450 A CN 110660450A CN 201910899612 A CN201910899612 A CN 201910899612A CN 110660450 A CN110660450 A CN 110660450A
Authority
CN
China
Prior art keywords
data
query
terminal
encrypted
snp
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910899612.1A
Other languages
Chinese (zh)
Inventor
王雷
邹赛
朱贤友
陈治平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing College of Electronic Engineering
Changsha University
Original Assignee
Chongqing College of Electronic Engineering
Changsha University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing College of Electronic Engineering, Changsha University filed Critical Chongqing College of Electronic Engineering
Priority to CN201910899612.1A priority Critical patent/CN110660450A/en
Publication of CN110660450A publication Critical patent/CN110660450A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for safely sharing and calculating genome data on a cloud server with poor integrity. Firstly, the method provided by the invention processes the biomedical data, and ensures the privacy and integrity of the shared data. Secondly, the method provided by the invention stores data in a hashMap mode, improves the encryption efficiency and efficiently realizes safe counting query. The method provided by the invention is evaluated by using the existing Single Nucleotide Polymorphism (SNP) sequence database, and experiments show that the completion of counting query is more flexible, easy to realize and high in safety for practical application.

Description

Safety counting query and integrity verification device and method based on encrypted genome data
Technical Field
The invention relates to the technical field of biological information, in particular to a safety counting inquiry and integrity verification device based on encrypted genome data.
Background
Clinical medical practice plays a crucial role in the field of healthcare, and the study of genomics is also becoming more and more popular. Genomics research also helps to identify potential correlations between disease and a gene for which biomedical researchers have conducted large-scale investigations of patient clinical status and DNA sequence data, and most of the analysis of gene data is based on the american national institute of health genome association (GWAS). In order to improve the accuracy of the research, it is necessary to summarize data from different sources, and various service systems for sharing and storing data have been established, such as the genotype and phenotype database (dbGaP) in the united states and the biological banking program in the united kingdom by Welcomme Trust. Genomic data relates to the privacy of patients and if data is revealed, it may cause social and legal problems, for example, a health insurance company may refuse to insurance against personal information that a gene that knows the likelihood of a particular cancer carries a mutation. Based on the sensitivity of genetic data, sharing of genetic data by multiple institutions requires storage and access to genomic data through privacy preserving methods.
Biomedical research is increasingly dependent on a large amount of genomic and clinical data, with the ensuing great attention of scholars on how to ensure privacy of the respective individuals and the overall security of the system when sharing and managing these data. In the past, to protect sensitive information of a subject, important identifiers that may identify individuals were deleted when multiple organizations shared data. However, research has shown that the identity of a subject can be easily deduced using automated methods. Current research uses encryption protocols to share, manage and analyze biomedical data and outsource encrypted data to third party cloud service providers that possess vast storage and computing resources in order to protect the privacy of the data. At the same time, third parties have also become potential targets that may have caused the privacy of research subjects to be violated.
In the prior art, there are security frameworks that manage clinical genomic data in a centralized database in a specific form, and these frameworks propose methods for secure sharing and storage of genomic data on a less honest cloud service. Firstly, the data owner sends clinical information to a third-party organization for encryption and authentication, and the third-party organization sends a large amount of aggregated data to a cloud service for storage in a certain structure. The cloud service then performs a query on behalf of the researcher (e.g., number of patient records with hypertensive patients and particular genomic variant features) and returns the query results to the researcher. However, directly returning the query result to the researcher cannot exclude an attacker from attacking the researcher and capturing the encryption protocol, and still has data risk.
Disclosure of Invention
In order to realize scientific research under the condition of not revealing the identity of a data main body while inquiring shared medical data, the invention provides a safety counting inquiry and integrity verification device based on encrypted genome data, which comprises a data owner terminal, an authentication mechanism server, a cloud server, an agency mechanism terminal and a researcher terminal;
the certification authority terminal receives the patient record from the data owner terminal and encrypts the patient record;
the data owner terminal processes the genome data of the patient according to a specified format and sends the genome data of the patient to the certification authority server in a plaintext form;
after obtaining data shared by different data owners, the authentication mechanism server constructs a searchable hash map aggregating and sharing encrypted data and sends the hash map to the cloud service;
the cloud service acquires encrypted data from the authentication mechanism, processes an encrypted query request sent by the agency mechanism terminal, executes the query request and sends a query result to the agency mechanism terminal in an encrypted manner;
the research personnel terminal responds to the user operation and sends the research personnel query request to the agency terminal;
the agency terminal encrypts the query request data after receiving the scientific research personnel query request, sends the encrypted query request to the cloud server, and decrypts the query result by using the private key after obtaining the query result from the cloud server and sends the query result to the research personnel terminal.
Further, the cloud server generates a set of SNP data sets of the specific gene and a frequency value of the specific gene appearing in the database, and then normalizes the value into the total number of records;
the cloud server represents the total number of counting queries as:
where D is database Q for query, D ═ S1,S2,…,SnDenotes the SNP sequence of n patients, the count query is defined as the number of patients who find D satisfying a plurality of query conditions Q for Q, and the gene sequence of m SNP sites of the patients is denoted as S ═ { D1,d2,…,dmIn which d isi(1. ltoreq. i.ltoreq.m) represents the SNP value of the i-th site of the patient.
Further, after receiving data provided by a data owner, the certification authority server creates an Entity of a map for the SNP sequence of each patient, processes all the data to generate a mapping table M, wherein the M comprises genotype and phenotype information of each patient, and after the M is created, the certification authority server creates and updates the Entity information of the M for each new record from the data owner;
each Entity contains the following:
key in Entity the key value points to a specific data, index value,
geno: the SNP sequence of each patient is shown, and the SNP sequence of one patient is shown as S ═ { d ═ d1,d2,…,dmAnd sid is the unique identifier of SNP site and represents a specific position in a gene sequence, wherein base pair diThe site identifier of (1. ltoreq. i.ltoreq.n) is represented by
Figure BDA0002211388830000031
A phenotype data set corresponding to the genome sequence,
count the frequency of occurrence of the current SNP sequence in the database, representing the total number of records matching the genotype and phenotype,
next is the key value of the next Entity.
Further, the certificate authority server creates a hash table with a complexity of o (mn), where m is the number of records in the database, n is the number of different SNP sites in each sequence record, and each Entity is defined as θ, and its components are denoted as θ (key, gene, I, count, next).
Further, the certification authority server processes the phenotype data by using Bloom filter technology, records each SNPs in the genome sequence, inserts a corresponding phenotype in the Bloom filter, if a plurality of phenotypes are associated with the same genome sequence, inserts all the phenotypes into the Bloom filter of the genome sequence entry, wherein each Entity in M contains a Bloom filter value I which represents phenotype information corresponding to the genome sequence;
the certification authority server sets the hash function H used in the Bloom filter and the field of the common alphabet Σ, which is a set of all possible phenotypes, inserts in the Bloom filter the corresponding phenotype for each entry in the genomic sequence, each phenotype in Σ is mapped to a unique number, and inserts that number into the Bloom filter,
the certification authority server selects a random key k using a pseudo-random function F, encrypts each Bloom filter into
Figure BDA0002211388830000041
The certification authority server sends the private key sk to the agency terminal, and provides a key k of the PRF and a hash function F for the Bloom filterk
Certificate authority server inquires tabular value I of requestqLocation of hash and I of each EntityiMatching hash positions;
the certification authority server transmits a PRF key k to the agency terminal and encrypts Bloom filter'iUsing a formula
Figure BDA0002211388830000042
Decryption is performed.
Further, the certificate authority server generates a key pair (pk, sk), where pk is the public keySk is a private key, each entry in the hash table is encrypted using the public key pk, and the encryption of the entry is denoted as Epk(θ);
The certification authority server represents the hash map of all the entries in the Hash map M after being encrypted as
Figure BDA0002211388830000043
And will be
Figure BDA0002211388830000044
Sending the data to a cloud server;
the certificate authority server transmits the key pair (pk, sk) to the agency terminal.
Further, the cloud server queries the result RqAnd sending the data to an agency, decrypting the SNP value by the agency terminal by using the key sk, verifying the integrity of the data by adopting a Hamming code detection bit, and returning a query result meeting the query condition:
Rq={Epk(count)|Epk(B1)|Epk(B2)|…|Epk(Bnum)|Ii′}
the agent terminal decrypts each SNP value by using the private key sk, and the decrypted SNP value
Figure BDA0002211388830000051
Will be provided with
Figure BDA0002211388830000052
Recalculating the Hamming code detection bit as T';
the agency terminal compares T' with TqTo determine whether the data is complete,
if T ═ TqIf the data is safe and credible, the agency terminal judges that the data is safe and credible;
if T' ≠ TqAnd T' is e { hA,hC,hG,hTAnd then the agency terminal passes the detection position TqTo the return
Figure BDA0002211388830000053
Correcting and updating the Hash diagram;
if T' ≠ TqAnd is and
Figure BDA0002211388830000054
the agency terminal judges the original data
Figure BDA0002211388830000055
Tampered and tampered to other content, at which point the returned query result counter is not authentic for the patient's SNP sequence.
The invention also provides a safety counting inquiry and integrity verification method based on encrypted genome data, which comprises the following steps:
the certification authority terminal receives the patient record from the data owner terminal and encrypts the patient record;
the data owner terminal processes the genome data of the patient according to a specified format and sends the genome data of the patient to the certification authority server in a plaintext form;
after obtaining data shared by different data owners, the authentication mechanism server constructs a searchable hash map aggregating and sharing encrypted data and sends the hash map to the cloud service;
the cloud service acquires encrypted data from the authentication mechanism, processes an encrypted query request sent by the agency mechanism terminal, executes the query request and sends a query result to the agency mechanism terminal in an encrypted manner;
the research personnel terminal responds to the user operation and sends the research personnel query request to the agency terminal;
the agency terminal encrypts the query request data after receiving the scientific research personnel query request, sends the encrypted query request to the cloud server, and decrypts the query result by using the private key after obtaining the query result from the cloud server and sends the query result to the research personnel terminal.
The invention has the beneficial effects that:
(1) the invention provides a secure system query framework. According to the invention, the Paillier cryptosystem is used for encrypting the gene sequence, the encrypted data is stored in the third-party cloud service, and counting query is carried out under the condition that the cloud service does not know the secret key, so that the privacy of the data is ensured. The quote agency mechanism decrypts and verifies the returned inquiry result of the encryption node, and the integrity of the data is ensured.
(2) The present invention provides a technique for processing raw data. The invention adds a detection position to the gene data by using a hamming code technology, and verifies the credibility of the counting query result by checking the returned query gene sequence.
(3) The model provided by the invention provides security guarantee for data privacy, inquiry privacy and output privacy. The query initiator completes the safe interactive query through the agency mechanism and the cloud server.
(4) The invention provides experimental studies to demonstrate the feasibility of the method proposed by the invention. For a query of 5000 records for 40 SNP sites, Kantar certification authority servers oglu et al and Canim et al, used 30min and 80s, respectively. The method proposed by the present invention takes 3.9s to perform the same query. The time is significantly shortened.
(5) The method of the invention allows the outsourcing protocol to match the stored encrypted data with the query command in the ciphertext state to complete the counting query. In order to protect data privacy, the present invention encrypts gene data using a homomorphic encryption method. In order to verify the integrity of the query result, the invention adds the detection bit to the genome data by using the Hamming code technology, can verify the query result based on the error correction of the Hamming code detection bit, and has good performance in the aspect of ensuring the safety and the integrity of the genome data.
Drawings
FIG. 1 is a diagram illustrating an apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating test results according to an embodiment of the present invention.
FIG. 3 is a diagram illustrating test results according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating test results according to an embodiment of the present invention.
Detailed Description
The invention provides a method for safely sharing and calculating genome data on a cloud server with poor integrity. Firstly, the method provided by the invention processes the biomedical data, and ensures the privacy and integrity of the shared data. Secondly, the method provided by the invention stores data in a hashMap mode, improves the encryption efficiency and efficiently realizes safe counting query. The method provided by the invention is evaluated by using the existing Single Nucleotide Polymorphism (SNP) sequence database, and experiments show that the completion of counting query is more flexible, easy to realize and high in safety for practical application.
The following description will explain the design principles and advantageous effects of the present invention
In the present invention, the objective of the invention is to design a counting framework based on outsourced genome data security query, and determine the number of records in the database matching with the query condition through counting query. The framework provided by the invention processes and stores data by using a safe encryption mechanism, and performs security evaluation on each inquired storage record through a third-party agency terminal, and as shown in fig. 1, the system framework provided by the invention is provided. Specific contributions of the invention are as follows:
(1) the invention provides a secure system query framework. According to the invention, the Paillier cryptosystem is used for encrypting the gene sequence, the encrypted data is stored in the third-party cloud service, and counting query is carried out under the condition that the cloud service does not know the secret key, so that the privacy of the data is ensured. And the quote agency terminal decrypts and verifies the returned query result of the encryption node, so that the integrity of the data is ensured.
(2) The present invention provides a technique for processing raw data. The invention adds a detection position to gene data by using a hamming code technology, verifies the credibility of a counting query result by checking a returned query gene sequence
(3) The model provided by the invention provides security guarantee for data privacy, inquiry privacy and output privacy. And the query initiator completes the safe interactive query through the agency mechanism terminal and the cloud server.
(4) The invention provides experimental studies to demonstrate the feasibility of the method proposed by the invention. For a query of 5000 records for 40 SNP sites, the Kantar certification authority server oglu et al and Canim et al, took 30min and 80s, respectively. The method proposed by the present invention takes 3.9s to perform the same query.
2 the following description is made of the related concepts related to the present invention
The human genome contains basic data on human biology and private diagnostic information, and it is very sensitive to query personal genome data. To protect the privacy of data, Ayday et al propose a privacy protection method and system based on homomorphic encryption, using a storage and processing unit to store sensitive data in encrypted form. Bruekers et al propose solutions to the semi-honest attacker model, based on limited DNA homomorphic encryption, the complexity of which depends to some extent on the number of errors to be tolerated. Eppstein et al used a privacy-enhanced reversible bloom filter (PIBF), and proposed a privacy-preserving comparison method that operates on compressed genomic data.
Sensitive data may also be leaked by a privacy protection protocol of genome data, and a first privacy protection sequence comparison algorithm is proposed by Atallah et al aiming at privacy protection of a query protocol. However, due to its unreasonable intensive computational requirements, Jha et al introduced scrambling circuitry for sequence comparison and distance calculation using a secure two-way communication protocol. The main drawback of this solution is the inability to handle large-scale computations. Troncoso-Pastoriza et al provide DNA match security privacy in a semi-honest environment, and protocols for secure matching are proposed by inadvertent evaluation of automata, but the protocol processing time is somewhat longer. Blaton and alisgari et al propose a secure outsourcing protocol in order to improve the Troncoso-Pastoriza et al protocol communication time problem.
For genome data security outsourcing problems for counting queries, the Kantar certification authority server oglu et al originally proposed an encryption model involving two third parties, but this model did not provide query privacy. Perl et al propose a method for searching biomedical data using a Bloom filter in combination with homomorphic encryption. They fully outsource the task of searching to a third party cloud server. Canim et al use tamper-resistant cryptographic hardware to provide a data storage server (DS) to enable secure storage, sharing, and querying of single third party genomic data, the size of the query being limited to the memory size of the tamper-resistant hardware. Thank et al proposed a meta-analysis approach to a secure genetic association study that stored the data at the site of the corresponding data owner. Ignatenko and Petkovic propose a solution to search and match DNA sequences in private DNA databases, representing DNA sequences as index information for matching and similarity search of context trees for privacy protection. The context tree is constructed using a general compression technique known as Context Tree Weighting (CTW).
The invention provides a method for protecting privacy and integrity of genome data safety counting query. The method of the invention allows the outsourcing protocol to match the stored encrypted data with the query command in the ciphertext state to complete the counting query. In order to protect data privacy, the present invention encrypts gene data using a homomorphic encryption method. In order to verify the integrity of the query result, the invention adds the detection bit to the genome data by using the Hamming code technology, can verify the query result based on the error correction of the Hamming code detection bit, and has good performance in the aspect of ensuring the safety and the integrity of the genome data.
3. The following describes a model of the apparatus of the present invention
3.1 System design
In this section, the security framework proposed by the present invention is described. As shown in fig. 1, the frame contains five major participants: the system comprises a data owner, a certificate authority server, a cloud server, an agency terminal and a researcher terminal. Each entity is responsible for executing different specific tasks, and the safety and the good function of the whole system are ensured. The flow of information in this framework includes two phases: a data integration phase and a query processing phase. During the data integration phase, the certification authority server receives the patient record from the data owner and encrypts it. In the query processing stage, the agency terminal encrypts the query conditions of the scientific research personnel and sends the encrypted query conditions to the cloud service to execute query.
The data owner is comprised of a plurality of institutions that own the genomic data, which may be hospitals, academic research institutions or government research institutions, etc. These institutions process the patient's genome data in a prescribed format and send it in the clear to the certification authority server.
The certification authority server, as a third party authority, is crucial to the security of the framework. The certificate authority server has the main tasks:
1) and processing the data. After the certification authority server obtains data shared by different data owners, a searchable hashMap aggregating and sharing encrypted data is constructed and sent to the cloud service. And the user query operation is basically performed on the encrypted index tree. The index tree contains all records from the shared data, and for additions and deletions of records, the certificate authority server can update the tree accordingly.
2) A key is generated. Sensitive data stored in the hash table are all encrypted by a key, and an execution organization needs to manage a public key for data encryption and a private key for decryption.
The cloud service acquires the encrypted data from the certification authority server and is responsible for processing the encrypted query request sent by the agency terminal. The cloud server executes the query and sends the encryption result to the agency terminal.
The agency terminal handles all communications with the scientific researchers. After receiving the query request of the scientific research personnel, the agency terminal encrypts the query data, sends the encrypted query request to the cloud server, and decrypts the query result and sends the decrypted query result to the terminal of the research personnel.
The researcher represents any individual or organization interested in performing queries on shared data stored in the cloud server. The scientific research personnel sends the inquiry request to the agency terminal for encryption, the agency terminal decrypts the result by using the private key after receiving the encrypted inquiry result and sends the result to the research personnel terminal, and the research personnel terminal obtains the final output result.
3.2 attack model and Security target
The invention aims to solve the problem that the cloud server does not know any information of shared genome data, and neither the certificate authority server nor the cloud server has knowledge of the query performed by a researcher terminal. The present invention assumes that the certificate authority server is a trusted entity. The certification authority server performs authentication as in the Data Access Committee (DAC) of NIH, which is responsible for the generation and encryption of hash tables and verifies the identity of individuals and organizations applying for access to data. Within the framework of the present invention, the present invention assumes that the cloud server is semi-honest, itself correctly complies with the protocol, and is not intended to maliciously produce erroneous results. The cloud server is different from the certificate authority server, and stores a large amount of sensitive data and is a place for executing query processing, when the cloud server is captured, an attacker can obtain a large amount of data, forge a query result or provide an incomplete query result for a user, and the like. Once a cloud server is attacked, its harmfulness greatly outweighs the risk of a captive certificate authority server being captive. Therefore, the invention focuses on the case that the cloud server is attacked.
The safety objects of the present invention are mainly the following two points:
(1) data privacy and query privacy: data privacy means that the cloud server cannot know the plaintext data stored by the data owner, which ensures that an attacker cannot understand the data stored in the cloud server; the query privacy means that the cloud server cannot know the actual value of the researcher terminal query request, which ensures that an attacker cannot understand or infer the query request received by the cloud server.
(2) Data integrity: the goal is to verify the correctness of the query results returned to the researcher terminal. If the query result returned to Agency by the cloud server is deleted or forged by an attacker, the Agency can detect the query result and analyze and correct the query result based on the protocol provided by the invention.
4 the following explains the data processing method adopted in the present invention
The human genome is the complete set of genetic information of an organism, which is located inside chromosomes, each of which contains genes responsible for controlling various functions of the human body. The genome sequence data is composed of four bases { A, C, G, T } of nucleotides at each baseDifferences in the arrangement on the DNA strands result in uniqueness between individuals. Most DNA sequences are conserved throughout the population, with approximately 0.5% of each person's DNA differing from the reference genome due to genetic variation. Several types of variation modes exist in the genome, such as Single Nucleotide Polymorphism (SNP), Copy Number Variation (CNV), rearrangement, and the like. Single nucleotide polymorphism refers to DNA sequence polymorphism caused by variation of a single nucleotide on an allele, and is also the most common form of DNA variation. Most SNPs have no effect on human health. However, some SNPs directly cause certain diseases in the human body. Analysis of SNP sequences is common in genomic data research. The genomic sequence of the patient, which is the site information and diagnosis associated with the SNP, and the corresponding clinical information are contained in the dbGap database. The present invention assumes that one sequence S is composed of a plurality of SNPs, denoted as S ═ α12,…,αγIn which α isiThe SNP at position i is represented by a pair of nucleotides.
Table 1 Data sequence of SNPs
4.1 secure count query
For genomic data studies, the most common task performed is to determine how many samples in the database satisfy certain characteristics. The counting inquiry is an important step of genetic association research, and is helpful for a researcher terminal to determine genes affecting specific diseases of a human body. For example, the investigator terminal inquires whether there is an association between various SNPs of the DSP1 gene and diagnosis alzheimer's disease in the individual. Similar SNP-disease association studies are becoming more common in human genomics research. For a biological researcher terminal, he needs to obtain a set of SNP data sets (e.g., SNPs) for a specific gene1=AG∩SNP2CT ═ biagnosis ═ Alzheimer's Disease) the frequency values that appeared in the database, which were then normalized to the total number of records. For example, given a database D and a query Q, where D ═ S1,S2,…,SnDenotes the SNP sequence of n patients. A count query may be defined as finding a number of patients in D that satisfy the multiple query condition Q for Q. Assuming that the gene sequence of m SNP sites of a certain patient is represented by S ═ { d1,d2,…,dmIn which d isi(1. ltoreq. i.ltoreq.m) represents the SNP value of the i-th site of the patient, the total number of count queries can be expressed as:
Figure BDA0002211388830000131
counting queries is a simple operation if the data is stored in the clear, and conventional database management systems support counting queries. But database management systems are not available for querying encrypted data. How to respond to the count query of the encrypted value without decrypting the data, the present invention proposes, in section 5, a secure data processing method hashMap for performing the query on the number of records under the condition specified by the user without decrypting the data stored in the cloud server.
4.2 data encryption
To protect patient privacy, the certificate authority server needs to encrypt the data in the database to prevent leakage of sensitive data. In this section, the invention will provide an encryption scheme that is relevant for the framework of the invention.
4.2.1 Hamming code
Hamming Code (Hamming Code) is a linear debug Code that inserts an authentication Code into a transmitted message stream to detect and correct data bit errors that may occur during data storage and transmission. Hamming codes utilize the concept of parity bits, adding bits of specific bits behind data bits to verify the validity of the data. The invention uses Hamming code technology to process gene data, not only can verify whether the data is valid, but also can indicate error information and correct under the condition that the data is wrong.
The Hamming code is realized by inserting k-bit binary data as check bits into original data and changing the original n-bit data into z-bit code. When coding, it is necessary toSatisfies inequality 2k-1 ≧ z, wherein z ═ n + k. Specifying the resulting z-bit coded 2 in Hamming codesk-1(k≥0,2k-1The < z) bit is inserted with a special check code, and the rest bits are placed in sequence by using the original code. The detailed coding rule of the hamming code is as follows:
1) at 2 of the new coded bit (i.e. parity bit)k-1The bits are filled with 0's and then the remaining bits of the new code are filled in order with the source code.
2) The encoding mode of the check bit is as follows: the k-th bit check code is then from the newly encoded 2 nd bitk-1Bit start, 2 per calculationk-1Exclusive or of bits, hop 2k-1Bit, recalculate next group 2k-1XOR of bits, fill 2k-1A bit.
For example, for adenine C in a nucleotide, (C)2 ═ 1000011, and n ═ 7, according to formula 2k-1 ≧ z, z ═ n + k, k can be taken to a minimum of 4. Then add 4 bits of detection code:
firstly at 2k-1The bits are filled with 0, denoted 00100000011. The 1 st bit check code is located at the 1 st bit (2) of the new code1 -1) (Hamming code starts with 1 bit), and XOR 1,3,5,7,9,11 bits is calculated as
Figure BDA0002211388830000141
Bit 1 of the new code is filled with 0, denoted as 00100000011.
The 2 nd bit check code is located at the 2 nd bit (2) of the new code2-1) Calculating the XOR of 2,3,6,7,10,11 bits toThe 2 nd bit of the new code is filled, denoted as 011000000011. The 3 rd bit check code is located at the 4 th bit (2) of the new code3-1) Calculating the XOR of 4,5,6,7 bitsThe 4 th bit filled with the new code is denoted 011000000011. The 4 th bit check code is located at the 8 th bit (2) of the new code4-1) Calculating the XOR of bits 8-11
Figure BDA0002211388830000152
The 8 th bit filled with the new code is denoted 011000000011. The final binary hamming code for C is denoted as 011000000011.
The present invention uses hamming codes to process raw data and uses its error correction function to detect counting inquiry results. For four nucleotides A, C, G and T, respectively calculating the corresponding binary check bits { h) of the AS certificate authority server I codeA,hC,hG,hTIn which h isA=0001,hC=0100,hG=1101,hT0011. Assuming a pair of nucleotides diDenoted B after adding the detection bit tiWhere t is the set { h }A,hC,hG,hTAny two of them combined by 24And (4) a combination mode. E.g. diThe binary representation of CC is 10000111000011, and 0110000001101100000011 is indicated after adding 0100,0100 detection bits to the 1 st, 2 nd, 4 th and 8 th positions of each base.
4.2.2 Paillier cryptosystem
To achieve simplicity and flexibility of architecture, the present invention uses a Paillier cryptosystem encryption algorithm. The encryption algorithm meets semantic security, and guarantees that an opponent with limited computing capacity and ciphertext cannot acquire plaintext information. Paillier cryptosystem is a probabilistic asymmetric algorithm for public key encryption that produces different ciphertexts when the same message is encrypted multiple times. In the Paillier encryption algorithm, a pair of keys is generated, one is a private key sk and the other is a public key pk. The public key and the private key are used for encryption and decryption of data, respectively. The Paillier cryptosystem may be defined as follows. (reference wiki translation)
And a key generation stage: two large prime numbers p and q are randomly selected such that they are independent of each other gcd (pq, (p-1) (q-1)) ═ 1. Calculate η ═ pq and λ ═ lcm (p-1, q-1), and choose the random integer g such that
Figure BDA0002211388830000161
Determining the division order of n by performing a modular multiplication inverse operation on g: μ ═ L (g)λmodη2))-1mod η, where L is defined
Figure BDA0002211388830000162
From this, the public key (η, g), the private key (λ, μ) can be obtained.
And (3) an encryption stage: selecting a random number gamma (0 < gamma < eta) to calculate a ciphertext c for a message omega (0 < m < eta) needing to be encrypted: c is gω·γηmodη2
And a decryption stage: for ciphertext c that needs to be decrypted
Figure BDA0002211388830000163
The decrypted plaintext is represented as: ω ═ L (c)λmodη2)·μmodη。
Homomorphic attributes: suppose there are two data ω1And ω2Then the product of its ciphertext is decrypted to the sum of their corresponding plaintexts: esk(Epk1)·Epk2)modη2)=ω12modη。
Adding one plaintext as another plaintext exponent value, and after decryption, taking the product of the two plaintext:
Figure BDA0002211388830000164
assuming n patients in the database, Data ═ S1,S2,…,Sn) The genome data representing the gene sequence dataset in the database, the SNP sequence of the jth patient can be represented as
Figure BDA0002211388830000166
Wherein j is more than or equal to 1 and less than or equal to n. Each genome data in the encryption process
Figure BDA0002211388830000167
By ChineseThe plain code adds the detection bits represented as
Figure BDA0002211388830000168
To pair
Figure BDA0002211388830000169
The encryption can be expressed as
Figure BDA00022113888300001610
Wherein "|" indicates that the bit is to be detectedAddition to base pairs
Figure BDA00022113888300001612
In (1). The ciphertext of each data encryption is represented as:
Figure BDA00022113888300001613
5 the following explains the system design of the present invention
In this section, the present invention will introduce a data processing model for use by the present invention. First, the certification authority server processes the patient gene sequences and clinical information in the database. Clinical information of patients mainly includes diagnosis of diseases of some phenotypes. Suppose that the disease type has I ═ I1,I2,…,IgWhere (1. ltoreq. g.ltoreq.n), one of the values I in the set represents a disease. Each patient may have one or more diseases. Later, the genotypic and phenotypic data for the patient cases discuss how they were integrated into the hashMap. The genotype sequences and phenotype types of 5 patients are shown as table 1.
5.1 Generation of hashMap
After receiving data provided by a data owner, the certification authority server creates an Entity of the hashMap for the SNP sequence of each patient, and generates a mapping table M after processing all the data, wherein the mapping table M comprises the genotype and phenotype information of each patient. After creating M, the certification authority server needs to create and update the Entity information of M for each new record from the data owner. Each Entity contains the following:
key value in Entity points to specific data, index value.
geno part of value in Entity. It is the SNP sequence of each patient. For example, the SNP sequence for one patient is denoted as S ═ d1,d2,…,dmAnd the sid is the unique identifier of the SNP site and represents a specific position in a gene sequence. Wherein base pair diThe site identifier of (1. ltoreq. i.ltoreq.n). Can be expressed as
And I, a phenotype data set corresponding to the genome sequence.
count the frequency of occurrence of the current SNP sequence in the database indicates the total number of records matching the genotype and phenotype.
Next is the key value of the next Entity
Figure BDA0002211388830000181
Table 2 Hash Map for Table1
The complexity of creating the hash table is O (mn), where m is the number of records in the database and n is the number of different SNP sites in each sequence record. The invention defines each Entity as theta, and its components are denoted as theta (key, geno, I, count, next). The present invention can obtain table2 according to the characteristics of the hash table and the data of table 1.
5.2 encryption hashMap
5.2.1 encryption of bloom filters
Each Entity in M contains a Bloom filter value I, representing phenotypic information corresponding to the genomic sequence. The certification authority server processes the phenotype data using Bloom filter technology. Every SNPs in the genomic sequence were recorded, and the corresponding phenotype was inserted in the Bloom filter. If multiple phenotypes are associated with the same genomic sequence, the invention inserts all of them into the Bloom filter for that genomic sequence entry.
The certificate authority server sets the hash function H used in the Bloom filter and the field of the common alphabet Σ. The field of Σ is the set of all possible phenotypes. For each entry in the genomic sequence, the corresponding phenotype will be inserted in the Bloom filter. Each phenotype in Σ is mapped to a unique number and that number will be inserted into the Bloom filter.
The Bloom filters on each Entity in M are all the same length, and in order to encrypt the Bloom filters, the present invention uses AES in CTR mode with a key size of 128 bits. The certificate authority server selects a random key k using a Pseudo Random Function (PRF) F, encrypts each Bloom filter into
Figure BDA0002211388830000191
This encryption is done while other data of the entity is being encrypted. Besides sending the private key sk to the agency terminal, the certificate authority server also needs to provide a key k of the PRF and a hash function F for Bloom filterk. In order to match phenotypes from a query during a search, the present invention requires a tabular value I for the query requestqLocation of hash and I of each EntityiThe hash locations are matched. If IqAll positions set to 1 iniIs also set to 1, this means that the phenotype from the query matches the phenotype stored in the tree node.
Since the bloom filter represented at each node of the tree is encrypted, in order to check the location of any one of these bloom filters, the present invention first needs to decrypt it. The certificate authority server provides the PRF key k to the agent terminal. To encrypted Bloom filter'iDecryption using an in-circuit SFE
5.2.2 Paillier Cryptosystem encryption
The certificate authority server encrypts all entitlements in M using Paillier Cryptosystem. The certificate authority server generates a key pair (pk, sk) for the homomorphic encryption scheme using a key generation algorithm, wherein pk is a public key and sk is a private key; each entry in the hash table is encrypted using the public key pk. In order to make the whole search process fast enough while maintaining the security of the system, the present invention encrypts only the sensitive attribute of each entity. Because the cloud server is semi-honest in the system of the invention, the cloud server must be capable of ensuring that the inquirer knows the next key value of the entity, and therefore the invention does not need to encrypt the next information. The encryption of the entity may be denoted as Epk(theta), after encryption, each entry like theta (key, E)pk(geno),Epk(count),I′iNext). The present invention defines that the hashMap in which all the entries in M are encrypted is represented as
Figure BDA0002211388830000201
Finally, the certificate authority server will
Figure BDA0002211388830000202
And sending the data to the cloud server. Meanwhile, the certificate authority server transmits the key pair (pk, sk) to the agency terminal.
5.3 query encryption hashMap
The cloud server receives the information sent by the certificate authority server
Figure BDA0002211388830000203
And executing the encrypted inquiry request sent by the agency terminal. The main idea of the query is that the query satisfies the query phenotype as IqAnd satisfies the gene sequence matching the specific site stored in geno. The cloud server needs to know the sid in the query request of the researcher terminal and match the SNP of the specific site of geno stored in the Entity with the SNP of the site in the query condition of the researcher terminal. Since the values of the SNPs are all encrypted and the encryption scheme used by the present invention is probabilistic, the cloud server cannot determine whether the values match. The cloud server may send the encrypted valn value to the agency terminal, and the agency terminal decrypts the encrypted SNPAnd checked for equality.
Suppose that a researcher terminal sends a query request Qq=(q1∩q2∩…∩qnum∩Iq) With num query conditions. q. q.sl(1. ltoreq. l. ltoreq. num) represents a specific gene value of SNP at a certain site. For example, query QqThe condition in (1) is SNP2=CC,SNP3=TT,SNP5=AG,Iq=I1,q1SNP representing conditions2CC. The agency terminal adds a detection code B to the base pair of each query condition qq=(Tq| q). The query is encrypted with a public key pk, and the query for each site can represent thetaq(sid,Epk(Bq)). The connection after the query request is encrypted is as follows: epkq)=Epk1)∩Epk2)∩…∩Epknum)∩Iq′。
For example, the query Q described above is performed on M in table2qIn operation, the SNP values of site sid 2, sid 3 and sid 5 of geno in each entity are traversed, and these values are sent to an agency terminal for decryption and verification, so as to obtain a case in the database that meets the query condition. As shown in table2, only case #1 satisfies the query condition, and the query count result count of each researcher terminal is returned to 1.
5.4 data integrity analysis
The cloud server queries the result RqAnd sending the SNP value to the agency terminal, and decrypting the SNP value by the agency terminal by using the key sk. In order to ensure the privacy of the interactive information between the agency mechanism terminal and the cloud server, the invention adopts the Hamming code detection bit to check the integrity of the data. For the returned query results meeting the query conditions:
Rq={Epk(count)|Epk(B1)|Epk(B2)|…|Epk(Bnum)|Ii′}
the agent terminal decrypts each SNP value by using the private key sk, and the decrypted SNP value
Figure BDA0002211388830000211
The invention is to
Figure BDA0002211388830000212
The hamming code detection bit is recalculated to T'. The invention compares T' with TqTo determine whether the data is complete.
1)T′=TqIn this case, the description data is complete, and the data is safe and credible;
2)T′≠Tqand T' is e { hA,hC,hG,hTIndicates that the geno value in the query has been tampered with, but is still a combination of the four nucleotides A, C, G, T. Can pass through the detection bit TqTo the return
Figure BDA0002211388830000213
Correction is performed and the hashMap can be updated. The returned count value count is not already trusted.
3)T′≠TqAnd is and
Figure BDA0002211388830000214
representing raw data
Figure BDA0002211388830000215
Tampered and tampered to other content, at which point the returned query result counter is not authentic for the patient's SNP sequence.
Regarding security analysis, the present invention only assumes that the geno sequences in the database are revealed to the participants, which leads to serious security problems. The invention evaluates the privacy protocol proposed by the invention and the security of the system in consideration of the ability of the participants to infer information at different stages of the system. The leakage profiles of the different participants in the model proposed by the present invention are given below.
Construction of HashMap and data leakage in an encryption phase: the certification authority server is used as a trusted entity and is responsible for generation and encryption of M, and data leakage does not exist in the certification authority server at the stage.
Leakage to cloud servers in each query: in the query execution process, the cloud server does not relate to plaintext information at all, so that data information is not disclosed to the cloud server in the query phase.
Leakage of each query to the agency terminal and the researcher terminal: the agency terminal mainly contributes to encrypting and sending the query request of the research staff terminal to the cloud server by using the secret key, receiving the query result returned by the cloud server, decrypting the query result and returning the decrypted query result to the research staff terminal. For the SNP inquiry result returned to the agency terminal, the invention uses the Hamming code technology for verification, and the terminal of a researcher cannot directly deduce any information from the detection position of the Hamming code. Here, the present invention does not consider any privacy leakage in the output. Therefore, the system is safe and reliable.
6 the following is a description of experimental analytical data of the present invention
Since the beginning of the human genome project, a large number of SNPs have been reported by researchers terminals. The availability of good SNP sites as candidate genes makes the research on the relevance of the candidate genes to the whole genome possible. Linkage Disequilibrium (LD) technology has been widely used to develop high quality SNP marker maps. When applied to disease-gene mapping, LD is assessed by correlation analysis, which entails comparing allele or haplotype frequencies between the affected disease (e.g., diagnosis of alzheimer's disease) and a control individual (e.g., diagnosis without alzheimer's disease). Toivonen et al proposed an LD mapping data mining method called Haplotype Pattern Mining (HPM) for generating a simulated SNP data set. The data privacy protection framework provided by the invention evaluates the practicability of the data privacy protection framework by using a simulated SNP data set to perform a test experiment. The source code of the system is realized by JAVA programming coding language. The experimental environment of the invention is Intel i7-4710MQ (4 cores 1.6GHz) CPU, 8GB memory; the software environment is a Windows 10 operating system and an IntelliJ IDEA running platform. In an experiment, the invention provides two different machines to run a cloud server and a certification authority server. In the invention, a Hamming code technology is used for adding detection bits to gene data in an experiment, and a 1024-bit Paillier encryption technology is used.
In the safety-counting query experiments of the present invention, four different query sizes with 500, 1000 and 5000 SNPs datasets and involving 10, 20, 30 and 40 randomly selected SNP sequences were used by the present invention. The method tests the influence of different record numbers on the hashMap construction time. The results of the testing of the present invention are compared to the method of index tree used by mohammadzahdul Hasan et al and the method of cryptographichardware SCPs used by m. Experimental results show that the performance of the method provided by the invention is superior to that of encryption time and query execution time. The experiment was analyzed from three aspects:
(1) HashMap generation time. It refers to the time required to process a genomic database and construct a hashMap using genotypes and phenotypes. The present invention analyzed the creation time of a data set containing 500 and 1000 number of SNPs as shown in FIG. 2. Experiments show that the time consumption is increased along with the increase of the number of SNPs. Compared with the index tree creation time used by Mohammad Zahudul Hasan et al, the method of the invention has more creation time than the index tree, but occupies a small proportion of the experimental operation time in seconds.
(2) And (4) encrypting time. The encryption time for data in the method of the invention is mainly consumed in encrypting base pairs and count values of each entity, and the CTR scheme used in the invention for encrypting the Bloom filter does not need a long time and can be ignored. Secondly, the time of encryption also depends on the total number of sequences in the SNP dataset. As shown in FIG. 3, the present invention analyzes encryption times of data sets having 500 and 1000 SNPs. The method of the invention has obvious advantages in encryption time compared with Index Tree as the encryption time consumed by the increase of the number of SNPs is increased. For 1000records, for example, the invention requires only 2.11min,
(3) and inquiring the time. The query time refers to the time taken by the researcher terminal to provide the query request until the returned result is obtained. In order to calculate the query time, the present invention randomly selects 10, 20, 30 and 40 sized SNP sequences, and performs a query on 5000 records. The execution times of these query sizes on the encrypted hashMap are listed, as in fig. 4. Because the phenotype information in the entries in the hashMap needs to be searched and matched, all the entries in the hashMap need to be subjected to traversal query to obtain the gene sequences meeting the query conditions. In fig. 2, the present invention shows the comparison of the query execution time and the method execution time, and the execution time of the present invention increases linearly with the number of queries, and the time is in seconds.
Figure BDA0002211388830000241
Table 3 Comparison of query execution time on 5000 records
By varying the record size of the tally query and the total number of query datasets, the present invention analyzes the time to store data and execute the query using a hashMap and compares it to the methods that have been proposed. The method of the invention effectively protects the privacy of the data by using the hashMap to store the data and using the homomorphic encryption scheme. The method of the present invention supports the storage of large data sets. The processing and encryption of sensitive data for the entity does not directly affect the number of records. The invention provides a method for adding detection bits to original data by using the error correction of Hamming codes, and the integrity of returned results is verified at an agency terminal, so that the accuracy of the results returned to a researcher terminal is effectively ensured. The characteristics enable the method provided by the invention to execute counting query on the encrypted data more safely and efficiently.
The invention provides a safe and effective genome data outsourcing method on the basis of analyzing the existing scheme aiming at the problem of safe counting query of encrypted genome data. In order to achieve privacy of data, the method stores the data in the database in a hashMap mode and outsourcing the stored data to a third-party cloud server. By using the third-party agency terminal, the security counting query of the terminal of the researcher in the cloud service is realized. In order to verify the integrity of data, the invention provides that a hamming code technology is used for adding detection bits to genome data, so that no sensitive gene data can be displayed in the data processing and query execution stages.

Claims (8)

1. A security counting inquiry and integrity verification device based on encrypted genome data is characterized by comprising a data owner terminal, an authentication mechanism server, a cloud server, an agency terminal and a researcher terminal;
the certification authority terminal receives the patient record from the data owner terminal and encrypts the patient record;
the data owner terminal processes the genome data of the patient according to a specified format and sends the genome data of the patient to the certification authority server in a plaintext form;
after obtaining data shared by different data owners, the authentication mechanism server constructs a searchable hash map aggregating and sharing encrypted data and sends the hash map to the cloud service;
the cloud service acquires encrypted data from the authentication mechanism, processes an encrypted query request sent by the agency mechanism terminal, executes the query request and sends a query result to the agency mechanism terminal in an encrypted manner;
the research personnel terminal responds to the user operation and sends the research personnel query request to the agency terminal;
the agency terminal encrypts the query request data after receiving the scientific research personnel query request, sends the encrypted query request to the cloud server, and decrypts the query result by using the private key after obtaining the query result from the cloud server and sends the query result to the research personnel terminal.
2. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 1,
the cloud server generates a group of SNP data sets of the specific genes and frequency values of the specific genes appearing in the database, and then normalizes the values into total record numbers;
the cloud server represents the total number of counting queries as:
Figure FDA0002211388820000011
where D is database Q for query, D ═ S1,S2,…,SnDenotes the SNP sequence of n patients, the count query is defined as the number of patients who find D satisfying a plurality of query conditions Q for Q, and the gene sequence of m SNP sites of the patients is denoted as S ═ { D1,d2,…,dmIn which d isi(1. ltoreq. i.ltoreq.m) represents the SNP value of the i-th site of the patient.
3. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 1,
after receiving data provided by a data owner, the certification authority server creates an Entity of a map for the SNP sequence of each patient, processes all the data to generate a mapping table M, wherein the M comprises genotype and phenotype information of each patient, and after the M is created, the certification authority server creates and updates the Entity information of the M for each new record from the data owner;
each Entity contains the following:
key in Entity the key value points to a specific data, index value,
geno: the SNP sequence of each patient is shown, and the SNP sequence of one patient is shown as S ═ { d ═ d1,d2,…,dmAnd sid is the unique identifier of SNP site and represents a specific position in a gene sequence, wherein base pair diThe site identifier of (1. ltoreq. i.ltoreq.n) is represented by
A phenotype data set corresponding to the genome sequence,
count the frequency of occurrence of the current SNP sequence in the database, representing the total number of records matching the genotype and phenotype,
next is the key value of the next Entity.
4. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 3,
the complexity of creating the hash table by the certification authority server is O (mn), where m is the number of records in the database, n is the number of different SNP sites in each sequence record, and each Entity is defined as θ, and its components are denoted as θ (key, gene, I, count, next).
5. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 1,
the certification authority server processes phenotype data by using Bloom filter technology, records each SNPs in a genome sequence, inserts a corresponding phenotype in the Bloom filter, inserts all phenotypes into a Bloom filter of the genome sequence entry if the phenotypes are associated with the same genome sequence, and each Entity in M contains a Bloom filter value I which represents phenotype information corresponding to the genome sequence;
the certification authority server sets the hash function H used in the Bloom filter and the field of the common alphabet Σ, which is a set of all possible phenotypes, inserts in the Bloom filter the corresponding phenotype for each entry in the genomic sequence, each phenotype in Σ is mapped to a unique number, and inserts that number into the Bloom filter,
the certification authority server selects a random key k using a pseudo-random function F, encrypts each Bloom filter into
Figure FDA0002211388820000031
The certification authority server sends the private key sk to the agency terminal, and provides a key k of the PRF and a hash function F for the Bloom filterk
Certificate authority server inquires tabular value I of requestqLocation of hash and I of each EntityiMatching hash positions;
the authentication mechanism server sends PRF key k to the agent mechanism terminal, and the encrypted Bloom filter I is processediBy the formula
Figure FDA0002211388820000032
Decryption is performed.
6. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 1,
the certificate authority server generates a key pair (pk, sk), where pk is the public key and sk is the private key, encrypts each entry in the hash table using the public key pk, and represents the encryption of the entry as Epk(θ);
The certification authority server represents the hash map of all the entries in the Hash map M after being encrypted as
Figure FDA0002211388820000041
And will be
Figure FDA0002211388820000042
Sending the data to a cloud server;
the certificate authority server transmits the key pair (pk, sk) to the agency terminal.
7. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 1,
the cloud server queries the result RqAnd sending the data to an agency, decrypting the SNP value by the agency terminal by using the key sk, verifying the integrity of the data by adopting a Hamming code detection bit, and returning a query result meeting the query condition:
Rq={Epk(count)|Epk(B1)|Epk(B2)|…|Epk(Bnum)|Ii′}
the agent terminal decrypts each SNP value by using the private key sk, and the decrypted SNP value
Figure FDA0002211388820000043
Will be provided with
Figure FDA0002211388820000044
Recalculating the Hamming code detection bit as T';
the agency terminal compares T' with TqTo determine whether the data is complete,
if T ═ TqIf the data is safe and credible, the agency terminal judges that the data is safe and credible;
if T' ≠ TqAnd T' is e { hA,hC,hG,hTAnd then the agency terminal passes the detection position TqTo the return
Figure FDA0002211388820000045
Correcting and updating the Hash diagram;
if T' ≠ TqAnd is and
Figure FDA0002211388820000046
the agency terminal judges the original dataTampered and tampered to other content, at which point the returned query result counter is not authentic for the patient's SNP sequence.
8. A safety counting inquiry and integrity verification method based on encrypted genome data comprises the following steps:
the certification authority terminal receives the patient record from the data owner terminal and encrypts the patient record;
the data owner terminal processes the genome data of the patient according to a specified format and sends the genome data of the patient to the certification authority server in a plaintext form;
after obtaining data shared by different data owners, the authentication mechanism server constructs a searchable hash map aggregating and sharing encrypted data and sends the hash map to the cloud service;
the cloud service acquires encrypted data from the authentication mechanism, processes an encrypted query request sent by the agency mechanism terminal, executes the query request and sends a query result to the agency mechanism terminal in an encrypted manner;
the research personnel terminal responds to the user operation and sends the research personnel query request to the agency terminal;
the agency terminal encrypts the query request data after receiving the scientific research personnel query request, sends the encrypted query request to the cloud server, and decrypts the query result by using the private key after obtaining the query result from the cloud server and sends the query result to the research personnel terminal.
CN201910899612.1A 2019-09-23 2019-09-23 Safety counting query and integrity verification device and method based on encrypted genome data Pending CN110660450A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910899612.1A CN110660450A (en) 2019-09-23 2019-09-23 Safety counting query and integrity verification device and method based on encrypted genome data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910899612.1A CN110660450A (en) 2019-09-23 2019-09-23 Safety counting query and integrity verification device and method based on encrypted genome data

Publications (1)

Publication Number Publication Date
CN110660450A true CN110660450A (en) 2020-01-07

Family

ID=69038942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910899612.1A Pending CN110660450A (en) 2019-09-23 2019-09-23 Safety counting query and integrity verification device and method based on encrypted genome data

Country Status (1)

Country Link
CN (1) CN110660450A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111967048A (en) * 2020-08-19 2020-11-20 西安电子科技大学 Efficient matching and privacy protection method and system for genome data similarity
CN112416948A (en) * 2020-12-15 2021-02-26 暨南大学 Verifiable gene data outsourcing query protocol and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170753A (en) * 2017-12-22 2018-06-15 北京工业大学 A kind of method of Key-Value data base encryptions and Safety query in shared cloud
CN110263570A (en) * 2019-05-10 2019-09-20 电子科技大学 A kind of gene data desensitization method for realizing efficient similarity query and access control

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170753A (en) * 2017-12-22 2018-06-15 北京工业大学 A kind of method of Key-Value data base encryptions and Safety query in shared cloud
CN110263570A (en) * 2019-05-10 2019-09-20 电子科技大学 A kind of gene data desensitization method for realizing efficient similarity query and access control

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
MOHAMMAD ZAHIDUL HASAN ET AL.: "《Secure count query on encrypted genomic data》", 《JOURNAL OF BIOMEDICAL INFORMATICS》 *
MURAT KANTARCIOGLU ET AL.: "《A Cryptographic Approach to Securely Share and Query Genomic Sequences》", 《IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE》 *
ZAHIDUL HASAN ET AL.: "《Secure Count Queries on Encrypted Genomic Data: a survey》", 《IEEE INTERNET COMPUTING》 *
徐东作 等: "《转录因子结合位点预测算法的研究与应用》", 31 December 2009, 上海:上海大学出版社 *
王晓东 等: "《网络通信与网络互联》", 31 March 2014, 北京:高等教育出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111967048A (en) * 2020-08-19 2020-11-20 西安电子科技大学 Efficient matching and privacy protection method and system for genome data similarity
CN111967048B (en) * 2020-08-19 2022-11-29 西安电子科技大学 Efficient matching and privacy protection method and system for genome data similarity
CN112416948A (en) * 2020-12-15 2021-02-26 暨南大学 Verifiable gene data outsourcing query protocol and system

Similar Documents

Publication Publication Date Title
US20230385437A1 (en) System and method for fast and efficient searching of encrypted ciphertexts
US10402588B2 (en) Method to manage raw genomic data in a privacy preserving manner in a biobank
Ayday et al. {Privacy-Preserving} Computation of Disease Risk by Using Genomic, Clinical, and Environmental Data
EP3356988B1 (en) Method and system for verifiable searchable symmetric encryption
US9536047B2 (en) Privacy-enhancing technologies for medical tests using genomic data
US10192029B2 (en) Secure and scalable mapping of human sequencing reads on hybrid clouds
US9270446B2 (en) Privacy-enhancing technologies for medical tests using genomic data
Chen et al. Large-Scale Privacy-Preserving Mapping of Human Genomic Sequences on Hybrid Clouds.
Ying et al. A lightweight policy preserving EHR sharing scheme in the cloud
Sousa et al. Efficient and secure outsourcing of genomic data storage
Cheng et al. Secure similar sequence query on outsourced genomic data
Xu et al. DNA similarity search with access control over encrypted cloud data
Chen et al. Perfectly secure and efficient two-party electronic-health-record linkage
Mahdi et al. Secure similar patients query on encrypted genomic data
Hasan et al. Secure count query on encrypted genomic data: a survey
Mahdi et al. Privacy-preserving string search on encrypted genomic data using a generalized suffix tree
CN110660450A (en) Safety counting query and integrity verification device and method based on encrypted genome data
Deuber et al. My genome belongs to me: controlling third party computation on genomic data
Singh et al. Practical personalized genomics in the encrypted domain
De Cristofaro et al. Privacy-preserving genetic relatedness test
WO2020259847A1 (en) A computer implemented method for privacy preserving storage of raw genome data
Jafarbeiki et al. Pressgendb: Privacy-preserving substring search on encrypted genomic database
Mozumder et al. Towards privacy-preserving authenticated disease risk queries
Raisaro Privacy-enhancing technologies for medical and genomic data: From theory to practice
Zhu et al. Privacy-Preserving Identification of Target Patients from Outsourced Patient Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination