CN110660450A

CN110660450A - Safety counting query and integrity verification device and method based on encrypted genome data

Info

Publication number: CN110660450A
Application number: CN201910899612.1A
Authority: CN
Inventors: 王雷; 邹赛; 朱贤友; 陈治平
Original assignee: Chongqing College of Electronic Engineering; Changsha University
Current assignee: Chongqing College of Electronic Engineering; Changsha University
Priority date: 2019-09-23
Filing date: 2019-09-23
Publication date: 2020-01-07

Abstract

The invention provides a method for safely sharing and calculating genome data on a cloud server with poor integrity. Firstly, the method provided by the invention processes the biomedical data, and ensures the privacy and integrity of the shared data. Secondly, the method provided by the invention stores data in a hashMap mode, improves the encryption efficiency and efficiently realizes safe counting query. The method provided by the invention is evaluated by using the existing Single Nucleotide Polymorphism (SNP) sequence database, and experiments show that the completion of counting query is more flexible, easy to realize and high in safety for practical application.

Description

Safety counting query and integrity verification device and method based on encrypted genome data

Technical Field

The invention relates to the technical field of biological information, in particular to a safety counting inquiry and integrity verification device based on encrypted genome data.

Background

Clinical medical practice plays a crucial role in the field of healthcare, and the study of genomics is also becoming more and more popular. Genomics research also helps to identify potential correlations between disease and a gene for which biomedical researchers have conducted large-scale investigations of patient clinical status and DNA sequence data, and most of the analysis of gene data is based on the american national institute of health genome association (GWAS). In order to improve the accuracy of the research, it is necessary to summarize data from different sources, and various service systems for sharing and storing data have been established, such as the genotype and phenotype database (dbGaP) in the united states and the biological banking program in the united kingdom by Welcomme Trust. Genomic data relates to the privacy of patients and if data is revealed, it may cause social and legal problems, for example, a health insurance company may refuse to insurance against personal information that a gene that knows the likelihood of a particular cancer carries a mutation. Based on the sensitivity of genetic data, sharing of genetic data by multiple institutions requires storage and access to genomic data through privacy preserving methods.

Biomedical research is increasingly dependent on a large amount of genomic and clinical data, with the ensuing great attention of scholars on how to ensure privacy of the respective individuals and the overall security of the system when sharing and managing these data. In the past, to protect sensitive information of a subject, important identifiers that may identify individuals were deleted when multiple organizations shared data. However, research has shown that the identity of a subject can be easily deduced using automated methods. Current research uses encryption protocols to share, manage and analyze biomedical data and outsource encrypted data to third party cloud service providers that possess vast storage and computing resources in order to protect the privacy of the data. At the same time, third parties have also become potential targets that may have caused the privacy of research subjects to be violated.

In the prior art, there are security frameworks that manage clinical genomic data in a centralized database in a specific form, and these frameworks propose methods for secure sharing and storage of genomic data on a less honest cloud service. Firstly, the data owner sends clinical information to a third-party organization for encryption and authentication, and the third-party organization sends a large amount of aggregated data to a cloud service for storage in a certain structure. The cloud service then performs a query on behalf of the researcher (e.g., number of patient records with hypertensive patients and particular genomic variant features) and returns the query results to the researcher. However, directly returning the query result to the researcher cannot exclude an attacker from attacking the researcher and capturing the encryption protocol, and still has data risk.

Disclosure of Invention

In order to realize scientific research under the condition of not revealing the identity of a data main body while inquiring shared medical data, the invention provides a safety counting inquiry and integrity verification device based on encrypted genome data, which comprises a data owner terminal, an authentication mechanism server, a cloud server, an agency mechanism terminal and a researcher terminal;

the certification authority terminal receives the patient record from the data owner terminal and encrypts the patient record;

the data owner terminal processes the genome data of the patient according to a specified format and sends the genome data of the patient to the certification authority server in a plaintext form;

after obtaining data shared by different data owners, the authentication mechanism server constructs a searchable hash map aggregating and sharing encrypted data and sends the hash map to the cloud service;

the cloud service acquires encrypted data from the authentication mechanism, processes an encrypted query request sent by the agency mechanism terminal, executes the query request and sends a query result to the agency mechanism terminal in an encrypted manner;

the research personnel terminal responds to the user operation and sends the research personnel query request to the agency terminal;

the agency terminal encrypts the query request data after receiving the scientific research personnel query request, sends the encrypted query request to the cloud server, and decrypts the query result by using the private key after obtaining the query result from the cloud server and sends the query result to the research personnel terminal.

Further, the cloud server generates a set of SNP data sets of the specific gene and a frequency value of the specific gene appearing in the database, and then normalizes the value into the total number of records;

the cloud server represents the total number of counting queries as:

where D is database Q for query, D ═ S₁,S₂,…,S_nDenotes the SNP sequence of n patients, the count query is defined as the number of patients who find D satisfying a plurality of query conditions Q for Q, and the gene sequence of m SNP sites of the patients is denoted as S ═ { D₁,d₂,…,d_mIn which d is_i(1. ltoreq. i.ltoreq.m) represents the SNP value of the i-th site of the patient.

Further, after receiving data provided by a data owner, the certification authority server creates an Entity of a map for the SNP sequence of each patient, processes all the data to generate a mapping table M, wherein the M comprises genotype and phenotype information of each patient, and after the M is created, the certification authority server creates and updates the Entity information of the M for each new record from the data owner;

each Entity contains the following:

key in Entity the key value points to a specific data, index value,

geno: the SNP sequence of each patient is shown, and the SNP sequence of one patient is shown as S ═ { d ═ d₁,d₂,…,d_mAnd sid is the unique identifier of SNP site and represents a specific position in a gene sequence, wherein base pair d_iThe site identifier of (1. ltoreq. i.ltoreq.n) is represented by

A phenotype data set corresponding to the genome sequence,

count the frequency of occurrence of the current SNP sequence in the database, representing the total number of records matching the genotype and phenotype,

next is the key value of the next Entity.

Further, the certificate authority server creates a hash table with a complexity of o (mn), where m is the number of records in the database, n is the number of different SNP sites in each sequence record, and each Entity is defined as θ, and its components are denoted as θ (key, gene, I, count, next).

Further, the certification authority server processes the phenotype data by using Bloom filter technology, records each SNPs in the genome sequence, inserts a corresponding phenotype in the Bloom filter, if a plurality of phenotypes are associated with the same genome sequence, inserts all the phenotypes into the Bloom filter of the genome sequence entry, wherein each Entity in M contains a Bloom filter value I which represents phenotype information corresponding to the genome sequence;

the certification authority server sets the hash function H used in the Bloom filter and the field of the common alphabet Σ, which is a set of all possible phenotypes, inserts in the Bloom filter the corresponding phenotype for each entry in the genomic sequence, each phenotype in Σ is mapped to a unique number, and inserts that number into the Bloom filter,

the certification authority server selects a random key k using a pseudo-random function F, encrypts each Bloom filter into

The certification authority server sends the private key sk to the agency terminal, and provides a key k of the PRF and a hash function F for the Bloom filter_k；

Certificate authority server inquires tabular value I of request_qLocation of hash and I of each Entity_iMatching hash positions;

the certification authority server transmits a PRF key k to the agency terminal and encrypts Bloom filter'_iUsing a formula

Decryption is performed.

Further, the certificate authority server generates a key pair (pk, sk), where pk is the public keySk is a private key, each entry in the hash table is encrypted using the public key pk, and the encryption of the entry is denoted as E_pk(θ)；

The certification authority server represents the hash map of all the entries in the Hash map M after being encrypted as

And will be

Sending the data to a cloud server;

the certificate authority server transmits the key pair (pk, sk) to the agency terminal.

Further, the cloud server queries the result R_qAnd sending the data to an agency, decrypting the SNP value by the agency terminal by using the key sk, verifying the integrity of the data by adopting a Hamming code detection bit, and returning a query result meeting the query condition:

R_q＝{E_pk(count)|E_pk(B₁)|E_pk(B₂)|…|E_pk(B_num)|I_i′}

the agent terminal decrypts each SNP value by using the private key sk, and the decrypted SNP value

Will be provided with

Recalculating the Hamming code detection bit as T';

the agency terminal compares T' with T_qTo determine whether the data is complete,

if T ═ T_qIf the data is safe and credible, the agency terminal judges that the data is safe and credible;

if T' ≠ T_qAnd T' is e { h_A,h_C,h_G,h_TAnd then the agency terminal passes the detection position T_qTo the return

Correcting and updating the Hash diagram;

if T' ≠ T_qAnd is and

the agency terminal judges the original data

Tampered and tampered to other content, at which point the returned query result counter is not authentic for the patient's SNP sequence.

The invention also provides a safety counting inquiry and integrity verification method based on encrypted genome data, which comprises the following steps:

The invention has the beneficial effects that:

(1) the invention provides a secure system query framework. According to the invention, the Paillier cryptosystem is used for encrypting the gene sequence, the encrypted data is stored in the third-party cloud service, and counting query is carried out under the condition that the cloud service does not know the secret key, so that the privacy of the data is ensured. The quote agency mechanism decrypts and verifies the returned inquiry result of the encryption node, and the integrity of the data is ensured.

(2) The present invention provides a technique for processing raw data. The invention adds a detection position to the gene data by using a hamming code technology, and verifies the credibility of the counting query result by checking the returned query gene sequence.

(3) The model provided by the invention provides security guarantee for data privacy, inquiry privacy and output privacy. The query initiator completes the safe interactive query through the agency mechanism and the cloud server.

(4) The invention provides experimental studies to demonstrate the feasibility of the method proposed by the invention. For a query of 5000 records for 40 SNP sites, Kantar certification authority servers oglu et al and Canim et al, used 30min and 80s, respectively. The method proposed by the present invention takes 3.9s to perform the same query. The time is significantly shortened.

(5) The method of the invention allows the outsourcing protocol to match the stored encrypted data with the query command in the ciphertext state to complete the counting query. In order to protect data privacy, the present invention encrypts gene data using a homomorphic encryption method. In order to verify the integrity of the query result, the invention adds the detection bit to the genome data by using the Hamming code technology, can verify the query result based on the error correction of the Hamming code detection bit, and has good performance in the aspect of ensuring the safety and the integrity of the genome data.

Drawings

FIG. 1 is a diagram illustrating an apparatus according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating test results according to an embodiment of the present invention.

FIG. 3 is a diagram illustrating test results according to an embodiment of the present invention.

FIG. 4 is a diagram illustrating test results according to an embodiment of the present invention.

Detailed Description

The following description will explain the design principles and advantageous effects of the present invention

In the present invention, the objective of the invention is to design a counting framework based on outsourced genome data security query, and determine the number of records in the database matching with the query condition through counting query. The framework provided by the invention processes and stores data by using a safe encryption mechanism, and performs security evaluation on each inquired storage record through a third-party agency terminal, and as shown in fig. 1, the system framework provided by the invention is provided. Specific contributions of the invention are as follows:

(1) the invention provides a secure system query framework. According to the invention, the Paillier cryptosystem is used for encrypting the gene sequence, the encrypted data is stored in the third-party cloud service, and counting query is carried out under the condition that the cloud service does not know the secret key, so that the privacy of the data is ensured. And the quote agency terminal decrypts and verifies the returned query result of the encryption node, so that the integrity of the data is ensured.

(2) The present invention provides a technique for processing raw data. The invention adds a detection position to gene data by using a hamming code technology, verifies the credibility of a counting query result by checking a returned query gene sequence

(3) The model provided by the invention provides security guarantee for data privacy, inquiry privacy and output privacy. And the query initiator completes the safe interactive query through the agency mechanism terminal and the cloud server.

(4) The invention provides experimental studies to demonstrate the feasibility of the method proposed by the invention. For a query of 5000 records for 40 SNP sites, the Kantar certification authority server oglu et al and Canim et al, took 30min and 80s, respectively. The method proposed by the present invention takes 3.9s to perform the same query.

2 the following description is made of the related concepts related to the present invention

The human genome contains basic data on human biology and private diagnostic information, and it is very sensitive to query personal genome data. To protect the privacy of data, Ayday et al propose a privacy protection method and system based on homomorphic encryption, using a storage and processing unit to store sensitive data in encrypted form. Bruekers et al propose solutions to the semi-honest attacker model, based on limited DNA homomorphic encryption, the complexity of which depends to some extent on the number of errors to be tolerated. Eppstein et al used a privacy-enhanced reversible bloom filter (PIBF), and proposed a privacy-preserving comparison method that operates on compressed genomic data.

Sensitive data may also be leaked by a privacy protection protocol of genome data, and a first privacy protection sequence comparison algorithm is proposed by Atallah et al aiming at privacy protection of a query protocol. However, due to its unreasonable intensive computational requirements, Jha et al introduced scrambling circuitry for sequence comparison and distance calculation using a secure two-way communication protocol. The main drawback of this solution is the inability to handle large-scale computations. Troncoso-Pastoriza et al provide DNA match security privacy in a semi-honest environment, and protocols for secure matching are proposed by inadvertent evaluation of automata, but the protocol processing time is somewhat longer. Blaton and alisgari et al propose a secure outsourcing protocol in order to improve the Troncoso-Pastoriza et al protocol communication time problem.

For genome data security outsourcing problems for counting queries, the Kantar certification authority server oglu et al originally proposed an encryption model involving two third parties, but this model did not provide query privacy. Perl et al propose a method for searching biomedical data using a Bloom filter in combination with homomorphic encryption. They fully outsource the task of searching to a third party cloud server. Canim et al use tamper-resistant cryptographic hardware to provide a data storage server (DS) to enable secure storage, sharing, and querying of single third party genomic data, the size of the query being limited to the memory size of the tamper-resistant hardware. Thank et al proposed a meta-analysis approach to a secure genetic association study that stored the data at the site of the corresponding data owner. Ignatenko and Petkovic propose a solution to search and match DNA sequences in private DNA databases, representing DNA sequences as index information for matching and similarity search of context trees for privacy protection. The context tree is constructed using a general compression technique known as Context Tree Weighting (CTW).

The invention provides a method for protecting privacy and integrity of genome data safety counting query. The method of the invention allows the outsourcing protocol to match the stored encrypted data with the query command in the ciphertext state to complete the counting query. In order to protect data privacy, the present invention encrypts gene data using a homomorphic encryption method. In order to verify the integrity of the query result, the invention adds the detection bit to the genome data by using the Hamming code technology, can verify the query result based on the error correction of the Hamming code detection bit, and has good performance in the aspect of ensuring the safety and the integrity of the genome data.

3. The following describes a model of the apparatus of the present invention

3.1 System design

In this section, the security framework proposed by the present invention is described. As shown in fig. 1, the frame contains five major participants: the system comprises a data owner, a certificate authority server, a cloud server, an agency terminal and a researcher terminal. Each entity is responsible for executing different specific tasks, and the safety and the good function of the whole system are ensured. The flow of information in this framework includes two phases: a data integration phase and a query processing phase. During the data integration phase, the certification authority server receives the patient record from the data owner and encrypts it. In the query processing stage, the agency terminal encrypts the query conditions of the scientific research personnel and sends the encrypted query conditions to the cloud service to execute query.

The data owner is comprised of a plurality of institutions that own the genomic data, which may be hospitals, academic research institutions or government research institutions, etc. These institutions process the patient's genome data in a prescribed format and send it in the clear to the certification authority server.

The certification authority server, as a third party authority, is crucial to the security of the framework. The certificate authority server has the main tasks:

1) and processing the data. After the certification authority server obtains data shared by different data owners, a searchable hashMap aggregating and sharing encrypted data is constructed and sent to the cloud service. And the user query operation is basically performed on the encrypted index tree. The index tree contains all records from the shared data, and for additions and deletions of records, the certificate authority server can update the tree accordingly.

2) A key is generated. Sensitive data stored in the hash table are all encrypted by a key, and an execution organization needs to manage a public key for data encryption and a private key for decryption.

The cloud service acquires the encrypted data from the certification authority server and is responsible for processing the encrypted query request sent by the agency terminal. The cloud server executes the query and sends the encryption result to the agency terminal.

The agency terminal handles all communications with the scientific researchers. After receiving the query request of the scientific research personnel, the agency terminal encrypts the query data, sends the encrypted query request to the cloud server, and decrypts the query result and sends the decrypted query result to the terminal of the research personnel.

The researcher represents any individual or organization interested in performing queries on shared data stored in the cloud server. The scientific research personnel sends the inquiry request to the agency terminal for encryption, the agency terminal decrypts the result by using the private key after receiving the encrypted inquiry result and sends the result to the research personnel terminal, and the research personnel terminal obtains the final output result.

3.2 attack model and Security target

The invention aims to solve the problem that the cloud server does not know any information of shared genome data, and neither the certificate authority server nor the cloud server has knowledge of the query performed by a researcher terminal. The present invention assumes that the certificate authority server is a trusted entity. The certification authority server performs authentication as in the Data Access Committee (DAC) of NIH, which is responsible for the generation and encryption of hash tables and verifies the identity of individuals and organizations applying for access to data. Within the framework of the present invention, the present invention assumes that the cloud server is semi-honest, itself correctly complies with the protocol, and is not intended to maliciously produce erroneous results. The cloud server is different from the certificate authority server, and stores a large amount of sensitive data and is a place for executing query processing, when the cloud server is captured, an attacker can obtain a large amount of data, forge a query result or provide an incomplete query result for a user, and the like. Once a cloud server is attacked, its harmfulness greatly outweighs the risk of a captive certificate authority server being captive. Therefore, the invention focuses on the case that the cloud server is attacked.

The safety objects of the present invention are mainly the following two points:

(1) data privacy and query privacy: data privacy means that the cloud server cannot know the plaintext data stored by the data owner, which ensures that an attacker cannot understand the data stored in the cloud server; the query privacy means that the cloud server cannot know the actual value of the researcher terminal query request, which ensures that an attacker cannot understand or infer the query request received by the cloud server.

(2) Data integrity: the goal is to verify the correctness of the query results returned to the researcher terminal. If the query result returned to Agency by the cloud server is deleted or forged by an attacker, the Agency can detect the query result and analyze and correct the query result based on the protocol provided by the invention.

4 the following explains the data processing method adopted in the present invention

The human genome is the complete set of genetic information of an organism, which is located inside chromosomes, each of which contains genes responsible for controlling various functions of the human body. The genome sequence data is composed of four bases { A, C, G, T } of nucleotides at each baseDifferences in the arrangement on the DNA strands result in uniqueness between individuals. Most DNA sequences are conserved throughout the population, with approximately 0.5% of each person's DNA differing from the reference genome due to genetic variation. Several types of variation modes exist in the genome, such as Single Nucleotide Polymorphism (SNP), Copy Number Variation (CNV), rearrangement, and the like. Single nucleotide polymorphism refers to DNA sequence polymorphism caused by variation of a single nucleotide on an allele, and is also the most common form of DNA variation. Most SNPs have no effect on human health. However, some SNPs directly cause certain diseases in the human body. Analysis of SNP sequences is common in genomic data research. The genomic sequence of the patient, which is the site information and diagnosis associated with the SNP, and the corresponding clinical information are contained in the dbGap database. The present invention assumes that one sequence S is composed of a plurality of SNPs, denoted as S ═ α₁,α₂,…,α_γIn which α is_iThe SNP at position i is represented by a pair of nucleotides.

Table 1 Data sequence of SNPs

4.1 secure count query

For genomic data studies, the most common task performed is to determine how many samples in the database satisfy certain characteristics. The counting inquiry is an important step of genetic association research, and is helpful for a researcher terminal to determine genes affecting specific diseases of a human body. For example, the investigator terminal inquires whether there is an association between various SNPs of the DSP1 gene and diagnosis alzheimer's disease in the individual. Similar SNP-disease association studies are becoming more common in human genomics research. For a biological researcher terminal, he needs to obtain a set of SNP data sets (e.g., SNPs) for a specific gene₁＝AG∩SNP₂CT ═ biagnosis ═ Alzheimer's Disease) the frequency values that appeared in the database, which were then normalized to the total number of records. For example, given a database D and a query Q, where D ═ S₁,S₂,…,S_nDenotes the SNP sequence of n patients. A count query may be defined as finding a number of patients in D that satisfy the multiple query condition Q for Q. Assuming that the gene sequence of m SNP sites of a certain patient is represented by S ═ { d₁,d₂,…,d_mIn which d is_i(1. ltoreq. i.ltoreq.m) represents the SNP value of the i-th site of the patient, the total number of count queries can be expressed as:

counting queries is a simple operation if the data is stored in the clear, and conventional database management systems support counting queries. But database management systems are not available for querying encrypted data. How to respond to the count query of the encrypted value without decrypting the data, the present invention proposes, in section 5, a secure data processing method hashMap for performing the query on the number of records under the condition specified by the user without decrypting the data stored in the cloud server.

4.2 data encryption

To protect patient privacy, the certificate authority server needs to encrypt the data in the database to prevent leakage of sensitive data. In this section, the invention will provide an encryption scheme that is relevant for the framework of the invention.

4.2.1 Hamming code

Hamming Code (Hamming Code) is a linear debug Code that inserts an authentication Code into a transmitted message stream to detect and correct data bit errors that may occur during data storage and transmission. Hamming codes utilize the concept of parity bits, adding bits of specific bits behind data bits to verify the validity of the data. The invention uses Hamming code technology to process gene data, not only can verify whether the data is valid, but also can indicate error information and correct under the condition that the data is wrong.

The Hamming code is realized by inserting k-bit binary data as check bits into original data and changing the original n-bit data into z-bit code. When coding, it is necessary toSatisfies inequality 2^k-1 ≧ z, wherein z ═ n + k. Specifying the resulting z-bit coded 2 in Hamming codes^k-1(k≥0,2^k-1The < z) bit is inserted with a special check code, and the rest bits are placed in sequence by using the original code. The detailed coding rule of the hamming code is as follows:

1) at 2 of the new coded bit (i.e. parity bit)^k-1The bits are filled with 0's and then the remaining bits of the new code are filled in order with the source code.

2) The encoding mode of the check bit is as follows: the k-th bit check code is then from the newly encoded 2 nd bit^k-1Bit start, 2 per calculation^k-1Exclusive or of bits, hop 2^k-1Bit, recalculate next group 2^k-1XOR of bits, fill 2^k-1A bit.

For example, for adenine C in a nucleotide, (C)2 ═ 1000011, and n ═ 7, according to formula 2^k-1 ≧ z, z ═ n + k, k can be taken to a minimum of 4. Then add 4 bits of detection code:

firstly at 2^k-1The bits are filled with 0, denoted 00100000011. The 1 st bit check code is located at the 1 st bit (2) of the new code¹ ^-1) (Hamming code starts with 1 bit), and XOR 1,3,5,7,9,11 bits is calculated as

Bit 1 of the new code is filled with 0, denoted as 00100000011.

The 2 nd bit check code is located at the 2 nd bit (2) of the new code^2-1) Calculating the XOR of 2,3,6,7,10,11 bits toThe 2 nd bit of the new code is filled, denoted as 011000000011. The 3 rd bit check code is located at the 4 th bit (2) of the new code^3-1) Calculating the XOR of 4,5,6,7 bitsThe 4 th bit filled with the new code is denoted 011000000011. The 4 th bit check code is located at the 8 th bit (2) of the new code^4-1) Calculating the XOR of bits 8-11

The 8 th bit filled with the new code is denoted 011000000011. The final binary hamming code for C is denoted as 011000000011.

The present invention uses hamming codes to process raw data and uses its error correction function to detect counting inquiry results. For four nucleotides A, C, G and T, respectively calculating the corresponding binary check bits { h) of the AS certificate authority server I code_A,h_C,h_G,h_TIn which h is_A＝0001,h_C＝0100,h_G＝1101,h_T0011. Assuming a pair of nucleotides d_iDenoted B after adding the detection bit t_iWhere t is the set { h }_A,h_C,h_G,h_TAny two of them combined by 2⁴And (4) a combination mode. E.g. d_iThe binary representation of CC is 10000111000011, and 0110000001101100000011 is indicated after adding 0100,0100 detection bits to the 1 st, 2 nd, 4 th and 8 th positions of each base.

4.2.2 Paillier cryptosystem

To achieve simplicity and flexibility of architecture, the present invention uses a Paillier cryptosystem encryption algorithm. The encryption algorithm meets semantic security, and guarantees that an opponent with limited computing capacity and ciphertext cannot acquire plaintext information. Paillier cryptosystem is a probabilistic asymmetric algorithm for public key encryption that produces different ciphertexts when the same message is encrypted multiple times. In the Paillier encryption algorithm, a pair of keys is generated, one is a private key sk and the other is a public key pk. The public key and the private key are used for encryption and decryption of data, respectively. The Paillier cryptosystem may be defined as follows. (reference wiki translation)

And a key generation stage: two large prime numbers p and q are randomly selected such that they are independent of each other gcd (pq, (p-1) (q-1)) ═ 1. Calculate η ═ pq and λ ═ lcm (p-1, q-1), and choose the random integer g such that

Determining the division order of n by performing a modular multiplication inverse operation on g: μ ═ L (g)^λmodη²))^-1mod η, where L is defined

From this, the public key (η, g), the private key (λ, μ) can be obtained.

And (3) an encryption stage: selecting a random number gamma (0 < gamma < eta) to calculate a ciphertext c for a message omega (0 < m < eta) needing to be encrypted: c is g^ω·γ^ηmodη²。

And a decryption stage: for ciphertext c that needs to be decrypted

The decrypted plaintext is represented as: ω ═ L (c)^λmodη²)·μmodη。

Homomorphic attributes: suppose there are two data ω₁And ω₂Then the product of its ciphertext is decrypted to the sum of their corresponding plaintexts: e_sk(E_pk(ω₁)·E_pk(ω₂)modη²)＝ω₁+ω₂modη。

Adding one plaintext as another plaintext exponent value, and after decryption, taking the product of the two plaintext:

assuming n patients in the database, Data ═ S¹,S²,…,Sⁿ) The genome data representing the gene sequence dataset in the database, the SNP sequence of the jth patient can be represented as

Wherein j is more than or equal to 1 and less than or equal to n. Each genome data in the encryption process

By ChineseThe plain code adds the detection bits represented as

To pair

The encryption can be expressed as

Wherein "|" indicates that the bit is to be detectedAddition to base pairs

In (1). The ciphertext of each data encryption is represented as:

5 the following explains the system design of the present invention

In this section, the present invention will introduce a data processing model for use by the present invention. First, the certification authority server processes the patient gene sequences and clinical information in the database. Clinical information of patients mainly includes diagnosis of diseases of some phenotypes. Suppose that the disease type has I ═ I₁,I₂,…,I_gWhere (1. ltoreq. g.ltoreq.n), one of the values I in the set represents a disease. Each patient may have one or more diseases. Later, the genotypic and phenotypic data for the patient cases discuss how they were integrated into the hashMap. The genotype sequences and phenotype types of 5 patients are shown as table 1.

5.1 Generation of hashMap

After receiving data provided by a data owner, the certification authority server creates an Entity of the hashMap for the SNP sequence of each patient, and generates a mapping table M after processing all the data, wherein the mapping table M comprises the genotype and phenotype information of each patient. After creating M, the certification authority server needs to create and update the Entity information of M for each new record from the data owner. Each Entity contains the following:

key value in Entity points to specific data, index value.

geno part of value in Entity. It is the SNP sequence of each patient. For example, the SNP sequence for one patient is denoted as S ═ d₁,d₂,…,d_mAnd the sid is the unique identifier of the SNP site and represents a specific position in a gene sequence. Wherein base pair d_iThe site identifier of (1. ltoreq. i.ltoreq.n). Can be expressed as

And I, a phenotype data set corresponding to the genome sequence.

count the frequency of occurrence of the current SNP sequence in the database indicates the total number of records matching the genotype and phenotype.

Next is the key value of the next Entity

Table 2 Hash Map for Table1

The complexity of creating the hash table is O (mn), where m is the number of records in the database and n is the number of different SNP sites in each sequence record. The invention defines each Entity as theta, and its components are denoted as theta (key, geno, I, count, next). The present invention can obtain table2 according to the characteristics of the hash table and the data of table 1.

5.2 encryption hashMap

5.2.1 encryption of bloom filters

Each Entity in M contains a Bloom filter value I, representing phenotypic information corresponding to the genomic sequence. The certification authority server processes the phenotype data using Bloom filter technology. Every SNPs in the genomic sequence were recorded, and the corresponding phenotype was inserted in the Bloom filter. If multiple phenotypes are associated with the same genomic sequence, the invention inserts all of them into the Bloom filter for that genomic sequence entry.

The certificate authority server sets the hash function H used in the Bloom filter and the field of the common alphabet Σ. The field of Σ is the set of all possible phenotypes. For each entry in the genomic sequence, the corresponding phenotype will be inserted in the Bloom filter. Each phenotype in Σ is mapped to a unique number and that number will be inserted into the Bloom filter.

The Bloom filters on each Entity in M are all the same length, and in order to encrypt the Bloom filters, the present invention uses AES in CTR mode with a key size of 128 bits. The certificate authority server selects a random key k using a Pseudo Random Function (PRF) F, encrypts each Bloom filter into

This encryption is done while other data of the entity is being encrypted. Besides sending the private key sk to the agency terminal, the certificate authority server also needs to provide a key k of the PRF and a hash function F for Bloom filter_k. In order to match phenotypes from a query during a search, the present invention requires a tabular value I for the query request_qLocation of hash and I of each Entity_iThe hash locations are matched. If I_qAll positions set to 1 in_iIs also set to 1, this means that the phenotype from the query matches the phenotype stored in the tree node.

Since the bloom filter represented at each node of the tree is encrypted, in order to check the location of any one of these bloom filters, the present invention first needs to decrypt it. The certificate authority server provides the PRF key k to the agent terminal. To encrypted Bloom filter'_iDecryption using an in-circuit SFE

5.2.2 Paillier Cryptosystem encryption

The certificate authority server encrypts all entitlements in M using Paillier Cryptosystem. The certificate authority server generates a key pair (pk, sk) for the homomorphic encryption scheme using a key generation algorithm, wherein pk is a public key and sk is a private key; each entry in the hash table is encrypted using the public key pk. In order to make the whole search process fast enough while maintaining the security of the system, the present invention encrypts only the sensitive attribute of each entity. Because the cloud server is semi-honest in the system of the invention, the cloud server must be capable of ensuring that the inquirer knows the next key value of the entity, and therefore the invention does not need to encrypt the next information. The encryption of the entity may be denoted as E_pk(theta), after encryption, each entry like theta (key, E)_pk(geno),E_pk(count),I′_iNext). The present invention defines that the hashMap in which all the entries in M are encrypted is represented as

Finally, the certificate authority server will

And sending the data to the cloud server. Meanwhile, the certificate authority server transmits the key pair (pk, sk) to the agency terminal.

5.3 query encryption hashMap

The cloud server receives the information sent by the certificate authority server

And executing the encrypted inquiry request sent by the agency terminal. The main idea of the query is that the query satisfies the query phenotype as I_qAnd satisfies the gene sequence matching the specific site stored in geno. The cloud server needs to know the sid in the query request of the researcher terminal and match the SNP of the specific site of geno stored in the Entity with the SNP of the site in the query condition of the researcher terminal. Since the values of the SNPs are all encrypted and the encryption scheme used by the present invention is probabilistic, the cloud server cannot determine whether the values match. The cloud server may send the encrypted valn value to the agency terminal, and the agency terminal decrypts the encrypted SNPAnd checked for equality.

Suppose that a researcher terminal sends a query request Q_q＝(q₁∩q₂∩…∩q_num∩I_q) With num query conditions. q. q.s_l(1. ltoreq. l. ltoreq. num) represents a specific gene value of SNP at a certain site. For example, query Q_qThe condition in (1) is SNP₂＝CC,SNP₃＝TT,SNP₅＝AG,I_q＝I₁，q₁SNP representing conditions₂CC. The agency terminal adds a detection code B to the base pair of each query condition q_q＝(T_q| q). The query is encrypted with a public key pk, and the query for each site can represent theta_q(sid,E_pk(B_q)). The connection after the query request is encrypted is as follows: e_pk(θ_q)＝E_pk(θ₁)∩E_pk(θ₂)∩…∩E_pk(θ_num)∩I_q′。

For example, the query Q described above is performed on M in table2_qIn operation, the SNP values of site sid 2, sid 3 and sid 5 of geno in each entity are traversed, and these values are sent to an agency terminal for decryption and verification, so as to obtain a case in the database that meets the query condition. As shown in table2, only case #1 satisfies the query condition, and the query count result count of each researcher terminal is returned to 1.

5.4 data integrity analysis

The cloud server queries the result R_qAnd sending the SNP value to the agency terminal, and decrypting the SNP value by the agency terminal by using the key sk. In order to ensure the privacy of the interactive information between the agency mechanism terminal and the cloud server, the invention adopts the Hamming code detection bit to check the integrity of the data. For the returned query results meeting the query conditions:

R_q＝{E_pk(count)|E_pk(B₁)|E_pk(B₂)|…|E_pk(B_num)|I_i′}

The invention is to

The hamming code detection bit is recalculated to T'. The invention compares T' with T_qTo determine whether the data is complete.

1)T′＝T_qIn this case, the description data is complete, and the data is safe and credible;

2)T′≠T_qand T' is e { h_A,h_C,h_G,h_TIndicates that the geno value in the query has been tampered with, but is still a combination of the four nucleotides A, C, G, T. Can pass through the detection bit T_qTo the return

Correction is performed and the hashMap can be updated. The returned count value count is not already trusted.

3)T′≠T_qAnd is and

representing raw data

Regarding security analysis, the present invention only assumes that the geno sequences in the database are revealed to the participants, which leads to serious security problems. The invention evaluates the privacy protocol proposed by the invention and the security of the system in consideration of the ability of the participants to infer information at different stages of the system. The leakage profiles of the different participants in the model proposed by the present invention are given below.

Construction of HashMap and data leakage in an encryption phase: the certification authority server is used as a trusted entity and is responsible for generation and encryption of M, and data leakage does not exist in the certification authority server at the stage.

Leakage to cloud servers in each query: in the query execution process, the cloud server does not relate to plaintext information at all, so that data information is not disclosed to the cloud server in the query phase.

Leakage of each query to the agency terminal and the researcher terminal: the agency terminal mainly contributes to encrypting and sending the query request of the research staff terminal to the cloud server by using the secret key, receiving the query result returned by the cloud server, decrypting the query result and returning the decrypted query result to the research staff terminal. For the SNP inquiry result returned to the agency terminal, the invention uses the Hamming code technology for verification, and the terminal of a researcher cannot directly deduce any information from the detection position of the Hamming code. Here, the present invention does not consider any privacy leakage in the output. Therefore, the system is safe and reliable.

6 the following is a description of experimental analytical data of the present invention

Since the beginning of the human genome project, a large number of SNPs have been reported by researchers terminals. The availability of good SNP sites as candidate genes makes the research on the relevance of the candidate genes to the whole genome possible. Linkage Disequilibrium (LD) technology has been widely used to develop high quality SNP marker maps. When applied to disease-gene mapping, LD is assessed by correlation analysis, which entails comparing allele or haplotype frequencies between the affected disease (e.g., diagnosis of alzheimer's disease) and a control individual (e.g., diagnosis without alzheimer's disease). Toivonen et al proposed an LD mapping data mining method called Haplotype Pattern Mining (HPM) for generating a simulated SNP data set. The data privacy protection framework provided by the invention evaluates the practicability of the data privacy protection framework by using a simulated SNP data set to perform a test experiment. The source code of the system is realized by JAVA programming coding language. The experimental environment of the invention is Intel i7-4710MQ (4 cores 1.6GHz) CPU, 8GB memory; the software environment is a Windows 10 operating system and an IntelliJ IDEA running platform. In an experiment, the invention provides two different machines to run a cloud server and a certification authority server. In the invention, a Hamming code technology is used for adding detection bits to gene data in an experiment, and a 1024-bit Paillier encryption technology is used.

In the safety-counting query experiments of the present invention, four different query sizes with 500, 1000 and 5000 SNPs datasets and involving 10, 20, 30 and 40 randomly selected SNP sequences were used by the present invention. The method tests the influence of different record numbers on the hashMap construction time. The results of the testing of the present invention are compared to the method of index tree used by mohammadzahdul Hasan et al and the method of cryptographichardware SCPs used by m. Experimental results show that the performance of the method provided by the invention is superior to that of encryption time and query execution time. The experiment was analyzed from three aspects:

(1) HashMap generation time. It refers to the time required to process a genomic database and construct a hashMap using genotypes and phenotypes. The present invention analyzed the creation time of a data set containing 500 and 1000 number of SNPs as shown in FIG. 2. Experiments show that the time consumption is increased along with the increase of the number of SNPs. Compared with the index tree creation time used by Mohammad Zahudul Hasan et al, the method of the invention has more creation time than the index tree, but occupies a small proportion of the experimental operation time in seconds.

(2) And (4) encrypting time. The encryption time for data in the method of the invention is mainly consumed in encrypting base pairs and count values of each entity, and the CTR scheme used in the invention for encrypting the Bloom filter does not need a long time and can be ignored. Secondly, the time of encryption also depends on the total number of sequences in the SNP dataset. As shown in FIG. 3, the present invention analyzes encryption times of data sets having 500 and 1000 SNPs. The method of the invention has obvious advantages in encryption time compared with Index Tree as the encryption time consumed by the increase of the number of SNPs is increased. For 1000records, for example, the invention requires only 2.11min,

(3) and inquiring the time. The query time refers to the time taken by the researcher terminal to provide the query request until the returned result is obtained. In order to calculate the query time, the present invention randomly selects 10, 20, 30 and 40 sized SNP sequences, and performs a query on 5000 records. The execution times of these query sizes on the encrypted hashMap are listed, as in fig. 4. Because the phenotype information in the entries in the hashMap needs to be searched and matched, all the entries in the hashMap need to be subjected to traversal query to obtain the gene sequences meeting the query conditions. In fig. 2, the present invention shows the comparison of the query execution time and the method execution time, and the execution time of the present invention increases linearly with the number of queries, and the time is in seconds.

Table 3 Comparison of query execution time on 5000 records

By varying the record size of the tally query and the total number of query datasets, the present invention analyzes the time to store data and execute the query using a hashMap and compares it to the methods that have been proposed. The method of the invention effectively protects the privacy of the data by using the hashMap to store the data and using the homomorphic encryption scheme. The method of the present invention supports the storage of large data sets. The processing and encryption of sensitive data for the entity does not directly affect the number of records. The invention provides a method for adding detection bits to original data by using the error correction of Hamming codes, and the integrity of returned results is verified at an agency terminal, so that the accuracy of the results returned to a researcher terminal is effectively ensured. The characteristics enable the method provided by the invention to execute counting query on the encrypted data more safely and efficiently.

The invention provides a safe and effective genome data outsourcing method on the basis of analyzing the existing scheme aiming at the problem of safe counting query of encrypted genome data. In order to achieve privacy of data, the method stores the data in the database in a hashMap mode and outsourcing the stored data to a third-party cloud server. By using the third-party agency terminal, the security counting query of the terminal of the researcher in the cloud service is realized. In order to verify the integrity of data, the invention provides that a hamming code technology is used for adding detection bits to genome data, so that no sensitive gene data can be displayed in the data processing and query execution stages.

Claims

1. A security counting inquiry and integrity verification device based on encrypted genome data is characterized by comprising a data owner terminal, an authentication mechanism server, a cloud server, an agency terminal and a researcher terminal;

2. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 1,

the cloud server generates a group of SNP data sets of the specific genes and frequency values of the specific genes appearing in the database, and then normalizes the values into total record numbers;

the cloud server represents the total number of counting queries as:

3. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 1,

after receiving data provided by a data owner, the certification authority server creates an Entity of a map for the SNP sequence of each patient, processes all the data to generate a mapping table M, wherein the M comprises genotype and phenotype information of each patient, and after the M is created, the certification authority server creates and updates the Entity information of the M for each new record from the data owner;

each Entity contains the following:

key in Entity the key value points to a specific data, index value,

A phenotype data set corresponding to the genome sequence,

next is the key value of the next Entity.

4. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 3,

the complexity of creating the hash table by the certification authority server is O (mn), where m is the number of records in the database, n is the number of different SNP sites in each sequence record, and each Entity is defined as θ, and its components are denoted as θ (key, gene, I, count, next).

5. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 1,

the certification authority server processes phenotype data by using Bloom filter technology, records each SNPs in a genome sequence, inserts a corresponding phenotype in the Bloom filter, inserts all phenotypes into a Bloom filter of the genome sequence entry if the phenotypes are associated with the same genome sequence, and each Entity in M contains a Bloom filter value I which represents phenotype information corresponding to the genome sequence;

the authentication mechanism server sends PRF key k to the agent mechanism terminal, and the encrypted Bloom filter I is processed_iBy the formula

Decryption is performed.

6. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 1,

the certificate authority server generates a key pair (pk, sk), where pk is the public key and sk is the private key, encrypts each entry in the hash table using the public key pk, and represents the encryption of the entry as E_pk(θ)；

And will be

Sending the data to a cloud server;

7. The apparatus for secure count query and integrity verification based on encrypted genomic data according to claim 1,

the cloud server queries the result R_qAnd sending the data to an agency, decrypting the SNP value by the agency terminal by using the key sk, verifying the integrity of the data by adopting a Hamming code detection bit, and returning a query result meeting the query condition:

R_q＝{E_pk(count)|E_pk(B₁)|E_pk(B₂)|…|E_pk(B_num)|I_i′}

Will be provided with

Recalculating the Hamming code detection bit as T';

Correcting and updating the Hash diagram;

if T' ≠ T_qAnd is and

the agency terminal judges the original dataTampered and tampered to other content, at which point the returned query result counter is not authentic for the patient's SNP sequence.

8. A safety counting inquiry and integrity verification method based on encrypted genome data comprises the following steps: