CN115622684B - Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption - Google Patents

Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption Download PDF

Info

Publication number
CN115622684B
CN115622684B CN202211433166.3A CN202211433166A CN115622684B CN 115622684 B CN115622684 B CN 115622684B CN 202211433166 A CN202211433166 A CN 202211433166A CN 115622684 B CN115622684 B CN 115622684B
Authority
CN
China
Prior art keywords
algorithm
ciphertext
ntt
polynomial
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211433166.3A
Other languages
Chinese (zh)
Other versions
CN115622684A (en
Inventor
蒋琳
赵鑫
刘虎成
陈倩
方俊彬
王轩
张加佳
李君一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Jinan University
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University, Shenzhen Graduate School Harbin Institute of Technology filed Critical Jinan University
Priority to CN202211433166.3A priority Critical patent/CN115622684B/en
Publication of CN115622684A publication Critical patent/CN115622684A/en
Application granted granted Critical
Publication of CN115622684B publication Critical patent/CN115622684B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/008Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols involving homomorphic encryption
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0861Generation of secret information including derivation or calculation of cryptographic keys or passwords
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/08Key distribution or management, e.g. generation, sharing or updating, of cryptographic keys or passwords
    • H04L9/0894Escrow, recovery or storing of secret information, e.g. secret key escrow or cryptographic key storage
    • H04L9/0897Escrow, recovery or storing of secret information, e.g. secret key escrow or cryptographic key storage involving additional devices, e.g. trusted platform module [TPM], smartcard or USB
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/12Details relating to cryptographic hardware or logic circuitry
    • H04L2209/125Parallelization or pipelining, e.g. for accelerating processing of cryptographic operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption. By using a memory hierarchical structure in the GPU, the number of tasks with large memory access amount simultaneously distributed on the SM is reduced, more shared memories are distributed to improve the memory hit rate, and the communication with the global memory is reduced; designing a heterogeneous computing stream: limited hardware resources are shared, both temporally and spatially. The challenge of the present invention to implement NTT/INTT algorithms in a GPU is to efficiently allocate threads to achieve high utilization, all threads should be busy for optimal performance, and the workload of each thread should be equal.

Description

Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption
Technical Field
The invention belongs to the technical field of privacy computation, and particularly relates to a privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption.
Background
In recent years, data has become an important resource and is an important production element, and compared with the traditional production element, the data really becomes a strategic resource which can be freely circulated and has safety, so that the link of ensuring the privacy and the safety of the data can not be avoided. Modern cryptography has found application in a myriad of digital systems and components and has become an important tool for securing data and privacy. However, the encryption technology itself (including the widely used public key encryption, PKE) still has a limitation that sensitive data must be decrypted before it can be processed and analyzed.
Privacy Computing (Privacy Computing), also called Privacy-Enhancing Computing (Privacy-Enhancing Computing) or Confidential Computing (Confidential Computing), is a technology and a system for realizing multi-party data collaborative Computing on the premise of not transferring or revealing original data, and has the characteristic of being "available and invisible". Among many privacy computing technologies, full Homomorphic Encryption (FHE) is an Encryption form that supports algebraic operations directly on encrypted data (ciphertext) without decrypting the ciphertext before the operation. The computed result is also encrypted and only the owner of the private key can decrypt and access the computed result. The result is equivalent to the result of performing the same operation on an unencrypted version (plaintext) of this data. The characteristic allows an untrusted third party to directly operate the ciphertext without a private key, and avoids user sensitive information leakage caused by the fact that the third party needs to decrypt the ciphertext in the operation process.
Compared with differential privacy, homomorphic encryption has the guarantee of password security certification and has the advantage of high safety. The nature of differential privacy still requires data to be transmitted elsewhere and often requires a trade-off between accuracy and privacy, with less security than homomorphic encryption. Compared with safe multi-party calculation, homomorphic encryption has the advantage of less interaction times. The secure multi-party computing requires multiple interactions of multiple participants to generate a computing result, and the communication cost is high. The fully homomorphic encryption has more universal data privacy protection capability and supporting force of a future data infrastructure base, and can perfectly solve the contradiction problem between data protection and data circulation theoretically. However, the technical bottleneck of the fully homomorphic encryption in terms of operational efficiency greatly limits the practicability and further popularization and application of the fully homomorphic encryption.
On the improvement of homomorphic encryption efficiency, a homomorphic encryption algorithm is optimized on a CPU, and the optimization method mainly comprises three aspects of the Bootstrapping algorithm, the ciphertext packing algorithm and the noise-free FHE algorithm. The Bootstrapping algorithm helps to solve the noise accumulation problem down to a level comparable to the original plaintext, but many people subsequently optimize for Bootstrapping because FHE algorithms are inefficient due to the large computational overhead of repeatedly calling the decryption circuitry. In addition to optimizing Bootstrapping, another effective means to fully homomorphic "speed up" is Batch (Batch) technology, also known as encapsulation (Pack), that allows multiple data values to be encrypted into one ciphertext and enables single instruction multiple data operations SIMD to be performed homomorphically. The third research idea is to construct a noise-free FHE algorithm. Noise is added to ensure the safety of the FHE algorithm, but it also brings the trouble of noise control, and although the noise-free FHE algorithm is considered unsafe, the conclusion is not strictly proven.
Disclosure of Invention
The invention mainly aims to overcome the defects and shortcomings of the prior art and provide a privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption.
In order to achieve the purpose, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a privacy computation heterogeneous acceleration method based on fully homomorphic encryption, including the following steps:
s1, setting required public parameters according to requirements by a userppAt a common parameterppOn the basis of the key generation algorithm, the key generation algorithm of the fully homomorphic encryption is called to generate a private keyskPublic keypkAnd evaluatingSecret keyevk
S2, calling a coding algorithm to code the input, and mapping the input to an integer field to form a polynomial form;
s3, calling an encryption algorithm to encrypt a plaintext and outputting a ciphertext;
s4, after the data are encrypted into the ciphertext, homomorphic operation under the ciphertext is executed, and the calculation process is as follows:
s41, copying the ciphertext and the calculation related parameters in the CPU memory to a corresponding GPU memory, decomposing a ciphertext polynomial ring of a super-large integer by adopting a Chinese remainder theorem, and decomposing the ciphertext polynomial in a vector form into a matrix form;
s42, regarding each line of the decomposed ciphertext polynomial matrix as an independent NTT operation group, and selecting a corresponding optimized NTT algorithm according to the calculation scale of the ciphertext matrix and the hardware resource of the GPU to convert the ciphertext polynomial matrix into a ciphertext point value expression matrix; the optimized NTT algorithm is that an NTT module on a GPU adopts a negative closure convolution NWC to reduce the calculation scale of half of the NTT algorithm, and the closure convolution NWC preprocesses array elements needing NTT operation by generating a preprocessing sequence to correct the result offset brought by a polynomial ring;
s43, after a ciphertext point value expression matrix is obtained, homomorphic calculation is carried out according to a required operation function, a Barrett reduction algorithm is realized on a GPU, and the step of multiplying the operands is divided into two independent shifting operations and one multiplication operation, so that the size of a calculation intermediate variable of a modulus with the bit length of K bits is reduced to 2K bits at most;
s44, after the operation function is executed, an INTT algorithm is called to convert the calculated ciphertext point value expression matrix into a ciphertext polynomial matrix;
s45, calling an ICRT algorithm to collect each row of the ciphertext polynomial matrix and restore the row of the ciphertext polynomial matrix into a final ciphertext polynomial vector, and writing a calculation result in a GPU memory back to a CPU memory;
s5, calling a decryption algorithm to decrypt the calculated ciphertext polynomial to obtain a plaintext polynomial;
and S6, calling a decoding algorithm to decode the plaintext polynomial into a calculation result in a plaintext form.
As a preferred solution, the common parameterppParameters including RLWE difficulty problempp RLWE
As a preferred technical solution, in step S41, the ciphertext polynomial ring of the super-large integer is decomposed by using the chinese remainder theorem, the ciphertext polynomial in the vector form is decomposed into a matrix form, and the ciphertext polynomial is subjected to
Figure DEST_PATH_IMAGE001
The method specifically comprises the following steps:
s411, the ciphertext polynomial is expressed as a coefficient vector:
Figure DEST_PATH_IMAGE002
s412, generating a group of prime number sequences of k elements:
Figure DEST_PATH_IMAGE003
s413, decomposing each item in the coefficient vector by using a CRT algorithm:
Figure DEST_PATH_IMAGE004
s414, regarding each row as an independent NTT operation group;
and S415, finally, collecting the result of each column by using an ICRT algorithm to recover the final calculation result.
As a preferred technical solution, preprocessing of the logarithmic array elements is integrated into the butterfly transform by generating a special twiddle factor sequence, and NWC-NTT generates the special twiddle factor sequence as follows:
PSI list =[1,ψ N/2 ,ψ N/4 ,ψ 3N/4 ,ψ N/8 ,ψ 3N/8 ,ψ 5N/8 ,ψ 7N/8 ,…,ψ 1 ,ψ 3 ,ψ 5 ,…,ψ N-5 ,ψ N-3 ,ψ N-1 ]whereinψIs composed of
Figure DEST_PATH_IMAGE005
The 2N-order unit primitive root of (1),ψis to be->
Figure DEST_PATH_IMAGE006
Only need to traverse
Figure DEST_PATH_IMAGE007
The negative closure convolution operation can be integrated into the butterfly transform of the NTT algorithm, i.e.
Figure DEST_PATH_IMAGE008
,/>
Figure DEST_PATH_IMAGE009
Represents a modified NTT algorithm, and the optimized NTT algorithm has the space complexity ofON)。
As a preferred technical scheme, the NTT operation adopts a CT butterfly structure, accepts input in a standard order, and generates output in a bit-reversal order; INTT operations take inputs in bit-reversed order using the GS butterfly structure and produce outputs in standard order.
As a preferred technical scheme, optimizing a memory access mode according to a data memory access rule of an NTT algorithm specifically includes:
during the computation of NTT/INTT, the elements of the polynomial coefficient array poly _ coeff are accessed log 2 (n) times, a request is issued to global memory to copy an element of poly _ coeff to shared memory and then access shared memory log for it 2 (n) times, thereby avoiding sending out log to global memory for single element 2 (n) the read-write speed of the shared memory is far better than that of the global memory.
As a preferred technical solution, dynamically allocating thread workload according to the calculation scale and the hardware resource specifically includes:
each thread independently processes a butterfly operation, each thread is assigned the required operand and corresponding twiddle factor in each iteration,
the operand number extracted by a single butterfly operation is the base of the NTT algorithm, for each butterfly operation of the NTT algorithm with the base of 2, two elements of an array poly _ coeff participate in calculation, and a GPU thread is used for the two elements of the array poly _ coeff; the thread uses the target _ idx variable to calculate the index of the first array element to be processed by the thread in each iteration, the variable step is used to determine the second element of the array poly _ coeff allocated to the same thread, and each thread uses the index variable of step _ group to track access the twiddle factor sequence set
Figure DEST_PATH_IMAGE010
The corresponding twiddle factor in the array psi.
As a preferred technical solution, dynamically invoking the kernel according to the butterfly transformation grouping of each iteration of the NTT algorithm specifically comprises:
a method for mixed calling is proposed, which groups the elements of the array to be processed poly _ coeff with the size of step, and when the NTT algorithm starts
Figure DEST_PATH_IMAGE011
Is grouped and/or asserted>
Figure DEST_PATH_IMAGE012
When ≧ is greater than or equal to { (R) }, the size of step _ group decreases by one time with each iteration of the algorithm>
Figure DEST_PATH_IMAGE013
When switching from the multicore mode to the single core mode. />
In a second aspect, the present invention also provides an electronic device, including:
at least one processor; and (c) a second step of,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the method for computing heterogeneous acceleration based on privacy of homomorphic encryption.
In a third aspect, the present invention further provides a computer-readable storage medium, which stores a program, and when the program is executed by a processor, the method for computing heterogeneous acceleration based on fully homomorphic encryption for privacy is implemented.
Compared with the prior art, the invention has the following advantages and beneficial effects:
aiming at a hardware architecture of the GPU, the system optimizes a fully homomorphic encryption algorithm from two levels of a memory and an instruction, dynamically allocates Block blocks in the GPU according to calculation load, and splits tasks with large calculation amount, merges tasks with small calculation amount to be large, and controls access and memory competition in a result merging process. By using a memory hierarchical structure in the GPU, the number of tasks with large memory access amount distributed on the SM at the same time is reduced, more shared memories are distributed, the memory hit rate is improved, and the communication with a global memory is reduced. Designing a heterogeneous computing stream: limited hardware resources are shared, both temporally and spatially. And distributing different calculation tasks and different blocks to different GPU calculation units by utilizing a large integer decomposition technology, a matrix splitting technology and the like to realize spatial parallelism. Data transmission is hidden in calculation by using a pipeline technology, and parallel optimization on time is designed and realized by using the characteristic that different modules can execute operation at the same time in a fully homomorphic encryption algorithm. The challenge in implementing the NTT/INTT algorithm in the GPU is to efficiently allocate threads to achieve high utilization, all threads should be busy for optimal performance, and the workload of each thread should be equal. The invention adopts a new number theory conversion architecture to overcome the challenges brought by the complex data dependency relationship and the complex access mode in the NTT.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of the structure of a CKKS isomerism calculation scheme according to an embodiment of the invention;
fig. 2 is a block diagram of a privacy computing heterogeneous acceleration system based on fully homomorphic encryption according to an embodiment of the present invention.
FIG. 3 is an example graph of CT-NTT and GS-INTT with n =8 according to the embodiment of the present invention;
FIG. 4 is a diagram of Radix2-NTT iterative memory access example with n =4096 according to an embodiment of the present invention;
FIG. 5 is a diagram of Radix2-NTT step _ group grouping iteration according to an embodiment of the present invention;
FIG. 6 is a diagram of an example of N =8 hybrid Radix in accordance with an embodiment of the present invention;
FIG. 7 is a step _ group mixed call iteration diagram of the NTT of the present invention;
FIG. 8 is a diagram illustrating a fully homomorphic encryption privacy computing model in the cloud environment of FIG. 1 according to an embodiment of the present invention;
FIG. 9 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention;
fig. 10 is a block diagram of an electronic device according to an embodiment of the invention.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by a person skilled in the art that the embodiments described herein can be combined with other embodiments.
The (R) LWE difficult problem-based homomorphic encryption algorithm maps plain texts in an integer or real number field to cipher texts in a polynomial ring, and the calculation amount under the cipher texts is ten thousand times that of the plain texts. In terms of computational efficiency, the basic operation of homomorphic operation under ciphertext involves addition and multiplication of integer coefficient polynomials with large dimension, resulting in extremely low efficiency of fully homomorphic operation.
As shown in fig. 1, aiming at the performance problem of low computational efficiency of fully homomorphic encryption, the invention designs and realizes an NTT algorithm with higher parallelism and lower computational complexity facing to a GPU, and realizes a heterogeneous privacy computation method of a CPU + GPU based on a CKKS homomorphic encryption scheme, thereby greatly improving homomorphic computational efficiency under a ciphertext.
In the heterogeneous scheme of the invention, a CPU is used as a general processor and has few computing units and low parallelism, tasks such as data preprocessing, coding, batch processing request, thread scheduling, memory mapping and the like are executed on the CPU, and computation-intensive tasks such as secret addition, multiplication, NTT/INTT and the like are executed on a GPU. In order to break through the efficiency bottleneck of homomorphic encryption, the method is mainly used for accelerating homomorphic calculation under a ciphertext by using the GPU.
As shown in fig. 2, the method for computing heterogeneous acceleration based on privacy of fully homomorphic encryption in the present embodiment includes the following steps:
s1, setting required public parameters according to requirements:
Figure DEST_PATH_IMAGE014
: taking the security parameter lambda as input and returning the common parameterppParameters including RLWE difficulty problempp RLWE . Based on the above parameter, call->
Figure DEST_PATH_IMAGE015
Generating a private keyskPublic keypkAnd evaluating the keyevk
S2, CKKS schemes using integer polynomial ringsThe rich structure realizes the mapping from the plaintext space to the ciphertext space. Data in reality appears more in the form of complex vectors. Therefore, the encoding algorithm is first invoked
Figure DEST_PATH_IMAGE016
Δ is a scaling factor, set->
Figure DEST_PATH_IMAGE017
Encoding the input of the invention: />
Figure DEST_PATH_IMAGE018
And mapped onto an integer field to form a polynomial form in which &>
Figure DEST_PATH_IMAGE019
Is an n-dimensional phasor, is greater than>
Figure DEST_PATH_IMAGE020
Represents->
Figure DEST_PATH_IMAGE021
Set of polynomials of all above with respect to X, (X) N + 1) is a univariate irreducible polynomial of order N.
S3, the CKKS scheme used in the embodiment is a public key encryption scheme based on the RLWE difficult problem, wherein the generated public key is used for encryption and can be shared, and the private key is used for decryption and needs to be kept secret. Invoking an encryption algorithm
Figure DEST_PATH_IMAGE022
: encrypted plaintext->
Figure DEST_PATH_IMAGE023
And outputs the ciphertext->
Figure DEST_PATH_IMAGE024
Q > 1 is an integer>
Figure DEST_PATH_IMAGE025
Represents a set of integers, is selected, and is selected>
Figure DEST_PATH_IMAGE026
Represents->
Figure DEST_PATH_IMAGE027
A set of all polynomials above with respect to X>
Figure DEST_PATH_IMAGE028
Representing a polynomial residual class ring. />
And S4, after the data are encrypted into the ciphertext, homomorphic operation under the ciphertext is executed. In order to break through the bottlenecks of large calculation amount and low calculation efficiency of fully homomorphic operation under the ciphertext, the homomorphic operation method under the ciphertext, which has lower calculation complexity and higher parallelism, is designed and realized on the GPU by utilizing the advantages of large-scale parallel calculation of the GPU, and the specific calculation process is as follows:
s41, copying the ciphertext and the calculation related parameters in the CPU memory to a corresponding GPU memory, firstly adopting Chinese Remainder Theorem (CRT) to decompose a ciphertext polynomial ring of a super-large integer, and decomposing the ciphertext polynomial in a vector form into a matrix form:
Figure DEST_PATH_IMAGE029
it can be understood that, in the homomorphic encryption scheme based on the lattice code, the dimensionality and the coefficient of the generated ciphertext polynomial are very large, and the operation of performing function evaluation on the ciphertext is very time-consuming (a trivial algorithm calculates two functions)NThe addition of a second order polynomial has a complexity ofO(N) The complexity of multiplication isO(N 2 ) In order to reduce the computational complexity, a commonly used method is Number Theory Transform (NTT), which can reduce the complexity of polynomial multiplication toO(NlogN) However, efficiency is still less than ideal for particularly large-scale large integer multiplications.
Therefore, the invention firstly optimizes the NTT algorithm in terms of reducing the computational complexity and improving the parallelism. One important factor for reducing NTT computational complexityThe prime element is to avoid multiplication of the super-large integer, and the invention adopts Chinese Remainder Theorem (CRT) to decompose the super-large integer into the elements capable of being simultaneously calculated in parallelkAnd small integers are subjected to large-scale integer operation at a small cost for multiple times, and then the optimal calculation efficiency is obtained through the advantage of GPU large-scale parallel calculation. For ciphertext polynomials
Figure 320950DEST_PATH_IMAGE001
NOn the scale of 2 10 ~2 15 The coefficient scale is 128 to 256bits, and the specific steps of decomposing the large integer polynomial by using a CRT algorithm are as follows:
s21, representing the ciphertext polynomial as a coefficient vector:
Figure 599616DEST_PATH_IMAGE002
s22, generating a group of prime number sequences of k elements:
Figure 937187DEST_PATH_IMAGE003
s23, decomposing each item in the coefficient vector by using a CRT algorithm:
Figure 19413DEST_PATH_IMAGE004
s24, regarding each row as an independent NTT operation group;
and S25, finally, collecting the result of each column by using an ICRT algorithm to recover the final calculation result.
The CRT algorithm is used for decomposing the items without dependency relationship, the structure is more suitable for a parallel computing mode of a single-instruction multi-data-stream structure on a GPU, a single control component dispatches instructions to each pipeline, and the same instructions are executed by all processing components simultaneously. Wherein k is a balance factor related to storage and access, and the value of k influences the scale of the calculation task. Specifically, increasing k decreases the single computation time, but increases the computation size and the number of accesses, and conversely decreasing k increases the single computation time, decreasing the computation size and the number of accesses.
Furthermore, the NTT module on the GPU of the present invention employs negative closed-form convolution (NWC) to reduce the computation scale of half of the NTT algorithm, and the negative closed-form convolution corrects the result offset caused by the polynomial loop by generating a preprocessing sequence to preprocess the array elements that need to be subjected to NTT operation.
In the invention, the preprocessing of the array elements is integrated into the butterfly transformation by generating a special twiddle factor sequence. The NWC-NTT is generated by generating a special sequence of twiddle factors as follows:
PSI list =[1,ψ N/2 ,ψ N/4 ,ψ 3N/4 ,ψ N/8 ,ψ 3N/8 ,ψ 5N/8 ,ψ 7N/8 ,…,ψ 1 ,ψ 3 ,ψ 5 ,…,ψ N-5 ,ψ N-3 ,ψ N-1 ]whereinψIs composed of
Figure 100632DEST_PATH_IMAGE005
The 2N-order unit primitive root of (1),ψis to be->
Figure 991663DEST_PATH_IMAGE006
Only need to traverse
Figure 765715DEST_PATH_IMAGE007
The negative closure convolution operation can be integrated into the butterfly transform of the NTT algorithm, i.e.
Figure 882707DEST_PATH_IMAGE008
The optimized NTT algorithm has the space complexity ofON) The calculation scale was reduced by a factor of 1.
In the NTT transform, the butterfly operation is called cyclically as an arithmetic logic unit. In addition, the NTT operation adopts a CT butterfly structure, receives input in a standard sequence and generates output in a bit reverse sequence; INTT operations take input in bit-reversed order using GS butterfly structures and are standardThe outputs are generated sequentially. As shown in FIG. 3, pre-processing and post-processing operations are eliminated at the expense of using two different butterfly structures, according to an improvement
Figure 33196DEST_PATH_IMAGE009
The parallel acceleration scheme on the GPU is designed by the algorithm, and an operation example of the NWC-NTT algorithm with n =8 is shown in FIG. 3.
Furthermore, in addition to reducing the computational complexity and fitting the parallel computation flow in the aspect of algorithm design, the invention balances the workload and reduces the access competition in the aspect of hardware structure design, dynamically allocates the Block Block in the GPU according to the computational load, reduces the task with overlarge computation amount, merges the task with small computation amount to be larger, controls the access competition in the result merging process, and fully utilizes the hardware resources of the GPU in terms of time and space. According to the memory hierarchical structure in the GPU, the number of tasks with large memory access amount distributed on the SM at the same time is reduced, more shared memories are distributed, the memory hit rate is improved, and the communication with the global memory is reduced. The invention further provides three technical points for the purpose:
(1) Optimizing a memory access mode according to a data memory access rule of an NTT algorithm;
the optimization of the memory access mode is a key part for realizing the fully homomorphic acceleration on the GPU, and the effective use of the GPU memory hierarchy plays an important role in realizing the overall performance of the fully homomorphic encryption algorithm. For the input polynomial coefficient array poly _ coeff and the twiddle factor array𝑝𝑠𝑖𝑠The naive solution to access different elements is to use global memory, however such performance is also the worst. Another idea is to utilize shared memory: since the elements of the array poly _ coeff are processed in a GPU block, the shared memory provides the best choice for threads located in the same block. Here twiddle factor sequence sets𝑝𝑠𝑖𝑠Each element of (a) is only accessed once per block and does not change its value, so it is loaded into constant memory before the GPU core starts up. During the computation of NTT/INTT, the element of poly _ coeff is accessed log 2 (n) times, the present invention issues a request to global memory to copy an element of poly _ coeff to shared memory,then accessing the shared memory log for it 2 (n) times, thereby avoiding sending out log to global memory for single element 2 (n) the read-write speed of the shared memory is far better than that of the global memory.
It will be appreciated that an n-point based 2 NTT algorithm consists essentially of log 2 (n) iterations, and in each iterationnA/2 butterfly operation, with NTT groups halved after each iteration. In addition, most current GPU technologies support a maximum number of threads per block of 1024, and when the input array has a dimension of 2048, two elements of the array poly _ coeff are scheduled using one GPU thread. When processing an array having more than 2048 elements, the present invention divides the array into groups of 2048 elements and processes them in different NTT computation iterations using multiple GPU blocks. Because different GPU blocks can not access the shared memory mutually, and therefore access from the low-speed global memory is needed, the invention adopts a method of temporarily storing the shared memory to improve the utilization rate of the shared memory and reduce the read-write times of the shared memory and the global memory. As shown in fig. 4, an n =4096 Radix2-NTT iterative access instance divides the array into 2 groups of 2048 elements and uses 2 blocks to process them. It can be seen that only 1024 elements of Block 0 and Block 1 change in the second iteration, so that when a plurality of GPU blocks process one NTT operation group, the unchanged elements are temporarily stored in the shared memory according to the access rule of the NTT algorithm, and the changed elements are written back to the global memory, thereby avoiding frequent reading and writing of the global memory.
(2) Dynamically distributing thread workload according to the calculation scale and hardware resources;
the challenge in implementing the NTT/INTT algorithm in the GPU is to efficiently allocate threads to achieve high utilization, all threads should be busy for optimal performance, the workload of each thread should be equal, and each thread independently processes a butterfly operation in the present invention, and allocates the required operands and corresponding twiddle factors to each thread in each iteration.
It will be appreciated that, the number of operands extracted by a single butterfly operation is the basis of the NTT algorithm,as shown in FIG. 5, for each butterfly operation of the radix-2 NTT algorithm, two elements of the array poly _ coeff are involved in the computation, and one GPU thread is used for two elements of the array poly _ coeff. The thread uses the target _ idx variable to calculate the index of the first array element to be processed by the thread in each iteration, the variable step is used to determine the second element of the array poly _ coeff allocated to the same thread, and each thread uses the index variable of step _ group to track access the twiddle factor sequence set
Figure DEST_PATH_IMAGE030
The corresponding twiddle factor in the array psi.
In order to fully utilize the computing resources in each GPU block and reduce the read-write operation between threads and a shared cache, the invention realizes Radix2-NWC-NTT, radix4-NWC-NTT, radix8-NWC-NTT and Radix16-NWC-NTT on the GPU.
TABLE 1 NTT comparison of different cardinalities
Figure DEST_PATH_IMAGE031
As shown in table 1, the larger the NTT radix, the fewer the number of iterations, the fewer the total number of accesses to the shared memory, the more the computational resources required for a single butterfly transform operation, and the maximum resource that can be paralleled by GPU hardware performance can be matched by the NTT algorithm of different radix. However, different cardinality needs to be matched with different calculation scales, in the implementation, the invention designs a butterfly structure for mixed calling, and for the NTT structure of the N-point cardinality k, when the ith iteration is N < k i Then, call a round of bases
Figure DEST_PATH_IMAGE032
The NTT structure of (a) can complete the iteration. Fig. 6 is an example of N =8 mixed call, and the butterfly structure of the mixed call improves the flexibility while obtaining the optimal parallel policy.
(3) Dynamically calling the inner core according to the butterfly transformation group of each iteration of the NTT algorithm;
in single kernel mode, 1024 threads are allocated per GPU block,and will be𝑥A group of continuous𝑠𝑡𝑒𝑝The elements of x2 are allocated to the same GPU block, no GPU block needs to wait for other GPU blocks to process other parts of the array poly _ coeff, the idea is to schedule according to the sequence of algorithm scheduling, so that the limited parallelism provided by the GPU is broken through, and the memory access competition in result combination is avoided. In multi-kernel mode, scheduling is done for each kernel call𝑛= 2048 GPU blocks, all running simultaneously, because each call to the kernel incurs some overhead, resulting in worse performance than the single-kernel approach when processing smaller groups. In order to fully utilize the parallel potential of multi-kernel calling and simultaneously reduce calling overhead in an optimal mode, the invention provides a mixed calling method, which groups elements of an array poly _ coeff to be processed according to the size of step, and when an NTT algorithm starts, the elements of the array poly _ coeff to be processed are grouped
Figure 974870DEST_PATH_IMAGE011
Is grouped and/or asserted>
Figure 450982DEST_PATH_IMAGE012
When ≧ is greater than or equal to { (R) }, the size of step _ group decreases by one time with each iteration of the algorithm>
Figure 55270DEST_PATH_IMAGE013
When the mode is switched from the multi-core mode to the single-core mode. />
As shown in fig. 7, in the multi-core phase, the input array is accessed from global memory, in the single-core phase, the array elements are copied to shared memory once and they are accessed in shared memory in the remaining iterations, the INTT algorithm is slightly different from NTT because it starts from a position where step is small and merges them after each iteration.
S42, each line of the decomposed ciphertext polynomial matrix is regarded as an independent NTT operation group, and Radix2-NWC-NTT, radix4-NWC-NTT, radix8-NWC-NTT and Radix16-NWC-NTT are realized on the GPU. Selecting a corresponding optimized NTT algorithm according to the calculation scale of the ciphertext matrix and the hardware resource of the GPU to convert the ciphertext polynomial matrix into a ciphertext point value expression matrix:
Figure DEST_PATH_IMAGE033
and S43, after the point value expression of the ciphertext is obtained, homomorphic calculation is carried out according to a required operation function. Because the cryptograph polynomial coefficients are operated in a quotient loop, a lightweight and efficient modular reduction algorithm is a key point for realizing high-performance polynomial multiplication, the Barrett reduction algorithm is realized on a GPU, and the step of multiplying the operands is divided into two independent shifting operations and one multiplication operation, so that the size of a calculation intermediate variable of a modulus with the bit length of K bits is reduced to 2K bits at most. By utilizing the advantages of GPU large-scale parallel computation, the operation of the ciphertext matrix far exceeds the computation efficiency of a CPU under the same method.
S44, after the execution of the operation function is finished, an INTT algorithm is called to convert the calculated ciphertext point value expression matrix into a ciphertext polynomial matrix:
Figure DEST_PATH_IMAGE034
s45, calling an ICRT algorithm to collect each column of the ciphertext polynomial matrix and restore the column of the ciphertext polynomial matrix into a final ciphertext polynomial vector:
Figure DEST_PATH_IMAGE035
and writing the calculation result in the GPU memory back to the CPU memory.
S5, calling a decryption algorithm
Figure DEST_PATH_IMAGE036
And decrypting the calculated ciphertext polynomial to obtain a plaintext polynomial:
Figure DEST_PATH_IMAGE037
s6, calling a decoding algorithm
Figure DEST_PATH_IMAGE038
,/>
Figure DEST_PATH_IMAGE039
As a scale factor, the plaintext polynomial is decoded into a calculation result in plaintext form: />
Figure DEST_PATH_IMAGE040
Compared with the prior art of other fully homomorphic encryption schemes, the framework of the invention can maximally utilize the advantages of GPU large-scale parallel computation. Based on the realization of the NTT/INTT algorithm on the GPU, the invention realizes the operation of each calculation module under the cryptograph of the homomorphic encryption scheme on the GPU, and the operation comprises an addition module, a multiplication module, a key exchange module and a re-linearization module under the cryptograph.
In another embodiment of the present application, a fully homomorphic encryption scheme for heterogeneous acceleration is constructed based on implementation of each computing module on a GPU, and a fully homomorphic encryption privacy computing method in a cloud environment is designed, as shown in fig. 8, specifically including the following steps:
step 1: the Key Generation Center (KGC) sets the public parameters according to the user demand, and calls the setting function of the fully homomorphic encryption
Figure DEST_PATH_IMAGE041
Generating the parameter set required by the operationmkparamAnd discloses that the participant and the cloud server receive the common set of parameters for subsequent steps, including parameters of RLWE difficult problems
Figure DEST_PATH_IMAGE042
And sending the generated parameters to the cloud server and each data owner. />
Step 2: each data owner calls a key generation function
Figure DEST_PATH_IMAGE043
Independently generate respective private keysskiAnd the public key (ciphertext expansion key, bootstrap key and conversion key) <>
Figure DEST_PATH_IMAGE044
And sends the public key portion to the cloud server.
And 3, step 3: each data owner calls an encryption function
Figure 451528DEST_PATH_IMAGE045
The data is encrypted. The steps are independently completed by the participants, and the ciphertext is uploaded to the cloud server after the steps are completed.
And 4, step 4: the cloud server performs efficient homomorphic operation on ciphertexts from different data owners by adopting heterogeneous computing, the efficient homomorphic operation comprises the steps of cipher text expansion, bootstrap, key conversion and the like, and final computing results are returned to all the participants in a form of the ciphertexts.
And 5, step 5: each participant calls a decryption function
Figure DEST_PATH_IMAGE046
Decrypting with a private key to obtain a plaintext result𝑚And obtaining a calculation result.
Due to the powerful and complete homomorphic computing function of the fully homomorphic encryption algorithm, the privacy computing architecture is very simple in the cloud environment. The intermediate server node only contacts the ciphertext to provide a computing function, and can be regarded as a computing node. The user includes one or more data holders. The users encrypt their own data and upload the encrypted data together with the public keys to a computing server, and the server uses the public keys to perform ciphertext computation and distributes computation results to the users. The user carries out decryption step by step, namely, the user only decrypts the corresponding ciphertext by using the private key of the user, and finally, the partially decrypted plaintext is restored into complete plaintext through interaction among the users.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.
As shown in fig. 9, in another embodiment of the present application, a storage medium 100 is further provided, where a memory 101 stores a program, and when the program is executed by a processor 102, the program implements a privacy computation heterogeneous acceleration method based on fully homomorphic encryption, specifically:
s1, setting required public parameters according to requirements:
Figure DEST_PATH_IMAGE047
: taking the security parameter lambda as input and returning the common parameterppParameters including RLWE difficulty problempp RLWE . Based on the above parameter, call->
Figure 496975DEST_PATH_IMAGE015
Generating a private keyskPublic keypkAnd evaluating the keyevk
S2, calling coding algorithm
Figure DEST_PATH_IMAGE048
Δ is a scaling factor, set->
Figure DEST_PATH_IMAGE049
Encoding the input of the invention: />
Figure DEST_PATH_IMAGE050
And mapped onto an integer field to form a polynomial form, wherein->
Figure 301246DEST_PATH_IMAGE019
Is an n-dimensional phasor, is greater than>
Figure 861671DEST_PATH_IMAGE020
Represents->
Figure DEST_PATH_IMAGE051
Set of polynomials of all above with respect to X, (X) N + 1) is a univariate irreducible polynomial of order N;
s3, calling an encryption algorithm
Figure DEST_PATH_IMAGE052
: encrypted plaintext
Figure DEST_PATH_IMAGE053
And outputCiphertext->
Figure 956625DEST_PATH_IMAGE024
Q > 1 is an integer>
Figure 918895DEST_PATH_IMAGE025
Represents a set of integers, is selected, and is selected>
Figure 736810DEST_PATH_IMAGE026
Represents->
Figure 50111DEST_PATH_IMAGE025
A set of all polynomials above with respect to X>
Figure DEST_PATH_IMAGE054
Representing a polynomial residual class ring;
s4, after the data are encrypted into the ciphertext, homomorphic operation under the ciphertext is executed, and the calculation process is as follows:
s41, copying the ciphertext and the calculation related parameters in the CPU memory to a corresponding GPU memory, decomposing the ciphertext polynomial ring of a super-large integer by adopting the Chinese remainder theorem, and decomposing the ciphertext polynomial in a vector form into a matrix form:
Figure 690302DEST_PATH_IMAGE029
s42, regarding each row of the decomposed ciphertext polynomial matrix as an independent NTT operation group, and selecting a corresponding optimized NTT algorithm according to the calculation scale of the ciphertext matrix and the hardware resource of the GPU to convert the ciphertext polynomial matrix into a ciphertext point value expression matrix:
Figure 756347DEST_PATH_IMAGE033
(ii) a The optimized NTT algorithm is characterized in that an NTT module on a GPU adopts a negative closure convolution NWC to reduce the calculation scale of half of the NTT algorithm, and the closure convolution NWC corrects the result deviation brought by a polynomial ring by generating a preprocessing sequence to preprocess array elements needing NTT operationMoving;
s43, after a ciphertext point value expression matrix is obtained, homomorphic calculation is carried out according to a required operation function, a Barrett reduction algorithm is realized on a GPU, and the step of multiplying the operands is divided into two independent shifting operations and one multiplication operation, so that the size of a calculation intermediate variable of a modulus with the bit length of K bits is reduced to 2K bits at most;
s44, after the operation function is executed, an INTT algorithm is called to convert the calculated ciphertext point value expression matrix into a ciphertext polynomial matrix:
Figure DEST_PATH_IMAGE055
s45, calling an ICRT algorithm to collect each column of the ciphertext polynomial matrix and restore the column of the ciphertext polynomial matrix into a final ciphertext polynomial vector:
Figure 558212DEST_PATH_IMAGE035
writing the calculation result in the GPU memory back to the CPU memory;
s5, calling a decryption algorithm
Figure 624388DEST_PATH_IMAGE036
And decrypting the calculated ciphertext polynomial to obtain a plaintext polynomial:
Figure 192904DEST_PATH_IMAGE037
;
s6, calling a decoding algorithm
Figure 864187DEST_PATH_IMAGE038
,/>
Figure 742014DEST_PATH_IMAGE039
As a scale factor, the plaintext polynomial is decoded into a calculation result in plaintext form:
Figure 498748DEST_PATH_IMAGE040
it should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
Referring to fig. 10, in an embodiment, an electronic device for implementing a privacy computing heterogeneous acceleration method based on fully homomorphic encryption is provided, where the electronic device 200 may include a first processor 201, a first memory 202, and a bus, and may further include a computer program stored in the first memory 202 and executable on the first processor 201, such as a privacy computing heterogeneous acceleration program 203 based on fully homomorphic encryption.
The first memory 202 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The first memory 202 may in some embodiments be an internal storage unit of the electronic device 200, such as a removable hard disk of the electronic device 200. The first memory 202 may also be an external storage device of the electronic device 200 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 200. Further, the first memory 202 may also include both an internal storage unit and an external storage device of the electronic device 200. The first memory 202 may be used to store not only application software installed in the electronic device 200 and various types of data, such as codes of the privacy computing heterogeneous acceleration program 203 based on the fully homomorphic encryption, but also temporarily store data that has been output or is to be output.
The first processor 201 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The first processor 201 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions of the electronic device 200 and processes data by running or executing programs or modules stored in the first memory 202 and calling data stored in the first memory 202.
Fig. 10 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 10 does not constitute a limitation of the electronic device 200, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
The privacy computation heterogeneous acceleration program 203 based on the fully homomorphic encryption stored in the first memory 202 of the electronic device 200 is a combination of a plurality of instructions that, when executed in the first processor 201, can implement:
s1, setting required public parameters according to requirements by a userppAt a common parameterppOn the basis of the key generation algorithm, the key generation algorithm of the fully homomorphic encryption is called to generate a private keyskPublic keypkAnd evaluating the keyevk
S2, calling a coding algorithm to code the input, and mapping the input to an integer field to form a polynomial form;
s3, calling an encryption algorithm to encrypt a plaintext and outputting a ciphertext;
s4, after the data are encrypted into the ciphertext, homomorphic operation under the ciphertext is executed, and the calculation process is as follows:
s41, copying the ciphertext and the calculation related parameters in the CPU memory to a corresponding GPU memory, decomposing a ciphertext polynomial ring of a super-large integer by adopting a Chinese remainder theorem, and decomposing the ciphertext polynomial in a vector form into a matrix form;
s42, regarding each line of the decomposed ciphertext polynomial matrix as an independent NTT operation group, and selecting a corresponding optimized NTT algorithm according to the calculation scale of the ciphertext matrix and the hardware resource of the GPU to convert the ciphertext polynomial matrix into a ciphertext point value expression matrix; the optimized NTT algorithm is that an NTT module on a GPU adopts a negative closure convolution NWC to reduce the calculation scale of half of the NTT algorithm, and the closure convolution NWC preprocesses array elements needing NTT operation by generating a preprocessing sequence to correct the result offset brought by a polynomial ring;
s43, after a ciphertext point value expression matrix is obtained, homomorphic calculation is carried out according to a required operation function, a Barrett reduction algorithm is realized on a GPU, and the step of multiplying the operands is divided into two independent shifting operations and one multiplication operation, so that the size of a calculation intermediate variable of a modulus with the bit length of K bits is reduced to 2K bits at most;
s44, after the operation function is executed, an INTT algorithm is called to convert the calculated ciphertext point value expression matrix into a ciphertext polynomial matrix;
s45, calling an ICRT algorithm to collect each row of the ciphertext polynomial matrix and restore the row of the ciphertext polynomial matrix into a final ciphertext polynomial vector, and writing a calculation result in a GPU memory back to a CPU memory;
s5, calling a decryption algorithm to decrypt the calculated ciphertext polynomial to obtain a plaintext polynomial;
and S6, calling a decoding algorithm to decode the plaintext polynomial into a calculation result in a plaintext form.
Further, the modules/units integrated with the electronic device 200, if implemented in the form of software functional units and sold or used as independent products, may be stored in a non-volatile computer-readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM).
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. The privacy computation heterogeneous acceleration method based on the fully homomorphic encryption is characterized by comprising the following steps of:
s1, setting required public parameters according to requirements by a userppAt a common parameterppOn the basis of the encryption algorithm, a secret key generation algorithm of the fully homomorphic encryption is called to generate a private keyskPublic keypkAnd evaluating the keyevk
S2, calling a coding algorithm to code the input, and mapping the input to an integer field to form a polynomial form;
s3, calling an encryption algorithm to encrypt a plaintext and outputting a ciphertext;
s4, after the data are encrypted into the ciphertext, homomorphic operation under the ciphertext is executed, and the calculation process is as follows:
s41, copying the ciphertext and the calculation related parameters in the CPU memory to a corresponding GPU memory, decomposing a ciphertext polynomial ring of a super-large integer by adopting a Chinese remainder theorem, and decomposing the ciphertext polynomial in a vector form into a matrix form;
s42, regarding each line of the decomposed ciphertext polynomial matrix as an independent NTT operation group, and selecting a corresponding optimized NTT algorithm according to the calculation scale of the ciphertext matrix and the hardware resource of the GPU to convert the ciphertext polynomial matrix into a ciphertext point value expression matrix; the optimized NTT algorithm is that an NTT module on a GPU adopts negative closure convolution NWC to reduce the calculation scale of half of the NTT algorithm, and the closure convolution NWC preprocesses array elements needing NTT operation by generating a preprocessing sequence to correct result deviation brought by a polynomial ring;
s43, after a ciphertext point value expression matrix is obtained, homomorphic calculation is carried out according to a required operation function, a Barrett reduction algorithm is realized on a GPU, and the step of multiplying the operands is divided into two independent shifting operations and one multiplication operation, so that the size of a calculation intermediate variable of a modulus with the bit length of K bits is reduced to 2K bits at most;
s44, after the operation function is executed, an INTT algorithm is called to convert the calculated ciphertext point value expression matrix into a ciphertext polynomial matrix;
s45, calling an ICRT algorithm to collect each row of the ciphertext polynomial matrix and restore the row of the ciphertext polynomial matrix into a final ciphertext polynomial vector, and writing a calculation result in a GPU memory back to a CPU memory;
s5, calling a decryption algorithm to decrypt the calculated ciphertext polynomial to obtain a plaintext polynomial;
and S6, calling a decoding algorithm to decode the plaintext polynomial into a calculation result in a plaintext form.
2. The method for privacy-based computation heterogeneous acceleration based on fully homomorphic encryption according to claim 1, characterized in that the public parameterppParameters including RLWE difficulty problempp RLWE
3. The privacy computation heterogeneous acceleration method based on homomorphic encryption of claim 1, wherein in step S41, the ciphertext polynomial ring of the super large integer is decomposed by using the chinese remainder theorem, the ciphertext polynomial in the form of a vector is decomposed into a matrix, and the ciphertext polynomial is subjected to
Figure QLYQS_1
The method specifically comprises the following steps:
s411, representing the ciphertext polynomial as a coefficient vector thereof:
Figure QLYQS_2
whereinNRepresents the highest order of the polynomial;
s412, generating a group of prime number sequences of k elements:
Figure QLYQS_3
s413, decomposing each item in the coefficient vector by using a CRT algorithm:
Figure QLYQS_4
s414, regarding each row as an independent NTT operation group;
and S415, finally, collecting the result of each column by using an ICRT algorithm to recover the final calculation result.
4. The method according to claim 3, wherein the preprocessing of the array elements is integrated into the butterfly transform by generating a special twiddle factor sequence, wherein the twiddle factor sequence is as follows:
PSI list =[1, ψ N/2 ,ψ N/4 ,ψ 3N/4 ,ψ N/8 ,ψ 3N/8 ,ψ 5N/8 ,ψ 7N/8 ,…,ψ 1 ,ψ 3 ,ψ 5 ,…,ψ N-5 ,ψ N-3 ,ψ N-1 ]whereinψIs composed of
Figure QLYQS_5
The 2N-order unit primitive root of (1),ψis to be->
Figure QLYQS_6
Only need to traverse
Figure QLYQS_7
Namely, the negative closure convolution operation is merged into the butterfly transformation of the NTT algorithm, and the space complexity of the optimized NTT algorithm isON)。
5. The method of claim 1, wherein the NTT operation employs a CT butterfly structure, accepts inputs in a standard order, and produces outputs in a bit-reversed order; INTT operations take inputs in bit-reversed order using the GS butterfly structure and produce outputs in standard order.
6. The privacy computation heterogeneous acceleration method based on the fully homomorphic encryption according to claim 3, characterized in that a memory access mode is optimized according to a data memory access rule of an NTT algorithm, specifically:
during the calculation of NTT/INTT, the elements of the polynomial coefficient array poly _ coeff are calculatedAccess log 2N) Next, a request is issued to global memory to copy an element of poly _ coeff to shared memory, for which it then accesses shared memory log 2N) Second, thereby avoiding sending out log to global memory for single element 2N) For the secondary request, the read-write speed of the shared memory is far better than that of the global memory.
7. The privacy computation heterogeneous acceleration method based on fully homomorphic encryption according to claim 3, characterized in that thread workloads are dynamically allocated according to computation scale and hardware resources, specifically:
each thread independently processes a butterfly operation, each thread is allocated with required operands and corresponding twiddle factors in each iteration,
the operand number extracted by a single butterfly operation is the base of the NTT algorithm, for each butterfly operation of the NTT algorithm with the base of 2, two elements of an array poly _ coeff participate in calculation, and a GPU thread is used for the two elements of the array poly _ coeff; the thread uses the target _ idx variable to calculate the index of the first array element to be processed by the thread in each iteration, the variable step is used to determine the second element of the array poly _ coeff allocated to the same thread, and each thread uses the index variable of step _ group to track access the twiddle factor sequence set
Figure QLYQS_8
The corresponding twiddle factor in the array psi.
8. The privacy computation heterogeneous acceleration method based on the homomorphic encryption according to claim 7, characterized in that the kernels are dynamically invoked according to butterfly transformation groups of each iteration of the NTT algorithm, specifically:
a method for mixed calling is provided, which groups the elements of the array poly _ coeff to be processed according to the size of the variable step, and when the NTT algorithm starts
Figure QLYQS_9
Packet based on time and location>
Figure QLYQS_10
With each iteration of the algorithm &>
Figure QLYQS_11
Is reduced by one time when->
Figure QLYQS_12
When the mode is switched from the multi-core mode to the single-core mode.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the method of private computing heterogeneous acceleration based on homomorphic encryption of any of claims 1-8.
10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the method for privacy computing heterogeneous acceleration based on fully homomorphic encryption of any one of claims 1-8.
CN202211433166.3A 2022-11-16 2022-11-16 Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption Active CN115622684B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211433166.3A CN115622684B (en) 2022-11-16 2022-11-16 Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211433166.3A CN115622684B (en) 2022-11-16 2022-11-16 Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption

Publications (2)

Publication Number Publication Date
CN115622684A CN115622684A (en) 2023-01-17
CN115622684B true CN115622684B (en) 2023-03-28

Family

ID=84878880

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211433166.3A Active CN115622684B (en) 2022-11-16 2022-11-16 Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption

Country Status (1)

Country Link
CN (1) CN115622684B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116366248B (en) * 2023-05-31 2023-09-29 山东大学 Kyber implementation method and system based on compact instruction set expansion
CN116633526B (en) * 2023-07-21 2023-10-31 北京数牍科技有限公司 Data processing method, device, equipment and medium
CN116743349B (en) * 2023-08-14 2023-10-13 数据空间研究院 Paillier ciphertext summation method, system, device and storage medium
CN117251871B (en) * 2023-11-16 2024-03-01 支付宝(杭州)信息技术有限公司 Data processing method and system for secret database
CN117349868B (en) * 2023-12-04 2024-04-12 粤港澳大湾区数字经济研究院(福田) Fully homomorphic encryption and decryption method based on GPU, electronic equipment and storage medium
CN117439731B (en) * 2023-12-21 2024-03-12 山东大学 Privacy protection big data principal component analysis method and system based on homomorphic encryption
CN117521164B (en) * 2024-01-08 2024-05-03 南湖实验室 Self-adaptive homomorphic encryption method based on trusted execution environment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114697113A (en) * 2022-03-30 2022-07-01 医渡云(北京)技术有限公司 Hardware accelerator card-based multi-party privacy calculation method, device and system
CN114710320A (en) * 2022-03-03 2022-07-05 湖南科技大学 Edge calculation privacy protection method based on block chain and multi-key fully homomorphic encryption

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763241B (en) * 2010-01-20 2012-02-08 西安电子科技大学 Large integer modular arithmetic device for realizing signature algorithm in ECC cryptosystem and modular method therefor
US10733819B2 (en) * 2018-12-21 2020-08-04 2162256 Alberta Ltd. Secure and automated vehicular control using multi-factor authentication

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114710320A (en) * 2022-03-03 2022-07-05 湖南科技大学 Edge calculation privacy protection method based on block chain and multi-key fully homomorphic encryption
CN114697113A (en) * 2022-03-30 2022-07-01 医渡云(北京)技术有限公司 Hardware accelerator card-based multi-party privacy calculation method, device and system

Also Published As

Publication number Publication date
CN115622684A (en) 2023-01-17

Similar Documents

Publication Publication Date Title
CN115622684B (en) Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption
Al Badawi et al. Implementation and performance evaluation of RNS variants of the BFV homomorphic encryption scheme
Al Badawi et al. High-performance FV somewhat homomorphic encryption on GPUs: An implementation using CUDA
Gupta et al. Pqc acceleration using gpus: Frodokem, newhope, and kyber
US10467389B2 (en) Secret shared random access machine
Wu et al. Secure and efficient outsourced k-means clustering using fully homomorphic encryption with ciphertext packing technique
Fan et al. Parallelization of RSA algorithm based on compute unified device architecture
Gao et al. CuNH: Efficient GPU implementations of post-quantum KEM NewHope
Alkım et al. Compact and simple RLWE based key encapsulation mechanism
JP2022531593A (en) Systems and methods for adding and comparing integers encrypted by quasigroup operations in AES counter mode encryption
Wang et al. HE-Booster: an efficient polynomial arithmetic acceleration on GPUs for fully homomorphic encryption
Choi et al. Fast implementation of SHA-3 in GPU environment
US11804968B2 (en) Area efficient architecture for lattice based key encapsulation and digital signature generation
Wan et al. TESLAC: accelerating lattice-based cryptography with AI accelerator
Dolev et al. Secret shared random access machine
Pu et al. Fastplay-a parallelization model and implementation of smc on cuda based gpu cluster architecture
Han et al. cuGimli: optimized implementation of the Gimli authenticated encryption and hash function on GPU for IoT applications
Takeshita et al. Gps: Integration of graphene, palisade, and sgx for large-scale aggregations of distributed data
Seo SIKE on GPU: Accelerating supersingular isogeny-based key encapsulation mechanism on graphic processing units
Lei et al. Accelerating homomorphic full adder based on fhew using multicore cpu and gpus
Zheng Encrypted cloud using GPUs
Yuan et al. Portable Implementation of Postquantum Encryption Schemes and Key Exchange Protocols on JavaScript-Enabled Platforms
CN109150494A (en) Method, storage medium, equipment and the system of enciphering and deciphering algorithm are constructed in mobile terminal
Doumi et al. Performance evaluation of parallel international data encryption algorithm on iman1 super computer
Kamal et al. Enhanced implementation of the NTRUEncrypt algorithm using graphics cards

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant