CN115622684B

CN115622684B - Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption

Info

Publication number: CN115622684B
Application number: CN202211433166.3A
Authority: CN
Inventors: 蒋琳; 赵鑫; 刘虎成; 陈倩; 方俊彬; 王轩; 张加佳; 李君一
Original assignee: Jinan University; Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Jinan University; Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-03-28
Anticipated expiration: 2042-11-16
Also published as: CN115622684A

Abstract

The invention discloses a privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption. By using a memory hierarchical structure in the GPU, the number of tasks with large memory access amount simultaneously distributed on the SM is reduced, more shared memories are distributed to improve the memory hit rate, and the communication with the global memory is reduced; designing a heterogeneous computing stream: limited hardware resources are shared, both temporally and spatially. The challenge of the present invention to implement NTT/INTT algorithms in a GPU is to efficiently allocate threads to achieve high utilization, all threads should be busy for optimal performance, and the workload of each thread should be equal.

Description

Privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption

Technical Field

The invention belongs to the technical field of privacy computation, and particularly relates to a privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption.

Background

In recent years, data has become an important resource and is an important production element, and compared with the traditional production element, the data really becomes a strategic resource which can be freely circulated and has safety, so that the link of ensuring the privacy and the safety of the data can not be avoided. Modern cryptography has found application in a myriad of digital systems and components and has become an important tool for securing data and privacy. However, the encryption technology itself (including the widely used public key encryption, PKE) still has a limitation that sensitive data must be decrypted before it can be processed and analyzed.

Privacy Computing (Privacy Computing), also called Privacy-Enhancing Computing (Privacy-Enhancing Computing) or Confidential Computing (Confidential Computing), is a technology and a system for realizing multi-party data collaborative Computing on the premise of not transferring or revealing original data, and has the characteristic of being "available and invisible". Among many privacy computing technologies, full Homomorphic Encryption (FHE) is an Encryption form that supports algebraic operations directly on encrypted data (ciphertext) without decrypting the ciphertext before the operation. The computed result is also encrypted and only the owner of the private key can decrypt and access the computed result. The result is equivalent to the result of performing the same operation on an unencrypted version (plaintext) of this data. The characteristic allows an untrusted third party to directly operate the ciphertext without a private key, and avoids user sensitive information leakage caused by the fact that the third party needs to decrypt the ciphertext in the operation process.

Compared with differential privacy, homomorphic encryption has the guarantee of password security certification and has the advantage of high safety. The nature of differential privacy still requires data to be transmitted elsewhere and often requires a trade-off between accuracy and privacy, with less security than homomorphic encryption. Compared with safe multi-party calculation, homomorphic encryption has the advantage of less interaction times. The secure multi-party computing requires multiple interactions of multiple participants to generate a computing result, and the communication cost is high. The fully homomorphic encryption has more universal data privacy protection capability and supporting force of a future data infrastructure base, and can perfectly solve the contradiction problem between data protection and data circulation theoretically. However, the technical bottleneck of the fully homomorphic encryption in terms of operational efficiency greatly limits the practicability and further popularization and application of the fully homomorphic encryption.

On the improvement of homomorphic encryption efficiency, a homomorphic encryption algorithm is optimized on a CPU, and the optimization method mainly comprises three aspects of the Bootstrapping algorithm, the ciphertext packing algorithm and the noise-free FHE algorithm. The Bootstrapping algorithm helps to solve the noise accumulation problem down to a level comparable to the original plaintext, but many people subsequently optimize for Bootstrapping because FHE algorithms are inefficient due to the large computational overhead of repeatedly calling the decryption circuitry. In addition to optimizing Bootstrapping, another effective means to fully homomorphic "speed up" is Batch (Batch) technology, also known as encapsulation (Pack), that allows multiple data values to be encrypted into one ciphertext and enables single instruction multiple data operations SIMD to be performed homomorphically. The third research idea is to construct a noise-free FHE algorithm. Noise is added to ensure the safety of the FHE algorithm, but it also brings the trouble of noise control, and although the noise-free FHE algorithm is considered unsafe, the conclusion is not strictly proven.

Disclosure of Invention

The invention mainly aims to overcome the defects and shortcomings of the prior art and provide a privacy computation heterogeneous acceleration method and device based on fully homomorphic encryption.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a privacy computation heterogeneous acceleration method based on fully homomorphic encryption, including the following steps:

s1, setting required public parameters according to requirements by a userppAt a common parameterppOn the basis of the key generation algorithm, the key generation algorithm of the fully homomorphic encryption is called to generate a private keyskPublic keypkAnd evaluatingSecret keyevk；

S2, calling a coding algorithm to code the input, and mapping the input to an integer field to form a polynomial form;

s3, calling an encryption algorithm to encrypt a plaintext and outputting a ciphertext;

s4, after the data are encrypted into the ciphertext, homomorphic operation under the ciphertext is executed, and the calculation process is as follows:

s41, copying the ciphertext and the calculation related parameters in the CPU memory to a corresponding GPU memory, decomposing a ciphertext polynomial ring of a super-large integer by adopting a Chinese remainder theorem, and decomposing the ciphertext polynomial in a vector form into a matrix form;

s42, regarding each line of the decomposed ciphertext polynomial matrix as an independent NTT operation group, and selecting a corresponding optimized NTT algorithm according to the calculation scale of the ciphertext matrix and the hardware resource of the GPU to convert the ciphertext polynomial matrix into a ciphertext point value expression matrix; the optimized NTT algorithm is that an NTT module on a GPU adopts a negative closure convolution NWC to reduce the calculation scale of half of the NTT algorithm, and the closure convolution NWC preprocesses array elements needing NTT operation by generating a preprocessing sequence to correct the result offset brought by a polynomial ring;

s43, after a ciphertext point value expression matrix is obtained, homomorphic calculation is carried out according to a required operation function, a Barrett reduction algorithm is realized on a GPU, and the step of multiplying the operands is divided into two independent shifting operations and one multiplication operation, so that the size of a calculation intermediate variable of a modulus with the bit length of K bits is reduced to 2K bits at most;

s44, after the operation function is executed, an INTT algorithm is called to convert the calculated ciphertext point value expression matrix into a ciphertext polynomial matrix;

s45, calling an ICRT algorithm to collect each row of the ciphertext polynomial matrix and restore the row of the ciphertext polynomial matrix into a final ciphertext polynomial vector, and writing a calculation result in a GPU memory back to a CPU memory;

s5, calling a decryption algorithm to decrypt the calculated ciphertext polynomial to obtain a plaintext polynomial;

and S6, calling a decoding algorithm to decode the plaintext polynomial into a calculation result in a plaintext form.

As a preferred solution, the common parameterppParameters including RLWE difficulty problempp ^RLWE 。

As a preferred technical solution, in step S41, the ciphertext polynomial ring of the super-large integer is decomposed by using the chinese remainder theorem, the ciphertext polynomial in the vector form is decomposed into a matrix form, and the ciphertext polynomial is subjected to

The method specifically comprises the following steps:

s411, the ciphertext polynomial is expressed as a coefficient vector:

；

s412, generating a group of prime number sequences of k elements:

；

s413, decomposing each item in the coefficient vector by using a CRT algorithm:

s414, regarding each row as an independent NTT operation group;

and S415, finally, collecting the result of each column by using an ICRT algorithm to recover the final calculation result.

As a preferred technical solution, preprocessing of the logarithmic array elements is integrated into the butterfly transform by generating a special twiddle factor sequence, and NWC-NTT generates the special twiddle factor sequence as follows:

PSI _list =[1,ψ ^N/2 ,ψ ^N/4 ,ψ ^3N/4 ,ψ ^N/8 ,ψ ^3N/8 ,ψ ^5N/8 ,ψ ^7N/8 ,…,ψ ¹ ,ψ ³ ,ψ ⁵ ,…,ψ ^N-5 ,ψ ^N-3 ,ψ ^N-1 ]whereinψIs composed of

The 2N-order unit primitive root of (1),ψis to be->

Only need to traverse

The negative closure convolution operation can be integrated into the butterfly transform of the NTT algorithm, i.e.

，/>

Represents a modified NTT algorithm, and the optimized NTT algorithm has the space complexity ofO（N）。

As a preferred technical scheme, the NTT operation adopts a CT butterfly structure, accepts input in a standard order, and generates output in a bit-reversal order; INTT operations take inputs in bit-reversed order using the GS butterfly structure and produce outputs in standard order.

As a preferred technical scheme, optimizing a memory access mode according to a data memory access rule of an NTT algorithm specifically includes:

during the computation of NTT/INTT, the elements of the polynomial coefficient array poly _ coeff are accessed log ₂ (n) times, a request is issued to global memory to copy an element of poly _ coeff to shared memory and then access shared memory log for it ₂ (n) times, thereby avoiding sending out log to global memory for single element ₂ (n) the read-write speed of the shared memory is far better than that of the global memory.

As a preferred technical solution, dynamically allocating thread workload according to the calculation scale and the hardware resource specifically includes:

each thread independently processes a butterfly operation, each thread is assigned the required operand and corresponding twiddle factor in each iteration,

the operand number extracted by a single butterfly operation is the base of the NTT algorithm, for each butterfly operation of the NTT algorithm with the base of 2, two elements of an array poly _ coeff participate in calculation, and a GPU thread is used for the two elements of the array poly _ coeff; the thread uses the target _ idx variable to calculate the index of the first array element to be processed by the thread in each iteration, the variable step is used to determine the second element of the array poly _ coeff allocated to the same thread, and each thread uses the index variable of step _ group to track access the twiddle factor sequence set

The corresponding twiddle factor in the array psi.

As a preferred technical solution, dynamically invoking the kernel according to the butterfly transformation grouping of each iteration of the NTT algorithm specifically comprises:

a method for mixed calling is proposed, which groups the elements of the array to be processed poly _ coeff with the size of step, and when the NTT algorithm starts

Is grouped and/or asserted>

When ≧ is greater than or equal to { (R) }, the size of step _ group decreases by one time with each iteration of the algorithm>

When switching from the multicore mode to the single core mode. />

In a second aspect, the present invention also provides an electronic device, including:

at least one processor; and (c) a second step of,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the method for computing heterogeneous acceleration based on privacy of homomorphic encryption.

In a third aspect, the present invention further provides a computer-readable storage medium, which stores a program, and when the program is executed by a processor, the method for computing heterogeneous acceleration based on fully homomorphic encryption for privacy is implemented.

Compared with the prior art, the invention has the following advantages and beneficial effects:

aiming at a hardware architecture of the GPU, the system optimizes a fully homomorphic encryption algorithm from two levels of a memory and an instruction, dynamically allocates Block blocks in the GPU according to calculation load, and splits tasks with large calculation amount, merges tasks with small calculation amount to be large, and controls access and memory competition in a result merging process. By using a memory hierarchical structure in the GPU, the number of tasks with large memory access amount distributed on the SM at the same time is reduced, more shared memories are distributed, the memory hit rate is improved, and the communication with a global memory is reduced. Designing a heterogeneous computing stream: limited hardware resources are shared, both temporally and spatially. And distributing different calculation tasks and different blocks to different GPU calculation units by utilizing a large integer decomposition technology, a matrix splitting technology and the like to realize spatial parallelism. Data transmission is hidden in calculation by using a pipeline technology, and parallel optimization on time is designed and realized by using the characteristic that different modules can execute operation at the same time in a fully homomorphic encryption algorithm. The challenge in implementing the NTT/INTT algorithm in the GPU is to efficiently allocate threads to achieve high utilization, all threads should be busy for optimal performance, and the workload of each thread should be equal. The invention adopts a new number theory conversion architecture to overcome the challenges brought by the complex data dependency relationship and the complex access mode in the NTT.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of the structure of a CKKS isomerism calculation scheme according to an embodiment of the invention;

fig. 2 is a block diagram of a privacy computing heterogeneous acceleration system based on fully homomorphic encryption according to an embodiment of the present invention.

FIG. 3 is an example graph of CT-NTT and GS-INTT with n =8 according to the embodiment of the present invention;

FIG. 4 is a diagram of Radix2-NTT iterative memory access example with n =4096 according to an embodiment of the present invention;

FIG. 5 is a diagram of Radix2-NTT step _ group grouping iteration according to an embodiment of the present invention;

FIG. 6 is a diagram of an example of N =8 hybrid Radix in accordance with an embodiment of the present invention;

FIG. 7 is a step _ group mixed call iteration diagram of the NTT of the present invention;

FIG. 8 is a diagram illustrating a fully homomorphic encryption privacy computing model in the cloud environment of FIG. 1 according to an embodiment of the present invention;

FIG. 9 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention;

fig. 10 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by a person skilled in the art that the embodiments described herein can be combined with other embodiments.

The (R) LWE difficult problem-based homomorphic encryption algorithm maps plain texts in an integer or real number field to cipher texts in a polynomial ring, and the calculation amount under the cipher texts is ten thousand times that of the plain texts. In terms of computational efficiency, the basic operation of homomorphic operation under ciphertext involves addition and multiplication of integer coefficient polynomials with large dimension, resulting in extremely low efficiency of fully homomorphic operation.

As shown in fig. 1, aiming at the performance problem of low computational efficiency of fully homomorphic encryption, the invention designs and realizes an NTT algorithm with higher parallelism and lower computational complexity facing to a GPU, and realizes a heterogeneous privacy computation method of a CPU + GPU based on a CKKS homomorphic encryption scheme, thereby greatly improving homomorphic computational efficiency under a ciphertext.

In the heterogeneous scheme of the invention, a CPU is used as a general processor and has few computing units and low parallelism, tasks such as data preprocessing, coding, batch processing request, thread scheduling, memory mapping and the like are executed on the CPU, and computation-intensive tasks such as secret addition, multiplication, NTT/INTT and the like are executed on a GPU. In order to break through the efficiency bottleneck of homomorphic encryption, the method is mainly used for accelerating homomorphic calculation under a ciphertext by using the GPU.

As shown in fig. 2, the method for computing heterogeneous acceleration based on privacy of fully homomorphic encryption in the present embodiment includes the following steps:

s1, setting required public parameters according to requirements:

: taking the security parameter lambda as input and returning the common parameterppParameters including RLWE difficulty problempp ^RLWE . Based on the above parameter, call->

Generating a private keyskPublic keypkAnd evaluating the keyevk。

S2, CKKS schemes using integer polynomial ringsThe rich structure realizes the mapping from the plaintext space to the ciphertext space. Data in reality appears more in the form of complex vectors. Therefore, the encoding algorithm is first invoked

Δ is a scaling factor, set->

Encoding the input of the invention: />

And mapped onto an integer field to form a polynomial form in which &>

Is an n-dimensional phasor, is greater than>

Represents->

Set of polynomials of all above with respect to X, (X) ^N + 1) is a univariate irreducible polynomial of order N.

S3, the CKKS scheme used in the embodiment is a public key encryption scheme based on the RLWE difficult problem, wherein the generated public key is used for encryption and can be shared, and the private key is used for decryption and needs to be kept secret. Invoking an encryption algorithm

: encrypted plaintext->

And outputs the ciphertext->

Q > 1 is an integer>

Represents a set of integers, is selected, and is selected>

Represents->

A set of all polynomials above with respect to X>

Representing a polynomial residual class ring. />

And S4, after the data are encrypted into the ciphertext, homomorphic operation under the ciphertext is executed. In order to break through the bottlenecks of large calculation amount and low calculation efficiency of fully homomorphic operation under the ciphertext, the homomorphic operation method under the ciphertext, which has lower calculation complexity and higher parallelism, is designed and realized on the GPU by utilizing the advantages of large-scale parallel calculation of the GPU, and the specific calculation process is as follows:

s41, copying the ciphertext and the calculation related parameters in the CPU memory to a corresponding GPU memory, firstly adopting Chinese Remainder Theorem (CRT) to decompose a ciphertext polynomial ring of a super-large integer, and decomposing the ciphertext polynomial in a vector form into a matrix form:

。

it can be understood that, in the homomorphic encryption scheme based on the lattice code, the dimensionality and the coefficient of the generated ciphertext polynomial are very large, and the operation of performing function evaluation on the ciphertext is very time-consuming (a trivial algorithm calculates two functions)NThe addition of a second order polynomial has a complexity ofO(N) The complexity of multiplication isO(N ² ) In order to reduce the computational complexity, a commonly used method is Number Theory Transform (NTT), which can reduce the complexity of polynomial multiplication toO(NlogN) However, efficiency is still less than ideal for particularly large-scale large integer multiplications.

Therefore, the invention firstly optimizes the NTT algorithm in terms of reducing the computational complexity and improving the parallelism. One important factor for reducing NTT computational complexityThe prime element is to avoid multiplication of the super-large integer, and the invention adopts Chinese Remainder Theorem (CRT) to decompose the super-large integer into the elements capable of being simultaneously calculated in parallelkAnd small integers are subjected to large-scale integer operation at a small cost for multiple times, and then the optimal calculation efficiency is obtained through the advantage of GPU large-scale parallel calculation. For ciphertext polynomials

，NOn the scale of 2 ¹⁰ ~2 ¹⁵ The coefficient scale is 128 to 256bits, and the specific steps of decomposing the large integer polynomial by using a CRT algorithm are as follows:

s21, representing the ciphertext polynomial as a coefficient vector:

；

s22, generating a group of prime number sequences of k elements:

；

s23, decomposing each item in the coefficient vector by using a CRT algorithm:

s24, regarding each row as an independent NTT operation group;

and S25, finally, collecting the result of each column by using an ICRT algorithm to recover the final calculation result.

The CRT algorithm is used for decomposing the items without dependency relationship, the structure is more suitable for a parallel computing mode of a single-instruction multi-data-stream structure on a GPU, a single control component dispatches instructions to each pipeline, and the same instructions are executed by all processing components simultaneously. Wherein k is a balance factor related to storage and access, and the value of k influences the scale of the calculation task. Specifically, increasing k decreases the single computation time, but increases the computation size and the number of accesses, and conversely decreasing k increases the single computation time, decreasing the computation size and the number of accesses.

Furthermore, the NTT module on the GPU of the present invention employs negative closed-form convolution (NWC) to reduce the computation scale of half of the NTT algorithm, and the negative closed-form convolution corrects the result offset caused by the polynomial loop by generating a preprocessing sequence to preprocess the array elements that need to be subjected to NTT operation.

In the invention, the preprocessing of the array elements is integrated into the butterfly transformation by generating a special twiddle factor sequence. The NWC-NTT is generated by generating a special sequence of twiddle factors as follows:

The 2N-order unit primitive root of (1),ψis to be->

Only need to traverse

The optimized NTT algorithm has the space complexity ofO（N) The calculation scale was reduced by a factor of 1.

In the NTT transform, the butterfly operation is called cyclically as an arithmetic logic unit. In addition, the NTT operation adopts a CT butterfly structure, receives input in a standard sequence and generates output in a bit reverse sequence; INTT operations take input in bit-reversed order using GS butterfly structures and are standardThe outputs are generated sequentially. As shown in FIG. 3, pre-processing and post-processing operations are eliminated at the expense of using two different butterfly structures, according to an improvement

The parallel acceleration scheme on the GPU is designed by the algorithm, and an operation example of the NWC-NTT algorithm with n =8 is shown in FIG. 3.

Furthermore, in addition to reducing the computational complexity and fitting the parallel computation flow in the aspect of algorithm design, the invention balances the workload and reduces the access competition in the aspect of hardware structure design, dynamically allocates the Block Block in the GPU according to the computational load, reduces the task with overlarge computation amount, merges the task with small computation amount to be larger, controls the access competition in the result merging process, and fully utilizes the hardware resources of the GPU in terms of time and space. According to the memory hierarchical structure in the GPU, the number of tasks with large memory access amount distributed on the SM at the same time is reduced, more shared memories are distributed, the memory hit rate is improved, and the communication with the global memory is reduced. The invention further provides three technical points for the purpose:

(1) Optimizing a memory access mode according to a data memory access rule of an NTT algorithm;

the optimization of the memory access mode is a key part for realizing the fully homomorphic acceleration on the GPU, and the effective use of the GPU memory hierarchy plays an important role in realizing the overall performance of the fully homomorphic encryption algorithm. For the input polynomial coefficient array poly _ coeff and the twiddle factor array𝑝𝑠𝑖𝑠The naive solution to access different elements is to use global memory, however such performance is also the worst. Another idea is to utilize shared memory: since the elements of the array poly _ coeff are processed in a GPU block, the shared memory provides the best choice for threads located in the same block. Here twiddle factor sequence sets𝑝𝑠𝑖𝑠Each element of (a) is only accessed once per block and does not change its value, so it is loaded into constant memory before the GPU core starts up. During the computation of NTT/INTT, the element of poly _ coeff is accessed log ₂ (n) times, the present invention issues a request to global memory to copy an element of poly _ coeff to shared memory,then accessing the shared memory log for it ₂ (n) times, thereby avoiding sending out log to global memory for single element ₂ (n) the read-write speed of the shared memory is far better than that of the global memory.

It will be appreciated that an n-point based 2 NTT algorithm consists essentially of log ₂ (n) iterations, and in each iterationnA/2 butterfly operation, with NTT groups halved after each iteration. In addition, most current GPU technologies support a maximum number of threads per block of 1024, and when the input array has a dimension of 2048, two elements of the array poly _ coeff are scheduled using one GPU thread. When processing an array having more than 2048 elements, the present invention divides the array into groups of 2048 elements and processes them in different NTT computation iterations using multiple GPU blocks. Because different GPU blocks can not access the shared memory mutually, and therefore access from the low-speed global memory is needed, the invention adopts a method of temporarily storing the shared memory to improve the utilization rate of the shared memory and reduce the read-write times of the shared memory and the global memory. As shown in fig. 4, an n =4096 Radix2-NTT iterative access instance divides the array into 2 groups of 2048 elements and uses 2 blocks to process them. It can be seen that only 1024 elements of Block 0 and Block 1 change in the second iteration, so that when a plurality of GPU blocks process one NTT operation group, the unchanged elements are temporarily stored in the shared memory according to the access rule of the NTT algorithm, and the changed elements are written back to the global memory, thereby avoiding frequent reading and writing of the global memory.

(2) Dynamically distributing thread workload according to the calculation scale and hardware resources;

the challenge in implementing the NTT/INTT algorithm in the GPU is to efficiently allocate threads to achieve high utilization, all threads should be busy for optimal performance, the workload of each thread should be equal, and each thread independently processes a butterfly operation in the present invention, and allocates the required operands and corresponding twiddle factors to each thread in each iteration.

It will be appreciated that, the number of operands extracted by a single butterfly operation is the basis of the NTT algorithm,as shown in FIG. 5, for each butterfly operation of the radix-2 NTT algorithm, two elements of the array poly _ coeff are involved in the computation, and one GPU thread is used for two elements of the array poly _ coeff. The thread uses the target _ idx variable to calculate the index of the first array element to be processed by the thread in each iteration, the variable step is used to determine the second element of the array poly _ coeff allocated to the same thread, and each thread uses the index variable of step _ group to track access the twiddle factor sequence set

The corresponding twiddle factor in the array psi.

In order to fully utilize the computing resources in each GPU block and reduce the read-write operation between threads and a shared cache, the invention realizes Radix2-NWC-NTT, radix4-NWC-NTT, radix8-NWC-NTT and Radix16-NWC-NTT on the GPU.

TABLE 1 NTT comparison of different cardinalities

As shown in table 1, the larger the NTT radix, the fewer the number of iterations, the fewer the total number of accesses to the shared memory, the more the computational resources required for a single butterfly transform operation, and the maximum resource that can be paralleled by GPU hardware performance can be matched by the NTT algorithm of different radix. However, different cardinality needs to be matched with different calculation scales, in the implementation, the invention designs a butterfly structure for mixed calling, and for the NTT structure of the N-point cardinality k, when the ith iteration is N < k ⁱ Then, call a round of bases

The NTT structure of (a) can complete the iteration. Fig. 6 is an example of N =8 mixed call, and the butterfly structure of the mixed call improves the flexibility while obtaining the optimal parallel policy.

(3) Dynamically calling the inner core according to the butterfly transformation group of each iteration of the NTT algorithm;

in single kernel mode, 1024 threads are allocated per GPU block,and will be𝑥A group of continuous𝑠𝑡𝑒𝑝The elements of x2 are allocated to the same GPU block, no GPU block needs to wait for other GPU blocks to process other parts of the array poly _ coeff, the idea is to schedule according to the sequence of algorithm scheduling, so that the limited parallelism provided by the GPU is broken through, and the memory access competition in result combination is avoided. In multi-kernel mode, scheduling is done for each kernel call𝑛= 2048 GPU blocks, all running simultaneously, because each call to the kernel incurs some overhead, resulting in worse performance than the single-kernel approach when processing smaller groups. In order to fully utilize the parallel potential of multi-kernel calling and simultaneously reduce calling overhead in an optimal mode, the invention provides a mixed calling method, which groups elements of an array poly _ coeff to be processed according to the size of step, and when an NTT algorithm starts, the elements of the array poly _ coeff to be processed are grouped

Is grouped and/or asserted>

When the mode is switched from the multi-core mode to the single-core mode. />

As shown in fig. 7, in the multi-core phase, the input array is accessed from global memory, in the single-core phase, the array elements are copied to shared memory once and they are accessed in shared memory in the remaining iterations, the INTT algorithm is slightly different from NTT because it starts from a position where step is small and merges them after each iteration.

S42, each line of the decomposed ciphertext polynomial matrix is regarded as an independent NTT operation group, and Radix2-NWC-NTT, radix4-NWC-NTT, radix8-NWC-NTT and Radix16-NWC-NTT are realized on the GPU. Selecting a corresponding optimized NTT algorithm according to the calculation scale of the ciphertext matrix and the hardware resource of the GPU to convert the ciphertext polynomial matrix into a ciphertext point value expression matrix:

。

and S43, after the point value expression of the ciphertext is obtained, homomorphic calculation is carried out according to a required operation function. Because the cryptograph polynomial coefficients are operated in a quotient loop, a lightweight and efficient modular reduction algorithm is a key point for realizing high-performance polynomial multiplication, the Barrett reduction algorithm is realized on a GPU, and the step of multiplying the operands is divided into two independent shifting operations and one multiplication operation, so that the size of a calculation intermediate variable of a modulus with the bit length of K bits is reduced to 2K bits at most. By utilizing the advantages of GPU large-scale parallel computation, the operation of the ciphertext matrix far exceeds the computation efficiency of a CPU under the same method.

S44, after the execution of the operation function is finished, an INTT algorithm is called to convert the calculated ciphertext point value expression matrix into a ciphertext polynomial matrix:

。

s45, calling an ICRT algorithm to collect each column of the ciphertext polynomial matrix and restore the column of the ciphertext polynomial matrix into a final ciphertext polynomial vector:

and writing the calculation result in the GPU memory back to the CPU memory.

S5, calling a decryption algorithm

And decrypting the calculated ciphertext polynomial to obtain a plaintext polynomial:

。

s6, calling a decoding algorithm

，/>

As a scale factor, the plaintext polynomial is decoded into a calculation result in plaintext form: />

。

Compared with the prior art of other fully homomorphic encryption schemes, the framework of the invention can maximally utilize the advantages of GPU large-scale parallel computation. Based on the realization of the NTT/INTT algorithm on the GPU, the invention realizes the operation of each calculation module under the cryptograph of the homomorphic encryption scheme on the GPU, and the operation comprises an addition module, a multiplication module, a key exchange module and a re-linearization module under the cryptograph.

In another embodiment of the present application, a fully homomorphic encryption scheme for heterogeneous acceleration is constructed based on implementation of each computing module on a GPU, and a fully homomorphic encryption privacy computing method in a cloud environment is designed, as shown in fig. 8, specifically including the following steps:

step 1: the Key Generation Center (KGC) sets the public parameters according to the user demand, and calls the setting function of the fully homomorphic encryption

Generating the parameter set required by the operationmkparamAnd discloses that the participant and the cloud server receive the common set of parameters for subsequent steps, including parameters of RLWE difficult problems

And sending the generated parameters to the cloud server and each data owner. />

Step 2: each data owner calls a key generation function

Independently generate respective private keysskiAnd the public key (ciphertext expansion key, bootstrap key and conversion key) <>

And sends the public key portion to the cloud server.

And 3, step 3: each data owner calls an encryption function

The data is encrypted. The steps are independently completed by the participants, and the ciphertext is uploaded to the cloud server after the steps are completed.

And 4, step 4: the cloud server performs efficient homomorphic operation on ciphertexts from different data owners by adopting heterogeneous computing, the efficient homomorphic operation comprises the steps of cipher text expansion, bootstrap, key conversion and the like, and final computing results are returned to all the participants in a form of the ciphertexts.

And 5, step 5: each participant calls a decryption function

Decrypting with a private key to obtain a plaintext result𝑚And obtaining a calculation result.

Due to the powerful and complete homomorphic computing function of the fully homomorphic encryption algorithm, the privacy computing architecture is very simple in the cloud environment. The intermediate server node only contacts the ciphertext to provide a computing function, and can be regarded as a computing node. The user includes one or more data holders. The users encrypt their own data and upload the encrypted data together with the public keys to a computing server, and the server uses the public keys to perform ciphertext computation and distributes computation results to the users. The user carries out decryption step by step, namely, the user only decrypts the corresponding ciphertext by using the private key of the user, and finally, the partially decrypted plaintext is restored into complete plaintext through interaction among the users.

It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention.

As shown in fig. 9, in another embodiment of the present application, a storage medium 100 is further provided, where a memory 101 stores a program, and when the program is executed by a processor 102, the program implements a privacy computation heterogeneous acceleration method based on fully homomorphic encryption, specifically:

s1, setting required public parameters according to requirements:

Generating a private keyskPublic keypkAnd evaluating the keyevk；

S2, calling coding algorithm

Δ is a scaling factor, set->

Encoding the input of the invention: />

And mapped onto an integer field to form a polynomial form, wherein->

Is an n-dimensional phasor, is greater than>

Represents->

Set of polynomials of all above with respect to X, (X) ^N + 1) is a univariate irreducible polynomial of order N;

s3, calling an encryption algorithm

: encrypted plaintext

And outputCiphertext->

Q > 1 is an integer>

Represents a set of integers, is selected, and is selected>

Represents->

A set of all polynomials above with respect to X>

Representing a polynomial residual class ring;

s41, copying the ciphertext and the calculation related parameters in the CPU memory to a corresponding GPU memory, decomposing the ciphertext polynomial ring of a super-large integer by adopting the Chinese remainder theorem, and decomposing the ciphertext polynomial in a vector form into a matrix form:

；

s42, regarding each row of the decomposed ciphertext polynomial matrix as an independent NTT operation group, and selecting a corresponding optimized NTT algorithm according to the calculation scale of the ciphertext matrix and the hardware resource of the GPU to convert the ciphertext polynomial matrix into a ciphertext point value expression matrix:

(ii) a The optimized NTT algorithm is characterized in that an NTT module on a GPU adopts a negative closure convolution NWC to reduce the calculation scale of half of the NTT algorithm, and the closure convolution NWC corrects the result deviation brought by a polynomial ring by generating a preprocessing sequence to preprocess array elements needing NTT operationMoving;

s44, after the operation function is executed, an INTT algorithm is called to convert the calculated ciphertext point value expression matrix into a ciphertext polynomial matrix:

；

writing the calculation result in the GPU memory back to the CPU memory;

s5, calling a decryption algorithm

;

s6, calling a decoding algorithm

，/>

As a scale factor, the plaintext polynomial is decoded into a calculation result in plaintext form:

。

it should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

Referring to fig. 10, in an embodiment, an electronic device for implementing a privacy computing heterogeneous acceleration method based on fully homomorphic encryption is provided, where the electronic device 200 may include a first processor 201, a first memory 202, and a bus, and may further include a computer program stored in the first memory 202 and executable on the first processor 201, such as a privacy computing heterogeneous acceleration program 203 based on fully homomorphic encryption.

The first memory 202 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The first memory 202 may in some embodiments be an internal storage unit of the electronic device 200, such as a removable hard disk of the electronic device 200. The first memory 202 may also be an external storage device of the electronic device 200 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 200. Further, the first memory 202 may also include both an internal storage unit and an external storage device of the electronic device 200. The first memory 202 may be used to store not only application software installed in the electronic device 200 and various types of data, such as codes of the privacy computing heterogeneous acceleration program 203 based on the fully homomorphic encryption, but also temporarily store data that has been output or is to be output.

The first processor 201 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The first processor 201 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions of the electronic device 200 and processes data by running or executing programs or modules stored in the first memory 202 and calling data stored in the first memory 202.

Fig. 10 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 10 does not constitute a limitation of the electronic device 200, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

The privacy computation heterogeneous acceleration program 203 based on the fully homomorphic encryption stored in the first memory 202 of the electronic device 200 is a combination of a plurality of instructions that, when executed in the first processor 201, can implement:

s1, setting required public parameters according to requirements by a userppAt a common parameterppOn the basis of the key generation algorithm, the key generation algorithm of the fully homomorphic encryption is called to generate a private keyskPublic keypkAnd evaluating the keyevk；

Further, the modules/units integrated with the electronic device 200, if implemented in the form of software functional units and sold or used as independent products, may be stored in a non-volatile computer-readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM).

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. The privacy computation heterogeneous acceleration method based on the fully homomorphic encryption is characterized by comprising the following steps of:

s1, setting required public parameters according to requirements by a userppAt a common parameterppOn the basis of the encryption algorithm, a secret key generation algorithm of the fully homomorphic encryption is called to generate a private keyskPublic keypkAnd evaluating the keyevk；

s42, regarding each line of the decomposed ciphertext polynomial matrix as an independent NTT operation group, and selecting a corresponding optimized NTT algorithm according to the calculation scale of the ciphertext matrix and the hardware resource of the GPU to convert the ciphertext polynomial matrix into a ciphertext point value expression matrix; the optimized NTT algorithm is that an NTT module on a GPU adopts negative closure convolution NWC to reduce the calculation scale of half of the NTT algorithm, and the closure convolution NWC preprocesses array elements needing NTT operation by generating a preprocessing sequence to correct result deviation brought by a polynomial ring;

2. The method for privacy-based computation heterogeneous acceleration based on fully homomorphic encryption according to claim 1, characterized in that the public parameterppParameters including RLWE difficulty problempp ^RLWE 。

3. The privacy computation heterogeneous acceleration method based on homomorphic encryption of claim 1, wherein in step S41, the ciphertext polynomial ring of the super large integer is decomposed by using the chinese remainder theorem, the ciphertext polynomial in the form of a vector is decomposed into a matrix, and the ciphertext polynomial is subjected to

The method specifically comprises the following steps:

s411, representing the ciphertext polynomial as a coefficient vector thereof:

whereinNRepresents the highest order of the polynomial;

s412, generating a group of prime number sequences of k elements:

；

s413, decomposing each item in the coefficient vector by using a CRT algorithm:

s414, regarding each row as an independent NTT operation group;

4. The method according to claim 3, wherein the preprocessing of the array elements is integrated into the butterfly transform by generating a special twiddle factor sequence, wherein the twiddle factor sequence is as follows:

PSI _list =[1, ψ ^N/2 ,ψ ^N/4 ,ψ ^3N/4 ,ψ ^N/8 ,ψ ^3N/8 ,ψ ^5N/8 ,ψ ^7N/8 ,…,ψ ¹ ,ψ ³ ,ψ ⁵ ,…,ψ ^N-5 ,ψ ^N-3 ,ψ ^N-1 ]whereinψIs composed of

The 2N-order unit primitive root of (1),ψis to be->

Only need to traverse

Namely, the negative closure convolution operation is merged into the butterfly transformation of the NTT algorithm, and the space complexity of the optimized NTT algorithm isO（N）。

5. The method of claim 1, wherein the NTT operation employs a CT butterfly structure, accepts inputs in a standard order, and produces outputs in a bit-reversed order; INTT operations take inputs in bit-reversed order using the GS butterfly structure and produce outputs in standard order.

6. The privacy computation heterogeneous acceleration method based on the fully homomorphic encryption according to claim 3, characterized in that a memory access mode is optimized according to a data memory access rule of an NTT algorithm, specifically:

during the calculation of NTT/INTT, the elements of the polynomial coefficient array poly _ coeff are calculatedAccess log ₂ （N) Next, a request is issued to global memory to copy an element of poly _ coeff to shared memory, for which it then accesses shared memory log ₂ （N) Second, thereby avoiding sending out log to global memory for single element ₂ （N) For the secondary request, the read-write speed of the shared memory is far better than that of the global memory.

7. The privacy computation heterogeneous acceleration method based on fully homomorphic encryption according to claim 3, characterized in that thread workloads are dynamically allocated according to computation scale and hardware resources, specifically:

each thread independently processes a butterfly operation, each thread is allocated with required operands and corresponding twiddle factors in each iteration,

The corresponding twiddle factor in the array psi.

8. The privacy computation heterogeneous acceleration method based on the homomorphic encryption according to claim 7, characterized in that the kernels are dynamically invoked according to butterfly transformation groups of each iteration of the NTT algorithm, specifically:

a method for mixed calling is provided, which groups the elements of the array poly _ coeff to be processed according to the size of the variable step, and when the NTT algorithm starts

Packet based on time and location>

With each iteration of the algorithm &>

Is reduced by one time when->

When the mode is switched from the multi-core mode to the single-core mode.

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the method of private computing heterogeneous acceleration based on homomorphic encryption of any of claims 1-8.

10. A computer-readable storage medium storing a program, wherein the program, when executed by a processor, implements the method for privacy computing heterogeneous acceleration based on fully homomorphic encryption of any one of claims 1-8.