CN116048811A

CN116048811A - Fully homomorphic encryption neural network reasoning acceleration method and system based on resource multiplexing

Info

Publication number: CN116048811A
Application number: CN202310113879.XA
Authority: CN
Inventors: 鞠雷; 诸怡兰; 王心瑶; 张伟; 周梓梦
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2023-05-02

Abstract

The present disclosure provides a fully homomorphic encryption neural network reasoning acceleration method and system based on resource multiplexing, comprising: acquiring information of an isomorphic encryption neural network to be accelerated and hardware resource information of an FPGA; inputting a pre-constructed hardware resource allocation model to obtain an optimal hardware resource allocation scheme when the fully homomorphic encryption neural network performs operation processing in the FPGA; the processing strategy of the hardware resource allocation model is as follows: parallel and flow optimization is carried out in each network layer aiming at full homomorphic encryption operation and full homomorphic encryption neural network, and multiplexing of homomorphic basic operation modules is adopted among each network layer; meanwhile, for the on-chip storage space of the FPGA, on-chip buffer area multiplexing under different granularities is carried out based on the operation division of the fully homomorphic encryption neural network; finally, the optimal resource allocation scheme is obtained by taking the time of reasoning encryption data of the fully homomorphic encryption neural network as the target.

Description

Fully homomorphic encryption neural network reasoning acceleration method and system based on resource multiplexing

Technical Field

The disclosure belongs to the technical field of fully homomorphic encryption application, and particularly relates to a fully homomorphic encryption neural network reasoning acceleration method and system based on resource multiplexing.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The use of the fully homomorphic encryption (FHE: fully Homomorphic Encryption) technique to protect Convolutional Neural Network (CNN) data is a current popular direction, however, the combination of the two presents a number of obstacles in practical applications, including in particular: on the one hand, isomorphic encryption can only support linear operations such as addition and multiplication operations, while common CNNs comprise nonlinear activation functions such as ReLU; on the other hand, the calculation cost of the fully homomorphic encrypted data is huge, the calculation of one ciphertext is 5 to 6 orders of magnitude more than the original calculation, and the neural network itself needs to perform multiple-node and multiple-round complex calculation, so that the fully homomorphic encrypted convolutional neural network (FHE-CNN) needs to calculate a large amount of data, and meanwhile, a large amount of storage space is needed to store and schedule the generated data, and further, the operation speed of the FHE-CNN is slow. For example, working CryptoNets, which combine FHE and CNN for the first time, infers that an encrypted 5-layer neural network takes 205 seconds, which is not satisfactory for practical performance.

In the above background, there are many works for optimizing and accelerating FHE-CNN, including in particular:

(1) Based on the work of the CPU implementation, these works only consider how to improve the algorithm to reduce ciphertext operations and thus reduce the time to reasoning, and do not consider optimization in hardware.

(2) Based on the optimization of the lower hardware level, in particular, common hardware platforms include graphics processors (Graphics Processing Unit, GPUs), field programmable gate arrays (Field-programmable Gate Array, FPGAs), and application specific integrated circuits (Application Specific Integrated Circuit, ASICs). The inventors found that such work has the following problems:

1) Part of the methods only optimize the isotactic operation, but do not optimize in the isotactic operation in combination with the specific application;

2) There is a lack of consideration of performance bottlenecks in FHE-CNN storage and a lack of optimization for the computational process.

Disclosure of Invention

The present disclosure provides a fully homomorphic encryption neural network reasoning acceleration method and system based on resource multiplexing, where the scheme is based on a fully homomorphic encryption operation optimization method and an on-chip buffer multiplexing strategy, so as to effectively improve the fully homomorphic encryption neural network reasoning acceleration effect.

According to a first aspect of the embodiments of the present disclosure, there is provided a fully homomorphic encryption neural network reasoning acceleration method based on resource multiplexing, including:

acquiring information of an isomorphic encryption neural network to be accelerated and hardware resource information of an FPGA;

inputting the information and the hardware resource information of the fully homomorphic encryption neural network into a pre-constructed hardware resource allocation model to obtain an optimal hardware resource allocation scheme when the fully homomorphic encryption neural network performs operation processing in an FPGA;

the processing strategy of the hardware resource allocation model is as follows: parallel and flow optimization is carried out in each network layer aiming at full homomorphic encryption operation and full homomorphic encryption neural network, and multiplexing of homomorphic basic operation modules is adopted among each network layer; meanwhile, for the on-chip storage space of the FPGA, on-chip buffer area multiplexing under different granularities is carried out based on the operation division of the fully homomorphic encryption neural network; finally, the optimal resource allocation scheme is obtained by taking the time of minimizing the data encryption reasoning of the fully homomorphic encryption neural network as a target, wherein the operation of the fully homomorphic encryption neural network is divided into three levels of operation of homomorphic basic operation, homomorphic operation and network layer operation according to different granularities.

Further, the hardware resource allocation model specifically includes the following steps:

wherein LAT is _lr For the running time of the lr layers, the Layer represents all layers contained in the homomorphic encrypted neural network, lr represents one Layer, OP represents the set of all homomorphic operations, OP represents one homomorphic operation, and DSP _max Representing the number of DSPs owned by FPGA development board, BRAM _max Representing total number of BRAMs owned by FPGA development board and DSP _op The number of DSPs occupied for the homomorphic operation, BRAM _lr Number of BRAMs used for lr layer.

Further, each network layer operation comprises a plurality of homomorphic operations, each homomorphic operation comprises a plurality of homomorphic basic operations, wherein the operation of the homomorphic encryption neural network adopts a pipeline and parallel processing mode, and the homomorphic basic operations are used as units for pipelining.

Further, the parallel and running optimization is performed in each network layer aiming at the full homomorphic encryption operation and the full homomorphic encryption neural network, and multiplexing of homomorphic basic operation modules is adopted among each network layer, specifically: according to the sequences of homomorphic basic operation, homomorphic operation and network layer operation, respectively setting the parallelism inside the operations with different granularities, so that the parallel effects of the operations with different granularities are overlapped;

Or, for homomorphic basic operation, traversing one round or multiple rounds according to the coefficient of the RNS polynomial, dividing the homomorphic basic operation into two kinds, and setting different parallelism for the two kinds of operation to enable the running time of the two kinds of operation to be similar;

or, for the running water for realizing cross homomorphic operation in the network layer, setting the same parallelism for different homomorphic operations, and dividing the homomorphic operations into two types based on the complexity of the data dependency relationship among RNS polynomials;

or, for different network layer operations, dividing the network layer operations into two categories according to whether the network layer operations comprise KeySwitch operations or not, and respectively carrying out pipeline design on the different categories.

Further, for the on-chip storage space of the FPGA, on-chip buffer multiplexing under different granularities is performed based on the operation division of the fully homomorphic encrypted neural network, specifically: taking a space for storing an RNS polynomial as a storage unit in the on-chip buffer, and dividing the on-chip buffer into two types according to class division of homomorphic basic operation; wherein the multiplexing of the buffer area comprises multiplexing in homomorphic operation, multiplexing in network layer and multiplexing between network layers.

Further, the homomorphic base operation includes; modular multiplication, modular addition, modular subtraction, modular operation, fast number theory transformation and inverse transformation;

or, the homomorphic operation comprises the operations of plaintext addition, ciphertext multiplication, replay contraction, re-linearity and rotation;

or, the network layer comprises a homomorphic convolution layer, a homomorphic activation layer and a homomorphic full connection layer.

According to a second aspect of the embodiments of the present disclosure, there is provided an isomorphic encryption neural network reasoning acceleration system based on resource multiplexing, including:

the data acquisition unit is used for acquiring information of the fully homomorphic encryption neural network to be accelerated and hardware resource information of the FPGA;

the resource allocation unit is used for inputting the information of the fully homomorphic encryption neural network and the hardware resource information into a pre-constructed hardware resource allocation model to obtain an optimal hardware resource allocation scheme when the fully homomorphic encryption neural network performs operation processing in the FPGA;

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic device, including a memory, a processor, and a computer program running on the memory, where the processor implements the fully homomorphic encryption neural network reasoning acceleration method based on resource multiplexing when executing the program.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the described fully homomorphic encryption neural network reasoning acceleration method based on resource multiplexing.

Compared with the prior art, the beneficial effects of the present disclosure are:

(1) The scheme supports automatic deployment from neural network application to full homomorphic encryption hardware optimization realization, the adopted full homomorphic encryption technology effectively and reliably protects data safety in the neural network reasoning process, greatly facilitates the flow from the neural network to encryption reasoning and hardware deployment, and aims at meeting actual application requirements to perform hardware acceleration optimization, and achieves good deployment effects;

(2) Aiming at the bottleneck of FPGA hardware deployment, namely low utilization efficiency of computing resources and insufficient on-chip storage space, the scheme provides an all-homomorphic operator optimization technology and an on-chip buffer multiplexing technology, so that the difficulty in the actual deployment process is effectively relieved, the hardware advantage of the FPGA is brought into play, and the reasoning speed and the energy consumption ratio of the encrypted neural network are both well optimized;

(3) The high-level comprehensive technology adopted by the scheme has the advantages of flexible and easy realization of programming, short development period, strong portability and the like, is convenient for carrying out design space exploration on various application requirements, and particularly can evaluate hardware deployment aiming at different neural networks and different types of FPGA development boards, generate a hardware resource configuration scheme with optimal running speed and generate codes for deployment. In addition, the method can be adjusted for other optimization targets, and has expansibility.

Additional aspects of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.

FIG. 1 is a flow diagram of a NKS layer (i.e., a layer that does not contain a KeySwitch operation) as described in embodiments of the present disclosure;

FIG. 2 is a flow diagram of a KS layer (i.e., a layer containing KeySwitch operations) as described in embodiments of the present disclosure;

FIG. 3 is a schematic diagram of homography matrix multiplication in which "RO" represents rotation and "+" represents homography addition, according to an embodiment of the present disclosure;

FIGS. 4 (a) through 4 (e) are diagrams illustrating buffer multiplexing within different homomorphic operations according to embodiments of the present disclosure;

FIGS. 5 (a) to 5 (b) are diagrams illustrating buffer multiplexing within FHE-CNN network layers according to embodiments of the present disclosure;

FIG. 6 is an overall design framework of the fully homomorphic encryption neural network reasoning acceleration method based on resource multiplexing in an embodiment of the present disclosure;

fig. 7 (a) to 7 (b) are schematic diagrams of parallelism configuration results based on the Lola-MNIST network according to the embodiments of the present disclosure;

FIG. 8 is a schematic diagram of an on-chip memory optimization effect according to an embodiment of the disclosure;

FIG. 9 is a diagram of the on-chip computing unit DSP optimization effect described in embodiments of the present disclosure;

FIG. 10 is an effect diagram of design space exploration based on an inference acceleration method as described in embodiments of the present disclosure;

FIG. 11 is a flowchart of an isomorphic encryption neural network reasoning acceleration method based on resource multiplexing according to an embodiment of the present disclosure;

Detailed Description

The disclosure is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

Term interpretation:

fully homomorphic encryption (Fully Homomorphic Encryption) is an algorithm in the field of cryptography, characterized by the ability to arbitrarily compute ciphertext without decryption.

High-level Synthesis (HLS) is a process of automatically converting a logical structure of a High-level language description into a circuit model of a low-level language description.

FPGA (Field Programmable Gate Array) field programmable gate array is a semi-custom circuit that can be programmed to perform custom functions on an existing integrated circuit.

Neural Networks (Neural Networks), which are a subset of machine learning, are used to make classification predictions for data.

Embodiment one:

the aim of the embodiment is to provide an all-homomorphic encryption neural network reasoning acceleration method based on resource multiplexing.

Currently, the mainstream technology for computing private data includes secure multiparty computing (MPC: multi-party Computation), homomorphic encryption (Homomorphic Encryption), and the like. Homomorphic encryption technology is a cryptographic-based encryption scheme with more reliable security. The homomorphic encryption (Fully Homomorphic Encryption) is a homomorphic encryption branch, can support infinite times of ciphertext addition and multiplication, and has wider application prospect. In addition to homomorphic addition and homomorphic multiplication, operations such as rotation (Rotate) are collectively referred to as homomorphic operations (Homomorphic Operation). The full homomorphic encryption scheme has developed three generations of technologies, wherein the second generation full homomorphic encryption scheme represented by the BGV scheme, the BFV scheme and the CKKS scheme is the scheme with the highest current operation efficiency and the widest scope of use, and is also more focused by academia and industry, and a plurality of mature open source libraries, such as SEAL, palisade, HElib, HEAAN, are already owned.

Because of various problems in the existing optimization acceleration of the fully homomorphic encryption neural network, the embodiment provides a fully homomorphic encryption neural network reasoning acceleration method based on resource multiplexing, the scheme fully considers that the FPGA has strong original data computing power and reconfigurability, is superior to the GPU in terms of energy consumption and is superior to the ASIC in terms of flexibility, and the method is an ideal hardware scheme for accelerating fully homomorphic application. The scheme combines the characteristic of flexible programming of the FPGA, uses a High-level Synthesis (HLS) tool, combines the application characteristic of FHE-CNN and the on-chip resource limitation of the FPGA, and provides an optimal resource configuration method to accelerate the calculation speed of FHE-CNN through the design space exploration of on-chip resource configuration.

As shown in fig. 11, the scheme specifically includes:

Furthermore, the information of the homomorphic encryption neural network mainly comprises homomorphic encryption parameter polynomial terms, and the hardware resource information of the FPGA mainly comprises the rated DSP resource quantity and BRAM resource quantity.

In particular, for easy understanding, the following detailed description of the embodiments will be given with reference to the accompanying drawings:

the following describes the scheme of the present embodiment in detail from three aspects:

fully homomorphic encryption operation optimization

On the basis of combining isomorphic encryption and a neural network, parallel and running water optimization is designed aiming at isomorphic encryption operation and FHE-CNN layer, and isomorphic modules are multiplexed on the basis; the method specifically comprises the following steps:

the method comprises the steps of dividing the operation after the homomorphic encryption and the neural network are combined into three levels, wherein the highest level comprises a plurality of intermediate level operations, the intermediate level comprises a plurality of lowest level operations, and parallel strategies are respectively designed for different levels. The lowest-level operation is homomorphic basic operation (hereinafter simply referred to as basic operation), and includes Modular multiplication (Modular mult), modular addition (Modular add), modular subtraction (Modular sub), modular arithmetic, fast Number Theory Transformation (NTT) and inverse transformation (INTT); intermediate stages are homomorphic operations including a secure cipher-text addition (PC-add), a cipher-text addition (CC-add), a secure cipher-text multiplication (PC-mult), a cipher-text multiplication (CC-mult), a replay scale (Rescale), a re-linearity (religioze), a rotation (Rotate), collectively referred to as a Keyswitch operation. The highest level is the operation of an FHE-CNN network layer, wherein the FHE-CNN layer refers to a homomorphic convolution layer, a homomorphic activation layer, a homomorphic full connection layer and the like.

The three levels are respectively provided with the internal parallelism, so that the parallel effects from the lowest level to the highest level are overlapped, and the fine granularity parallelism and the coarse granularity parallelism are combined. The lowest level of parallelism refers to parallelism within the basic operation that computes one RNS polynomial (FHE scheme, where one ciphertext is split into several RNS polynomials for computation). When the basic operation has only one calculation unit, the calculation unit traverses all coefficients of the polynomial to operate; when a plurality of calculation units are operated in parallel by a basic operation, each calculation unit processes a part of coefficients, and the duration of the basic operation calculation can be reduced. The parallelism of the intermediate stage is to compute the parallelism inside homomorphic operation in units of ciphertext, one ciphertext contains a plurality of RNS polynomials, and the parallelism of the intermediate stage is to compute several RNS polynomials at the same time, that is, to compute several modules of basic operations in parallel. There are sometimes dependencies between RNS polynomials that lead to the inability to compute in parallel, we have later devised differently for different situations. The highest-level parallelism refers to the parallelism in the FHE-CNN layer, the FHE-CNN layer needs to process a plurality of ciphertexts, and the highest-level parallelism refers to the parallel calculation of a plurality of ciphertexts at a time, namely, the parallel of a plurality of homomorphic operation modules.

In combination with the parallel design, we design a pipeline within the FHE-CNN layer. The FHE-CNN layer contains various homomorphic operations, which contain various basic operations, and we use the basic operations as units to flow water. Because the basic operation pipeline is a fine-grained pipeline mode, the utilization rate of computing resources is improved, and the use of storage space is reduced. To improve the pipeline efficiency, parallelism is configured inside the basic operation, so that the running time of each basic operation is similar, and bubbles of the pipeline are reduced (that means that a certain module of the pipeline stops to be not running). We divide the basic operations into two classes according to the computational flow within the basic operations, one is NTT and INTT, which require multiple rounds of computation on the coefficients of one RNS polynomial, and the other is the remaining basic operation, traversing only one round on the coefficients of the RNS polynomial. Different parallelism is typically set for the two classes to achieve run-time approximation.

Since the running water in FHE-CNN layer aims at basic operation, we set the same intermediate level parallelism for different homomorphic operation to the running water realizing cross homomorphic operation in layer, so that data can flow. Because of the data dependencies between internal RNS polynomials of homomorphic operations, we divide homomorphic operations into two classes, one class is KeySwitch operation, and the rest is another class. Because of the complex data dependencies of KeySwitch operations, affecting the pipelining of KeySwitch operations across other operations, we further divide the FHE-CNN layer into two classes, one class being layers containing KeySwitch operations, referred to as the "KS layer", and one class being layers not containing KeySwitch operations, referred to as the "NKS layer". For the two layers we respectively design the flow.

As shown in fig. 1, the horizontal axis represents time, the vertical axis represents homomorphic operation, and the arrow represents the flow direction of data. Each block represents a basic operation, wherein the interval (Pipeline Interval) of the running water is the time of the basic operation, and the running water in Rescale extends to homomorphic operation in front and back, thus representing the design of the cross-operation running water.

As shown in fig. 2, the horizontal axis represents time, and the vertical axis represents homomorphic operation. Since the KeySwitch operation requires computation of a plurality of basic operations to compute the next ciphertext, the pipeline interval of this scheme depends on the pipeline interval of the KeySwitch operation.

The KS layer is typically used to compute homomorphic matrix multiplication, where homomorphic vector multiplication requires multiple successive rounds of rotation and homomorphic addition operations, as shown in FIG. 3. Since each rotation is performed after the last addition is completed, these rotation operations cannot be pipelined. In this case we can compute multiple vector multiplications at a time with no data dependencies between them, e.g. KeySwitch module computes the first rotation of vector multiplication a, then computes the first rotation of vector multiplication B, then performs the second rotation of vector multiplication a, and so on. Thus, the KeySwitch module pipelining effect can be utilized to accelerate.

For FHE-CNN layer, we adopt basic operation multiplexing method. Because one layer calculates the next layer, and the layers do not have the part calculated at the same time, different layers can multiplex the same basic operation module. Therefore, in the configuration process, the parallelism is set for three levels in parallel, the corresponding calculation circuits on the FPGA are generated by instantiation, and the same circuit is used for different layers, so that the final acceleration is achieved, and the resource utilization rate is improved.

(two) on-chip buffer multiplexing

On-chip Buffer multiplexing design is carried out on the on-chip storage space, and performance loss caused by off-chip data transmission is reduced through multiplexing, so that an acceleration effect is achieved; the method specifically comprises the following steps:

the on-chip Buffer (Buffer) is stored in a space storing one RNS polynomial. Since the two classes into which the basic operation is divided typically have different degrees of parallelism, different fragmentation (Partition) of the Buffer is required. The Buffer is divided into two types according to the different fragments, one type is the Buffer for NTT/INTT operation, which is marked as Bn, and the other type is the Buffer for the rest basic operation, which is marked as Bb. Because there are more fragments of NTT/INTT, bn can be multiplexed to Bb, but Bb cannot.

The on-chip buffer multiplexing optimization technique is divided into three levels of multiplexing, the lowest level is multiplexing in homomorphic operation, the middle level is buffer multiplexing in FHE-CNN layer, and the highest level is multiplexing between different FHE-CNN layers.

The lowest multiplexing means that the same Buffer is used to operate as much as possible in a homomorphic operation, so as to reduce the data handling and reduce the data transmission with the off-chip Buffer space. Fig. 4 (a) to 4 (e) show the Buffer usage result for each homomorphism operation. Taking fig. 4 (a) as an example, input and output of CC-add multiplexes the Buffer of Bb 1; taking fig. 4 (c) as an example, keySwitch requires Bn1 for storing data during computation in addition to Bb1 for input and output, and data read from the off-chip (DDR).

The intermediate multiplexing means that the operation adjacent to the FHE-CNN layer is operated by the same Buffer, and when the operation cannot be operated by the same Buffer according to the data dependency relationship, different buffers can be used for calculation, and non-adjacent buffers can be multiplexed. Fig. 5 (a) to 5 (b) show Buffer multiplexing of NKS layer and KS layer, for NKS layer, data to be calculated is read from under chip, data is stored on Bn1 after PC-mu lt operation, res can operation is performed on Bn1, and the result is added to Bb 1. Then, the same operation is performed on the multiple ciphertexts, and Bb1 is responsible for storing the accumulated result. For the KS layer, data is read from the under-chip and Bb1 to PC-mu lt, the result is stored to Bn1, resca le is performed on Bn1, the result is used as an input of Key switch, bn2 is used as an intermediate buffer of Key switch, the result is stored in Bb1, bb1 is multiplexed, bn1 stores the result of CC-add accumulation, and Bn1 is also multiplexed.

The highest multiplexing means that different FHE-CNN layers share the same group of buffers. According to the characteristics of one layer after calculation and the next layer after calculation, except that the output of the one layer and the input of the next layer are the same data, a Buffer is needed to transfer the data, and the calculation processes of the two layers are not intersected. The buffers of one layer can thus be multiplexed onto another layer, the total amount of buffers ultimately used being the maximum of the individual Buffer usage of all layers.

(III) hardware resource allocation model construction

Modeling the resource configuration of the FHE-CNN on the FPGA on the basis of the first two steps, and exploring the design space to obtain a configuration scheme with the best performance through the characteristics of the given FHE-CNN and the given FPGA; the method specifically comprises the following steps:

modeling the time delay (Latency) of isomorphic operation optimization:

the number of cycles of one type of basic operation, including NTT/INTT, is as follows:

wherein N represents the number of terms, nc, of the ciphertext polynomial _NTT Representing the parallelism inside the NTT. The NTT operation is to traverse log for N numbers ₂ Each NTT check is calculated N times, two numbers at a time.

The number of cycles of the operation remaining in the basic operation except NTT/INTT is as follows:

where N represents the number of terms of the ciphertext polynomial and p represents the parallelism within the operation.

The number of cycles in FHE-CNN layer is calculated, and the cycle of the pipeline interval is obtained first. Since the running water interval is obtained with the maximum of the basic operations, there are:

LAT _b ＝max{LAT _basic ,AT _NTT }

wherein LAT is _b Represents the maximum period of a ciphertext polynomial to perform basic operations, and PI represents the period of the pipeline interval. P (P) ^intra Representing parallelism within homomorphic operations, L represents level of ciphertext.

Further, the period of the KS layer and the NKS layer is calculated as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

and->

Respectively represent the parallel module number of homomorphism operation, N _on The number of ciphertext input is represented by L eve L of ciphertext, and PI represents the period of the running water interval.

Modeling DSP resources used by homomorphic operation:

wherein P is ^inter And P ^intra Respectively represents parallelism in the homomorphic operation op and FHE-CNN layers,

indicating the minimum number of DSP resources required when there is no parallel.

Modeling the on-chip buffers used:

the number of BRAMs used by KS and NKS layers is modeled separately, and since buffers are divided into two classes, bn and Bb, calculations are performed separately for these two classes. The number of BRAMs for KS layer is:

and->

The constants representing the BRAM used by KS layer buffers, bn and Bb represent the number of BRAM used by these two types of buffers. / >

And->

Representing internal homomorphism of KS layerParallelism of operations and parallelism within the layer homomorphic operation.

The number of BRAMs for NKS layer is:

and->

A constant of BRAM representing the use of NKS layer buffers,/v>

And->

Representing parallelism of homomorphic operation inside the NKS layer and parallelism inside homomorphic operation.

An equation aiming at the resource allocation of the FHE-CNN on the FPGA is constructed, network parameters of the FHE-CNN are extracted as input, the number of available resources of the FPGA is increased, and the parallelism of each layer and the homomorphic operation is obtained through the resources and the storage model, so that the time for reasoning the encrypted data is obtained at the fastest speed. The total formula is:

/>

where OP represents a set of all homomorphic operations, an OP tableShowing some of them operating homomorphically. Layer represents all layers contained in the neural network, and lr represents one of the layers. Where lr ε { KS, NKS }, LAT _lr Is LAT _KS And LAT _NKS One of them, and LAT _KS And LAT _NKS Both are made of LAT _b The summation results from differences in the specific calculations of the summation due to the different calculation methods of KS and NKS. For BRAM resources, BRAM _lr Is BRAM _KS And BRAM _NKS And BRAM is a certain one _KS ＝Bn _KS +Bb _KS ，BRAM _NKS ＝Bn _NKS +Bb _NKS 。

DSP _max Representing the number of DSPs owned by the FPGA development board, BRAM _max Representing the total number of BRAMs owned by the FPGA development board. The sum of the DSPs is smaller than the DSP when each homomorphic operation is satisfied; and under the condition that the number of the BRAMs of the layer with the largest number of the BRAMs is smaller than the total number of the BRAMs, the sum of the time of each layer, namely the reasoning time of the whole network, is calculated so as to reach the minimum value.

Modeling the time delay (Latency) of isomorphic operation optimization:

LAT _b ＝max{LAT _basic ,LAT _NTT }

Further, the period of the KS layer and the NKS layer is calculated as follows:

and->

Respectively represent the parallel module number of homomorphism operation, N _in The number of ciphertext input is represented, L represents level of ciphertext, and PI represents the period of the running water interval.

Modeling DSP resources used by homomorphic operation:

DSP＝P ^inter ·P ^intra ·Const ^DSP

wherein P is ^inter And P ^intra Respectively represent the parallelism in the FHE-CNN layer and Const ^DSP Indicating the minimum number of DSP resources required when there is no parallel.

Modeling the on-chip buffers used:

and->

The constants representing the BRAM used by KS layer buffers, bn and Bb represent the number of BRAM used by these two types of buffers. />

And->

The parallelism of homomorphic operation inside the KS layer and the parallelism inside the homomorphic operation of the layer are represented.

The number of BRAMs for NKS layer is:

and->

A constant of BRAM representing the use of NKS layer buffers,/v>

And->

wherein, OP represents a set of all homomorphic operations, and OP represents a certain homomorphic operation. Layer represents all layers contained in the neural network, and lr represents one of the layers. DSP (digital Signal processor) _max Representing the number of DSPs owned by the FPGA development board, BRAM _max Representing the total number of BRAMs owned by the FPGA development board. The sum of the DSPs is smaller than the DSP when each homomorphic operation is satisfied; and under the condition that the number of the BRAMs of the layer with the largest number of the BRAMs is smaller than the total number of the BRAMs, the sum of the time of each layer, namely the reasoning time of the whole network, is calculated so as to reach the minimum value.

Further, to demonstrate the effectiveness of the protocol described in this example, the following experiments were performed:

in order to test the effect of the method described in this embodiment, the performance test and the demonstration of the optimization effect of the method are performed using a specific neural network.

Specifically, the method described in this embodiment uses the design framework of fig. 6. Firstly, extracting information of homomorphic encryption application combined with a neural network and hardware resources of a given FPGA, wherein homomorphic encryption application information packageIncludes homomorphic encryption parameter polynomial term N, ciphertext modulus Q, small modulus Q _i Wherein q= pi _0≤i＜L q _i L represents a small modulus, and the hardware resources of the FPGA comprise the number of rated DSP resources of the board, the number of BRAM resources and the specification, and are used as the input of a special accelerator generation framework. And constructing an integer linear programming model based on the parameterized homomorphic encryption operator library and two technologies of on-chip calculation and storage resource management, and performing automatic design space exploration to achieve optimal design, wherein an optimal hardware configuration scheme is taken as output. Generating a corresponding bit stream file through a Vivado HLS tool, and burning the bit stream file on an FPGA to obtain the acceleration implementation of the application.

Experiment 1:

the neural network selected in this example is the Lola-MNIST. Is a five-layer network for predicting MNIST datasets, described in table 1:

TABLE 1 Lola-MNIST network description

The homomorphic operands contained in each layer of FHE-CNN obtained through the combination with homomorphic encryption, the final reasoning accuracy and the model data size are shown in Table 2:

TABLE 2 basic information of Lola-MNIST network

The implementation of the FHE-CNN deployment on the embedded FPGA is verified to be feasible by performing experimental tests on two low-power-consumption FPGA development boards. Specifically, one is the intermediate FPGA ALINX ACU9EG (with Xilinx Zynq UltraScale +mpsoc XCZU9EG device), with 2,520DSP units and 32.1Mbit on-chip BRAM. The other is the high-end FPGA ALINX ACU15EG (with Xilinx Zynq UltraScale +MPSoC XCZU15 EG), with 3,528DSP units, 26.2Mbit on-chip BRAM and 31.5Mbit on-chip URAM.

The layers following this network binding homomorphism were classified according to KS layer and NKS layer, where Cnv1 is NKS layer, act1 is NKS layer, fc1 is KS layer, act2 is NKS layer, and Fc2 is KS layer. The input quantity for extracting each layer of ciphertext is 25,1 and 275,1,70 respectively. Homomorphism parameters take n=8192, the ciphertext level of Cnv layers of fhe-CNN is 6, then decrease by 1 every layer passes until Fc2 is 2.

The solution space of FHE-CNN is as follows: the lowest level of parallel solution space is 1 to 32, the middle level of parallel solution space is 1 to 6, and the highest level of parallel solution space is 1 to infinity. Traversing all solutions of the solution space to finally obtain the parallelism of the basic operation NTT/INTT on an FPGA ALINX ACU9EG board as 4 and the parallelism of other basic operations as 1; parallelism inside homomorphic operation, PC-Mult is 3, CC-Mult is 1, rescale is 3, and KeySwitch is 3; parallelism in FHE-CNN layer is 1 module except 2 KeySwitch modules. The frequency of the plate was set to 100MHz and the final plate verification time was 0.24 seconds. For an FPGA ALINX ACU15EG board, the parallelism of the basic operation NTT/INTT is 4, and the other basic operations are 1; parallelism inside homomorphic operation, PC-Mult is 3, CC-Mult is 1, rescale is 3, and KeySwitch is 3; parallelism in FHE-CNN layer is 1 module except 3 KeySwitch modules. The frequency of the plate was set to 100MHz and the final plate verification time was 0.19 seconds. Compared with the most advanced CPU implementation (implementation of the Lola scheme), the speed is improved by 11.58 times, and the energy consumption is improved by 1019.04 times.

TABLE 3 acceleration results

Furthermore, for implementation on different FPGAs, we present an integer linear programming that ultimately yields a solution of inter parallelism and intra parallelism, see fig. 7 (a) and 7 (b), from which it can be seen that ACU15EG parallelism is set higher than ACU9EG on KeySwitch operation, because ACU15EG resources are higher than ACU9EG, which can increase KeySwitch parallelism and thus reduce latency. The parallelism of the CC-mult operation of the two boards is 1, because the number of times of calling the CC-mult module is small, even if the parallelism is reduced, the influence on the total time delay is not great, and therefore resources saved by the low parallelism of the CC-mult are used for other bottleneck operations.

To demonstrate the optimization effect of the isomorphic operator optimization technique and the on-chip buffer multiplexing technique, we achieved a comparison experiment in a control group that did not use both optimization approaches. As shown in fig. 8, the experimental result of on-chip buffer multiplexing shows that the Fc1 layer occupies the most time for reasoning, and the BRAM used by the Fc1 layer is increased from 25.8% to 84.8% by means of BRAM multiplexing, so that the reasoning speed of the layer is improved by 6.63 times.

The results of the isomorphic operator optimization technique are shown in figure 9. Through operator optimization and resource multiplexing, the consumption of DSP resources used independently by each layer is increased, so that the homomorphic operation parallelism is increased, the on-chip buffer multiplexing technology is combined, the reasoning time of each layer is reduced, and the optimization effect of the technology is verified.

In addition, the acceleration optimization framework adopts a high-level comprehensive technology, the latter has the characteristics of flexibility, easiness in programming and short development time, and the design space exploration can be carried out on the FPGA development boards of different resources by combining the flexibility of the self-configurable programming of the acceleration optimization framework, so that the optimal design is found. We have experimented with this, and fig. 10 shows the design generated by this optimization framework with a number of BRAMs ranging from 350 to 1500, each point representing a configuration in which the configuration points on the red line reach pareto optimality. The optimal configuration point available when the number of BRAM resources is low is small, since some on-chip BRAM is still needed to store temporary data, even with the lowest parallelism.

Experiment 2:

the experiment uses the Lola-Cifar network to test the effect of the FPGA accelerator design method based on the reasoning of the full homomorphic encryption neural network. Lola-Cifar is also a 5-layer network, and compared with Lola-MNIST, the difference is that the weight rule of the network is larger, the required homomorphic operation quantity is more, and the input data is larger. The network is described in table 4:

TABLE 4 Lola-Cifar network description

Combining with homomorphic encryption, the homomorphic operands contained in each layer of FHE-CNN obtained, and the final inference accuracy, and model data size are shown in Table 5:

TABLE 5Lola-Cifar network basic information

The FPGA development boards used were ALINX ACU9EG (with Xilinx Zynq UltraScale +mpsoc XCZU9EG device) and ALINX ACU15EG (with Xiliinx Zynq UltraScale +mpsoc XCZU15 EG), respectively. The hardware optimization acceleration framework of the fully homomorphic encrypted neural network automatically generates an optimal configuration solution, and the acceleration result is shown in table 6:

TABLE 6 acceleration results

Wherein KS, lambda, N and log Q respectively represent the number of KeySwitch operations, security level, homomorphic encryption parameter polynomial degree and homomorphic encryption parameter module width. TDP represents the power of the hardware platform used for the implementation. Through actual plate measurement, the final experimental optimization result is improved by 13.49 times compared with the Lola scheme, and the inference speed is improved by 1187.12 times.

Experiments 1 and 2 prove that the acceleration effect of the acceleration frame on a plurality of FHE-CNNs can be improved by more than 10 times compared with the acceleration of a CPU, and the energy consumption ratio is improved by more than 1000 times. In addition, the framework is used for customizing two different FPGA boards to generate a configuration scheme, generating a pareto optimal solution of performance-resources, and obtaining an optimal configuration result. The framework realizes the automatic generation of the optimal configuration of hardware resources of a given neural network and a given FPGA development board, and provides a solution for the optimal deployment of FHE-CNN hardware.

Embodiment two:

the embodiment aims to provide an all-homomorphic encryption neural network reasoning acceleration system based on resource multiplexing.

An all homomorphic encryption neural network reasoning acceleration system based on resource multiplexing, comprising:

Specifically, the system in this embodiment corresponds to the method in the first embodiment, and the technical details thereof have been described in the first embodiment, so that details are not repeated here.

In further embodiments, there is also provided:

an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of embodiment one. For brevity, the description is omitted here.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of embodiment one.

The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The fully homomorphic encryption neural network reasoning acceleration method and system based on resource multiplexing can be realized, and have wide application prospects.

The foregoing description of the preferred embodiments of the present disclosure is provided only and not intended to limit the disclosure so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. The fully homomorphic encryption neural network reasoning acceleration method based on resource multiplexing is characterized by comprising the following steps of:

2. The fully homomorphic encryption neural network reasoning acceleration method based on resource multiplexing as claimed in claim 1, wherein the hardware resource allocation model is specifically expressed as follows:

therein, LAt _lr For the Layer running time, layer represents all layers contained in the homomorphic encryption neural network, lr represents one Layer, OP represents the set of all homomorphic operations, OP represents one homomorphic operation, and DSP _max Representing the number of DSPs owned by FPGA development board, BRAM _max Representing total number of BRAMs owned by FPGA development board and DSP _op The number of DSPs occupied for the homomorphic operation, BRAM _lr Number of BRAMs used for the layer.

3. The fully homomorphic encryption neural network reasoning acceleration method based on resource multiplexing as claimed in claim 1, wherein each network layer operation comprises a plurality of homomorphic operations, each homomorphic operation comprises a plurality of homomorphic basic operations, wherein the operation of the fully homomorphic encryption neural network adopts a pipeline and parallel processing mode, and the homomorphic basic operations are used as units for pipelining.

4. The fully homomorphic encryption neural network reasoning acceleration method based on resource multiplexing as claimed in claim 1, wherein the parallel and flow optimization is performed in each network layer of the fully homomorphic encryption operation and the fully homomorphic encryption neural network, and multiplexing of homomorphic basic operation modules is adopted between each network layer, specifically: according to the sequences of homomorphic basic operation, homomorphic operation and network layer operation, respectively setting the parallelism inside the operations with different granularities, so that the parallel effects of the operations with different granularities are overlapped;

5. The method for accelerating reasoning of the fully homomorphic encryption neural network based on resource multiplexing as claimed in claim 1, wherein the on-chip buffer multiplexing of the storage space of the FPGA based on the operation division of the fully homomorphic encryption neural network is performed under different granularity, specifically: taking a space for storing an RNS polynomial as a storage unit in the on-chip buffer, and dividing the on-chip buffer into two types according to class division of homomorphic basic operation; wherein the multiplexing of the buffer area comprises multiplexing in homomorphic operation, multiplexing in network layer and multiplexing between network layers.

6. The method for acceleration of reasoning of an homomorphic encryption neural network based on resource multiplexing as claimed in claim 1, wherein the homomorphic basic operation comprises; modular multiplication, modular addition, modular subtraction, modular operation, fast number theory transformation and inverse transformation;

7. The method for accelerating reasoning of fully homomorphic encryption neural network based on resource multiplexing as claimed in claim 1, wherein the information of fully homomorphic encryption neural network mainly comprises homomorphic encryption parameter polynomial term number, and the hardware resource information of FPGA mainly comprises rated DSP resource number and BRAM resource number.

8. The fully homomorphic encryption neural network reasoning acceleration system based on resource multiplexing is characterized by comprising:

9. An electronic device comprising a memory, a processor and a computer program stored for execution on the memory, wherein the processor implements a fully homomorphic encryption neural network reasoning acceleration method based on resource multiplexing as claimed in any one of claims 1-7 when executing the program.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements an isomorphic encryption neural network reasoning acceleration method based on resource multiplexing as claimed in any one of claims 1-7.