CN116049907A

CN116049907A - Paillier homomorphic encryption processor and processing method thereof

Info

Publication number: CN116049907A
Application number: CN202310109089.4A
Authority: CN
Inventors: 伍毅夫; 石贵铭; 张武科
Original assignee: Arctic Xiongxin Information Technology Tianjin Co ltd
Current assignee: Arctic Xiongxin Information Technology Tianjin Co ltd
Priority date: 2023-02-13
Filing date: 2023-02-13
Publication date: 2023-05-02

Abstract

The invention discloses a Paillier homomorphic encryption processor and a processing method thereof, which belong to the technical field of chip design and comprise a top controller, a global memory system and a plurality of ciphertext processing elements with bit serial reconfigurable data streams; the top controller is used for controlling the data flow inside the chip and the calculation of the ciphertext processing element; a global memory system for storing input and output data required for internal computation of the chip; the ciphertext processing element is used for calculating and processing the modulo operation required by the ciphertext; the top controller controls the global memory system to interact data with the plurality of ciphertext processing elements via the bus. The invention is based on ASCII design and is specially used for accelerating the chip of homomorphic encryption Paillier algorithm, and the computing efficiency and performance are improved by supporting the computing mode of bit stream sparseness on the architecture.

Description

Paillier homomorphic encryption processor and processing method thereof

Technical Field

The invention belongs to the technical field of chip design, and particularly relates to a Paillier homomorphic encryption processor.

Background

Cloud computing is currently in the center of a large number of emerging information applications that provide a variety of reliable high-performance services based on a large amount of personal and institutional data. Paillier Homomorphic Encryption (PHE) is one of the key privacy computing technologies that enables ciphertext to enjoy equivalent utility as plaintext. In the partial homomorphic encryption scheme shown in fig. 1 (a), the client encrypts its plaintext into ciphertext, and then sends it to the server, which performs homomorphic evaluation and returns the encryption result to the client.

However, this process avoids the leakage of plaintext by at the cost of several computational overheads, and the present invention enumerates 3 major challenges in FIG. 1 (b). First, ciphertext domain computation requires costly large integer modular arithmetic operations (e.g., modular multiplication (ModMul), modular inversion (ModInv), and modular exponentiation (ModExp)) with several orders of magnitude higher energy and delay. Second, the client performs encryption and decryption featuring independent vector operations, while the server inevitably performs evaluation using Multiply and Accumulate (MAC) operations. Third, the diversity of tasks requires computational extensibility to meet latency and throughput requirements in the cloud.

The current homomorphic encryption Paillier algorithm has lower performance on general processor CPUs and graphics processor GPUs. For example, in the left half of fig. 1 (a), homomorphic encryption refers to performing a corresponding operation on ciphertext, and the result obtained is equivalent to performing the same operation on plaintext. Specifically, the addition on the plaintext corresponds to the modular multiplication on the ciphertext, and the multiplication is the modular exponentiation.

In summary, the prior art uses specific circuits to implement modular exponentiation and modular multiplication operations, but does not implement the related schemes of bit stream sparse operations.

Disclosure of Invention

The invention aims to provide a Paillier homomorphic encryption processor and a processing method thereof.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

a Paillier homomorphic encryption processor comprising: a top controller, a global memory system, and a plurality of ciphertext processing elements having a bit-serial reconfigurable data stream;

the top controller is used for controlling the data flow inside the chip and the calculation of the ciphertext processing element;

a global memory system for storing input and output data required for internal computation of the chip;

the ciphertext processing element is used for calculating and processing the modulo operation required by the ciphertext;

the top controller controls the global memory system to interact data with the plurality of ciphertext processing elements via the bus.

As a further improvement of the present invention, each ciphertext processing element includes a weight buffer, an input buffer, an output buffer, a lookup table unit for bit-serial processing of parts and terms, a montgomery unit, and an extended euclidean unit;

the weight buffer is used for storing the weight multiplied by the ciphertext;

the input buffer is used for storing the calculation ciphertext;

the output buffer is used for storing the calculation result;

look-up table unit for parts and entries of bit serial processing: the intermediate settlement result is used for storing the ciphertext;

the Montgomery unit performing Montgomery modular exponentiation for the Paillier algorithm organized as modular exponentiation and modular exponentiation;

the extended euclidean unit is configured to use an extended euclidean algorithm for modulo inversion operation.

As a further refinement of the present invention, the Montgomery unit includes a plurality of multipliers for computationally intensive Montgomery, wherein a first stage is 1 multiplier for updating a result of a first stage of the Montgomery algorithm and a second stage is 2 multipliers for updating a result of a second stage of the Montgomery algorithm.

As a further refinement of the present invention, the multiplier employs two stages of montgomery arithmetic, a first stage of montgomery arithmetic for updating the first stage parameter values and associated intermediate variables for a second stage; the second stage Montgomery operations are used to perform large integer multiplications of intermediate Montgomery results.

As a further development of the invention, the extended euclidean unit comprises a local register file for control-intensive extended euclidean, a lightweight branch detector and an execution unit; the execution unit includes 3 adders and a shifter unit for executing the operation of each branch in one cycle.

As a further development of the invention, the extended euclidean unit is also arranged to forward the update result to the lightweight branch detector before forwarding it to the pipeline register in each cycle.

A method of processing a Paillier homomorphic encryption processor, comprising: an n-level bit serial mode for multiply-accumulate operations and a pipeline mode for conventional modulo operations; the data flow in the two modes is reconfigured by the computing unit controller, and the same resource set is reused;

the homomorphic encryption Paillier algorithm adopts bit stream sparse calculation, in a bit stream sparse calculation part, bit sparsity of weights is realized by carrying out bit-wise multiplication on ciphertext and the weights, and the bit sparsity is aligned in the part and the bit-wise.

As a further improvement of the present invention, the n-level bit serial pattern is performed via bit 1 selector using bit sparsity of weights; comprising the following steps:

s1, sharing a plurality of registers in part and Psum, wherein the registers correspond to m unit quantities of weight respectively; for each weight, the selector searches the first unit amount of locations to indicate an update of the corresponding Psum term;

s2, aligning the part and the term by performing Montgomery modular multiplication for a plurality of times according to the bit position;

s3, further processing the negative sign term of the partial sum by once expanding Euclidean sum and twice Montgomery modular multiplication;

s4, accumulating all the parts and items and obtaining a final result.

As a further refinement of the present invention, the pipeline mode is reconfigured to perform the basic modular operation of Paillier, including:

wherein the modular multiplication and modular exponentiation are achieved by repeatedly invoking Montgomery modular multiplication units, while the extended Euclidean units achieve modular inversion; in the case of modular exponentiation, the corresponding weights reach the set unit quantity, and a fast exponential algorithm is introduced by multiplexing the selector of the first unit quantity of the n-level bit serial pattern to skip sparse bits in the same way.

As a further improvement of the present invention, the bit stream thinning calculation includes:

s10, ciphertext 10 b00001011 +ciphertext 20 b10111100;

firstly, carrying out bit-wise multiplication on ciphertext and weight, specifically carrying out Montgomery modular multiplication on ciphertext 1 and part and 0,1,3, and carrying out Montgomery modular multiplication on ciphertext 2 and part and 2,3,4,5 and 7;

s10, aligning 0 to 7 of the partial sum by bits, wherein the partial sum is multiplied by i times by i Montgomery modulus in a specific calculation way;

s10, carrying out inverse processing on the partial sum 7;

and S10, summing all partial sums.

Compared with the prior art, the invention has the following beneficial effects:

the invention designs an acceleration chip special for homomorphic encryption Paillier algorithm based on ASCIs, and improves the calculation efficiency and performance by supporting a bit stream sparse calculation mode on a framework. The Montgomery arithmetic unit uses a high-order multiplier to accelerate high-order multiplication in the Montgomery arithmetic unit, and the modulo inverse arithmetic unit uses a mode of forwarding intermediate results to enable the utilization efficiency of the arithmetic unit to reach 100%. The energy efficiency of the accelerator of the present invention exceeds the energy efficiency of the CPU by several orders of magnitude. For bit sparsity efficiency, the present invention evaluates four data patterns of different sparsity rates. The results show that the effective use of a 0-bit skip achieves a 3.3-fold acceleration at 75% sparsity. The figure also shows a comparison table with previous work on custom hardware designs for homomorphic encryption. The Paillier accelerator (PH-EPU) of the present invention supports all types of Homomorphic Encryption (HE) tasks (encryption, decryption, and evaluation) with a high degree of flexibility. It also supports dynamic bit width for ciphertext and weights while using bit-serial sparsity for optimizing performance. The throughput of a typical operation in Paillier reaches 23 to 68MOPS (millions of computations per second) on Montgomery modular multiplication (Mont) computations, 38KOPS (thousands of computations per second) on modular exponentiation (ModInv) computations, with efficiencies of 0.18 to 0.52 μJ/Op (micro-joules per computation) and 105.3 μJ/Op, respectively. Throughput for Montgomery modular multiplication (Mont) is 115-to 340-fold higher and energy efficiency 30.4-to 87.8-fold higher than current Paillier processors.

Further, in the bit stream sparse calculation part, the bit sparseness of the weights can be realized by multiplying the ciphertext of the first part and the weights by bits, and in the part and (Psum) the bits are aligned, since each bit is uniformly processed, a large amount of calculation required by the bit is omitted.

Furthermore, the invention also establishes a system level experiment flow to simulate the complete processing flow between the client and the server. The runtime library loads the input data and execution instructions onto the card through the host CPU.

Drawings

Fig. 1 shows a PHE (partially homomorphic encryption) scheme in the prior art, where (a) is the homomorphic encryption scheme and (b) is the main challenge of the prior art.

The overall architecture shown in fig. 2;

FIG. 3 details the implementation of MU (Montgomery modular multiplication unit) and SU (extended Euclidean unit);

FIG. 4 depicts the data flow inside the BSRD-PE (bit stream coefficient computation unit) for the following two typical computation modes of Paillier;

FIG. 5 illustrates an efficient task deployment flow on PE (computing Unit) and chip;

FIG. 6 first illustrates the key basic operations (i.e., mont and ModInv) and the efficiency of bit-serial optimization;

fig. 7 shows a micrograph and characteristics of both the chip and PCIe card.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 2, a first object of the present invention is to provide a Paillier homomorphic encryption processor, including: a top controller, a global memory system, and a plurality of ciphertext processing elements having a bit-serial reconfigurable data stream;

a global memory system for storing the chip internal computing input and output data;

and the ciphertext processing element is used for calculating and processing the modulo operation required by the ciphertext.

The top controller controls the global memory system to interact data with the plurality of ciphertext processing elements via the bus. The embodiment of the invention is to use 256-bit bus for data interaction.

Specifically, the high performance Paillier homomorphic encryption processor Unit (PH-EPU) of the present invention comprises:

1) Two efficient modular processing units (i.e., a Montgomery modular multiplication unit and an extended Euclidean unit) with dynamic bit widths for basic Paillier homomorphic operations;

2) Efficient ciphertext processing elements featuring a reconfigurable data stream that uses bit-serial computation for different task modes;

3) Instruction-based control is used for scalable computational streaming of various workload volumes.

As a specific example, the present invention gives an example of a 28nm processor that occupies 42.96mm ² And operates at 500MHz and 0.9V, which supports Paillier encryption, decryption, and homomorphic evaluation. The PH-EPU of the present embodiment achieves up to 14.9 times acceleration compared to the Intel desktop CPU cores i9-9900 with 16 cores, and the system-in-PCIe card of the present embodiment with 8 chips achieves up to 1/22.8 times lower latency relative to an Intel to strong Platinum (Xeon platform) 8260M server CPU with 192 cores.

FIG. 2 is a diagram of an overall architecture of a chip according to an embodiment of the invention, including a top controller, a global memory system with 3 registers, and 16 ciphertext processing elements (BSRD-PE) with bit-serial reconfigurable data streams. Wherein the top controller manages data interactions between the input/output buffers and the 16 BSRD-PEs via the 256-bit bus.

In the application, each ciphertext processing element comprises a weight buffer, an input buffer, an output buffer, a lookup table unit for bit serial processing parts and items, a Montgomery unit and an extended Euclidean unit;

the weight buffer is used for storing the weight multiplied by the ciphertext;

the input buffer is used for storing the calculation ciphertext;

the output buffer is used for storing the calculation result;

the Montgomery unit performing Montgomery modular exponentiation for the Paillier algorithm organized as modular exponentiation and modular exponentiation; the extended euclidean unit is configured to use an extended euclidean algorithm for modulo inversion operation.

In a specific example, each BSRD-PE includes a 9KB weight buffer, an 8KB input buffer, a 1KB output buffer, a lookup table for parts and entries for bit serial processing, a Montgomery modular Multiplication Unit (MU), and an extended Euclidean unit (SU). The MU performs Montgomery modular multiplication (Mont) for the Paillier algorithm that may be organized as modular multiplication and modular exponentiation.

The Montgomery unit includes a plurality of multipliers for computationally intensive Montgomery, wherein the first stage is 1 multiplier for updating the results of the first stage of the Montgomery algorithm and the second stage is 2 multipliers for updating the results of the second stage of the Montgomery algorithm.

Wherein the multiplier employs two stages of Montgomery operations, a first stage of Montgomery operations for updating a first stage parameter value and related intermediate variables for a second stage; the second stage Montgomery operations are used to perform large integer multiplications of intermediate Montgomery results.

Also, the extended euclidean unit presented herein contains a local register file for control-intensive extended euclidean, a lightweight branch detector, and an execution unit; the execution unit includes 3 adders and a shifter unit for executing the operation of each branch in one cycle.

SU uses the extended euclidean modulo inversion algorithm (Stein) for modulo inversion. These two basic execution units solve the first modulo challenge via dedicated arithmetic hardware. However, using such a common modulus power data stream for performing correlated multiply-accumulate in the evaluation phase is inefficient, so the present invention also extends the plug-in part and (Psum) term look-up table (LUT) coupled to the bit 1 selector to take advantage of bit level sparsity. The entire data path helps to efficiently handle the diverse computing modes in the second challenge. The horizontal capacity expansion calculation for the third challenge is achieved by an instruction-based control flow combined with a system-level automated dispatch flow.

FIG. 3 details the implementation of the MU Montgomery arithmetic unit and the SU-modular inverse arithmetic unit. The MU mainly includes 3 256-bit multipliers for computationally intensive Mont. The two-stage Montgomery modular multiplication operation has several similar iterations, the number of which depends on the width (B) of the multiplier.

For extension, the extended euclidean unit is also used to forward the update results to the lightweight branch detector before forwarding them to the pipeline registers in each cycle.

The four alternatives of B are evaluated and 256 bits of B are chosen as this achieves the highest cost efficiency. Thus, based on 256-bit operands, the first stage is used to update the first stage parameter (q) values and associated intermediate variables for the second stage. The second stage performs a large integer multiplication (4096 b×256 b) for the intermediate Montgomery result, which takes 16 cycles using a 256-bit multiplier. The SU contains a local register file for controlling the dense extended euclidean, a lightweight Branch Detector (BD) and an Execution Unit (EU). Only the key bits for branch detection are input to BD to eliminate redundant outputs, while EU contains 3 4102b adders and one shifter unit to perform the operation of each branch in one cycle.

The Montgomery unit comprises a plurality of multipliers for computationally intensive Montgomery, wherein the first stage is 1 256-bit multipliers for updating the results of the first stage of the Montgomery algorithm and the second stage is 2 256-bit multipliers for updating the results of the second stage of the Montgomery algorithm.

To avoid pipeline bubbles (bubbles) from the branch detection, the scheme also forwards the update results to the BD stage in each cycle before forwarding them to the pipeline registers, which guarantees SU usage.

The invention provides an embodiment specific modular arithmetic unit, wherein the Montgomery arithmetic unit uses a 256-bit high-order multiplier to accelerate high-order multiplication, and the modular inverse arithmetic unit uses a mode of forwarding an intermediate result to enable the utilization efficiency of the arithmetic unit to reach 100%.

In the bit stream sparse calculation part, the bit sparseness of the weight can be realized by multiplying the ciphertext of the first part and the weight by bits, and in the part and the bit alignment, as the invention uniformly processes each bit, a large amount of calculation required by the bit is omitted.

Paillier chip architecture, for Montgomery and modular inverse operation efficient computing units, bit stream sparse computing architecture in the computing units.

The invention also provides a processing method of the Paillier homomorphic encryption processor, which comprises the following steps: an n-level bit serial mode for multiply-accumulate operations and a pipeline mode for conventional modulo operations; the data flow in the two modes is reconfigured by the computing unit controller, and the same resource set is reused;

The data flow inside the bit stream sparse computation unit for the following two typical computation modes of Paillier is depicted as in fig. 4: a 4-level bit serial pattern for the correlated multiply-accumulate operation; and classical pipeline mode for normal-scale operations.

The data flows in both modes are reconfigured by the compute unit controller, reusing the same set of resources.

Wherein the n-level bit serial pattern is performed via the bit 1 selector using bit sparsity of weights; comprising

s4, accumulating all the parts and items and obtaining a final result.

The specific scheme is as follows:

mode 1 (4-level bit serial mode) uses bit sparsity of weights via a bit 1 selector to improve performance.

Step 1, the partial sum (Psum) is decoupled into 8 terms corresponding to the accumulation result based on 8 bits of the weights. For each weight, the selector searches the position of bit 1 to indicate an update of the corresponding Psum item.

Step 2, align Psum terms by performing Montgomery modular multiplication several times according to their bit positions.

After all weights are completed, because the original (native) Psum data size is not considered after step 1, psum entries are aligned in step 2 by performing montgomery modulo multiplication several times according to their bit positions.

Step 3 further processes the negative sign term of Psum (i.e., psum 7) by once extending the euclidean and twice montgomery modular multiplication.

Step 4 subtracts all Psum terms and gives the final result.

Wherein the pipeline mode is reconfigured to perform the Paillier's basic modular operations, including:

For example, mode 2 is reconfigured to perform the Paillier's basic modular operation, in which modular exponentiation and modular exponentiation are achieved by repeatedly invoking MUs, while the SU achieves modular inverse. In the case of modular exponentiation, the corresponding weight reaches 4096b, so a fast exponential algorithm is introduced by skipping sparse bits in the same way by the bit 1 selector of multiplexing mode 1 (4-level bit serial mode).

The specific bit stream sparse calculation mode is shown in the mode 1 of fig. 4, and includes:

first step ciphertext 1 x 0b00001011+ ciphertext 2 x 0b10111100;

the second step is to align the 0 to 7 of the partial sum by bits, namely, multiply the partial sum i Montgomery modulo i times;

thirdly, carrying out inverse processing on the partial sum 7;

and a fourth step of summing all partial sums.

FIG. 5 illustrates an efficient task deployment flow on PE and chip. The compiler performs task specification, memory allocation, and instruction generation to generate runtime libraries and instructions for full-high full-length (FHFL) PCIe cards integrated with the host FPGA and 8 PH-EPU chips.

In general, for convolution, tasks at the chip level are assigned priority in the output channel to broadcast ciphertext and save bandwidth because the bandwidth requirements of ciphertext are much greater than the plaintext weights. For the PE level, the invention uses a partitioning scheme that optimizes the search to determine the parallelism of the height (ParaH), width (ParaW) and channel (ParaK) of the 3D output feature map over 16 PEs. The invention also establishes a system level experiment flow to simulate the complete processing flow between the client and the server. The runtime library loads the input data and execution instructions onto the card through the host CPU.

To illustrate the scalability of the system of the present invention, the present invention evaluates a series of tasks with different workload amounts. End-to-end evaluation results including both software time and hardware time in addition to communication time between the client and server show that the PH-EPU chip and 8-chip card of the present invention are significantly better than desktop and server CPUs and exhibit better scalability.

FIG. 6 first illustrates the key basic operations (i.e., montgomery modular multiplication and modular inversion) and the efficiency of bit-serial optimization. The energy efficiency of the accelerator of the present invention exceeds the energy efficiency of the CPU by several orders of magnitude. For bit sparsity efficiency, the present invention evaluates four data patterns of different sparsity rates.

The results show that the effective use of a 0-bit skip achieves a 3.3-fold acceleration at 75% sparsity. The figure also shows a comparison table with previous work on custom hardware designs for homomorphic encryption.

It follows that the PH-EPU of the present invention supports all types of homomorphic encryption tasks (encryption, decryption and evaluation) with a high degree of flexibility. It also supports dynamic bit width for ciphertext and weights while using bit-serial sparsity for optimizing performance. The throughput of the typical operation in Paillier reaches 23 to 68MOPS on Montgomery modular multiplication calculations and 38KOPS on modular inverse calculations, with efficiencies of 0.18 to 0.52 μJ/Op and 105.3 μJ/Op, respectively. Throughput for Montgomery modular multiplication is 115-fold to 340-fold higher and energy efficiency is 30.4-fold to 87.8-fold higher than current Paillier processors.

Fig. 7 shows a micrograph and characteristics of both the chip and PCIe card. As can be seen, the chip was fabricated under UMC28nm process with an area of 42.96mm ² The operating frequency is 0.5-500MHz. The supported data bit width is: ciphertext bit widths are 1024, 2048, 4096 bits, weights are 2,4,8 bits (bit stream sparse mode), and maximum 4096 bits (modular exponentiation mode). The power consumption is 4-12W. The throughput of a typical operation in Paillier reaches 23 to 68MOPS on Montgomery modular multiplication calculations. The PCIE board has 8 chips, provides power 120W, and the throughput of typical operation in Paillier reaches 181 to 355MOPS on Montgomery modular multiplication calculation. DDR3, off-chip and PCIe bandwidths are 100Gb/s,12.5Gb/s and 32Gb/s, respectively.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Moreover, the present invention may be employed in one or more of the packets

Computer program product forms implemented on computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims

1. A Paillier homomorphic encryption processor comprising: a top controller, a global memory system, and a plurality of ciphertext processing elements having a bit-serial reconfigurable data stream;

2. The Paillier homomorphic encryption processor of claim 1, wherein,

each ciphertext processing element comprises a weight buffer, an input buffer, an output buffer, a lookup table unit for bit serial processing of parts and items, a Montgomery unit and an extended Euclidean unit;

the weight buffer is used for storing the weight multiplied by the ciphertext;

the input buffer is used for storing the calculation ciphertext;

the output buffer is used for storing the calculation result;

3. The Paillier homomorphic encryption processor of claim 2, wherein,

the Montgomery unit comprises a plurality of multipliers for the Montgomery which are computationally intensive, wherein the first stage is 1 multiplier for updating the results of the first stage of the Montgomery algorithm and the second stage is 2 multipliers for updating the results of the second stage of the Montgomery algorithm.

4. The Paillier homomorphic encryption processor of claim 3, wherein,

the multiplier employs two stages of Montgomery operations, a first stage of Montgomery operations for updating a first stage parameter value and related intermediate variables for a second stage; the second stage Montgomery operations are used to perform large integer multiplications of intermediate Montgomery results.

5. The Paillier homomorphic encryption processor of claim 2, wherein,

the extended euclidean unit comprises a local register file for control-intensive extended euclidean, a lightweight branch detector and an execution unit; the execution unit includes 3 adders and a shifter unit for executing the operation of each branch in one cycle.

6. The Paillier homomorphic encryption processor of claim 2, wherein,

the extended euclidean unit is further arranged to forward the update result to the lightweight branch detector in each cycle before forwarding it to the pipeline register.

7. A method of processing a Paillier homomorphic encryption processor according to any one of claims 1 to 6, comprising: an n-level bit serial mode for multiply-accumulate operations and a pipeline mode for conventional modulo operations; the data flow in the two modes is reconfigured by the computing unit controller, and the same resource set is reused;

8. The processing method according to claim 7, wherein the n-level bit serial pattern is performed via a bit 1 selector using bit sparsity of weights; comprising

s4, accumulating all the parts and items and obtaining a final result.

9. The processing method of claim 8, wherein the pipeline mode is reconfigured to perform a basic modular operation of Paillier, comprising:

10. The processing method of claim 7, wherein the bit stream sparse computation comprises:

s10, ciphertext 10 b00001011 +ciphertext 20 b10111100;

s20, aligning 0 to 7 of the partial sum by bits, wherein the partial sum is multiplied by i times by i Montgomery modulus in a specific calculation way;

s30, carrying out inverse processing on the partial sum 7;

and S40, summing all partial sums.