CN114594925A

CN114594925A - Efficient modular multiplication circuit suitable for SM2 encryption operation and operation method thereof

Info

Publication number: CN114594925A
Application number: CN202210265484.7A
Authority: CN
Inventors: 沈展; 陈付龙; 谢冬
Original assignee: Anhui Normal University
Current assignee: Anhui Normal University
Priority date: 2022-03-17
Filing date: 2022-03-17
Publication date: 2022-06-07

Abstract

The invention discloses an efficient modular multiplication operation circuit suitable for SM2 encryption operation, which expands the secondary iteration of the karatsuba algorithm by using the division idea of the karatsuba algorithm, performs large-number multiplication operation in partial parallel and uses a prime number field P recommended in the national cryptographic algorithm₂₅₆And performing large digital-to-analog multiplication operation. The algorithm obtains multiplication results in 3 periods first and then utilizes P₂₅₆The characteristic of (1) is to perform reduction operation. In the operation process, a divide and conquer method is used for once expansion, then three 64-bit karatsuba multipliers are used for parallel execution, three partial products can be obtained respectively (for the operation of the partial products, an improved karatsuba algorithm is adopted), and modular reduction operation is carried out after the accumulation and addition of the three parts, so that time and resources are saved. A comparison experiment shows that only 13.45k LUTs are consumed to complete one modular multiplication operation on a 100MHZ Artix-7 development board and the modular multiplication operation is completed within 0.04 us. And optimizing resource consumption and execution time.

Description

Efficient modular multiplication circuit suitable for SM2 encryption operation and operation method thereof

Technical Field

The invention belongs to the technical field of circuit operation, and particularly relates to an efficient modular multiplication circuit suitable for SM2 encryption operation and an operation method thereof.

Background

Elliptic Curve Cryptography (ECC) and RSA cryptography algorithms are two very popular and powerful public key cryptography algorithms. However, at the same security level, the number of key bits for ECC is shorter compared to RSA. The 256-bit ECC algorithm in the prime field has the same security level as the 3072-bit RSA algorithm. In addition, elliptic curve cryptography systems consume fewer hardware resources. The modular multiplication operation is the most time-consuming operation in the encryption process of the elliptic curve, so the speed of the modular multiplication becomes the bottleneck in the encryption operation process of the elliptic curve, and how to accelerate the modular multiplication operation is the key point for improving the encryption speed of the elliptic curve.

The SM2 encryption algorithm is an encryption algorithm with independent intellectual property rights and has great significance for improving the information security of China. Fig. 1 shows four levels of elliptic curve cryptography encryption, and it can be seen that since SM2 encryption is based on elliptic curve cryptography, the lowest level of modulo operation is the basis of the entire encryption operation. The time consumption of modular multiplication operation and modular inverse operation in the four modular operations is far higher than that of other 2 operations, wherein the called frequency of the modular multiplication operation is far higher than that of the modular inverse operation, so how to finish the modular multiplication operation more efficiently is the core of improving the algorithm speed of the SM 2.

In the encryption operation process, the large-digit multiplication operation (a × B mod P) is the bottom layer operation with the most serious time and resource consumption, so that many algorithms adopt a scheme of calculating the product (a × B ═ C) and taking the modulus (C mod P), and generally, comparison and judgment are performed once based on each bit cycle of a binary system, so that the 256-bit large integer can be calculated in 256 periods, and the scheme is the most time-consuming. It has also been proposed that Radix-8 interleaved modular multiplication reduces the number of cycles to 32, but even then, the speed-up effect of the algorithm is not ideal because one dot-plus requires at least 9 large modular multiplications in SM2 encryption.

For this case, it has been proposed to use the prime field value P used in SM2 encryption₂₅₆To speed up the modulo operation. The modular division operation is changed into the modular addition and subtraction operation, so that the complexity of the algorithm is reduced, and the speed of the addition and subtraction operation is much faster than that of the multiplication operation, so that the modular operation can be completely finished in one period. How to quickly obtain C₁₅～C₀(divide C into 16 32-bit Cs₁₅～C₀) Is another key problem of modular multiplication designBecause the consumption of multiplication operation on resources and time is far greater than that of addition and subtraction operation, how to perform large number multiplication operation also needs to be reasonably designed, the existing Karatsuba-Ofman algorithm based on one-time expansion needs to use a 129-bit multiplier, and the consumption of the multiplier on resources is exponentially increased along with the increase of multiplication bits, so that the resource consumption of the scheme is too serious.

Disclosure of Invention

The invention provides an efficient modular multiplication circuit suitable for SM2 encryption operation, and aims to balance resource consumption and time consumption.

The invention is realized in this way, a high-efficiency modular multiplication circuit suitable for SM2 encryption operation, the high-efficiency modular multiplication circuit suitable for SM2 encryption operation comprises:

8 one-out-of-three selectors, MUX 1-MUX 8; 2 128-bit subtractors, a subtracter Sub1 and a subtracter Sub2, 2 exclusive-or gates, an exclusive-or gate 1 and an exclusive-or gate 2; 3 64-bit multipliers, MULT 1-MULT 3; 2 64-bit subtractors, SUB1 and SUB 2; 3 expanders EXT 1-EXT 3; 1 128-bit adder, ADD 1; 3 512-bit adders, ADD 2-ADD 4; 1 256-bit adder, ADD5, 1 one-out-of-two selector MUX; 1 register R512 with 512 bits, an addition and subtraction operator 1 with 128 bits, an addition and subtraction operator 2 with 512 bits, a shifter and a modulo subtraction operator.

Input terminal 1 input A of MUX1₃A₂Input terminal 2 input A₇A₆Input terminal 3 input a₃a₂Input 1 input B of MUX2₃B₂Input terminal 2 input B₇B₆Input terminal 3 input b₃b₂Input 1 input A of MUX3₁A₀Input terminal 2 input A₅A₄Input terminal 3 input a₁a₀Input 1 input B of MUX4₁B₀Input terminal 2 input B₅B₄Input terminal 3 input b₁b₀Input 1 input A of MUX5₁A₀Input terminal 2 input A₅A₄Input terminal 3 input a₁a₀Input 1 input A of MUX6₃A₂Input terminal 2 input A₇A₆Input terminal 3 input a₃a₂Input 1 input B of MUX7₃B₂Input terminal 2 input B₇B₆Input terminal 3 input b₃b₂Input 1 input B of MUX8₁B₀Input terminal 2 input B₅B₄Input terminal 3 input b₁b₀；

Input terminal 1 input a of subtractor Sub1₃A₂ A₁A₀Input terminal 2 input A₇A₆ A₅A₄Output end 1 is connected with input ends 3 of MUX5 and MUX3, output end 2 is connected with input ends 3 of MUX6 and MUX1, and output end 3 is connected with exclusive-OR gate 1; input terminal 1 input B of subtractor Sub2₃B₂ B₁B₀Input terminal 2 input B₇B₆ B₅B₄The output end 1 is connected with the input ends 3 of the MUX7 and the MUX2, the output end 2 is connected with the input ends of the MUX8 and the MUX4, the output end 3 is connected with the XOR gate 1, and the output end of the XOR gate 1 is connected with the addition and subtraction arithmetic unit 2;

the output ends of MUX1 and MUX2 are connected with MULT1, the output ends of MUX3 and MUX4 are connected with MULT2, the output ends of MUX5 and MUX6 are connected with SUB1, and the output ends of MUX7 and MUX8 are connected with SUB 2; output ends 1 of SUB1 and SUB2 are connected with MULT3, and output ends 2 of SUB1 and SUB2 are connected with an exclusive-OR gate 2;

an output end 1 of the MULT1 is connected with EXT1, an output end 2 is connected with an adder ADD1, an output end 1 of the MULT2 is connected with EXT2, an output end 2 is connected with an adder ADD1, output ends 1 of the EXT1 and EXT2 are connected with a register ADD2, output ends of the adders ADD1, MULT3 and an exclusive-OR gate 2 are connected with an addition and subtraction arithmetic unit 1, an output end of the addition and subtraction arithmetic unit 1 is connected with EXT3, output ends of the EXT3 and ADD2 are connected with ADD3, output ends of the ADD3 and MUX are connected with the addition and subtraction arithmetic unit 2, an output end 1 of the addition and subtraction arithmetic unit 2 is connected with a register R512, an output end 2 is connected with ADD4, an output end 1 of the register R512 is connected with a modulo subtraction arithmetic unit, an output end 2 is connected with ADD4, an output end 4 is connected with the ADD4 through the adder ADD5 and a shifter, and an output end of the ADD4 is connected with the MUX;

wherein, A and B are respectively multiplier and multiplicand with 256 bits, and A is the same asA₇A₆A₅A₄A₃A₂A₁A₀，B＝B₇B₆B₅B₄B₃B₂B₁B₀，A_i(7≥i≥0)，B_i(7. gtoreq. i.gtoreq.0) are segments of 32-bit word length, a₃a₂a₁a₀＝A₃A₂A₁A₀-A₇A₆A₅A₄，b₃b₂b₁b₀＝B₇B₆B₅B₄-B₃B₂B₁B₀。

The invention is realized in such a way that the operation method based on the efficient modular multiplication circuit suitable for SM2 encryption operation specifically comprises the following steps:

s1, resetting the register at initial initialization stage;

s2, the control signals of 8 one-out-of-three selectors MUX 1-MUX 8 are all 0, namely the input data of the input end 1 is selected and output; the control signals of the three expanders EXT 1-EXT 3 are all 0, namely the expanders EXT 1-EXT 3 expand the input 128 bits, 0 bits, 64 bits, and the alternative selector MUX control signal bit 1, select and output the data input by the selection register R512, and accumulate the operation result to the R512 register;

s3, the control signals of 8 one-out-of-three selectors MUX 1-MUX 8 are all 1, namely the input data of the output input end 3 is selected; the control signals of the three expanders EXT 1-EXT 3 are all 1, namely the expanders EXT 1-EXT 3 expand input 384 bits, 256 bits and 320 bits, an alternative selector MUX control signal bit 1 selects and outputs data input by the selection register R512, and the operation result is accumulated to the R512 register;

s4, the control signals of 8 one-out-of-three selectors MUX 1-MUX 8 are all 2, namely the input data of the input end 2 is selected and output; the control signals of the three expanders EXT 1-EXT 3 are all 2, namely the expanders EXT 1-EXT 3 expand the input with 256 bits, 128 bits and 192 bits, and the alternative selector MUX control signal bit 0 selects and outputs the data input by the adder ADD 4;

and in the states of S5 and MOD, the modular subtraction arithmetic unit finishes modular operation in one period according to the multiplication result.

The invention utilizes the thought of dividing and treating the karatsuba algorithm, expands the secondary iteration of the karatsuba algorithm, carries out the large number multiplication operation in local parallel and utilizes the prime number field P recommended in the national cryptographic algorithm₂₅₆And performing large digital-to-analog multiplication operation. The algorithm obtains the multiplication result in 3 periods, and then utilizes P₂₅₆The characteristic of (1) is to perform reduction operation. In the operation process, a divide and conquer method is used for once expansion, then three 64-bit karatsuba multipliers are used for parallel execution, three partial products can be obtained respectively (for the operation of the partial products, an improved karatsuba algorithm is adopted), and modular reduction operation is carried out after the accumulation and addition of the three parts, so that time and resources are saved.

Drawings

Fig. 1 is a schematic diagram of an architecture level of an elliptic curve cipher according to an embodiment of the present invention;

FIG. 2 is a circuit diagram of a 64-bit karatsuba multiplier according to an embodiment of the present invention;

fig. 3 is a schematic diagram illustrating a calculation process of gamma B according to an embodiment of the present invention;

fig. 4 is a schematic diagram of an efficient modular multiplication circuit suitable for SM2 encryption operation according to an embodiment of the present invention;

FIG. 5 is a state transition diagram of a controller according to an embodiment of the present invention;

fig. 6 is an internal structural diagram of a mold reducing module according to an embodiment of the present invention.

Detailed Description

The following detailed description of the embodiments of the present invention will be given in order to provide those skilled in the art with a more complete, accurate and thorough understanding of the inventive concept and technical solutions of the present invention.

Multiplication of (one) large numbers

The Karatsuba algorithm is an effective algorithm for carrying out large integer multiplication, and divides the multiplication operation of a multiplier and a multiplicand participating in the operation into a plurality of partial products with smaller scales on the basis of a divide and conquer idea, wherein the original times of the multiplication operation are reduced from 4 times to 3 times.

For example, two large integers a and B with (2 × W) bits are represented as follows:

A＝A₁*2^W+A₀

B＝B₁*2^W+B₀

the general multiplication process for a and B is as follows:

A*B＝A₁B₁2^2W+(A₁B₀+A₀B₁)2^W+A₀B₀ (1)

as can be seen from equation (1), to obtain a result of a × B, 4 multiplications are required: a. the₁B₁,A₁B₀,A₀B₁,A₀B₀And A is₁B₀+A₀B₁Can be rewritten as formula (2).

A₁B₀+A₀B₁＝(A₀+A₁)(B₀+B₁)-A₁B₁-A₀B₀ (2)

This multiplication process is the basic idea of the Karatsuba algorithm, and four multiplications in equation 5 are reduced to three by using addition and subtraction. It is further known that the n-partition karatsuba algorithm can reduce n (n-1)/2 multiplications, and the original multiplication is performed by n²Is reduced to (n)²+ n)/2 times.

For software algorithms, A in equation (2)₀+A₁The result of the operation may overflow, and the same principle B₀+B₁Overflow is also possible, so equation (2) is modified as follows:

A₁B₀+A₀B₁＝(A₀-A₁)(B₀-B₁)+A₁B₁+A₀B₀ (3)

the implementation of equation (3) needs to be considered₀-A₁) Sign of operation result, and (B)₁-B₀) Sign of the operation, which leads to final considerations (A)₀-A₁)(B₁-B₀) If this scheme is also adopted in circuit design, devices such as a subtractor, a comparator, a multiplexer and the like are also added, which brings more resource consumption to circuit design, therefore, for the multiplication operation when w is 32, 2 multipliers with 32 bit width and one multiplier with 33 bit width can be adopted to realize the calculation process of formula (2), and 2 results of multiplication with 64 bits are obtained by calculation, which brings advantages that the problem of operation symbols in formula (2) does not need to be considered, as shown in the logic structure diagram 2 of the 64-bit Karatsuba multiplier:

the Karatsuba algorithm is a large number multiplication mode based on divide and conquer, and is more efficient by using a divide and conquer method by utilizing parallel line characteristics of a circuit. Let two 256-bit large integers a and B be represented as follows:

A[255:0]＝A₇A₆A₅A₄A₃A₂A₁A₀；

B[255:0]＝B₇B₆B₅B₄B₃B₂B₁B₀；

wherein A is_i(7≥i≥0)，B_i(7 ≧ i ≧ 0) are all 32-bit word-long segments, so the calculation procedure for C ═ A × B is shown in FIG. 3.

For Part1 of FIG. 3, A₃A₂A₁A₀And B₃B₂B₁B₀Two 128-bit numbers are multiplied, and a one-time karatsuba expansion can be used, transformed as follows:

Part1＝A₃A₂*B₃B₂*2¹²⁸+[(A₁A₀-A₃A₂)*(B₃B₂-B₁B₀)+A₁A₀*B₁B₀+A₃A₂*B₃B₂]*2⁶⁴+A₁A₀*B₁B₀

the expansion uses 3 64-bit karatsuba multipliers as shown in fig. 2 to achieve two 128-bit number multiplications. To use the same 64-bit basis karatsuba multiplier, Part1 is expandedThe formula adopts the improved development form (A) in the formula (3)₁A₀-A₃A₂And B₃B₂-B₁B₀None of which exceeds 64 bits).

Similarly, for Part3 of FIG. 3, A₇A₆A₅A₄And B₇B₆B₅B₄Two 128-bit numbers are multiplied, using the same expansion as described above:

Part3＝A₇A₆*B₇B₆*2¹²⁸+(A₅A₄-A₇A₆)*(B₇B₆-B₅B₄)+A₇A₆*B₇B₆+A₅A₄*B₅B₄)*2⁶⁴+A₅A₄*B₅B₄

for Part2 of FIG. 3, the following results were obtained by first performing the algorithm of karatsuba once:

Part2＝(A₃A₂A₁A₀-A₇A₆A₅A₄)*(B₇B₆B₅B₄-B₃B₂B₁B₀)+Part1+Part3

let a be a₃a₂a₁a₀＝A₃A₂A₁A₀-A₇A₆A₅A₄；b＝b₃b₂b₁b₀＝B₇B₆B₅B₄-B₃B₂B₁B₀

Here, the signs of a and b are determined to obtain the signs of the operation result a x b. Substituting a, b into Part2, resulting in the expression Part2 ', Part 2' is expressed as follows:

Part2’＝a₃a₂a₁a₀*b₃b₂b₁b₀+Part1+Part3

for a in Part2₃a₂a₁a₀*b₃b₂b₁b₀The part is developed by the secondary karatsuba algorithm, and the expression is as follows:

Part2’＝a₃a₂*b₃b₂*2¹²⁸+((a₁a₀-a₃a₂)*(b₃b₂-b₁b₀)+a₁a₀*b₁b₀+b₃b₂*b₃b₂)*2⁶⁴+a₁a₀*b₁b₀+Part1+Part3

part1, Part 2', Part3 are merged to obtain a final result C, wherein the expression of C is as follows:

C＝A*B＝Part3*2²⁵⁶+Part2’*2¹²⁸+Part1

the 64-bit multiplication adopts a formula (2), namely the operation structure shown in figure 1, and the simplest scheme of a basic operation unit is realized; the three 128-bit operations of Part1, Part 2' and Part3 adopt the improved scheme of formula (3) to realize the multiplexing of a 64-bit karatsuba multiplier. The combined use of the two formulas can achieve the optimization of resource consumption.

Modulo two arithmetic

Using the prime field value P used in SM2 encryption₂₅₆To speed up the modulo operation. P₂₅₆Can be expressed as a sum or a difference of powers of 2, so that the prime field P can be expressed₂₅₆Conversion to P₂₅₆＝2²⁵⁶－2²²⁴－2⁹⁶+2⁶⁴1 fast reduction form, so that the formula of the modulo result of the higher power of 2 can be derived as follows:

2²⁵⁶(mod P₂₅₆)≡2²²⁴+2⁹⁶－2⁶⁴+1(mod P₂₅₆)；

2²⁸⁸(modP₂₅₆)≡2²⁵⁶+2¹²⁸－2⁹⁶+2³²(mod P₂₅₆)≡2²²⁴+2¹²⁸－2⁶⁴+2³²+1(mod P₂₅₆)；

2³²⁰(mod P₂₅₆)≡2²⁵⁶+2¹⁶⁰－2⁹⁶+2⁶⁴+2³²(mod P₂₅₆)

≡2²²⁴+2¹⁶⁰+2³²+1(mod P₂₅₆)；

2³⁵²(mod P₂₅₆)≡2²⁵⁶+2¹⁹²+2⁶⁴+2³²(mod P₂₅₆)

≡2²²⁴+2¹⁹²+2⁹⁶+2³²+1(mod P₂₅₆)；

2³⁸⁴(mod P₂₅₆)≡2²⁵⁶+2²²⁴+2¹²⁸+2⁶⁴+2³²(mod P₂₅₆)

≡2*2²²⁴+2¹²⁸+2⁹⁶+2³²+1(mod P₂₅₆)；

2⁴¹⁶(mod P₂₅₆)≡2*2²⁵⁶+2¹⁶⁰+2¹²⁸+2⁶⁴+2³²(mod P₂₅₆)

≡2*2²²⁴+2¹⁶⁰+2¹²⁸+2*2⁹⁶－2⁶⁴+2³²+2(mod P₂₅₆)；

2⁴⁴⁸(mod P₂₅₆)≡2*2²⁵⁶+2¹⁹²+2¹⁶⁰+2*2¹²⁸-2⁹⁶+2⁶⁴+2*2³²(mod P₂₅₆)

≡2*2²²⁴+2¹⁹²+2¹⁶⁰+2*2¹²⁸+2⁹⁶-2⁶⁴+2*2³²+2(mod P₂₅₆)；

2⁴⁸⁰(mod P₂₅₆)≡2*2²⁵⁶+2²²⁴+2¹⁹²+2*2¹⁶⁰+2¹²⁸-2⁹⁶+2*2⁶⁴+2*2³²(mod P₂₅₆)

≡3*2²²⁴+2¹⁹²+2*2¹⁶⁰+2¹²⁸+2⁹⁶+2*2³²+2(mod P₂₅₆)；

the operation result C for a × B can be written in the following expression form:

C＝C₁₅*2⁴⁸⁰+C₁₄*2⁴⁴⁸+C₁₃*2⁴¹⁶+C₁₂*2³⁸⁴+C₁₁*2³⁵²+C₁₀*2³²⁰+C₉*2²⁸⁸+C₈*2²⁵⁶+C₇*2²²⁴+C₆*2¹⁹²+C₅*2¹⁶⁰+C₄*2¹²⁸+C₃*2⁹⁶+C₂*2⁶⁴+C₁*2³²+C₀

by grouping the above equations, the C mod P can be calculated as follows₂₅₆The algorithm of (1):

C mod P₂₅₆＝

3*C₁₅*2²²⁴+C₁₅*2¹⁹²+2*C₁₅*2¹⁶⁰+C₁₅*2¹²⁸+C₁₅*2⁹⁶+2*C₁₅*2³²+2*C₁₅+2*C₁₄*2²²⁴+C₁₄*2¹⁹²+C14*2¹⁶⁰+2*C₁₄*2¹²⁸+C₁₄*2⁹⁶C₁₄*2⁶⁴+2*C₁₄*2³²+2*C₁₄+2*C₁₃*2²²⁴+C₁₃*2¹⁶⁰+C₁₃*2¹²⁸+2*C₁₃*2⁹⁶C₁₃*2⁶⁴+C₁₃*2³²+2*C₁₃+2*C₁₂*2²²⁴+C₁₂*2¹²⁸+C₁₂*2⁹⁶+C₁₂*2³²+C₁₂+C₁₁*2²²⁴+C₁₁*2¹⁹²+C₁₁*2⁹⁶+C₁₁*2³²+C₁₁+C₁₀*2²²⁴+C₁₀*2¹⁶⁰+C₁₀*2³²+C₁₀+C₉*2²²⁴+C₉*2¹²⁸+C₉*2⁶⁴+C₉*2³²+C₉+C₈*2²²⁴+C₈*2⁹⁶+C₈*2⁶⁴+C₈+C₇*2²²⁴+C₆*2¹⁹²+C₅*2¹⁶⁰+C₄*2¹²⁸+C₃*2⁹⁶+C₂*2⁶⁴+C₁*2³²+C₀(mod P₂₅₆)；

using C for 256-bit C value₁₆ C₁₅....C₁ C₀Denotes each C_i(15 ≧ i ≧ 0) are all segments of 32-bit word length.

Defining the expression of 256-bit integers S1-S14 as follows:

S1＝(C₇,C₆,C₅,C₄,C₃,C₂,C₁,C₀)；S2＝(C₈,C₁₁,C₁₀,C₉,C₈,0,C₁₃,C₁₂)；

S3＝(C₉,0,0,0,C₁₅,0,C₉,C₈)；S4＝(C₁₀,0,0,C₁₅,C₁₄,0,C₁₀,C₉)；

S5＝(C₁₅,C₁₄,C₁₃,C₁₂,C₁₁,0,C₁₂,C₁₁)；S6＝(C₁₁,C₁₅,C₁₄,C₁₃,C₁₂,0,C₁₁,C₁₀)；

S7＝(C₁₂,0,0,0,0,0,0,0)；S8＝(C₁₃,0,0,0,0,0,0,C₁₃)；

S9＝(C₁₄,0,0,0,0,0,C₁₄,C₁₄)；S10＝(C₁₅,0,C₁₅,C₁₄,C₁₃,0,C₁₅,C₁₅)；

S11＝(0,0,0,0,0,C₈,0,0)；S12＝(0,0,0,0,0,C₉,0,0)；

S13＝(0,0,0,0,0,C₁₃,0,0)；S14＝(0,0,0,0,0,C₁₄,0,0)

the expressions S1-S14 are the same as the expression of S1, and S1 is taken as an example for explanation:

S1＝C₇*2²²⁴+C₆*2¹⁹²+C₅*2¹⁶⁰+C₄*2¹²⁸+C₃*2⁹⁶+C₂*2⁶⁴+C₁*2³²+C₀。

the return value is:

Result＝(S1+S2+S3+S4+S5+S6+2S7+2S8+2S9+2S10-S11-S12-S13-S14)mod P₂₅₆。

(III) with P₂₅₆Modulo large digital-to-analog multiplier data path design

As can be seen from the analysis of equations (1), (2) and (3), each expression needs 3 64-bit multiplications, so that the final result C of the large number multiplication needs 9 64-bit multipliers, and if the 64-bit karatsuba multiplier designed in fig. 2 is used, the result is equivalent to 27 32-bit multipliers. If the operation times are calculated according to the traditional 8-point karatsuba algorithm, the use times of the 32-bit multiplier are known to be (8)²The +8)/2 is 36, which shows that the circuit design method provided by the invention greatly reduces the times of multiplication operations, and the multiplication operations are the operations which consume time and resources most in large-number multiplication, and the optimization of the multiplication operations can achieve the maximum optimization effect on reducing the operation time and resource consumption.

Further analysis shows that the operation of A and B can be completed in one time, the method can be completed in only one period, the consumed time is the shortest, but the resource consumption is extremely high, and 9 64-bit basic karatsuba multipliers are needed to work simultaneously. The method can also be completed in a multi-period mode in a periodic mode, the number of multi-period schemes is 2, a 9-period scheme is adopted, only one 64-bit basic karatsuba multiplier is needed, and the scheme has the minimum resource consumption but the maximum time consumption; with the 3-cycle scheme, 3 64-bit basic karatsuba multipliers are needed, and resource consumption and time consumption are balanced. TABLE 1 shows

TABLE 1 comparison of resource consumption and time consumption

Comparing the three schemes, it can be seen that, compared to the most resource-saving 9-cycle scheme, if the 3-cycle scheme is adopted, the time consumption is reduced from 9 cycles to 3 cycles, the cycle number is reduced by 6, the consumption of the 32-bit multiplier is increased by 6, and the increase/decrease ratio is 6/6 to 1; if the 1-cycle scheme is adopted, the time consumption is reduced from 9 cycles to 1 cycle, the cycle number is reduced by 8, the 32-bit multiplier consumption is increased by 24, and the increasing ratio is 8/24 to 0.33, so the cost efficiency is poor. Therefore, the 3-period scheme is balanced, and the resource consumption and the time consumption can be effectively considered.

Fig. 4 is a schematic structural diagram of an efficient modular multiplication circuit suitable for SM2 encryption operation according to an embodiment of the present invention, which is only shown in relevant parts according to an embodiment of the present invention for convenience of description, and the efficient modular multiplication circuit suitable for SM2 encryption operation includes:

8 one-out-of-three selectors, MUX 1-MUX 8,

input terminal 1 input A of MUX1₃A₂Input terminal 2 input A₇A₆Input terminal 3 input a₃a₂，

Input terminal 1 input B of MUX2₃B₂Input terminal 2 input B₇B₆Input terminal 3 input b₃b₂，

Input terminal 1 input A of MUX3₁A₀Input terminal 2 input A₅A₄Input terminal 3 input a₁a₀，

Input terminal 1 input B of MUX4₁B₀Input terminal 2 input B₅B₄Input terminal 3 input b₁b₀，

Input terminal 1 input A of MUX5₁A₀Input terminal 2 input A₅A₄Input terminal 3 input a₁a₀，

Input terminal 1 input A of MUX6₃A₂Input terminal 2 input A₇A₆Input terminal 3 input a₃a₂，

Input terminal 1 input B of MUX7₃B₂Input terminal 2 input B₇B₆Input terminal 3 input b₃b₂，

Input terminal 1 input B1B of MUX8₀Input terminal 2 input B₅B₄Input terminal 3 input b₁b₀，

2 128 bit subtracters, a subtracter Sub1 and a subtracter Sub2, 2 exclusive or gates, an exclusive or gate 1 and an exclusive or gate 2;

input terminal 1 input a of subtractor Sub1₃A₂ A₁A₀Input terminal 2 input A₇A₆ A₅A₄The output end 1 is connected with the input ends 3 of the MUX5 and the MUX3, the output end 2 is connected with the input ends 3 of the MUX6 and the MUX1, and the output end 3 is connected with the exclusive-OR gate 1;

input terminal 1 input B of subtractor Sub2₃B₂B₁B₀Input terminal 2 input B₇B₆ B₅B₄The output end 1 is connected with the input ends 3 of the MUX7 and the MUX2, the output end 2 is connected with the input ends of the MUX8 and the MUX4, the output end 3 is connected with the XOR gate 1, and the output end of the XOR gate 1 is connected with the addition and subtraction arithmetic unit 2;

3 64-bit multipliers, MULT 1-MULT 3; 2 64-bit subtractors, SUB1 and SUB 2;

3 expanders EXT 1-EXT 3; 1 128-bit adder, ADD 1; 3 512-bit adders, ADD 2-ADD 4; 1 one-out-of-two selector, MUX;

the output end 1 of MULT1 is connected with EXT1, the output end 2 is connected with an adder ADD1, the output end 1 of MULT2 is connected with EXT2, the output end 2 is connected with an adder ADD1, the output ends 1 of EXT1 and EXT2 are connected with a register ADD2, the output ends of adders ADD1, MULT3 and an XOR gate 2 are connected with an addition and subtraction arithmetic unit 1, the output end of the addition and subtraction arithmetic unit 1 is connected with EXT3, the output ends of EXT3 and ADD2 are connected with ADD3, the output ends of ADD3 and MUX are connected with the addition and subtraction arithmetic unit 2, the output end 1 of the addition and subtraction arithmetic unit 2 is connected with a register R512, the output end 2 is connected with ADD4,

1 register R512 with 512 bits, an addition and subtraction arithmetic unit 1 with 128 bits, an addition and subtraction arithmetic unit 2 with 512 bits, 1 adder with 256 bits, ADD5, a shifter and a modulo subtraction arithmetic unit;

the output end 1 of the register R512 is connected with the modulo reduction arithmetic unit, the output end 2 is connected with ADD4, the output end 3 is connected with MUX, the output end 4 is connected with ADD4 through an adder ADD5 and a shifter, and the output end of ADD4 is connected with MUX.

The subtracter subtracts two input paths of data, the adder adds the two input paths of data, the multiplier multiplies the two input paths of data, the one-out-of-three selector selects one input path of data to output, the one-out-of-two selector selects one input path of data to output, and the expander expands data bits of the input data; an adder-subtractor for performing addition when the input signal is 0 and subtraction when the input signal is 1, an exclusive-or gate for outputting the signal 0 when the input data is positive or negative at the same time and outputting the signal 1 when the input data is positive or negative, a register for storing the input data, and a modulo-subtractor composed of a plurality of modulo-adders whose internal structures are shown in fig. 6, wherein s1 to s14 in the figure are s1 to s14 of the return value Result in the modulo operation, and a shifter for shifting the input data to 128 bits high.

In the embodiment of the present invention, the multi-cycle controller part for generating the control signal is shown in fig. 5, and the operation process of the efficient modular multiplication circuit for SM2 encryption operation is specifically as follows:

s1, clearing the register at the initial initialization stage;

s2, Part of calculating Part 1: the control signals of the 8 one-out-of-three selectors MUX 1-MUX 8 are all 0, namely, the input data of the input end 1 is selected and output; the control signals of the three expanders EXT 1-EXT 3 are all 0, namely the expanders EXT 1-EXT 3 expand the input 128 bits, 0 bits, 64 bits, and the alternative selector MUX control signal bit 1, select and output the data input by the selection register R512, and accumulate the operation result to the R512 register;

s3, Part of calculating Part 3: the control signals of the 8 one-out-of-three selectors MUX 1-MUX 8 are all 1, namely, the input data of the output input end 3 is selected; the control signals of the three expanders EXT 1-EXT 3 are all 1, namely the expanders EXT 1-EXT 3 expand input 384 bits, 256 bits and 320 bits, an alternative selector MUX control signal bit 1 selects and outputs data input by the selection register R512, and the operation result is accumulated to the R512 register;

s4, calculating Part 2: the control signals of the 8 one-out-of-three selectors MUX 1-MUX 8 are all 2, namely, the input data of the input end 2 is selected and output; the control signals of the three expanders EXT1 to EXT3 are all 2, that is, the expanders EXT1 to EXT3 expand the input with 256 bits, 128 bits and 192 bits, and the alternative selector MUX control signal bit 0 selects and outputs the data input by the adder ADD4, and the result is an accumulated value (Part1+ Part3) which is shifted to the left by 128 bits and then partially accumulated in the register R512 (the upper 256 bits in R512 are Part3, and the lower 256 bits are Part 1).

In the states of S5 and MOD, the modulo reduction operator completes the modulo operation in one cycle according to the multiplication result, the calculation of the Part1, the calculation of the Part3, the calculation of the Part2, and the modulo operation cycle are all one cycle, and four cycles are required for completing the whole modulo multiplication operation.

The invention can complete modular multiplication operation in 4 periods, consumes 0.04us on an Artix-7 hardware platform and consumes 13.45k LUTs. The optimization of resource consumption and the optimization of time are basic principles of circuit design, for comparison with other schemes, the product of the resource consumption quantity and the time is calculated, then for comparison, the main frequency is unified at 100Mhz, so that each operation result is multiplied by the main frequency of a hardware platform under the scheme, and then divided by 100, obviously, the smaller the value is, the better the performance is. By contrast, this scheme is significantly superior to other schemes.

Table 2 comparison of other protocols

Note that: scheme 1: liu Yang. national cryptographic algorithm SM2 cipher logic accelerator design and implementation [ D ]. Anhui university, 2021. scheme 2: marzouqi H, Al-Qutayr M, Salah K.A High-Speed FPGA Implementation of an RSD-Based ECC Processor [ J ]. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,2016,24(1): 151-: md S H, Yinan K, FPGA-based electronic modulation for electrolytic Current Cryptographic [ C ].2015International electronic communication Networks and Applications reference (ITNAC), Sydney, NSW,2015: 191-195-scheme 4: khalid J, Wang Xiaojun, Mike S, High performance hardware support for encapsulating curved cryptography over general prime field [ J ]. Microprocessors and microspheres, 2017, Volume 51: 331-: ali S Y, Khalid J, Shoaib A, et al.reduced Signal based High Speed electric concrete cryptographical Processor [ J ]. Journal of Circuits Systems & Computers,2018: s0218126619500816. scheme 6: ali S Y, Khalid J, Shoaib a, et al.a high-speed RSD-based flexible ECC processor for the allocation of circuits over general field J. 1858-1878 scheme 7: islam M M, Hossain M S, Shahjalal M, et al, area-Time Efficient Hardpower Implementation of Modular multiplexing for elastic Current Cryptography [ J ]. IEEE Access,2020, vol.8:73898 + 73906. scheme 8: kudithi T, Potdar M, Saktive R.Radix-4 Interleaved modulation Applications [ C ].2019International Conference on Vision algorithms in Communication and Networking (ViTECON), Vellore, India,2019:1-5. scheme 9: T.Zhang, J.Zhu, Y.Liu and F.Chen, The Novel efficiency Dual-field FIPS modulation [ J ], Internet Transactions on and Information Systems,2020, vol.14, No.2:738 756-.

The invention utilizes the thought of dividing and treating the karatsuba algorithm, expands the secondary iteration of the karatsuba algorithm, carries out the large number multiplication operation in local parallel and utilizes the prime number field P recommended in the national cryptographic algorithm₂₅₆And performing large digital-to-analog multiplication operation. The algorithm obtains multiplication results in 3 periods first and then utilizes P₂₅₆The characteristic of (1) is to perform reduction operation. In the operation process, a divide and conquer method is used for once expansion, then three 64-bit karatsuba multipliers are used for parallel execution, three partial products can be obtained respectively (for the operation of the partial products, an improved karatsuba algorithm is adopted), and modular reduction operation is carried out after the accumulation and addition of the three parts, so that time and resources are saved. A comparison experiment shows that only 13.45kLUTs are consumed to complete one modular multiplication operation on a 100MHZ Artix-7 development board and the operation is completed within 0.04 us. And optimizing resource consumption and execution time.

The invention has been described by way of example, and it is to be understood that its specific implementation is not limited to the details of construction and arrangement shown, but is within the scope of the invention.

Claims

1. An efficient modular multiplication circuit suitable for SM2 encryption operations, the efficient modular multiplication circuit suitable for SM2 encryption operations comprising:

wherein A and B are a multiplier and multiplicand of 256 bits respectively, and A ═ A₇A₆A₅A₄A₃A₂A₁A₀，B＝B₇B₆B₅B₄B₃B₂B₁B₀，A_i(7≥i≥0)，B_i(7 is more than or equal to i and more than or equal to 0) are all segments with the word length of 32 bits,a₃a₂a₁a₀＝A₃A₂A₁A₀-A₇A₆A₅A₄，b₃b₂b₁b₀＝B₇B₆B₅B₄-B₃B₂B₁B₀。

2. the operation method of the efficient modular multiplication circuit suitable for SM2 encryption operation according to claim 1, wherein the method specifically comprises the following steps:

s1, resetting the register at initial initialization stage;