Disclosure of Invention
The technical problem to be solved by the invention is to provide a fast RSA password coprocessor supporting double domains, which effectively improves the encryption and decryption performance of RSA by utilizing the cascade connection among all functional units, realizes the switching function among different finite domains and fully multiplexes hardware resources.
The technical scheme adopted by the invention is as follows: a fast RSA cryptographic coprocessor supporting dual domains, comprising:
the domain control register is used for receiving an externally input control signal;
the control register is used for receiving an externally input control signal;
the RAM storage unit is used for storing operands input from outside and operation results;
the binary extension domain is connected with the output end of the domain control register and receives a control signal of the domain control register;
the prime number domain is connected with the output end of the domain control register and receives a control signal of the domain control register;
and the double-domain modular multiplication unit is respectively connected with the control register, the RAM storage unit, the binary extension domain and the prime number domain, and is used for calculating the external operand stored in the RAM storage unit according to the control signal of the domain control register and storing the calculation result back into the RAM storage unit.
The RAM storage unit comprises a first single-port RAM storage unit, a second single-port RAM storage unit and a third single-port RAM storage unit.
The double-domain modular multiplication unit comprises a state machine unit used for simulating algorithm execution and a multiplication accumulator unit used for unifying modular multiplication operation into a + x y + b by fusing two different finite domain algorithm structures.
The state machine unit comprises a fourth multiplexer respectively corresponding to an operand Xi output from the RAM storage unit, a seventh multiplexer of an operand Yi, a first multiplexer of the operands Xi and Tj, an exclusive-OR gate of the operands Ti and Nj, a third multiplexer of an operand Zi, a Ca memory and a Cb memory which are respectively connected with the binary extension domain output end of the multiplying and accumulating unit and store carry accumulation numbers at different times, an X memory, a Y memory and a Z memory which are respectively connected with the output ends of the first multiplexer, the second multiplexer and the third multiplexer and are used for storing the operands, wherein the other input end of the OR gate receives an external Inv signal and is connected with the input end of the second multiplexer, and the input ends of the first multiplexer, the second multiplexer and the third multiplexer are respectively connected with the prime number domain output end of the accumulator multiplying unit, the input ends of the third multiplexer and the fourth multiplexer are further connected with the output end of a Ca memory, the output end of the Cb memory is respectively connected with the input ends of the fourth multiplexer and the fifth multiplexer, the output ends of the X memory, the Y memory and the Z memory are respectively and correspondingly connected with the input ends of the fifth multiplexer, the sixth multiplexer and the seventh multiplexer, the other input end of the fifth multiplexer receives a number 1, and the output ends of the fourth multiplexer, the fifth multiplexer, the sixth multiplexer and the seventh multiplexer) respectively form the output end of a state machine unit and are connected with the multiplier-accumulator unit.
The multiplier-accumulator unit is composed of a multiplier-accumulator unit, wherein the input end of the multiplier-accumulator unit respectively receives a 64-bit binary addend a, a addend b, a multiplier X and a multiplier Y which are input by the RAM unit, the output end of the multiplier-accumulator unit respectively outputs a prime field result c and a binary extension field result d, the multiplier-accumulator unit comprises a first adder, a second adder, a third adder and a double-field multiplier which multiplies the received multiplier X and multiplier Y and then respectively outputs the result to the second adder, the input end of the first adder respectively receives the binary addend a and the addend b, the output end of the first adder is respectively connected with the input ends of the second adder and the third adder, the output end of the second adder outputs the prime field result c, and the output end of the third adder outputs the binary extension field result d.
The double-domain multiplier comprises 64 half-adding/full-adding arrays which are sequentially connected in series, a Wallace tree which is connected with carry output ends of the 64 half-adding/full-adding arrays, and carry propagation adders which are respectively connected with carry output ends and summation output ends of the wzllace tree, wherein the input end of the first half-adding/full-adding array of the 64 half-adding/full-adding arrays receives a multiplier X and a multiplier Y which are input by an RAM memory unit, the output end of the last half-adding/full-adding array is respectively connected with the input end of the carry propagation adder and the second adder, and the output end of the carry propagation adder is connected with the third adder.
The invention relates to a fast RSA password coprocessor supporting double domains, which is combined with a side channel attack resisting method on the basis of the research of a predecessor on an RSA modular exponentiation algorithm and a Montgomery modular multiplication algorithm, and realizes a special hardware password acceleration module with certain side channel attack resistance. Compared with the implementation modes of a general processor, an application specific integrated circuit, an FPGA and the like, the invention has certain advantages in performance and safety. Compared with other RSA encryption hardware, the invention adds the function of supporting double domains, expands additional data channels, and utilizes the cascade connection among all functional units, thereby effectively avoiding the process of writing back a large amount of redundant data, improving the encryption and decryption performance of the RSA, realizing the function of switching between different finite domains, fully multiplexing hardware resources, and having the area increased by less than 20 percent compared with a cryptographic module only supporting single-domain operation, and having very obvious effect.
Detailed Description
The following describes a fast RSA cryptographic coprocessor supporting dual domains according to the present invention in detail with reference to the embodiments and the accompanying drawings.
The invention relates to a fast RSA password coprocessor supporting double domains, which adopts Montgomery ladder algorithm in a modular exponentiation layer and uses FIOS algorithm in a modular multiplication layer. Comprehensive research and integral consideration are carried out on the modular multiplication and modular exponentiation algorithms, and hardware multiplexing is carried out on similar operations in the operation to reduce the area; the RAM in the framework is specially connected so as to reduce multiple times of data transportation in the modular exponentiation process and save data transmission time; the configurable design is carried out in the hardware implementation process, so that the encryption and decryption support the operation of different finite fields, the requirements of different users can be met, and meanwhile, in order to support two longest-used finite fields, a high-efficiency 64-bit-by-64-bit double-domain multiplier is designed. Secondly, through the research on side channel attack, in the process from the initial algorithm research to the later hardware design, the anti-attack characteristic is penetrated in the whole design, so that the hardware design can effectively prevent power consumption attack and fault attack, and on the basis, the design of a hardware modular multiplication module is improved, thereby preventing the hidden trouble of power consumption leakage of modular multiplication.
The rapid RSA password coprocessor supporting the double domains designs a special instruction set, and a user can dynamically adjust the finite domains of operation by accessing the reserved interface and transmitting a specific instruction. In order to conveniently integrate the system on the SoC (System on chip), the invention adopts a single-port RAM interface signal to be interconnected with the outside, and all main data and RAM of the system are 64bit wide.
As shown in fig. 1, a fast RSA cryptographic coprocessor supporting dual domains of the present invention includes: a domain control register 1 for receiving an externally input control signal; a control register 2 for receiving an externally input control signal; a RAM storage unit 3 for storing an operation result of an operational work output inputted from the outside; the binary domain expansion 5 is connected with the output end of the domain control register 1 and receives a control signal of the domain control register 1; the prime number field 6 is connected with the output end of the field control register 1 and receives a control signal of the field control register 1; and the double-domain modular multiplication unit 4 is respectively connected with the control register 2, the RAM storage unit 3, the binary extension domain 5 and the prime number domain 6, and is used for calculating external operands stored in the RAM storage unit 3 according to the control signal of the domain control register 1 and storing the calculation result back into the RAM storage unit 3. Wherein,
the RAM storage unit 3 includes a first single-port RAM storage unit 31, a second single-port RAM storage unit 32, and a third single-port RAM storage unit 33. The double-domain modular multiplication unit 4 comprises a state machine unit 41 used for simulating algorithm execution and a multiplication accumulator unit 42 used for unifying modular multiplication operation into a + x y + b by fusing two different finite domain algorithm structures.
The state machine unit 41 of the present invention is designed by using a montgomery optimization algorithm fios (finite integrated operating and scanning method). The Montgomery optimization algorithm divides the multiplier X, Y, N into r bits for operation, which is very beneficial to hardware implementation and can efficiently utilize registers. And all operations in the algorithm can be changed into one operation, which is beneficial to saving hardware resources. The Montgomery optimization algorithm comprises a modular multiplication algorithm under a prime field and a modular multiplication algorithm under a binary extension field. Wherein,
1. modular multiplication algorithm under prime field
The algorithm given in table 1 is a high-basis montgomery modular multiplication algorithm, a large number of operands are divided into a block of small-bit words to participate in operation, and a high-basis modular multiplier with a bit width of 64 bits is designed in the patent.
TABLE 1 FIOS Algorithm for prime field
2. Modular multiplication algorithm under binary extension
In the binary extension, all data can be regarded as coefficients of polynomial, so their operation is also converted into an algorithm of polynomial coefficients, such as addition to bit-wise modulo two addition. Correspondingly, the partial products in the multiplication are added according to the same rule. Table 2 shows the FIOS algorithm supporting binary extension.
TABLE 2 FIOS Algorithm for binary Domain expansion
3. Algorithmic comparison of different domains
The structures of the FIOS algorithm under the prime field and the binary field are basically the same, except for the difference of the basic addition and multiplication algorithms under the prime field and the binary field, there are two differences:
3.1, the bits of the modulus N under the binary domain expansion usually exceed the bits of the multiplier and usually exceed 2 bits, e.g., the modulus of 256 bits is 258 bits, and the most significant bit of the excess is 1, the modulus N exceeds 2 bits (with a value of 0x2) under the prime domain, so that the 2 bits that are exceeded are added to the calculation during the last iteration of the loop in the second layer of the algorithm (e.g., step 6 in table 2).
3.2, binary extension-down operation does not generate carry, so the subtraction of the last step cannot be executed and can be directly removed.
4. Architecture of dual-domain modular multiplier
By fusing two algorithm structures of different finite fields, the modular multiplication operation is unified into a + x y + b, so that the efficient multiplexing of operation resources is facilitated, hardware resources are greatly saved, and the hardware area is optimized. Fig. 2 is a diagram of a logic structure of a dual-domain modular multiplier.
As shown in fig. 2, the state machine unit 41 according to the present invention includes a fourth multiplexer 415 for receiving the operand Xi output from the RAM storage unit 3, a seventh multiplexer 418 for receiving the operand Yi, a first multiplexer 412 for receiving the operands Xi and Tj, an or gate 413 for receiving the operands Ti and Nj, and a third multiplexer 414 for receiving the operand Zi, respectively, and further includes a Ca memory 419 and a Cb memory 4120 for storing carry accumulation numbers at different times respectively connected to the binary extension field output terminals of the multiply accumulator unit 42, an X memory 421, a Y memory 422, and a Z memory 4123 for storing operands respectively connected to the output terminals of the first multiplexer 412, the second multiplexer 413, and the third multiplexer 414, wherein another input terminal of the or gate 413 receives an external Inv signal and is connected to an input terminal of the second multiplexer 413, the inputs of the first multiplexer 412, the second multiplexer 413 and the third multiplexer 414 are further respectively connected to the prime field output of the multiply-accumulator unit 42, the inputs of the third multiplexer 414 and the fourth multiplexer 415 are further connected to the output of the Ca memory 419, the output terminal of the Cb memory 4120 is connected to the input terminals of the fourth multiplexer 415 and the fifth multiplexer 416, the output terminals of the X memory 421, the Y memory 4122 and the Z memory 4123 are respectively connected to the input terminals of the fifth multiplexer 416, the sixth multiplexer 417 and the seventh multiplexer 418, the other input terminal of the fifth multiplexer 416 receives the digital 1, and the output terminals of the fourth multiplexer 415, the fifth multiplexer 416, the sixth multiplexer 417 and the seventh multiplexer 418 respectively form the output terminal of the state machine unit 41, which is connected to the multiply-accumulator unit 42.
Reducing the frequency of division in operation is an effective way to increase the operation speed. In 1985, the classical modular reduction algorithm is quickly replaced by the modular multiplication algorithm proposed by Montgomery, the Montgomery algorithm does not depend on comparison and division of long integers, the numbers are represented by remainders of N modules, the modular operation of N is converted into the division operation of 2 exponents, the shift operation is performed in the hardware implementation process, and the algorithm is very convenient for hardware implementation and is most widely applied.
The basic addition and multiplication under the prime field and the binary extension field have obvious difference, and the key point is that the operation under the binary extension field is polynomial operation, and compared with the traditional operation, the characteristic of no carry generation exists. The data under the binary extension field can be regarded as the coefficient of the corresponding polynomial, so the addition can be regarded as polynomial addition, according to the rule of adding the same-time terms in the polynomial operation, only the numbers at the same position can be added without the problem of carry, and the modulo-2 addition is adopted, so that the binary extension field addition can be expressed as bitwise exclusive or operation of the data under the binary form. Since the multiplication can be decomposed into the sum of partial products for operation, the multiplication result under the binary domain expansion can be obtained by separating the result of the exclusive or operation in the process of adding the partial products, and then the carry generated in the adding process is added back, so that the common multiplication result can be obtained. The structure of a 64bit multiply accumulator is supported, as shown in figure 3, and the principle of a double-domain multiplier is as shown in figure 4.
As shown in fig. 3, the multiplier-accumulator unit 42 is composed of a multiplier-accumulator whose input end receives the 64-bit binary addend a, addend b, multiplier X and multiplier Y respectively input by the memory unit 3 and whose output end outputs the prime field result c and the binary extended field result d respectively, the multiplier-accumulator includes a first adder 421, a second adder 422, a third adder 423 and a dual-field multiplier 424 which multiplies the received multiplier X and multiplier Y and outputs the result to the second adder 422 respectively, the input end of the first adder 421 receives the binary addend a and addend b respectively, the output end is connected to the input ends of the second adder 422 and the third adder 423 respectively, the output end of the second adder 422 outputs the prime field result c, and the output end of the third adder 423 outputs the binary extended field result d.
As shown in fig. 4, the dual-domain multiplier 424 includes 64 half-add/full-add arrays 4241 connected in series in sequence, a wallace tree 4242 connected to carry output terminals of the 64 half-add/full-add arrays 4241, and a carry propagation adder 4243 respectively connected to carry output terminals and sum output terminals of the wallace tree 4242, wherein an input terminal of a first half-add/full-add array of the 64 half-add/full-add arrays 4241 receives a multiplier X and a multiplier Y input by the memory unit 3, an output terminal of a last half-add/full-add array is respectively connected to an input terminal of the carry propagation adder 4243 and the second adder 422, and an output terminal of the carry propagation adder 4243 is connected to the third adder 423.