GB2321979A - Modular multiplication circuit - Google Patents
Modular multiplication circuit Download PDFInfo
- Publication number
- GB2321979A GB2321979A GB9701958A GB9701958A GB2321979A GB 2321979 A GB2321979 A GB 2321979A GB 9701958 A GB9701958 A GB 9701958A GB 9701958 A GB9701958 A GB 9701958A GB 2321979 A GB2321979 A GB 2321979A
- Authority
- GB
- United Kingdom
- Prior art keywords
- bit
- modular multiplication
- value
- multiplication circuit
- modular
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/60—Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers
- G06F7/72—Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers using residue arithmetic
- G06F7/728—Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers using residue arithmetic using Montgomery reduction
Landscapes
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Complex Calculations (AREA)
Abstract
A modular multiplication circuit for performing modular multiplication has a 64 bit modular multiplier (Mull) that receives B and N binary data streams (bstr, nstr), and 64-bit parallel data values A (ML1) and Y 0 (ML2). The modular multiplier multiplies A*B and N*Y 0 on alternate clock phases to decrease the time required to perform the multiplication operations. The binary data streams are presented as 2-bit binary steams to further decrease the processing time.
Description
MODULAR MULTIPLICATION CIRCUIT
FIELD OF THE INVENTION
This invention relates generally to a circuit for performing modular multiplication and particularly, though not exclusively, for implementing the Montgomery
Reduction Algorithm.
BACKGROUND OF THE INVENTION
Modular multiplication is extensively used in implementing cryptographic methods such as RSA cryptography.
The Montgomery algorithm is one of the most efficient techniques for performing modular multiplication. Its use is particularly effective where high performance is required so as to minimize the computation time.
The Montgomery proof is given in Appendix 1 and the
Montgomery Reduction Algorithm is outlined below: Montaornerv Algorithm
To enact the P operator on A*B we follow the process
outlined below:
(1) X = A*B + S {S initially zero)
(2) Y = (X*J) mod2" (where J is a pre-calculated
constant)
(3) Z = X + Y*N
(4) S = z/2n (5) P = S (modN) (N is subtracted from S, if S 2 N)
Thus P = P(A*B)N (the result in the Montgomery Field
of numbers)
In financial applications where smart cards or portable data carriers are used as a means of ensuring a high level of security during the transaction, Public Key
Cryptography is becoming increasingly popular. Public Key
Cryptography offers a higher level of protection than the traditional symmetric or private key methods but until recently has been expensive to implement. Advances in technology have now made the implementation of such methods cost effective.
RSA Public Key capability has been designed into smartcard or portable data carrier's microcontrollers which also include an on-chip co-processor which has been specifically designed to perform modular multiplications for operands each of 512 bit length. The co-processor is directly driven by the microcontroller's CPU under software control by a program stored either in ROM or in
EEPROM. Such a co-processor which implements the
Montgomery algorithm for modular reduction without the division process is known from European Patent Publication
EP-0601907-A.
As will be discussed in detail hereafter, such a known co-processor suffers from a number of disadvantages.
BRIEF DESCRIPTION OF THE DRAWINGS
One circuit for performing modular multiplication which can be utilized to implement the Montgomery Reduction
Algorithm and which is suitable for implementing as a co-processor will now be described, by way of example only, with reference to the accompanying drawings, in which:
FIG. 1 shows a block schematic diagram of a known, prior art co-processor for performing modular multiplication to implement the Montgomery
Reduction Algorithm in accordance with the present invention;
FIG. 2 schematically illustrates a block diagram of an embodiment of a new improved modular multiplication circuit for performing modular multiplication in accordance with the present invention;
FIG. 3 schematically illustrates a block diagram of a bit-pair adder stage used in the circuit of
FIG. 2 in accordance with the present invention;
FIG. 4 schematically illustrates a block diagram of a multiplier and associated circuitry used in the circuit of FIG. 2 in accordance with the present invention;
FIG. SA schematically illustrates a block diagram of an arrangement used in the circuit of FIG. 2 for generating component serial bit streams from a random access memory utilizing a parallel-serial interface in accordance with the present invention in accordance with the present invention;
FIG. SB schematically illustrates a block diagram of a dual port register arrangement used in the circuit of FIG. 2 in accordance with the present invention;
FIG. 6 schematically illustrates a block diagram of an arrangement to perform modular exponentiation in accordance with the present invention;
FIG. 7 schematically illustrates a block diagram
using the circuit of FIG. 2, in implementing the
Chinese Remainder Theorem in accordance with the
present invention; and
FIG. 8 is a graph illustrating a timing diagram
for the circuit of FIG. 2 in accordance with the
present invention.
FIG. 9 illustrates a block diagram of a portable
data carrier that utilizes the modular
multiplication circuit of FIG. 1 in accordance
with the present invention.
DETAILED DESCRIPTION OF THE DRAWINGS
Known Co-Drocessor Operation
FIG. 1 shows a diagram of a known, prior art hardware implementation of a co-processor which performs the
Montgomery algorithm for both full mode 512 bit and halfmode 256 bit operands. The diagram shows the execution unit which comprises basically three 512 bit clocked shift registers and two parallel - serial multipliers.
The B value and the modulus N are preloaded into the B and
N registers respectively. Register S is used to store the intermediate result after each rotation of 544 clock cycles. Initially this register will be cleared. The pre-calculated Montgomery Constant, Jo, is loaded into the co-processor via a 32 bit shift register and latched in
Latch2.
The A value is shifted in 4 bytes (32 bits) at a time, (Ai) via multiplexer M2~1;2 and latched in Latchl. The value in the B register is serially clocked one bit at a time into a first parallel - serial multiplier ML1. The output of this multiplier, at node nA, is the value Ai*B.
The value Ai*B is then summed at adder Adl to the intermediate value stored in register S to produce the value X = Ai*B + S.
For the first 32 clock cycles, the first 32 bit portion of the X value is fed via multiplexer M3~1;4 into a second parallel - serial multiplier ML2, where it is multiplied by the value Jo. The output from ML2 at node nD is the value Yo = X*Jo = (A*B*Jo). Yo is fed back through a 32 bit shift register and latched in Latch2 via multiplexer
M.
After the first 32 clock cycles, multiplexer M3 1;4 switches and feeds the modulus N into the multiplier ML2, where N is multiplied by Yo to produce the value Yo*N.
This value is then summed, over the next 544 clock cycles, with X at adder Ad2 to produce the value Z = X + Yo*N.
The last 32 bits of this calculation are zero and only the 512 most significant bits are saved back in the S register. This completes one full rotation. Sixteen rotations, using a 32 bit multiplication, are required to perform the full 512 bit by 512 bit multiplication, which gives:
P = A*B.I (modN) = P(A*B)N (the result in the Montgomery
Field of numbers).
To recover the required result P is multiplied by H (a pre-calculated Montgomery constant as described by equation 7 in Appendix 1) to give the result in the field of real numbers:
R = A*B (modN)= P(P.H)N
RSA Public Kev CrvDtoarahv Implementing the RSA public key cryptographic system requires calculating values of the form Md (modN) where the exponent d may be up to n bits long (where n is the number of binary digits in N). This is done by performing repeated squaring operations and multiply operations depending upon the value of each bit of the exponent value, d taken in sequence. For a 512 bit exponent, approximately 768 modular operations are required. Using the Prior Art implementation shown in FIG. 1, this leads to the following performance calculation for a 512 bit RSA signature at a clock rate of 20MHz: trsa = (544 * 16 * 768) * (50 * 10-6 mS) tor,, = 334.23 mS
Disadvantaaes of the Known Co-Processor Architecture
The prior art co-processor architecture shown in FIG. 1 is integrated onto a single silicon chip together with a microcontroller. The prior art co-processor is directly driven by the microcontroller's CPU under software control by a program stored either in ROM or in EEPROM. Such an arrangement suffers from a number of drawbacks:
The prior art co-processor performance is severely
limited owing to the interaction with the software
drivers in the external CPU that controls the co
processor,
The external CPU is restricted by the prior art co
processor dependence on the external CPU providing
the Ai value during the calculation,
The prior art co-processor suffers from long
calculation times thereby limiting the RSA Public Key
Signature/Authentication applications and other
applications where it can be used.
To generate an RSA signature, if the prime factors (p & q) of N are known then it is possible to use the Chinese
Remainder Theorem (CRT) to substantially speed up the calculation time. Appendix 2 states the Chinese Remainder
Theorem and details its application to RSA. The prior art co-processor architecture is simply a modular multiplier and does not allow easy implementation of CRT. As a result a substantial CPU overhead tends to negate the advantage of using CRT. Typical performance times (CPU time, co-processor time and total time) for the known coprocessor arrangement processing different lengths of signature using the Chinese Remainder Theorem are:
Table 1 CPU Co-processor Total 512 bit CRT 95.5 mS 92 mS ^ 7.5 mS
Signature 768 bit CRT 568 mS 348 mS 916 mS
Signature 1024 bit CRT 375 mS 680 mS 1055 mS
Signature
New Modular Multiplication Circuit
FIG. 2 illustrates the general arrangement for a modular multiplication circuit that is suitable for use as a coprocessor. The modular multiplication circuit offers improved performance and flexibility to overcome the disadvantages of the prior art co-processor, as discussed hereinbefore. All data paths of the modular multiplication circuit are 2 bits wide (unless a wider bit width is clearly required, such as at the 64-bit inputs to the multipliers Mull and Mu12) to allow bit-pair operations. The intermediate S value and the B value are stored in dual port RAM as these storage areas are overwritten at various stages of the calculation.
Features of the modular multiplication circuit which provide the improvements will be discussed in detail hereinafter (descriptions typically refer to 512 bit calculations for convenience).
Bit Pair Calculation
The modular multiplication circuit uses bit-pair multiplication, addition, and subtraction. Instead of using a single serial loop clocking scheme as in the coprocessor of FIG. 1, the serial bit stream in the new improved co-processor is examined two bits at a time per clock period. As will be described in detail hereinafter, each serial bit stream is split into two component bit streams (an odd and an even bit stream so that bits from the originating serial bit stream are alternately assigned to the odd and even component serial bit streams). The two component bit streams are processed in parallel, one bit being presented by each of the component bit streams, at the same time to form a bit-pair for calculation. This means that the adders, subtracters, and parallel-serial multipliers evaluate and compute results two bits at a time. This change in architecture immediately doubles the performance for the same clock frequency. An immediate advantage is that the computational throughput is almost doubled without a corresponding doubling of power dissipation.
As will be seen hereinafter, the modular multiplication circuit also utilizes a 2-by-64-bit multiplier and interleaves the calculations of A*B and Yo*N to further increase the calculation speed by a factor of approximately two. The 2-by-64-bit modular multiplier is first used to calculate a 64-bit value for Yo where Y0 is calculated as X*J, and X is calculated as A*B+S.
FIG. 3 illustrates a bit-pair adder that forms the basis of the modular multiplication circuit's bit-pair multipliers, adders, and subtracters. The modular multiplication circuit's bit-pair multiplication, addition, and subtraction is described below by referring to the bit pair adder. Initially the elements of the bitpair adder are set to zero. The bits Ao and Bo from the odd data stream are added in a carry-save half-adder to produce odd sum and odd carry outputs So and CO, respectively. The bits AE and BE from the even data stream are input to a carry-save full-adder which produces even sum and even carry outputs SE and CE, respectively.
The signals So and CO are logically combined with the signal CE to produce a signal CE.So + CO which is input to the full-adder to produce an output or result Se. The signals So and CE are XORed to produce a result (odd result) for the odd bit of the bit-pair addition, and the signal SE forms the result (even result) for the even bit of the bit-pair addition. Such carry-save half-adders are well known by those skilled in the art.
The bit pair subtractor uses the same circuitry as the bit-pair adder described above, except that for use as a subtractor the initial values So and CE are set to logical "1 and the data stream to be subtracted is inverted before input to the half and full-adder, respectively.
The subtraction is thus achieved by two's complement addition.
The bit-pair multiplier (which is 2-bit by 64-bit multiplier) is formed using bit-pair adders as described above. As the odd and even bits of the serial data streams are presented to the multiplier, the multiplication process proceeds by addition as follows:
if the two input serial data bits are 00", a zero
value is added;
if the two input serial data bits (Ao and BE or Ao
and BE) are "01", the 64-bit value is added;
if the two input serial data bits are "10", the 64
bit value is left-shifted by one bit, then added;
if the two input serial data bits are "11", a pre
calculated value of three times the 32-bit value is
added. The value of three times the 64-bit value is
stored in a register denoted by 3X.
Improved Y0 Calculation
As described above, the modular multiplication circuit uses a bit-pair multiplication scheme to enhance performance. In the prior art architecture, shown in FIG.
1, this would have involved greatly complicating the Y0 control. By using additional logic and a 3X function for both the Jo and Yo paths feeding multiplier MUL2, this complication has been avoided as shown in FIG. 2. The ML1 and Ai registers are both 2-bit serial input multipliers that multiply the 64-bit parallel number within the multiplier by a value of up to three which is the maximum multiplication value that can be applied by the 2-bit serial inputs, thus the 3X nomenclature.
In the prior art (FIG. 1), for the first 32 clocks cycles at the start of a rotation, Jo is multiplied by X, (X=A*B+S). Yo, the result of the multiplication of X*Jo, is fed back during these first 32 clock cycles and latched in Latch2, after which time Y0 is fed into MUL2 and used to generate the product Yo*N over the following 544 clock cycles.
Referring now to FIGS. 2 and 4, in the modular multiplication circuit, Jo is initially loaded into storage element, such as a latch, from the databus interface via multiplexer Mxl. During multiply operations, the value of Jo is retained by circulating Jo back into the input of the register labeled Jo via Mxl.
At the start of a rotation, register SR is cleared except for SR~bit32 which is set. The output of SR~bitk if a logical 1, will enable data to be clocked through the latch ML2 from the MSB down to the kth bit pair. After the first clock cycle in any given rotation, bits 63 and 62 in ML2 are no longer required and the first two bits of
Y0 can be fed back and latched into ML2~bit63 and
ML2~bit62. During the first clock cycle the logical 1 at
SR~bit32 is clocked to SR~bit31, at which point ML2~bit63 and ML2~bit62 are enabled. On the second clock cycle the logical 1 at SR~bit31 shifts to SR~bit30. SR bit31 is reloaded with a logical 1 and now ML2~bit63, ML2~bit62,
ML2~bit61 and ML2~bit60 are enabled. The next two output bits from MUL2 are clocked into ML2~bit63 and ML2~bit62.
The bits previously in ML2 bit63 and ML2~bit62 are shifted to ML2 bit61 and ML2~bit60 respectively. The process repeats until after 32 clock cycles, 64 bits of Y0 have fed back and loaded into ML2. On the subsequent 256 clock cycles Y0 is multiplied by the modulus N.
Reaister Retlacement As discussed above, the prior art architecture utilizes three 512 bit clocked serial shift registers (B, S and N registers). Data (i.e. value B and modulus value N) are loaded from external memory into the B and N registers respectively by the external CPU via a bus interface. The external CPU feeds the A value into the co-processor, 4 bytes at a time. The external CPU subsequently loads the result back into memory from either the B or S register once the calculation is complete. This scheme consumes power and adds CPU overhead.
Referring now to FIGS. SA and 5B, the modular multiplication circuit utilizes a simple 8 bit parallel to serial interface, placed between the RAM and the modular multiplier, together with an automatic RAM pointer mechanism. FIG.s 5A and SB illustrate the RAM interface.
Each alternate bit is loaded into a 4 bit clocked shift register. There are two such 4 bit clocked serial shift registers forming the odd and even component serial bit streams. These two component serial bit streams are then fed into the modular multiplier.
FIG. 5B illustrates the arrangement for writing data back into the RAM. The RAM is configured with a double sided or "dual port arrangement, where right and left side arrays share a central row decoder. With this arrangement, for a given decoded row, data can be read from the left side array while at the same time data is being written back into the right side array. The advantage of this scheme is that data in RAM is never loaded into registers by an external CPU via load and store instructions. Data is simply downloaded into the serial interface automatically when needed by the modular multiplication circuit to perform operations. This significantly reduces power consumption over the prior art by replacing each 512 bit clocked shift register by an 8bit clocked shift register interface.
Utilizing a mechanism as illustrated in FIG. 6, an automatic RAM pointer and downloading mechanism obviates the need for intervention by an external CPU. The data in
RAM is referenced by the RAM pointer and transferred to the serial interface and clocked out. The RAM pointer automatically increments in readiness for the next data transfer. This scheme has the further advantage over the prior art co-processor in that it allows greater flexibility in handling varying operand lengths. The prior art circuit of FIG. 1 performs a 32-bit by 5r2-^bit multiply per rotation. The number of rotations is determined by the operand length. For operations using less than 256 bits (e.g. 64X64), eight 256-bit rotations are still required by the prior art co-processor. The method of the modular multiplication circuit of FIG. 2 allows the operand length to be varied in increments of 32 bits. Once the operand length is chosen the number of rotations required for the calculation is automatically determined as a multiple of 64 bits (e.g. 64 X 64 = 1 rotation, not eight). . Thus, the modular multiplication circuit of FIG. 2 has improved performance.
Direct ExDonentiation In order to perform exponentiation operations as required for RSA Public Key systems, the prior art co-processor required the external CPU has to regulate the exponentiation process under software control by examining each exponent bit in sequence. The current bit is used to decide whether to perform a modular square or a modular multiply. The exponent value is stored in memory and is read by the external CPU one byte at a time as needed.
The current bit value is determined by an instruction sequence. As the co-processor requires the external CPU to provide the A value during the modular operation, the determination of the exponent bit can only happen between modular operations. Only then can the external CPU control the co-processor mode of operation.
FIG. 6, illustrates that by making use of an exponentiation automatic RAM pointing mechanism, the circuit is now controlled automatically during the exponentiation process. This exponentiation automatic RAM pointing mechanism is similar to the automatic RAM pointing mechanism described hereinbefore. At the end of each modular operation (square or multiply), a signal, EOP is generated by the circuit. This causes the control logic to shift the RAM pointer in the counter register to the next exponent bit. In this way, the next modular operation can be selected and started immediately without the intervention of the external CPU. If the exponent bit is a logical 1, two modular operations (square followed by multiply) are performed.
Typical performance times (external CPU time, circuit time and total time) for the circuit processing different lengths of signature using the Chinese Remainder Theorem are:
Table 2 CPU (CRT) Co-processor Total 512 bit CRT 34 mS 46 mS 80 mS
Signature 768 bit CRT 82 mS 168 mS 250 mS
Signature 1024 bit CRT 220 mS 340 mS 560 mS
Signature
The improvements provided by the modular multiplication circuit arrangement are readily apparent from a comparison of the times in Table 2 with those for the prior art processor presented in Table 1 above.
Other Arithmetic Operations
In order to further reduce the external CPU overhead required, the circuit of FIG. 2 has two additional arithmetic operations, namely an addition and a subtraction function.
Addition
Values stored in the B-RAM and S-RAM may be summed together. Referring to FIG. 2, multiplexers Mx2 and
Mx6 are set to give a logical 0 output. This means that the output from Subl is equal to the input, bstr.
Likewise for subtracter Sub2, the output will be equal to the input, sstr. Data from the B-RAM (bstr) and S
RAM (sstr) are fed serially through subtracters Subl and Sub2. The output from Subl (bstr) is fed to adder
Addl via multiplexer Mx5, where it is summed with the output from Sub2 (sstr). The result is returned via multiplexers Mx7 and Mx8 to the B-RAM.
Subtraction
Values stored in the S-RAM or N-RAM may be optionally subtracted from the value stored in the B-RAM.
Referring to FIG. 2, in either case, the data from either the S-RAM or N-RAM is fed serially via multiplexer Mx2 to subtracter Subl, where it is subtracted from the value stored in the B-RAM. The result is fed back via multiplexers Mx7 and Mx8 to either the S-RAM or B-RAM.
The inclusion of these additional functions allows an efficient implementation of modular exponentiation using the Chinese Remainder Theorem, as outlined below.
CRT Enaine If the prime factors of the modulus N are known, the
CRT may be used to reduce the computation time for a given RSA signature process. The Chinese Remainder
Theorem and its application in generating an RSA signature, is given in Appendix 2.
The prior art co-processor architecture has a significant external CPU overhead in using the CRT technique. This is because the co-processor is first used to evaluate rp = Mpr (modp) and rq = Mqs (modq).
The final result is then evaluated under software control by the external CPU. The processing times given immediately above show the significant external
CPU contribution to the performance degradation.
The inclusion of the addition and subtraction arithmetic functions to the modular multiplication circuit's exponentiation functions, as described previously, allows the modular multiplication circuit to act as a CRT engine in a way that drastically reduces this external CPU overhead. FIG. 7, shows how this is implemented. If the prime factors of N are known, then in order to compute R = Md (modN) using the
Montgomery Method and CRT, use is made of the following pre-calculated values: u, Jp, Jq, Hp, Hq, r = d mod(p1) and s = d mod(q-l). These values would typically be stored in a EEPROM so that the value can be changed when appropriate. The following sequence of calculations is followed:
Mp = M (modp) (1)
Mq = M (modq) (2) rp = Mpr (modp) (3) rq = Mqs (modq) (4) a = rq (modp) (5) b = rp - a (6) c = b * u (modp) (7) g = c* q (8)
R = g + rq (9)
The modular multiplication circuit of FIG. 2 has all of the necessary functionality to be able to efficiently calculate the required result using the CRT method.
Appropriate sequencing of events to allow the modular multiplication circuit to perform this calculation are described below. The control of the sequence may be under software control using the external CPU. In this case the external CPU overhead is minimal, otherwise control of the sequence of calculations may be done using a dedicated hardware state machine.
In the above sequence of calculations, operations (1), (2), (3), (4), (5) and (7) are all modular operatiOns5 that use the Montgomery Algorithm and use the coprocessor as a modular multiplier. The memory pointing mechanism of the modular multiplication circuit now allows these intermediate results to be returned to pre-designated locations within memory in readiness for subsequent stages of the CRT calculation.
Stage (8) uses the ordinary multiply function, while stages (6) and (9) make use of the new arithmetic operations that are now available within the new coprocessor, namely, addition and subtraction.
Interleaved Operation Of Mull
Referring back to FIG. 2 and to FIG. 8, the interleaved operation of the modular multiplier circuit can be seen.
Mull operates on both phases of a clock to decrease the time required to multiply A*B and N*Yo. This operation method reduces the calculation time by an additional factor of approximately two when compared to the prior art co-processor implementation.
First X is calculated by selecting 64 bits of A from the
A-RAM through multiplexer Mxl and stored in ML1 to be applied to the 64 inputs of Mull. At the same time 2 bits of B from the B-RAM are selected through Subl and through multiplexer Mx4 to a 2-bit wide serial input of Mull. The two B-bits are multiplied by the 64 A bits during one cycle of Mull. the result of the cycle is applied through multiplexer Mx5 to Addl and added to 2 bits of S from the
S-RAM to form the first 2 bits of X. These first 2 bits of X are stored in ML2. This cycle is repeated 32 times until the 64 B bits are multiplied by A and added to the 64 S bits to form 64 X bits in ML2. This is described in the section referred to hereinbefore as "Improved Y0 calculation".
Next, X*J is calculated to form Yo by selecting then64 bits of X from ML2 to the inputs of Mull and selecting 2 of the 64 bits of J from the Jo register through Mx4 to
Mull. The two bit result of the multiplication is routed through Mx5 and Addl to be stored in the last two bits of
ML2. Thus, Y0 is shifted into bits from which X is shifted out. This cycle is repeated 32 times until a 64 bit Yo is stored in ML2.
FIG. 8 illustrates how the modular multiplier circuit can calculate A*B and N*Yo on alternate clock cycles once Y0 is formed in ML2. In FIG. 8, INPUT CLOCK represents a base clock from which a system clock that controls the modular multiplier circuit operation is derived. This system clock is shown as CONTROL in FIG. 8. Mull performs a 64-by-2 multiply operation during each phase of CONTROL.
When CONTROL is "high", Mull multiplies A*B, when CONTROL is "low", Mull multiplies N*Yo. CKB represents a clock that clocks the B-RAM to provide 2 output bits and form the bstr signal. CKN represents a clock that clocks the
N-RAM to provide 2 output bits and form the nstr signal.
At time to, CKB selects the B-RAM to provide two bits through Subl and Mx4 to Mull while a 64 bit A value is applied to Mull from ML1. Since CONTROL is "high", A*B is calculated and held in MUL1. At time tl, CKN selects the N-RAM to provide two bits through Mx4 to Mull while a 64 bit Yo value is applied to Mull from ML2. Since
CONTROL is "low", N*Yo is calculated and added to AxB inside MUL1.
CRT Engine Operation
Reference is now made to FIG. 7. For the purposes of the following description, the A-RAM area, B-RAM area,
N-RAM and S-RAM area are divided into regions AL and
AH, BL and BH, NL and NH, SL and SH respectively. The message M is first stored in the B-RAM and the prime factors p and q are stored in memory N-RAM areas NL and
NH respectively. The message M is then multiplied by 1 modulo p to give the result Mp. This value is initially returned to the S-RAM area, SL and then transferred to the A-RAM area AL. In a similar fashion the value rotation. Next the product g = c*q is formed by invoking the ordinary multiply function. The result is stored in S-RAM area, S. Finally the result R = g + rq is calculated by applying the newly incorporated addition function. This value is returned to either the S-RAM or B-RAM.
FIG. 9 illustrates a block diagram of a portable data carrier 200 that utilizes the modular multiplication circuit of FIG. 2. Carrier 200 includes a control unit such as a microcomputer 201 having a modular multiplication circuit 202 with a memory 203. One example of a prior art portable data carrier is giveS by the reference disclosed in U.S. Patent 4,471,216 issued to Robert L.J. herve on September 11, 1984.
It will be appreciated that various modifications to the above described modular multiplication circuit will be apparent to a person of ordinary skill in the art, and may be made without departing from the scope invention as set out in the claims provided hereinafter.
Appendix 1
Montaomerv Modular Reduction Technique
The Montgomery function P(A*B)N performs a multiplication modulo N of the product A*B into the P field. The retrieval from the P field back into the normal modular field is performed by enacting P on the result of P(A*B)N and a precalculated constant H.
Thus if P == P(A*B)N, then P(P.H)N == A*B (modN).
Proof
We require to calculate R = A*B (modN).
First find Q, such that: P2" = A*B + Q.N (where N is odd) (1)
Note: I.2" == 1 (modN) (and n is the bit length of N) (2)
Multiply equation (1) by I to give: P.I.2n = A*B.I + Q.I.N (3)
Consider the left side of (3), from (2): P.I.2n == P (modN) (4)
Consider the right side of (3), then from (4):
P == {A*B.I + Q.I.N) (modN), and therefore:
P == A*B.I (modN) = P(A*B)N (5)
Consider P(P.H)N then from (5):
P(P.H)N == A*B.I2.H (modN) (6)
Clearly if H is defined as I-2 then:
R == P(P.H)N == A*B (modN) (7)
Equation (7) gives the desired result.
From (2) above, H = 22n (modN) and is a precalculated constant depending only on N and n.
It next requires that Q be found. From (1) it can be seen that:
(A*B.I + Q.I.N) (mod2") = 0 (8)
This implies:
A*B.I (mod2n) = -Q.I.N (mod2n) and therefore,
Q == -N-l A*B (mod2n) (9)
For odd N, J = N-1 such that N.J = I.2n + 1.
Hence Q == - A*B*J (mod2n).
Note, J is also a precalculated constant depending only on
N and n.
Appendix 2
Chinese Remainder Theorem
The Chinese Remainder Theorem may be stated as follows.
For a given set of integers mo, m1, m2, mk such that gcd(ml, m2, m3, , mk) = 1, then for any set of integers rg, rl, r2, , rk such that ri < mi (0 < i < k), there exists a unique integer X such that
X (modmi) = ri (0 < i < k) and X < mOmim2 mk.
Chinese Remainder Theorem as applied to RSA
In the RSA system the modulus N is the product of two large prime factors, p and q. As p and q are prime, then gcd(p, q) = 1 {gcd = greatest common divisor}.
Therefore, for some integers rp and rq such that rp < p and rq < q, there exists a unique integer R (R < N) such that R (modp) = rp and R (modq) = rq.
In general we have:
(M modN) modp = X modp = rp
(M modN) modq = X modq = rq
Suppose that R= Md (modN), then we can use the Chinese
Remainder Theorem as follows:
rp= R modp = (Md (modN)) modp
rq= R modq = (Md (modN)) modq
Also suppose d = k*(p - 1) + r, then by the Euler - Fermat
Theorem rp= (M P - l)k Mr (modp) = 1k Mr (modp) = (M modp)r modp
Similarly if d = j*(q - 1) + s,
rq= (M q- l)j MS (modq) = 1j Ms (modq) = (M modq)s modq
Also, r = d (mod (p - 1)) and s= d (mod (q - 1))
Hence in order to calculate R, where R = Md (modN) 1) Compute:
a) rp= (M modp)d mod(p-l) modp
b) rq= (M modq)d mod(q-l) modq 2) Find u with 0 < u < p and,
u * q = 1 (modp) 3) Use one of:
a) R = (((rp - (rq modp)) * u) modp) * q + rq
(where a > rq modp)
b) R = (((rp + p - (rq modp)) * u) modp) * q + rq
(where a < rq modp)
Thus the problem of calculating R = Md (modN), where M, N and d are n binary digit values, is reduced to one of calculating two values rp and rq involving n/2 binary digit values. This represents a considerable saving in computation time.
Claims (7)
1. A modular multiplication circuit comprising:
a storage element for receiving a data value Jo;
a storage element for receiving a data value A;
a storage element for receiving a data value B;
a storage element for receiving a data value Yo;
a storage element for receiving a data value N; and
a 64-bit multiplier having a 64-bit parallel input and a serial input. the 64-bit multiplier coupled to receive the A value and the Yo value on the 64-bit parallel input, and the B value and the N value on tome serial input.
2. The modular multiplication circuit according to claim 1 wherein the serial input is a 2-bit serial input to form a 2-bit X 64-bit multiplier.
3. The modular multiplication circuit according to claims 1 or 2 wherein the 64-bit multiplier interleaves multiplication operations of A*B and N*Yo on alternate phases of a control clock.
4. The modular multiplication circuit according to any previous claim wherein the A value is stored in a ????64-bit parallel register???.
5. The modular multiplication circuit according to any preceding claim wherein the storage element for receiving the data value A, the storage element for receiving the data value B, and the storage element for receiving the data value N are each a random access memory having an automatic RAM pointer mechanism for each RAM.
5. The modular multiplication circuit according to any preceding claim further including a splitter means for splitting each of the B and N binary data streams into component data streams comprising respectively alternate bits of the binary data streams and wherein the modular multiplication is arranged to process the component data streams in parallel.
6. The modular multiplication circuit according to any preceding claim wherein the modular multiplication circuit is arranged to perform the Chinese Remainder
Theorem.
7. The modular multiplication circuit according to any preceding claim further including a portable data crrier utilizing the modular multiplication circuit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB9701958A GB2321979B (en) | 1997-01-30 | 1997-01-30 | Modular multiplication circuit |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB9701958A GB2321979B (en) | 1997-01-30 | 1997-01-30 | Modular multiplication circuit |
Publications (3)
Publication Number | Publication Date |
---|---|
GB9701958D0 GB9701958D0 (en) | 1997-03-19 |
GB2321979A true GB2321979A (en) | 1998-08-12 |
GB2321979B GB2321979B (en) | 2002-11-13 |
Family
ID=10806855
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB9701958A Expired - Fee Related GB2321979B (en) | 1997-01-30 | 1997-01-30 | Modular multiplication circuit |
Country Status (1)
Country | Link |
---|---|
GB (1) | GB2321979B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1299797A2 (en) * | 2000-05-15 | 2003-04-09 | M-Systems Flash Disk Pioneers Ltd. | Extending the range of computational fields of integers |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4276607A (en) * | 1979-04-09 | 1981-06-30 | Sperry Rand Corporation | Multiplier circuit which detects and skips over trailing zeros |
WO1986002181A1 (en) * | 1984-09-28 | 1986-04-10 | Motorola, Inc. | A digital signal processor for single cycle multiply/accumulation |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2724741B1 (en) * | 1994-09-21 | 1996-12-20 | Sgs Thomson Microelectronics | ELECTRONIC CIRCUIT FOR MODULAR CALCULATION IN A FINISHED BODY |
FR2726668B1 (en) * | 1994-11-08 | 1997-01-10 | Sgs Thomson Microelectronics | METHOD OF IMPLEMENTING MODULAR REDUCTION ACCORDING TO THE MONTGOMERY METHOD |
-
1997
- 1997-01-30 GB GB9701958A patent/GB2321979B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4276607A (en) * | 1979-04-09 | 1981-06-30 | Sperry Rand Corporation | Multiplier circuit which detects and skips over trailing zeros |
WO1986002181A1 (en) * | 1984-09-28 | 1986-04-10 | Motorola, Inc. | A digital signal processor for single cycle multiply/accumulation |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1299797A2 (en) * | 2000-05-15 | 2003-04-09 | M-Systems Flash Disk Pioneers Ltd. | Extending the range of computational fields of integers |
EP1299797A4 (en) * | 2000-05-15 | 2008-04-02 | Sandisk Il Ltd | Extending the range of computational fields of integers |
US7904719B2 (en) | 2000-05-15 | 2011-03-08 | Sandisk Il Ltd. | Extending the range of computational fields of integers |
Also Published As
Publication number | Publication date |
---|---|
GB2321979B (en) | 2002-11-13 |
GB9701958D0 (en) | 1997-03-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP0890147B1 (en) | Co-processor for performing modular multiplication | |
JP4955182B2 (en) | Integer calculation field range extension | |
US7277540B1 (en) | Arithmetic method and apparatus and crypto processing apparatus for performing multiple types of cryptography | |
US6185596B1 (en) | Apparatus & method for modular multiplication & exponentiation based on Montgomery multiplication | |
US5764554A (en) | Method for the implementation of modular reduction according to the Montgomery method | |
JPH09274560A (en) | Power remainder operation circuit, power remainder operation system and operation method for power remainder operation | |
US8078661B2 (en) | Multiple-word multiplication-accumulation circuit and montgomery modular multiplication-accumulation circuit | |
Rankine | Thomas—a complete single chip RSA device | |
US5121429A (en) | Digital signal processing | |
US6922717B2 (en) | Method and apparatus for performing modular multiplication | |
GB2318892A (en) | Co-processor for performing modular multiplication | |
US7113593B2 (en) | Recursive cryptoaccelerator and recursive VHDL design of logic circuits | |
Koppermann et al. | Fast FPGA implementations of Diffie-Hellman on the Kummer surface of a genus-2 curve | |
GB2321979A (en) | Modular multiplication circuit | |
WO2023043467A1 (en) | A method and architecture for performing modular addition and multiplication sequences | |
JP2000207387A (en) | Arithmetic unit and cipher processor | |
Raghuram et al. | A programmable processor for cryptography | |
GB2318890A (en) | Co-processor for performing modular multiplication | |
KR100297110B1 (en) | Modular multiplier | |
GB2318891A (en) | Co-processor for performing modular multiplication | |
GB2332542A (en) | Multiplication circuitry | |
JP3210420B2 (en) | Multiplication circuit over integers | |
JP3137599B2 (en) | Circuit for calculating the remainder of B raised to the power of C modulo n | |
Güneysu | Establishing Dedicated Functions on FPGA Devices for High-Performance Cryptography |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
732E | Amendments to the register in respect of changes of name or changes affecting rights (sect. 32/1977) | ||
732E | Amendments to the register in respect of changes of name or changes affecting rights (sect. 32/1977) |
Free format text: REGISTERED BETWEEN 20090917 AND 20090923 |
|
PCNP | Patent ceased through non-payment of renewal fee |
Effective date: 20140130 |