WO2003029957A1 - Montgomery multiplier with dual independent channels - Google Patents

Montgomery multiplier with dual independent channels Download PDF

Info

Publication number
WO2003029957A1
WO2003029957A1 PCT/US2002/029160 US0229160W WO03029957A1 WO 2003029957 A1 WO2003029957 A1 WO 2003029957A1 US 0229160 W US0229160 W US 0229160W WO 03029957 A1 WO03029957 A1 WO 03029957A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameters
montgomery
values
perform
control codes
Prior art date
Application number
PCT/US2002/029160
Other languages
French (fr)
Inventor
Mike Ruehle
Original Assignee
Intel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation filed Critical Intel Corporation
Priority to EP02761648A priority Critical patent/EP1430393A1/en
Publication of WO2003029957A1 publication Critical patent/WO2003029957A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/60Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers
    • G06F7/72Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers using residue arithmetic
    • G06F7/728Methods or arrangements for performing computations using a digital non-denominational number representation, i.e. number representation without radix; Computing devices using combinations of denominational and non-denominational quantity representations, e.g. using difunction pulse trains, STEELE computers, phase computers using residue arithmetic using Montgomery reduction

Definitions

  • the invention pertains generally to computers. In particular, it pertains to a linear systolic array Montgomery multiplier.
  • Exponentiation of large numbers has many uses, including public key encryption such as the Rivest-Shamir-Adleman (RSA) algorithm for data encryption and decryption.
  • public key encryption such as the Rivest-Shamir-Adleman (RSA) algorithm for data encryption and decryption.
  • RSA Rivest-Shamir-Adleman
  • a common approach to performing the exponentiation known as the 'square and multiply' technique, performs a square operation (multiplying the accumulated result by itself) for each bit in the exponent, and a 'multiply' operation (multiplying the accumulated result by a base number) for every ' 1 'bit in the exponent.
  • a typical exponentiation using 1024-bit numbers requires over 1500 operations, with each operation involving 1024-bit numbers.
  • Montgomery multipliers are frequently used to perform the RSA and similar algorithms more efficiently by using a transform.
  • Montgomery multipliers perform a transform of the operation needed for exponentiation by performing the operation A x B mod M (the remainder of A times B divided by M), and for large numbers is much more efficient than a direct approach to performing the math.
  • Some Montgomery multipliers use a linear systolic array, i.e., a chain of identical processing elements (PEs), with each PE working on a portion (typically four bits) of each of the large numbers involved. The chain contains enough PEs to hold the largest of the numbers involved, including interim results. Carries and other interim values of the operation are fed in both directions between adjacent PEs.
  • PEs processing elements
  • each PE processes data for one clock cycle, then waits for one clock cycle to receive interim values from the adjacent PEs.
  • Adjacent PEs are one clock cycle out of sync, i.e., the odd-numbered PEs are processing while the even-numbered PEs are waiting, and vice-versa. This means that each PE is idle half the time, and those idle cycles represent wasted resources.
  • the idle cycles which can be considered a separate channel, can be used to perform another operation.
  • two of the three parameters e.g., B and M
  • the conventional approach to utilizing some of these wasted cycles in a square-and-multiply operation is to perform the squares for an exponentiation in one channel, and to perform the multiplies for the same exponentiation in the alternate channel. Since the average exponent contains a ' 1 ' in approximately half the bit positions, only half the cycles in the alternate channel are used for multiplies, while the remaining cycles in that channel, about 25% of the total cycles in both channels, remain idle and wasted.
  • FIG. 1 shows a system according to one embodiment of the invention
  • Fig. 2 shows a linear systolic array Montgomery multiplier according to one embodiment of the invention.
  • Fig. 3 shows a chart of two Montgomery multiplications propagating through a linear systolic array Montgomery multiplier according to one embodiment of the invention.
  • FIG. 4 shows a schematic of a processing element according to one embodiment of the invention.
  • Fig. 5 shows a flow chart of a method according to one embodiment of the invention.
  • LSAMM LSAMM
  • the invention may be implemented in hardware, software, or firmware.
  • a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer).
  • a machine-readable medium may include read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others.
  • Fig. 1 shows a system according to one embodiment of the invention.
  • system 100 includes input/output (I/O) logic 130 coupled to a processor (CPU) 1 10, an accelerated graphics processor (AGP) 120, a memory 140, an LSAMM 150, and an I/O bus 170 coupled to various I/O devices not shown.
  • the LSAMM 1 0 is connected to memory bus 160, where the LSAMM 150 can be addressed by the CPU 110 and/or other devices as a block of memory, but various other embodiments have the LSAMM 150 connected to the system in other ways.
  • I/O input/output
  • any device that can write two sets of parameters e.g., Al, Bl, Ml, and A2, B2, M2
  • read two final results Rl, R2
  • 'A', 'B', and 'M' are used throughout this disclosure to represent the parameters in a Montgomery multiplication, these are generic designations. Any other terms may be used without departing from the invention.
  • Fig. 2 shows a linear systolic array Montgomery multiplier according to one embodiment of the invention.
  • a chain of serially-connected PEs is used to perform two concurrent Montgomery multiplications, one Montgomery multiplication with a first set of parameters Al, Bl, Ml, and the other Montgomery multiplication with a second set of parameters A2, B2, M2, with each parameter being distributed through the PEs at one digit per PE.
  • a digit is defined as the number of bits of each parameter processed by each PE.
  • a digit is a four-bit hexadecimal number, and each PE operates on four bits of A and/or four bits of B and/or four bits of M at a time in a given operation. In other embodiments, digits of other sizes are used, and the distribution of bits in the PEs differs from the examples accordingly.
  • the illustrated embodiment of Fig. 2 shows a chain with a quantity of (N+3) PEs, numbered 210-0 through 210-(N+2), and labeled as PE-0 through PE-(N+2).
  • the LSAMM 150 operates on 1024-bit parameters, the value of N is 256, and the number of PEs is 259, including 256 PEs to hold 256 hexadecimal digits of the parameters, and 3 additional PEs to accommodate interim results during processing.
  • Montgomery multiplier (MM) controller 220 controls the process by transferring parameters A 1 , Bl, Ml, and A2, B2, M2 into the chain.
  • each parameter is passed from MM controller 220 to PE-0 one digit at a time and propagates through the chain of PEs from right to left.
  • MM controller 220 also passes sets of control codes Cl, C2 and sets of other values Ql, Q2, which are described later, into PE-0 for propagation through the chain of PEs.
  • the chain of PEs is set up for a Montgomery multiplication by sequentially passing each parameter Bl, B2, Ml, and M2, one digit at a time, from MM controller 220 into the chain of PEs until each parameter is distributed throughout PE-0 through PE-N.
  • multiples of each parameter Bl , B2, Ml, M2 are calculated within the PEs and stored within the PEs.
  • parameter A I is then sequentially passed one digit at a time to the left from MM controller 220 to PE-0, PE-1, PE-2, etc., in alternating clock cycles, to operate on the stored multiples of Bl and Ml .
  • parameter A I is then sequentially passed one digit at a time to the left from MM controller 220 to PE-0, PE-1, PE-2, etc., in alternating clock cycles, to operate on the stored multiples of Bl and Ml .
  • parameter A I is then sequentially passed one digit at a time to the left from MM controller 220 to PE-0, PE-1, PE-2
  • A2 is sequentially passed one digit at a time to the left from MM controller 220 to PE-0,
  • PE-1, PE-2, etc. to operate on stored multiples of B2 and M2.
  • the multiplication operations are complete, with the results Rl and R2 residing in the PEs. If the Montgomery multiplications are finished, the results Rl and R2 are then passed to the right through the PEs and into MM controller 220. In one embodiment, if Ml, M2 remain unchanged and if the results Rl, R2 provide the values of Bl, B2 for the next Montgomery multiplications (as is often the case in RSA operations), multiples of Rl, R2 are calculated and stored in the PEs as multiples of Bl, B2 without being shifted into the MM controller 220.
  • New values of Al, A2 are then shifted through the PEs as before to perform the next pair of Montgomery multiplications.
  • the second channel is provided with 'no-operation' control codes so that no multiplication will take place in that channel, and the other parameters for the non- operating channel do not have to be loaded.
  • dummy values can be used for parameters on the unused channel, and the results on that channel ignored.
  • parameters A 1 , A2, B 1 , B2, Ml , M2, and results Rl, R2 each represent a large number that is distributed through the chain of PEs, one digit per PE.
  • Control codes are each small enough to fit into a single PE, and each control code directs the PE in which it is currently located to perform an operation on the values currently contained in that PE during a given clock cycle.
  • Cl in Fig. 2 represents multiple control codes for channel 1, the individual control codes generically designated herein as cl, with each cl being fed into PE-0 and propagated up the chain of PEs along with the digits of the associated parameters Al and/or Bl and/or Ml, to control the operation of the PE in which cl resides in any given clock cycle.
  • C2 represents multiple control codes c2 for channel 2.
  • Ql and Q2 in Fig. 2 also represent multiple parameters ql and q2, each small enough to fit into each PE, and are passed through the
  • each PE to the left along with the digits of the associated parameters to further define the operation in each PE.
  • Cl is a predetermined sequence of values cl
  • C2 is a predetermined sequence of values c2.
  • Ql is a predetermined sequence of values ql
  • Q2 is a predetermined sequence of values q2.
  • the various values of ql and q2 are determined in PE-0 and propagated through the chain from PE-0.
  • each PE performs an internal operation on channel 1 during one clock cycle as specified by that PE's current values of cl, ql and the digit of Al, then performs another internal operation on channel 2 during the next clock cycle as specified by the PE's current values of c2, q2, and the digit of A2.
  • the even-numbered PEs perform operations in a first Montgomery multiplication in the odd-numbered clock cycles and perform operations in a second Montgomery multiplication in the even-numbered clock cycles, while the odd-numbered PEs perform operations in the first Montgomery multiplication in the even-numbered clock cycles and perform operations in the second Montgomery multiplication in the odd- numbered clock cycles.
  • these alternating cycles are created by the MM controller 220 by providing PE-0 the digits of Al, cl, and ql on even-numbered clock cycles, and providing PE-0 the digits of A2, c2, and q2 on odd-numbered clock cycles.
  • Fig. 3 shows a chart of two Montgomery multiplications propagating through a linear systolic array Montgomery multiplier according to one embodiment of the invention.
  • a single Montgomery multiplication can require hundreds of PEs and thousands of clock cycles to complete.
  • the chart shows only the first few PEs and clock cycles.
  • the dark areas represent the clock cycles in which the indicated PEs perform work on a first operation, and are labeled as channel 1.
  • the crosshatched areas represent the clock cycles in which the indicated PEs perform work on a second operation, and are labeled as channel 2.
  • the clock cycles of channel 1 are interleaved with the clock cycles of channel 2.
  • FIG. 3 shows parameters in the format'xYz, in which x represents a digit of A, Q, or C as shown, Y represents channel 1 or channel 2, and z represents which digit (0 - N) of the indicated parameter is being supplied. [0023] The operations shown by Fig. 3 are described as follows:
  • PE-0, and PE-1 performs the same operation that was performed by PE-0 in clock cycle 1, although the internal data on which it operates may be different. Also at the beginning of clock cycle 2, the first digit of A2 (a2 0 ), the associated parameter q2 0 , and the associated control code c2 0 are provided by MM controller 220 to PE-0. During clock cycle 2, PE-0 performs an internal operation defined by control code c2o, and which may further be defined by a2 0 and q2 0 .
  • FIG. 4 shows a schematic of a processing element according to one embodiment of the invention.
  • PE 210 includes two storage elements (B-RAM 412 and M-RAM 414), and processing logic that includes PE control logic 410, two address registers (Q-register 424 and A-register 422), two adders (S+B Adder 430 and S+B+M Adder 440), two multiplexers (first multiplexer 435 and second multiplexer 455), two carry registers (Carry- 1 -register 432 and Carry-2- register 442), an accumulation register (S-register 445), a Channel selection register 450, and a results register (R-register 460).
  • PE control logic 410 two address registers (Q-register 424 and A-register 422), two adders (S+B Adder 430 and S+B+M Adder 440), two multiplexers (first multiplexer 435 and second multiplexer 455), two carry registers (Carry- 1 -register 432 and Carry-2- register 442), an accumulation
  • PE 210 is generic to every PE 210-x in the chain.
  • connections shown at the bottom of Fig. 4 are common to multiple PEs, connections shown to the right interface with the PE to the right, and connections shown to the left interface with the PE to the left, with outputs from one PE connected to similarly-named inputs of the adjacent PE.
  • Exceptions are PE-0, which interfaces to MM controller 220 on the right, and PE-(N+2), which has no PE to its left.
  • Clk, Carry-In-1, Carry-Out-1, Carry-In-2, Carry-Out-2, Chnl-ln, Chnl-Out, and all internal connections to propagate those signals contain one bit each, while Cntl-In and Cntl-Out contain the number of bits necessary to identify each of the various control codes. All the remaining connections shown in Fig. 4 contain the number of bits being processed by each PE, such as four bits each for the illustrated embodiment. In one embodiment, each PE also includes other inputs and outputs as necessary, e.g., a Reset input (not shown).
  • Control logic 410 latches a control code received from the PE to the right, uses that control code to control the logic elements of the present PE during a current clock cycle, and then passes the control code to the PE to the left.
  • Storage element B-RAM 412 is used to store one digit of each multiple of B that is stored in the PE chain, while storage element M-RAM 414 is used to store one digit of each multiple of M that is stored in the PE chain.
  • A-register 422 and Q-register 424 hold the addresses that select the desired locations within B-RAM 412 and M-RAM 414, respectively, (both for reading and for writing) and also pass these addresses to the PE to the left.
  • S + B Adder 430 is used to add the contents of a selected location in B-RAM 412 to the contents of the S-register in the PE to the left, including any carry bit received through the Carry-In- 1 input from the S+B Adder in the PE to the right.
  • Carry- 1 -Register 432 latches any carry bit from S+B Adder 430 and provides it as a carry bit to the S+B Adder in the PE to the left during the next clock cycle.
  • S+B+M Adder 440 adds the output of S+B Adder 430 to the contents of a selected location in M-
  • Any received carry bit is provided from the PE to the right through the Carry-In-2 input, and any generated carry bit is latched into Carry-2-Register 442 for use by the PE to the left in the next clock cycle.
  • the output of S+B+M Adder 440 is latched into S-register 445, which acts as an accumulation register for interim results.
  • the output of S-register 445 is distributed to each of B-RAM 412, M-RAM 414, first multiplexer 435, and the S- Out output for use by the PE to the right.
  • R-register 460 latches the output of S-register 445 if the right-hand input of second multiplexer 455 is selected, and otherwise latches the contents of the R-register in the PE to the left.
  • Channel selection register 450 is coupled to an address bit of both B-RAM 412 and M-RAM 414 to select either a first bank of addresses or a second bank of addresses in both storage elements.
  • A- register 422 and Q-register 424 select specific locations as described above.
  • the channel selection values are propagated from right to left through the Channel selection registers of the PEs along with the values in the A- and Q-registers.
  • the channel selection values are part of the control codes. [0031] In the embodiment of Fig.
  • Clk is used to latch data into the Control logic, into the Q-, A-, S-, R-, Carry- 1- and Carry-2-, and Channel selection registers, and to clock write operations in the B- and M-RAMs, while both adders, both multiplexers, and the read operations in the B- and M-RAMs are combinatorial, i.e., any change at an input is propagated through to the logic element's output regardless of clock status.
  • the B- and M-RAMs use a clocked input for read as well as write operations.
  • clock speed is chosen so that worst-case combinatorial delays in PE 210 are less than one clock cycle. Specific connections from the Clk input to other circuit elements is not shown in Fig. 4 to avoid making the figure overly complex.
  • Control logic 410 contains the logic necessary to control the operation of
  • control logic 410 includes a decoder circuit to convert the control code to necessary control signals.
  • the control code is simply latched, with each bit of the control code specifying a particular control signal.
  • the control codes specify operations that include but are not limited to: selecting one of the two inputs of first multiplexer 435, selecting one of the two inputs of second multiplexer 455, writing to B-RAM 412, writing to M-RAM 424, resetting one or more of the A, Q, S and R registers, and inhibiting the clock signal to various logic elements.
  • the storage elements include random access memories (RAM), labeled B-RAM and M-RAM to indicate the parameters being stored. Even though the terms 'B-RAM' and 'MRAM' are used throughout the disclosure, in some embodiments, types of storage elements other that RAM are used.
  • RAM random access memories
  • B-RAMs 412 in the PE chain provide a first bank of storage space for values of (0 x Bl), (1 x Bl), (2 x Bl), etc., and a second bank of storage space for values of (0 x B2), (1 x B2), (2 x B2), etc.
  • B-RAM 412 includes 32 4-bit storage locations, 16 locations to hold the digits of (0 x Bl) through (15 x Bl) and another 16 locations to hold corresponding digits of (0 x B2) through (15 x B2).
  • M-RAM 414 includes 32 4-bit storage locations to hold corresponding digits of (O x Ml) through (15 x Ml) and (0 x M2) through (15 M2).
  • each PE processes a number of bits other than four, the number of locations in each RAM are changed accordingly to address and store the required number of multiples for each set of parameters.
  • a single value of a given parameter is always used for both Montgomery multiplications.
  • the corresponding storage element provides storage for multiples of only the single value of the parameter, and the connection from Channel selection register 450 to that storage element is eliminated so that both Montgomery multiplications read from the same bank of multiples.
  • M-RAM 414 contains sixteen locations to store a digit of the multiples (0 x M) through (15 x M), and Channel selection register 450 does not control an address input line to M-RAM 414.
  • An embodiment that is designed to hold two independent values of M can also be used for applications that have a single value of M by making Ml and M2 have the same value.
  • Channel selection register 450 is a latch containing a one-bit selection value that is propagated through the PEs along with control codes and values in the A- and Q-registers. When the one-bit selection value is in one state it selects the first bank in B-RAM 412 and in M-RAM 414, and when the one-bit selection value is in another state it selects the second bank in B-RAM 412 and M-RAM 414. Thus Channel selection register 450 can select the bank of values for the channel that is operable in a given PE during a given clock cycle.
  • Fig. 5 shows a flow chart of a method according to one embodiment of the invention. The illustrated embodiment of Fig.
  • PE 5 sets up two Montgomery multiplications in blocks 510-545, performs the two Montgomery multiplications concurrently in block 550, 560, and propagates the results out of the PEs in block 570.
  • the logic of PE 210 can be used in various ways, depending on the operation being performed at the time. In one embodiment, the PEs can perform each of the following, which are described in more detail in the following sections:
  • each Montgomery multiplication starts with initial values for B and M in the PEs.
  • the result of one Montgomery multiplication is an initial value for the next Montgomery multiplication, so that the new initial value does not have to be loaded.
  • S-register 445 contains a digit of the final result, which is then loaded as an initial value for the next Montgomery multiplication into B-RAM 412 (or M-RAM 414) at an address specified by A-register 422 (or Q-register 424) and Channel selection register 450.
  • B-RAM 412 and/or M-RAM 414 require one or more initial values that are not contained in the PE, so the initial values are loaded as shown in blocks 510, 520, 530, and 540 of Fig. 5.
  • an initial value is propagated through the PEs through the S-registers until each digit is in its proper PE, whereupon the digit is written into the corresponding RAM.
  • adders 430 and 440 will pass through the value from S-In unchanged and load it into S-register 445.
  • initial values can be propagated into and through the chain of PEs through the S-registers.
  • the S-registers are designed to pass data from left to right, in one embodiment the values begin propagating through the
  • MM controller 220 has a separate output to S-In of PE-(N+2) and feeds the digits of the initial value directly into PE-(N+2).
  • MM controller 220 feeds the digits of the initial value into an address register (A- or Q-) of PE-0 and propagates the digits through the chain of PEs from right to left.
  • a loopback circuit then loops the address register output of PE-(N+2) back to the S-In input of PE-(N+2), from where the digits are propagated from left to right through the S-registers as before.
  • the contents of the S-register are loaded into a specified location of the B- or M-RAM.
  • S-registers or are propagated through the address registers first, in one embodiment only one initial value is propagated through the PEs at a time, without interleaving, until every digit of the initial value is in its proper PE. Multiples of that digit are then calculated and stored as described in one of the next two sections before the next initial value is propagated into the S-registers.
  • Bl is loaded by propagating the digits of Bl through the S-registers in block 510, and multiples of Bl are calculated and stored at block 515.
  • B2 is loaded by propagating the digits of B2 through the S-registers in block 520, and multiples of B2 are calculated and stored in block 525.
  • Ml and M2 are separately propagated into place and their multiples separately calculated and stored in blocks 530, 535, 540, and 545. Although these parameters are shown being handled in the order Bl, B2, Ml, M2, in one embodiment the parameters may be handled in any order. [0046] In another embodiment, the digits of Bl and B2 are interleaved while being concurrently propagated into the PEs and stored in the storage elements, and the digits of
  • Ml an M2 are likewise interleaved while being concurrently propagated into the PEs and stored in the storage elements.
  • the multiples of each parameter are then calculated separately as described in the previous paragraph. Pre-calculate multiples of Bl and B2 and store in the B-RAMs
  • a digit of each multiple of Bl is calculated and stored in the B-RAM 412 as shown in block 515 by executing the following in each PE: [0048] 1) Clear the contents of the first location of the lower bank in B-RAM 412.
  • this operation is performed by zeroing the contents of S-register 445, zeroing the contents of A-register 422, setting Channel selection register 450 to zero, and setting B-RAM 412 to 'write' so that the zeroes of S-register 445 are written into the first location of the lower bank in B-RAM 412. [0049] 2) Load the correct digit of Bl into S-register 445 through the process previously described in the section 'Load Initial Values into the B-RAMs and/or M- RAMs'.
  • M-RAM 414 is a temporary holding place for this value, and can be cleared at the end of the pre-calculation steps.
  • B-RAM 412 so that the changing value in S-register 445 is stored into successive locations 0, 1, 2, 3, etc. in B-RAM 412.
  • the result in B- RAM 412 is that location 0 contains a digit of 0 x Bl, location 1 contains the same digit of 1 x Bl, location 2 contains the same digit of 2 x Bl, location 3 contains the same digit of 3 x Bl, etc.
  • a digit of each multiple of Ml is calculated and stored in the M-RAM 414 as shown in block 535 by executing the following in each
  • all locations in M- RAM 414 are cleared to zero, so that if M-RAM 414 is implemented with a design that always reads the selected location (even when in write mode), the outputs will not interfere with the additions performed in paragraph 5) below, [0057] 2) Load the correct digit of Ml into S-register 445 through the process described above under the section 'Load Initial Values into the B-RAMs and/or M- RAMs'.
  • M-RAM 414 contains a digit of 1 x Ml, location 2 contains the same digit of 2 x Ml, location 3 contains the same digit of 3 x Ml, etc.
  • 6) Zero S-register 445 and Q-register 424 and write the zero contents of S- register 445 into location 0 of M-RAM 414.
  • A-register 422 and Q-register 424 are set to zero through a control code. In another embodiment, the contents of A-register 422 and Q-register 424 are set to zero by propagating the zero value through the PE chain as are other values in the A- and Q- registers.
  • Block 550 of Fig. 5 covers performing two Montgomery multiplications in alternating clock cycles.
  • the operation within each PE is triggered and controlled by feeding the correct values of a, q and the control codes into PE-0 in the correct sequence, and the rest of the operation is automatic, based on the circuitry of the PEs.
  • each PE performs in the following manner in a particular Montgomery multiplication involving Al, B l, and Ql :
  • Channel selection register 450 is cleared to address the lower banks of B-RAM 412 and M-RAM 414, which contain multiples of Bl and Ml.
  • A-register 422 latches a digit of A 1 to select a digit of a multiple of Bl in B-RAM 412
  • Q-register 424 latches a q value to select a digit of a multiple of Ml in M-RAM 424
  • Control logic 410 latches a control code to control the logic elements of PE 210 during the current clock cycle. All three values are received from the PE to the right (or from MM controller 220 in the case of PE-0) and are passed on to the PE to the left on the following clock cycle.
  • S+B Adder 430 the selected location of B-RAM 412 is added to the current contents of the S-register in the PE to the left.
  • Carry bits are propagated from right to left using the Carry-In- 1 input and the Carry- Out-1 output so that S+B Adder 430 of the current PE acts in concert with the S+B Adders of the other PEs to add the value of a selected multiple of Bl to a right-shifted (by one digit) value of an interim result in the S registers.
  • S+B+M Adder 440 uses propagating carry bits at Carry-In-2 and Carry-Out-2 to perform a larger addition in concert with the S+B+M Adders of the other PEs.
  • the left-hand input of first multiplexer 435 is selected to add the selected multiple of Ml from M-RAM 414 to the aforementioned output of S+B Adder 430.
  • the sum is latched into S-register 445 as the new interim result, completing the operation that was defined by the control code of the current clock cycle.
  • Channel selection register 450 is set to a ' 1' to select the upper banks of B-RAM 412 and M-RAM 414, which contain the same digit of multiples of B2 and M2.
  • a digit of A2 is latched into A- register 422, a corresponding value of q is latched into Q-register 424, and a control code for this operation is latched into Control logic 410.
  • the value received from the S-register in the PE to the left is the value that was generated in the previous cycle when the PE to the left was working on the multiplication involving A2, B2, and M2, so the correct values for this particular multiplication are maintained.
  • the PE In the next clock cycle, the PE returns to working on the multiplication involving Al, Bl, and Ml, using new values for A-register 422, Q-register 424, and the control code.
  • the value in S- register 445 is a digit of the final result of the first Montgomery multiplication.
  • the value in S-register 445 is a digit of the final result of the second Montgomery multiplication.
  • the contents of S-Register 445 are loaded into R-register 460 through the right-hand input of multiplexer 455 in every PE, then the contents of all R-registers 460 are passed through each other to the right into MM 220 by selecting the left-input of the multiplexer 455 in every PE.
  • R-register 460 and second multiplexer 455 are not included in the PEs, and the result is passed to the right through the S-registers of every PE using the S-In and S- Out connections, in much the same manner as original parameters were loaded as described above under 'Load Initial Values into the B-RAMs and/or M-RAMs.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

A linear systolic array Montgomery multiplier circuit that concurrently processes two separate Montgomery multiplications on alternate clock cycles, without a requirement to have any common parameters between the two multiplications. Multiples of two different parameters are stored in storage elements for each multiplication. Two sets of these multiples, ones set for each of the two multiplications, are stored in separate storage banks and accessed on alternate clock cycles by each processing element in the array. Two sequences of control codes for the two multiplications are interleaved as they are fed into first processing element.

Description

MONTGOMERY MULTIPLIER WITH DUAL INDEPENDENT
CHANNELS
BACKGROUND OF THE INVENTION
1. Field of the Invention
[0001] The invention pertains generally to computers. In particular, it pertains to a linear systolic array Montgomery multiplier.
2. Description of the Related Art
[0002] Exponentiation of large numbers has many uses, including public key encryption such as the Rivest-Shamir-Adleman (RSA) algorithm for data encryption and decryption. A common approach to performing the exponentiation, known as the 'square and multiply' technique, performs a square operation (multiplying the accumulated result by itself) for each bit in the exponent, and a 'multiply' operation (multiplying the accumulated result by a base number) for every ' 1 'bit in the exponent. Assuming an equal number of ' l 's and 'O's in the average exponent, a typical exponentiation using 1024-bit numbers requires over 1500 operations, with each operation involving 1024-bit numbers.
[0003] Montgomery multipliers are frequently used to perform the RSA and similar algorithms more efficiently by using a transform. Montgomery multipliers perform a transform of the operation needed for exponentiation by performing the operation A x B mod M (the remainder of A times B divided by M), and for large numbers is much more efficient than a direct approach to performing the math. Some Montgomery multipliers use a linear systolic array, i.e., a chain of identical processing elements (PEs), with each PE working on a portion (typically four bits) of each of the large numbers involved. The chain contains enough PEs to hold the largest of the numbers involved, including interim results. Carries and other interim values of the operation are fed in both directions between adjacent PEs.
[0004] Because of the clocked chain design of a linear systolic array Montgomery multiplier (LSAMM), each PE processes data for one clock cycle, then waits for one clock cycle to receive interim values from the adjacent PEs. Adjacent PEs are one clock cycle out of sync, i.e., the odd-numbered PEs are processing while the even-numbered PEs are waiting, and vice-versa. This means that each PE is idle half the time, and those idle cycles represent wasted resources. The idle cycles, which can be considered a separate channel, can be used to perform another operation. However, in a conventional LSAMM circuit, two of the three parameters (e.g., B and M) must be the same in both channels. With this limitation, the conventional approach to utilizing some of these wasted cycles in a square-and-multiply operation is to perform the squares for an exponentiation in one channel, and to perform the multiplies for the same exponentiation in the alternate channel. Since the average exponent contains a ' 1 ' in approximately half the bit positions, only half the cycles in the alternate channel are used for multiplies, while the remaining cycles in that channel, about 25% of the total cycles in both channels, remain idle and wasted. BRIEF DESCRIPTION OF THE DRAWINGS
[0005] The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:
[0006] Fig. 1 shows a system according to one embodiment of the invention
[0007] Fig. 2 shows a linear systolic array Montgomery multiplier according to one embodiment of the invention. [0008] Fig. 3 shows a chart of two Montgomery multiplications propagating through a linear systolic array Montgomery multiplier according to one embodiment of the invention.
[0009] Fig. 4 shows a schematic of a processing element according to one embodiment of the invention. [0010] Fig. 5 shows a flow chart of a method according to one embodiment of the invention.
DETAILED DESCRIPTION OF THE INVENTION
[0011] In the following description, numerous specific details are set forth to provide a thorough understanding of the invention. However, it is understood that the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the invention.
[0012] Various portions of the description refer to parts of the invention with the terms 'right', 'left', 'right-hand', 'left-hand', 'right-most', or 'left-most'. These terms refer to relative orientation as shown in the figures, and should not be interpreted as limitations on the physical implementation of the invention.
[0013] Various embodiments use a linear systolic array Montgomery multiplier
(LSAMM) that can perform separate operations on two channels, without a requirement that two of the parameters be the same on both channels. In one embodiment, one channel is used to perform both squares and multiplies for a first operation, while the other channel is used to perform both squares and multiplies for a second operation. Although each operation can take 50% longer than it would using one channel for squaring and the alternate channel for multiplying in the same operation, two operations may be performed at once so that total throughput is greater than in a conventional LSAMM. [0014] The invention may be implemented in hardware, software, or firmware.
The invention may also be implemented as instructions stored on a machine-readable medium, which can be read and executed by at least one processor to perform the operations described herein. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a machine-readable medium may include read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others.
[0015] Fig. 1 shows a system according to one embodiment of the invention. In the illustrated embodiment of Fig. 1, system 100 includes input/output (I/O) logic 130 coupled to a processor (CPU) 1 10, an accelerated graphics processor (AGP) 120, a memory 140, an LSAMM 150, and an I/O bus 170 coupled to various I/O devices not shown. In the embodiment of Fig. 1, the LSAMM 1 0 is connected to memory bus 160, where the LSAMM 150 can be addressed by the CPU 110 and/or other devices as a block of memory, but various other embodiments have the LSAMM 150 connected to the system in other ways. In the embodiment of Fig. 1, any device that can write two sets of parameters (e.g., Al, Bl, Ml, and A2, B2, M2) to the LSAMM 150 and read two final results (Rl, R2) from the LSAMM 150 can initiate two concurrent Montgomery multiplications in the LSAMM 150. Although the terms 'A', 'B', and 'M', are used throughout this disclosure to represent the parameters in a Montgomery multiplication, these are generic designations. Any other terms may be used without departing from the invention.
[0016] Fig. 2 shows a linear systolic array Montgomery multiplier according to one embodiment of the invention. In the illustrated embodiment of Fig. 2, a chain of serially-connected PEs is used to perform two concurrent Montgomery multiplications, one Montgomery multiplication with a first set of parameters Al, Bl, Ml, and the other Montgomery multiplication with a second set of parameters A2, B2, M2, with each parameter being distributed through the PEs at one digit per PE. In the context of the invention, a digit is defined as the number of bits of each parameter processed by each PE. In example embodiments used throughout this disclosure, a digit is a four-bit hexadecimal number, and each PE operates on four bits of A and/or four bits of B and/or four bits of M at a time in a given operation. In other embodiments, digits of other sizes are used, and the distribution of bits in the PEs differs from the examples accordingly. [0017] The illustrated embodiment of Fig. 2 shows a chain with a quantity of (N+3) PEs, numbered 210-0 through 210-(N+2), and labeled as PE-0 through PE-(N+2). In one embodiment, the LSAMM 150 operates on 1024-bit parameters, the value of N is 256, and the number of PEs is 259, including 256 PEs to hold 256 hexadecimal digits of the parameters, and 3 additional PEs to accommodate interim results during processing. Montgomery multiplier (MM) controller 220 controls the process by transferring parameters A 1 , Bl, Ml, and A2, B2, M2 into the chain. In one embodiment, each parameter is passed from MM controller 220 to PE-0 one digit at a time and propagates through the chain of PEs from right to left. MM controller 220 also passes sets of control codes Cl, C2 and sets of other values Ql, Q2, which are described later, into PE-0 for propagation through the chain of PEs. [0018] In one embodiment, the chain of PEs is set up for a Montgomery multiplication by sequentially passing each parameter Bl, B2, Ml, and M2, one digit at a time, from MM controller 220 into the chain of PEs until each parameter is distributed throughout PE-0 through PE-N. In one embodiment, multiples of each parameter Bl , B2, Ml, M2 are calculated within the PEs and stored within the PEs. To perform the actual Montgomery multiplications, parameter A I is then sequentially passed one digit at a time to the left from MM controller 220 to PE-0, PE-1, PE-2, etc., in alternating clock cycles, to operate on the stored multiples of Bl and Ml . In the intervening clock cycles, parameter
A2 is sequentially passed one digit at a time to the left from MM controller 220 to PE-0,
PE-1, PE-2, etc., to operate on stored multiples of B2 and M2. When all digits of Al and A2 have passed through the chain of PEs, the multiplication operations are complete, with the results Rl and R2 residing in the PEs. If the Montgomery multiplications are finished, the results Rl and R2 are then passed to the right through the PEs and into MM controller 220. In one embodiment, if Ml, M2 remain unchanged and if the results Rl, R2 provide the values of Bl, B2 for the next Montgomery multiplications (as is often the case in RSA operations), multiples of Rl, R2 are calculated and stored in the PEs as multiples of Bl, B2 without being shifted into the MM controller 220. New values of Al, A2 are then shifted through the PEs as before to perform the next pair of Montgomery multiplications. [0019] In one embodiment, if only one Montgomery multiplication is to be performed, the second channel is provided with 'no-operation' control codes so that no multiplication will take place in that channel, and the other parameters for the non- operating channel do not have to be loaded. In another embodiment, dummy values can be used for parameters on the unused channel, and the results on that channel ignored. [0020] In the illustrated embodiment, parameters A 1 , A2, B 1 , B2, Ml , M2, and results Rl, R2 each represent a large number that is distributed through the chain of PEs, one digit per PE. Control codes are each small enough to fit into a single PE, and each control code directs the PE in which it is currently located to perform an operation on the values currently contained in that PE during a given clock cycle. Cl in Fig. 2 represents multiple control codes for channel 1, the individual control codes generically designated herein as cl, with each cl being fed into PE-0 and propagated up the chain of PEs along with the digits of the associated parameters Al and/or Bl and/or Ml, to control the operation of the PE in which cl resides in any given clock cycle. Similarly, C2 represents multiple control codes c2 for channel 2. Ql and Q2 in Fig. 2 also represent multiple parameters ql and q2, each small enough to fit into each PE, and are passed through the
PEs to the left along with the digits of the associated parameters to further define the operation in each PE. For each type of operation, Cl is a predetermined sequence of values cl, and C2 is a predetermined sequence of values c2. For some types of operations Ql is a predetermined sequence of values ql, and Q2 is a predetermined sequence of values q2. For other types of operations, the various values of ql and q2 are determined in PE-0 and propagated through the chain from PE-0. [0021] During operation of the LSAMM 150, each PE passes information to adjacent PEs in both directions. In one embodiment, each PE performs an internal operation on channel 1 during one clock cycle as specified by that PE's current values of cl, ql and the digit of Al, then performs another internal operation on channel 2 during the next clock cycle as specified by the PE's current values of c2, q2, and the digit of A2. In one embodiment, the even-numbered PEs perform operations in a first Montgomery multiplication in the odd-numbered clock cycles and perform operations in a second Montgomery multiplication in the even-numbered clock cycles, while the odd-numbered PEs perform operations in the first Montgomery multiplication in the even-numbered clock cycles and perform operations in the second Montgomery multiplication in the odd- numbered clock cycles. In one embodiment, these alternating cycles are created by the MM controller 220 by providing PE-0 the digits of Al, cl, and ql on even-numbered clock cycles, and providing PE-0 the digits of A2, c2, and q2 on odd-numbered clock cycles.
Alternate cycle processing
[0022] Fig. 3 shows a chart of two Montgomery multiplications propagating through a linear systolic array Montgomery multiplier according to one embodiment of the invention. A single Montgomery multiplication can require hundreds of PEs and thousands of clock cycles to complete. To avoid making Fig. 3 overly complex, the chart shows only the first few PEs and clock cycles. In the illustrated embodiment of Fig. 3, the dark areas represent the clock cycles in which the indicated PEs perform work on a first operation, and are labeled as channel 1. The crosshatched areas represent the clock cycles in which the indicated PEs perform work on a second operation, and are labeled as channel 2. As can be seen, for any specific PE, the clock cycles of channel 1 are interleaved with the clock cycles of channel 2. Fig. 3 shows parameters in the format'xYz, in which x represents a digit of A, Q, or C as shown, Y represents channel 1 or channel 2, and z represents which digit (0 - N) of the indicated parameter is being supplied. [0023] The operations shown by Fig. 3 are described as follows:
[0024] 1) At the beginning of clock cycle 1, the first digit of Al (al0), the associated parameter qlo, and the associated control code cl0 are provided by MM controller 220 to PE-0. During clock cycle 1, PE-0 performs an internal operation defined by control code clo, and which may further be defined by al0 and qlo- [0025] 2) At the beginning of clock cycle 2, al0, qlo, and cl0 are passed to PE-1 by
PE-0, and PE-1 performs the same operation that was performed by PE-0 in clock cycle 1, although the internal data on which it operates may be different. Also at the beginning of clock cycle 2, the first digit of A2 (a20), the associated parameter q20, and the associated control code c20 are provided by MM controller 220 to PE-0. During clock cycle 2, PE-0 performs an internal operation defined by control code c2o, and which may further be defined by a20 and q20.
[0026] 3) At the beginning of clock cycle 3: 1) alo, qlo, and clo are passed to PE-2 by PE-1 to perform the operation defined by those values in PE-2, 2) a20, q20, and c2o are passed to PE-1 by PE-0 to perform the operation defined by those values in PE-1, and 3) al i, ql i, and cl i are provided by MM controller 220 to PE-0 to perform a new operation defined by those values in PE-0.
[0027] 4) In subsequent clock cycles, this process continues, with the specific values of a, q, and c being propagated through the chain of PEs to perform the operation defined by those specific values in each PE. Although the specific values of a, q, and c are passed from PE to PE unchanged, each PE has internal data to be operated upon by those specific values, and the internal data may be different in each PE.
Processing Element [0028] Fig. 4 shows a schematic of a processing element according to one embodiment of the invention. In the illustrated embodiment of Fig. 4, PE 210 includes two storage elements (B-RAM 412 and M-RAM 414), and processing logic that includes PE control logic 410, two address registers (Q-register 424 and A-register 422), two adders (S+B Adder 430 and S+B+M Adder 440), two multiplexers (first multiplexer 435 and second multiplexer 455), two carry registers (Carry- 1 -register 432 and Carry-2- register 442), an accumulation register (S-register 445), a Channel selection register 450, and a results register (R-register 460). Although a single PE 210 is described, in one embodiment PE 210 is generic to every PE 210-x in the chain. In the illustrated embodiment, connections shown at the bottom of Fig. 4 are common to multiple PEs, connections shown to the right interface with the PE to the right, and connections shown to the left interface with the PE to the left, with outputs from one PE connected to similarly-named inputs of the adjacent PE. Exceptions are PE-0, which interfaces to MM controller 220 on the right, and PE-(N+2), which has no PE to its left.
[0029] In one embodiment Clk, Carry-In-1, Carry-Out-1, Carry-In-2, Carry-Out-2, Chnl-ln, Chnl-Out, and all internal connections to propagate those signals contain one bit each, while Cntl-In and Cntl-Out contain the number of bits necessary to identify each of the various control codes. All the remaining connections shown in Fig. 4 contain the number of bits being processed by each PE, such as four bits each for the illustrated embodiment. In one embodiment, each PE also includes other inputs and outputs as necessary, e.g., a Reset input (not shown).
[0030] In one embodiment, the various logic elements of Fig. 4 perform the following operations: Control logic 410 latches a control code received from the PE to the right, uses that control code to control the logic elements of the present PE during a current clock cycle, and then passes the control code to the PE to the left. Storage element B-RAM 412 is used to store one digit of each multiple of B that is stored in the PE chain, while storage element M-RAM 414 is used to store one digit of each multiple of M that is stored in the PE chain. A-register 422 and Q-register 424 hold the addresses that select the desired locations within B-RAM 412 and M-RAM 414, respectively, (both for reading and for writing) and also pass these addresses to the PE to the left. S + B Adder 430 is used to add the contents of a selected location in B-RAM 412 to the contents of the S-register in the PE to the left, including any carry bit received through the Carry-In- 1 input from the S+B Adder in the PE to the right. Carry- 1 -Register 432 latches any carry bit from S+B Adder 430 and provides it as a carry bit to the S+B Adder in the PE to the left during the next clock cycle. When the left-hand input of first multiplexer 435 is selected, S+B+M Adder 440 adds the output of S+B Adder 430 to the contents of a selected location in M-
RAM 414. When the right-hand input of first multiplexer 435 is selected, S+B+M Adder
440 adds the contents of S-Register 445 to the contents of the selected location in M-RAM
414. Any received carry bit is provided from the PE to the right through the Carry-In-2 input, and any generated carry bit is latched into Carry-2-Register 442 for use by the PE to the left in the next clock cycle. The output of S+B+M Adder 440 is latched into S-register 445, which acts as an accumulation register for interim results. The output of S-register 445 is distributed to each of B-RAM 412, M-RAM 414, first multiplexer 435, and the S- Out output for use by the PE to the right. R-register 460 latches the output of S-register 445 if the right-hand input of second multiplexer 455 is selected, and otherwise latches the contents of the R-register in the PE to the left. Channel selection register 450 is coupled to an address bit of both B-RAM 412 and M-RAM 414 to select either a first bank of addresses or a second bank of addresses in both storage elements. Within each bank, A- register 422 and Q-register 424 select specific locations as described above. In one embodiment, the channel selection values are propagated from right to left through the Channel selection registers of the PEs along with the values in the A- and Q-registers. In one embodiment, the channel selection values are part of the control codes. [0031] In the embodiment of Fig. 4, Clk is used to latch data into the Control logic, into the Q-, A-, S-, R-, Carry- 1- and Carry-2-, and Channel selection registers, and to clock write operations in the B- and M-RAMs, while both adders, both multiplexers, and the read operations in the B- and M-RAMs are combinatorial, i.e., any change at an input is propagated through to the logic element's output regardless of clock status. In another embodiment, the B- and M-RAMs use a clocked input for read as well as write operations. In one embodiment, clock speed is chosen so that worst-case combinatorial delays in PE 210 are less than one clock cycle. Specific connections from the Clk input to other circuit elements is not shown in Fig. 4 to avoid making the figure overly complex.
[0032] Control logic 410 contains the logic necessary to control the operation of
PE 210, based on control codes received through Cntl-In. In one embodiment, control logic 410 includes a decoder circuit to convert the control code to necessary control signals. In another embodiment, the control code is simply latched, with each bit of the control code specifying a particular control signal. In one embodiment, the control codes specify operations that include but are not limited to: selecting one of the two inputs of first multiplexer 435, selecting one of the two inputs of second multiplexer 455, writing to B-RAM 412, writing to M-RAM 424, resetting one or more of the A, Q, S and R registers, and inhibiting the clock signal to various logic elements. [0033] Because a Montgomery multiplication operates with multiples of B and M, one embodiment pre-calculates the multiples within the PEs, using the same logic that is used for the Montgomery multiplication. In the illustrated embodiment of Fig. 4, the storage elements include random access memories (RAM), labeled B-RAM and M-RAM to indicate the parameters being stored. Even though the terms 'B-RAM' and 'MRAM' are used throughout the disclosure, in some embodiments, types of storage elements other that RAM are used. Collectively, all the B-RAMs 412 in the PE chain provide a first bank of storage space for values of (0 x Bl), (1 x Bl), (2 x Bl), etc., and a second bank of storage space for values of (0 x B2), (1 x B2), (2 x B2), etc. In one embodiment in which each PE operates on a hexadecimal digit, B-RAM 412 includes 32 4-bit storage locations, 16 locations to hold the digits of (0 x Bl) through (15 x Bl) and another 16 locations to hold corresponding digits of (0 x B2) through (15 x B2). Similarly in the same embodiment, M-RAM 414 includes 32 4-bit storage locations to hold corresponding digits of (O x Ml) through (15 x Ml) and (0 x M2) through (15 M2). In an embodiment in which each PE processes a number of bits other than four, the number of locations in each RAM are changed accordingly to address and store the required number of multiples for each set of parameters.
[0034] In some embodiments, a single value of a given parameter is always used for both Montgomery multiplications. In these embodiments, the corresponding storage element provides storage for multiples of only the single value of the parameter, and the connection from Channel selection register 450 to that storage element is eliminated so that both Montgomery multiplications read from the same bank of multiples. In one such embodiment, M-RAM 414 contains sixteen locations to store a digit of the multiples (0 x M) through (15 x M), and Channel selection register 450 does not control an address input line to M-RAM 414. An embodiment that is designed to hold two independent values of M can also be used for applications that have a single value of M by making Ml and M2 have the same value.
[0035] In one embodiment, Channel selection register 450 is a latch containing a one-bit selection value that is propagated through the PEs along with control codes and values in the A- and Q-registers. When the one-bit selection value is in one state it selects the first bank in B-RAM 412 and in M-RAM 414, and when the one-bit selection value is in another state it selects the second bank in B-RAM 412 and M-RAM 414. Thus Channel selection register 450 can select the bank of values for the channel that is operable in a given PE during a given clock cycle. [0036] Fig. 5 shows a flow chart of a method according to one embodiment of the invention. The illustrated embodiment of Fig. 5 sets up two Montgomery multiplications in blocks 510-545, performs the two Montgomery multiplications concurrently in block 550, 560, and propagates the results out of the PEs in block 570. [0037] The logic of PE 210 can be used in various ways, depending on the operation being performed at the time. In one embodiment, the PEs can perform each of the following, which are described in more detail in the following sections:
[0038] 1) Load initial values into the B-RAMs and/or M-RAMs.
[0039] 2) Pre-calculate multiples of Bl and B2 and store in the B-RAMs.
[0040] 3) Pre-calculate multiples of Ml and M2 and store in the M-RAMs.
[0041] 4) Perform concurrent Montgomery multiplications. [0042] The following descriptions pertain to both Fig. 4 and Fig. 5.
Load Initial Values into the B-RAMs and/or M-RAMs [0043] In an LSAMM, each Montgomery multiplication starts with initial values for B and M in the PEs. Under some conditions, the result of one Montgomery multiplication is an initial value for the next Montgomery multiplication, so that the new initial value does not have to be loaded. In one embodiment, at the end of a multiplication S-register 445 contains a digit of the final result, which is then loaded as an initial value for the next Montgomery multiplication into B-RAM 412 (or M-RAM 414) at an address specified by A-register 422 (or Q-register 424) and Channel selection register 450. [0044] Under other conditions, B-RAM 412 and/or M-RAM 414 require one or more initial values that are not contained in the PE, so the initial values are loaded as shown in blocks 510, 520, 530, and 540 of Fig. 5. In one embodiment, an initial value is propagated through the PEs through the S-registers until each digit is in its proper PE, whereupon the digit is written into the corresponding RAM. With reference to Fig. 4, by zeroing the outputs of B-RAM 412 and M-RAM 414, and selecting the left-hand input of first multiplexer 435, adders 430 and 440 will pass through the value from S-In unchanged and load it into S-register 445. Thus initial values can be propagated into and through the chain of PEs through the S-registers. However, because the S-registers are designed to pass data from left to right, in one embodiment the values begin propagating through the
S-registers starting with PE-(N+2). In one embodiment, MM controller 220 has a separate output to S-In of PE-(N+2) and feeds the digits of the initial value directly into PE-(N+2).
In another embodiment, MM controller 220 feeds the digits of the initial value into an address register (A- or Q-) of PE-0 and propagates the digits through the chain of PEs from right to left. A loopback circuit then loops the address register output of PE-(N+2) back to the S-In input of PE-(N+2), from where the digits are propagated from left to right through the S-registers as before. When a digit of the initial value is within the S-register of its correct PE, the contents of the S-register are loaded into a specified location of the B- or M-RAM.
[0045] Regardless of whether the initial values are propagated solely through the
S-registers or are propagated through the address registers first, in one embodiment only one initial value is propagated through the PEs at a time, without interleaving, until every digit of the initial value is in its proper PE. Multiples of that digit are then calculated and stored as described in one of the next two sections before the next initial value is propagated into the S-registers. For example, in the illustrated embodiment of Fig. 5, Bl is loaded by propagating the digits of Bl through the S-registers in block 510, and multiples of Bl are calculated and stored at block 515. Then B2 is loaded by propagating the digits of B2 through the S-registers in block 520, and multiples of B2 are calculated and stored in block 525. In a similar manner, if initial values of Ml and M2 are required, Ml and M2 are separately propagated into place and their multiples separately calculated and stored in blocks 530, 535, 540, and 545. Although these parameters are shown being handled in the order Bl, B2, Ml, M2, in one embodiment the parameters may be handled in any order. [0046] In another embodiment, the digits of Bl and B2 are interleaved while being concurrently propagated into the PEs and stored in the storage elements, and the digits of
Ml an M2 are likewise interleaved while being concurrently propagated into the PEs and stored in the storage elements. The multiples of each parameter are then calculated separately as described in the previous paragraph. Pre-calculate multiples of Bl and B2 and store in the B-RAMs
[0047] In the illustrated embodiment, a digit of each multiple of Bl is calculated and stored in the B-RAM 412 as shown in block 515 by executing the following in each PE: [0048] 1) Clear the contents of the first location of the lower bank in B-RAM 412.
In one embodiment, this operation is performed by zeroing the contents of S-register 445, zeroing the contents of A-register 422, setting Channel selection register 450 to zero, and setting B-RAM 412 to 'write' so that the zeroes of S-register 445 are written into the first location of the lower bank in B-RAM 412. [0049] 2) Load the correct digit of Bl into S-register 445 through the process previously described in the section 'Load Initial Values into the B-RAMs and/or M- RAMs'.
[0050] 3) To calculate all multiples of Bl, clear Q register 424, set M-RAM 414 to
'write', and write the digit of Bl from S-register 345 into location 0 of the lower bank of M-RAM 414. M-RAM 414 is a temporary holding place for this value, and can be cleared at the end of the pre-calculation steps.
[0051] 4) Set M-RAM 414 to 'Read' and leave Q-register 424 and Channel selection register 450 cleared to continuously read the digit of Bl from M-RAM 414. Set B-RAM 412 to 'write', clear S-register 445 and set A-register 422 to '0'. [0052] 5) Select the right-hand input of multiplexer 435 so that S+B+M Adder 440 will add the digit of Bl from M-RAM 414 to the current value in S-register 445, and latch that sum as the new value in S-register 445, including the effect of any relevant carry bit received at Carry-In-2. (Any carry bit produced by this addition is latched into Carry-2- Reg 442 for use by the PE to the left.) [0053] 6) Increment the value in A-register 422 with each new value in S-register
445 so that the changing value in S-register 445 is stored into successive locations 0, 1, 2, 3, etc. in B-RAM 412. After incrementing through all multiples of Bl, the result in B- RAM 412 is that location 0 contains a digit of 0 x Bl, location 1 contains the same digit of 1 x Bl, location 2 contains the same digit of 2 x Bl, location 3 contains the same digit of 3 x Bl, etc. When this process has been applied to PEs 0 through N, the pre-calculation and storage of multiples of Bl is complete.
[0054] To calculate multiples of B2 and store them in B-RAM 412 as shown in block 525, repeat 1) through 6), but select the upper bank of B-RAM 412 by setting Channel selection register 450 to ' 1 ', and load the proper digit of B2 into S-register 445.
Pre-calculate multiples of Ml and M2 and store in the M-RAMs
[0055] In the illustrated embodiment, a digit of each multiple of Ml is calculated and stored in the M-RAM 414 as shown in block 535 by executing the following in each
PE:
[0056] 1) Clear the contents of the first location of the lower bank in M-RAM 414. In one embodiment, this operation is performed by zeroing the contents of S-register 445, zeroing the contents of Q-register 424, setting the Channel selection register 450 to zero, and setting M-RAM 414 to 'write' so that the zeroes of S-register 445 are written into the first location of the lower bank in M-RAM 414. In one embodiment, all locations in M- RAM 414 are cleared to zero, so that if M-RAM 414 is implemented with a design that always reads the selected location (even when in write mode), the outputs will not interfere with the additions performed in paragraph 5) below, [0057] 2) Load the correct digit of Ml into S-register 445 through the process described above under the section 'Load Initial Values into the B-RAMs and/or M- RAMs'.
[0058] 3) Clear Q register 424, and write the digit of Ml from S-register 445 into location 0 of M-RAM 414. Location 0 is a temporary holding place for this value, and can be cleared at the end of the pre-calculation steps. [0059] 4) Select the right-hand input of multiplexer 435 so that S+B+M Adder 440 will add a value read from M-RAM 414 to the current value in S-register 445, and store that sum as the new value in S-register 445, including the effect of any relevant carry bit received at Carry-In-2. (Any carry bit produced by this addition is latched into Carry-2- Reg 442 for use by the PE to the left.). In this manner, the value in S-register 445 will successively change through the same digit of l x Ml, 2 x Ml, 3 x Ml, etc. with each addition.
[0060] 5) Alternate the contents of Q-register 424 between an incrementing counter and zero: 1, 0, 2, 0, 3, 0, etc. When the Q-register 424 holds a zero, place M- RAM 414 in a read state to read the value of Ml out of location 0. When the Q-register 424 holds one of the incrementing counter values, place M-RAM 414 in a write state to write the accumulated value from S-register 445 into that location. In this manner, the digit of Ml is read from location 0 in M-RAM 414 and added to the accumulated multiple of Ml in S-register 445, including the effect of any received carry bit. The sum is then written to a location in M-RAM 414 that increments with each write operation. The result in M-RAM 414 is that location 1 contains a digit of 1 x Ml, location 2 contains the same digit of 2 x Ml, location 3 contains the same digit of 3 x Ml, etc. [0061] 6) Zero S-register 445 and Q-register 424 and write the zero contents of S- register 445 into location 0 of M-RAM 414. When the process has been applied to PEs 0 through N, pre-calculation and storage of multiples of Ml is complete. [0062] To calculate multiples of M2 and store them in M-RAM 414 as shown in block 545, repeat 1) through 6), but select the upper bank of M-RAM 414 by setting Channel selection register 450 to ' 1 ', and load the proper digit of M2 into S-register 445. [0063] In one embodiment for implementing the foregoing operations, the contents of A-register 422 and Q-register 424 are set to zero through a control code. In another embodiment, the contents of A-register 422 and Q-register 424 are set to zero by propagating the zero value through the PE chain as are other values in the A- and Q- registers.
Perform Concurrent Montgomery Multiplications
[0064] Block 550 of Fig. 5 covers performing two Montgomery multiplications in alternating clock cycles. In one embodiment, the operation within each PE is triggered and controlled by feeding the correct values of a, q and the control codes into PE-0 in the correct sequence, and the rest of the operation is automatic, based on the circuitry of the PEs. In the illustrated embodiment of Fig. 4, each PE performs in the following manner in a particular Montgomery multiplication involving Al, B l, and Ql : Channel selection register 450 is cleared to address the lower banks of B-RAM 412 and M-RAM 414, which contain multiples of Bl and Ml. A-register 422 latches a digit of A 1 to select a digit of a multiple of Bl in B-RAM 412, Q-register 424 latches a q value to select a digit of a multiple of Ml in M-RAM 424, and Control logic 410 latches a control code to control the logic elements of PE 210 during the current clock cycle. All three values are received from the PE to the right (or from MM controller 220 in the case of PE-0) and are passed on to the PE to the left on the following clock cycle. Using S+B Adder 430, the selected location of B-RAM 412 is added to the current contents of the S-register in the PE to the left. Carry bits are propagated from right to left using the Carry-In- 1 input and the Carry- Out-1 output so that S+B Adder 430 of the current PE acts in concert with the S+B Adders of the other PEs to add the value of a selected multiple of Bl to a right-shifted (by one digit) value of an interim result in the S registers. In a similar manner, S+B+M Adder 440 uses propagating carry bits at Carry-In-2 and Carry-Out-2 to perform a larger addition in concert with the S+B+M Adders of the other PEs. The left-hand input of first multiplexer 435 is selected to add the selected multiple of Ml from M-RAM 414 to the aforementioned output of S+B Adder 430. The sum is latched into S-register 445 as the new interim result, completing the operation that was defined by the control code of the current clock cycle.
[0065] In the following clock cycle, a similar process is followed for the multiplication involving A2, B2, and M2, with these differences: Channel selection register 450 is set to a ' 1' to select the upper banks of B-RAM 412 and M-RAM 414, which contain the same digit of multiples of B2 and M2. A digit of A2 is latched into A- register 422, a corresponding value of q is latched into Q-register 424, and a control code for this operation is latched into Control logic 410. The value received from the S-register in the PE to the left is the value that was generated in the previous cycle when the PE to the left was working on the multiplication involving A2, B2, and M2, so the correct values for this particular multiplication are maintained.
[0066] In the next clock cycle, the PE returns to working on the multiplication involving Al, Bl, and Ml, using new values for A-register 422, Q-register 424, and the control code. When all digits of Al have propagated through the PE, the value in S- register 445 is a digit of the final result of the first Montgomery multiplication. One cycle later, when all digits of A2 have propagated through the PE, the value in S-register 445 is a digit of the final result of the second Montgomery multiplication. When all digits of Al and A2 have propagated through all PEs, both Montgomery multiplications are complete as determined at block 560 of Fig. 5. [0067] In a series of consecutive Montgomery multiplications, if the results Rl, R2 are to be used as the new values of Bl, B2 in the next Montgomery multiplications, the digits of each result in S-register 445 are loaded into B-RAM 412 in two consecutive clock cycles as digits of Bl, B2, and multiples are calculated as previously described above under 'Pre-calculate multiples of B and store in the B-RAMs'. If both results are final results, in block 570 of Fig. 5 the results are propagated through the PEs to the right until all digits of the results have propagated into MM 220, from where the results can be made available to other devices in the system. In one embodiment, the contents of S-Register 445 are loaded into R-register 460 through the right-hand input of multiplexer 455 in every PE, then the contents of all R-registers 460 are passed through each other to the right into MM 220 by selecting the left-input of the multiplexer 455 in every PE. In another embodiment, R-register 460 and second multiplexer 455 are not included in the PEs, and the result is passed to the right through the S-registers of every PE using the S-In and S- Out connections, in much the same manner as original parameters were loaded as described above under 'Load Initial Values into the B-RAMs and/or M-RAMs. [0068] The foregoing description is intended to be illustrative and not limiting.
Variations will occur to those of skill in the art. Those variations are intended to be included in the invention, which is limited only by the spirit and scope of the appended claims.

Claims

I claim:
1. An apparatus comprising: a processing element of a linear systolic array Montgomery multiplier, including: a first storage element to store a first set of values for a first computation and a second set of values for a second computation; a second storage element to store a third set of values for the first computation and a fourth set of values for the second computation; and processing logic coupled to the first and second storage elements to perform the first computation during a first clock cycle and to perform the second computation during a second clock cycle immediately following the first clock cycle.
2. The apparatus of claim 1, wherein: the first storage element includes a first bank of storage locations to store the first set of values and a second bank of storage locations to store the second set of values.
3. The apparatus of claim 2, wherein: the processing logic includes a first address register to provide a first address to the first bank of storage locations during the first clock cycle and to provide a second address to the second bank of storage locations during the second clock cycle.
4. The apparatus of claim 2, wherein: the processing logic includes a selection register to select between the first bank and the second bank.
5. The apparatus of claim 2, wherein: the processing logic includes control logic to receive control codes, the control codes including a channel selection value to select between the first bank and the second bank.
6. The apparatus of claim 3, wherein: the second storage element includes a third bank of storage locations to store the third set of values and a fourth bank of storage locations to store the fourth set of values.
7. The apparatus of claim 6, wherein: the processing logic includes a second address register to provide a third address to the third bank of storage locations during the first clock cycle and to provide a fourth address to the fourth bank of storage locations during the second clock cycle.
8. The apparatus of claim 1, wherein: the processing logic includes control logic to receive a first control code for the first computation during the first clock cycle and a second control code for the second computation during the second clock cycle.
9. The apparatus of claim 1, wherein: the first set of values includes no values in common with the second set of values.
10. The apparatus of claim 1, wherein: the third set of values includes no values in common with the fourth set of values.
11. An apparatus comprising: a linear systolic array Montgomery multiplier circuit including: a chain of serially-connected processing elements to perform a first Montgomery multiplication in a first Montgomery multiplier channel and to perform a second Montgomery multiplication in a second Montgomery multiplier channel; and a controller coupled to the chain to provide a first set of parameters and control codes to the chain to perform the first Montgomery multiplication and to provide a second set of parameters and control codes to the chain to perform the second Montgomery multiplication.
12. The apparatus of claim 11, wherein: the controller is further to provide a first channel selection value to perform the first Montgomery multiplication and a second channel selection value to perform the second Montgomery multiplication.
13. The apparatus of claim 11, wherein: the controller is further to provide the first set of parameters and control codes to the chain during a first set of clock cycles and to provide the second set of parameters and control codes to the chain during a second set of clock cycles.
14. The apparatus of claim 13, wherein: the controller is further to provide first and second initial values to the chain before providing the first and second sets of parameters and control codes.
15. A system comprising: a processor; a main memory coupled to the processor; a linear systolic array Montgomery multiplier circuit coupled to the processor and including: a plurality of processing elements connected together in a chain of processing elements to perform a first Montgomery multiplication during a first set of clock cycles and to perform a second
Montgomery multiplication during a second set of clock cycles interleaved with the first set of clock cycles; and a controller coupled to the chain to provide a first set of parameters and control codes to the chain to perform the first Montgomery multiplication and to provide a second set of parameters and control codes to the chain to perform the second Montgomery multiplication.
16. The system of claim 15, wherein: the controller is further to provide the first set of parameters and control codes to the chain during the first set of clock cycles and to provide the second set of parameters and control codes to the chain during the second set of clock cycles.
17. The system of claim 16, wherein: the controller is further to provide first and second initial values to the chain before providing the first and second sets of parameters and control codes.
18. The system of claim 15, wherein: the controller is further to provide a first channel selection value to perform a first
Montgomery multiplication and a second channel selection value to perform a second Montgomery multiplication.
19. A method comprising: storing multiples of first and second parameters in a linear systolic array to perform a first Montgomery multiplication; storing multiples of third and fourth parameters in the linear systolic array to perform a second Montgomery multiplication, the third and fourth parameters having different values than the first and second parameters; and performing the first and second Montgomery multiplications concurrently.
20. The method of claim 19, wherein concurrently performing includes: providing a first set of control codes to the linear systolic array to control the first
Montgomery multiplication; providing a second set of control codes to the linear systolic array to control the second Montgomery multiplication; and interleaving the first set of control codes with the second set of control codes as said first and second sets are provided to the linear systolic array.
21. The method of claim 19, wherein concurrently performing includes: performing portions of the first and second Montgomery multiplications in alternating clock cycles in a particular processing element of the linear systolic array.
22. A machine-readable medium that provides instructions, which when executed by a set of one or more processors, cause said set of processors to perform operations comprising: causing a Montgomery multiplier to execute operations including: receiving first, second, third, and fourth parameters; storing multiples of the first and second parameters in processing elements of a linear systolic array to perform a first Montgomery multiplication; storing multiples of the third and fourth parameters in the processing elements of the linear systolic array to perform a second
Montgomery multiplication, the third and fourth parameters having different values than the first and second parameters; and performing the first and second Montgomery multiplications concurrently using the multiples of the first, second, third and fourth parameters.
23. The medium of claim 22, wherein said performing includes: providing a first set of control codes to the linear systolic array to control the first Montgomery multiplication; providing a second set of control codes to the linear systolic array to control the second Montgomery multiplication; and interleaving the first set of control codes with the second set of control codes as said first and second sets are provided to the linear systolic array.
24. The medium of claim 22, wherein said performing includes: performing portions of the first and second Montgomery multiplications in alternating clock cycles in a particular processing element of the linear systolic array.
PCT/US2002/029160 2001-09-28 2002-09-13 Montgomery multiplier with dual independent channels WO2003029957A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP02761648A EP1430393A1 (en) 2001-09-28 2002-09-13 Montgomery multiplier with dual independent channels

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/965,915 US6732133B2 (en) 2001-09-28 2001-09-28 Montgomery multiplier with dual independent channels
US09/965,915 2001-09-28

Publications (1)

Publication Number Publication Date
WO2003029957A1 true WO2003029957A1 (en) 2003-04-10

Family

ID=25510669

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/029160 WO2003029957A1 (en) 2001-09-28 2002-09-13 Montgomery multiplier with dual independent channels

Country Status (4)

Country Link
US (1) US6732133B2 (en)
EP (1) EP1430393A1 (en)
TW (1) TWI223191B (en)
WO (1) WO2003029957A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030065696A1 (en) * 2001-09-28 2003-04-03 Ruehle Michael D. Method and apparatus for performing modular exponentiation
US6922717B2 (en) * 2001-09-28 2005-07-26 Intel Corporation Method and apparatus for performing modular multiplication
US7010561B2 (en) * 2002-10-09 2006-03-07 William L. Freking Systolic ring-planarized cylindrical array modular multipler
US7539718B2 (en) * 2004-09-16 2009-05-26 Intel Corporation Method and apparatus for performing Montgomery multiplications
US20060059219A1 (en) * 2004-09-16 2006-03-16 Koshy Kamal J Method and apparatus for performing modular exponentiations
US20060140399A1 (en) * 2004-12-28 2006-06-29 Young David W Pre-calculation mechanism for signature decryption
US8560814B2 (en) 2010-05-04 2013-10-15 Oracle International Corporation Thread fairness on a multi-threaded processor with multi-cycle cryptographic operations
US8583902B2 (en) 2010-05-07 2013-11-12 Oracle International Corporation Instruction support for performing montgomery multiplication

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6061706A (en) * 1997-10-10 2000-05-09 United Microelectronics Corp. Systolic linear-array modular multiplier with pipeline processing elements

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100267009B1 (en) * 1997-11-18 2000-09-15 윤종용 Method and device for modular multiplication

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6061706A (en) * 1997-10-10 2000-05-09 United Microelectronics Corp. Systolic linear-array modular multiplier with pipeline processing elements

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JYH-HUEI GUO ET AL: "A novel digit-serial systolic array for modular multiplication", CIRCUITS AND SYSTEMS, 1998. ISCAS '98. PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL SYMPOSIUM ON MONTEREY, CA, USA 31 MAY-3 JUNE 1998, NEW YORK, NY, USA,IEEE, US, 31 May 1998 (1998-05-31), pages 177 - 180, XP010289200, ISBN: 0-7803-4455-3 *
KORNERUP P: "A SYSTOLIC, LINEAR-ARRAY MULTIPLIER FOR A CLASS OF RIGHT-SHIFT ALGORITHMS", IEEE TRANSACTIONS ON COMPUTERS, IEEE INC. NEW YORK, US, vol. 43, no. 8, 1 August 1994 (1994-08-01), pages 892 - 898, XP000457349, ISSN: 0018-9340 *
TSAI W-C ET AL: "TWO SYSTOLIC ARCHITECTURES FOR MODULAR MULTIPLICATION", IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, IEEE INC. NEW YORK, US, vol. 8, no. 1, February 2000 (2000-02-01), pages 103 - 107, XP000912731, ISSN: 1063-8210 *
WALTER C D: "SYSTOLIC MODULAR MULTIPLICATION", IEEE TRANSACTIONS ON COMPUTERS, IEEE INC. NEW YORK, US, vol. 42, no. 3, 1 March 1993 (1993-03-01), pages 376 - 378, XP000364332, ISSN: 0018-9340 *

Also Published As

Publication number Publication date
US6732133B2 (en) 2004-05-04
TWI223191B (en) 2004-11-01
EP1430393A1 (en) 2004-06-23
US20030065694A1 (en) 2003-04-03

Similar Documents

Publication Publication Date Title
EP1430392B1 (en) Component reduction in montgomery multiplier processing element
US6691143B2 (en) Accelerated montgomery multiplication using plural multipliers
EP0976027B1 (en) ARITHMETIC PROCESSOR combining finite field arithmetic and modular integer arithmetic
US5524090A (en) Apparatus for multiplying long integers
US5299144A (en) Architecture for covariance matrix generation
US6748412B2 (en) Square-and-multiply exponent processor
US6732133B2 (en) Montgomery multiplier with dual independent channels
EP1430394B1 (en) Method and apparatus for performing modular multiplication
CN212112470U (en) Matrix multiplication circuit
JP3213628B2 (en) An arithmetic unit for multiplying long integers modulo M and an R.M. S. A. converter
US6917956B2 (en) Apparatus and method for efficient modular exponentiation
US6598061B1 (en) System and method for performing modular multiplication
US6424987B1 (en) Method for the implementation of a specific modular multiplication operation relating to the montgomery method
EP0474246A2 (en) Image signal processor
US11961420B2 (en) Efficient squaring with loop equalization in arithmetic logic units
US5948051A (en) Device improving the processing speed of a modular arithmetic coprocessor
US6275837B1 (en) Method for the implementation of an elementary modular operation according to the Montgomery method

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BY BZ CA CH CN CO CR CU CZ DE DM DZ EC EE ES FI GB GD GE GH HR HU ID IL IN IS JP KE KG KP KR LC LK LR LS LT LU LV MA MD MG MN MW MX MZ NO NZ OM PH PL PT RU SD SE SG SI SK SL TJ TM TN TR TZ UA UG UZ VN YU ZA ZM

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ UG ZM ZW AM AZ BY KG KZ RU TJ TM AT BE BG CH CY CZ DK EE ES FI FR GB GR IE IT LU MC PT SE SK TR BF BJ CF CG CI GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2002761648

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2002761648

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP

WWW Wipo information: withdrawn in national office

Ref document number: 2002761648

Country of ref document: EP