GB2389678A

GB2389678A - Finite field processor reconfigurable for varying sizes of field.

Info

Publication number: GB2389678A
Application number: GB0213683A
Authority: GB
Inventors: Mohammed Benaissa
Original assignee: University of Sheffield
Current assignee: University of Sheffield
Priority date: 2002-06-14
Filing date: 2002-06-14
Publication date: 2003-12-17
Also published as: WO2003107177A3; GB0213683D0; AU2003250361A8; AU2003250361A1; WO2003107177A2

Abstract

A finite field processor has input registers for storing the operands, an arithmetic unit with sub-units to preform arithmetic operations on the operands and an output register to store the result. The input registers may have multiple operand registers, the arithmetic unit has the same number of sub-units and the control unit can produce the same number of control signals to configure the field sizes of the multiple sub-units. The arithmetic sub-units can be operated independently and concurrently. Also the arithmetic sub-units can co-operate in response to control signals to operate on operands that have field sizes greater than the dynamic range of any one arithmetic sub-unit. The sub-units can be arranged to operate on different sizes of field at the same time. The operands could be global and represent the coefficients of a polynomial. The processor could be a Domain processor, particularity a Galois field processor. Also disclosed is a reconfigurable processor with a plurality of arithmetic units each with a plurality of sub-units, each sub-unit having dynamic ranges that are defined by predetermined field sizes.

Description

PROCESSOR 2 3 8 9 6

Field of the nver.tion

The present invention relates to a processor and, more particularly, to a finite field processor.

Background to the Invention

_ - Data integrity and data security are crucial in many communication and storage applications. Furthermore, the current, and forecast, proliferation in communication systems, in particular over optical networks and wireless 15 networks creates an even more acute need for advanced error control coding and encryption functionality that can operate in real-time and within various environments.

This, in many cases, requires optimised, dedicated implementations that can be programmed or configured to 20 suit the operating environment.

Galois Fields (GF) play an important role in the

areas of data integrity and data security. Many important error-control coding and cryptographic schemes are based on GF(2m) arithmetic. Examples of such Error-control 25 coding schemes include Reed-Solomon (RS) codes, BCH codes, and, promisingly, their corresponding Block Turbo Codes. Elliptic curve cryptosystems, and the new Advanced Encryption Standard (AES) using the Rijndael algorithm are examples of cryptographic schemes that use 30 GF(2) arithmetic.

Although the underlying arithmetic for all of the above applications is the same, no attempts have been made to develop a single architecture suitable for implementing all these applications efficiently. The main

reasons being the level of complexity, the need for security, the discrepancy in the size of the finite field

used in the different applications, and the level of performance required. That is why most reported hardware 5 implementations of these applications have been algorithm specific, which meant that circuits designed for one application cannot be used easily for another, even though their underlying operations are similar It His an object of the present invention at least to 10 mitigate some of the problems of the prior art.

Summary of Invention

Accordingly, a first aspect of the present invention provides a processor comprising a first register for storing at least a first operand in at least a portion of 15 at least a first operand register of the first register; a second register for storing at least a second operand in at least a portion of at least a first operand register of the second register; an arithmetic unit comprising at least a first arithmetic subunit, 20 responsive to a first set of control signals, to perform a first arithmetic operation using the first and second operands; a result register for storing the result of performing the arithmetic operation using the first and second operands; a control unit for producing the first 25 set of control signals) the first set of control signals being used to configure the field size of the first

arithmetic sub-unit to influence the size of respective portions of the first and second operands used in performing the first arithmetic operation.

30 Advantageously, not only do embodiments of the present invention address at least some, and preferably most or all, of the above limitations by being multi

functional, but they also enable, by dynamic conflquration, the implementation of any of the above applications under different performance constraints.

This is particularly useful in cases where a single 5 application is required to operate under variable parameters to optic se performance.

Embodiments provide a processor in which the first and second registers comprise at least respective second operand registers for storing third and fourth operands 10 respectively) the third and fourth operands having comparable field sizes; the arithmetic unit comprises at

least a second ar thmetic sub-unit, responsive to a second set of control signals, to perform a second arithmetic operation using at least the third and fourth 15 operands; and the control unit produces the second set of control signals; the second set of control signals being used to configure the field size of the second arithmetic

sub-unit to influence the size of the respective portions of the third and fourth operands used in the second 20 arithmetic operation.

Preferred embodiments provide a processor as claimed in any preceding claims, wherein the first and second registers each comprise j operand registers, the arithmetic unit comprises j arithmetic sub-units, each of 25 the j arithmetic sub-units being capable of performing a respective arithmetic operation using respective operands derived from respective operand registers of the j operand registers; the control unit comprising means to produce j sets of control signals to configure the field

30 sizes of the j arithmetic sub-units' respectively to influence the size of respective portions of the respective operand used in performing the respective arithmetic operations.

A problem with conventional processors is the serial architecture in which on a single operation can be performed at any one time. Suitably, embodiments provide a processor as claimed any preceding claim, in which any 5 one of the arithmetic sub-units is operable to perform a respective arithmetic operation independently of any other arithmetic sub-unit.

Preferred embodiments provide a processor in which at least selected arithmetic sub-units of the arithmetic 10 unit are operable to perform respective arithmetic operations substantially concurrently.

Conventional processors are constrained to operate with fixed field sizes, which leads to inefficiencies.

Accordingly, preferred embodiments provide a processor in 15 which at least a selected plurality of the arithmetic sub-units co-operate, in response to respective control signals, to perform an arithmetic operation on an operand that has a field size greater than a dynamic range of any

one of the selected plurality of arithmetic sub-units.

20 Advantageously, varying field sizes can be

accommodated by the processor.

Preferably, embodiments provide a processor in which at least a first arithmetic sub-unit and corresponding operand registers are arranged to operate over a first 25 field size and at least a second arithmetic subunit and

corresponding operand registers are arranged to operate over at least a second field size, where the first and

second field sizes are different.

Embodiment provide a processor in which the first 30 and second arithmetic sub-units form part of the

r arithmetic unit and are arranged to operate over different field sizes.

Embodiments provide a processor in which at least a pair of operand registers of the first register is arranged to store respective parts of a first global operand and at least a pair of operand registers of the second register is arranged to store respective parts of a second global operand.

Embodiments provide a processor in which the first 10 and second global operands represent respective polynomials and the pairs of operand registers are used to store coefficients of the respective polynomials.

Embodiments provide a processor further comprising a plurality of storage registers for storing data 15 representing at least one of the operands and corresponding results; and a bus for routing data from the plurality of storage registers to at least one of the arithmetic units' input and result registers.

An aspect of the present invention provides a 20 reconfigurable processor comprising arithmetic units for performing finite field arithmetic; each arithmetic unit

having a plurality of arithmetic sub-units, each arithmetic sub-unit having respective dynamic ranges that are defined by respective predetermined field sizes; the

25 dynamic range being the selected size of the data unit capable of being operated on, or stored within, a sub-

unit up to the respective predetermined field size; each

arithmetic sub-unit being configured by respective sets of control signals to define at least one relationship 30 with another sub-units) the at least one relationship being such that the respective arithmetic sub-units co S

operate to represent a second data unit having a dynamic range that is greater than at least a selectable one of the respective dynamic ranges of the plurality of the at Kinetic sub-units.

5 Embodiment are provided wherein the processor is a Domain processor such as, for example, a Galois field;

processor. Embodiments can be realised as a standalone GF processor or as a co-processor or peripheral to a 10 commercial platform. A number of such processors can be configured into an array processor for very high performance Embodiments of the present invention are configurable, in terms of p and A, depending on 15 application and level of performance required to suit the needs of the application.

Embodiments exhibit multi-functionality and, as such, may be suitable for FEC and Encryption applications and any other applications requiring operations over 20 Galois Fields (as in DSPs for example) without loss of

performance. Still further, embodiments can be realized that are programmable to support upgrading and that can be used as a standalone GF processor, with a corresponding 25 instruction set, or as a co-processor or peripheral for separate platform. The programmability also allows software upgrades to be used which, in turn, prolong the useful 1 fe of embodiments.

A still further advantage of embodiments of the 30 present invention resides in cost savings. A single GF

If processor, according to an embodiment, will be able to replace many application specific hardware architectures giving significant cost savings in terms of silicon real estate and power consumption.

5 Furthermore, within embodiments the underlying circuitry may be optimized. Therefore, at least a reduced, and preferably min mal, loss of performance will be exhibited as compared to application specific hardware architectures. Embodiments can also be realised for a 19 high cost, high speed and high throughput system where an array of OF processors can operate in parallel. Any upgrades in this case would involve a change of software instead of hardware.

A further advantage of embodiments of the present 15 invention is that they allow random redundant operations to be included in the software. Therefore, embodiments are very difficult to attack using differential power attacks and timing attacks.

Brief Description of the Drawings

Embodiments of the present invention will now be described, by way of example only, with reference to the accompany drawings in which: Figure 1 shows a Polynomial A(x) in a Register Location (SIMD);

25 Figure 2 illustrates a Single GF Element in a Register Location (SISD); Figure 3 depicts P Parallel Element Multiplication over GF(2q);

( Figure 4 shows Polynomial Multiplication of a(x) and (x); Figure 5 illustrates P Paralle' Element Division over GF(2q) where al/bl; 5 Figure 6 depicts Element Arithmetic over G. (2n); Figure 7 shows a q+1 Cascaded Bit Slice; Figure 8 illustrates a mapping of F'(y) to [C,J'(y); Figure 9 shows a Modified Shift Register for [Cl]'(y) [Cll'(y).y Operation; 10 Figure 10 Illustrates a Modified Shift Register for fBl,(y) = [Bl](y).y Operation; Figure 11 depicts an Internal Block Diagram of Cascadable FLU; Figure 12 shows Cascaded MLUs with Control and 15 Configuration Loglc; Figure 13 illustrates an RS Data Path; Figure 14 depicts a UV Data Path; Figure 15 illustrates a Bit Slice of W(y) = y.W(y) mod G(y) Operation; 20 Figure 16 shows a Bit Slice of Z(y) = Z(y)/y mod G(y) Operation; Figure 17 depicts an Internal Block Diagram of a DLU without Control Module; Figure 18 illustrates a Cascadable DLU with Control & 25 Configuration Logic;

Figure 19 shows a CLU Count Module; Figure 20 depicts an Interna' Block Diagram of the OF Processor; Figure 21 illustrates an Implemented Register rile 5 Structure; Figure 22 shows the content of a Register Location in SIMS Mode; Figure 23 depicts content of a Register Location in SIMS Mode; 10 Figure 24 demonstrates an ACCO R3, R5, C4, R6 Operation; Figure 25 shows P Parallel Arithmetic Logic Units; Figure 26 demonstrates a MULT/DIVI R1,R2,R3 Operation; Figure 27 depicts a Processor State Machine; Figure 28 illustrates an example of DIVI/MVLT Pipeline 15 Operation; Figure 29 depicts a Register File with G(x) and I(x); Figure 30 depicts a Syndrome Computation 1; Figure 31 shows a Syndrome Computation 2; Figure 32 illustrates a Chien Search Module; 20 Figure 33 depicts a Register File with Pre-Computed Constantsi Figure 34 shows a Register File with Cipher Key and Data; Figure 35 illustrates a Forward MixColumn Constants Register File; and

.' Figure 36 depicts an Inverse MixColumn Constants in Register File.

Description of Preferred Embodiments

5 With the embodiments of the present invention described hereafter a Register is defi..ed as an entity where its w dish is the size of the input of the arithmetic unit. The arithmetic unit will have at least one input register and one output register.

10 An operand register by itself, or in conjunction with one or more other operand registers, make up the whole of a register as defined above.

The arithmetic unit consists of one or more arithmetic sub-units of the same type (referred to as 15 logic unitsLUs-). Each arithmetic sub-unit can operate, independently from and concurrently with other arithmetic sub-units within the same arithmetic unit, on operands in its respective operand registers. Each arithmetic sub unit will have at least one input operand register and 20 one output operand register.

The dynamic range of an arithmetic unit is the size of the data unit or field over which the arithmetic unit

can, or is configured to, operate.

Preferred embodiments of a Configurable GF(2Pq) Galois 25 Field Processor have two modes of operation. The first

mode of operation is a Single Instruction Multiple Data (SIMD) mode that allows the processor to operate efficiently on a whole, or part of a polynomial having coefficients that are elements of a Galois Field GF(2n),

30 where n < q. In SIMD mode, the processor can at most

( operate on p coefficients at one time. This mode is useful -or applications that operate over smaller Galo s Field that are used for Reed-Solomon Codes and AES.

The second mode of operation is a Single Instruction 5 Single Data mode (SISD) that allows the processor to be configured for large field operations. The largest Galois

Field size than preferred embodiments can accommodate is

GF(2Pq). Thls mode is useful for applications like Ellipt c Curve Cryptography and DSP.

10 Figure 1 shows the structure of a register file 100 according to an embodiment. The structure of the register file in the GE processor clearly illustrates the design philosophy. The register file 100 is made up of a number of register locations 102 to 106, each m bits wide, where 15 m=p.q. In the SIMD mode, each register location is sub-

divided into p coefficient locations 108 to 116, each q bits wide.

The data in each register location 102 to 106 can be viewed as a polynomial of order p - 1, having its 20 coefficients, as elements of a Galois Field GF(2q), stored

in the p coefficient locations.

A polynomial with an order larger than (p - 1) can be stored in two or more register locations. If n q, the coefficients are stored leftjustified in each 25 coefficient location 108 to 116 and the LSB (q - n) bits are set to zero.

Figure 2 illustrates a register location 200 configured for SISD mode in which a, or each, register location 200 has only one coefficient location (i.e. p = 30 l and q = m). It should be noted that if Norm, the GF

element is stored left-:ustified with the LSB (m-n) bits being set to zero.

In Both modes, all arithmetic operations of the GF processor revolve aour.d the register C le 100 The 5 operands of the operation are loaded from the register file lO0 and the results are stored back in the register Mile 100.

This unique architecture allows embodiments of the GF processor to be configured dynamically to allow a high 10 efficiency operation. Embodiments use a configurable multiplication/division architecture that allows the GF processor to operate, in, and switch between, the two modes. This will be described in greater detail hereafter. 15 In SIMD mode, the register locations 102 to 106 hold polynomials with coefficients as elements of a Galois Field. GF arithmetic operations, addition, multiplication

and division/inversion similarly follow this convention.

Addition in a Galois Field GF(2q) is simply the

20 exclusive-or (XOR) between two elements of the Galois Field. In the SIMD mode, there are two sub-types of

addition; namely, polynomial addition and summation addition. Polynomial addition is performed between two 25 polynomials in two separate register locations. For example,

C(x)= A(x) B(x) where A(x)=apxP-+ap2xP2++a2x2+ax+a0, B(x)=bp xP-i +bp2xP2+....+b2x2+bx+bO,and 3 0 C(x) = (ap ffl bp j)xP-t + (ap 24B bp 2) xP2 + + (a2 (33 b2)x2 + (a' 33 b')x + (aO (13 be),

Summations are needed in some cases, for example, to calculate the roots of a polynomial. This operation will sum the elements in all the coefficient locations of a register location 102 to 106, that is, it will calculate.

5 a,,where a. Is the element stored in the ith coefficient i=) location. To avoid confusion herein, multiplication between two polynomials with coefficients as elements of a Galois Field shall be called Polynomial Multiplication, whereas

lO multiplication between two elements of a Galois Field

shall be called Element Multiplication.

Within element multiplication, two elements of a Galois Field GF(2q) are multiplied together module an

Irreducible Polynomial. Since in SIDED mode a register 15 location lO2 to 106 is made up of several Coefficient Locations, using the configurable multiplication architecture, P parallel element multiplication can be executed at the same time as shown in figure 3.

Polynomial multiplication comprises a sequence of 20 element multiplication operations and addition operations and is illustrated by the general example below.

Let a (x) - a x2 + b x + c and p(x) = d X2 + e x + f, where a,b,c,d,e,f are elements of a Galois Field. The

polynomial multiplication of a(x) p(x) - x(x) is given 25 by: x(x) [(a X2 + b x + c) d x2]+ t(a X2 + b x + c) e x] + [(a X2 + b x + c),1 = ad X4 + (ted ae)x3 + (be cd ad)x2 + (ce bf) x + of xkx) can thus be calculated by element multiplying a(x) which is stored in a Register Location, with each

coefficient of p(x) fin turn and then shifting and adding the intermediate results. This process is illustrated by f gure 4.

Embodiments also support two types of division, that 5 is, division between two polynomials with coefficients as elements of a Galois Field, which shall be known as

Polynomial Division, and division between two elements of a Galols Field, which shall be known as Element Division.

The operation of Element Division is similar to that 10 or the Element Multiplication. A number, A, of parallel Element Divisions can be executed at the same time as shown in figure 5 below. However, care must be taken to distinguish between the numerator and the denominator.

Polynomial Division comprises a sequence of element 15 multiplication, element division and element addition operations. The result consists of a remainder polynomial as well as a quotient polynomial. It is more complicated than polynomial multiplication and there are a number of different approaches to it. An example of 20 such an approach is given below. However, one skilled in the art will realise that embodiments can be realised which use alternative approaches to polynomial division.

Given a(x? and p(x), expressions for x(x) and 6(x) should be determined that satisfy the expression: 2 5 X(X) (3 p(x) 6(x) = a(x) where a(x), (x), %(x) and b(x) are polynomials with coefficients as elements of a Galois Field, that is,

a (x) if (x) should be calculated to obtain the quotient, x(x), and the remainder, 6(x). Note that the order of 30 a(x) must be greater or equal to the order of p(x).

Let a (x) = c x2 + d x + e and (x) = a x + b.

Therefore, x(x) and 6(x) must have the order and form x(x) f x + g and 6(x) = I. Hence, the steps involved in the polynomial division are: 5 1. Perform an Element Division of C/a. The result of this division provides the mcst significant coefficient of (x) (i.e. f = C/a).

2. Element Multiply throughout a(x) with f and add the result to Fix). The result will be (Ct/a ffl d) x + e.

10 3. Perform another Element Division of (Ct/a d)/a where (:b/a d) is taken from the result in part 2.

The result of this division is the least significant coefficient of x(x). (i.e. g - (Ct/a d) /a).

4. Element Multiply throughout a(x) with g and add to the 15 result of part 2. This is the remainder b(x).

In the Single Instruction Single Data Mode, each register location 102 to 106 contains just one element of a large Galois Field, that is, the whole of a register

location is used to store a coefficient (i.e. Register 20 Location = Coefficient Location). As a result, the OF Processor can only do one arithmetic operation (Addition, Element Multiplication or Element Division) at a time.

Figure 6 illustrates Element Arithmetic in SISD mode, which comprises an operation between two large 25 Galols Field elements that are stored in two register

locations R1 and R2 with the result being stored in another register location.

At the heart of embodiments of the GF processor are the Configurable Multiplier and Divider Architectures over GF(2q), which are described hereafter in greater detail with reference to figures 7 through to 19. These 5 architectures allow the OF processor to be configured to operate on vastly different Galois field sizes without a

corresponding loss in efficiency.

The basic building block of the architectures is a Logic Unit (LU), which can be a Division Logic Unit (DLU) 10 for a Configurable Division Architecture or a Multiplication Logic Unit (M:U) for a Configurable Multiplication Architecture. Each of these Logic Units is capable of performing field element arithmetic of a

size up to GF(2q). In preferred embodiments, the LU takes 15 in three elements in parallel, which are: an Irreducible Polynomial G(a) of order q, characteristic 2i and Input Operands A(), B(a) where A(),B() e GF(2q).

The logic units produce respective outputs, which are 20 C (a) - A (a) /B (a) mod G(a) after 2q clock cycles and C (a) = A (a). B (a) mod Go) after q clock cycles for division and multiplication logic units respectively.

By cascading p of these LUs together, the resultant entity can be configured to perform finite field

25 operations using a field size of up to GF(2m), where m =

paq. Furthermore, in preferred embodiments, each of these LUs can be configured dynamically to function as p -

GF(2q) arithmetic circuits operating in parallel or in any

combinations as required according to a set of control signals, which are: MSBlock[i7, which has the values True or False and determines whether a particular LU is the most 5 significant block of the operation; LSBlock[i], which has the values True or False and determines whether a particular LU is the least significant block of the operation; and LSSitPos[i), which has a value of O to q-1 and 10 determines the least significant bit position of the operation in the least significant LU block of the operation; where O c i c p. It will be appreciated that all operations are Left Justified 15 An illustration of the use of these control signals now follows in the context of configuring a GF(2l6) Configurable LU (p = 4), (q = 4). The illustration is applicable equally to DLUs and MLUs.

Four GF (24) LUs are cascaded to form a Configurable 20 Galois Field arithmetic circuit. The circuit can be

configured to perform element arithmetic operations (multiplication, division or inversion) for a field size

of up to GF(26). Alternatively, the four GFi24) Lus can also be configured to perform four field element

25 operations in parallel, up to a field size of GF(24) or

two element operations in parallel up to a field size of

OF (2a) or any other valid combinations.

To perform element arithmetic operations in GF(26), the control signals are initialized as shown in Table 1 30 below:

I LU 4 I LU 3 LU 2 | LU 1 |

I MSBlockLi] I True False False False I I T S84 CONY [ I] False False False True l LSBitPos[I] __. I l TABLE 1

The operas-on needs all four LUs. MSBlock[41 and LSBlock[1] are set to True, which means that the most 5 significant block of the operation is DLU4 and the least significant block of the operation is DLU1. LSBitPos [1] is set to O which means that the least significant bit of the operation is in bit O of DLU1. Again, all data vectors are left justified.

lO To perform element operations in GF(29), the control signals are initialized as shown in Table 2 below.

LU 4 LU 3 | LU 2 LU 1 l _. MSBlock[i] True False False True _ LSBlock[i] False False True True _. LSBitPos L i] __ 3 o . TABLE 2

In this case, as LUs 4 to 2 are configured to do the 15 GF(29) operation, MSBlock[41 and LSBlock[23 are set to True. LSBitPos[2] is set to 3 which means that least significant bit of the GF(29) operation is in bit 3 of LU 2. Since LU1 is not involved in the GF(29) operation, it can be used for an element division operation up to 20 GF(24). In this case, it is configured to do a GF(24)

operation, with MSElock[1] and LSBlock[l] set to True and LSBitRos[1] set to 0.

Table 3 shows the 4 cascaded LUs configured to perform four GF(23) operations in parallel.

LU4 LU3 _ LUi MSBlock True Tme Tme Tme LSBlock Tme Twe Tme Tme I LSBitPos I 1 5 TABLE 3

It will be appreciated by one skilled in the art that by using these control signals, the configurable architecture for the multiplier or divider can be dynamically configured for use by the OF Processor for 10 different applications over different f eld sizes.

Embodiments will now be described for multiplier and divide' implementations of configurable architectures.

Most multiplication or division algorithms over Talons Fields can be used as the underlying algorithm for the

15 Configurable Architecture.

An Implementation of a Configurable Multiplier Architecture Based on the principles described above, an embodiments of architectures for performing configurable 20 GE multiplication will now be described.

The underlying basic element multiplication algorithm used is based on that described in P. A. Scott, S. E. Tavares, and L. E. Peppard, "A Fast VLSI Multiplier for GF(2^m)," IEEE Journal Selected Areas in 25 Communications, vol. SAC-4, pp. 62-66, 1986, which is

incorporated herein by reference for all purposes. This algorithm has been selected due to its simplicity although most other existing multiplication algorithms could equally well be used.

5 The algorithm operates ever a polynomial basis and has been mapped onto a semi-systolic architecture adapted to allow element multiplication over variable field sizes

and different irreducible polynomials.

A description of the Multiplication Algorithm can be

LO summarized as follows: C(y) needs to be determined, where C(y) = A(y). B(y) mod G(y); where the irreducible polynomial over OF(2) is G(y) = Y -t g liql + q2yq2 + - + GAY + go Equation 1 and C(y), A(y), B(y) eGF(2q) in the polynomial basis.

A(y) = aqyq; + aq2yq2 +. + any + aO Equation 2 B(y) = bqyqi + bq2yq2 +... + bly + be Equation 3 C(y) = cqlyqi + cqyq2 +.... + cry + cO Equation 4 15 The multiplication operation can be broken down into q shift and add operations as described below: Multiplication Algorithm C'(y) = all zeros for j = 0 to q-1 20 /IPartA if bq = = '1' then if c'q = = 'I' then C'(y) =C'(y) + G(y) + A(); else 2 5 C'(y) = C'(y) + A(y); end if; else if c'q = = '1' then C'(y) = C'(y) + G(y); 3 0 else C'fy) = C'(y); end if;

end if; //End of Part A /I Part 1] B(y) = B(y).y C'(y) = C'(y).y 5 11 End of Part B end loop; C(y) = C'O/y From the algor thm, Part A is generalised into the equation below: C'(y) = C'(y) + G(y).c'q + A(y). bql Equation 5 10 where Ct(y) is a degree n partial product ( y) Cq y + Cq_lyq 1 + C yq2 + Breaking Equation 5 into bit slices, gives C'k - C XOR ( ok AND C'9) XOR ( ak AND,) Equation 6 where k = 0 to q.

For a GF(2q) operation, A, bit-slices are cascaded for Part A of the multiplication algorithm as shown in 15 figure 7.

The element B(y) is loaded into a shift register and is shifted q times during the course of the algorithm.

Hence, at the with iteration, the most significant bit of the shift register will be the Ah bit of B(y), by. Part B 20 {B(y) = B(y).y and C'(y) - Ct(y).y} is simply a Logical Left Shift Operation.

Modifications for Element Multiplication over _Variable . Field Size on a Single MLU

It can be appreciated from Equation 5 that for every 25 iteration, two bits information; namely; cq and be, are required to calculate the partial product. Since the element vector 3(y) is loaded in to a shift register and shifted towards the Most Significant Bit for each iteration, b: will always be at the MSB of the shift

register. Similarly, cq is the MSB of the partial product at the josh iteration Caky).

Furthermore, the MLU can support variable field size

and -;ar_ale irreducible polynomial multiplication.

5 Assume that the width of the data bus of the MLU is q bits. It can be appreciated from the following that the same circuit can support multiplication operations over GF(2n) or n < q.

Let D(y), E;y) and:(y) eGF(2n) and n < q 10 D(y) = dnyn- + dn2y +. + d1y + do E(y) = elynt + e2y2 + + e1y + eO F(y) = fn-lyn: + f-2Yn2 + + fly + to and H(y) be the irreducible polynomial of GF(2n) H(y)-yn+hnyn-+h-2yn-2+ + hy+ho 15 A(y), B(y) and G(y) are fed into the MLC left justified, where A(y) = D(y) yqn B(y) - E(y).yqn G(y) = H(y).yqn 20 The result will be C(y) = C(y)/yqn, or alternatively, the most significant n bits of C(y). Hence from Equation 5, it can be appreciated that the MLU does not need to be modified to allow it to operate over different field sizes (up to a maximum of q) and with

25 different irreducible polynomial. However, the number of iterations will vary depending on the field size and this

is controlled by a Control Logic Module, which is described hereafter.

Modifications for Element Multiplication over Field

Sizes> GF(2q) on Cascaded MLU _ Embodiments can be realised in which two or more Moos are cascaded together so that the resultant entity 5 can support element multiplication over field sizes

larger than GF(2q).

Let D(y), E(y) be the inputs and F(y) be the output, where D(y), E(y), F(y)eGF(2n) for n>q. Also, let H(y) be the irreducible polynomial of the Galois Field GF(2n).

10 Assuming there are p MLUs cascaded together, the largest field that is supportable is GF(2Pq), i.e. n <

p.q. Let [A1](y), [B,](y) be the ith input elements and [C](y) be the i'h multiplication output elements of the ith MLU Block, where i = 1,2,....,p. Also, let [Gl](y) be 15 the ith irreducible polynomial inputs to the ith MLU Block.

Note that the degree of [Al](y), [Bl](y), [Cl](y) is (A ) Let [Cl]'(y) be the degree q partial product of the ith MLU. i.e. 20 [A,](Y)- [U'q-yq-i+ [a;q2yq-2+ +[a,;y+[a,]O [B'](y) = [bJqiYq +[b]q2yq2+ À + Jy+{b,]o [C](y) [cJq Lye + [cJq-2yq 2 +.,. + [Cry + [Ci]o [C,](y)= [c,qyq+[c,jq.yq + [ciliq-2yq 2 +.. + [C,j'iy + [C'/'o [G,](y)= Yq+ [g'q-yq-+[g]q-2yq-2+ + lg']'y+[gJo 25 D(y) and E(y) are left justified so that their most significant bits are aligned to that of the MSB of [Ap](y) and [BF] (Y) respectively. In another words, D(y) = D(y).yPq n =dn yPq-i +dn2ypq-2+.. + Dot ypq-2+ n +doypq-z+ n 30 E(y) =E(y).yPqn yPq+e zypq2+ +elyPq 2 + en YPq

H(y) = H(y) yPq - n = yPq + hn. yPq- + hn.2 ypq-2 +.. + h ypq-2 + n + ho ypq-2 + n and LAp] (y) ≤ ain- yPq + d-2 ypq-2. . + dn - q - Y(P q + n q YP- q 5 [Bp] (y) ≤ en- yPqi + en-2 yP42 + + en q lylP)q + en-qY(P4 [Gp] () ≤ hn. yPq + hn-2yp4-2 + + hn q lyP)q +l + hn - qY<P'q Note that the MSB of H (y) is always '1'.

i.e. [apq l = dn [ap]q '= an-2... [ap]/ = dn (q - l [apJO= dn q {bpJq =enl {bp/q 2- en-2 [hp1! = en (q ') [bpo=en q [gpq^/ = hn [gp]y-2 = hn-2.... [gpJI = n - (q - I) [gpo = hn - q Furthermore, the input operands of (p 1) th MLU will be: 10 [Atpi '](y) <:= dnq tyip-th-+dn-q-'yp-q-+.+dn-2q'yp-2'q++dn-7qyp-2>q [Bp](y) enq ty +e,2yP h +, +en 2q,y(P2h '+ e tp-2)q [G](y) c= hnqy,P-'q-+hnqyP'q 2 +.,,, + hn - 2q y(P -2)q hn - 2q y(P -2)q and the input operands of the ith MLU wil 1 be: [Al](Y) ≤ Cln - (p-)q Y'qt + d'n (p 'jq -2 ytq2 +, +dn ()q - (q - Y(')q d'n - P-h - q y(' q 1 5 [B.] (y) c= e n (p.)q.I y + e n-()q -2 Y +,,,. + e n- (p th - (q - t) Y + e n P-'h - q Y [G']'(y)c=h'n(,)q yiq + htnj)q-2 Yq +....

+ h'nP-'q-(q-'iY(' i)q-i + 'n p', y(i)q It will be appreciated that in such a cascaded operation, the MSBs of [Gl]'(y) of the less significant 20 MLUs will not be used.

This will continue until the entire D(y) and E(y) vectors are mapped onto k consecutive most significant MLUs' input operands [A.(y) and [B(y), where i = p, (p-

1),... (p-k+1). Similarly, the output F(y) will be left 25 justified, i.e. F(y) = F(y) yPq-n Sn-l y +fn 2 Y q + +f ypq-2 p +fo ypq-2 + n and [Cp](y) = fn l YPq-t +fn 2YPq + +fn-(q - I) Y +fn-q Y 30 [C(p l)](y) =f y(p)q-i + Sn q 2y(P)q-2 + +fn-(2q-)yP +fn-2qY

[C]/(y) =in - (p-l)q -l Y q +/n-(p-l)q -2 Y + fin - (p-l)q - (q - 1) Y f} n - (p-l)q q y and so on for k consecutive most significant MLUs.

Defining a partial product, F'(y), to be 5 F(y) =fnyn +f7n,ynl +fn,2yn2 +, , +f ty iffy and redefining Equation 5, the partial product at the ith MPU becomes [C,](y)[C,l(y)[G,](y)in+[A(y)en Equation7 where [C](y) =fn-(p)q-1 Y +fn-(p-)q-2Y +.... +f7n (p.)q-(q-1) Y +fn-(p-)q-q Y 10 Note here that the degree of the partial product [Cl]'(y) is A. For cascaded MLU operation, the most significant bit, yq, of each partial product [Cl]'(y) will not be used, with the exception of the most significant MLU which is mapped to f tn. First part of figure 8 15 illustrates this with n - 10, p = 9 and q = 4.

Figure 8 clearly shows that for multiplication over cascaded MLUs, f'n of Equation 7 will be multiplexed from [C4)'4 of the Most Significant MLO (4th) to the less significant MLU Blocks. Furthermore, since the cascaded 20 MLUs are configurable for cascaded operation as well as parallel operation, a multiplexer is needed to multiplex the correct most significant bit of the partial product [cp]'q, where [cp]'q is the from the most significant MLU for cascaded n > q operation, or [c1l'q, where [ci] q is 25 the from the ith MLU for single n < q operation, based on the control signals MSBlock[i] and LSBlock[i].

From the algorithm, it can be appreciated that the partial product needs to be shifted during the iterations. In cascaded operation, this shift operation 30 will span several MLUs and hence the design of the MLU of

embodiments takes this into account as shown in figure 10. Figure 9 shows the modified shift register needed to enable the shifting of the partial product.

5 From Equation 7, it can be appreciated that end, or [bail where the ith MLU is most slgnlficant, should also be multiplexed for cascaded or single operation and that E(y) also needs shifting.

Figure 10 shows the modified shift register needed 10 to enable the shifting of E(y).

Figure 11 illustrates an internal block diagram of a cascadable MLU 1100. The cascadable MLU 1100 comprises The cascadable couples MLU 1100 comprises a Q-cascaded bit slice module 1102 for performing the multiplication 15 described above with reference to figure 7. The bit slice module 1102 comprises first input 1104 for receiving the [Al](y) polynomial. A second input 1106 for receiving the [Gl](y) polynomial is also provided. The Q bit slice module 1102 produces the above mentioned 20 [Cl]'(y)' as an output 1108. This output 1108 is coupled to both a result register 1110, which produces the above mentioned [Cl](y), and a modified shift register 1112, described above with reference to figure 9, for calculating [C1]'(y)=[Cl]'(y).y. The shift register 1112 25 receives bit [ci]'q-1 as an input 1114 from the next least significant MLU (not shown). The modified shift register 1112 produces an output 1116, which is [C],q-l, and an output to be fed to the next most significant MLU.

The output from the result register 1110 is controlled 30 using an output enable control signal 1118. A further modified shift register 1120 for calculating [Bl](y) = [Bi](y).y, This second modified shift register 1120

. receives, as an input 1122 the above mentioned [Bl](y). i The second modified shift register 1120 also receives as an Input 1124 Lbl:] q:. Two outputs 1126 and 1128 are produced by the second modified shift register 1120, 5 which are [b:+1]o' for output to the next most significant FLU, and [b]q: for output to a multiplexer 1130. The t first modified shift register 1120 produces a further output 1117, [c,]'q, which is fed to a second multiplexer 1132. The second multiplexer produces an output 1134 f'n, 10 having received the output 1117 from the first modified shift register 1112 and a further input signal 1136 [cp],q from the next least FLU. The first multiplexer receives an input 1131 [bp]q1 from the next least significant MLU.

Both the first 1130 and second 1132 multiplexers are 15 controlled using the MSBlock[i] control signal. c Furthermore, both of the modified shift registers are, as described above, responsive to the MSBlock [i] and LSBlock [i] control signals.

Figure 12 shows cascaded MLUs 1200 together with 20 external multiplexers 1202 to 1208 and the Control and Configuration Module 1214. i Multiplication Control and Configuration Logic Module The control logic 1214 in the Configurable 25 Multiplication Unit, that are cascaded MLUs 1210 and i 1212, is responsible for controlling the external multiplexers 1202 to 1208 so that the correct information, f in and ens in the present example or embodiment, is multiplexed to the correct FLU. This is 30 achieved by decoding the control signals MS8lock[i] and LSBlock[i] which will determine the span of the operation. Together with the LSBitPos[i] control signal, the control logic is arranged to calculate the number of

l iterations required for the multiplication operation (n clock cycles for GF(2n)) for each of the MLUs. An output signal, OutputEnable[i], will be set high when the multiplication result is ready at [C,](y).

An Implementation of Configurable Division Architecture The design of embodiments of the Configurable GO 10 Divider will now be described.

The underlying basic element division algorithm for embodiments is based on that described in H. Brunner, A. Curiger, and M. Hofstetter, "On Computing Multiplicative Inverses in GF(2^m)," IEEE Transactions On Computers C, 15 vol. 42, pp. 1010, 1993 and J. H. Guo and C. L. Wang, "Systolic Array Implementation of Euclid's Algorithm for Inversion and Division in CF(2Am)," IEEE Transactions On Computers C, vol. 47, pp. 11611167, 1998, which are incorporated herein by reference for all purposes.

20 Although the embodiments described herein use the element division algorithms described in the above two papers, embodiments of the present invention are not limited to such an arrangement. Embodiments can be realised in which other division algorithms could equally well be 25 used.

As before, this algorithm operates over a polynomial basis and has been mapped onto a semi-systolic architecture to allow element division over variable field sizes and different irreducible polynomials.

30 Brunner et al, in H. Brunner, A. Curiger, and M. Hofstetter, "On Computing Multiplicative Inverses in GF(2^m)," IEEE Transactions On Computers C, vol. 42, pp. 1010, 1993, proposed a variation of Euclid's GCD Algorithm that is amenable to being used in inversion,

which was modified for use as a systolic array implementation by Guo and Wang in J. H. Guo and C. L. Wang, "Systolic Array Implementation o Euclid's Algorithm for Inversion arid Division in GF(2^m\," IEEE Transactions 5 On Computers A, vol. 47, pp. 1161-1167, 1998.

One skilled in the art will appreciate that C(y)=A(y)/B(y)m dG(Y) needs to be calculated, where the irreducible polynomial over GF(2q) is G(y)=yq+ gq.tyq +gq2yq + +glY+g0 Equation 8 10 Therefore, let be the root of the G(y). Hence, gq-! +gq-274 ± +g'7+go Equation9 It will also be appreciated by one skilled in the art that A(y), B(y), C(y) are elements of the field GF(2q)

described in the polynomial basis.

A(y)= tyq+ 2yq2+ +ay+ Equationl0 B(y)=bqyq+b42yq2+ +by+bo Equation 11 C(y) =cqtyq+cq2yq2+ +cy+c0 Equation 12 The Division Algorithm is described below: 15 R = B(y); S = G(y); 'J = A(y); V = 0; count = 0; for i = 1 to 2q do if rq = = I then ( *rq denotes the coetTicient of the term yq of r*) R = y. R mod G(y); 2 0 U = y.U mod G(y); count = count + 1; else if sq = = 1 then ( tsq denotes the coefficient of the term yq of s*) S = S + R.

25 V=V+U;

end if; S--y.S mod G(y);; if count = = 0 then R exchange with S. 3 0 U exchange with V; U = y. U mod G(y); W(y) = W(y).y mod G(y) count=count + 1; else

U = U/y mod G(y); Z(y) = Z(y)/y mod G(y) count = count- 1, end if; end if; 5 end loop; (* Result C(y) is In U at this instance*) (* Count must be equal to zero at this instance *) Key Operations of Division Algorithm The key operations in the algorithm are V = you mod 10 G(y) and = O'/y mod G(y).

It will be appreciated that the R = y.R mod G(y) operation is a logical left and that this operation is executed only when rq = 0. As for S = y.S mod G(y), the operation S = S + R will be executed if Sq = 1 and since 15 rq = l, the resultant is sq = 0. Hence, S = y.S mod G(y) is also a logical left shift operation.

W(y) = y.W(y) mod G(y) Operation The calculation of W(y) is illustrated above in the algorithm Let W(y)= wqyq+ wq2yq2+....+ way+ we Equation 12 and W'(y)= w'qyq+ w'qyq2+... iw'y+w0=y W mod G(y) Equation 13 20 Rearranging Equation 8 gives yq mod G(y)-gqyq'+gq2yq2+....+gy+go Equation 14 Substituting Equation 12 and Equation 14 into Equation 13 and then comparing coefficients, the following equations are derived: w'0-wq'go Equation 15 w',=wqg'+w,, I<iSq-l \ Equation 16

Z(y)= Z(y)/y mod G(y) Operation The calculation of Z(y) is illustrated in the above algorithm 5 Let Z(Y)Zq-Yq + -2Yq-+ +Z'Y+zo Equation 17 As go '1', from Equation 19, 1 = (Yq + gq-lYq + gq-2Yq 3 - + g2y + gi) y mod G(y) Equation 18 Dividing both sides by y, we have: y'modG(y)=yq+gqyq2+gq2yq + +g2Y+g Equation 19 Let Z(y)=zqyq+z'q2yq2....+z'y+z'o=Z/ymodG(y) Equation 20 Substituting Equation 17 and Equation 19 into Equation 20 10 and comparing coefficients, gives aid= Equation 21 zi =z,++z'g,+,0<i<q-2 Equation 22 Wherein Z'q-1 and z'i represent the bits of equation 20.

Data Paths From the algorithm, it can be appreciated that there are primarily two data paths, that is, one for the U and 15 V vectors and the other for the R and S vectors.

RS Path Figure 13 shows the block diagram for the RS path 1300. Depending on the current values of rq, sq, and CountIsZero (True if Count - 0), the new values of R and 20 S. that is, R' and S', are computed for the current iteration. This is illustrated in Table A below.

Condition rq = 0, R'=y. R; S'=S; Condition r,' = 1, Sq = 0, CountisZero = False, R'=R; S'=y.S; 5 Condition rq - 1, Sq = 1, CountisZero = False, R'=R; S'=y.(S+R);: Condition rq = 1, Sq - 1, CountisZero = True, R'-y.(S+ R); S'=R; Condition rq = 1, Sq - O. CountisZero = True, 10 R'=y.S; S'=R; TABLE 4

UV Path Figure 14 shows the block diagram for the Uv path 15 1400. The new values of U and V, that is, U' and V', are computed as illustrated in Table 5 below.

Condition rq = 0, U'= y.1] mod G(y); V'= V; Condition rq = 1, Sq = 0, CountisZero = False, 2 0 U'= U/y mod G(y); V' = V; Condition rq = 1, sq = 1, CountisZero = False, CJ'= U/y mod G(y); V' = ( V + IJ); Condition rq = I, Sq - i, CountisZero = Tne, U'-y.(V+U)modG(y); V'=U; 2 5 Condition r; 1, Sq - O. CountisZero = True, U'= y. V mod G(y); V'= U; TABLE 5

Modification for Element Divislon over Variable Field

30 Size on a Single DLU Embodiments of a single DLU are capable of performing division and inversion operations over a field

size of GE(2n), where n c q. To enable embodiments to 35 operate on field sizes less than GF(2q), the following

needs to be appreciated by one skilled in the art.

It can be seen from equations Equation 15, Equation 16, as well as from the division algorithm, that bit variables are needed for the division operation. The bit 40 variables, are rq, Sq and wq taken from the most significant bits of R(y), S(y) and W(y), and two bit

variables, z0 and go, which are taken from the least significant bit of Z(y) from Equation 17 and G(y) respectively. Therefore, embodiments are arranged such that the 5 DLU accepts left-justified data vectors (A(y), B(y), G(y)). Hence, the most significant bit of the data vectors will always be at the most significant bit of the data paths of the DLU. The LSB position of the data vectors can then be determined by the LSBitPos control 10 signal and from it, z0 end g0 can be multiplexed.

Generalising equations Equation 15 and Equation 16, gives w''= wqg+(wAND NOTisLSB')),0<iSq-1 Equation 23 where isLSBl = '1' when the in bit is the least significant bit of the operation and is derived from the 15 LSBitPos control signal by the Control Module Figure 15 shows a bit sliced logic circuit of Equation 23. Cascading n-1 of these bit-slices together will yield an e-bit W = y.W mod G(y) operator.

20 Figure 16 shows a bit sliced logic circuit of Equation 22 of the Z = Z/y mod G(y) operator. Both Equation 21 and Equation 22 require z0 (least significant bit of Z(y)), which can be determined using a multiplexer selected by LSBitPos.

25 Modifications for Element Division over Field Sizes >

GF(2q) on Cascaded DLUs To realise embodiments of DLUs that can operate using a field size that is greater than GF(2q),

embodiments cascade two or more DLUs together so that the resultant entity can support element division over GF(2n) where n > q.

Le FRl]<y), [Sl](y), [Ul](y),[Zl](y), [Wll(y) 5 [Z',](y) and ['ll(y) represent the vectors R(y)/ S(y), U(y), Z(y), W(y), Z'(y) and W'(y) of the lth DLU. [rq]l, [sqJl' [zq]l and twq1]' are the most significant bits of [Rl](y), [S;](y), Zl](y) and [Wl](y). [z0ll represents the least significant bit of [21](y).

10 Let the irreducible polynomial of a field GF(2n) be:

G()=yn+g Iyn+gn2Y" + +glY+go Equation 24 A(y),B(y),C(y)e GF(2n)as: A(y) =anynian2yn2+ +ay+ Equation 25 B(y)=bn.yni+bn-2yn-2+. +by+bo Equation 26 C(y)=cnyn+cn2yn2+ +cy+c0 Equation 27 One skilled in the art will appreciate that C(y) = A(y)/B(y) mod G(y) should be computed to give effect to the desired division for field sizes greater than GF(2q).

15 Let fA,](y), [Bll(y) and [Gl](y) be the input operands of the ith DLU.

Assuming there are p cascaded DLUs, the operands are mapped onto the inputs of the k consecutive most slgniflcant DLVs aligned left justified. i.e: 20 A(y) =A(y).yPq n an-l Y + an-2 yPq 2 +, + a yPq2 + n + a pq-2 + n B(y) =B(y).yP4- n = bn yPq + ten-2 ypq-2 +... + b yPq2 + n + bo ypq-2 + n G(y) =G(y).yPq n 2 5 Y an-l Y + an-2 y q +... + g ypq-2 + n + g pq-2 + n and

[Ap] (y) c= Cln-l jPq-l + an z ypq 2 +.. + 42n - (q 1) y(P)q + an. q Y(P) q [Bp] (y) c-bn l ypql + bn 2ypq 2 +.. bn (q - Dy(P-l)q I + bn qy(P)q [Gp] (y) c= gn- I YPq- I + an-2 yp42 + + gn - (q I) y(P)q + gn - q Y Note that the MSB of G (y) is always '1'.

5 i.e. [apJq / = an l [apq 2= Cln-2.. [apJ/ = an -(q) [apJo= an -q [bpq / = bn I [bp]q 2 = hn-2 [bp]1 = bn. (q - 1) [bp]o = hn q {gP14-/ = an-l [gp] 4-2 = an-2 - [gp]/ = gn - (q -1) [gpo = gn - q Furthermore, the input operands of (p -1) th DLU will be: [Ap''](y) <- t7n-q- YP'q' + an q 2y(P> q + an <2q,yP2>q + an qyP24q [B(p](y) c= kn q, yP'h' + hn q y('q + -... + bn 2q- r,YP ' + bn qyP- q [G p.l'](y) gn -q -i Y + gn -q -2 Y P q g <' yiP -2h +' + (p 21q 10 and the input operands of the ith DLU will be: [A,] (y)≤aa-'h-'Y +an1p,)q2yq +....+an_(,q_(q_>y<')q t+ar, p,q qy('h [B,]'(y) c=b',, (p_>q y'4'+h 'n-tp-q-2Yiq2+ +b'n _(p_'h (q,' y()q++ b'r poq qy(lP [G,]'(y) ≤ gn - t)q -I y q + gn-()q -2 y q-2 + g n (p_)q (q I) y( h + g n - (p-)q - q Y q 15 Similarly, Z(Y) = Z(y) yPq - n = Zn-l Y + Zn-2 yPq 2 +, + z ypq-2 + n + zO ypq-2 + n W(y) = W(y), ypq n Wn I y + Wn 2 YPq + -.. + W ypq-2 + n + Wo ypq-2 + n 2 0 Z (Y) = Z (Y) Y

n-l y + Z n-: YPq. + Z'i ypq-2 + n + pq-2 + n W'(y) = W (y) yPq W n- y q + W n-2 yPq 2 +, W' ypq2 + n + W,o ypq-2 + n Modification of RS Path and UV Path 25 The two Logical Left Shift units in the RS Path, it(y) = y.R(y) or S(y)= y.S(y) operations, have to be modified. If the DLU is not the least significant block, the bits left shifted into Logical left Shift Units are from the next least significant DLU. If the DLU is the

Least Significant Block (i.e. LSBlock = True), then '0's should be shifted in.

From Equation 23, it can be appreciated that if the in DLU is not the least significant DLU then, 5 tow Fin-Wn gO + [w,]q I, where an: = [w'F]q. and go is the LSB of G(y). Otherwise, [W i]0 = Wn-lgO.

From Equation 22, if the ith DLU is not the most significant block, then o [z ']q-1 = [Z+/lo + zoLg+]o If the ith DLU is the most significant block (i e. i=p), then [Z []q-l = Zo where z0 is the LSB of Z(y).

15 - It should be noted that z0 and go are multiplexed from the Least Significant DLU in the operation and wow are multiplexed from the most significant DLU.

Figure 17 shows the internal block diagram of a DLU 1700 with all the interconnections required to enable 20 cascadable DLU operation. Figure 18 shows a cascadable DLU 1800 with Control and Configuration Logic 1802.

Count Module It can be seen from the division algorithm that some sort of a counter is preferable Embodiments of the 25 present invention preferably utilise a special counter, which is shown in figure 19. The counter 1900 is a shift

l register 1902 that can shift in both directions. The shift register 1902 is initialized to all zeros except for its least significant bit, which is set to '1'. When a count = count + 1 operation is encountered, the shift 5 register will left shift, whereas if a count = count - ' is encountered, the shift register 1902 will right shift.

Hence, the bit position of the bit in the shift register 1902 containing the '1' will be the current value of count. ie, when bit O of shift register 1902 is '1', 10 count = 0. The word size of the shift register 1902 is 2q, as the value of count will not exceed 2q. In the case of cascading two or more DLUs to operate over a field size of > GF(2q), the count modules of the DLUs can

also be concatenated to form a larger counter. For 15 example, for two DLUs, the maximum value of count will be 4q. Division Control and Configuration Logic Module The main function of the Control and Configuration Logic Module 1802 is the same as that of its counterpart 20 in the MLU. It multiplexes the bit variables required for an element division based on the field size of the

required operation. However, the control module 1802 in the DLU is much more complex than that of the MLU because it needs to multiplex a greater number of bit variables.

25 Furthermore, these bit variables can come from both the most significant DLU and the least significant DLU. As with the MLU, the DLU determines or derives the information necessary for multiplexing from the control signals MSBlock[i], LSBlock[i] and LSBitPos[i].

30 Also through the control signals, the control and configuration module 1802 keeps track of the field size

of the operation and determines the number of iterations needed to complete the element division operation.

Implementation of an Embodiment of a GF(2Pq) Processor 5 An embodiment of a GF(9Pq) processor, which will show the internal workings thereof, will now be described. An instruction set is also defined. One skilled in the art will appreciated that the design of the instruction set will vary as it is dependent on the parameters p and q as 10 well as the intended application. Therefore, the instruction set defined below should not be taken as being limiting notwithstanding t providing an example of an implementation of an instruction set. Figure 20 shows a block diagram of the prototype GF processor 2000, which 15 was constructed by the inventor.

In this case, the processor 2000 is designed for Reed-Solomon codes as well as both the AES and Elliptic Curve Cryptography.

The programming of the processor to realise Reed 20 Solomon Codecs, AES Encryption and Decryption as well as Elliptic Curve point Addition will be described hereafter. Referring to figure 20 there is shown, schematically, the architecture of a OF processor 2000.

25 The processor 2000 comprises a data input/output port 2002 for receiving and outputting data to other parts of a communication system (not shown), a communication port 2004 for external control and handshaking signals, an instruction input/output port 2006 which provides a 30 dynamic update of instructions to be executed. It can be seen that the processor 2000 also comprises a

* configurable GF multiplier 2008, a configurable GF divider 2010, both of which have been described above.

The processor 2000 also comprises a module 2012 for performing GF additional and summation functions. The 5 data to be processed by the multiplier divider and addition and summation modules is stored in the register file 2014. An embodiment of a structure of the register file 2014 has been described above. A data manipulation module 2016 is provided to perform shift operations, and 10 move coefficient location is. REPO, REPA SHPX instructions. The processor 2000 also comprises a Chien search module 2018 which is used to find the roots of the error location polynomial and locate the errors. A root counter 2020 is provided to keep track of the error 15 locations. Having located the errors, error location registers 2022 are then updated. Using the Forney algorithm, the error magnitudes are then evaluated and error-magnitude registers 2024 are then updated. A data bus 2026 is provided for routine data throughout the 20 processor 2000. Furthermore, a control bus 2028 is provided for controlling the above elements The processor is controlled using a processor state machine 2028, a state register 2030, which keeps track of the current state of the processor, a program counter 2032, 25 branch logic 2034 an instruction decoder 2036 and, optionally, an instruction ROM/RAM 2038.

Register File Structure Figure 21 shows the register file structure 2100 in greater detail. Each register location R0 to R? is p.q 30 bits wide, which corresponds to the internal data bus width of the processor. It can also be considered as p, q-bit wide coefficients locations. The data stored in a register location will depend on the application for

which the processor 2000 is currently configured. For example, when implementing an RS Codec, a register location can contain a polynomial, or part thereof, with coefficients as elements of the Galois field GF(2q) or

5 smaller (i.e. the name Coefficient Location). The coefficients are stored left justified.

An illustration of the present invention an embodiment of a OF processor with p = 8 and q = 8 will be described. The total width of the register file is 64 10 bits (p multiplied by q). Assume that the processor is to be used to implement an RS (31,25) codec. The Galois field this particular codec operates on is GF(25). This

also means that the coefficients of the polynomials used in the codecs are elements of the Galois Field GF(25), 15 that is or, alternatively, it 5-bits are required to represent these

elements. Assuming the Galois field is

generated by the irreducible polynomial p(y) = yS + y2 + 1. The way a polynomial A(x) with coefficients as elements of the field GF(2s) is stored in a register

20 location is shown below: () 1, 7 + al4X6 a5X5 + a3x4 + a28X3 ax2 + al8X + a or in binary: A(x)=(OOlll)x7+(11101)x6+(00101)x5+(01000)x4+ (10110)x3+(00010)x2+ (00011)x+(00100) 25 Since the size of each coefficient in this case is five bits, and the size of the coefficient location specified above is 8-bits, each binary representation of the coefficients of A(x) is stored left justified as shown in figure 22.

30 It will be appreciated that the GF processor will be operating in SIMD mode in this case.

On the other hand, assuming the data to be stored is an element defined over a Galois Field of size GE(260), it

will be stored in manner shown in figure 23. The Processor is said to be operating in Single Instruction 5 Single Data Mode (SISD).

This structure is advantageous because it provides a way of storing data of different field sizes for

different applications in exactly the same way without ambiguity. Furthermore, all computations in Reed Solomon 10 decoding involve polynomials with coefficients from an underlying Galois Field. As will be appreciated by one

skilled in the art, Reed Solomon decoding is about solving a set of polynomial equations to detect and correct errors. The structure lends itself to 15 computations involving polynomials as p Logic Units that can compute p polynomial coefficients at the same time.

The value of p is selected with respect to some cost function involving, for example, power, speed and/or area, when designing the GF processor.

20 As an example, the embodiment of a GF processor is designed for the RS (31,25) triple error correcting code, the underlying Galois Field is GF(2s) (i.e. q > 5). The

received vector will have 31 coefficients. Since triple error correcting is envisaged, it will have six 25 syndromes, (2t), and the syndrome polynomial will have six coefficients. If an Extended Euclidean Algorithm is used for finding the error location and error magnitude polynomial, it will operate on, at most, seven, that is, (2t + 1), coefficients at a time. The value of p (i.e. 30 the number of coefficients that can be processed in parallel) will, therefore, determine the performance of the GF processor for use in decoding the RS(31,25) code.

An advantage of the embodiments of the present invention resides in its inherent flexibility. The configurable architecture provides a very straightforward way of configuring the Mult plier and Divider circuits 5 for operation over different field sizes. Indeed,

embodiments of the GF processor can be used for one Galois Field operation of size up to (GF 2Pq) or p (GF 2q)

Field operations. An instruction set for the GF

processor will be described in detail below. However, as 10 an example, the MULT instruction may be arranged to multiply the GF(2Pq) field element in register file

location R1 with that in location R2 and to save the result to Rdes or it may be arranged to multiply the p GE(2q) field elements in register file location R1 with

15 another p GF(2q) field elements it location R2 and to save

the result to Rde5. Hence, the GE Processor will automatically know the data structure stored in the register file by reference to the application it is currently running (i.e. the Galois Field Size the current

20 application it operates on etc).

Galois Field Processor Instruction Set

Addition Galois Field addition is realised using the biLwise

25 exclusive or (XOR) between two operands with elements in the corresponding Galois Field. Embodiment can be

realised in which there are two instructions that deal with GF addition, namely; ADDP, which refers to "add polynomial" and ACCU "accumulate". The syntax of the 30 ADDP and ACCU instructions are as follow: ADDP Ra Rb Rdea

Adds the polynomial in register location R. to that in register location Rb and saves the results to register location Rd-B. This instruction will XOR each element in each coefficient location in Ra with 5 that of the corresponding coefficient location in Rb and save the result in Rdeg.

ACCU Ra Rb Ca Rde9 (Not Available in SISD mode) Accumulates all coefficient locations of register location Ra and replaces coefficient location Ca of 10 register location Rb with the result and saves to Register Location Rdes. In preferred embodiments, it should be noted that data in Register Location Ra and Rb is not changed. How can this be correct when the accumulated coefficients of Ra are stored in Rb?.

15 Figure 24 shows an example of the operation of the ACCU instruction.

Multiplication and Division To give effect to multiplication and division, 20 embodiments of the GF processor make use of the Configurable Multiplier and Divisor Architectures described above. The configurable multipliers and divisors units have p parallel multiplication/division logic units, which are each capable of performing GF 25 operations up to GF(2q). They can also be configured to operate as a unit to perform GF operations up to a field

size of GF(2Pq). This is illustrated in flqure 25.

Embodiments can be realized in which the two instructions for multiplication and division are MULT and 30 DIVI respectively. The multiply and divide instructions are defined as MULT Ra r Rb Rdes

( This instruction multiplies the content of Ra with the content of Rb, coefficient location by coefficient location, and saves the result in Ream.

DIVI Ra Rb r Rues 5 This instruction divides the content of R. by the content of Rb, coefficient location by coefficient location, and saves the result in Ram. Ra is numerator, Rbis denominator.

10 One skilled in the art should note that this is not polynomial multiplication or division. Figure 26 shows an example of the operation of the multiply/divide instructions. 15 In Parallel Operation Mode (SIMD), the MULT and DIVI instructions will perform p parallel operations per instruction. In Single Operation Mode (SISD), they will only perform one operation per instruction.

Data Manipulation Instructions 20 Embodiments use data manipulation instructions to move data around the register file. Preferred embodiments may provide the following data manipulation instructions. REPA Ra,Ca,R-, (Not Useable in SISD mode) 25 This instruction replaces all coefficient locations in Rdes by the data in coefficient location Ca of register Ra. It should be noted that this instruction is not available in Single Operation Mode. ' 30 REPO Ra,Rb,C,,Cb,R- (Not Usable in SISD mode)

This instruction replaces the data in coefficient location Cb of Rb by the data in coefficient location C. of register Ra and saves the result into register Rde-., the result being the content of Rb minus the 5 content of Cb, which is replaced by Ca of Ra In preferred embodiments, the contents of Ra and Rb are not changed. It should be noted that this instruction is not available in Single Operation Mode. 10 SHPX Ra, Dir, #Nub, R - 9 (Not Usable in SISD mode) This instruction shifts the contents of register Ra by #Nun of coefficient locations towards the MSB if Dir is MSS or shifts register Ra by #Hum coefficient locations towards the LSB if Dir is LSB and saves 15 the result in into register Ram, that is the contents of Ra is shifted by a number of coefficient locations defined by #NUM in the direction defined by DIR and the result is saved it into Rdes. It should be noted that the content of Ra is not changed 20 and that this instruction is not available in Single Operation Mode.

COPY Ra, Odes This instruction copies the contents of register location Ra into register location Ram.

Conditional Branching Instructions These instructions provide testing and branching capabilities that are preferable for usable code. In preferred embodiments, there are two status flags 30 associated with these instructions, isZero and isCountZero. LCNT #Initial

This instruction loads the Load Global Counter with an initial value, #Initial r or the least significant 16 bits of register location Ra. It will set isCountZero flag to TRUE if value is zero, otherwise 5 FALSE.

DECT This instruction decrements the Global Counter by one. If value of Global Counter is zero, then the Flag isCountZero will be set. It can only be reset 10 by BEQZ, BNEQ and LCNT instruction.

TEST R.,C.

For Parallel Operation Mode: This instruction tests coefficient location Ca of register location Ra. If the content is zero, set isZero Flag.

15 For Single Operation Mode, this instruction tests register location Ra. If the content is zero, the instruction sets the i=Zero Flag.

BEQZ #Hum, Di r, Type This instruction is Branch if equal to Zero.

20 If Type = Counter, the instruction will test the isCountZero flag. If Type = Contents, the instruction will test the isZero flag.

If TRUE, the instruction will cause execution to branch #Num number of instructions. If Dir = '0', a 25 branch forward effected. If Dir = '1', a branch backwards is effected.

Resets isCountZero or isZero flag to FALSE accordingly to Type.

BNEQ #Nun, Dir, Type \ 30 This instruction is Branch if not equal to Zero.

If Type = Counter, the instruction will test the isCountZero flag. If Type = Contents, the instruction will test the isZero flag.

If FALSE, the instruction will cause execution to 5 branch #Nun number of instructions. If Dir = 'O', a branch forward is effected. If Dir = '1', a branch backwards is effected.

The instruction resets isCountzero or isZero flag to FALSE accordingly to Type.

10 JUMP #Nun, Dir, This instruction is Branch #hum number of instructions. If Dir - 'O', a branch forward is executed. If Dir = '1', a branch backwards is executed. 15 Miscellaneous instructions (Setup and Interface) The preferred embodiments have a processor design that is based on a Load Store Architecture. Data oriented-

instructions, (like the GF arithmetic instructions) operate around the register file. Communication with the 20 external world is done via special interface instructions, which perform handshaking functions to load data into or retrieve data from the register file.

STAT #state, #action This instruction is Set the state of the GF 25 processor. It is used to determine the type of processing performed by the processor. The type of processing varies with the value of #state. When the value of #state is: 1' RS Codec - Encoding is performed; 30 '2' RS Codec - Decoding is performed; ( Syndrome Computation Euclidean Algorithm)

3' RS Codec - Decoding is performed; ( Chien Search/ Forney Algorithm) 4' AES Forward Encryption is performed; and 5' AES Inverse Encryption is performed.

5 The value of #action, in some embodiments, is optional. The effect of the value of #action varies as follows: When '0', the value has no effect; and When it is any other value, the effect is to 10 toggle handshaking signals from GF Processor to external world.

SETC Ra, # Type This instruction is used to set the irreducible polynomial for the GF Multiplier and GF Divisor.

15 The irreducible polynomial is saved in Ra left justified, without its MSB. For example, if p = 8, q = 8 and the Galois Field is GF(25) with irreducible

polynomial; P (Y) = ys + ye + 1 then y2 + or (00101000) will be saved in the coefficient locations of Ra.

It should be noted that this instruction can 25 also be used to setup other parameters of the GF Processor using Ra' for example to tell the processor the Field Size, operating mode etc.

LOAD # from, Ra 30 This instruction loads data from external sources to the register file. The external source of the data is determined by the value of #from.

STOR Ra, #to

This instruction saves data from register file to external sources. The external source of the data is determined by the value of #to.

AFFT Ra, #forward/inverse, Rde5 5 This instruction is a special instruction for use by Advance Encryption Standard only. The instruction performs the forwards or inverse affine transform required by the sub-box as will be appreciated by one skillets in the art.

Processor State Machine Figure 27 shows a Processor State Machine for embodiments of the present invention. All instructions except MUST, DIVI LOAD, STOR and other communications and 15 interface instructions follow the middle branch.

The main reason behind this design is that the number of clock cycles needed to complete the multiplication or division depends on the current Galois Field size m. For multiplication, the number of clock

20 cycles is (m + n), where n is the overhead required to load the instruction, the operands and to save the result. For division, the number of clock cycles is (2m + n). Therefore, for large values of m, there is a risk processor will be idle while awaiting results when it can 25 execute other instructions that do not depend on the results of the multiplication or division. (i.e. do not violate data dependencies).

For example Table 6 below shows a program segment of a number of instructions that are. to be executed, which 30 illustrates the above.

a) MULT R1, R2, R1 b) ADDP R3, R4, R3 c) ADDS R1, R2, R1 Table 6

Assuming that the processor is configured to perform operations in GF(2), where m = 16, the MUTT Instruction 5 will take (16 + n) clock cycles to finish. Instead of waiting (16 + n) clock cycles, the processor will instead try to execute the next instruction if there are no data dependencies. In this case, instruction (b) does not require the result of the MULT instruction, which will be 10 saved in R1, and hence will be executed. However, it can be appreciated that the third instruction needs the result of Rl and if the MULT operation has not finish by the time instruction B has finish executing, the pipeline will stall. Figure 28 provides an illustration of above.

15 The result of Multiplication or Division is saved into the register file immediately after the current instruction has finish executing.

This feature is useful for large field size

multiplication and division operations. Programs written 20 for the processor can make use of this feature to reduce the number of clock cycles needed to execute it.

Applications Applications of embodiments of the present invention will now be described, again, by way of example only.

Programming the GF Processor for RS Codecs An embodiment of the present invention will be described with reference to the GF Processor being programmed to perform RS coding with the instruction set.

5 The steps involved are RS encoding with codeword generation, RS decoding with syndrome computation, Euclidean Algorithm, Chien Search and Forney Algorithm.

The RS code used will be a RS (n,k) code with t = (n-k)/2 and the underlying field GE(2m) where n - 2 - 1.

10 Non-Systematic RS Encoding In non-systematic RS Encoding for a RS(n,k) code, an information polynomial I(x) of degree k-1 is multiplied by a generator polynomial of degree n-k to get a codeword C(x) of degree n-1.

15 Let G(x) = xn-k+g xn-k-+ +8x+ I(x) = Ax +ik2x +....+ix+io C(x3 = cnxn'+ cnx +....+cx+co The RS encoding algorithm is as follows: Let T(x) = tn_k X + tn-k-lX +.... + Fix + to - O; 20 for count in O to k-1 loop T (x) = T(x) + icount.G(x); Ccount-1 = to; T (x) = T (x) / x; end loop; 25 The implementation of the above algorithm is illustrated by the example below.

Assume that an embodiment of the GF Processor is arranged such that p = 8 and that RS(31,25) encoding is required

with the following generator polynomial and information polynomial: G(x) = x6 + aiOx5 + a9X4 + a24X3 + at6X2 +a24X + all I(x) = a x + ax23 + a9XI6 + o24Xi5 + a.9X8, +a2X7 + as 6 + 9 5 t3 4 5 a'x3 + a20X2 +a22X + ct,26 G(x) and I(x) are loaded into the register file as shown in figure 29. The instructions needed to implement the RS Encoding algorithm for RS(31, 25) are listed below.

Initially count = 0i 10 1) REPA R1IR2I R3 I R4 C (coat mod p) +1 ' Rs All the coefficient locations in R5 now contain the coefficient icOunt.

2) MULT R5, Pa, R5 R5 now contains the polynomial iCount.G(x).

15 3) ADDP R5, R6, R6

This instruction performs T(x) - T(x) + icOunt.G(x).

4) REPO R6, R7, C1, C(cot Ed pi + 1, (R7 I Fat I Rs 1 Rlo) Saves the coefficient to coefficient location (count mod p) + 1 in R7, R8, Rg or Riot as the degree of C(x) 20 is 30 and hence needs 31 coefficient locations or 4 register locations to store them.

5) SHPX R6, LSB, #1, R6

This instruction performs the T(x) = T(x) / x operation. 25 Increment Count; Steps 1 to 5 are repeated k times to compute C(x).

Syndrome Computation In Syndrome Computation, the At consecutive roots of the generator polynomial are substituted into the 30 received vector to get 2t syndromes. If all At syndromes are zero, it means that the received vector is a code

word and hence no errors are present. Otherwise, a decoding algorithm is used to locate, determine the errors and correct the received vector.

Let it(x) = rO+rx+r2x2+. +rnxn Equation 28 be the received vector of an RS(n,k) code; 5 let p be the number of parallel multiplication logic units available in the OF Processor.

Rewriting Equation 28 into sections of p gives it(x) = r0+rx+r2x2+ +rpxP+ Equation 29 rpxP + rp+'xP+i +,., + r2p X2P! + repxfP + rfp+lxfP+I + + En lxnl where f - (n-1/p rounded up to the nearest integer.

and taking out common factors, give it(x) = (r0+rx + +rpxPt)+ Equation 30 xP (rp+rp+lx+...+r2p-lxp-i)+ x2P (r2p + rapt 1X +... + r3p lxPI) + xtP (rt-p+ rtp+lX+... + rn lxniP'') 10 To calculate the syndrome, assuming it can correct t errors, i.e. it will have 2t syndromes.

To find the syndromes, substitute at consecutive powers of a into R(x).

For S1, R() becomes \ R(a) = (r0+ra + +rpaP)+ Equation 31 cap (rp + rp+ lor + + r2p (xPI) +

asp (r2p + r2p+lCt + + r3p 1(XP) + oLiP (rfp+ rfp+la+ + rn otn-tP For S2, R(a2)is R(a) = (rO+r2 + +rpa2(P'))+ Equation 32 up (rp + rp+la2 + + r2pla2(P-)) + a4P (r2p + r2p+la2 +... + rap la2(P-I)) + a2tp (rf + +2+ + rn1a2(n [P)) For S2,R(a2') R(a2') = (roar'-. +rpa2(p-))+ Equation 33 a2tP (rp+rp+2+.+r2pa2(p-))+ a4tP (rep + r2p+a2' +... + rap a2t (pit)) + a2tfP (r+rtp+ a2t +,,. + rn a2t (n-fp I) From Equation 31, Equation 32 and Equation 33, a recursive pattern can be seen. Firstly, one skilled in 5 the art appreciates that the following values should be pre-calculated: I a2 p-t} Equation 34 {,.........

{I,P,aP,....'a3 Equation 35 This allows the syndromes to be calculated from S;.

From the above, it can be seen that the coefficients of it(x) are multiplied first by the elements of Equation 10 34 and then multiplied throughout by the individual elements of Equation 35 as shown in figure 30.

The results are first saved and then XOR summed together to get So. S2 can be calculated from the saved pre-summed multiplication results of S:, which is shown in figure 31.

5 The results are then saved again and XOR summed to get S2. It will be appreciated that repeating this 2t times will yield the required 2t Syndromes.

The above method provides a recursive way of computing the syndromes without requiring too many 10 precomputed values.

Extended Euclidean Algorithm for Error Location and Error Magnitude Polynomials Computations The purpose of the Euclidean Algorithm is to compute the Greatest Common Divisor pair of the Key Equation for decoding Reed Solomon codes. The Key equation is shown below: A(x)[l+s(x)]-Q(x) modx2+i where A(x) is the Error Location Polynomial, Q(x)is the Error Magnitude Polynomial and S(x) is the Syndrome 20 Polynomial.

The Euclid's algorithm for non-binary decoding is as follows: l) Compute the Syndrome Polynomial S(x)i 2) Set the following initial Conditions: 25 R' (x)=x2'+' T'(x)_0 Ro(x) - I + S(X) To(x) = I

( 3) Using the extended algorithm, compute the successive remainders Rl(x) and the corresponding Tl(x) until deg [R1(x)] < t.

R,=R,,+Q,R,,

5 = T-2 + Q't-t 4) At this instance, Tl(x) = A(x) and Rl(x) = O(x).

The number of iterations required by the algorithm will depend on the error correction capability of the code it is decoding.

10 For Example, the decoding steps for Euclidean algorithm of a triple error correcting code is shown below: i Rem Q' T' x21+! _ 0 I+S(x) To=l 1 Rl(x) Qi(x) T'(x)= Q(x) 2 R2(x) Q2(x) T2(X)=T(x).Q2(x)+1 R3(x) Q3(x) T3(X)=T:(x).Q3(x)+T'(X) R1 is the remainder of dividing Ri1 by R12 and Qi is the quotient. When the degree of R:(x) is < 3, Rl(x) will be the Error Magnitude Polynomial and Tl(x) will be the 1S Error Location; Polynomial.

There are two key operations in the algorithm, polynomial GF multiplication and polynomial GF division, which are described below.

\

( Polynomial GF Multiplication The steps involved in GF multiplication are similar to those of non-systematic RS encoding but on a smaller scale. 5 Two polynomials, with elements of a field GF(2q) as

coefficients, are multiplied. Assuming that the two polynomials are: A(x)aO+a,x+a2x2++agxg B(x) = ho + hi:c boxy + - + tgxg 10 The multiplication is effected as follows.

Assume B(x) is in register RO and A(x) is in R1. The LSB coefficients will be in coefficient location 1.

1) REPA R1, C1, R2

This Instruction sets all coefficient locations in R2 15 to aO.

2) MULT Ro, R2, R3 Multiply registers Ro with R2 and save result to R3. The contents of R3 after this operation will be: aObO + aObx + aOb2x2 + + aObgxg 20 3) SHPX Ro' MSB, #1, Ro Shift B (x) by one coefficient location towards the MSB and save back to Ro.

4) REPA R1, C2, R2

This Instruction sets all coefficient locations in R2 25 to a,.

5) MULL Ro, R2, R4 Multiply registers Ro with R2 and save result to R4. The contents of R4 after this operation will be: O + abOx + a,bx2 + a, b2x3 + + a,bxg+'

( 6) ADDP R3, R4, R3

Add the polynomial in R3 with the polynomial in R4 and save result in R3. The polynomial in R3 after this instruction will be: 5 aObe + (a,bO + aObi)X + (a,b + aOb:)X2 + + (abbe, + aObg jxg + a,bg)xg+' Step 3 onwards are repeated until all coefficients in A(x) have been multiplied and added to the partial sum. 10 Polynomial GF Division Two polynomials' with elements of a field GF(2q) as

coefficients, are divided as follow. Assuming that the two Polynomials are: A(x) = aO + a,x + a2x2 + + agxg 15 B(x) = be + b,x + boxy + + bg xg' One skilled in the art appreciates that polynomials that satisfy the following should be determined.

A(x) = C(x) + Q(x)B(x) This can be achieved using the following steps.

20 Assume B(x) is in Register RO and A(x) is in R1. The LSB coefficients will be in coefficient location 1. Also assume, in this case, that the degree of B(x) is always one less than that of A(x).

7) REPA Ro, Cal, R2 25 This Instruction sets all coefficient locations in R: to by,. Cg means the coefficient location that contains the coefficient of xg' 8) REPA R1, Cg, R3

( This Instruction sets all coefficient locations in R3 to ag.

9) DIVI R3, R2, R3

Divide all the as in R3 throughout by by,.

5 After this instruction, all coefficient locations in R3 will be ag/bg. (Note: ag/bgis the MSB Coefficient of Q(x)) 10) REPO R4, R3, C,, C2, R4

This Instruction saves the MSB Coefficient of Qlx) into 10 coefficient location 2 of R4.

11) MALT R3, Ro, R3 Multiply polynomial in R3 by that in R2 and save to R3.

After this instruction, the polynomial in R3 will be: ag (bo/bg ')+ ag(b, /bg,)x+ag(b2/6g,)x-+ + agx 15 12) SHPX R3, MSB, #1, R3

Shift R4 by one coefficient location towards the MSB and save back to R4.

O+ag(bO/bg,)x+ag(b/bg)x2 +ag(b'/hg)r3 + +agx 13) ADDP R3, R1, R1

20 Add the polynomial in R3 with A(x) in Rob and save back in to Rag. After Step 6, the polynomial in R1 will be the partial division with degree S g-1.

14) REPA R1, Cg1, R3 This Instruction replicates Cg-1 from R1 and saves it 25 into R3, which means that all coefficient locations in R3 are now the value of R3.

15) DIVI R3, R2, R3

This instruction divides the value in each coefficient location in R3 by that of the corresponding one in R2 30 and stores the result in Rim, see, for example, figure 26. 16) REPO R4, R3, C1, C2, R4

This Instruction saves the LSB Coefficient of Q(x) into coefficient location C1 of R4. R4 now contains the quotient Q(x).

17) MULT R3, Fee, R3 5 This instruction multiplies the value in each coefficient location in R3by that of the corresponding one in Ro and stores the result in Ran, see figure 26.

18) ADDP R3, R1, R1

This instruction performs a polynomial addition of the 0 value in each coefficient location in R3by that of the corresponding one in R1 and stores the result in R3.

Steps 7 to 11 repeat the process of step 1 to 6, that is, they perform another partial division, and after step 10, register R1 will contain the remainder of the 15 polynomial division C(x). The Quotient Q(x) is in R4.

In this case, the degree of C(x) will be < g-2 and the degree of Q(x) is 2.

If the Degree of B(x) is less than the Degree of A(x) by more than one, the steps above need to be changed 20 accordingly. For example, if the difference in degree is 2, that is, if the degree of A(x) is g and degree of B(x) is g-2, then 3 partial divisions will be needed to compute the polynomial division. Furthermore, the degree of Q(x) and C(x) will now be 3 and < g-3 respectively.

25 The TEST instruction can be used to determine the degree of B(x) by checking if the most significant coefficient of B(x) is zero.

Chien Search and Forney Algorithm Preferred embodiments of the present invention 30 comprise a specialized module to perform the Chien Search and part of the Forney algorithm. In such embodiments,

the implementation it is an extension of the MOLT instruction and is accessed only when the processor is in state 3 (Chien Search/Forney Algorithm State). The instruction, STATE, is used to change the state of the S processor.

The Chlen search algorithm is a systematic way of evaluating the roots of the error location polynomial (x). A(x)=2,x'+ +4x+l,whereX' GF(2m) Equation 36 10 The algorithm works by substituting every inverse of the elements of GF(2m) into A(x) and evaluating A(x). If the result is zero, then the element is a root of A(x). It also makes use of the following identities: ad = <mu-! _! _ au-2... aiU23 = a' --1 O Where u = 2m 15 It is a recursive process where at first is substituted into A(x) and evaluated for zero.

A-)= A()=4,a'+A,,a''+ +:a+l Equation 37 If it is zero then a' (or aU2) is a root of A(x) and an error is present at location u-2. Next substitute a2 (or aU3) into A(x) which is equivalent to: A(a(U31)=A(2)=(iat)a' +(2, a'')a'' ++('a)a +I Equation 38 20 Similarly, to evaluate a3 ( or a'U4,) IL(a --4)) = A(a 3) - (A,a')a'a' + (A, A'' Ja'ta'' + + (baa + I E quation 39 Generalising, A(a I) = A(a')= A'a" A,,iQ'' + + la' + I Equation 40

As can be seen above, the Chien Search process is just a recursive multiplication of coefficients of A(al) by a set of precomputed powers of a, ( a, oh,, atl. This is repeated for 2m-1 times for all non-zero elements of 5 GF(2). While this provides the error locations, the magnitudes of the errors still need to be calculated, The Forney algorithm is use to compute the error magnitudes according to the following equation: Q(a ') Equation 41 e = I a--IA'(a I) e7 is the error magnitude at location j and a: is a root of 1\ (X).

10 A'(x) is defined as the derivative of A(x).

'(x)=ti,x'-+(t-l)],,x'2 +24x+: Equation 42 If t is odd, 4'(x) =t],x'-'+(t2)1,2x'-3 +l Equation 43 = i,x'-' + i, 2x'-3 + if t is even, A'(x) =(t-l) /,x'-2+(t-3);,3x'-4 +l Equation 44 =,x +2,3x À +4 Furthermore, if t is odd xA'(x) =2,x'+4,,x'' +1x Equation 4S 15 If t is even.

xA'(x) -4,,x'-+A,3x'3 +4x Equation 46 From Equation 45 and Equation 46, itcan be seen that xA'(x) is actually A(x)without its even power terms.

Hence, the xA'(x) can be calculated at the same time while searching for the roots of A(x)by summing the odd power terms of A(a')if at is a root of A(x).

Q(')can also be evaluated at the same time as the roots 5 of A(x)using the same method as that for A(a'), using precomputed powers of a, { a, at, ....., at.

Q({Z) = Q(a) - oo,a + a),, + + Ada' + I Equation 47 The Chien Search Module is designed with the above considerations. Figure 32 shows the module. This module is coupled to the GF multiplier and is activated whenever 10 the GF processor is in State 3.

To illustrate how to implement Chien Search on the GF Processor, consider the following example of RS(31,25).

Assume that the following values have been calculated and stored in register location RO.

15 {ax} Assume that A(x)+land Q(x)is stored in R1 as shown in figure 33. It should be noted that the coefficient of x for A(x)is omitted. This is compensated accordingly in the Chien Search Module by testing the result of root 20 evaluation for a instead of '0'. The steps for Chien Search are as follows: 1) STATE #3

2) MULT By, R1, R1 This operation MULT Ro' Rib, Rat is executed 2m-' times.

25 Whenever a MULT instruction is executed in State 3, an extra clock cycle at the end of the MULT operation will access the Chien Search Module. If the IsRoot signal is high, the values Q(a3) and a3.6'. (ad) are saved into the

error magnitude registers. At the end of the Chien Search, a special division instruction DIVC will divide all (at) by ai.A'. (at), which will give the error magnitudes, which are saved back into the error magnitude 5 registers.

Programming the GF Processor 'or the Advanced Encryption Standard The Rijndael Block Cipher has been chosen for the Advanced Encryption Standard (AES). The GF processor can 10 also be used to implement the AES. This Section will describe how to program the GF Processor to perform the AES. Details of the operation of the AES can be found in J. Daemen and V. Rijmen, "AES Proposal: The Rijndael Block 15 Cipher," 2000, which is incorporated herein by reference for all purposes. Operations described in Rijndael operate over the Galois field GF(28) with a non- primitive,

irreducible polynomial m(x), which is given by m(x) =X8 +X4 +X3 +x+1 20 The cipher consists of a number of Round Transformations (Nr) that operate on the 'state', which consist of Nb columns of 4 bytes array of words. The cipher key is also similar with Nk columns of 4 bytes array of words. The number of round transformation Nr is pre-determined by Nk 25 and Nb as shown in Table D. Nr Nb = 4 Nb = 6 Nb = 8 Nk = 4 10 12 14 Nk - 6 12 6q

1 r 1 --1 ----- 1 Table D

Each Round Transformation consists of four different transformation, which are: PoundTransform( State, RoundKey) 5 ByteSub (State); ShiftRow (State); MixColum (State)) AddRoundKey (State, RoundKey); The RoundKey is derived from the Cipher Key by means of a 10 Key schedule and Key Expansion.

The full cipher operation will be: Cipher (Input, Cipher key) { RoundKey = Cipher Key; State = Input; 15 AddRoundKey (State, RoundKey); For I in 1 to Nr-1 loop RoundKey = KeyExpansion (RoundKey, RC); Interleaved Key Expansion RoundTransform( State, RoundKey); 20 End loop; -- Final Round; RoundKey = KeyExpansion (RoundKey); ByteSub (State); ShiftRow (State); 25 MixColum (State); AddRoundKey (State, RoundKey); Output = State; } In line with the J. Daemen and V. Rijmen, "AES Proposal: 30 The Rijndael Block Cipher," 2000, data vectors in this section will be described in hexadecimal notation.

An example where Nk = 4 and Nb = 4 (i.e. Nr = 10) will be used to illustrate how to program a OF Processor where p = 8 and q = 8 for Rijadael operation.

35 The input and the cipher key for this example are: Input: 3243F6A885A308D313198A2E0370734

32 88 31 ED

_ 43 5A 31 37 Input State F6 30 1 98 07

AS 8D I A2 34 1

Cipker Key: 2B7E 15 1 628AED2A6ABF71 58809C F4F3C 23 28 AB 09

7E AE F7 CF Cipher Key 15 D2 15 4F

16 A6 88 3C

The Input and the Cipher Key are loaded into the Register File as shown below in figure 34.

The ByteSub Transformation (Forward and Inverse) 5 The two instructions that are used are DIVI and AFFT.

The ByteSub Transformation is invertible and consists of two separate operations, which are the forward ByteSub Transformation and the Inverse BtyeSub Transformation.

The forward ByteSub Transformation consists of the 10 following steps.

1. Take the multiplicative inverse of each individual byte of the State over GF(28). '0x00' is mapped onto itself.

DIVI Ra, Rb, Rc; 15 Ra consists of all one coefficients. Rb contains the state.

2. Apply a Forward Affine Transformation Over OF(2) to each individual byte of the State.

AFFT Rc, #forward, Rc; 20 The inverse ByteSub Transformation consists of the following steps.

1. Apply an Inverse Afflne Transformation Over OF(2) to each individual byte of the State.

AFFT Rc, #inverse, Rc; 2. Take the multiplicative inverse of each individual 5 byte of the State over GE(28). 'OxOO' is mapped onto itself.

DIVI Pa, Rc, Rdi Ra consists of all one coefficients. Rc contains the state.

10 For example, if a certain state stored in RO and R1 is to be ByteSub Forward Transformed, the sequence of instructions will be: DIVI R3:, Ro, Ro; -- Rat contains all one coefficients 15 DIVI R3:, Rat, R:; AFFT Ro, #forward, Ro; -- Forward ByteSub AFFT Rl, #forward, Rib; -- Results in Ro and R1; Applying the Inverse ByteSub Transform gives 20 AFET Ro, #inverse, Ro; -- R3l contains all AFFT Rat, #inverse, Rat -- one coefficients DIVI R,1, Ro' Ro; -- Forward ByteSub Results in DIVI R31, Rat, Rl; -- Ro and Rob; The ShiftRow Transformation 25 The ShiftRow Transformation shifts cyclically the rows of the state according to different offsets. Row O is not shifted but Row 1 is shifted by C1 bytes, Row 2 is shifted by C2 bytes and Row 3 is shifted by C3 bytes.

The values of C1, C2 and C3 are dependent on the block 30 length Nb. In the present example, Nb = 4 so: C1 = 1; C2 - 2i C3 = 3; For example, applying the ShiftRow Transformation to the input vector gives:

32 88 31 EO 1 32 88 31 EO I

43 SA 31 37 I SA 31 37 43 I

F6 30 98 07 98 07 30 F6

A6 6D A2 34 A8 L: A2

BereShibRow AherShidRow The sequence o instructions required to effect this operation is: -- Initial Conditions: Rq, Rs, R6, R7 = All zeros 5 -ShiftRow ( RorRl REPO Ro R4 C C: R4 -- First Column REPO Ro R4 C6 C2 R4 REPO R Rq C3 C3 R4 REPO R R4 Ca C4 R4 10 REPO Ro R5 C5 C: R5 -- Second Column REPO R R5 C2 C2 R5

REPO R Rs C C Rs REPO Ro Rs C4 C4 Rs REPO R R6 C C R6 -- Third Cclumn 15 REPO R R6 C6 C2 R6

REPO R: R6 C3 C3 R6

REPO Ro R6 Ce C4 R6 REPO R: R7 Cs C1 R7 -- Fourth Column REPO Ro R7 C2 C2 R7 20 REPO Ro R7 C7 C3 R7 REPO Ri R7 C4 C R -- R4,R5,R6,R7 now contains the shifted columns of input state.

-- One Column per Register Location The ShiftByte Transformation's inverse is opposite of the forward transform. However, in practice, the inverse ShiftByte Transformation can be integrated with Inverse MixColumn Transformation so the inverse transformation 30 need not be specifically performed as described hereafter. The MixColumn Transformation (Forward and Inverse)

The Forward MixColumn Transformation treats each column of the state as a respective polynomial over GF(28), multiplied, module X4 + 1, with a fixed polynomial C(x) given by: 5 C(x)-'03'x3+'01'x +'OI'x+'02' The polynomial multiplication can be written as a matrix multiplication. Let a(x) be a column of the state and B(x) = C(x) x A(x), be 02 03 01 01 an be = 01 02 03 01 a, b2 01 01 02 03 a2 be 03 01 01 02 as Following the notation in ShlftByte Transformation, 10 assuming the resultant state of Forward ShiftByte Transformation is in R4, R5, R6, R7 (one column per Register Location), and assume that each row of the Multiplication Matrix is Stored in the Least Significant Coefficient Locations of R25,R26,R27, R28 as shown in 15 figure 35.

The sequence of instructions for the Forward MixColumn Transformation is as follows: -- Initial Conditions R8'Ra'R1o contains all zeros) 20 -ForwardMixColumn ( R4,R5,R6,R7) -- First Column of State MULT R4 R2s R8 ACCU R8 Rg Cal Rg -- C1 of R9 contains be of column 1 MULT R4 R2 6 R8

25 ACCU Red Rg C2 Rg -- C2 of R' contains b1 of column 1 MULT R4 R27 R8

ACCU R8 Rs C3 Rg -- C3 of R. contains b2 of column 1 MULT R4 R28 R8

ACCU R4 R2a C4 R9 -- C3 of Rg contains b3 of column 1 -- Second Column of State MULT Rs R2s R ACCO R R9 Cs R9 -- C5 of R' contains bo of column 2 5 MULT R5 R26 R

ACCU R Rg C6 Rs -- C6 of R9 contains b1 of column 2 MULT Rs R27 R ACCO R8 Rg C7 R9 -- C7 of R; contains b2 of column 2 MULT Rs R28 R 10 ACCU R8 Rg C R9 -- C7 of Rg contains b3 of column 2 -- Thlrd Column of State MULT R6 R25 R

ACCO Re Rlo C Rlo - Cl of R1o contains b0 of column 3 MULT R6 R26 R

15 ACCU R8 Ro C2 Ro -- C: cf Rlo contains b. of column 3 MULT R6 R27 R

ACCO R8 R:o C3 R:o -- C3 of R:o contains b2 of column 3 MULT R5 R2 Ra ACCU Rs Ro C4 Rio -- C4 of Ro contains b3 of column 3 20 -- Fourth Column of State MULT R7 R2s R ACCU R8 Ro Cs Ro -- C5 of R1o contains bo of column 4 MULT R7 R26 Ra ACCU R8 Ro C6 Ro -- C6 of R1o contains b of column 4 25 MULT R7 R:o Ra ACCU Ra Rlo C7 Ro -- C7 of Rlo contains b2 of column 4 MULT R7 R2 3 R8

ACCU R8 R:o Cs Rlo -- C of Rlo contains b3 of column 4 -- R9, Rio now contains the MixColumn Transformed State 30 -- End ForwardMixColumn; As the polynomial C(x) is co-prime to X4 +1, it iS invertible. The Inverse MixColumn Transformation makes use of this property by multiplying it with a specific multiplication polynomial D(x), which is defined by: 35 C(X)XD(X)=1

D(x) is given by: D(x)='OB' X3 +'OD'x2+'09'x+'0E' The Inverse Transformation is very similar to the Forward Transform. Assume that the input state to be transformed 40 is saved similarly in R4,R5,R6,R7 (i.e. one column per Register Location), and the Inverse Multiplication Matrix is saved in R25,R26,R27,R28 as shown in figure 36.

- The sequence of instructions for the Inverse MixColumn Transformation s as follows: -- Initial Conditions R,R5,R contans all zeros; 5 -InverseMixColmn ( R4, R5, R6, R.) -- Fltst Column of Input State MULT Rq R25 R8 ACCU R Rg C: Rg MULT R4 Rz Re 10 ACCU R8 Rg C6 Rg MULT R4 R27 RB

ACCU R8 R'o C3 R MULT Rs R2s Rs ACCU R8 Rc Cs R1 15 -- Secord Column oInput State MULT Rs Rs R8 ACCU Re R5 C5 Rg MULT Rs R26 R8 ACCU Rs R1 C2 Rlo z0 MULT Rs R2 R8 ACCU RB R, C7 Rlo MULT Rc R2 Rs ACCU R8 Rg C4 Rg -Third Column of Input State 2 5 MULT R6 R23 RB

ACCU R8 R'O C! R. O

MULT R6 R26 R8

ACCU Rs Ro C6 Rlo MULT Rc R2, Rg 30 ACCU RS Rg C3 Rg MULT R6 R2 R8

ACCU R8 Rg Ce Rg -- Fourth Colum.n of Input State MULT R7 R2, Rs 35 ACCU R Ro C5 Rlo M'JL' R7 R;6 R

ACCU R8 Rg C2 Rg MULT R7 R27 Rg ACCU R8 Rg C7 Rg 40 MULT R7 R28 R8

ACCU Rs R o Cq Plo -- End InverseMixColumn -- R9, Ro now contains the Inverse MlxColumn AND the Inverse ShiftRow Transformed -- State The program segment above makes use of the functionality o the ACCU instruction to pericrm Inverse MixColumn and

( Inverse ShiftRow at the same time, removing the need for a separate inverse ShiftRow transformation.

The AddRoundKey Transformation This is a biLwise XOR of the RoundKey with the state.

5 the instruction used is ADDP.

Assume that the state is stored in R0 and R1 and the current RoundKey is stored in R2 and R3, this transformation can be done by: ADDP Ray Rz R4 10 ADDP R1 R3 R5

so that R4 and R5 contain the output state. It should be noted that this operation is valid only for Nk = Nb.

This transformation is its own inverse.

The Key Expansion 15 The RoundKeys used in the AddRoundKey Transformation are derived from the cipher key by means of the key expansion. The total number of RoundKeys in the columns are equal to the block length Nb multiplied by the number of rounds 20 Nr. The Roundkeys are expanded from the cipher key by means of a key expansion.

Initially, the KeyExpansion function takes Nk columns of the Cipher Key and produces the next Nk columns of the RoundKeys to be used in round one. The KeyExpansion will 2S then take the previous Nk columns of the Roundkeys to determine the next Nk columns of 'Roundkeys and continue until the number of RoundKey Columns required are reached as can be appreciated from J. Daemen and V. Rijmen, "AES Proposal: The Rijndael Block Cipher," 2000.

Each round will only use Nb columns of the RoundKeys for the AddRoundKey Transformation.

In the example, programming the embodiments of the present invention for key expansion for Nk = 4 and Nb - 4 5 and the Key Expansion to be computed dynamically will be examined. The KeyExpansion Function in this case is shown below: KeyExpansion (RoundKey[Nk], RC) -- Nk = 1 to 4 { -- RC is the Last Round Constant 10 Rcon - ( RC, '00', '00', '00'); temp = RoundKey[1]; temp = ByteSub(RotByte(temp)) XOR Rcon; NextRoundKey[1] = temp; NextRoundKey[2] = RoundKey[2] XOR temp; 15 NextRoundKey[3] = RoundKey[3] XOR temp; NextRoundKey[4] = RoundKey[4] XOR temp; RC = PC x; Return (NextRoundKey, RC); } 20 The ByteSub function is the same as the that above but for the scale, that is, one column instead of one state.

The RotByte Function is defined as follows: RotByte ( RoundKeyColumnt4]) input is one column of 4 bytes 25 { temp[1] = RoundKeyColumn[21; tempt2] RoundKeyColumn[3]; temp[3] - RoundKeyColumn[4]; temp[41 = RoundKeyColumn[1]; 30 Return(temp); } This function is can be realised using the REPO and SHPX instructions. For example, if one skilled in the art wanted to perform a RotByte on a RoundKeyColumn that is 35 stored in the coefficient location 5 to 8 of R1 as shown in figure 34, only the rotated 4 bytes in Coefficient Locations 5 to 8 of R1 are required so: 1) SHPX R1, LSB, #4, R11

( This instruction takes the contents of R1 and shifts it by 4 coefficient locations to the LSB and saves it in R1l.

This will ensure that Rat contains only the RoundKeyColumn in question in coefficient locations 1 to S 4 of R:1. The shift operation will introduce zeros into coefficient locations 5 to 8.

) "PO R11 Rl1 Cl C5 R11 This instruction copies the contents of coefficient location 1 of R11 to coefficient location 5 of All.

10 3) SHPX R1l, LSB, #1, R1l This instruction again shifts R11 by one coefficient location to the LSB. The contents of R11 is now the result of RotByte with the results in Coefficient locations 1 to 4 and coefficient locations 5 to 8 which 15 are zeros.

Rcon(RoundNum) is called the round constant function and is defined as Rcon [i] = ( RC[i], '00', '00', '00') RC[i] represents an element in GF(28) with a value of xl 20 t so that: RC11] = 1 (i.e '01') RC[2] = x (i. e '02') RC[3] = x2 (i.e '04') Generalising, 25 RC[i] = x RC[i-1] Thus, the Round Constant can be computed dynamically by recursively using the most recent RC.

Assume that the last Nk = 4 columns of RoundKey are in the least significant coefficient' locations of RO,R1,R2 30 and R3, and RC is in C1 of R4. Also, assume that Ra is all zeros and C1 of Rb contains '01'.

-- KeyExpansion (Ro,Rl,R:,R3,R4) -- RotByte(R0)i REPO Ro Ro C1 C5 Ro SHPX Ro LSB #1 Ro 5 -- End RotByte; -- ByteSub(R0Ji DIVI Ra Ro Ro AFFT Ro #forward Ro -- End ByteSubi 10 ADDP Ro Rq Ro -- Add RCon = temp; ADDP Ro Rat Rib -- RoundKey[2] XOR temp; ADDP Ro R2 Rot -- RoundKey[31 XOR temp; ADD? Ro R3 R3 -- RoundKey[4] XOR temp; -- New RoundKeys in Ro, R. R:, R3 15 MULT R4 Rb R4 -- x RCti-l] -- End KeyExpansion

Programming the GF Processor for Elliptic Curve Point Addition The application of embodiments of the present invention 5 to Elliptic Curve Point Addition will now be examined.

An Elliptic Curve over GF(2m) is made up of a number of points, which are expressed as elements of the Galois Field GF(2). For secure use in Public Key Cryptography,

m is usually very large. An Elliptic Curve Public Key 10 Cryptography is very computationally intensive, and one of the most repeated operations is the Elliptic Curve Point Addition.

* A Non-Supersingular Elliptic Curve, as can be appreciated from M. Posing, Implementing elliptic curve cryptography.

15 Greenwich, CT: Manning, 1999, which is incorporated herein for all purposes, is defined as: y2+xy=x3+a2x2+a6 where a2 and a6eGE(2"') Equation 48 Given that two points P = (x,yl) and Q = (x2,y2) lie on an Elliptic Curve, point addition R = P + Q is defined as below, where R - (X3, y3) 20 If P Q then yo-yo Equation 49 X2-x, and X3 =02 ++X2 +X' + a2 Equation 50 y3 -(X3 XI) - y' Equation 51 If P = Q then 8=x+Y orR=x2+ Y2 Equation 52 x x2 and X3 = 02 + al Equation 53

y3 =X +(+I)X3 Equation 54 These operations can be decomposed into element multiplication, element division as well as addition, since addition is the same as subtraction. The GF Processor is first conflqured to operate in SIMS mode.

5 Consider Equation 49, can be calculated by ADDP R,R2,R5 -- where Rl contains y:, R2 contains y2 ADDP R3,R<,R6 -- where R3 contains LIZ, Rq contains x2 DIVI R5,R6,R7 -- R7 now contains Similarly, the same can be performed for the other equations to produce the results of the Elliptic Curve Point Addition.

References = 15 [1] M. A. Hasan and A. G. Wassal, "VLSI Algorithms, Architectures, and Implementation of a Versatile GF(2^m) Processor," Ieee Transactions On Computers, vol. 49, pp. 1064-1073, 2000. -

[2] G. Orlando and C. Paar, "A Scalable GF(p) Elliptic 20 Curve Processor Architecture for Programmable Hardware," in Cryptographic Hardware and Embedded Systems - CHES 2001, vol. 2162, Lecture Notes in Computer Science, C. K. Koc, D. Naccache, and C. Paar, Eds.; Springer-Verlag, 2001, pp. 348-363.

25 [3; G. Orlando and C. Paar, "A High-Performance Reconfigurable Elliptic Curve Processor for GF(2^m)," in Cryptographic Hardware and Embedded Systems - CHES 2000 Second International Workshop Worcester, MA, USA, August 17-18, 2000 Proceedings, 30 vol. 1965, Lecture Notes in Computer Science, C. K. Koc and C. Paar, Eds.: Springer-Verlag, 2000, pp. 41-56.

[4] K. Leung, K. Ma, W. Wong, and P. Leong, "EPGA Implementation of a Microcoded Elliptic Curve 35 Cryptographic Processor," presented at Field

programmable custom computing machines; 2000 IEEE symposium on field programmable custom computing

machines, Napa Valley, CA, 2000.

[5] P. A. Scott, S. E. Tavares, and L. E. Peppard, "A Fast VESI Multiplier for GF (2Am) Fir IEEE Journal Selected Areas in Communications, vol. SAC-4, pp. 62-66, 1986.

5 [6] H. Brunner, A. Curiger, and M. Hofstetter, "On Computing Multiplicative Inverses in GE (2^m)," Ieee Transactions On Computers C, vol. 42, pp. 1010, 1993. [7] J. H. Guo and C. L. Wang, "Systolic Array 10 Implementation o Euclid's Algorithm for Inversion and Division in GF(2^m)," feed Transactions On Computers C, vol. 47, pp. 1161-1167, 1998.

[8] J. Daemen and V. Rijmen, "AES Proposal: The Rijndael Block Cipher," 2000.

15 19] M. Rosing, Implementing elliptic curve cryptography.

Greenwich, CT.: Manning, 1999.

All references are incorporated herein by reference for all purposes.

20 The reader's attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application

and which are open to public inspection with this specification, and the contents of all such papers and

25 documents are incorporated herein by reference.

All of the features disclosed in this specification

(including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, 30 except combinations where at least some of such features and/or steps are mutually exclusive.

Each feature disclosed in this specification

(including any accompanying claims, abstract and drawings), may be replaced by alternative features 35 serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly

stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features. The invention is not restricted to the details of 5 any foregoing embodiments. The invention extends to any novel one, or any novel combination., of the features disclosed in this specification (including any

accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any 10 method or process so disclosed.

\

Claims

( CLAIMS

1.A processor comprising 5 a first register for storing at east a first operand in at least a portion of at least a first operand register of the first register; a second register for storing at least a second operand in at least a portion of at least a first 10 operand register of the second register; an arithmetic unit comprising at least a first arithmetic sub-unit, responsive to a first set of control signals, to perform a first arithmetic operation using the first and second operands; a result register for storing the result of performing the arithmetic operation using the first and second operands) a control unit for producing the first set of control signals; the first set of control signals 20 being used to configure the field size of the first

2.A processor as claimed in claim 1, in which 25 the first and second registers comprise at least respective second operand registers for storing third and fourth operands respectively; the third and fourth operands having comparable field sizes;

the arithmetic unit comprises at least a second arithmetic sub-unit, responsive to a second set of control signals, to perform a second arithmetic operation using at least the third and fourth 5 operands; and the control unit produces the second set of control signals; the second set of control signals being used to configure the field size of the second

arithmetic sub-unit to influence the size of the To respective portions of the third and fourth operands Used in the second arithmetic operation.

3. A processor as claimed in any preceding claims, -

wherein the first and second registers each comprise j 15 operand registers, the arithmetic unit comprises j arithmetic sub- 2 units, each of the j arithmetic sub-units being capable of performing a respective arithmetic operation using respective operands derived from 20 respective operand registers of the j operand resisters; the control unit comprising means to produce j sets of control,signals to configure the field sizes of

the j arithmetic sub-units respectively to influence 25 the size of respective portions of the respective operand used in performing the respective arithmetic operations.

4. A processor as claimed any preceding claim, in which any one of the arithmetic sub-units is operable to 30 perform a respective arithmetic operation independently of any other arithmetic sub-unit.

5.A processor as claimed in any preceding claim, in which at least selected arithmetic sub-units of the arithmetic unit are operable to perform respective arithmetic operations substantially concurrently.

5

6. A processor as claimed in preceding claims, in which at least a selected plurality of the arithmetic sub units co-operate, in response to respective control signals, to perform an arithmetic operation on an operand that has a field size greater than a dynamic

10 range of any one of the selected plurality of arithmetic sub-units.

7. A processor as claimed in any preceding claim, in which at least a first arithmetic sub-unit and corresponding operand registers are arranged to 15 operate over a first field size and at least a

second arithmetic sub-unit and corresponding operand registers are arranged to operate over at least a second field size, where the first and second field

sizes are different.

20

8.A processor as claimed in claim 7, in which the first and second arithmetic sub-units form part of the arithmetic unit and are arranged to operate over different field sizes.

g.A processor as claimed in all preceding claims, in 25 which at least a pair of operand registers of the first register is arranged to store respective parts of a first global operand and at least a pair of operand registers of the second register is arranged to store respective parts of a second global 30 operand.

10. A processor as claimed in claim 9, in which the first and second global operands represent respective polynomials and the pairs of operand registers are used to store coefficients of the 5 respective polynomials.

11. A processor as claimed in any preceding claims, further comprising a plurality of storage registers for storing data representing at least one of the - operands and corresponding results; and a bus for 10 routing data from the plurality of storage registers to at least one of the arithmetic units' input and result registers.

12. An arithmetic unit, arithmetic sub-unit or processor substantially as described herein with 15 reference to and/or as illustrated in the accompanying drawings.

i3. A reconfigurable processor comprising arithmetic units for performing finite field

arithmetic; each arithmetic unit having a plurality 20 of arithmetic subunits, each arithmetic sub-unit having respective dynamic ranges that are defined by respective predetermined field sizes; the dynamic

range being the selected size of the data unit capable of being operated on, or stored within, a 25 sub-unit up to the respective predetermined field

size; each arithmetic sub-unit being configured by respective sets of control signals to define at least one relationship with another subunits; the at least one relationship being such that the 30 respective arithmetic sub-units co-operate to represent a second data unit having a dynamic range that is greater than at least a selectable one of

( the respective dynamic ranges of the plurality of the arithmetic subunits.

14. A processor as claimed in any preceding claim, wherein the processor is a Domain processor.

5

15. A processor as claimed in any preceding claim, wherein the processor is a Galois field processor.

N