EP1859344A2

EP1859344A2 - Method and device for calculating a polynom multiplication, in particular for elliptical curve cryptography

Info

Publication number: EP1859344A2
Application number: EP06708654A
Authority: EP
Inventors: Peter Langendoerfer; Zoya Dyka; Steffen Peter
Original assignee: IHP GmbH
Current assignee: IHP GmbH
Priority date: 2005-03-04
Filing date: 2006-03-06
Publication date: 2007-11-28
Also published as: DE102005028662B4; US8477935B2; WO2006092448A3; WO2006092448A2; DE102005028662A1; US20090136022A1

Abstract

The communications channel security is required, in particular, for wireless networks. The use of encoding mechanisms in the form of software is limited by required calculation and energy capacities of mobile terminals. The use of hardware solutions for cryptographic operations is significantly costly. Said invention provides means for simultaneously resolving all said points. The invention relates to a hardware accelerator for polynom multiplication in extensive Galois fields (GF), wherein the known Karatsuba method, according said invention, is iteratively used. When using said invention, a space required can be reduced, for example from 6.2 mm2 to 2,1 mm2. The inventive solution also enables to reduce the energy consumption about by 30 % with respect to known state of the art solutions.

Description

Berlin, March 6, 2006

Our sign: IB 1284-03WO LE / jwd

Direct dial: 030/841 887 16

Applicant / owner: IHP GMBH Official file: new registration

IHP GmbH - Innovation for High Performance Microelectronics / Institute for Innovative Microelectronics

In the Technology Park 25, D-15236 Frankfurt (Oder)

Method and apparatus for calculating polynomial multiplication, in particular for elliptic curve cryptography

The invention relates to a method and an apparatus for calculating a polynomial multiplication. Furthermore, it relates to a method for encrypting data and an encryption unit.

1 Introduction

Mobile devices are penetrating more and more areas of everyday life. More and more sensitive information is exchanged between mobile devices or between mobile devices and fixed communication endpoints. Data exchange is normally protected by encryption mechanisms. Due to the scarce resources of mobile devices, however, a comprehensive use of cryptographic methods is not possible. This is especially true for Public Key Cryptography, which is typically used to provide a secure channel between the communication partners and to create digital signatures. Public key cryptography uses so-called asymmetric encryption techniques. In this case, a public key is used to encrypt data, which according to its name is made known to third parties. Decryption of data encrypted with the public key can only be done with a private key that only the recipient of the message has. A decryption of the encrypted message with the public key, however, is virtually impossible. This practical impossibility of decryption is due to the asymmetry of the encryption scheme, which uses a public key based encryption algorithm requiring only relatively few computation steps. However, decryption, which involves a mathematical inversion of the encryption algorithm, requires so many computational steps when the public key alone is known that the time required for such a decryption attempt is virtually infinitely large, even using the most modern and elaborate computational technology.

Well-known asymmetric encryption methods are RSA and Diffie-Hellmann methods and the digital signature algorithm DSA based thereon.

More recently, Elliptic Curve Cryptography (ECC) has been increasingly developed. The advantage of ECC over the other methods mentioned is that shorter keys can be used without reducing the security of the encryption. In addition, ECC operations are faster than those of the RSA method. An introduction to elliptic curve cryptography has been published on the Internet at http://www.deviceforge.com/articles/AT4234154468.html.

The ECC encryption is based on the calculation of a product of two operands called "kP" where P is a point on an elliptic curve (EC) and k is a large number. - Multiplication is based on the point doubling and the point addition. All EC Point operations are based on addition, subtraction, squaring, multiplication and division in a selected Galois field (GF).

Hardware accelerators for public-key cryptography operations are ideal tools for reducing computation time and energy consumption. However, a direct implementation of cryptographic operations leads to a relatively large space requirement on a chip. This complicates the application of hardware accelerators from an economic point of view. The boundary conditions of the design of hardware accelerators are therefore the required calculation time, the energy consumption and the space requirement.

2. State of the art

Hereinafter, known methods for polynomial multiplication in a polynomial basis will be described. For this purpose, first the polynomial multiplication is generally illuminated and then known methods for the acceleration of the polynomial multiplication are explained.

2.1 polynomial multiplication

In a Galois field GF (2 ⁿ ), addition and subtraction are XOR operations. Therefore, and for easier understanding of the formulas, the common

H-I H-I

Position of polynomials A (x) = £ a, x 'modified here in A (x) = @ a, x' In the frame

I = O (= 0 men of this application, the XOR operation is marked as "Θ." The symbol "+" always denotes an ordinary addition.

The product of two polynomials

H-I H-I

A (x) = @ a, x 'and B (x) = @ b, x' ι = 0 ι = 0 is the polynomial C (x) = A (x) ^■ B (x) = 0 c _f x '(1)

I = O

where C, = ^■ b, ie: k + l = ι

C ₀ = u _o -b _o

C ₁ = u _v b ₀ ®u ₀ -b _x

c "_ _L = a _n _ _ι _o -b ®ã _n _ ₂ -b _γ ® ... ®ã ₀ b _ _H _γ

Ci _n - ₃ = a "-ι -b" _ ₂ ®ã "_ ₂ _ _n -b _γ

^C 2n-2 ^{= a} nl ^' "nY

A direct implementation of formula (1) requires n ² partial multiplications and (n-1) ² XOR operations of partial products to calculate the coefficients c i. All operands in formula (1) are only one bit long. When using EC B-233, both polynomials A (x) and B (x) are 233 bits long. This means that a total of 233 ² one-bit partial multiplications and 232 ² XOR operations are required.

2.2 Karatsuba-based polynomial multiplication method

For a polynomial multiplication with the original Karatsuba method, both operands must be fragmented into two equal parts. When the length n of the operands is odd, they must be "0" to be supplemented with a leading. Denoting with aι the i-th bit, and as a ^1, the i-th segment of operand A (x), so can the operands as follows:

A (x) = a _n _ _x ... a _n a _n ... a _ι a ₀ = a _n _ _ι ... a _n -x ² ®a _n ... U ₁ U ₀ =

^~ 2 T ^{1 ~} 2 T ¹ (2) n The polynomial B (x) can be represented in the same way. Karatsuba's formula for the product C (x) = A (x) -B (x)

n

C (x) = a ° b ° © [α V © aΨ © (α ° © a ^ι ) (b ° © b ¹ )] - x ² © aΨ ^■ x "(3)

In the publication Bailey, D. V .; Pair, C: Efficient Arithmetic in Finite Field Extensions with Application in Elliptic Curve Cryptography. Journal of Cryptology, vol. 14, no. 3, 153-176. In 2001, a procedure for the application of Karatsuba's idea is proposed. Hereinafter, this method will be referred to as Bailey's method. In this procedure, the operands are divided into three parts. Bailey's method requires six partial multiplications of n / 3-bit long operands. This method can be combined with the original Karatsuba formula for such operands whose length is divisible by six.

US 2004/0109561 A1 describes a method for multiplying numbers over a Galois field GF (2 ^m ). In this method, a recursive algorithm is used to decompose a product into a number of subproducts until the remaining size is sufficient to perform a non-recursive algorithm to complete the multiplication. Disadvantage of the method described in this document is its relatively low area efficiency.

3. Technical problem of the invention

The technical problem underlying the invention is to provide a method and a device for polynomial multiplication, which allow an area-efficient realization of elementary mathematical operations.

4. Summary of the invention

According to a first aspect, a method for calculating a polynomial multiplication is specified, with the steps: HI

Providing coefficients a "b, two polynomials A (x) = @ a, x 'and

= 0

HI

B (x) = @b _t x ', where "0" indicates an addition or an XOR operation.

; = 0 and a "b, are binary one-bit values, i. H . 0 or 1, and x '= 1,

Selecting either two or more than two fragments, one from each polynomial, as operands for a partial multiplication,

partially multiplying the selected fragments to obtain a partial product

wherein the selecting step is performed iteratively according to a predefined selection plan, and using fragments in which the respective multiplying step results in the formation of a partial product

Fragments requires only one calculation step,

Accumulating the partial products, wherein the accumulation is controlled in accordance with a predefined accumulation plan for accumulation of a partial product currently present at the output of the partial multiplier with one or more terms of an iteration step further developed iteration step result polynomial.

The inventive iterative application of the selection and multiplication steps enables a hardware implementation with a reduced footprint and power consumption. This applies in particular in comparison with a hardware implementation of a recursive algorithm as described in US2004 / 0109561 A1. For example, the chip area needed to calculate the product of two 233-bit operands is 2.18 mm ² (2-segment layout). These advances are paid only with a slightly increased calculation time. An additional area saving is achieved by the use according to the invention of a predefined accumulation plan, which enables the use of a control signal-based accumulation control in a simple switching system instead of complicated data paths between hard-wired components of the polynomial multiplier. Here, an area requirement of only 1.52 mm ² (4-segment structure) was achieved on the example of an inventive iterative Karatsuba multiplier for 233 bits, in which case the number of clock cycles is not increased by the use of the accumulation plan.

For the standard application of the Karatsuba method, however, 6.2 mm ^{2 are} needed. The solution according to the invention also reduces the energy consumption by 30% compared to the known completely recursive solution approaches.

The inventive method develops its advantages, in particular in a hardware implementation. However, it can also be implemented in the form of software. A software implementation of the method according to the invention has the advantage of better performance with respect to the calculation of the polynomial product compared to known software solutions for polynomial multiplication.

The inventive method also allows over known methods increased flexibility.

Fragments of polynomials are sometimes referred to as segments in the following description of like meaning. Instead of fragmentation is also spoken of segmentation.

In a preferred embodiment of the method according to the invention, the accumulation step is carried out in parallel with the iterative fragmentation and comprises a partial accumulation step which is carried out after an iteration of a multiplication step. In a further preferred embodiment of the method according to the invention, the selection, multiplication and accumulation steps are based on a Karatsuba method for carrying out a polynomial multiplication. An example of the application of the Karatsuba method will be explained in more detail below in connection with the description of FIGS. 1 and 2 with reference to the formulas (3) to (8).

In another embodiment, prior to the selecting step, the selection schedule and the accumulation schedule are selected from a plurality of stored selection and accumulation schedules corresponding to a respective length of the polynomials and a predetermined word width of a partial multiplier used to perform the partial multiplying step. The length of the partial polynomials corresponds to the length of the data words which can be applied to the input of the polynomial multiplier and which are to be multiplied by one another according to a polynomial multiplication. This embodiment particularly benefits from the flexibility specified according to the invention, so that one and the same hardware can be used for different polynomial lengths.

According to a second aspect of the invention, there is provided a method of encrypting data comprising a step of computing a product of two polynomials. According to the invention, the product of the polynomials is calculated according to the method of the first aspect of the invention or according to one of the described embodiments.

Preferably, the encryption of the data is performed by elliptic curve cryptography. That is, the encryption of the data is performed by means of an elliptic curve over a Galois field. One of the following curves is preferably used: the curve B-163 over the Galois field GF (2 ¹⁶³ ) or the curve B-233 over the Galois field GF (2 ²³³ ), or the curve B-283 over the Galois Field GF (2 ²⁸³ ) or the curve B-409 over the Galois field GF (2 ⁴⁰⁹ ), or the curve B-571 over the Galois field GF (2 ⁵⁷¹ ), or the curve K-163 over the Galois Field GF (2 ¹⁶³ ) or the curve K-233 over the glocal field GF (2 ²³³ ), or the curve K-283 over the Galois field GF (2 ²⁸³ ) or the Curve K-409 over the Galois field GF (2 ⁴⁰⁹ ), or the curve K-571 over the Galois field GF (2 ⁵⁷¹ )

According to a third aspect of the invention, there is provided a polynomial multiplier, comprising

a selection unit that is designed

HI

Coefficients a "b, of two polynomials A (x) = (+) a, x and

= 0

HI

B (x) = @b _t x ', where "0" is an addition or a

= 0

XOR operation, a, and b, are one-bit binary values, i. H. 0 or 1, and x '= 1, and

to select either one or two fragments, one of each polynomial, having such a length as operands for a partial multiplication and to provide at their output in a working step that the partial multiplication requires only one calculation step,

wherein the selection unit is additionally designed to iteratively select the fragments in successive operating steps according to a predefined selection plan,

at least one partial multiplier connected to the selection unit and configured to carry out a partial multiplication of the two or more than two operands in an iteration step and to provide the resulting partial product at its output

an accumulation unit, which is connected to the partial multiplier and which is designed to amplify the complete polynomial product by an iteration step for iteration step completed accumulation of the partial multiplier. calculate the products received from the partial multiplier,

an accumulation control unit which is connected to the partial multiplier and the accumulation unit and which is designed to output a control signal depending on the current iteration step, the accumulation unit corresponding to a predetermined accumulation plan for the accumulation of a partial product currently present at the output of the partial multiplier is constructed with one or more terms of an iteration step for iteration step refined result polynomial.

The polynomial multiplier according to the second aspect of the invention is characterized by a reduced area requirement and a reduced power consumption with increased flexibility compared to the prior art with regard to the word length of the data words to be multiplied. The latter is effected by the use of a programmable accumulation schedule by the accumulation control unit.

In one embodiment of the polynomial multiplier according to the invention, the accumulation unit contains a number of XOR gates, each XOR gate being connected at one input to one or more terms of the resulting result polynomial and at another input to the partial multiplier. In this case, the accumulation control unit preferably has a number of control logic gates that corresponds to the number of XOR gates of the accumulation unit. Each control logic gate is associated with a respective XOR gate. The accumulation control unit is designed to generate a predetermined set of control signals in a respective iteration step and to apply them to the control logic gates, wherein a respective control signal determines whether or not a respective XOR gate is activated for accumulation in the current iteration step. This embodiment enables a particularly simple, space-saving design of the accumulation control unit. A current accumulation step is controlled, for example, by a control word that corresponds to the Set of control signals corresponds. Each control bit of the control word corresponds to a control signal, and controls the activation of a particular XOR gate in the respective accumulation step via a respective control logic gate. The control logic gates are preferably AND gates with two inputs, of which a first input is applied to a logical one and a second input is acted upon by a respective control signal.

In a further preferred embodiment of the polynomial multiplier according to the invention, the partial multiplier is designed to perform a partial multiplication calculation step in one clock cycle.

Preferably, not only the accumulation is controlled by a programmable controller, but also the selection of the fragments. One embodiment has a selection control unit connected to the selection unit and configured to output, according to the current iteration step, control signal instructing the selection unit to select a respective predetermined size fragment of each polynomial in accordance with the predetermined selection plan. The hardware implementation may be similar to the accumulation controller with control logic gates, such that in turn only a corresponding set of control signals must be delivered, which in turn simplifies and shortens data paths, thereby increasing area efficiency.

Selection control unit and accumulation control unit are preferably integrated in a single multiplication control unit.

Preferably, the accumulation control unit and the selection control unit are arranged to select, on receipt of polynomials at the input of the selection unit, the selection plan and the accumulation plan corresponding to a respective length of the polynomials and a predetermined word width of the partial multiplier from a plurality of stored selection and accumulation schedules , This embodiment takes advantage of the high flexibility of the polynomial multiplier. In a further preferred embodiment of the polynomial multiplier according to the invention, the partial multiplier is designed to perform a partial multiplication of two operands provided at the output of the selection unit by means of a Karatsuba formula for a partial product C (x) = A (x) - B (x), according to the type

n

C (x) = a ° b ° © [α V © aΨ © (α ° © a ^ι ) (b ° © b ¹ )] - x ¹ © a ^ι b ^{ι ■} x ",

and to provide the partial product at its exit.

A preferred embodiment of the polynomial multiplier is formed in the form of an integrated circuit.

A fourth aspect of the invention relates to an encryption unit for encrypting data performing a polynomial multiplication, a data input for unencrypted data, a polynomial multiplier connected to the data input according to the third aspect of the invention or one of its embodiments, and one with the polynomial - Multiplier connected data output to output the encrypted data.

A preferred embodiment of the encryption unit is designed to encrypt the data by means of elliptic curve cryptography, that is to say in particular to encrypt the data present at the data input by means of an elliptic curve over a Galois field. For this purpose, one of the curves already described in connection with the method of the second aspect of the invention may be used.

The encryption unit may be implemented in the form of an integrated circuit or in the form of an executable computer program. A further aspect of the invention therefore forms a data carrier with an executable program which implements a method for encrypting data according to the second aspect of the invention or one of its embodiments.

5. Brief description of the figures

In the following, the invention and further preferred embodiments will be explained in detail with reference to the figures. Show it:

1 shows the structure of an embodiment of a polynomial multiplier

FIG. 2 shows a detailed representation of an exemplary embodiment of a method for polynomial multiplication using the example of a 4-word karatsuba

Polynomial multiplication in GF (2 ^m ) and

FIG. 3 shows a block diagram of an embodiment of a method for data encryption oriented on the data flow

FIG. 4 shows a block diagram of an embodiment encryption unit

FIG. 5 shows a block diagram of a further exemplary embodiment of a polarity multiplier according to the invention.

Figure 6 is a schematic view of the partial multiplier and the accumulation unit of Figure 5 in more detail.

6. Detailed description of embodiments

First, the invention will be explained with reference to a concrete example. Reference is made to FIGS. 1 and 2 in parallel. Figure 1 shows the structure of an embodiment of a polynomial multiplier. A flow diagram of an exemplary embodiment of the method according to the invention, in which a four-word iterative Karatsuba polynomial multiplication in GF (2 ^m ) is used by way of example, is shown in FIG.

The following description uses as an example the curve B-233 over a Galois field GF (2 ²³³ ), which is recommended by the National Institute of Standards and Technology of the United States of America (NIST) and is particularly suitable for implementation as hardware.

Elliptic Curve Cryptography (ECC) guarantees the same level of security as the well-known RSA method, but allows the use of significantly shorter keys. In addition, ECC operations are faster than those of the RSA method. Although ECC is less computationally intensive than RSA, it requires relatively much energy and time to compute the product of a 233-bit number k and a point P with two 233-bit coordinates. This operation is referred to as 'kP' 'multiplication, where P is a point on an elliptic curve (EC) and k is a large number.The' kP 'multiplication can be done using the' Double and Add "method (point doubling and point addition) or using the Montgomery method.

Regardless of the method used, the result of the 'kP' multiplication must be reduced. The reduction is performed using so-called irreducible polynomials and can be a very elaborate operation in the Galois field GF (2 ^m ). The irreducible polynomial for B-233 is the trinomial: / W = X ²³³ Θ / ΘI.

In the Galois field GF (2 ^m ), addition and subtraction are XOR operations. Therefore, and for easier understanding of the formulas, the ordinary representation nl n-1 of polynomials A (x) = £ a, x 'is here modified in A (x) = @ a, x' In this context

; = 0 / = 0

Logon, the XOR operation is marked as "θ". always draws an ordinary addition. The division of polynomials is normally done in two steps: first, the inverse of the divisor is identified using the irreducible polynomial, and then the inverse is multiplied by the dividend.

The advantage of the Montgomery method is that a maximum of twice the inverse of the product must be calculated for reduction. This is achieved in the Montgomery method by a larger number of multiplications, which generate less computational effort than computing the inverse. This applies in particular when using the efficient polynomial multiplier proposed here.

One approach of the present invention is to iteratively apply the original Karatsuba method. Therefore, we refer to the inventive method as iterative Karatsuba method. The main advantages of this process are:

a smaller footprint of hardware accelerators due to the ability to serially perform partial multiplications

- a lower number of XOR operations compared to the recursive variant of Karatsuba's method.

The Karatsuba formula for the product C (x) = A (x) -B (x)

n

According to the invention, Karatsuba's formula is iteratively applied to calculate the partial products a'b ¹ . In this case a total of s ^log23 ~, y ¹⁵⁸ partial multiplications are required, where s is the number of segments. The number of segments (s) into which the operands must be decomposed is determined by the length of the input words of the multiplier and may be before the calculation be determined as follows: s = length of the operand / word length of the multiplier.

This method can be used to speed up both software and hardware implementations. In software implementations, the Karatsuba method is usually used until both operands are the length of a data word.

The principle of the inventive iterative application of Karatsuba's formula will now be explained by way of example in which the operands are split into four segments. First, Karatsuba's formula is used to obtain a formula for a product using only one-segment long operands for partial multiplication.

Initially, however, there are two operands, each 4n-bits long. Each operand can be represented and decomposed in the form of a sum of two 2n-bit parts:

The result of using Karatsuba's formula is:

C {x) = a ^ι a ° - b ^ι b ° ®

® [αV • bΨ ® a ³ a ² - b ³ b ² ® a ¹³ a ⁰² • b ¹³ b ⁰² ] • x ²ⁿ (5)

® a ³ a ² - b ³ b ² - x ⁴ⁿ

in which

a ^l3 a ⁰² = a ¹³ - x ⁿ ® a ⁰² = (a ^ι ® a ³ ) - x ⁿ ® (a ° ® a ² ) ^{logo CNRS logo INIST}

(6) (α ¹ • x "® a °) ® (α ³ • x" ® a ² ) = a ^ι a ° ® a ³ a ² and

b ^l3 b ° ² = b ^l b ° b ³ b ² (7)

Each element having two segments can be represented as ^J = a'a a '-x ^"®ã]. For each partial multiplication of equations (6) and (7) is used The result of the application of the Karatsuba formula again. The calculation is carried out by iteratively applying steps 204 and 206 (FIG. 2), taking into account the case differentiation determined by the current clock cycle ("clk"), which represents a selection plan related to this exemplary embodiment.

The final result is shown in the following formula (8):

C (x) = a ³ b ³ -x ⁶ "@ (a ² b ² a ³ b ³ ® ® a ²³ b ²³⁾ -x ^5" ® (a ^ι ^ι @a b ² b ² b ³ ³ @a ® A ¹³ b ^l3) ⁴ⁿ -x ® (a ° b ° @ a ^ι ^ι @a b ² b ² b ³ ³ @a

®a ⁰¹ b ^oι @a ⁰² b ^O2 @a ¹³ b ^l3 @a ²³ b ²³ (8)

®α ⁰¹²³ b ^ol23 ) -x ³ⁿ

® (a ° b ° @ a ^ι b ^ι @a ² b ² @a ⁰² b ^O2 ) -x ² "® (α ° b ° @ a ^ι b ^ι @a ⁰¹ b ^oι ) -x" @ a ° b °

Each of the operands of the right term of formula (8) is one segment long so that the resulting partial product (2n-1) bit is long. The bits from n-1 to 0 of the product a'-b 'are noted in the form a'b' [0] and the bits from 2n-1 to n in the form a'b '[1]:

With the notation introduced in equation (9), formula (8) can be represented as shown in Table 1 below:

Table 1: Accumulation schedule according to formula (8) All columns in Table 1, listed under the heading "Result Segments", represent a particular segment c 'that results from partial multiplication of each of the selected fragments mentioned above For each partial product, two rows are provided in Table 1, one row containing the lower portion (a ^x b ^x [0]), and the second row represents the upper portion (a ^x b ^x [1]) of the product as stated above.

The segment c 'can be calculated in Table 1 as the XOR of all the rows in the c the "Θ" symbol segment' associated column contained For example, ⁵ c can be calculated as follows.:

c ⁵ = a ¹ b ¹ [1] θ a ² b ² [0] Θ a ² b ² [1] Θ a ³ b ³ [0] Θ a ³ b ³ [1] Θ ((a ¹ θ a ³ ) (b ¹ θ b ³ ) [1]) θ ((a ² θ a ³ ) (b ² θ b ³ ) [0]) (10)

Each segment c 'can be calculated iteratively, ie step by step as in the calculation of the partial products, starting with a ° b ° up to (fl ^o ®fl ¹ ® fl ² ®fl ³ ) (fo ^o ® / 7 ¹ ® / 7 ² ® / 7 ³ ). Subsequently, the calculation of the segments of products is started using the already existing results (step 208 in FIG. 2). For example:

Table 2: Example of calculation of product segments using existing results

In parallel with the calculation of the segments, an accumulation step is carried out in each iteration step in accordance with the accumulations represented by Table 1. lationsplan. In this way, the result C (x) completes with each iteration step from the first to the last line of the partial products to be calculated.

This iterative calculation of the product C (x) reduces the area requirement of a hardware multiplier. Only a partial multiplier for single-segment long operands is needed. After each new clock signal, this multiplier delivers the next partial product. In this way the segments of product C (x) are collected. Thus, in the example given above, after nine clock cycles, all segments contain the correct product of the polygon multiplication.

With the described iterative hardware solution, the chip area required to compute the product of two 233-bit operands is

2.1 mm ² . For the standard application of the Karatsuba method will be against

6.2 mm ² needed. The solution according to the invention also reduces the energy consumption by 30% compared to the original solution. These advances are paid only with an increased calculation time. In one embodiment, a polynomial multiplication requires three clock cycles while in the original Karatsuba method only one clock cycle is needed.

Similarly, the iterative approach of the invention may be applied to the Bailey method, which is referred to in this application as the Bauter iterative method.

The structure and the essential parameters of an embodiment of a hardware implementation of the iterative Karatsuba method will be explained below. The structure of an iterative Karatsuba accelerator consists of three essential parts, cf. FIG. 1:

A selection unit 100 makes certain parts of both operands available to a downstream partial multiplier at each new clock signal at its output. A partial multiplier 102 computes the partial product of the operands provided by the selection unit and provides the results of a product accumulation unit.

The product accumulation unit 104 calculates the end result of the product from the partial products that it receives from the partial multiplier. The theoretical basis and the exact sequence of steps were explained in detail in sections 4 and 5 above.

The performance data, the chip area and the energy requirement of a polynomial multiplier are significantly influenced by the partial multiplier used. The larger the input signals of the partial multiplier, the faster the partial multiplier. On the other hand, this also leads to a relatively large space requirement. Therefore, the hardware design has to make a decision between calculation time and chip area. However, this only applies as long as the partial multiplier alone is taken into account. In addition, the area of the selection and product accumulation units is important for the polynomial multiplier. The chip area required by the product accumulation unit depends on the area requirement of the partial multiplier in an inversely proportional manner. That is, the smaller the partial multiplier, the larger the product accumulation unit. This results from the fact that in the case of small partial multipliers, more intermediate results must be stored in order to perform the final computation of the polynomial product. For example, the area of the product accumulation unit is 0.649 mm ² when the partial multiplier accepts 128 bit long operands. In contrast, the area is 1.466 mm ² if the maximum accepted length of the operands is only 32 bits.

In order to obtain a well adapted design for a polynomial multiplier, various partial multipliers have been realized. Three one-clock partial multipliers were used for the Karatsuba method according to the invention as well as for an iterative Bailey method according to the invention. These partial multipliers accept operands with a maximum length of 128, 64 and 32 bits each. They were synthesized using Applicant's circuit library and proprietary 0.25 μm CMOS technology. Table 3 shows the parameter area, time and energy consumption of each of these six partial multipliers. The values were determined using the design analysis tool from Synopsys.

Table 3: Parameters of the produced partial multipliers

Embodiments with an accumulation control unit and a selection control unit, which are integrated in a multiplication control unit, can further increase the flexibility with regard to the required total duration, the possible clock frequencies and the required chip area. Table 4 below compares parameters of different 233-bit Karatsuba interactive multipliers. It can be seen that a 1-clock multiplier requires the least overall time for a polynomial multiplication, but on the other hand requires by far the largest chip area.

It is apparent from the data in Table 4 that a 4-segment Karatsuba implementation requires less chip area than an 8-segment implementation. Here comes to fruition that the logic for choosing the Fragments (segments) and for the accumulation of partial products in embodiments with a higher fragmentation has a considerable influence on the space requirement. It turns out that in the 8-segment implementation these logic parts require more than 75% of the chip area occupied by the multiplier. Overall, the selection and selection logics in the 2-segment multiplier take 0.30 mm ² , in the 4-segment multiplier 0.78 mm ² and in the 8-segment multiplier 1, 18 mm ² . The segmentation has a high impact on the required chip area due to the resulting complicated data path. For this reason, in a preferred embodiment, the accumulation control unit and the selection control unit are formed with an alternative structure, which will be described in more detail below with reference to FIG.

Table 4: Parameters of different 233-bit iterative Karatsuba multipliers

For a benchmarking test, polynomial multipliers were made with an implementation of the following methods:

- iterative Karatsuba process

- Iterative Bailey procedure

- Recursive Karatsuba method according to the prior art - recursive Bailey method according to the prior art For the first two method implementations, three polynomial multipliers with different partial multipliers were used (see Table 4) to determine the influence of each partial multiplier on the performance parameters. These multipliers were named so that the name indicates the method used. For example, the name iterative_Karatsuba_8segments means: iterative Karatsuba method in which incoming operands are fragmented into eight segments.

The two recursive multipliers use the original Karatsuba or Bailey formula down to one-bit operands. Both multipliers supply the polynomial product after one clock cycle. They differ in the length of the input operands. The Karatsuba multiplier always expects two 256-bit input values, while the Bailey multiplier expects two 243-bit input values.

Because of the intended use of these multipliers for EC B-233, the expected input values are only 233 bits long. Therefore, the operands were padded with leading zeros where necessary. The result of the multiplication is always 465 bits long.

All polynomial multipliers were synthesized using Applicants' 0.25 μm CMOS circuit library. The parameters of the implemented polynomial multipliers are given in Table 4. The data contained was obtained from Synopsys' Design Analyzer using various types of analysis reports.

Table 5: Parameters of the synthesized polynomial multipliers

The results presented in Table 5 clearly show that iterative application of the Karatsuba and Bailey methods significantly reduces required chip area. If the number of iterations is kept small, the inventive approach also helps to reduce energy consumption. In these design variants, a smaller space requirement and a lesser Energy consumption achieved at the expense of a slower execution time. Increasing the number of iterations reduces the required chip area, but also increases the required power consumption and computing time. These implementations are only useful if costs are the deciding parameter.

The iterative application of the Karatsuba method for polynomial multiplications thus makes it possible to reduce the required chip area and the energy required to perform elliptic curve cryptography on mobile terminals. Different methods for polynomial multiplication in GF (2 ⁿ ) were analyzed and different polynomial multiplication algorithms were implemented. Various partial multipliers were made. They were used to implement a number of iterative polynomial multipliers to determine the best possible variant for use in mobile devices. Our results clearly show that the inventive iterative approach leads to significantly better results in terms of chip area and energy consumption than the original direct applications.

FIG. 3 shows a data flow-oriented block diagram of an exemplary embodiment of a device for data encryption, which is referred to below as encryption unit 300. The encryption unit 300 contains a read-only memory 302 in which the coordinates of a base point G of a predetermined elliptic curve are stored. A random number generator 304 generates in each case a random number k per section M for user data to be encrypted. A memory 306 contains a public key S of the recipient of the message. A data segmenter 308 decomposes incoming and to-be-encrypted payload data into payload portions M of a predetermined length.

As part of the encryption of a useful data section M, in a gated field multiplier 310, which contains an iterative polynomial multiplier according to the invention, the product kG of the base point G with the current random number k is calculated on the one hand. This is symbolized by a block 310.1. Furthermore, in the Galois field multiplier 310, the product kS of the same current random number k and the public key S is determined, which is symbolized by the block 310.2. It should be noted that although it is conceivable to provide two independent Galois field multipliers for the calculation of the products kG and kS. Preferably, however, there is only one Galois field multiplier so as not to unnecessarily increase the area requirement. The associated time delay is tolerable for most encryption applications.

In a transformation unit 312, the payload data section provided by the data segmenter 308 is checked for correspondence with the X coordinate of a point of the elliptic curve. Unwanted bits of the user data section M are possibly changed, so that a modified user data section M ^* is created. The undefined bits are freely changeable without the risk of modifying the payload. This modification therefore has no influence on the useful information contained in the payload data M ^* . After each generation of a modified payload data section M ^* , it is checked again whether this modified payload data section coincides with the X coordinate of a point on the elliptic curve.

The mode of operation of the transformation unit 312 will be explained in more detail below by means of an example. A payload section, for example, contains the text "zojka" symbolized by the data symbol sequence (Fig. 5A, 6F, 6A, 6B, 61, 00), where the last data symbol "00" is not fixed and can be changed to the sequence of data symbols to match the X coordinate of a point on the elliptic curve. Assuming that, in a first step, the unspecified data symbol is defined as "01", the transformation unit 312 will determine that the resulting sequence of data symbols has no correspondence at any point on the elliptic curve, but if the unspecified data symbol is written as "02 "defines, the transformation unit 312 will determine that the resulting sequence of data symbols corresponds to a point on the elliptic curve having the following Y coordinate: 7D3C7D654AAB7068E1 DA366C49588A27F252D410. The transformation unit 312 derives from the determined point of the elliptic curve X and Y coordinates to an input of an adder 314, to whose other input the product kS of the public key with the current random number k is present. The sum kS + Y is given by the adder 314 to an output unit 316, which is supplied with the product kG at a further input. The output unit 316 assembles the data symbols kG and the sum kS + Y into a data sequence and outputs them. The output can be either serial or parallel.

The encryption unit 300 can be implemented both in the form of hardware and in the form of software.

FIG. 4 shows a block diagram of a further embodiment of an encryption unit, which is referred to below as encryption unit 400. The representation of FIG. 4 shows a hardware implementation.

The individual units of the encryption unit are connected via a central bus 402. Connected to the bus 402 is a control unit 404 which contains control logic for performing the Montgomery method. The control unit 404 controls the cooperation of the units described below. Via an input / output unit 406, a base point G, a public key S and a user data section M can be supplied to the encryption unit 400. The input / output unit 406 is simultaneously configured to assemble and output an encrypted message generated by the encryption unit 400, as described in connection with the output unit 316 of FIG.

A random number generator 408 provides a random number k for each incoming payload section M. A Karatsuba polynomial multiplier 410 is connected to a polynomial reduction unit 412. An inversion unit 414 also connected to the data bus 402 is configured to form the multiplicative inverse of a polynomial. Furthermore, an adder 416 is provided as well as a polyhedron. nom squaring unit 418 connected to another polynomial reducer 420.

The mode of operation of the encryption unit 400 corresponds to the mode of operation illustrated with reference to FIG. 3, wherein the control unit 404 performs the function of the transformation unit 312.

Various variants of the encryption unit 400 are possible. For example, it may be provided that the base point G and / or the public key S are not supplied via the input / output unit 406, but stored permanently in a memory. For this purpose, the memory 422 also used in the polynomial multiplication can be used.

On the other hand, it is also conceivable to externally supply not only the base point G, the public key S and the user data M, but also the current random number k and not to integrate the random number generator 408 into the encryption unit 400. In this way, the area requirement of the encryption unit 400 can be further reduced in a hardware implementation.

The addition performed by adder 416 is based on the XOR operation, as previously mentioned.

The software variant was used with the Miracle Library Polynomial Multiplier, version 4.7, Shamus Software Ltd. (Ireland), http://indiqo.ie/~mscott/), which uses a recursive Karatsuba approach, and with an implementation For the implementation of the software variant Microsoft Visual C ++ 6.0 was used The comparison measurements were made on a PC (Intel Pentium III Processor, 800 MHz, Microsoft Windows XP Professional Version 2002, 256 MB RAM). as well as on a PDA (Pocket PC iPAQ Hewlett-Packard Company, 48MB ROM, 128MB RAM, 400MHz Intel XScale Processor Model h5500, OS: Pocket PC 2003 Prem with Outlook 2002) performed for comparison First, a number of 1 million each 233-bit operands were stored in a data file. These operands were used for all measurements to ensure comparability of the results. Entry number n of the data file was multiplied by entry number n + 1 until all operands were used. So the test consisted of performing 1 million multiplications.

The comparison of the average calculation times on the PC shows an increase in performance compared to the recursive application of the Karatsuba approach:

- up to 17% when using compiler optimizations and

up to 37% without using the compiler optimizations.

On the PDA, the following performance improvements compared to the recursive Karatsuba method have resulted:

up to 11% when using compiler optimizations and

- up to 17% without using the compiler optimizations.

These values show that the method according to the invention has advantages over known methods not only in a hardware implementation but also in a software implementation.

FIG. 5 shows a block diagram of an embodiment of a polynomial multiplier 500 according to the invention.

The structure of the polynomial multiplier 500 is similar in many parts to the structure of the polynomial multiplier described in FIG. Thus, the polynomial multiplier 500 also includes a selection unit 502, a partial multiplier 504, and a product accumulation unit 506. However, in the exemplary embodiment of FIG. 1, the operation of the selection and accumulation unit is predetermined by hard-wired data paths and clock signal for clock signal executes a hardwired selection plan and a hardwired accumulation plan, in the present exemplary embodiment a separate multiplier control unit 508 is provided, which selects Control unit 508.1 and an accumulation control unit 508.2 integrated into it. The multiplier control unit is accordingly connected to the selection unit 502 on the one hand and to the accumulation unit 506 on the other hand.

Also shown are input registers 510 and 512, which are connected upstream of the selector 502 and configured to receive incoming data words A and B whose product is to be calculated by the polynomial multiplier. The calculated product C will be present at the output of an output register 514, shown here as part of the accumulation unit 506.

In a preferred embodiment, the accumulation unit 506 simultaneously integrates a reduction unit, so that a product C reduced to the word length of the incoming data words can be output.

However, it is alternatively also conceivable to connect the reduction unit downstream of the polynomial multiplier 500. In this case, the output register 514 must provide a correspondingly larger word width.

The mode of operation of the polynomial multiplier 500 of FIG. 5, which has changed compared to the exemplary embodiment of FIG. 1, will be explained in more detail below with reference to FIG.

Figure 6 is a schematic view of the partial multiplier and the accumulation unit of Figure 5 in more detail. The representation is schematic in that it represents several structural elements for clarification of the operation of the accumulation in an iteration step several times, as will be explained in more detail below. The structure of the accumulation unit will be described with reference to an example of an iterative Karatsuba multiplier with a 4-segment multiplication of 233-bit long data words, as also exemplified in FIG. In this case, 8 product segments cθ to c7 are available at the end of each iteration in the accumulation unit, as was explained in Table 1 in a corresponding example. These segments cθ to c7 are located in respective registers 516 to 530.

The structure of registers 516 to 530 is shown three times in FIG. 6 overall in order to illustrate the sequence of an iteration step. Here, the register structure shown on the left side in the figure represents an initial register state, the register structure shown in broken line in the middle represents a temporary intermediate state which is not stored, and the register structure shown on the right represents an end state of each iteration step. However, as mentioned, all three representations refer to FIG the actual accumulation unit 506 has the same register structure 516 to 530.

The data width of the register structure in an nxn-partial multiplier 504 is 8n.

The transition from the initial register state to the final register state during an iteration step results from a number of XOR operations performed on a total of seven XOR gates 532 through 544. Each XOR gate combines 2n bits from two adjacent registers with the current result of a partial multiplication, which is available in a register 504.1 at the output of the partial multiplier 504.

Each XOR gate is linked on the input side to a control logic gate. Whether an XOR connection of the respective register with the current partial product from the register 504.1 is actually made, decides the state of a respective control logic gate. The control logic gates in the present embodiment are AND gates 546-558, at one input of which is the result of the partial multiplication and at the other input each a control bit cw [0], cw [1], ..., cw [6] of a control word (control word, cw) is present. Depending on the value of the respective control bit in the current iteration cycle, therefore, the partial multiplication result for accumulation at a respective register 516 to 530 is allowed or blocked by the respective AND gate. If the control bit is set, there is an addition or XOR operation; if the control bit is not set, no XOR operation takes place at the respective XOR gate. By controlling the links in each accumulation cycle, ie in each iteration step, an accumulation plan according to Table 1 can be implemented in this way.

This form of control of the accumulation with the help of a control word saves complicated data paths and leads to a considerable space saving. Another advantage of this embodiment is that different accumulation schedules can be implemented in this way. These may be available in a memory or may be subsequently stored in the accumulation control unit 508.2.

The selection control can be realized in the same way and is not shown here.

The area advantage achieved with the present exemplary embodiment is the greater the higher the segmentation of the data words provided for multiplication. While in the embodiment of Figure 1, the data path with increasing segmentation is complicated and therefore increasingly claimed chip area (from 0.30 mm ² for a 2-segment method over 0.78 mm ² for a 4-segment method up to 1.18 mm ² for an 8-segment method), the area requirement of the multiplier control unit 508 including both the selection control unit 508.1 and the accumulation control unit 508.2 remains almost constant and is in the range of 0.30 mm ² . claims

1 . Method for calculating a polynomial multiplication, comprising the steps:

HI

Providing coefficients a "b, two polynomials A (x) = @ a, x 'and

; = 0 n-1

B (x) = @ b, x ', where "0" indicates an addition or an XOR operation.

; = 0 and a "b, are binary one-bit values, i. H . 0 or 1, and x '= 1,

partially multiplying the selected fragments to obtain a partial product

wherein the selection step is carried out iteratively according to a predefined selection plan, and wherein fragments are used in which the respective multiplying step for forming a partial product of respective fragments requires only one calculation step,

Accumulating the partial products, the accumulation being controlled in accordance with a predefined accumulation plan for the accumulation of a partial product currently present at the output of the partial multiplier with one or more terms of an iteration step further developed by the iteration step.

2. The method of claim 1, wherein the accumulating is performed in parallel with the fragmenting and comprises a partial accumulating step according to the accumulation schedule performed after a respective iteration of a multiplying step.

Claims

3. The method of claim 1 or 2, wherein in a clock cycle of a computing device for performing the method, a calculation step is performed.

A method according to any one of the preceding claims, wherein the selection, multiplication and accumulation steps are based on a Karatsuba method for

Performing a polynomial multiplication based.

A method according to any one of the preceding claims, wherein before the selecting step, the selection schedule and the accumulation schedule are selected from a plurality of stored selection and accumulation schedules corresponding to a respective length of the polynomials and a predetermined word width of a partial multiplier used to perform the partial multiplying step.

6. A method for encrypting data, comprising a step of calculating a product of two polynomials according to a method according to one of claims 1 to 5.

7. The method of claim 6, wherein the encryption of the data is performed by means of an elliptic curve over a Galois field.

The method of claim 7, wherein the elliptic curve is curve B-163 over the Galois field GF (2 ¹⁶³ ) or the curve B-233 over the Galois field GF (2 ²³³ ), or the curve B-283 above the Galois Field GF (2 ²⁸³ ) or the

Curve B-409 over the Galois field GF (2 ⁴⁰⁹ ), or the curve B-571 over the Galois field GF (2 ⁵⁷¹ ), or the curve K-163 over the Galois field GF (2 ¹⁶³ ) or the Curve K-233 over the Galois field GF (2 ²³³ ), or the curve K-283 over the Galois field GF (2 ²⁸³ ) or the curve K-409 over the Galois field GF (2 ⁴⁰⁹ ), or the Turn K-571 over the Galois field

GF (2 ⁵⁷¹ ) is.

9. polynomial multiplier, with a selection unit that is designed

- coefficients a "b, of two polynomials Λ (x) = 0α, x 'and

= 0

B (x) = @ b, x ', where "0" is an addition or a

= 0

in one operation select either two or more than two fragments, one of each polynomial, with such a length as operands for a partial multiplication and provide at their output that the partial multiplication requires only one calculation step,

an accumulation unit, which is connected to the partial multiplier and which is designed to calculate the complete polynomial product by an iteration step for iteration step completed accumulation of the partial products which it has received from the partial multiplier,

an accumulation control unit connected to the partial multiplier and the accumulation unit, and formed each after the current iteration step, outputting a control signal which instructs the accumulation unit to accumulate a partial product currently present at the output of the partial multiplier with one or more terms of an iteration step further developed for iteration step in accordance with a predetermined accumulation plan.

10. The polynomial multiplier of claim 9, wherein the accumulation unit includes a number of XOR gates, each XOR gate connected at one input to one or more terms of the resulting result polynomial and at another input to the partial multiplier.

11. Polynomial multiplier according to claim 10, wherein the accumulation control unit comprises a number of control logic gates corresponding to the number of XOR gates of the accumulation unit, each control logic gate being associated with a respective XOR gate, and wherein the

Accumulation control unit is adapted to generate in a respective iteration step, a predetermined set of control signals and applied to the control logic gates, wherein a respective control signal determines whether a respective XOR gate is activated in the current iteration step for accumulation or not.

12. polynomial multiplier according to claim 11, wherein the control logic gates are AND gates with two inputs, of which a first input is applied to a logical one and a second input is supplied with a control signal.

13. polynomial multiplier according to one of claims 9 to 12, with a selection control unit, which is connected to the selection unit and is adapted to output, depending on the current iteration step, control signal which the selection unit according to the predetermined selection plan for output instructing a respective fragment of predetermined size from each polynomial.

14. Polynomial multiplier according to one of claims 9 to 13, which is designed to perform a partial multiplication of two operands provided at the output of the selection unit by means of a Karatsuba formula for a partial product C (x) = A (x) -B (x), according to the type

n

C (jc) = a ° b ° ® [α V ® aΨ ® (α ° ^ι ® a) (b ° ® b ^1)] - x ¹ x ® aΨ ^■ "

and to provide the partial product at its exit.

A polynomial multiplier according to any one of claims 9 to 14, wherein the accumulation control unit and the selection control unit are arranged, upon receiving polynomials at the input of the selection unit, the selection schedule and the accumulation schedule corresponding to a respective length of the polynomials and a predetermined word width of the partial multiplier from a plurality of stored selection and accumulation tion plans to select.

The polynomial multiplier according to any one of claims 9 to 15, wherein said accumulation control unit and said selection control unit form an integrated multiplying control unit.

17. Polynomial multiplier according to one of claims 9 to 16, which is formed in the form of an integrated circuit.

18. An encryption unit for encrypting data by performing a polynomial multiplication, comprising a data input for unencrypted data, a data input connected polynomial multiplier according to one of claims 9 to 17, and one with the Po lynom multiplier connected data output to output the encrypted data.

19. An encryption unit according to claim 18, which is designed to encrypt the data by means of elliptic curve cryptography.

20. Encryption unit according to one of claims 18 or 19, which is implemented in the form of an integrated circuit or in the form of an executable computer program.

21. An executable program medium implementing a method of encrypting data according to claim 6.