WO2022272289A1

WO2022272289A1 - Semiconductor intellectual property core, methods to improve data processor performance, and side-channel attack on hmac-sha-2

Info

Publication number: WO2022272289A1
Application number: PCT/US2022/073122
Authority: WO
Inventors: Yaacov Belenky; Ury KREIMER; Alexander KESLER
Original assignee: FortifyIQ, Inc.
Priority date: 2021-06-25
Filing date: 2022-06-23
Publication date: 2022-12-29

Abstract

In one general aspect, a method of improving performance of a data processor can include, in a ring of characteristic 2, computing X254 by performing a series of: (i) multiplications of two different elements of the field; and (ii) raising an element of the field to a power Z, wherein Z is a power of 2 (such operation being a linear transformation). The total number of multiplications can be limited to 4, the total number of linear transformations can be limited to 4, the number of multiplications executed sequentially can be limited to 3 (meaning that some 2 of 4 multiplications can be executed in parallel), and the number of linear transformations executed sequentially can be limited to 2.

Description

SEMICONDUCTOR INTELLECTUAL PROPERTY CORE, METHODS TO IMPROVE DATA PROCESSOR PERFORMANCE, AND SIDE-CHANNEL ATTACK ON HMAC-SHA-2

CROSS -REFERENCE TO RELATED APPLICATIONS

This PCT application claims the benefit of, and priority to:

U.S. Provisional Application No. 63/202,831 filed on June 25, 2021;

U.S. Provisional Application 63/221,483 filed on July, 14, 2021; and U.S. Non-Provisional Application No. 17/444,832 filed on August 11, 2021, each of these earlier applications being fully incorporated herein by reference in their entireties.

TECHNICAL FIELD

Some described embodiments are in the field of data processor design and implementation. Other described embodiments are in the field of side-channel attacks on cryptographic algorithms and, more specifically, to side-channel attacks on hash-based message authentication code (HMAC) implementations, and testing of HMAC implementations for vulnerability to such side- channel attacks.

BACKGROUND

The world is increasingly dependent on digital data processors. Any improvement in the function of such data processors is potentially beneficial.

Side Channel Attacks (SCA) such as differential power analysis (DP A), simple power analysis (SPA), and fault injection are a common category of cyber-attack used by hackers and intelligence agencies to penetrate sensitive systems in order to perform cryptographic key extraction. New types of side channel attacks are being conceived all the time.

Any device that performs a cryptographic operation should withstand side channel attacks and several security certifications explicitly require such side channel attack resistance tests.

Additionally, Side-channel attacks can pose a threat to cryptographic algorithms and, more specifically, data and/or information that is sought to be protected using such cryptographic algorithms. As an example, hash functions (hash algorithms), such as hash functions of the secure hash algorithm 2 (SHA-2) family, e.g., if at least some of the inputs to the hash function are secret, may be an interesting target for an attacker (e.g., bad actor, adversary, etc.) seeking to obtain such protected information. Hash based message authentication code (HMAC) implementations (e.g., hardware and/or software) are one example of cryptographic algorithms, where the inputs are at least partially secret. HMAC approaches that are implemented using hash functions, such as SHA-2 hash functions can, therefore, be a target of bad actors seeking to discover protected information. However, due to the construction of HMAC implementations, current side-channel attacks are not capable of mounting a successful attack on HMAC approaches implemented using SHA-2 family hash functions (HMAC-SHA-2). Accordingly, it follows that is it is not possible to determine susceptibility (e.g., to test for vulnerability) of an HMAC implementation to side-channel attacks.

SUMMARY

In a broad aspect, described embodiments relate to improving the function of data processors.

A general aspect relates to improvement of the exponentiation algorithm in a redundant AES calculation. In some embodiments this aspect is embodied by one or more methods. This aspect contributes to an improvement of calculation speed in a data processor by shortening the critical path in a hardware implementation of raising to the power of 254 in GF(2⁸). In some embodiments this path shortening contributes to an increase in the frequency at which such a design can be used. In some embodiments this aspect is embodied by an IP core. In other exemplary embodiments, this aspect is embodied by a method.

Another general aspect relates to limiting the degree of polynomials over a ring during multiplication operations. In some embodiments this aspect is embodied by an IP core. In other exemplary embodiments, this aspect is embodied by a method.

It will be appreciated that the various aspects described above relate to solution of technical problems related to improving calculation speed in a data processor.

Alternatively or additionally, it will be appreciated that the various aspects described above relate to solution of technical problems related to improving function of a data processor.

In some exemplary embodiments, there is provided a semiconductor intellectual property (IP) core including a transformation engine designed and configured to generate redundant representations of each element of a field GF(2⁸) using a polynomial of degree no higher than 7 + d, where d > 0 is a redundancy parameter the redundant representations belonging to a ring; wherein the transformation engine represents a same field element by one of 2^d various ways (pairwise differing by terms that are multiples of P(x)), and at one or more moment(s)of calculations replaces a redundant representation of a field element Z with a redundant representation chosen out of 2^d representations of the field element Z.

In some embodiments the redundant representation out of 2^d representations is chosen randomly. Alternatively or additionally, in some embodiments wherein d > 5. Alternatively or additionally, in some embodiments wherein d > 8. Alternatively or additionally, in some embodiments, the element of a field includes a byte of data within a block of a block cipher and a cryptographic key. Alternatively or additionally, in some embodiments, the block cipher is selected from the group consisting of AES, SM4, and ARIA. Alternatively or additionally, in some embodiments, the transformation engine computes X^Yby performing a series of:

(i) multiplications of two different elements of the field; and

(ii) raising an element of the field to a power Z wherein Z is a power of 2; wherein the number of multiplications (i) is at least two less than the number of ones (Is) in the binary representation of Y; and wherein after at least one transformation according to (i) or (ii) the result representing a field element Z is replaced with one of GF(p)[x]/(PQ) implemented representations of the field element Z. Alternatively or additionally, in some embodiments, the one of 2^d redundant representations of the field element Z is chosen randomly. Alternatively or additionally, in some embodiments, Y=254. Alternatively or additionally, in some embodiments, a number of multiplications (i) is 4 or less.

In some exemplary embodiments, there is provided a method of building different representations of the Galois Field (GF) implemented by logic circuitry including: redundantly representing each element of a field GF(2⁸) using a polynomial of degree no higher than 7 + d, where d > 0 is a redundancy parameter to generate redundant representations belonging to a ring; representing a same field element by one of 2^d various ways (pairwise differing by terms that are multiples of P(x)), and at one or more moment(s) of calculations replacing a value representing a field element Z with any of the various representations of the field element Z. In some embodiments one of the various representations of the field element Z is chosen randomly. Alternatively or additionally, in some embodiments d>5. Alternatively or additionally, in some embodiments d>8. Alternatively or additionally, in some embodiments the element of a field comprises a byte of data within a block of a block cipher and a cryptographic key. Alternatively or additionally, in some embodiments the block cipher is selected from the group consisting of AES, SM4, and ARIA. Alternatively or additionally, in some embodiments the method includes computing X^Yby performing a series of:

(i) multiplications of two different elements of the field; and

(ii) raising an element of the field to a power Z wherein Z is a power of 2; wherein the number of multiplications (i) is at least two less than the number of ones (Is) in the binary representation of Y; and wherein after at least one transformation according to (i) or (ii) the result representing a field element Z is replaced with one of the redundant representations of the field element Z. Alternatively or additionally, in some embodiments the one of the redundant representations of the field element Z is chosen randomly. Alternatively or additionally, in some embodiments Y=254. Alternatively or additionally, in some embodiments a number of multiplications

(i) is 4 or less.

In some exemplary embodiments, there is provided a method of improving performance of a data processor including: in a ring of characteristic 2, computing X²⁵⁴ by performing a series of: (i) multiplications of two different elements of the field; and (ii) raising an element of the field to a power Z, wherein Z is a power of 2 (such operation being a linear transformation); wherein the total number of multiplications is limited to 4, the total number of linear transformations is limited to 4, the number of multiplications executed sequentially is limited to 3 (meaning that some 2 of 4 multiplications can be executed in parallel), and the number of linear transformations executed sequentially is limited to 2. In some embodiments elements of the ring redundantly represent elements of a field and after at least one transformation according to (i) or (ii) the result is representing a field element Z replaced with one of the redundant representations of a same element Z of the field. Alternatively or additionally, in some embodiments the redundant representations of a same element Z of the field is chosen randomly. Alternatively or additionally, in some embodiments the ring includes a field GF(2⁸) as a subring.

In some exemplary embodiments, there is provided a method of improving performance of a data processor comprising: in a ring of characteristic 2, computing X²⁵⁴ by performing a series of:

(i) multiplications of two different elements of the ring; and

(ii) raising an element of the ring to a power Z, wherein Z is a power of 2 (such operation being a linear transformation); wherein the total number of multiplications is limited to 4, the total number of linear transformations is limited to 3, the number of multiplications executed sequentially is limited to 3 (meaning that some 2 of 4 multiplications can be executed in parallel), and the number of linear transformations executed sequentially is limited to 3. In some embodiments elements of the ring redundantly represent elements of a field; and after at least one transformation according to (i) or (ii) the result is representing a field element Z replaced with one of the redundant representations of a same element Z of the field. Alternatively or additionally, in some embodiments the redundant representations of a same element Z of the field is chosen randomly. Alternatively or additionally, in some embodiments the ring includes a field GF(2⁸) as a subring.

In some exemplary embodiments, there is provided a semiconductor intellectual property (IP) core for improving performance of a data processor comprising: a transformation engine designed and configured to perform, in a ring of characteristic 2, computation of X²⁵⁴ by performing a series of:

(i) multiplications of two different elements of the ring; and

In some exemplary embodiments, there is provided a method of improving performance of a data processor including: in a ring of characteristic 2, computing X²⁵⁴ by performing a series of:

(i) multiplications of two different elements of the field; and

(ii) raising an element of the field to a power Z wherein Z is a power of 2 (such operation being a linear transformation); wherein the total number of multiplications is limited to 7, the total number of linear transformations is limited to 6, the number of multiplications executed sequentially is limited to 3, and the number of linear transformations executed sequentially is limited to 1.

In some embodiments, elements of the ring redundantly represent elements of a field and after at least one transformation according to (i) or (ii) the result representing a field element Z is replaced with one of the redundant representations of a same element Z of the field. Alternatively or additionally, in some embodiments the redundant representation of a same element Z of the field is chosen randomly. Alternatively or additionally, in some embodiments the ring includes a field GF(2⁸) as a subring.

In some exemplary embodiments, there is provided an intellectual property (IP) core including: circuitry that improves performance of a data processor by: in a ring of characteristic 2, computing X²⁵⁴ by performing a series of:

(i) multiplications of two different elements of the field; and

(ii) raising an element of the field to a power Z, wherein Z is a power of 2 (such operation being a linear transformation); wherein the total number of multiplications is limited to 4, the total number of linear transformations is limited to 4, the number of multiplications executed sequentially is limited to 3 (meaning that some 2 of 4 multiplications can be executed in parallel), and the number of linear transformations executed sequentially is limited to 2.

In some embodiments elements of the ring redundantly represent elements of a field and after at least one transformation according to (i) or (ii) the result representing a field element Z is replaced with one of the redundant representations of a same element Z of the field. Alternatively or additionally, in some embodiments the redundant representation of a same element Z of the field is chosen randomly. Alternatively or additionally, in some embodiments the ring includes a field GF(2⁸) as a subring.

(i) multiplications of two different elements of the field; and

In some exemplary embodiments, there is provided a semiconductor intellectual property (IP) core including a transformation engine designed and configured to perform on elements of a finite ring R represented as GF(p)[x]/(PQ) a sequence of operations comprising at least one member selected from the group consisting of multiplication and raising to an integer power; wherein p is a prime number; wherein P is a polynomial of degree n irreversible over GF(p); wherein Q is a polynomial of degree d over GF(p); wherein elements of R redundantly represent elements of a finite field GF(pⁿ) represented as GF(p)[x]/(P); and wherein a result (Z) of at least one of the operations in the sequence is replaced with a element (Z*) of R such that Z and Z* redundantly represent a same element of F. In some embodiments, the element Z* is chosen randomly. Alternatively or additionally, in some embodiments p = 2. Alternatively or additionally, in some embodiments n = 8. Alternatively or additionally, in some embodiments any element of R represents an element A mod P of F. Alternatively or additionally, in some embodiments the replacement of Z by Z* is performed by calculation Z* = Z + CP wherein C is a polynomial of a degree less than d. Alternatively or additionally, in some embodiments the sequence of operations takes an element X of R representing an element F of F and calculates an element Z of R representing an element Yⁿ of F. Alternatively or additionally, in some embodiments the sequence of operations consists only of multiplications and raisings to the powers of 2^k . Alternatively or additionally, in some embodiments n = 254.

In some exemplary embodiments, there is provided a method of building different representations of a finite ring R represented as GF(p)[ ]/(PQ) implemented by logic circuitry including: a sequence of operations comprising at least one member selected from the group consisting of multiplication and raising to an integer power; wherein p is a prime number; wherein P is a polynomial of degree n irreversible over GF(p); wherein Q is a polynomial of degree d over GP(p); wherein elements of R redundantly represent elements of a finite field GF(pⁿ) represented as GF(p)[x]/(P); and wherein a result (Z) of at least one of the operations in the sequence is replaced with a element (Z*) of R such that Z and Z* redundantly represent a same element of F. In some embodiments the element Z* is chosen randomly. Alternatively or additionally, in some embodiments p = 2. Alternatively or additionally, in some embodiments n = 8. Alternatively or additionally, in some embodiments any element of R represents an element A mod P of F. Alternatively or additionally, in some embodiments the replacement of Z by Z* is performed by calculation Z* = Z + CP wherein C is a polynomial of a degree less than d. Alternatively or additionally, in some embodiments the sequence of operations takes an element Z of R representing an element F of F and calculates an element Z of R representing an element Yⁿ of F. Alternatively or additionally, in some embodiments the sequence of operations consists only of multiplications and raisings to the powers of 2^k. Alternatively or additionally, in some embodiments n = 254.

In another general aspect, a method for testing for vulnerability of an implementation of a hash-based message authentication code (HMAC) algorithm to a side-channel attack can include mounting a template attack on a hash function used to implement the HMAC algorithm. The template attack can include generating, based on first side-channel leakage information associated with execution of the hash function, a plurality of template tables, each template table of the plurality corresponding, respectively, with a subset of bit positions of an internal state of the hash function. The template attack can further include generating, based on second side-channel leakage information, a plurality of hypotheses for an internal state of an invocation of the hash function based on a secret key. The method can further include generating, using the hash function, respective hash values generated from each of the plurality of hypotheses and a message. The method can also include comparing each of the respective hash values with a hash value generated using the secret key and the message. The method can still further include, based on the comparison, determining vulnerability of the HMAC algorithm implementation based on a hash value of the respective hash values matching the hash value generated using the secret key and the message.

Implementations can include one or more of the following features. For example, the implementation of the HMAC algorithm can be one of a hardware implementation, a software implementation, or a simulator implementation.

One round of a compression function of the hash function can be calculated per calculation cycle of the hash function. A plurality of rounds of a compression function of the hash function can be calculated per calculation cycle of the hash function.

Each template table of the plurality of template tables can include a plurality of rows that are indexed using values of bits of the respective subset of bit positions. The rows can include respective side-channel leakage information of the first side-channel leakage information associated with the index values. Generating the template tables can include normalizing a value of the respective side- channel information based on an average value of a plurality of values of the respective side-channel leakage information. The plurality of rows of the template tables can be further indexed using at least one of carry bit values corresponding with the subset of bits of the internal state of the hash function, or bit values of a portion of a message schedule used to calculate the hash function.

Collecting the first side-channel leakage information can include executing the hash function using a known message schedule as a first input block of the hash function. The first side-channel leakage information can be collected based on a Hamming distance model.

Each subset of bit positions of the internal state can include a respective two-bit subset of each word of the internal state of the hash function.

The hash function can be a hash function of the Secure Hash Algorithm 2 (SHA-2) standard.

Each template table of the plurality of template tables further corresponds with a respective execution round of a compression function of the hash function.

Determining each hypothesis of the plurality of hypotheses can include determining values of respective subsets of bits of the internal state of the hash function in correspondence with the plurality of the template tables.

The hash function can be implemented in hardware. One execution round of a compression function of the hash function can be completed in one clock cycle of the hardware implementation. Multiple rounds of an execution round of a compression function of the hash function can be completed in one clock cycle of the hardware implementation.

The first side-channel leakage information and the second side-channel leakage information can include at least one of respective power consumption over time, electromagnetic emissions over time, or cache miss patterns. In another general aspect, a method of forging a hash-based message authentication code (HMAC) can include collecting, while executing an implementation of a hash function used to produce the HMAC, first side-channel leakage information corresponding with overwriting values of an internal state of the hash function. The method can also include generating a plurality of template tables, each template table corresponding, respectively, with a subset of bits of the internal state of the hash function. Each template table of the plurality of template tables can include rows that are indexed using values of the respective subset of bits. The rows can include respective side-channel leakage information of the first side-channel leakage information associated with the index values. The method can also include collecting second side-channel leakage information associated with producing the HMAC, and identifying, based on comparison of the second side-channel leakage information with the plurality of template tables, a first plurality of hypotheses for an internal state of an inner invocation the hash function. The method can still further include identifying, based on comparison of the second side-channel leakage information with the plurality of template tables, a second plurality of hypotheses for an internal state of an outer invocation of the hash function. The method can also include selecting, using pairs of hypotheses each including a hypothesis of the first plurality of hypotheses and a hypothesis of the second plurality of hypotheses, a first hypothesis of the first plurality of hypotheses and a second hypothesis of the second plurality of hypotheses for forging the HMAC.

Implementations can include one or more of the following features. For example, generating the template tables can include normalizing a value of the respective side-channel information based on an average value of a plurality of values of the respective side-channel leakage information.

Collecting the first side-channel leakage information can include executing a single invocation of the hash function using a known message schedule as a first input block of the hash function. The first side-channel leakage information can be collected based on a Hamming distance model.

The template tables can be further indexed using at least one of carry bit values corresponding with the subset of bits of the internal state of the hash function, or bit values of a portion of a message schedule used to calculate the hash function.

The subset of bits of the internal state can include respective two-bit subsets of each word of the internal state of the hash function.

Each template table of the plurality of template tables can correspond with a respective execution round of a compression function of the hash function. Determining each hypothesis of the first plurality of hypotheses and each hypothesis of the second plurality of hypotheses can include determining respective subsets of bits of the internal state of the hash function in correspondence with the plurality of the template tables.

The hash function can be implemented in software.

Selecting the first hypothesis of the first plurality of hypotheses and the second hypothesis of the second plurality of hypotheses for forging the HMAC can include performing a brute force attack.

The first side-channel leakage information and the second side-channel leakage information can include at least one of respective power consumption over time, electromagnetic emissions over time, or cache miss patterns.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic representation of an IP core according to some exemplary embodiments;

FIG. 2 is a simplified flow diagram of a method according to some exemplary embodiments;

FIG. 3 is a schematic representation of operations performed by a transformation engine of an IP core according to some exemplary embodiments; and

FIG. 4 is a schematic representation of operations performed by a transformation engine of an IP core according to some exemplary embodiments.

FIG. 5 is a flowchart illustrating a method for testing an HMAC implementation for vulnerability to a side-channel attack according to an aspect.

FIG. 6 is a flowchart illustrating an example method for performing a side-channel template attack on an HMAC implementation (e.g., hardware, software, simulation, etc.) according to an aspect.

FIG. 7 is a block diagram illustrating an experimental setup for performing side-channel template attacks and associated vulnerability testing on an HMAC implementation according to an aspect.

FIG. 8 is a diagram illustrating a SHA-256 algorithm block diagram according to an aspect.

FIG. 9 is a diagram schematically illustrating three execution rounds of a compression function of a SHA-256 hash function according to an aspect.

FIGs. 10A and 10B are diagrams illustrating operation of an adder used to build template tables for use in a side-channel attack according to an aspect. FIG. 11 is a graph illustrating standard deviations between trace samples according to an aspect.

In the drawings, like reference symbols may indicate like and/or similar components (elements, structures, etc.) in different views. The drawings illustrate generally, by way of example, but not by way of limitation, various implementations discussed in the present disclosure. Reference symbols shown in one drawing may not be repeated for the same, and/or similar elements in related views. Reference symbols that are repeated in multiple drawings may not be specifically discussed with respect to each of those drawings, but are provided for context between related views. Also, not all like elements in the drawings are specifically referenced with a reference symbol when multiple instances of an element are illustrated.

DETAILED DESCRIPTION

Some embodiments described herein (e.g., with respect to FIGs. 1-4) relate to methods and/or IP cores that contribute to improvement in speed and/or performance of a data processor. Specifically, some embodiments can be used to improve performance of AES calculations.

The principles and operation of a method and/or intellectual property (IP) core according to exemplary embodiments may be better understood with reference to the drawings and accompanying descriptions.

Before explaining at least one embodiment in detail, it is to be understood that the described embodiments are not limited in their application to the details set forth in the following description or exemplified by the Examples. The described example embodiments can be practiced or carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein is for the purpose of description and should not be regarded as limiting.

Exemplary IP core

FIG. 1 is a schematic representation of a semiconductor intellectual property (IP) core indicated generally as 100. Depicted exemplary IP core 100 includes a transformation engine 110 designed and configured to generate redundant representations of each element 121 i. se of a field GF(2⁸) 120 using a Galois Field polynomial 140 of the form GF(2^7+d). Redundant representations of a polynomial X are generated by adding a product CP, where P is a fixed irreducible polynomial of the degree 8, and C is a polynomial of a degree less than d where d > 0 is a redundancy parameter and the redundant representations belong to a ring. Although transformation engine 110 designed and configured to represent each element 121 i. ₂se of a field GF(2⁸) 120 using a polynomial 140, in actual practice, transformation using the polynomial is only conducted on those elements 121, or portions thereof, which are being used in calculations. In some embodiments transformation engine 110 represents a same field element by one of 2^d various ways (pairwise differing by terms that are multiples of P(x)), and at one or more moment(s)of calculations randomly replaces a redundant representation of a field element Z with a redundant representation chosen out of 2^d representations of the field element Z. In some embodiments the choosing out of 2^d representations is performed randomly.

Given any element 121_{(i. .}256₎ in field 120, the transformation engine transforms it to one of elements 130i to 130 ₂ ^d, each one of which redundantly represents said element of field 120. The 256 sets of 2^d elements 130 form ring 150. (A single set is depicted in the figure for clarity.) For example if d=9, the ring 150 will include 131,072 elements 130 with 512 elements 130 corresponding to each of elements 121.

For example if d=24, the ring 150 will include 4,294,967,296 elements 130 with 16,777,216 elements 130 corresponding to each of elements 121.

In some exemplary embodiments, increasing the value of redundancy parameter d contributes to an increase in security with respect to various types of attacks. According to various exemplary embodiments, transformation engine 110 employs d > 5; d > 6; d > 7;d > 8; d > 9 d > 12; d > 14; d > 16; d > 18; d > 20; d > 24; d > 32; d > 48 or intermediate or greater values of d.

In some exemplary embodiments, transformation engine 110 represents a same field element 121 by one of 2^d various ways (pairwise differing by terms that are multiples of P(x)), and at each moment of calculations chooses any of the various representations. Alternatively or additionally, in some embodiments each of elements 121 i_{. n} of field 120 include a byte of data within a block of a block cipher or a cryptographic key. According to various exemplary embodiments, the block cipher is selected from the group consisting of AES, SM4, and ARIA.

In some embodiments, transformation engine 110 computes X^Y by performing a series of: (i) multiplications of two different elements of the field; and (ii) raising an element of the field to a power Z wherein Z is a power of 2. According to these embodiments, the number of multiplications (i) is at least two less than the number of ones (Is) in the binary representation of Y. In some embodiments, wherein Y=254. Alternatively or additionally, in some embodiments a number of multiplications (i) is 4 or less. In some exemplary embodiments, after at least one transformation according to (i) or (ii) the result representing a field element Z is replaced with one of 2^d redundant representations of the field element Z. In some embodiments the replacement is done with one out of 2^d representations chosen randomly.

FIG. 2 is a simplified flow diagram of a method of building different representations of the Galois Field (GF), indicated generally as 200, according to some exemplary embodiments. Depicted exemplary method 200 is implemented by logic circuitry and comprises redundantly representing 210 each element of a field GF(2⁸) using a polynomial of degree no higher than 7 + d, where d > 0 is a redundancy parameter to generate redundant representations belonging to a ring. In some embodiments method 200 includes representing a same field element by one of 2^d various ways (pairwise differing by terms that are multiples of P(x)), and at one or more moment(s) of calculations replacing a value representing a field element Z with any of the various representations of the field element Z. In some embodiments the replacing is performed with a redundant representation of Z chosen randomly.

According to various exemplary embodiments, method 200 employs d > 5 (220); d > 6; d > 7; d > 8 (230); d > 9; d > 12; d > 14; d > 16; d > 18; d > 20; d > 24; d > 32; d > 48 or intermediate or greater values of d. In some embodiments, method 200 includes representing 240 a same field element by one of 2^d various ways (pairwise differing by terms that are multiples of P(x)), and at each moment of calculations chooses any of the various representations. In some embodiments, the element of a field includes 250 a byte of data within a block of a block cipher or a cryptographic key. In actual practice, transformation using the polynomial is only conducted on those elements of the field, or portions thereof, which are being used in calculations.

According to various exemplary embodiments, the block cipher is selected from the group consisting of AES, SM4, and ARIA.

In some embodiments, method 200 includes computing X^Y by performing a series of: (i) multiplications of two different elements of the field; and (ii) raising an element of the field to a power Z wherein Z is a power of 2. According to these embodiments the number of multiplications (i) is at least two less than the number of ones (Is) in the binary representation of Y. In some embodiments after at least one transformation according to (i) or (ii) the result representing a field element Z is replaced with one of the redundant representations of the field element Z. In some embodiments Z is replaced with a redundant representation chosen at random.

In some embodiments, Y=254. Alternatively or additionally, in some embodiments a number of multiplications (i) is 4 or less.

Mathematical definition of the block-ciphers.

In many block-ciphers (AES, SM4, ARIA) messages are broken into blocks of a predetermined length, and each block is encrypted independently of the others.

Rijndael (AES) is presented here as an example. The common block ciphers SM4 and ARIA are very similar to Rijndael.

Rijndael operates on blocks that are 128-bits in length. There are actually three variants of the Rijndael cipher, each of which uses a different key length. The permissible key lengths are 128, 192, and 256 bits. Even a 128-bit key is large enough to prevent any exhaustive search. Of course, a large key is no good without a strong design.

Within a block, the fundamental unit operated upon is a byte, that is, 8 bits. Bytes are thought of in two different ways in Rijndael. Let the byte be given in terms of its bits as b₇, b₆, ... , b₀.

Consider each bit as an element in GF(2), a finite field of two elements. First, one may think of a byte as a vector (b₇, b₆, ... , b₀) £ GF(2)⁸.

Second, one may think of a byte as an element of GF(2⁸), in the following way: Consider the polynomial ring GF(2) [X] It is possible to mod out by any polynomial to produce a factor ring. If this polynomial is irreducible, and of degree n, then the resulting factor ring is isomorphic to GF(2ⁿ). In Rijndael, the irreducible polynomial q(x) = x⁸ + x⁴ + x³ + x + 1, is used to mod out and obtain a representation for GF(2⁸). A byte is then represented in GF(2⁸) by the polynomial b₇x⁷ + b₆x⁶ + — F b_xx + b₀.

Arithmetic operations in F₂56 are described in detail in WO2020/148771 which is fully incorporated herein by reference.

In some exemplary embodiments, there is provided a method of improving performance of a data processor. In some embodiments the method includes: in a ring of characteristic 2, computing X²⁵⁴ by performing a series of:

(i) multiplications of two different elements of the ring; and

(ii) raising an element of the ring to a power Z, wherein Z is a power of 2 (such operation being a linear transformation); wherein the total number of multiplications is limited to 4, the total number of linear transformations is limited to 4, the number of multiplications executed sequentially is limited to 3 (meaning that some 2 of 4 multiplications can be executed in parallel), and the number of linear transformations executed sequentially is limited to 2.

In some exemplary embodiments, elements of the ring redundantly represent elements of a field and after at least one transformation according to (i) or (ii) the result is representing a field element Z replaced with one of the redundant representations of a same element Z of the field. In some embodiments replacement is with a redundant representation of Z chosen at random. Alternatively or additionally, in some embodiments the ring includes a field GF(2⁸) as a subring Embodiments which employ this method shorten the critical path in a hardware implementation of raising to the power of 254 in GF(2⁸). In some embodiments this shortening of the critical path contributes to an increase in the frequency at which such a design can be used.

An example of such implementation: x²=x² x³=x x²

x²⁵⁴=x¹⁴-x²⁴⁰

(i) multiplications of two different elements of the ring; and

(ii) raising an element of the ring to a power Z, wherein Z is a power of 2 (such operation being a linear transformation); wherein the total number of multiplications is limited to 4, the total number of linear transformations is limited to 3, the number of multiplications executed sequentially is limited to 3 (meaning that some 2 of 4 multiplications can be executed in parallel), and the number of linear transformations executed sequentially is limited to 3.

Embodiments which employ this method shorten the critical path in a hardware implementation of raising to the power of 254 in GF(2⁸). In some embodiments this shortening of the critical path contributes to an increase in the frequency at which such a design can be used.

An example of such implementation: x²=x² x³=x x²

In some exemplary embodiments, there is provided a semiconductor intellectual property (IP) core for improving performance of a data processor. The IP core includes a transformation engine designed and configured to perform, in a ring of characteristic 2, computation of X²⁵⁴ by performing a series of:

(i) multiplications of two different elements of the ring; and

In some embodiments, elements of the ring redundantly represent elements of a field and after at least one transformation according to (i) or (ii) the result is representing a field element Z replaced with one of the redundant representations of a same element Z of the field.

Alternatively or additionally, in some embodiments the redundant representations of a same element Z of the field is chosen randomly. Alternatively or additionally, in some embodiments the ring includes a field GF(2⁸) as a subring.

Additional exemplary method for improvement of exponentiation algorithm in redundant AES calculation

(i) multiplications of two different elements of the ring; and

(ii) raising an element of the ring to a power Z wherein Z is a power of 2 (such operation being a linear transformation); wherein the total number of multiplications is limited to 7, the total number of linear transformations is limited to 6, the number of multiplications executed sequentially is limited to 3, and the number of linear transformations executed sequentially is limited to 1.

In some embodiments elements of the ring redundantly represent elements of a field and after at least one transformation according to (i) or (ii) the result representing a field element Z is replaced with one of the redundant representations of a same element Z of the field chosen randomly. In some embodiments the redundant representations of a same element Z is chosen randomly.

In some embodiments the ring includes a field GF(2⁸) as a subring. Embodiments which employ this method give a slightly shorter critical path than the method described immediately above, because of 1 rather than 2 linear transformations performed sequentially. However, use of this method increases (relative to the method described immediately above) the gate count (more multiplications and more linear transformations). Embodiments which employ this method shorten the critical path in a hardware implementation of raising to the power of 254 in GF(2⁸). In some embodiments this shortening of the critical path contributes to an increase in the frequency at which such a design can be used.

An example of such an implementation:

x¹⁴=x⁵-x⁹, x²⁴⁰=x⁴⁸-x¹⁹² c254_=c14._c240

Exemplary IP core tor improvement ot exponentiation algorithm in redundant AE calculation

In some exemplary embodiments, there is provided an intellectual property (IP) core comprising: circuitry that improves performance of a data processor by: in a ring of characteristic 2, computing X²⁵⁴ by performing a series of:

(i) multiplications of two different elements of the ring; and

In some exemplary embodiments, elements of the ring redundantly represent elements of a field and after at least one transformation according to (i) or (ii) the result representing a field element Z is replaced with one of the redundant representations of a same element Z of the field. In some embodiments the redundant representations of a same element Z of the field is chosen randomly.

Alternatively or additionally, in some embodiments the ring includes a field GF(2⁸) as a subring.

Additionally, an exemplary IP core for improvement of exponentiation algorithm in redundant AES calculation can be provided.

(i) multiplications of two different elements of the ring; and (ii) raising an element of the ring to a power Z wherein Z is a power of 2 (such operation being a linear transformation); wherein the total number of multiplications is limited to 7, the total number of linear transformations is limited to 6, the number of multiplications executed sequentially is limited to 3, and the number of linear transformations executed sequentially is limited to 1.

In some exemplary embodiments, elements of the ring redundantly represent elements of a field and after at least one transformation according to (i) or (ii) the result representing a field element Z is replaced with one of the redundant representations of a same element Z of the field. In some embodiments the redundant representations of a same element Z of the field is randomly chosen.

Further Additional exemplary IP core

in redundant AES calculation

FIG. 3 is a schematic representation, indicated generally as 300, of operations performed by a transformation engine of an IP core according to some exemplary embodiments.

FIG. 4 is a schematic representation, indicated generally as 400, of operations performed by a transformation engine of an IP core according to some exemplary embodiments.

Some exemplary embodiments relate to a semiconductor intellectual property (IP) core comprising a transformation engine designed and configured to perform on elements (312 and 412 in FIG. 3 and Fig 4 respectively) of a finite ring R (310 and 410 in FIG. 3 and Fig 4 respectively) represented as GF(p)[x]/(PQ) a sequence of operations comprising at least one member selected from the group consisting of multiplication (FIG. 3) and raising to an integer power (FIG. 4); wherein p is a prime number; wherein P is a polynomial of degree n irreversible over GF(p); wherein Q is a polynomial of degree d over GF(p); wherein elements of R redundantly represent elements of a finite field GF(pⁿ) represented as GF(p)[x]/(P); and wherein a result (Z) (314 and 414 respectively) of at least one of said operations in said sequence is replaced with a element (Z*) (316 and 416 respectively) of R such that Z and Z* redundantly represent a same element of F. In some embodiments the element Z* is chosen randomly. Alternatively or additionally, in some embodiments p = 2. Alternatively or additionally, in some embodiments n = 8. Alternatively or additionally, in some embodiments any element of R represents an element A mod P of F. Alternatively or additionally, in some embodiments the replacement of Z by Z* is performed by calculation Z* = Z + CP wherein C is a polynomial of a degree less than d. Alternatively or additionally, in some embodiments the sequence of operations takes an element X of R representing an element Y of F and calculates an element Z of R representing an element Yⁿ of F. Alternatively or additionally, in some embodiments the sequence of operations consists only of multiplications and raisings to the powers of 2^k . Alternatively or additionally, in some embodiments n = 254.

FIG. 3 depicts calculation of a product of two elements A and B in the ring R, which represent H(A) and H(B) respectively. The ring product AB represents H(AB)=H(A)H(B), so that multiplication of elements H(A) and H(B) of the field F is replaced with multiplication of their representations A and B in the ring R.

FIG. 4 depicts calculation raising an element A in the ring R which represent H(A) to a power n. The ring power A¹¹ represents H(Aⁿ)=H(A) ⁿ, so that raising of an element H(A) of the field F to a power is replaced with raising of its representation A in the ring R to a power.

In some embodiments the IP core performs a method of building different representations of a finite ring R represented as GF(p)[x]/(PQ) implemented by logic circuitry. The method includes a sequence of operations with at least one member selected from the group consisting of multiplication and raising to an integer power; wherein p is a prime number; wherein P is a polynomial of degree n irreversible over GF(p); wherein Q is a polynomial of degree d over GF(p); wherein elements of R redundantly represent elements of a finite field GF(pⁿ) represented as GF(p)[x]/(P); and wherein a result (Z) of at least one of said operations in said sequence is replaced with a element (Z*) of R such that Z and Z* redundantly represent a same element of F. In some embodiments the element Z* is chosen randomly. Alternatively or additionally, in some embodiments p = 2. Alternatively or additionally, in some embodiments n = 8. Alternatively or additionally, in some embodiments any element A of R represents an element A mod P of F. Alternatively or additionally, in some embodiments the replacement of Z by Z* is performed by calculation Z* = Z + CP wherein C is a polynomial of a degree less than d.

Alternatively or additionally, in some embodiments the sequence of operations takes an element X of R representing an element Y of F and calculates an element Z of R representing an element Yⁿ of F. Alternatively or additionally, in some embodiments the sequence of operations consists only of multiplications and raisings to the powers of p^k. Alternatively or additionally, in some embodiments n = 254. Side-Channel Attack on HMAC-SHA-2 and Associated Testing

Some embodiments described in this disclosure are directed to approaches for side-channel attacks on cryptographic algorithms. More specifically, this disclosure describes implementations of side-channel attacks on hash-based message authentication code (HMAC) implementations (e.g., with respect to FIGs. 5 to 11), as well as testing of HMAC implementations for vulnerability to such side- channel attacks, where such testing can be implemented using the approaches for mounting side- channel attacks described herein. The example implementations described herein include performing a template attack on HMAC implementation. These approaches are generally described with respect to attacks on HMAC-SHA-2 implementations, though the described approaches could be used for mounting a side-channel attack, or testing for vulnerability to a side-channel attack, for other cryptographic algorithm implementations.

In this disclosure, initially, considerations for mounting a side-channel attack and an overview of the side-channel attack implementations disclosed herein are described. Example methods for testing for vulnerability of an HMAC implementation to the disclosed side-channel attack implementations, and for mounting a complete side-channel attack on an HMAC implementation are then described. After discussion of those methods, details regarding a SHA-2 (specifically SHA-256) hash function and HMAC implementations in the context of the disclosed side-channel attack implementations are described, followed by a discussion of example details of an HMAC-SHA-2 side-channel attack. Further, following discussion of the example side-channel attack approaches, experimental results for such approaches are discussed, as well as suggestions for mitigating susceptibility to such attacks.

Considerations for side-channel attacks on HMAC and template attack overview

Side-channel attacks are a class of attacks that can be used to expose secret information (e.g., secret keys, key derivatives, etc.) of cryptographic algorithms by observing side effects of algorithm execution. For instance, such secret information can be leaked (e.g., determined) from various channels during algorithm execution. For instance, such channels can include execution timing, electromagnetic emanation, cache miss patterns, exotic channels such as acoustics, and so forth. However, power side-channel attacks of different types, such as simple power analysis (SPA), differential power analysis (DPA) and correlation power analysis (CPA) remain the most prevalent forms of side-channel attacks used to attack cryptographic algorithms.

While successful approaches for side-channel attacks on many prevalent cryptographic algorithms have been developed, relatively few attacks on cryptographic hash functions, such as SHA-1 and SHA-2 hash algorithms (hash functions), are known. While hash functions primitives, such as those in the SHA-2 family, may not involve, or use secret information, hash-based message authentication code (HMAC) algorithms implemented with such hash functions can use a secret key to generate a keyed digest of a corresponding message. As HMAC is widely used, it is desirable to ensure that its implementations are secure (e.g., not susceptible to side-channel analysis or attack). However, due to a general belief that no practical attacks on such HMAC implementations exist, there has not been significant effort in developing approaches for assessing their vulnerability to side- channel attacks.

One difficulty of attacking HMAC implementations is the structure of the HMAC algorithm, which, as shown in Equation (1) below, includes two invocations of an underlying hash function on a secret key.

where K₀ is a known function of a secret key K, M is an input message, and ipad and opad are known constants. The two Hash invocations are referred to as an inner hash and an outer hash, where the variable part of the inputs to the outer hash is an output of the inner hash. Accordingly, even if an adversary has full control over the input data (e.g., the message M) and manages to break the inner hash, the input to the outer hash becomes known, yet not chosen, which limits the ability of the adversary to be able to successfully forge a corresponding HMAC signature. While susceptibility of HMAC implementations, particularly HMAC-SHA-2 implementations, to side-channel attacks has been addressed by several researchers, no successful full side-channel attack has been developed.

With respect to HMAC-SHA-2 implementations, such algorithms introduce additional complexity, e.g., as compared to HMAC implementations using SHA-1 hash functions. Although the compression functions of both SHA-1 and SHA-2 algorithms mainly include arithmetic operations, there is a substantial difference between hash functions of the two families, as discussed below that account for at least some of this additional complexity.

Specifically, in SHA-1 hash functions, a round function contains a single addition operation involving an input word. The result of this addition is stored in a state register, which can be used as a target for a power correlation attack. In contrast, for SHA-2 hash functions, a round function contains two addition operations involving the input word performed in parallel, where the results are sampled in two different sub-words (e.g., A and E as described in the SHA-2 standard) of an internal state register. Because it is difficult to separate respective side-channel leakages from the two additions that are executed in parallel, a naive attack power analysis attack is highly unlikely to be successful.

Based, at least, on the foregoing considerations, the approaches for mounting a side-channel attack on HMAC (e.g., HMAC-SHA-2) implementations disclosed herein include performing a profiling (learning stage). Without such a profiling stage, it is very difficult to apply a successful DPA/CPA attack on HMAC-SHA-2 cryptographic algorithm due, at least, to the following two factors. First, using a side channel attack on the inner hash only, a derivative of the key can be found rather than the key itself, therefore, the outer hash must be attacked as well, however, can only be attacked using known messages, as opposed to chosen messages. Second, attacking SHA-2 using DPA with a known message, as compared to a chosen message, is difficult due to its prevalently linear nature.

The first factor above is important because a correlation-based attack model assumes control/knowledge of the data and a constant key. However, since the first invocation of the SHA-2 compression function works with a constant string (K0 ® ipad), it cannot be successfully attacked using DPA. Further, the second invocation mixes a result of the first invocation with a message (or with an initial part of a message). Accordingly, a successful attack on the second invocation will reveal Hash(K0 ® ipad), but not the key itself. Therefore, Hash(K0 ® opad) must be derived separately. Also, because the input to the outer hash is an output of the inner hash, an adversary can possess knowledge of, but not control of the inner hash result. As for the second factor above, known- message attacks work well on nonlinear functions for which even a 1-bit change in the input completely changes the output. In contrast, however, due to the linearity and large word sizes of the SHA-2’s algebraic constructions, DPA may choose related, but wrong key hypotheses.

Therefore, in view of at least the considerations above, the approaches for mounting a template attack disclosed herein include performing power analysis of SHA-2 that can incrementally reveal/determine, e.g., in subsets of 1, 2 or 3 bits, respective internal states (e.g., secret internal states) of the inner and outer hash function invocation, which can be, for example, pseudo-random inputs. That is, because a DPA-type analysis is difficult as explained above, the disclosed approaches for obtaining secrets from HMAC-SHA-2 implementation include performing power analysis by profiling the underlying hash function, or mounting a template attack as described herein.

In order to mount a template attack, such as those described herein, the attack should be performed using an implementation of a target device, or an implementation of a device very similar to the target device, that can be operated with known data (e.g., a known message). In such approaches, the profiling stage can be performed once. The following attack stage can, using a smaller number of traces, then be performed on the profiled device or like devices using template tables that are constructed during the profiling using the approaches describe herein. For instance, such template tables can be constructed using a Multivariate Gaussian Model to build the templates. Further, a maximum likelihood approach can be used to match the power traces collected during the attack phase to the template tables. The approaches for template attacks described herein can be based on the described template tables and Euclidean distance for matching. In the examples described herein, during a profding stage, the addition operation discussed above is split into 2-bit slices that include carry -in and carry-out bits, and, for each slice, a power profde is built. The attack works in successive iterations, matching the slices starting from the least significant and, for each iteration going to the following slice, using the calculated carry -in from the previous iteration.

While the disclosed approaches, for purpose of illustration and example, are described in the context of an HMAC-SHA-2 implementations, e.g., using a SHA-256 hash function, these approaches can be applied to other members of the SHA-2 family of hash functions, or to approaches based on other hash functions. Furthermore, as described below, the disclosed template attack implementations, while generally discussed in the context of a single round per cycle implementation, can be applied in multiple rounds per cycle implementations.

To mount the disclosed template attack approaches, an underlying SHA-2 function should be directly accessible without use of the associated HMAC wrapper. That is, the SHA-2 function should be configured to be invoked independently.

While there have been attempts to implement side-channel attacks on HMAC-SHA-2 implementations using power analysis techniques to attack HMAC-SHA-2, those attacks have been unsuccessful in forging HMAC signatures. This disclosure describes approaches for mounting a successful template attack on HMAC-SHA-2 implementations, which approaches have been experimentally verified. These experiments were performed based on an open-source hardware SHA- 256 implementation that was implemented in two ways, e.g., using a pre-silicon side-channel leakage simulator, and using a field-programmable-gate-array (FPGA). In both experimental implementations, the disclosed template attack approaches provided for discovery of key derivatives that allow for successfully forging HMAC signatures. On the FPGA implementation, an example attack (e.g., trace acquisition and analysis) took approximately two hours, including a profiling stage and attack stage, as described below, and about half an hour excluding the profiling stage (e.g., for only the attack stage).

Example methods

FIG. 5 is a flowchart illustrating a method 500 for testing an HMAC implementation for vulnerability to a side-channel attack according to an aspect. That is, FIG. 5 illustrates an example method for testing susceptibility of an HMAC implementation and/or a hash function primitive (e.g., a SHA-2 hash function) of an HMAC implementation to a side-channel attack (e.g., a template attack). The method of FIG. 5 can be implemented using the approaches for mounting a template attack described herein. Further, the method 500 of FIG. 5 is provided by way of example and for purposes of illustration, and other methods for testing HMAC and/or SHA-2 implementations for such vulnerability using the disclosed approaches are possible. For purposes of brevity and clarity, some details of the disclosed template attack approaches are not described with respect to FIG. 5, but are, instead, described below.

The example method 500 of FIG. 5 includes, as noted above, mounting a template attack on a hash function used to implement an HMAC algorithm and/or on the HMAC implementation. In the method 500, the template attack includes, at block 510, generating, based on first side-channel leakage information associated with execution of the hash function (e.g., when executing a set of profiling vectors), a plurality of template tables. Each template table of the plurality of template tables can correspond, respectively, with a subset of bit positions of an internal state of the hash function. At block 520, the method 500 includes generating, based on second side-channel leakage information (e.g., when executing a set of attack vectors on the HMAC implementation), a plurality of hypotheses for an internal state of an invocation of the hash function based on a secret key. At block 530, the method 500 includes generating, using the hash function, respective hash values generated from each of the plurality of hypotheses and a message and, at block 540, comparing each of the respective hash values with a hash value generated using the secret key and the message. At block 550, the method 500 includes, based on the comparison, determining vulnerability of the HMAC algorithm based on a hash value of the respective hash values matching the hash value generated using the secret key and the message. That is, if a calculated hash value matches the generated hash value, the HMAC implementation is considered to be vulnerable to side-channel attacks.

FIG. 6 is a flowchart illustrating a method 600 for mounting a template attack on an HMAC implementation according to an aspect. The template attack of FIG. 6, which can be implemented using the approaches described herein, can be the basis of testing for side-channel attacked vulnerability, such as using the method 500 of FIG. 5, or can be implemented in other ways and/or in other applications. As with FIG. 5, FIG. 6 is provided by way of example and for purposes of illustration. That is, other methods for performing (mounting, executing, implementing, etc.) a template attack on a given HMAC implementation using the approaches described herein are possible. For purposes of brevity and clarity, some details of the disclosed template attack approaches are not described with respect to FIG. 6, but are, instead, described below.

The example method 600 of FIG. 6 includes, at block 610, collecting, while executing a hash function used to produce the HMAC (e.g., using profiling vectors), first side-channel leakage information corresponding with overwriting values of an internal state of the hash function. In an implementation, the first side-channel information can be as based on a Hamming distance model. At block 620, the method 600 includes generating a plurality of template tables, each template table corresponding, respectively, with a subset of bits of the internal state of the hash function. Each template table of the plurality of template tables at block 620 can include rows that are indexed using values of the respective subset of bits. The rows of the template table can include respective side- channel leakage information of the first side-channel leakage information that is associated with the index values.

At block 630, the method 600 includes collecting second side-channel leakage information associated with producing the HMAC (e.g., using a set of attack vectors). At block 640, the method 600 includes identifying (or selecting), based on comparison of the second side-channel leakage information with the plurality of template tables, a first plurality of hypotheses for an internal state of an inner invocation the hash function. In the example of FIG. 6, the method 600 further includes, at block 650, identifying, based on comparison of the second side-channel leakage information with the plurality of template tables, a second plurality of hypotheses for an internal state of an outer invocation of the hash function and, at block 660, selecting, using pairs of hypotheses each including a hypothesis of the first plurality of hypotheses and a hypothesis of the second plurality of hypotheses, a first hypothesis of the first plurality of hypotheses and a second hypothesis of the second plurality of hypotheses for forging the HMAC. The operation at block 660 can include performing a brute force attacking using the hypotheses identified at blocks 640 and 650 to identify the correct hypotheses for respective internal states (e.g., based on a respective secret key) for an inner SHA invocation and an outer SHA invocation of the HMAC implementation being attacked.

Experimental Setup

FIG. 7 is a block diagram schematically illustrating an experimental setup 700 for performing side-channel template attacks and associated vulnerability testing on an HMAC implementation according to an aspect. As with the methods of FIGs. 1 and 2, the experimental setup (setup) 700 of FIG. 7 is given by way of example and for purposes of illustration. Additional details example of experimental setups are described below.

As shown in FIG. 7, the setup includes external data 710 that is applied to an HMAC implementation 720. The external data 710 can include, for example, learning or profding vectors and attack vectors, as well as other data used for performing a template attack, such as the various parameters described herein. The HMAC implementation 720 of FIG. 7 includes a secret key (K) 722 and a hash function 724, which can be used in implementing an HMAC construction in accordance with Equation 1 presented above. As described herein, the hash function 724, which is described herein by of example, as a SHA-2 (e.g., SHA-256) hash function implementation should be invokable independent of the HMAC implementation 720. In mounting a template attack, as described herein, the learning vectors of the external data 710 are applied to the hash function (e.g., as known data, with a known key), while the attack vectors of the external data 710 are applied to the HMAC implementation using the secret key 722. As shown in FIG. 7, the setup 700 also includes a side-channel leakage measurement device or block 730, which is configured to collect side-channel leakage information associated with executing the hash function 724 using the learning or profiling vectors, and to collect side-channel information associated with executing the HMAC implementation 720 using the attack vectors.

As further shown in FIG. 7, the side-channel leakage measurement 730 can be configured to provide side-channel leakage information (e.g., associated with the learning vectors) to a profiling module 740, which can be configured to generate template tables 750, such as those described herein. The side-channel leakage measurement 730 can be further configured to provide side-channel leakage information (e.g., associated with the attack vectors) to an attack module 760, where the attack module can be configured to perform a multi-step attack on the HMAC implementation 720, such as using the techniques described herein.

SHA-2 and HMAC

As noted above, the template attack and associated side-channel attack vulnerability testing approaches disclosed herein are described as being mounted on, or applied to an HMAC implementation (e.g., as defined in Equation 1) that is implemented using a SHA-2 hash function, with specific reference to a SHA-256 hash. As context for discussion of these approaches, following is a discussion of SHA-2 (SHA-256), HMAC, as well as a specific (alternate) notation for SHA-256 that is used for discussion of the disclosed template attack approaches. For instance, FIG. 8 is a block diagram illustrating a SHA-256 hash function implementation according to an aspect, while FIG. 9 is a diagram illustrating the specific notation for SHA-256 used herein. It is noted that, for purposes of brevity, some details of the SHA-256 hash function implementation shown in FIG. 8 not directly relevant to the disclosed approaches may not be specifically described herein.

Referring to FIG. 8, an execution flow 800 for a SHA-256 hash function is shown. As illustrated in FIG. 8, a message 810 (e.g., of arbitrary length) is provided to a pre-processing stage 820. The pre-processing stage 820 generates a message schedule 830 based on 512 bit chunks or blocks. The message schedule 830, which is generated by the expanding a corresponding 512 bit block, can then be output, as sixty-four (64) 32-bit words to 64 respective compression function stages (stage 0 to stage 63), of which compression stage 0840, compression stage 1 850, and compression state 63 860 are shown. The compression function stages can also be referred to as rounds (calculation rounds). FIG. 8 also illustrates a detailed diagram of two 256-bit wide compression stages of the illustrated SHA-256 hash function (e.g., compression stages 840, 850).

The SHA-2 family of hash algorithms (including the SHA-256 function of FIG. 8) utilize the Merkle-Damgard construction, in which the input message 810 (properly padded) is represented as a sequence of blocks Bl₀, Bfi, ...., Bl_n_i, and the hash function is iteratively calculated (using the 64 compression stages and an arithmetic stage 870) as S_J+1 = CF(S_j, Bl_j) for j e [0, 1, . . . n-1] CF is the hash algorithm’s compression function, So is a predetermined constant, and S_n is the final output (the hash value 880). The compression function CF(S_j, Bl_j) for SHA-2 hash functions is calculated in the following steps (as is shown for SHA-256 in FIG. 8):

1. The message schedule 830 expands the input block Blj to a sequence of s x t-bit “words” Wo, Wi, . . . , W_s-i, where s = 64, t = 32 for SHA-224 and SHA-256; and s = 80, t = 64 for SHA- 512/224, SHA-512/256, SHA-384 and SHA-512. The particular details of how the expansion algorithm operates do not affect the approaches for executing a template attack described herein.

2. The round function RF is applied s times (e.g., by the compression stages 840, 850 ... 860) so that R = RF(Ri-1, Wi, Ki ) for i e [0, 1, . . . , s-1] where Ki are predefined “round constants", and R-1 = S_j .

3. An output of the compression function CF is then calculated as a word-wise sum modulo 2* of R-1 = S_j and R_s-1.

For the round function RF for SHA-2, the internal state (initial internal state) Ri-1 is split into eight t-bit words Ai-1, Bi-1, Ci-1, Di-1. Ei-1. F-1, Gi-1, Hi-1. A next internal state Ri is calculated from R,-i

(previous internal state based on Equations 2-8 below:

It is noted, which is relevant for the disclosed template attack approaches, that in every round (e.g., at every RF execution), only two words of the internal state are calculated, while the remaining six words of the internal state are copied from the previous internal state under a different name, such as is illustrated in FIG. 8 for compression stages (RF executions) 840, 850.

As noted above, for convenience in describing the disclosed template attack approaches, a different, or alternate notation is used for the SHA-2 internal state, in which every word of the internal state receives a unique name that does not change from round to round. This notation is illustrated in FIG. 9, which schematically illustrates, for a SHA-2 implementation using the alternate notation, an initial internal state 910 and resulting, respective internal states 920, 930, 940 for three successive rounds. In the example of FIG. 9, as in FIG. 8, arrows show copy operations, where all words of a given internal state that have incoming arrows receive an exact copy of a word from the internal state of the previous round. The remaining words of the internal states (without incoming arrows, or copied values) receive results of manipulated data from the previous round (e.g., newly calculated or generated words).

As shown in FIG. 9, using the alternate notation, the words of the initial state R-i 910 are designated as A_i, A_₂, A.₃, A. . E_i, E_₂, E_₃, E.₄. The state R, after round i (e.g., states 920, 930, 940, and so forth) can be designated as Ai, A,._|. A_J.2, A_-3, E,. E,._|. E,.₂. Ei-₃. The purpose for use of this indexing for describing the disclosed template attack and testing approaches is to assign the index 0 to the result of the first calculation, and to assign negative indices to words of the internal state that are merely copies of the initial state 910, as is illustrated by FIG. 9. Therefore, using the alternate notation, the only two words of the internal state that are newly calculated or generated at every round are A and E,. and they are calculated using the following Equations 9-14:

)

AE_i = A_i® e_i (11)

DAi = e_t EB a_t (12)

Ei = DE_ί ffl Wi (13)

A₍ = DAi ffl Wi (14)

Note that e, in Equation 9 is different from T ₁ in Equation 2, in that the calculation of e_! does not include W (the respective portion of the message schedule for a given round) as an addend. Therefore, AAi and AEi depend on the previous state, but not on Wi. In particular, AA₀ and AE₀ depend only on the initial state R-i.

As previously discussed, HMAC is a Message Authentication Code (MAC) algorithm that is based on a hash function, where an HMAC construction is defined by Equation 1 presented above. In HMAC implementations, derivation of a modified K₀ from a secret key K, regardless of the size of K, the size of K₀ is equal to a block size of the function Hash used to implement the HMAC construction. The two applications of the function Hash during the HMAC calculation can be referred to as an “inner” application or invocation and an “outer” application or invocation.

If Hash is a function from the SHA-2 family, e.g. SHA-256, then for a fixed K the first application of the SHA-256 compression function in the inner SHA-256 calculates S^m = CF(So, K₀ 0 ipad), and in the outer SHA-256 calculates S^out = CF(So, K₀ ® opad). Note that both S^m and S^out depend only on K. The goal of the disclosed attack approaches is to find S^m and S^out. Since it is difficult to invert a compression function (e.g., of a SHA-2 hash function), it follows that it is difficult to derive K or K₀ from S^m and S^out. However, in order to mount a successful attack, such derivation of K or Ko is not necessary, because an attacker who knows both S^m and S^out (for an HMAC construction based on SHA-256) can forge HMACSHA256(K, M) for any message M, which is the ultimate goal of an attack on a MAC algorithm.

It follows that, in order to find S^m and S^out in such implementations, both the inner and outer SHA-256 must be attacked. In the disclosed approaches, there is a subtle difference (consideration) between mounting the two attacks. That is, when attacking the inner SHA-256, an attacker may choose the message M. This is not the case with the outer SHA-256, because the variable part of the input to it is the output of the inner SHA-256, S^m, which may be known to the attacker, but cannot be chosen arbitrarily. This factor makes designing an attack on the outer SHA-256 more difficult. The approaches for mounting a template attack described below work for attacking both the inner hash function invocation and the outer hash function invocation of HMAC constructions (e.g., implemented using SHA-2 hash functions).

In the discussion of the disclosed template attack approaches, the various factors and values, in particular traces and input words, are numbered starting from 0, with the exception being the initial words A and E of a SHA-256 internal state, which are numbered starting from -4, such as described above and shown in FIG. 9. The bits in each word are also numbered starting from 0, where index 0 corresponds to the least significant bits of each word.

• X[i : j] stands for bits j . . . i of the word X (32 > i > j > 0).

• Carry (x, y, i) stands for the carry bit into the bit position i when adding x and y.

• stands for the i^th input word corresponding to the trace with index t in both the profiling set and the attack set.

• A; and E[ stand for the words A and E_! respectively in the calculation corresponding to the trace with index t. Note that for negative lower indices i, words A·, E[ do not depend on t: for the first (profiling) attack stage they are the words of the standard initial state So, and for the second and third attack stages they are the secret words of S^m or S^out, respectively. For this reason, for purposed of clarity, the upper index is omitted when the lower index is negative.

Template attack on HMAC-SHA-2

When mounting a side-channel attack on a cryptographic algorithm, the conventional objective is to obtain, discover or derive a secret key K. However, as described above, in HMAC constructions, the secret key K does not interact directly with data that an adversary can know. Accordingly, the secret key K cannot be obtained by statistical analysis. Nevertheless, since the ultimate goal of an adversary is to be able to forge signatures, for an attack on an HMAC implementation, it is sufficient to obtain the two values S^m = CF(So, K0 ® ipad) and S^out = CF(So, K0 ® opad) for an HMAC implementation (or similar implementation) being attacked.

For the template attack approaches on HMAC implementations disclosed herein, it is presumed that, when conducting a profding stage, an adversary has access to a pure hash function (e.g., SHA-256) invocation, independent of an associated HMAC construction or implementation. In the profding stage, for the disclosed examples, CF(M) is calculated using a SHA-256 engine on a variety of one-block messages M, and an associated profding set of power traces is acquired from side-channel leakage measurements. These traces are processed to generate the template tables described below, where the template tables are then used for matching during a multi-step attack stage, as is also described below.

In the attack stage, the secret key K is unknown, and the input message M is known, but not necessarily controlled by the adversary. The attack stage is applied (performed, mounted, executed, etc.) twice, first on an inner hash calculation and then on an outer hash calculation. In the attack stage, a set of power traces (the attack set) is acquired for the calculation of HMACSHA256(K, M) for a variety of messages M (e.g., attack vectors). In an implementation, it may be sufficient to record or capture only certain parts of every trace, e.g., such as respective portions corresponding to the first two rounds of the second block calculation in both the inner SHA-256 invocation and the outer SHA-256 invocation. It is noted that, because the first blocks of both the inner SHA-256 invocation and the outer SHA-256 are constant, being dependent only on the secret key K, any side-channel data corresponding to the first block bears no useful information for forging an associated HMAC signature.

In the disclosed template attack approaches, S^m can be determined using the template tables (generated during the profiling stage) and the portions of the traces (e.g. the attack traces) corresponding to the inner SHA-256 invocation. Knowing S^m, it is possible to calculate SHA256((K0 ® ipad) II M) for every trace, thus obtaining the input message to the outer SHA-256. After determining S^m, S^out is determine using the same template tables and the portions of the traces (e.g., the attack traces) corresponding to the outer SHA-256 invocation.

In the following discussion, example approaches for mounting an HMAC-SHA-2 template attack are first described presuming that a compression function of a corresponding SHA-256 implementation calculates one round (of the CF) in one clock cycle. Following that discussion, example approaches for applying the disclosed approaches to SHA-256 implementations that calculate two or three rounds-per-cycle are described.

As described herein, in a profiling stage of an HMAC template attack, or in testing for vulnerability to a side-channel attack, a set of traces is collected (e.g., as side-channel leakage information associated with implementation of an associated hash function) and a fixed-size set of template tables is generated from the collected traces. Different template tables of the generated template tables can correspond to different execution rounds and/or to different bit positions of words of a corresponding hash function’s internal state. In each table of the template tables, a set of all the traces can be split into a set of disjoint sets, where a given line of a respective template table can correspond to one of these sets, and can contain the corresponding traces averaged over that set. These disjoint sets, in example implementations, are characterized by values of specific bits in the SHA-256 round function calculation, as described below, as illustrated by FIGs. 10A and 10B.

FIG. 10A illustrates operation of an example 2-bit adder unit 1000 that can be used to build template tables for use in the side-channel attack approaches described herein, while FIG. 10B illustrates example corresponding template table entries. In example implementations, a respective adder unit can be used to build each template table, where entries in the tables are indexed by a 12-bit vector, including the adder inputs, inclusive of carry bit(s) and a previous state of the corresponding bits of the state register. As an example, FIGs. 6A and 10B illustrate an example calculation for table entries for A,.

For instance, the adder unit 1000 of FIG. 10A schematically illustrates part of an addition operation of an input word W, corresponding to a trace and a word DA, (such as described herein). In FIG. 10 A, an input word F , 1010 contains bits 1011 at positions 2k 3... 2k before round /, and a word A A, 1020 contains bits 0010 at the same positions before round /. In this example, a two-bit adder 1030, for positions 2k+l ... 2k, receives inputs bits 11 from an input word W, 1010, bits 10 from a word DA, 1020, and a carry bit 0 from addition at lower bit positions. The adder 1030 then calculates 11+10 + 0 = 101 in binary, of which its two least significant bits 01 are bits of a new state at positions 2k+l ... 2k, and its most significant bit 1 is passed as a carry bit to another two-bit adder 1040 for positions 2k+3... 2k+2. Similarly, the two-bit adder 1040 receives, as inputs bits, 10 from an input word Wi 1010, bits 00 from a word AA, 1020, and the carry bit 1 from the adder 1030. The adder 1040 then calculates 10+00+1=011 in binary, of which its two least significant bits 11 are bits of a new state at positions 2k+3... 2k+2, and its most significant bit 0 is passed to a next two-bit added (not shown) as a carry bit. Calculated bits 01 from adder 1030 replace previously stored bits 10 at positions 2k+l ... 2k in a register, and calculated bits 11 from adder 1040 replace previously stored bits 01 at positions 2k+3... 2k+2 in the register 1050.

FIG. 10B illustrates a portion of a template table 1060, showing line indices corresponding to the example trace discussed, e.g., in a table corresponding to round / and bit positions 2k+l ... 2k and in a table corresponding to round / and bit positions 2k+2... 2k+3. For simplicity it is assumed, in this illustrative example, that all relevant bits from addition between an input word W, 1010 and a word DE, (not shown) are zeros. Then, in this instance, index 1070 of the trace in the table 1060 corresponding to round / and bit positions 2k+l ... 2k is 00 1000 10 00 11, where (from left to right): 1) bits 00 correspond to assumed zero bits of AE,,

2) bits 10 correspond to bits 10 of AA,,

3) bits 00 correspond to assumed zero bits of E,.i,

4) bits 10 correspond to bits 10 oΐA,.i,

5) bits 00 correspond to carry bits to bit position 2k in both additions of IF, with AE, and of W, with AA,, and

6) bits 11 correspond to bits 11 of A W,.

Similarly, as shown in FIG. 10B, an index 1080 of the example trace in the table 1060 corresponding to round / and bit positions 2k+3... 2k+2 is 00 00 0001 01 10, where (from left to right):

1) bits 00 correspond to assumed zero bits of AE,,

2) bits 00 correspond to bits 00 of AA,,

3) bits 00 correspond to assumed zero bits of E,.i,

4) bits 01 correspond to bits 01 oΐA,.i,

5) bits 01 correspond to carry bits to bit position 2k in both additions of IF, with AE, and of W, with AA,, and

6) bits 10 correspond to bits 10 of AW,.

Continuing from the discussion above regarding execution of a SHA-256 calculation using the described alternate notation, in round i, two new values are calculated: A, = DA, 53 W, and E =

DE, 53 W,. If one round is calculated in one cycle, these values overwrite A,._| and E,._|. respectively.

In the disclosed approaches for mounting an HMAC attack, the vectors A,.i, DA,. E,.i. DE can be found or determined by splitting those vectors into windows of size J bits for different values of i. In such an approach, a value of J determines the size of the template tables, so it should be kept reasonably small. By way of example, for J = 3, the traces will be divided into 2^5J+2 > 2¹⁷ groups. By way of comparison, for J = 1, a one-bit addition with carry is a linear operation, and in general it is more difficult to mount side-channel attacks on linear functions. Accordingly, choosing J = 2 can be a good trade-off between accuracy and complexity, and the following description presumes that J = 2.

One aim of the profiling set (vector set or chosen messages) is to characterize a part of the side-channel information corresponding to the overwriting of A,.i and E,._| with the new values A, = DA, 53 Wi and E, = DE 53W,. To achieve that aim, in an example implementation, the calculation is split into two-bit units, indexed by k. For this purpose, for every k, where 0 < k < 16, and for some value(s) of the round index I, corresponding traces can be split into 2¹² groups, according to the values of the following bits:

1. Ai-i [2k + 1 : 2k] (2 bits) 2. Ei-i [2k + 1 : 2k] (2 bits)

3. AAi[2k + 1 : 2k] (2 bits)

4. AEi[2k + 1 : 2k] (2 bits)

5. Carry (DA„ W„ 2k) (1 bit)

6. Carry (DE;, W,. 2k) (1 bit)

7. W,[2k + 1 : 2k] (2 bits)

Such 12-bit vectors, as defined above, can be split into three groups, where the 8 unknown bits of data items (items 1-4 in the list above) are designated as g, the two carry bits obtained from iteration k-1 (items 5-6 in the list above) are designated as c, and the two known message bits W [2k+1 : 2k] (item 7 in the list above) are designated as w. An average value of a sample number s over all traces with specific values g, c, w at the round number i at the bit position k can be designated

Points of Interest (POIs) can be identified from the template tables using the following approach. It is noted, for purposed of this discussion, that both of the indices i and s, as discussed herein, correspond to a time offset in the calculation. Therefore, if the points on the time axis corresponding to these two indices are far apart, no dependency of M_g ^{l k} _{C W S} on w should be expected. For instance, a sample taken in round j should not depend on the bits of the input in round i, if i and j are sufficiently spread apart (e.g., spread apart in time and/or rounds). Because correspondence between the two indices may not necessarily be known, a priori, a technique to find out which pairs (i, s) bear relevant information and to drop all other pairs can be used. For example, one of the two following techniques can be used:

1. For every round i and for every trace index t, calculate, using Equation 15 below: hd\ = HD{A\__i,A\_₁) + HD{E_i ^t_₁,E_i ^t_₁) (15) where HD stands for Hamming distance.

Then for every s, a correlation coefficient between the vectors hd- and T? (the sample with index s of the trace with index t) can be calculated. Pairs (I, s) with low correlations can then be dropped or ignored.

2. For every round i and for every s, a standard deviation of M_g ^{l k} _{C W S} over all values of k, g, c, w can be calculated. Pairs (i, s) with low standard deviations can then be dropped or ignored.

In experiments on HMAC implementations, the above two techniques yield similar results. In the experimental results described herein, the second technique was used to determine the points of interest. As a result, for every entry (an averaged trace) in the table, only several points of interest remained. In this discussion, the number of selected or determined POIs is designated as p.

In mounting the disclosed template attack approaches, an average level of a signal (e.g., side- channel leakage) is likely to be different between respective profding sets and attack sets. This difference can be due, in part, to the fact that the calculations in the first round of the second block of the attack set start from the same (unknown) internal state, while in the profiling set, the internal state before a round is distributed uniformly. To accommodate for this difference, values can be

normalized by subtracting an average over four values with the same i, k, g, c, s, and all

possible values of w*.

When performing experimental attack, using such normalization results in successful attacks succeeds (presuming enough traces have been acquired). In experiments without normalization with similar amounts of traces on both FPGA and simulation implementations, those attacks have failed, and would require a significantly higher number of traces to be potentially successful.

In implementing template attacks using the approaches described herein, profiling traces can be reused. This is due, at least in part to the fact that every round of the SHA-2 (or HMAC) calculation is executed on the same hardware. Therefore, it is expected that the points of interest at different rounds will have a same distribution regardless of the round index. Assuming that an initial internal state in the profiling set is chosen randomly, the only significant difference in the distribution would be that the sample indices of the points of interest are shifted according to the round index. For example, if n samples are taken at every round, then the distribution of would not depend

on the value of i. For this reason, an optimization can be used which enables more information to be extracted (determined) from a same number of traces. For instance, data corresponding to different rounds i can be merged, such that two traces, one with specific values of g, c, w at the bit position k at round ii and the other with the same values of g, c, w at the same bit position at round i₂ are classified to a same group, while shifting them so that the sample number ni_l + s of the first trace corresponds to the sample number ni₂ +s of the second trace. The result of this approach is a set of averaged samples at POIs, every averaged sample being characterized by the values of g, c, w, the bit position k, and the POI index s. These averaged samples can be organized into 16 tables T^k. The table T^k has 2¹⁰ rows Tg _C corresponding to all possible values of g, c, and 4p columns corresponding to 4 values of w and p POIs. Every row is then represented as a 4p-dimensional vector.

In an attack stage of the disclosed template attack approaches on an HMAC-SHA-256 implementation, both the inner SHA-256 invocation and the outer SHA-256 invocation can be attacked in the same manner, where each respective attack can include the steps described below. The attack stage, as with the profiling stage, is described with reference to the alternative SHA-2 notation described above with respect to, for example, FIG. 9.

In implementations, a first step (Step 1) of an attack stag can include finding A_i, E.i of an internal state of a corresponding SHA-256 invocation (e.g., inner or outer). In order to find a group of bits of A-i, E-i, based on the set of traces acquired during the attack stage for every k < 16, vectors of dimension 4p (where p is the number of the points of interest) can be built, and the closest vectors in the table T^k can be identified This process can be done iteratively from the least significant (k = 0) to the most significant (k = 15) bits, as described below, for instance in subsets of J bits (e.g., J = 1, 2 or 3). In parallel to the discovery of bits of A_i, E.i, we find the corresponding bits of AA₀ and AE₀, as shown by the calculations of Equations 7-12 presented above. In the disclosed approaches, such bit discovery can be done in parallel for all four words of the words A_i, E.i, AA₀, AE₀ , finding two bits of each word in every iteration, starting from the least significant bit(s).

For instance, in an iteration k, finding the pair of bits 2k + 1 : 2k of these four words is attempted, assuming that the bits 2k - 1 : 0 of all four words are already known. This allows for calculating Carry(AA₀, W_Q , 2k) and Carry(AE₀, W_Q , 2k) for every trace t. With these calculated carry values, all the relevant traces can be split into several subsets U_c according to the two carry bits c. Although four possible values for c exist, in practice, the actual number of non-empty subsets is always strictly less than 4. For example, for k = 0 there is only one possible combination (0, 0) because Carry(x, y, 0) X 0. For k > 0 and A₀[2k - 1 : 0] = E₀[2k - 1 : 0] clearly Carry(A₀; W_Q , 2k) = Carry(E₀, W_Q , 2k), and only two combinations (0, 0) and (1, 1) are possible. Finally, if A₀[2k - 1 : 0] ¹ E₀[2k -1 : 0], e.g., A₀[2k - 1 : 0] > E₀[2k - 1 : 0], then Carry(A₀, W_Q , 2k) > Carry(E₀, W_Q , 2k), so one of the four combinations is excluded, and only three remain.

Every non-empty set U_c can then be subdivided into four subsets U_c,w according to w = W_Q [2k + 1 : 2k] Finally, samples at the p points of interest can be averaged over U_c,w for all four values of w, resulting in a vector of dimension 4p for every non-empty subset U_c.

The expectation is, for every c, for which U_c is not empty, V_c is close to the vector T_{g c}. where g represents bits 2k + 1 : 2k of the four words. To guess the correct g, for every g, a sum a_g = is calculated. Here, L² stands for the Euclidean metric. The value of g, for

which O_g has the minimal value, is taken. Then, the bit discovery can proceed to the next iteration, k + 1, for k < 15.

In a second step (Step 2) of the disclosed template attack approaches, the words A_₂, A.₃, E_₂, E of a respective SHA-256 invocation can be discovered. In this stage, all possible hypotheses about the bits of A_₂, A.₃, E_₂, E_₃, can be made, where, for each hypothesis, corresponding measured vectors and corresponding vectors from the template table can be calculated and/or determine. Hypotheses with the lowest Euclidean distances can then be selected. Similar to the first step, A_₂, A.₃, E_₂, E.₃ can be found iteratively, e.g., by finding two bits of every word in every iteration.

In iteration k, finding the pair of bits 2k + 1 : 2k of these four words is attempted, assuming that bits 2k - 1 : 0 of all four words are already known. In addition, the words DA₀;DEo;A.i;E.i are known from the first step. This allows for calculating the following values for every trace t (note that functions Maj and Ch are bit-wise):

2k], E. [2k+l:2k], E. [2k + 1 : 2k] can be obtained. The combination with the lowest sum of the distances is assumed to be the correct combination.

In a third step (Step 3) of the attack stage, the words A_-4, E_* of an internal state of a respective SHA-236 invocation can be found. Where

are already known, a simple linear calculation suffices to find A_ , E. .

Rewriting equations 9-12 for i = 0 we have, as Equations 16-19:

Where A .4 and E- now remain the only unknowns in these expressions, and they can be found as follows, using Equations 20-22.

The disclosed template attack approaches can be extended to HMAC implementations where more than one calculation round of a corresponding hash function is performed per clock cycle. For example, such attack approaches can be applied to HMAC implementations with up to three rounds per clock cycle, with some modifications, as described below. In this discussion, the number of rounds per clock cycle (e.g., 2 or 3) is designated as d.

First, changes to template table calculations should be made. For instance, because in such multiple rounds per cycle implementations A_i and E_i overwrite A,_i-d and E,_i-d. respectively, rather than

Ai-i and E,._|. the classification of the traces for building the should be based on the

following values:

Note the change of the indices of A and E in the first two lines above, compared to the previously discussed example of one round per clock cycle.

In addition to the changes to template table calculations discussed above, for multiple rounds per cycle implementations, there should be separation of template tables based on a round index modulo d. Since in every clock cycle, d rounds are calculated, if any two round numbers are different modulo d, then they likely use different physical gates. Therefore, different template tables should be built based on the round number modulo d.

Additionally for multiple rounds per cycle implementations, changes to Step 1 and Changes to Step 2 should be made. For instance, for Step 1, in the first clock cycle, the calculated values of A^l ₀ and E_Q overwrite A__d and E__d, rather than A_i and E_i. For this reason, the four words found in the first step, are A__d, E__d, DA₀, AE₀, rather than A_i, E_i, DA₀, AE₀. With this exception, the first step of an attack can be performed in exactly the same manner as in the case of one round per clock cycle, as described above. For Step 2, after the first step, A__d and E__d are already known, while A_i, E.i, A_5-d, E_5-d are still unknown. Accordingly, the selected hypotheses in this case are for A_i, E.i, A_5-d, E_5-d.

Experimental results

Using the approaches described herein, both a profiling stage and an attack stage of an example template attack can be performed using a single SHA-2 invocation. Accordingly, successful recovery of SHA-2 output from power traces can be sufficient for forging an HMAC SHA-2 signature.

To experimentally evaluate the disclosed template attack approaches, a low-area SHA-256 hardware implementation was used. Register-transfer level (RTF) of the SHA-256 implementation was synthesized for following two target platforms:

1. ASIC netlist using a Yosvs synthesizer and a NanGate FreePDK45 Open Cell Fibrarv - The netlist was simulated using SideChannel Studio, a pre-silicon side-channel leakage simulator by FortiiylQ. The simulator includes two stages: the first (ScopelQ) performs a power-aware functional simulation of the netlist and generates power traces and the second stage (ScorelQ) runs the analysis.

2. Two CW305 Artix FPGA target boards by NewAE Technology with the Kevsight E36100B Series DC Power Supply for power stabilization - Traces were collected using NewAE Technology ChipWhisperer-Lite kit and, after extracting points of interest, such as described herein, the traces were analyzed using ScorelQ, in a similar fashion as the simulation-based traces. A power signal was obtained by measuring current via a shunt resistor connected serially to the FPGA supply line. Power trace acquisition by the ScopelQ simulator for the first platform was performed in Amazon cloud in 64 parallel threads. Trace analysis by ScorelQ for both platforms ran on a local macOS machine.

FIG. 11 is a graph 1100 illustrating standard deviation of trace samples according to an aspect. In the graph 1100, standard deviation of over time for a constant i from the simulation

measurement 1110 and the FPGA-based measurements 1120 are shown. In the experimental data of FIG. 11, points of interest were selected in a way that increases (e.g., maximizes) the standard deviation of the trace samples for a given data set. For instance, a standard deviation for each sample in the averaged power trace for a fixed i over all possible values of the vector (g, c, w, k) was

calculated. The graph 1100 of FIG. 11 illustrates normalized standard deviation of over time

in first five rounds (e.g., first five SHA-256 execution rounds) for respective traces taken from the simulation and from the FPGA. In this example, the simulator is cycle-based, and therefore it produces a single power sample per cycle. In the FPGA-based setup, four samples per cycle were taken.

As can be seen from FIG. 11, for both the FPGA and the simulator-based experiments, the standard deviation data demonstrates that the first four execution rounds (0 to 3) can provide the most information about the trace data. The slight difference between the simulator data 1110 and the FPGA data 1120 can be attributed to noise in the FPGA environment, in contrast to the simulator environment, where no noise was presumed.

Experiments that were performed using a known key demonstrate that a number of traces both for performing a successful, for both the attack stage and the profiling stage, can be significantly reduced by considering a few hypotheses finalists, as compared to approaches where only a best hypothesis is selected, such as in the approaches described below.

In the approaches described herein, an attack stage of a template attack includes three steps, in which steps 1 and 2 produce a prioritized list of hypotheses for an unknown hash function internal state, and step 3 includes simple calculations. Step 1 (e.g., finding A_i, E.i) can include choosing qi best hypotheses for bits 0, 1, where qi is a parameter expressing the number of selected hypotheses for a first stage of the disclosed template attack approaches. For subsequent bit windows, (k > 0) the qi best hypotheses for bits 2k - 1 : 0 can be selected, and then combined with 256 hypotheses for bits 2k + 1 : 2k, which results in obtaining a total of 256qi hypotheses for bits 2k + 1 : 0. From these 256qi hypotheses, the best qi hypotheses for the next step can be selected using the approaches described herein. Finally, we obtain qi hypotheses for the full values of A_i, AA₀, E.i, AE₀.

Step 2 (e.g., finding A. ;A. ;E. ;E. ) can then be performed for each one of these qi hypotheses separately. Step 2 can be performed in a similar way to Step 1 by using. 2-bit windows (or using 1-bit windows, 3-bit windows, etc.), where the best q2 hypotheses are selected at each iteration, q₂ being a parameter expressing the number of selected hypotheses for a first stage of the disclosed template attack approaches. At the end of Step 2, q₂ hypotheses for each of the qi hypotheses from Step 1 are obtain, which results in a total of qiq₂ hypotheses for a full initial (e.g., unknown) internal state of the inner SHA-256.

After obtaining qiq₂ hypotheses for the inner SHA invocation, the outer SHA invocation can be attacked in the same way, e.g., by repeating the attack for each of the hypotheses, resulting in a total of (qiq₂)² iterations. However, the following observation helped significantly accelerate the process of attacking the outer SHA invocation. That is, using the technique for finding POIs described above, it is possible to find a correct hypothesis by correlation. Namely, for each of the qiq₂ hypotheses for the inner SHA initial state, and for every trace from a subset of the attack traces the Hamming distance hd_g can be calculated according to Equation 15, above, and its correlation with samples at the points of interest at round d can also be calculated. If the hypothesis is correct, the correlations are expected to be significantly above a noise level. Experimentally, it was found that in both FPGA and simulation setups, such an approaches consistently works with an arbitrary subset of 7K traces, 5 = 6 and a threshold value of 5% to distinguish between significant correlations and noise. In other implementations, different values for the foregoing may apply to achieve successful results.

If one of the hypotheses has passed the foregoing test, the outer SHA invocation (e.g. SHA- 256) can be attacked, with the assumption that the tested hypothesis is correct. Namely, for every trace, the output from the inner SHA-256 invocation can be calculated and, in the same way, the outer SHA-256 invocation can be attacked, obtaining a total of qiq₂ hypotheses for a full initial internal state of the outer SHA-256. The correct hypothesis can then be found by a brute-force attack. In an example experimental setup, such as the setups described herein, the values qi = 15; q₂ = 10 were used to successfully determine initial internal states of both an inner SHA-256 invocation an outer SHA-256 invocation of the attacked HMAC implementation. In other implementations, different values of qi, q₂ may be used to mount a successful template attack using the approaches described herein.

Template attacks, including the approaches for mounting a template attack described here, include performing a profiling stage. Accordingly, if a HMAC implementation (e.g., hardware or software) is solely dedicated to calculating HMAC values using a fixed key, e.g., does not allow an arbitrary, or independent hash value (e.g., SHA-2) calculations, then a template attack using the approaches described herein cannot be mounted. However, there are some considerations when implementing such a mitigation approach. First, access to pure hash function (e.g. SHA-2) units or primitives should be blocked in all commercial implementations of a given HMAC implementation, otherwise an attacker may exploit an HMAC unit with an independently accessible hash function primitive for profiling. Second, if somewhere in a given implementation, there is a hash function unit that provides plain hash function (e.g., SHA-2) functionality, the unit should be based on a different architecture, otherwise it could be possible to use that included unit for performing a profding stage.

A similar, but less restrictive mitigation approach is to define a execution policy that prevents large numbers of consecutive invocations of a pure hash function used to implement a given HMAC implementation. For instance, time intervals between hash function invocations could be enforced.

Alternatively, a power analysis resistant SHA-256 engine can be implemented using an adapted version of one of the methods developed for other cryptographic Nodules.

Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. Implementations may implemented as a computer program product, i.e., a non-transitory computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device (e.g., a computer-readable medium, a tangible computer-readable medium), for processing by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. In some implementations, a non-transitory tangible computer-readable storage medium can be configured to store instructions that when executed cause a processor to perform a process. A computer program, such as the computer program(s) described above, can be written in any form of programming language, including compiled or interpreted languages, and can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be processed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communications network.

Method steps may be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Method steps also may be performed by, and an apparatus may be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the processing of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. Elements of a computer may include at least one processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer also may include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory may be supplemented by, or incorporated in special purpose logic circuitry.

To provide for interaction with a user, implementations may be implemented on a computer having a display device, e.g., a cathode ray tube (CRT), a light emitting diode (LED), or liquid crystal display (LCD) display device, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.

Implementations may be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation, or any combination of such back-end, middleware, or front-end components. Components may be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.

While certain features of the described implementations have been illustrated as described herein, many modifications, substitutions, changes and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the scope of the implementations. It should be understood that they have been presented by way of example only, not limitation, and various changes in form and details may be made. Any portion of the apparatus and/or methods described herein may be combined in any combination, except mutually exclusive combinations. The implementations described herein can include various combinations and/or sub-combinations of the functions, components and/or features of the different implementations described.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art working in this area. Although suitable methods and materials are described herein, methods and materials similar or equivalent to those described herein can be used in the practice of the present example embodiments. In case of conflict, the patent specification, including definitions, will control. All materials, methods, and examples are illustrative only and are not intended to be limiting.

As used herein, the terms “comprising” and “including” or grammatical variants thereof are to be taken as specifying inclusion of the stated features, integers, actions or components without precluding the addition of one or more additional features, integers, actions, components or groups thereof. This term is broader than, and includes the terms "consisting of' and "consisting essentially of' as defined by the Manual of Patent Examination Procedure of the United States Patent and Trademark Office. Thus, any recitation that an embodiment “includes” or “comprises” a feature is a specific statement that sub embodiments “consist essentially of’ and/or “consist of’ the recited feature.

The phrase "consisting essentially of' or grammatical variants thereof when used herein are to be taken as specifying the stated features, integers, steps or components but do not preclude the addition of one or more additional features, integers, steps, components or groups thereof but only if the additional features, integers, steps, components or groups thereof do not materially alter the basic and novel characteristics of the claimed composition, device or method.

The phrase “adapted to” as used in this specification and the accompanying claims imposes additional structural limitations on a previously recited component.

The term "method" refers to manners, means, techniques and procedures for accomplishing a given task including, but not limited to, those manners, means, techniques and procedures either known to, or readily developed from known manners, means, techniques and procedures by practitioners of architecture and/or computer science.

Implementations of methods and systems, according to example embodiments, involves performing or completing selected tasks or steps manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of exemplary embodiments of methods, apparatus and systems, several selected steps could be implemented by hardware or by software on any operating system of any firmware or a combination thereof. For example, as hardware, selected steps could be implemented as a chip or a circuit. As software, selected steps could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In any case, selected steps of the method and system could be described as being performed by a data processor, such as a computing platform for executing a plurality of instructions.

It is expected that many data processor types will be developed and the scope of at least some of the appended claims can include all such new technologies a priori.

A variety of numerical indicators have been utilized in this disclosure. It should be understood that these numerical indicators could vary even further based upon a variety of engineering principles, materials, intended use and designs incorporated into the various embodiments. Additionally, components and/or actions ascribed to exemplary embodiments and depicted as a single unit may be divided into subunits. Conversely, components and/or actions ascribed to exemplary embodiments and depicted as sub-units/individual actions may be combined into a single unit/action with the described/depicted function.

Alternatively, or additionally, features used to describe a method can be used to characterize an apparatus and features used to describe an apparatus can be used to characterize a method.

It should be further understood that the individual features described hereinabove can be combined in all possible combinations and sub-combinations to produce additional embodiments. The examples given above are exemplary in nature and are not intended to limit the scope of, at least some of, the following claims. Each recitation of an embodiment that includes a specific feature, part, component, module or process is an explicit statement that additional embodiments not including the recited feature, part, component, module or process exist.

Alternatively or additionally, various exemplary embodiments can exclude any specific feature, part, component, module, process or element which is not specifically disclosed herein. Specifically, the described embodiments have been described in the context of certain calculation types but might also be used in the context of other calculation types.

The terms “include", and “have” and their conjugates as used herein mean “including but not necessarily limited to”.

Claims

CLAIMS What is claimed is:

1. A method of improving performance of a data processor comprising: in a ring of characteristic 2, computing X²⁵⁴, X being an element of the ring, by performing a series of:

(i) multiplications of two different elements of the ring; and

(ii) performing a linear transformation by raising an element of the ring to a power Z, wherein Z is a power of 2; wherein a total number of multiplications is limited to 4, a total number of linear transformations is limited to 4, a number of multiplications executed sequentially is limited to 3, and a number of linear transformations executed sequentially is limited to 2.

2. A method according to claim 1, wherein elements of said ring redundantly represent elements of a field; and wherein after at least one transformation according to (i) or (ii) a result representing a field element Z is replaced with one of said redundant representations of a same element Z of said field.

3. A method according to claim 2, wherein said redundant representations of the same element Z of said field is chosen randomly.

4. A method according to any of the preceding claims, wherein said ring includes a field GF(2⁸) as a subring.

5. A method of improving performance of a data processor comprising: in a ring of characteristic 2, computing X²⁵⁴, X being an element of the ring, by performing a series of:

(i) multiplications of two different elements of the ring; and

(ii) performing a linear transformation by raising an element of the ring to a power

Z, wherein Z is a power of 2; wherein a total number of multiplications is limited to 4, a total number of linear transformations is limited to 3, a number of multiplications executed sequentially is limited to 3, and a number of linear transformations executed sequentially is limited to 3.

6. A method according to claim 5, wherein elements of said ring redundantly represent elements of a field; and wherein after at least one transformation according to (i) or (ii) a result representing a field element Z is replaced with one of said redundant representations of a same element Z of said field.

7. A method according to claim 6, wherein said redundant representations of the same element Z of said field is chosen randomly.

8. A method according to any of claims 5 to 7, wherein said ring includes a field GF(2⁸) as a subring.

9. A semiconductor intellectual property (IP) core for improving performance of a data processor comprising: a transformation engine designed and configured to perform, in a ring of characteristic 2, computation of X²⁵⁴, X being an element of the ring, by performing a series of:

(i) multiplications of two different elements of the ring; and

10. An IP core according to claim 9, wherein elements of said ring redundantly represent elements of a field; and wherein after at least one transformation according to (i) or (ii) a result representing a field element Z is replaced with one of said redundant representations of a same element Z of said field.

11. An IP core according to claim 10, wherein said redundant representations of the same element Z of said field is chosen randomly.

12. An IP core according to any of claims 9 to 11, wherein said ring includes a field GF(2⁸) as a subring.

13. A method of improving performance of a data processor comprising: in a ring of characteristic 2, computing X²⁵⁴, X being an element of the ring, by performing a series of:

(i) multiplications of two different elements of the ring; and

(ii) performing a linear transformation by raising an element of the ring to a power Z wherein Z is a power of 2; wherein a total number of multiplications is limited to 7, a total number of linear transformations is limited to 6, a number of multiplications executed sequentially is limited to 3, and a number of linear transformations executed sequentially is limited to 1.

14. A method according to claim 13, wherein elements of said ring redundantly represent elements of a field; and wherein after at least one transformation according to (i) or (ii) a result representing a field element Z is replaced with one of said redundant representations of a same element Z of said field.

15. A method according to claim 14, wherein said redundant representation of the same element Z of said field is chosen randomly.

16. A method of improving performance of a data processor according to any of claims 13 to 15, wherein said ring includes a field GF(2⁸) as a subring.

17. An intellectual property (IP) core comprising: circuitry that improves performance of a data processor by: in a ring of characteristic 2, computing X²⁵⁴, X being an element of the ring, by performing a series of:

(i) multiplications of two different elements of the ring; and (ii) performing a linear transformation by raising an element of the ring to a power Z, wherein Z is a power of; wherein a total number of multiplications is limited to 4, a total number of linear transformations is limited to 4, a number of multiplications executed sequentially is limited to 3. up to 2 of 4 multiplications being executed in parallel, and a number of linear transformations executed sequentially is limited to 2.

18. An IP core according to claim 17, wherein elements of said ring redundantly represent elements of a field; and wherein after at least one transformation according to (i) or (ii) a result representing a field element Z is replaced with one of said redundant representations of a same element Z of said field.

19. An IP core according to claim 18, wherein said redundant representation of the same element Z of said field is chosen randomly.

20. An IP core according to any of claims 17 to 19, wherein said ring includes a field GF(2⁸) as a subring.

21. An intellectual property (IP) core comprising: circuitry that improves performance of a data processor by: in a ring of characteristic 2, computing X²⁵⁴, X being an element of the ring, by performing a series of:

(i) multiplications of two different elements of the ring; and

(ii) performing a linear transformation by raising an element of the field to a power Z wherein Z is a power of 2; wherein a total number of multiplications is limited to 7, a total number of linear transformations is limited to 6, a number of multiplications executed sequentially is limited to 3, and a number of linear transformations executed sequentially is limited to 1.

22. An IP core according to claim 21, wherein elements of said ring redundantly represent elements of a field; and wherein after at least one transformation according to (i) or (ii) a result representing a field element Z is replaced with one of said redundant representations of a same element Z of said field.

23. An IP core according to claim 22, wherein said redundant representation of the same element Z of said field is chosen randomly.

24. An IP core according to any of claims 21 to 23, wherein said ring includes a field GF(2⁸) as a subring.

25. A semiconductor intellectual property (IP) core comprising a transformation engine designed and configured to generate redundant representations of each element of a field GF(2⁸) using a polynomial of degree no higher than 7 + d, where d > 0 is a redundancy parameter said redundant representations belonging to a ring; wherein said transformation engine represents a same field element by one of 2^d various ways, which differ pairwise by terms that are multiples of P(x), and at one or more momentsof calculations replaces a redundant representation of a field element Z with a said redundant representation chosen out of 2^d representations of said field element Z.

26. An IP core according to claim 25, wherein said redundant representation out of 2^d representations is chosen randomly.

27. An IP core according to claim 25 or 26, wherein d > 5.

28. An IP core according to claim 27, wherein d > 8.

29. An IP core according to any of claims 25 to 28, wherein said element of a field comprises a byte of data within a block of a block cipher and a cryptographic key.

30. An IP core according to claim 29, wherein said block cipher is selected from the group consisting of AES, SM4, and ARIA.

31. An IP core according to any of claims 25 to 30, wherein said transformation engine computes X^Yby performing a series of:

(i) multiplications of two different elements of the field; and

(ii) raising an element of the field to a power Z wherein Z is a power of 2; wherein the number of multiplications (i) is at least two less than the number of ones (Is) in the binary representation ofY; and wherein after at least one transformation according to (i) or (ii) a result representing a field element Z is replaced with one of GF(p)[x]/(PQ) implemented representations of said field element Z.

32. An IP core according to claim 31, wherein said one of 2^d redundant representations of said field element Z is chosen randomly.

33. An IP core according to claim 31 or 32, wherein Y=254.

34. An IP core according to claim 33, wherein a number of multiplications (i) is 4 or less.

35. A method of building different representations of the Galois Field (GF) implemented by logic circuitry comprising: redundantly representing each element of a field GF(2⁸) using a polynomial of degree no higher than 7 + d, where d > 0 is a redundancy parameter to generate redundant representations belonging to a ring; representing a same field element by one of 2^d various ways, which differ pairwise by terms that are multiples of P(x), and at one or more moments of calculations replacing a value representing a field element Z with any of said various representations of said field element Z.

36. A method according to claim 35, wherein one of said various representations of said field element Z is chosen randomly.

37. A method according to claim 35 or 36, wherein d>5.

38. A method according to claim 37, wherein d>8.

39. A method according to any of claims 35 to 38, wherein said element of a field comprises a byte of data within a block of a block cipher and a cryptographic key.

40. A method according to claim 39, wherein said block cipher is selected from the group consisting of AES, SM4, and ARIA.

41. A method according to any of claims 35 to 40, comprising computing X^Yby performing a series of:

(i) multiplications of two different elements of the field; and

(ii) raising an element of the field to a power Z wherein Z is a power of 2; wherein the number of multiplications (i) is at least two less than the number of ones (Is) in the binary representation ofY; and wherein after at least one transformation according to (i) or (ii) the result representing a field element Z is replaced with one of said redundant representations of said field element Z.

42. A method according to claim 41, wherein said one of said redundant representations of said field element Z is chosen randomly.

43. A method according to claim 41 or 42, wherein Y=254.

44. A method according to claim 41, wherein a number of multiplications (i) is 4 or less.

45. A semiconductor intellectual property (IP) core comprising a transformation engine designed and configured to perform, on elements of a finite ring R represented as GF(p)[x]/(PQ). a sequence of operations comprising at least one member selected from the group consisting of multiplication and raising to an integer power; wherein p is a prime number; wherein P is a polynomial of degree n irreversible over GF(p): wherein Q is a polynomial of degree d over GF(p ); wherein elements of R redundantly represent elements of a finite field GF(pⁿ represented as GF(p)[x]/(P); and wherein a result (Z) of at least one of said operations in said sequence is replaced with a element (Z*) of R such that Z and Z* redundantly represent a same element of F.

46. An IP core according to claim 45, wherein said element Z* is chosen randomly.

47. An IP core according to claim 45 or claim 46, wherein p = 2.

48. An IP core according to any of claims 45 to 47, wherein n = 8.

49. An IP core according to any of claims 45 to 48, wherein any element A of R represents an element A mod P of F.

50. An IP core according to claim 49, wherein said replacement of Z by Z* is performed by calculation Z* = Z + CP wherein C is a polynomial of a degree less than d.

51. An IP core according to any of claims 45 to 50, wherein said sequence of operations takes an element A of R representing an element Z of F and calculates an element Z of R representing an element Yⁿ of F.

52. An IP core according to claim 51, wherein said sequence of operations consists only of multiplications and raisings to powers of p^k .

53. An IP core according to claim 51 or claim 52, wherein n = 254.

54. A method of building different representations of a finite ring R represented as GF(p [x]/ (PQ) implemented by logic circuitry comprising: a sequence of operations comprising at least one member selected from the group consisting of multiplication and raising to an integer power; wherein p is a prime number; wherein R is a polynomial of degree n irreversible over GF(p); wherein Q is a polynomial of degree d over GF(p): wherein elements of R redundantly represent elements of a finite field GF(pⁿ represented as GF(p)[x]/(R); and wherein a result (Z) of at least one of said operations in said sequence is replaced with a element (Z*) of R such that Z and Z* redundantly represent a same element of F.

55. A method according to claim 54, wherein said element Z* is chosen randomly.

56. A method according to claim 54 or claim 55, wherein p = 2.

57. A method according to any of claims 54 to 56, wherein n = 8.

58. A method according to any of claims 54 to 57, wherein any element A of R represents an element A mod P of F.

59. A method according to claim 58, wherein said replacement of Z by Z* is performed by calculation Z* = Z + CP wherein C is a polynomial of a degree less than d.

60. A method according to any of claims 54 to 59, wherein said sequence of operations takes an element A of R representing an element Z of F and calculates an element Z of R representing an element Yⁿ of F.

61. A method according to claim 60, wherein said sequence of operations consists only of multiplications and raisings to powers of p^k .

62. A method according to claim 60 or claim 61, wherein n = 254.

63. A method for testing for vulnerability of an implementation of a hash-based message authentication code (HMAC) algorithm to a side-channel attack, the method comprising: mounting a template attack on a hash function used to implement the HMAC algorithm, the template attack including: generating, based on first side-channel leakage information associated with execution of the hash function, a plurality of template tables, each template table of the plurality corresponding, respectively, with a subset of bit positions of an internal state of the hash function; and generating, based on second side-channel leakage information, a plurality of hypotheses for an internal state of an invocation of the hash function based on a secret key; generating, using the hash function, respective hash values generated from each of the plurality of hypotheses and a message; comparing each of the respective hash values with a hash value generated using the secret key and the message; and based on the comparison, determining vulnerability of the HMAC algorithm implementation based on a hash value of the respective hash values matching the hash value generated using the secret key and the message.

64. The method of claim 63, wherein the implementation of the HMAC algorithm is one of: a hardware implementation; a software implementation; or a simulator implementation.

65. The method of claim 63, wherein one round of a compression function of the hash function is calculated per calculation cycle of the hash function.

66. The method of claim 63, wherein a plurality of rounds of a compression function of the hash function are calculated per calculation cycle of the hash function.

67. The method of claim 63, wherein each template table of the plurality of template tables includes a plurality of rows that are indexed using values of bits of the respective subset of bit positions, the rows including respective side-channel leakage information of the first side-channel leakage information associated with the index values.

68. The method of claim 67, wherein generating the template tables includes normalizing a value of the respective side-channel information based on an average value of a plurality of values of the respective side-channel leakage information.

69. The method of claim 67, wherein the plurality of rows of the template tables are further indexed using at least one of: carry bit values corresponding with the subset of bits of the internal state of the hash function; or bit values of a portion of a message schedule used to calculate the hash function.

70. The method of claim 63, wherein collecting the first side-channel leakage information includes executing the hash function using a known message schedule as a first input block of the hash function, the first side-channel leakage information being collected based on a Hamming distance model.

71. The method of claim 63, wherein each subset of bit positions of the internal state includes a respective two-bit subset of each word of the internal state of the hash function.

72. The method of claim 63, wherein the hash function is a hash function of the Secure Hash Algorithm 2 (SHA-2) standard.

73. The method of claim 63, wherein each template table of the plurality of template tables further corresponds with a respective execution round of a compression function of the hash function.

74. The method of claim 63, wherein determining each hypothesis of the plurality of hypotheses includes determining values of respective subsets of bits of the internal state of the hash function in correspondence with the plurality of the template tables.

75. The method of claim 63, wherein the hash function is implemented in hardware.

76. The method of claim 75, wherein one execution round of a compression function of the hash function is completed in one clock cycle of the hardware implementation.

77. The method of claim 75, wherein multiple rounds of an execution round of a compression function of the hash function are completed in one clock cycle of the hardware implementation.

78. The method of claim 63, wherein the first side-channel leakage information and the second side-channel leakage information include at least one of respective: power consumption over time; electromagnetic emissions over time; or cache miss patterns.

79. A method of forging a hash-based message authentication code (HMAC), the method comprising: collecting, while executing an implementation of a hash function used to produce the HMAC, first side-channel leakage information corresponding with overwriting values of an internal state of the hash function; generating a plurality of template tables, each template table corresponding, respectively, with a subset of bits of the internal state of the hash function, each template table of the plurality of template tables including rows that are indexed using values of the respective subset of bits, the rows including respective side-channel leakage information of the first side-channel leakage information associated with the index values; collecting second side-channel leakage information associated with producing the HMAC; identifying, based on comparison of the second side-channel leakage information with the plurality of template tables, a first plurality of hypotheses for an internal state of an inner invocation the hash function; identifying, based on comparison of the second side-channel leakage information with the plurality of template tables, a second plurality of hypotheses for an internal state of an outer invocation of the hash function; and selecting, using pairs of hypotheses each including a hypothesis of the first plurality of hypotheses and a hypothesis of the second plurality of hypotheses, a first hypothesis of the first plurality of hypotheses and a second hypothesis of the second plurality of hypotheses for forging the HMAC.

80. The method of claim 79, wherein generating the template tables includes normalizing a value of the respective side-channel information based on an average value of a plurality of values of the respective side-channel leakage information.

81. The method of claim 79, wherein collecting the first side-channel leakage information includes executing a single invocation of the hash function using a known message schedule as a first input block of the hash function, the first side-channel leakage information being collected based on a Hamming distance model.

82. The method of claim 79, wherein the template tables are further indexed using at least one of: carry bit values corresponding with the subset of bits of the internal state of the hash function; or bit values of a portion of a message schedule used to calculate the hash function.

83. The method of claim 79, wherein the subset of bits of the internal state includes respective two-bit subsets of each word of the internal state of the hash function.

84. The method of claim 79, wherein the hash function is a hash function of the Secure Hash Algorithm 2 (SHA-2) standard.

85. The method of claim 79, wherein each template table of the plurality of template tables further corresponds with a respective execution round of a compression function of the hash function.

86. The method of claim 79, wherein determining each hypothesis of the first plurality of hypotheses and each hypothesis of the second plurality of hypotheses includes determining respective subsets of bits of the internal state of the hash function in correspondence with the plurality of the template tables.

87. The method of claim 79, wherein the hash function is implemented in hardware.

88. The method of claim 87, wherein one execution round of a compression function of the hash function is completed in one clock cycle of the hardware implementation.

89. The method of claim 87, wherein multiple rounds of an execution round of a compression function of the hash function are completed in one clock cycle of the hardware implementation.

90. The method of claim 79, wherein the hash function is implemented in software.

91. The method of claim 79, wherein selecting the first hypothesis of the first plurality of hypotheses and the second hypothesis of the second plurality of hypotheses for forging the HMAC includes performing a brute force attack.

92. The method of claim 79, wherein the first side-channel leakage information and the second side-channel leakage information include at least one of respective: power consumption over time; electromagnetic emissions over time; or cache miss patterns.