WO2022029443A1 - Procédé et appareil pour réduire le risque d'attaques fructueuses par canal auxiliaire et injection d'erreurs - Google Patents

Procédé et appareil pour réduire le risque d'attaques fructueuses par canal auxiliaire et injection d'erreurs Download PDF

Info

Publication number
WO2022029443A1
WO2022029443A1 PCT/GB2021/052034 GB2021052034W WO2022029443A1 WO 2022029443 A1 WO2022029443 A1 WO 2022029443A1 GB 2021052034 W GB2021052034 W GB 2021052034W WO 2022029443 A1 WO2022029443 A1 WO 2022029443A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
queue
tasks
pipeline
executable tasks
Prior art date
Application number
PCT/GB2021/052034
Other languages
English (en)
Inventor
Jeremy Simon THORNTON
Original Assignee
Pugged Code Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GBGB2012352.7A external-priority patent/GB202012352D0/en
Priority claimed from GBGB2105109.9A external-priority patent/GB202105109D0/en
Application filed by Pugged Code Limited filed Critical Pugged Code Limited
Publication of WO2022029443A1 publication Critical patent/WO2022029443A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/70Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer
    • G06F21/71Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure computing or processing of information
    • G06F21/75Protecting specific internal or peripheral components, in which the protection of a component leads to protection of the entire computer to assure secure computing or processing of information by inhibiting the analysis of circuitry or operation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/002Countermeasures against attacks on cryptographic mechanisms
    • H04L9/003Countermeasures against attacks on cryptographic mechanisms for power analysis, e.g. differential power analysis [DPA] or simple power analysis [SPA]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/002Countermeasures against attacks on cryptographic mechanisms
    • H04L9/004Countermeasures against attacks on cryptographic mechanisms for fault attacks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2209/00Additional information or applications relating to cryptographic mechanisms or cryptographic arrangements for secret or secure communication H04L9/00
    • H04L2209/08Randomization, e.g. dummy operations or using noise

Definitions

  • This invention relates to a method and apparatus for reducing the risk of successful side channel and fault injection attacks when executing a computer algorithm.
  • the execution of the computer algorithm may be immune to side channel and fault injection attacks.
  • BACKGROUND There are many situations where computer algorithms perform tasks which one might wish to keep secret but they are open to Side Channel Attacks [SCAs]. One such situation arises in cryptography. [03] Traditionally, cryptography has sought to improve the security of encrypted data by increasing the complexity of the mathematical function, the cipher, that turns plaintext into ciphertext.
  • SCA Differential Power Analysis
  • Kocher et al published at the Annual International Cryptology Conference in December 1999.
  • US2017099134 (A1) [ Kocher et al] describes a method and apparatus for conducting DPAs on devices to establish how vulnerable they are to DPAs.
  • EM emissions can also be used for successful SCAs.
  • Electric current flowing through a conductor induces electromagnetic emanations, and these emanations are a source of side-channel information.
  • electromagnetic field As the power consumption in a device varies while the data is being processed, so does the electromagnetic field and one can expect to extract secret information from a relevant analysis.
  • EMA ElectroMagnetic Analysis
  • Both DPA and EMA can be improved by using various wavelet transform denoising methods, typically based on singular spectral analysis (SSA) and detrended fluctuation analysis (DFA), as described for example in “Improved wavelet transform for noise reduction in power analysis attacks” [Ai et al, 2016, IEEE International Conference on Signal and Image Processing (ICSIP)] .
  • SSA singular spectral analysis
  • DFA detrended fluctuation analysis
  • Principal signal component in SSA can be selected by DFA adaptively, and residual part can be denoised by wavelet transform to retrieve important information.
  • Using wavelet transforms in such a way improves SCA success rate whilst significantly decreasing the necessary number of power consumption traces significantly.
  • DPA and EMA rely on two facts.
  • Examples include “A Network-based Asynchronous Architecture for Cryptographic Devices” a PhD by Spadavecchia published by Edinburgh University in 2005, “Improving DPA Security by Using Globally-Asynchronous Locally-Synchronous Systems” by Gürkaynak et al published in the Proceedings of the 31 st European Solid-State Circuits Conference 2005, and “Countering power analysis attacks by exploiting characteristics of multicore processors” by Yang et al published in the IEICE Electronics Express in 2018, Volume 15, Issue 7, Pages 20180084. Most of these approaches have been published as modelling exercises to establish chip architecture which can function with some randomness in core use. However, for these approaches to be implemented typically requires new multicore hardware for every user, which is a considerable expense for banks and the like issuing cards with new chips.
  • FIAs fault injection attacks
  • FIAs can be generalised and relatively easy to mount such as glitching the voltage up or down out of the working range, or by generating sudden out of working range temperature changes. More specialised and costly FIAs involving lasers or ion beams can be focussed on much smaller parts of a processor. FIAs may sometimes lead to a processor dumping the task it was performing and the dump can be harvested and analysed. Sometimes an FIA can lead to a specific bit flop which changes the processing. Comparing “normal” running with the bit flop can reveal useful information.
  • the invention relates to apparatus and methods for enhancing security when executing a computer algorithm as defined in the independent claims. Further features are defined in the dependent claims.
  • apparatus for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the apparatus comprising memory in which at least part of the computer algorithm is stored as at least one pipeline, wherein the at least one pipeline comprises a plurality of non-commutative separately executable tasks; and a processor which is configured to receive a plurality of inputs to be processed by the computer algorithm; randomise the order of execution of the plurality of non-commutative separately executable tasks within the pipeline; attempt to execute each of the plurality of non-commutative separately executable tasks from the randomised order of execution once by determining whether preconditions for execution are met, when the preconditions are met, executing the task and storing an output of the executed task; and repeat the randomising and attempting steps until the computer algorithm has processed the plurality of inputs; whereby,
  • the plurality of separately executable tasks may be considered to be randomised at a pipeline level.
  • apparatus for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the apparatus comprising memory in which at least part of the computer algorithm is stored as at least one pipeline, wherein the at least one pipeline comprises at least one task which comprises a set of commutative operations; and a processor which is configured to: receive a plurality of inputs to be processed by the computer algorithm; randomise an order of execution for the commutative operations of the at least one task within the pipeline; execute each of the at least one tasks within the pipeline once and repeat the randomising and executing steps until the computer algorithm has processed all inputs; whereby, at each repetition, the electrical signals produced when executing the at least one task are randomised to enhance security.
  • the plurality of separately executable tasks may be considered to be randomised at a task level.
  • apparatus for enhancing security when executing a computer algorithm comprising memory in which at least part of the computer algorithm is stored as at least one pipeline comprising a plurality of separately executable tasks; and a processor which is configured to randomise the plurality of separately executable tasks at a pipeline level and/or at a task level; and execute the randomised plurality of separately executable tasks whereby electrical signals produced when executing each task are randomised to enhance security.
  • a method for enhancing security when executing a computer algorithm comprising storing at least part of the computer algorithm as at least one pipeline comprising a plurality of separately executable tasks; randomising the plurality of separately executable tasks at a pipeline level and/or at a task level; and executing the randomised plurality of separately executable tasks whereby electrical signals produced when executing each task are randomised to enhance security.
  • a method for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the method comprising: storing at least part of the computer algorithm as at least one pipeline, wherein the at least one pipeline comprises a plurality of non-commutative separately executable tasks; receiving a plurality of inputs to be processed by the computer algorithm; randomising the order of execution of the plurality of separately executable tasks within the pipeline; attempting to execute each of the plurality of separately executable tasks from the randomised order of execution once by determining whether preconditions for execution are met, when the preconditions are met, executing the task and storing an output of the executed task; and repeating the randomising and attempting steps until the computer algorithm has processed the plurality of input; whereby, at each repeating step, the electrical signals produced when executing the tasks are randomised to enhance security.
  • a method for enhancing security when executing a computer algorithm comprising separately executable tasks, each of which produce an electrical signal when executed, the method comprising: storing at least part of the computer algorithm as at least one pipeline, wherein the at least one pipeline comprises at least one task which comprises a set of commutative operations; receiving a plurality of inputs to be processed by the computer algorithm; randomising an order of execution for the commutative operations of the at least one task within the pipeline; executing each of the at least one tasks within the pipeline once and repeat the randomising and executing steps until the computer algorithm has processed all inputs; whereby, at each repetition, the electrical signals produced when executing the at least one task are randomised to enhance security.
  • the randomising at a pipeline level and/or a task level may be considered to be a Random Out Of Order Execution [ROOOX] approach. Both commutative and non-commutative portions of the algorithm can be randomised separately or together. Moreover, there may be random mixing both within individual runs and between different runs of the algorithm.
  • an algorithm may be considered to be unrolled, de-nested, and split into portions of sufficiently fine granularity that they can all be randomised and presented in a wait-free asynchronous manner.
  • each task may be presented for execution in a random order, when randomising at a pipeline level but is executed in the correct order for each input.
  • all tasks in the random order may meet the preconditions for execution and thus there may appear to be a random out of order execution of the pipeline. However, each task is still being executed in the correct order for the associated input.
  • An executable task may be considered to be a portion of program code residing in memory that can be run by processing hardware, e.g. the processor.
  • Such a task may be a suitably intercommunicating, wait-free, asynchronously executing sub step of a larger computer algorithm or algorithms.
  • asynchronous it is meant that the tasks do not operate synchronously, they are not in a lock-step, that operate in a non-blocking manner so that the algorithm is guaranteed to complete in a finite number of steps regardless of the speed of operation of the tasks.
  • wait-free it is meant that all tasks will complete in finite time and in other words, there is no blocking wait-state and the processor cannot be indefinitely blocked or starved.
  • the separately executable tasks may be of sufficiently fine granularity which depends on several factors.
  • the size of an executable task is not merely the physical size that the task occupies in memory but also the amount of time that is spent executing the program instructions that define that particular task.
  • the executable task may, for example, be as small as a call or write function, or as large as all the steps necessary to expand an encryption key.
  • Each task may be without repetitions occurring during a single execution episode, for example to assist in reducing any information that will be leaked at the point of execution.
  • the algorithm may comprise a plurality of rounds of component actions.
  • the granularity of the task may be considered to be coarse when the algorithm is split into tasks, each of which constitutes a round within the algorithm.
  • the granularity of the task may be considered to be medium when the algorithm is split into tasks, each of which constitutes a component action within the algorithm.
  • the granularity of the task may be considered to be fine when the algorithm is split into tasks, each of which constitutes an operation within a component action.
  • the apparatus may further comprise at least one buffer comprising a first queue and a second queue.
  • the at least one buffer may be termed a double buffer.
  • Randomising the plurality of separately executable tasks at a pipeline level may be achieved by generating a pointer which is associated with each executable task; storing the plurality of pointers in temporal order in the first queue; randomising the order of the plurality of pointers; and transferring the randomised plurality of pointers to the second queue from which each pointer is retrievable to execute the associated task.
  • By randomising the order of the pointers the order in which the plurality of tasks are presented for execution is randomised.
  • pointer is a programming language object that stores the memory address of another value located in computer memory. A pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer.
  • the plurality of separately executable tasks may be executed by retrieving a pointer at the head of the second queue (for example, in response to an instruction issued by the processor); attempting to execute the task associated with the retrieved pointer; returning the pointer to the first queue; and repeating the retrieving, attempting and returning steps until the second queue is empty.
  • the processor may be configured to determine whether the second queue is empty and when it is determined that the second queue is empty, randomise the order of the plurality of pointers; and transfer the randomised plurality of pointers to the second queue.
  • the randomisation of the pointers in the first queue and transfer to the second queue may be done repeatedly and indefinitely or until the computer algorithm has processed all inputs.
  • Attempting to execute the task may comprise determining whether preconditions for execution are met and when the preconditions are met, executing the task or when the preconditions are not met, proceeding straight to the returning step.
  • the preconditions may comprise determine whether there is an intermediate output associated with the task, i.e. whether the task has already executed and its output stored.
  • the preconditions may comprise determine whether there is an intermediate input associated with the task, i.e. whether the previous task has already been executed and its output stored so the task can be executed.
  • the repetitive cycling of unmet tasks adds time
  • the wait-free action reduces time.
  • the asynchronous nature of the process according to the invention reduces processing time.
  • the pointers may be retrieved from (i.e. exit) the second queue in an asynchronous, wait-free manner.
  • a non-uniform execution is achieved and any semblance of a pattern is obliterated which reduces the chance of a successful side channel attack.
  • the plurality of executable tasks are presented in a random order, the continual testing to see if the task can be executed ensures that they are executed in order which it will be appreciated is important for non-commutative tasks.
  • the apparatus may comprise a first buffer and a second buffer each comprising a first and second queue.
  • Each of the first and second buffer may be a temporal double buffer.
  • the memory may store a first pipeline comprising a plurality of separately executable tasks and a second pipeline comprising a plurality of separately executable tasks.
  • the first pipeline may be an encryption pipeline for encrypting plaintext and the second pipeline may be a decryption pipeline for decrypting cipher text.
  • Randomisation may be achieved by generating a pointer which is associated with each executable task in the first and second pipelines; storing the plurality of pointers for the plurality of separately executing tasks in the first pipeline in temporal order in the first queue of the first buffer; storing the plurality of pointers for the plurality of separately executing tasks in the second pipeline in temporal order in the first queue of the second buffer; randomising the order of the plurality of pointers in each of the first queues; transferring the randomised plurality of pointers from the first queue of the first buffer to the second queue of the first buffer; and transferring the randomised plurality of pointers from the first queue of the second buffer to the second queue of the second buffer.
  • the pointers in each of the second buffers are retrievable for wait-free execution as described above.
  • the apparatus may comprise at least one circular buffer for storing a plurality of pointers with each pointer being associated with one of the plurality of tasks and a plurality of queues connected to the at least one circular buffer with each queue storing the output from each of the plurality of tasks when it is executed.
  • a circular buffer may be termed a single processor, single customer (SPSC).
  • SPSC single processor, single customer
  • the stored output may be termed an intermediate output.
  • the processor may be configured to randomise the order in which the plurality of separately executable tasks within the pipeline (or at a pipeline level) are presented for execution by randomising the order of the plurality of pointers within the at least one circular buffer.
  • the processor is configured to attempt to execute each of the plurality of separately executable tasks from the randomised order by: retrieving a pointer at the head of the circular buffer; attempting to execute the task associated with the retrieved pointer; returning the pointer to the circular buffer; and repeating the retrieving, attempting and returning steps until all pointers within the circular buffer have been attempted. Once all the pointers within the circular buffer have been attempted, the order of the plurality of pointers may be randomised again and the process repeats. [35] There may be a plurality of processing cores for executing the plurality of separately executable tasks. The plurality of processing cores may be spread across a network or may be on the same chip, or in the same device.
  • each core may involve several logical threads and a pointer may be associated with each of the logical threads. Randomisation of the location in which a task is executed may mean that it is not possible to rely on tracking the emanations from a particular core for an SCA.
  • a first processing core may be configured to retrieve the pointers from the second queue of the first buffer (and hence execute the associated tasks of the first pipeline) and a second processing core may be configured to retrieve the pointers from the second queue of the second buffer (and hence execute the associated tasks of the second pipeline). It will be appreciated that two is also illustrative in this example and the number of processing cores may be selected to match the number of pipelines.
  • the feature of utilising a plurality of cores may be used for parallel processing; e.g. for encryption and decryption or other similar substantially embarrassingly parallel problems.
  • an apparatus with a plurality of cores generally follows Gustafson- Barsis’ Law and benefits from linear scaling.
  • the apparatus may be adapted to allow for recruiting and retiring processing cores from a pool or processing cores. When a core is recruited to the pool, a pointer to the core, or pointers to its logical cores, may be generated and supplied to the plurality of pointers.
  • the pointers to the processing cores may then be randomised, leading to randomness in location of execution of a program (i.e. topological randomisation).
  • a core When a core is retired, its respective pointer, or pointers to its logical cores, may be removed from the plurality of pointers.
  • randomisation may be introduced by generating a pointer which is associated with each of the plurality of processing cores/threads.
  • the pointers may be randomised using the double buffer and/or the SPSC described above.
  • the randomisation may be introduced by storing the plurality of pointers in topological order in the first queue; randomising the order of the plurality of pointers; and transferring the randomised plurality of pointers to the second queue from which each pointer is retrievable to determine which of the plurality of processing cores/threads is to execute a task.
  • a double buffer may be termed a topological buffer because it randomises the location in which the task is executed as opposed to the order in which a task is executed as in a temporal double buffer described above.
  • the randomness may be enhanced by combining temporal and location randomisation.
  • the apparatus may comprise a first buffer having a first temporal queue and a second temporal queue; a second buffer having a first topological queue and a second topological queue; and a scheduler which is configured to cooperate with the processor to randomise the plurality of separately executable tasks at a pipeline level both temporally and topologically.
  • a functional change in temporal and topological ordering of execution may thus be achieved by the mixing of pointers to both the program code of the executable tasks, and location of execution, such that, when and where the tasks are physically executed is entirely random.
  • the at least one buffer may be located within a register of the processor, particularly if pointers are embodied as small enough pieces of memory that can readily fit inside a processor’s register.
  • the at least one buffer may be separate from the processor.
  • the processor may thus be configured to randomise the plurality of separately executable tasks at a pipeline level by communicating or cooperating with the at least one buffer.
  • a plurality of shuffled double buffers may be used as explained above to shuffle pointers to executable tasks and/or a plurality of shuffled double buffers may be used as explained above to shuffle pointers to locations of execution.
  • Randomisation may be introduced within the pipeline (at a pipeline level) or within the task itself (at a task level). Both types of randomisation may be used separately or combined. Randomisation within the task may be used when the task comprises a set of commutative operations.
  • the processor may then be configured to randomise an order of execution of the commutative operations within the task.
  • the commutative operations may have a set of indices and randomising the order of execution of commutative operations may comprise randomising the set of indices for the commutative operations before execution of the selected task.
  • the location of execution of each commutative operation may also be randomised.
  • Randomisation may be introduced into the plurality of separately executable tasks at a task level by selecting at least one task of the plurality of separately executable tasks, wherein the selected task comprises a set of discrete actions, wherein each set of actions has a set of indices indicating the order in which each action is executed; and randomising the set of indices for the discrete actions before execution of the selected task.
  • the at least one task may be selected by determining the length of time required to execute each task in the pipeline and selecting one or more tasks which take longer to execute than other tasks. The tasks which take longer to execute may be more vulnerable to successful SCAs.
  • the at least one task may be selected based on the number of commutative actions within the plurality of separately executable tasks.
  • the at least one task may contain the most commutative actions or a greater proportion of commutative tasks than non-commutative actions.
  • the randomisation at a task level may also comprise separating the or each selected task into the set of discrete actions.
  • the set of discrete actions may be obtained by unrolling one or more loops within the task into the discrete actions, wherein each discrete action is commutative.
  • the set of indices may be obtained from the loop indices.
  • the at least one task may be selected based on the number of times the at least one task repeats in the pipeline. For example, the selected task may have more repeats.
  • randomisation at a task level may be introduced for each occurrence of the same task, for example by identifying each occurrence of a repeating task and randomising the set of indices for each occurrence.
  • randomisation at a task level may be introduced for each repeat of at least one task or each repeat of several tasks within the pipelines. For example, where there are four tasks which repeat the same number of times within the algorithm, there may be considered to be 25% randomisation at a task level when randomisation is introduced for each repeat of one of the four tasks.
  • the computer algorithm may be an encryption algorithm encrypting plaintext or decrypting to plaintext.
  • the plurality of separately executable tasks may comprise AddRoundKey, SubBytes [or inverse], ShiftRows [or inverse], and MixColumns [or inverse].
  • the AddRoundKey may be the selected task because this is rich in commutative actions and is repeated multiple times in the pipeline.
  • any computer algorithm where there are one or more parts that one wishes to remain secret and immune from SCAs can benefit from the invention.
  • algorithms involved in blockchain processing would benefit from having the relevant part or parts of blockchain processing being kept secret.
  • the invention may, for example, be applied to Secure Multiparty Computation [SMC].
  • SMC Secure Multiparty Computation
  • Machine learning algorithms have many repeated processes that can be converted to execute in accordance with the invention. Moreover, in-house and custom banking algorithms which evaluate stock and foreign exchange for purchase would also benefit from having parts immune from SCAs.
  • the computer algorithm may be a hashing algorithm, e.g. SHA256.
  • apparatus for controlling encryption (or decryption) which comprises memory in which the encryption (or decryption) algorithm is stored as at least one pipeline comprising a plurality of separately executable tasks; and a processor which is configured to receive at least one plaintext input (ciphertext input); randomise the plurality of separately executable tasks at a pipeline level and/or at a task level; and execute the randomised plurality of separately executable tasks to encrypt plaintext (or decipher ciphertext).
  • the encryption algorithm can be any one from the many known to the skilled person. In an earlier competition run by the National Institute of Standards and Technology (NIST), there were five algorithms referred to as the “AES finalists” namely, Rijndael, Serpent, Twofish, RC6 and MARS.
  • Rijndael’ s algorithm has multiple stages each of which comprise one or more steps selected from: AddRoundKey, SubBytes or its inverse, ShiftRows or its inverse, MixColumns or its inverse loops.
  • Each of these steps may be considered to be an executable task which is of sufficiently small granularity for the method and system in which randomisation may be introduced at the pipeline level.
  • the granularity may be considered to be fine.
  • the executable task for AddRoundKey comprises two nested loops; an outer loop executes four iterations of the inner loop, which itself has four iterations - resulting in sixteen iterations of the EXclusive Or (XOR) operation which is a set of commutative operations.
  • Each outer loop may be considered to comprise a set of four discrete actions with each set of actions having a set of indices (in this example, one to four).
  • each inner loop may be considered to comprise a set of four discrete actions with each set of actions having a set of indices (in this example, also one to four).
  • the randomisation at a task level may thus be introduced by randomising the set of indices for at least one of the inner and outer loops.
  • shuffling of indices can also be applied to the 16 discrete actions of the SubBytes task, the four discrete actions (cycles) of the MixColumns task and the four discrete actions (stages) of ShiftRows task. There may also be shuffling within each of the cycles of the MixColumns and ShiftRows tasks.
  • the AES finalist, Rijndael can be run with different size keys, e.g. 128, 192 and 256, and different numbers of rounds, 10 rounds for AES-128, 12 rounds for AES-192 and 14 rounds for AES- 256.
  • the finalists are Classic McEliece, Crystals-Kyber, NTRU and Saber in the first group, and Crystals-Dilithium, Falcon and Rainbow in the second group. Three of the first group admit that they are vulnerable to side channel attack without modification and NTRU is silent on the matter. All these finalists are computer algorithms which can be modified to execute in accordance with the invention. [49] Many of the digital signature candidates are based on the Keccak family, which also underpins SHA3, and this group are non-prime based algorithms, so they are safe from quantum computers running Shor’s algorithm.
  • the Rijndael AES finalist is also non-prime based.
  • several pipelines may be processed.
  • several encryption algorithms may be decomposed into executable tasks in different pipelines running on different plaintext feeds or decryption modes.
  • the apparatus typically receives a plurality of inputs, for example in the context of encryption, there may be several sources of plaintext or ciphertext.
  • the plurality of inputs may be processed at the same time, for example a video stream together with an audio stream. In this regard, the temporal mixing of chunks of each source further adds to the randomness and thereby the resistance to SCAs
  • Randomisation may be introduced at the task level, e.g.
  • true randomness is the quality or state of lacking any pattern or principle of organization being entirely unpredictable and, therefore, unable to convey any information whatsoever. It will be appreciated that true randomness is, in fact, an abstract concept that can only be achieved when a system is at the theoretical limit of disorganization or maximum entropy. However, in cryptographic theory, the term “true randomness” describes a quality of the order of events such that they are not derived from any deterministic logic. Some physical phenomenon that is expected to be random such as atmospheric noise, thermal noise, other external electromagnetic, and quantum phenomena must be sampled and measured. [53] True randomness may be slow.
  • a “Cryptographically Secure Pseudo-Randomness” depends upon the mathematical technique used to generate the Pseusdo-Random Numbers (PRN).
  • PRN Pseusdo-Random Numbers
  • the size and “quality” of the sequence initiating seed will have a greater or lesser degree of randomness.
  • sequences of numbers can possess varying “amounts” of randomness may seem odd but using various statistical tests it is possible to assess the amount of information that such sequences reveal and assigning to them degrees of entropy.
  • the apparatus may comprise a random number generator to assist the processor when randomising the plurality of executable tasks at a pipeline or task level.
  • the processor may be configured to randomise the plurality of separately executable tasks at a pipeline level and/or a task level by communicating with the random number generator.
  • Random number generators fall into two main categories: fast but imperfect pseudo-random number generators (PRNG) and slow but perfect entropy True Random Number Generators which for all practical purposes cannot be guessed or predetermined and typically use external physical phenomena to generate true randomness.
  • PRNG pseudo-random number generators
  • the temporal and topological scrambling and/or the index shuffling may use a random number generator.
  • the choices are implementation dependent but at the very least a fast PRNG is required to drive the shuffling permutations of indices and pointers required by the methods described below.
  • a combination of both PRNG and TRNG may be preferred where a slow TRNG is used, at intervals (implementation dependent), to seed the PRNG.
  • both a hardware generated source of true randomness such as the on-chip entropy source present in most modern processors
  • a demand led fall back to periodically re- seeded software-based, NIST approved, cryptographically secure pseudorandom number generators (CSPRNGs)
  • CSPRNGs cryptographically secure pseudorandom number generators
  • CSPRNGs and on-chip true randomness may add further to the entropy thwarting SCAs, since the attacker cannot establish a correlation with a known CSPRNG.
  • the invention also thwarts FIAs.
  • FIAs have high dependency on each task being executed at exactly the same time and order in each iteration of an algorithm.
  • the random out of order execution of each task for each iteration which is key to the inventive techniques described above, means that there is no consistency between one round and another.
  • a fault injection attacker has no reliable event to focus upon because they can never expect task X to occur at time Y in a repeatable fashion.
  • present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. For example, the automatic encryption of data as it is stored on a memory chip is commonplace and the invention may be integrated into such memory chips as part of their encrypting process.
  • present techniques may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon.
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • a computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.
  • Embodiments of the present techniques also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.
  • the techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP).
  • DSP digital signal processor
  • the techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier.
  • the code may be provided on a carrier such as a disk, a microprocessor, CD- or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier.
  • Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language).
  • a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.
  • a logical method may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application- specific integrated circuit.
  • Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.
  • the present techniques may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.
  • FIG. 3 is a flowchart illustrating the method of randomising at a pipeline level which may be carried out on the apparatus of Fig.1; [69] Fig.4 shows the double buffered random queue of Fig.2 together with the steps in the process; [70] Fig. 5a is a flowchart for AES encryption and decryption which is an example of a computer program which can be processed using the method and apparatus described above; [71] Fig.
  • FIG. 5b is a schematic diagram illustrating of different levels of task granularity applied to the AES algorithm
  • Fig.6 is a diagram of an AES dataflow communicating task pipeline of ten rounds for a 128-bit key relating to the encryption process illustrated in Fig.5
  • Fig.7a is a key for each of the steps in Fig.7b and 7c
  • Fig.7b and 7c are alternate representations of the pipeline in Fig.6 using the key in Fig.7a
  • Figs.8a and 8b are representations of two alternative arrangements having two pipelines using the key in Fig.7a
  • Fig.9a is a variation of the apparatus shown in Fig.1
  • Fig.9b is a representation of the pipeline for Fig.9a using the key in Fig.7a
  • Figs.10a and 10b illustrates a further arrangement of the apparatus in two configurations
  • Fig.10c is a representation of
  • FIG. 11c is variation of the apparatus shown in Fig. 9a with six processing cores and related buffered random queues;
  • Fig. 12a is a schematic illustration of a circular buffer which is used to implement an interoperation communication queue, e.g. SPSC;
  • Figs. 12b and 12c are schematic illustrations of pipelines using SPSC queues at the coarse and medium granularity levels respectively;
  • Fig.12d is a schematic flowchart for implementing random out of order execution of the pipeline of Fig.12c; [86] Figs.
  • FIG. 13a and 13b are schematic representations of the commutative operations within ShiftRows and MixColumns of the AES algorithm, respectively;
  • Fig.13c is a flowchart of the method of randomising at a task level which may be carried out on the apparatus of Fig.13d;
  • Fig.13d is a schematic block diagram of the apparatus for carrying out the method of Fig.13c;
  • FIG. 13e and 13f are schematic representations of tuneable shuffling of the commutative operations within AddRoundKey and SubBytes of the AES algorithm, respectively;
  • Figs.14a and 14b are screen shots from an oscilloscope measuring the test apparatus carrying out standard AES encryption and AES encryption using the method of Fig.13c, respectively;
  • Figs. 15a to 15e are test results indicating the performance of AES encryption without any mixing and with different levels of mixing as implemented using the method of Fig.13c;
  • Fig. 16 is a schematic diagram illustrating of different levels of task granularity applied to the Keccak-f algorithm;
  • SCAs Side channel attacks
  • FFAs fault injection attacks
  • an algorithm can be adapted so that it is amenable to randomised out of order execution (ROOOX) and therefore resistant to SCAs and FIAs.
  • Adapting the algorithm may comprise examining the parts (for example tasks and/or operations) within the algorithm and the order in which they occur. Commuting steps are those which can be executed in any order and non-commuting steps must be executed in a specific order. An element of the adaptation may thus include identifying the commuting and non-commuting steps. Different techniques for random execution may then be applied to commuting and non- commuting tasks (or operations) as explained in more detail below. [98] The process of modifying the part(s) or whole of the algorithm to enable ROOOX may be termed “pugging”.
  • the dictionary definitions of pugging include the act or process of working and tempering clay to make it plastic and or uniform consistency or deadening sound e.g. by laying mortar or the like between the joists under the boards of a floor or within a partition.
  • the goal of pugging an algorithm is to enable flexibility of execution to the point where the randomised out of order execution of its parts prevents an attacker gaining any useful information from the analysis of its side-channel emissions during execution.
  • the act of pugging may be described as an approach that breaks apart, or decomposes, an algorithm into its smaller parts in order to understand the relationship between its execution and the side-channel information which is leaked during execution.
  • the decomposed parts may then be randomised in time (temporal) or in time (temporal) and place (topological). As described below, this is achieved by using a system which is able to change the order in which parts are executed by one or more processors and may also change the location (i.e. processor/processing core) which is executing the part of the algorithm.
  • Commuting and non- commuting operations need to be randomised differently and thus a successful implementation requires one or both of: a) A means of elastically executing commuting operations temporally or temporally and topologically b) A means of elastically executing non-commuting operations temporally or temporally and topologically Non-Commutative Operations [100] Fig.
  • FIG. 1 shows a schematic diagram of one system for executing a computer algorithm so that the risk of successful side channel attacks is reduced.
  • this is the simplest arrangement with a single core microprocessor 10 which accesses pointers from a double buffered random queue (DBRQ) 12.
  • DBRQ double buffered random queue
  • the DBRQ is long enough to store a plurality of pointers each of which points to a task within the computer algorithm.
  • the DBRQ may thus be considered to be a pipeline comprising a plurality of separately executable tasks.
  • a pipeline is particularly suitable when tasks are non-commuting and thus need to be executed in a specific order.
  • the connections in the pipeline retain the logical order of the processing but physical execution of the algorithm’s order is random.
  • Non-commuting tasks are critically dependent on the immediate outcome of the neighbouring tasks within the pipeline.
  • non-commuting tasks may also be termed operations where an operation is defined in mathematics as a function which takes zero or more input values (called operands) to a well-defined output value.
  • operands input values
  • Pointer is a programming language object that stores the memory address of another value located in computer memory.
  • a pointer references a location in memory, and obtaining the value stored at that location is known as dereferencing the pointer.
  • the actual format and content of a pointer variable is dependent on the underlying computer architecture. However, as schematically shown in Fig.1, the processor 10 accesses a memory 14 based on the pointer from the DBRQ 12.
  • Each pointer (ptr) is represented with a single symbol, e.g.
  • the memory may comprise a volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.
  • RAM random access memory
  • ROM read only memory
  • EEPROM electrically erasable programmable ROM
  • each pointer points to a piece of executable code.
  • Each piece of executable code may be part of a larger computer algorithm.
  • a computer algorithm is a sequence of steps that take an input and convert it into an output via a series of intermediate input and output stages. Traditionally, a single hardware execution resource steps and loops through each of the computational stages in the form of commands.
  • a task may be defined as a unit of execution or a unit of work. Alternative terms include process, light-weight process, thread (for execution), step, request or query (for work).
  • a generic task is one where the details of purpose are not relevant, such as in the scheduling of tasks, and may be represented by a single symbol, e.g. a black box. This black box can be a placeholder for any of the tasks, e.g. the serial fractions, within an algorithm.
  • a memory pointer may also be represented with a single symbol, e.g. an empty box.
  • such a DBRQ as shown in Fig. 1 is suitable for use with a decomposed 10 round 128bit key AES encryption pipeline. Accordingly, in this example, there are 38 pointers but it will be appreciated that the length may be adjusted to suit the computer algorithm which is being executed on the system.
  • the purpose of a task is of no concern to the DBRQ and so the DBRQ may be conveniently represented as 2 short groups of either vertical or horizontal (as shown in Fig. 1) task pointer symbols. Either side of the task pointer symbols, Fig.1 shows a PUSHable input and a POPable queried output to complete the symbolic representation of the DBRQ.
  • the apparatus may comprise a random number generator 16 to assist the processor when randomising the plurality of executable tasks at a pipeline level using the DBRQ.
  • the temporal scrambling may use both a hardware generated source of true randomness, such as the on- chip entropy source present in most modern processors, and a demand led fall back to periodically re- seeded software-based, NIST approved, cryptographically secure pseudorandom number generators (CSPRNGs).
  • CSPRNGs cryptographically secure pseudorandom number generators
  • Fig.3 is a schematic flowchart illustrating the steps which may be carried out by the system of Fig.1 with the DBRQ of Fig.2.
  • a first step S300 the processor issues a first POP instruction to the DBRQ which results in an internal POP request to the pending task queue in the DBRQ.
  • the processor determines whether the pending task queue is empty in step S302. When it is determined that the pending task queue is empty, the pointers in the full future task queue are randomly shuffled in step S304. The randomised pointers in the future task queue are then swapped into the empty pending task queue in step S306. In other words, the failed POP request effectively triggers a request to swap the empty pending task queue and the full future task queue which exchanges their roles. [110] As shown in Fig.3, a new POP instruction may then be issued to the DBRQ (step S300).
  • step S302 the determination of whether the pending task queue is empty (step S302) will determine that the pending task queue is not empty.
  • the now full, randomly ordered, pending queue can fulfil the POP request as set out at step S308.
  • the DBRQ (particularly the pending task queue) releases a task to the processor for execution.
  • step S310 There is an attempt to execute the task at step S310.
  • intermediate input data may be needed for the task to be executed, thus it is not always possible to execute a task.
  • the processor issues a DBRQ PUSH instruction to recycle the completed task pointer at step S312.
  • the DBRQ stores this recycled pointer in the future task queue which is initially empty.
  • steps S300, S302, S308, S310 and S312 is then repeated for each task pointer in the pending task queue so that the future task queue is gradually filled.
  • steps S300, S302, S308, S310 and S312 are then repeated for each task pointer in the pending task queue so that the future task queue is gradually filled.
  • Fig.4 is a schematic system and flowchart illustrating a snapshot of the process of Fig.3.
  • the reference numbers for features in common with Fig.2 and 3 are used in Fig.4.
  • the future task queue 22 comprises a number (in this example, three) of task pointers which have been recycled following their execution, if possible, by the processor 10.
  • a task pointer 30 (which has the sequential order number 10) is shown as being fetched by the POP instruction (S300) to be executed by the processor 10.
  • Another task pointer 32 (which has the sequential order number 13) is shown within the processor 10 for execution.
  • FIG. 4 illustrates the process of attempting to execute each task in more detail.
  • the processor determines whether the output is full, i.e. has the task already been executed in a previous cycle through the task pointers. If the processor determines that the output is full, the task pointer is recycled (step S312) and returned to the future task queue.
  • step S402 determines whether the input is empty (step S402). Some of the tasks will be dependent on other tasks having already been performed. If the input is empty (i.e. the task cannot be completed until at least one other task is completed), the task pointer is recycled (step S312) and returned to the future task queue. If the input is not empty, the task is ready for execution.
  • the checks to determine whether the input is empty and/or the output is full may be considered to be preconditions which need to be met before executing the task. [114] As explained above, the pointer references a location in memory. Once the preconditions are met, the next step S404 is to read the input from the memory 16. The appropriate task is then carried out in step S406.
  • the mechanism of keeping two separate queues, one for pending task pointers and one for future task pointers, and only applying a random transformation to the temporal ordering when the queues are swapped means that the double buffered random queue ensures two critical fairness guarantees, namely: 1. Although each task will be presented for execution in a random order it is guaranteed to be executed at least once within the execution time of the total number of tasks in the pipeline 2. Although the tasks in the pipeline will be presented for execution randomly the whole pipeline is guaranteed to be executed [117] The importance of the critical fairness guarantees may be illustrated by considering a short algorithm which comprises several tasks, e.g.
  • this function is composed of four tasks (or operations) performed in the sequence square, sine, multiplication and addition.
  • the ordered sequence of the operations can be broken into steps with the intermediate (or intermediary) outputs represented by a, b, c and d.
  • the DBRQ may allow the algorithm to execute more than one input, which may be represented as x0, x1... xn. These inputs may be represented in a queue as [xn... x0].
  • the pipeline process may be represented as: [xn... x0] ⁇ square ⁇ [] ⁇ sine[] ⁇ multiply 3 ⁇ [] ⁇ add 1 ⁇ f(x)
  • a pointer to each of these tasks is initially included in the future task queue and then the pointers are randomly shuffled into the pending task queue. If the order of the task pointers is randomly shuffled to be 3, 4, 1, 2 in the pending task pointer queue, following the process above will mean that the task pointer to task 3 is selected first and there is an attempt to execute step 3. The precondition checks will show that there is no intermediate input b0, so the step will fail and thus there is no change to the representation above.
  • step 4 there will then be an attempt to execute step 4 which will also fail and return the same result as above. Only when the third task pointer is taken from the pending task queue will the attempt on the task be successful because the required input to step 1, i.e. x0, is available and the intermediate output a0 can be output. Thus, the result below is returned and it is noted that the input queue length is reduced: [xn... x1] ⁇ square ⁇ [a0] ⁇ sine[] ⁇ multiply 3 ⁇ [] ⁇ add 1 ⁇ f(x) Step 2 “sine” can also now be implemented because the intermediate input a0 is available.
  • step 4 “add” fails because there is no intermediate input c0 and the result above is returned again.
  • Step 3 receives the intermediate input b0 and can return intermediate output c0 and so the result below is returned: [xn... x2] ⁇ square ⁇ [a1] ⁇ sine[] ⁇ multiply 3 ⁇ [c0] ⁇ add 1 ⁇ f(x)
  • step 2 “sine” is processed and the intermediate input a1 is available so the result below is returned: [xn... x2] ⁇ square ⁇ [] ⁇ sine[b1] ⁇ multiply 3 ⁇ [c0] ⁇ add 1 ⁇ f(x) [119]
  • the task pointers are randomly shuffled again and moved to the pending task queue.
  • step 3 “multiply” receives the intermediate input b1 and can return intermediate output c1 and so the result below is returned: [xn... x2] ⁇ square ⁇ [] ⁇ sine[] ⁇ multiply 3 ⁇ [c1, c0] ⁇ add 1 ⁇ f(x)
  • step 4 “add” receives the intermediate input c0 (there is only a single processor) and can return intermediate output d0 and so the result below is returned: [xn... x2] ⁇ square ⁇ [] ⁇ sine[] ⁇ multiply 3 ⁇ [c1] ⁇ add 1 ⁇ [d0]
  • Step 2 “sine” tries to execute but there is no intermediate input so the result above is unchanged.
  • step 1 “square” is processed and the intermediate input x2 is available so the result below is returned: [xn... x3] ⁇ square ⁇ [a2] ⁇ sine[] ⁇ multiply 3 ⁇ [c1] ⁇ add 1 ⁇ [d0]
  • the pipeline has now processed its first element to completion but as shown there are additional inputs to process. Given that the pipeline is not empty, processing continues using randomly permuted out of order execution (or randomised out of order execution – the terms may be used interchangeably) until all inputs have been processed. In other words, the tasks in the pipeline continue to be presented for execution in a random order but are only executed if the associated intermediate input is available. Thus, for each input, the tasks are executed in the correct order.
  • a first intermediate output a0 will be generated, i.e. only one of the tasks can be executed.
  • the intermediate output a0 can then be used in the next pass through the pipeline to generate first intermediate output b0 and if there is more than one input, a second intermediate output a1 will also be generated, i.e. two of the tasks can be executed.
  • Intermediate output c0 is generated in the third pass together with a second intermediate output b1 and a third intermediate output a3. In other words, three of the tasks execute when there are multiple inputs.
  • the first output d0 is generated on the last pass and thus the algorithm has produced an output after four executions.
  • the algorithm which is executed as described above may be any algorithm with any number of tasks arranged in a pipeline. Regardless of the action of a pipeline task there are defining features common to all tasks. For example, failing any of the preconditions required for execution should happen in the shortest amount of time possible so that the task pointer can be recycled almost immediately. Such a failure-fast-track mechanism is key in order to minimise the overheads inherent in decomposing an algorithm into communicating sequential tasks.
  • the task may comprise a minimal number of repeats and as explained below may further have the order of their indices randomly shuffled every time.
  • all repeating sections in a task are essential, short, discrete and randomly shuffled - otherwise they should be broken out into separate tasks.
  • a task with looping sections could have each of the loop stages broken out into separate smaller communicating tasks. However, doing so rapidly balloons the unavoidable overheads of inter-task communication and randomised scheduling affecting performance and scalability. Therefore, a balance must be struck between breaking up an algorithm into ever smaller tasks - the task granularity - and the overheads of the system (inter-task communication, temporal randomisation, topological randomisation and scheduling).
  • Examples of algorithms which may be separated into tasks and processed as described above include encryption algorithms and these can be any one from the many known to the skilled person.
  • Examples of encryption algorithms are the five algorithms referred to as the “AES finalists” in the NIST selection process namely, Rijndael, Serpent, Twofish, RC6 and MARS.
  • the Rijndael AES encryption algorithm will be used as an example algorithm.
  • the algorithm may be decomposed into a task pipeline with several discrete tasks.
  • Fig.5a illustrates the steps or tasks involved in encrypting plaintext [on the left] and decrypting cipher text [on the right].
  • Each stage consists of one or more steps selected from: AddRoundKey, SubBytes or its inverse, ShiftRows or its inverse, MixColumns or its inverse. These steps occur in certain orders at certain times and some are grouped in loop. It would seem that each named step is a suitable place to start decomposing the AES algorithm into tasks that communicate with each other via inputs and outputs to form a communicating task pipeline.
  • the AES algorithm can be decomposed into smaller constituent parts, all the way down to the microcode of the CPU if needed. How far an algorithm is broken up into smaller parts is termed the granularity of decomposition and examples of levels of granularity are set out below: * Coarse grained – such as splitting the AES algorithm into its constituent rounds * Medium grained – such as splitting a round into its component actions, e.g. SubBytes * Fine grained – such as splitting SubBytes into its program language, e.g. C++ * Very fine grained – such as splitting the program language operations into their assembly language components * Ultra fine grained – such as splitting the assembly language opcodes into their on chip microcode.
  • Fig.5b illustrates the coarse, medium and fine grained composition.
  • the coarse grained level of decomposition comprises 10 rounds. The order of execution of these rounds is strictly non- commutative.
  • the medium grained level of decomposition comprises 4 sub-steps in each of rounds 1 to n-1 (AddRoundKey, SubBytes, ShiftRows and MixColumns) and 3 sub-steps in the last round (AddRoundKey, SubBytes and ShiftRows). Each of the sub-steps in each round are also strictly non- commutative.
  • the encryption algorithm of Fig.5a has been decomposed in Fig.6 to show in detail the tasks which communicate with each other via intermediate inputs and outputs [shown by arrows which represent the data flows] to form a communicating task pipeline.
  • N 10 rounds
  • the resulting AES decomposed task pipeline (medium granularity) consists of 36 separate task duplicates of the four basic tasks, an input stage (plain block, p) and an output stage (encrypted block, e) making 38 task based stages in total having 40 separate communication channels.
  • the key size used for an AES cipher specifies the number of transformation rounds that convert the input, called the plaintext, into the final output called the ciphertext.
  • the full algorithm will be executed after at most 38 runs of the pipeline.
  • a data block will pass through the pipeline and be fully encrypted after at most 38 runs of the pipeline.
  • the processor would confirm that the output connection is not full and the input connection is not empty and then read a 128-bit data block from the input connection.
  • the next step in the AES pipeline would then be applied to the read data block to transform the data block.
  • the transformed (or modified) data block would then be written to the output connection.
  • the pointer would be recycled to the back of the queue with a PUSH instruction and the whole process would be repeated for the next pointer in the queue until the queue is completed.
  • Completing the steps in sequence means that the process is vulnerable to a SCA.
  • a simple, na ⁇ ve randomisation may be applied to the single queue of pointers, for example by selecting a pointer within the queue at random rather than selecting a pointer at the head of the queue.
  • a disadvantage of such a process is that a random transformation must take place each time a task pointer is selected rather than each time the queues are swapped in the process described above.
  • a na ⁇ ve randomisation is likely to lead to a pipeline that almost never executes from beginning to end.
  • the probability P 1 of selecting the next stage of the pipeline is also 1/38 and so on up to P 37 .
  • the probability of the data block progressing from the first stage of the pipeline to the second stage is P 0 x P 1 , i.e. 0.000676.
  • Fig.8a illustrates two abbreviated multi-stage AES encryption pipelines p 1 and p 2 consisting of the four basic task types identified above and repeated with input and output tasks to a total of 38 tasks.
  • Fig. 8b shows a mechanism for splitting the incoming data stream into plain data blocks p 1 and p 2 , known as demultiplexer (demux), and a mechanism for recombining the encrypted data blocks p 1 and p 2 , known as a multiplexer (mux).
  • demultiplexer demux
  • miux multiplexer
  • the data blocks are split, encrypted by each pipeline and the recombined for onwards transmission.
  • the two pipelines may be buffered together.
  • Each pipeline could have a different plaintext or ciphertext feed.
  • pipeline A might be involved with encrypting a video feed while pipeline B may be involved with encrypting an audio stream from a different source.
  • the pointers to the 76 tasks of both the pipelines can be randomised in two possible ways.
  • a first way is shown in Fig. 9a and comprises a single processor 10 and a single DBRQ 92 with an enlarged capacity when compared with Fig.1.
  • Figs.10a and 10b there may be two DBRQs 112A and 112B (each having a capacity for 38 task pointers) and a scheduler 100.
  • the memory and each of the pending and future task queues of the DBRQs are omitted for ease of display.
  • Fig.9b shows the workings of the embodiment of Fig.9a.
  • the single DBRQ contains all of the 76 tasks from the first and second AES pipeline p 1 and p 2 .
  • the single processor processes each task pointer(ptr) randomly presented from the pending task queue of the single DBRQ as described above and then returns the processed task pointer to the future task queue of the single DBRQ.
  • the first DBRQ 112A contains the 38 tasks from the first AES pipeline p 1 and the second DBRQ 112B contains the 38 tasks from the second pipeline p 2 .
  • the single processor 10 processes task pointers (ptr) in a round-robin fashion as determined by the scheduler 100.
  • Fig.10b shows the single processor processing each task pointer from the pending task queue of the first DBRQ 112A as described above and then returning the processed task pointer to the future task queue of the first DBRQ 112A.
  • 10c shows the single processor processing a task pointer from the pending task queue of the second DBRQ 112B as described above and then returning the processed task pointer to the future task queue of the second DBRQ 112B.
  • the round robin fashion means that the arrangements shown in Fig.10b and 10c are alternated with a task from the first DBRQ 112A being completed followed by a task from the second DBRQ 112B, following by another task from the first DBRQ 112A and so on.
  • Fig 11a shows a variation of the arrangement shown in Fig.9a with a dual processing core 110. As in Fig.9a, there is a single DBRQ 120 of double capacity.
  • the single DBRQ 120 randomises the temporal order in which the tasks are executed and may be termed a temporal DBRQ.
  • the temporal DBRQ 120 is connected to a scheduler 200 which issues two POP instructions to the temporal DBRQ 120. In response to the POP instructions, when there are pointers in the pending task queue, two pointers are pushed to the scheduler 200. The number of pointers corresponds to the number of processing cores and it will be appreciated that two is merely indicative.
  • a second DBRQ which may be termed a topological DBRQ 122 is also connected to the scheduler 200 to determine which core is to attempt to execute each of the two pointers.
  • the topological DBRQ 122 is implemented in the same manner as the temporal DBRQ 122 but comprises a pool of pointers to the physical means of executing the task. In this case, there are two cores and hence two pointers, one for each core. The pair of pointers are initially in the future task queue of the topological DBRQ 122. A POP instruction to the empty pending task queue will trigger a shuffle of the task pointers before they are swopped to the pending task queue. The shuffle randomises which core will be selected. The scheduler uses the combined results from the temporal and topological DBRQs to determine which core is to attempt to execute which task at each moment in time.
  • Fig.11b is a variant of the arrangement of Figs.10a and 10b follows on from the embodiment in Fig. 8, in that there are now two temporal DBRQs 112A, 112B combined with a single topological DBRQ 122.
  • a pointer from each of the temporal DBRQs 112A, 112B is fed to a scheduler 300, which also receives pointers regarding which core to use from topological DBRQ 122.
  • the individual cores on the dual processing core 110 thus randomly receive tasks to execute.
  • the randomisation is both temporal and in location.
  • Figs.11a and 11b are scalable to multiple cores or any combination of different processing elements (e.g. processors, network machines, logical cores, physical cores) by using a topological DBRQ of appropriate capacity.
  • Fig.11c shows a processor 210 having six cores, A, B, C, D, E, F.
  • a temporal DBRQ 120 comprising a plurality of task pointers, for example 38 as described above.
  • a topological buffer 122 comprising a plurality of core pointer, for example six to match the number of cores.
  • Figs. 12a to 12d illustrate an alternative implementation of random out of order execution for non-commutative tasks. As set out above, this can be achieved when various conditions including a wait-free communication queue and asynchronous operation.
  • a mechanism that enables elasticity without breaking the logical, algorithmic order can typically be implemented using two fundamental components: a wait-free communication queue and an asynchronous execution shuffler.
  • the wait-free communication queue must provide queued buffering of intermediate results between non-commuting operations.
  • the shuffler must enable shuffling of the physical order of execution of the non-commuting operations that maintains the logical order of the underlying algorithm and meets the various conditions.
  • a simple circular buffer can be used to implement the interoperation communication queue.
  • Fig. 12a shows an example of this implementation.
  • the buffering queue must be wait-free to enable asynchronous operation and thus the Abstract Data Type (ADT) operations of the queue must be conditional. This is typically achieved by first testing if the operation can take place and returning a Boolean value true on success and false if the operation failed. For example, the operation to “push” an item into the back (tail) of the queue must first test if the queue is full, return false if so. Otherwise, returning true after the item has been added to the tail of the queue and the size of the queue has been increased by one. Similarly, the operation to remove an item from the front (head) of the queue must first check to see if the queue is empty, returning false if so. Otherwise, returning true after the item has been removed and reducing the size of the queue by one.
  • ADT Abstract Data Type
  • the performance of the circular buffer implementation may be improved by restricting the capacity of the queue to a power of 2.
  • Such a “2n” queue has the advantage of being able to replace the slower modulo division step to ensure circular progression of the indices with a faster bit mask operation.
  • the communication queue between non-commuting operations must be synchronised to prevent race conditions. This can be achieved by using the data structure known as a single-producer, single consumer (SPSC) queue.
  • SPSC single-producer, single consumer
  • the wait-free feature can be implemented using light-weight concurrency data types known as “atomic variables” for the head and tail indices.
  • atomic variables for the head and tail indices.
  • C++ an implementation in C++ is shown below: /** * Concurrent wait free, lock free, single producer, single consumer, bounded FIFO queue * Capacity must be an unsigned power of 2 to enable performance advantage of bitwise masking.
  • a longer pipeline could be implemented consisting of 40 stages connected by 39 wait-free queues. This is schematically illustrated in Fig.12c.
  • the permutations of the pointers in the pipeline are 10! and 40! respectively, Both the arrangements in Figs. 12b and 12c can be executed using wait free single producer single consumer queues.
  • each pipeline can be executed out of order and asynchronously as follows: * The function pointer sequence is shuffled * Each function pointer from the shuffled sequence is taken and execution is attempted * The function checks if it has a non-empty input queue and a non-full output queue and if so, executes * Once all the functions in the sequence have attempted execution, the sequence is shuffled again and the process repeats until there are no more input bytes. [150] In this way, at least one function will be able to make progress at each run of the shuffled sequence of pointers (in a similar manner to the simpler four stage example pipeline described above).
  • Fig. 12d schematically illustrates a flowchart for implementing the pipeline of Fig. 12c.
  • the pipeline requires 39 connecting queues of bytes, an input stream of plain text bytes, an output stream of cipher text bytes and a vector of function wrappers 20 to the pipeline stages.
  • f1 is AddRoundKey-1
  • f2 is SubBytes-1
  • f3 is ShiftRows-1
  • f4 is MixColumns-1
  • f5 is AddRoundKey-2
  • f6 is SubBytes-2 and so on.
  • the shuffle module 24 shuffles the array of indices and hence the function pointer sequence is shuffled.
  • a CPU 10 and a memory 14 are also part of the system.
  • the array and the vector of function wrappers shows the sequence without any shuffling, but it will be appreciated that any order may be generated by the RNG and shuffle module.
  • each function pointer from the shuffled sequence is taken in turn and execution is attempted.
  • the vector of function wrappers before execution is attempted, first there are checks to see the output queue is full and the input queue is empty. These checks may be performed by the CPU 10 or any other suitable processing unit or processor (not shown). [154] When both the output queue is empty (not full) and the input queue is full (not empty), the function executes. The execution may be done by the CPU 10. The output of the execution is sent to the appropriate output queue which may be stored in memory 14.
  • the function is not executed.
  • the next function is then selected according to the next function pointer in the sequence and the pre-condition checks are repeated. Once an attempt to execute all the functions in the sequence has been attempted, the sequence is shuffled again, and the process repeats until there are no more input bytes. Despite the random out of order physical execution of each stage f in the pipeline, the logical order is maintained as a directed graph of function nodes connected by wait- free queues, i.e. preserving the algorithm, despite the random order of execution of its parts.
  • a multi-core and/or multi-processor implementation could use any scheduling approach but perhaps the simplest is to use the “grand central dispatch” approach.
  • each DBRQ needs to have a switch to turn on the random transformation once all the buffers are loaded; this applies to single core - single pipeline as well.
  • each of the tasks is a non- commutative operation within the overall AES computer algorithm. In other words, each task must be carried out in the specified order.
  • Each task may thus be considered to be a communicating sequential process (CSP) “bubble”. All of the CSP bubbles are connected together into a pipeline using asynchronous wait-free queues as communication channels between the bubbles.
  • CSP communicating sequential process
  • the bubbles may be executed randomly out of order provided that at each turn, execution of all bubbles is attempted exactly once before the order is shuffled and the next turn proceeds. This is achieved by use of the temporal DBRQ as explained above. It will be clear that each queue in the double buffer is at least as long as the number of bubbles in the pipeline to avoid random stalls. [158] Initially, the process means that no final output will appear unless the random order generated is the correct order – a 1 in 38! chance for the example AES algorithm. However, the process is guaranteed to produce output after a number of runs (i.e. shuffles and execution (or attempt to execute) cycles) with the number of runs being less than or equal to the number of bubbles (38 in the case of the example AES algorithm).
  • the normal distribution means that typically the pipeline will produce output after about 16 runs for the example AES algorithm. Additionally, in practice the pipeline has at least one data item in each of its communication queues after a short while (as shown above) and thus at every turn of executing the pipeline bubbles, each and every bubble will execute successfully and the pipeline no longer stalls.
  • Commutative Operations [159] As explained above, one of the design features for all tasks is that any repeated section of code within the task must itself be short and discrete. Nevertheless, a task may comprise a minimal number of repeats and these repeating tasks may be commutative, i.e. may be carried out in any order.
  • AES operates on a 4x4 column major order array of bytes which may be termed the state.
  • Each XOR operation may be considered to be a discrete action and the indices i, j indicate the order in which action is executed. Given that those mounting SCAs are aware of these loops and the rhythmical and predictable and computations, they can seek and exploit the power consumption and EMF signals arising during their execution. There is no requirement to execute loop(s) in an ordered fashion if the action being looped over is discrete i.e. not dependent on previous action(s) and as demonstrated in the code below, the 16 iterations of the XOR operation for each AddRoundKey task are commutative, i.e. may be carried out in any order.
  • the AddRoundKey task may be re-designed to exploit the commutative nature and introduce loop index shuffling - an example is shown below: Code Block 2 - Loop Index Shuffled AddRoundKey // This function adds the round key to state using Loop Index Shuffling. // The round key is added to the state by an XOR function.
  • rijndael :add_round_key(uint8_t &rkey, uint8_t* i) ⁇ if(blend_add[add_counter++]) ⁇ // if the next item in the blend sequence is true(1)... (NB add_counter is 8 bit so will overflow back to zero) shuffled_add.shuffle(); // shuffle the block array index sequence for(auto& j : shuffled_add) ⁇ // use the shuffled indices...
  • Loop index shuffling thus results in 20,922,789,888,000 possible random permutations in the order in which the memory locations are addressed by the XOR action inside the AddRoundKey loops.
  • Loop index shuffling provides temporal randomisation of the execution of the operations within the AddRoundKey task. Additionally, each of the 16 steps is a tiny embarrassingly parallel problem which may also be subject to topological randomisation. Given a multi-core processor which is amenable to parallel execution, each step could randomly be assigned to one of the cores. A Gustafson-Barsis linear scaling (up to 16 cores or hyperthreads) suggests that the performance penalty of mixing may be overcome.
  • the box is a 16x16 matrix of byte values and consists of all the possible combinations of an 8 bit sequence. However, the box is not just a random permutation of these values and there is a well defined method for creating the s-box tables. As before, each byte is separately replaced and this may be expressed as follows: [168] Each of these 16 replacements may be termed a discrete action and each is commutative and can be carried out in any order. Loop index shuffling can be applied in a similar manner to that in the AddRoundKey task and there are the same number of possible permutations, i.e. 16!.
  • one format for the loop index shuffled arrangement may be expressed as follows: [169]
  • the ShiftRows task the bytes in each row are cyclically shifted by a certain offset.
  • the first row is left unchanged, each byte of the second row is shifted one to the left.
  • the third and fourth rows are shifted by offsets or two and three respectively.
  • the ShiftRows task changes the order of bytes in each row of the state for all rows except the first. This may be represented as follows: [170]
  • Fig. 13a schematically illustrates the nested commutative steps within ShiftRows which may be pugged, i.e. randomised for out of order execution.
  • the MixColumns task is a 32-bit operation that transforms four bytes of each column in the state.
  • the new bytes of the column are obtained by the given constant matrix multiplication in the Galois Field GF(2 8 ).
  • each column is transformed using a fixed matrix and may be expressed as: [173]
  • a finer grained mixing could occur inside each of these steps too.
  • Fig.13c illustrates the main steps in the operation and Fig.13d is a schematic representation of a hardware implementation which is adapted from that shown in Fig.1 and thus the same reference numbers are used where appropriate.
  • the first step is to determine the task to be executed (step S400). This may be determined by using a buffer as shown in Fig.13d. In this arrangement the buffer has all 38 tasks in temporal order but it will be appreciated that the DBRQs or SPSCs described above can be used if randomisation at both pipeline and task level is required. It will also be appreciated that the determining step may be derived from the algorithm itself depending on its implementation.
  • the next step may be to read an input from memory 14 (step S402), e.g.
  • the input may be original plaintext (or ciphertext) or an intermediate input generated by a previous task in the algorithm. If the DBRQ or SPSC is used as described above, the pre-condition checks may also be carried out before attempting to read an input to make sure that the task can be executed.
  • the system comprises a random number generator 16 which may be the same as the ones described above in relation to Fig.1.
  • the random number generator 16 provides the random number to a shuffle module 24.
  • the shuffle module is shown as a separate module but it will be appreciated that the functionality may be incorporated in the CPU 10.
  • the shuffle module 24 introduces a random shuffling permutation of the loop indices as shown in the array of indices 26.
  • the array of indices 24 contains the 16 indices for the AddRoundKey task but this is merely exemplary.
  • the trade-offs between speed of execution and SCA resistance may be balanced by using a blend of fully shuffled operations together with standard ordered execution.
  • Fig.13e shows a similar schematic representation of the fork-join parallelisation method for the SubBytes task.
  • the tuneable approach of selecting the execution pathway as shown in Figs.13e or 13f can be applied to any of the four stages of the AES algorithm (or to any other algorithm). If mixing is applied to all occurrences of one task, the mixing levels may be considered to be 25% mixing when one of the four tasks is mixed, 50% when two of the four tasks are mixed, 75% when three of the four tasks are mixed, and 100% when all tasks are mixed as described above. It will be appreciated that different levels of mixing may be achieved by tuning one or more of the tasks. In other words, mixing may be applied to one task or any number of the tasks or some but not all of the repeating tasks in the pipeline. [181] Shuffling the indices requires a random number to be generated.
  • the task may request a new random number for each loop, e.g. 16 times, or may more optimally request the number of random numbers which are required at the same time.
  • the random number may be requested before the task is executed or during the execution of the task.
  • the order in which the random number is obtained during the method may also be optional. Where there are 16 cycles to be randomised, this may for example, be achieved by requesting a single 64 bit random number at the start of the 16 shuffles and breaking the single random number into 16 four bit numbers. This could be done quickly, for example by masking the lower nibble of the 64 bit random number and then shifting it right by 4 bits for the next round.
  • the task may now be executed (step S406).
  • the task may be executed as shown on a single CPU 10.
  • each of the commutative cycles e.g.16 or 4 cycles is suitable for parallel implementation on multiple cores.
  • the output is written (step S408) for this task and the process repeats again, e.g. by issuing a POP instruction to obtain the next task in the pipeline from the buffer.
  • the pipeline may even be full of intermediate results, one for each of the different inputs.
  • the number of tasks that are executable with each execution of the pipeline thus may also vary.
  • the number of permutations is also dependent on the number of runs of the algorithm which are needed to fully process all the inputs. It will be appreciated, for a given plurality of sequential blocks of plaintext, the ones earlier in the sequence will have been enciphered or will be having later tasks executed and for ones later in the sequence, they will be having earlier tasks executed or be waiting to join. This “overlapping” of runs of the algorithm applies to any algorithm after the pipeline has been executed once and before the final execution of the pipeline. [185] Loop Index Shuffling can be employed for commutative sections of a task.
  • the scheduler either truly randomly or cryptographically securely randomly selects a pointer from the pools of pointers and asynchronously executes a small serial fraction of an algorithm (e.g. a cipher algorithm) if the pre-conditions are met.
  • the pointers are then returned to their respective pool to be randomly used again.
  • the execution can be on a randomly selected hardware resource.
  • the computational steps of an algorithm are randomised in at least time and also optionally location.
  • loop index shuffling may be used.
  • the randomisation means that the electrical signals released by each small computational step are also random. Such random electrical noise can still be harvested externally at a distance but it can no longer be interpreted in time or possibly also location.
  • a testing method was developed to perform a side channel attack on a particular hardware platform to show whether the encryption key can be found using a correlation power analysis (CPA).
  • CPA uses changes in the power consumption of the microprocessor and these can be measured by monitoring the changes in the current consumed by the device.
  • the testbed comprises a processor for executing the encryption algorithm, e.g. an chicken UnoTM chip which is based on the ATmega328P microcontroller.
  • Figs. 14a and 14b show two screen shots from an oscilloscope, e.g. a 1GSa/s storage oscilloscope, which is monitoring the voltage drop across the resistor in the test set up described above.
  • Fig.14a is a screen shot of the power trace when the chicken UnoTM chip is executing standard AES encryption using a 128 bit key.
  • the varying (lower jagged) line is the voltage drop across the monitoring resistor and the straight (upper) line is a timing pulse to identify when the encryption is taking place.
  • the ten rounds of the AES encryption process are clearly shown in the varying trace.
  • Fig.14b is a screen shot of the power trace for the same AES encryption using a 128 bit key but with the mixing (shuffling) process applied to all four sub sections of the AES encryption. In other words, 100% mixing is applied.
  • the ten rounds of the AES encryption process are still clearly shown in the varying trace but the signal is significantly noisier.
  • CPA uses traces such as those illustrated in Figs.14a and 14b and applies a statistical analysis to the results of multiple encryptions to yield the encryption key.
  • One method for performing such a statistical analysis can be done using a ChipWhisperer-Lite which is open source hardware and software product optimised for SCA. In this test bed arrangement, it was also necessary to include a bidirectional level shifter to sit between the chicken UnoTM chip which uses 5v logic levels and the ChipWhisperer hardware which uses TTL 3.3v logic levels.
  • Fig.15a shows the data from a power spectrum side channel attack on standard AES using 25 different 16-byte encryption keys. After 170 traces, 24 of the 25 keys are correctly returned by the SCA and for the final key, 15 of the 16 sub-keys are returned.
  • standard AES is clearly vulnerable to a successful SCA with only a relatively small number of traces. It is noted that not all the keys are returned with fewer traces, e.g.
  • Figs. 15b to 15e show that the same AES algorithm which has been subject to varying amounts of the mixing (shuffling) process described above are much less vulnerable to a successful SCA.
  • Fig. 15b shows that if mixing is applied to one of the four tasks (for example AddRoundKey), none of the 25 encryption keys are completely returned after 170 traces of the SCA. There are three keys for which 9 and 10 subkeys are respectively returned and 10 subkeys is the highest number of subkeys returned.
  • Fig.15c shows that if mixing is applied to two of the four tasks (for example AddRoundKey and SubBytes), the maximum number of subkeys which is returned for any one of the 25 encryption keys is reduced to just 5. Similarly, increasing the mixing to three of the four tasks (for example, mixing within each of AddRoundKey, SubBytes and ShiftRows), reduces the maximum number of subkeys which is returned for any one of the 25 encryption keys still further to just 1. Finally, when all of the tasks (e.g. AddRoundKey, SubBytes, ShiftRows and MixColumns) are subject to the mixing process described above, none of the sub keys are determined by the SCA.
  • the tasks e.g. AddRoundKey, SubBytes, ShiftRows and MixColumns
  • the 128bit AES encryption process has been randomised at a task level by truncated randomising the AddRoundKey stage (mix1) as described above 3. randomised at task level by truncated randomising the SubBytes stage (mix2) as described above. 4. randomised at a task level by truncated randomising both the AddRoundKey stage and the SubBytes stage (mix1 & mix2). 5. initial single step randomisation at the start of the algorithm randomising the AddRoundKey stage (mix1) 6. initial single step randomisation at the start of the algorithm randomising the SubBytes stage (mix2) 7. initial single step randomisation at the start of the algorithm randomising both the AddRoundKey stage and the SubBytes stage (mix1 & mix2) 8.
  • randomisation (mix1) combined with the randomisation introduced at a pipeline line (parallel) 9.
  • randomisation (mix2) combined with the randomisation introduced at a pipeline line (parallel) 10.
  • randomisation (mix1 & mix2) combined with the randomisation introduced at a pipeline line (parallel).
  • this first type of randomisation (mix_1) is sufficient to foil 170 traces of CPA side channel attack but the trade- off is an increase in time to 0.193028 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts.
  • the third test results show an increase in time to 0.212729 seconds/100,000 encrypts for this second type of randomisation compared to 0.0768093 seconds/100,000 encrypts for the normal AES encryption.
  • both options for 25% mixing result in a similar increase in processing time.
  • the 50% mixing results in an additional increase in processing time when compared to 25% mixing as shown in the results for test four.
  • the results of tests two to four indicate that there is a performance penalty for introducing the randomisation. This may be an acceptable penalty given the improvement in reducing the risk of a successful side channel attack as demonstrated in Figs.15a to 15e.
  • the performance may be improved by introducing the randomisation in a single step at the start of the algorithm rather than introducing a truncated randomisation at each step as explained above. Tests five to seven indicate the improved performance.
  • Test five shows that the first type of randomisation (mix_1) only results in an increase in time to 0.110798 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption.
  • the second type of randomisation (mix_2) only results in an increase in time to 0.116044 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption.
  • the second type of randomisation (mix_2) combined with the parallel randomisation (parallel) results in an increase in time to 2.7305 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption.
  • the combination of the first and second type of randomisation (mix_1,mix_2) and parallel mixing results in an increase in time to 5.67733 seconds/100,000 encrypts compared to 0.0768093 seconds/100,000 encrypts for the unmixed, normal AES encryption.
  • Keccak is a versatile cryptographic function. Best known as a hash function, it nevertheless can be used for authentication, (authenticated) encryption and pseudo-random number generation.
  • Keccak-f cryptographic permutation algorithm can be decomposed into smaller constituent parts, all the way down to the microcode of the CPU if needed.
  • Figure 16 illustrates examples of coarse, medium and fine grained decomposition for the Keccak-f cryptographic permutation algorithm: * Coarse grained – such as splitting into its constituent rounds (in this case 24) * Medium grained – such as splitting a round into its component actions, e.g.
  • Fine grained – such as splitting ⁇ (theta) into its program language, e.g. C++ * Very fine grained – such as splitting the program language operations into their assembly language components * Ultra fine grained – such as splitting the assembly language opcodes into their on chip microcode.
  • a canonical implementation of the init/update/finalise (IUF) paradigm can be considered as the basic block permutation function consisting of 24 rounds of the five classical Keccak-f steps: * ⁇ (theta) * ⁇ (rho) * ⁇ (pi) * x (chi) * ⁇ (iota)
  • the main SHA-3 submission uses 64-bit words and so implementing the basic block permutation function consists of 24 rounds of the five steps. Each of the five steps and each of the rounds must be implemented in a specified order and thus are non-commuting operations. The opportunities for shuffling the commuting operations become apparent after examining the loops with the theta, ro, pi, chi and iota stages.
  • the theta ( ⁇ ) task is schematically illustrated in the right-hand side of Figure 16 which shows that the first and second steps each have an array of indices ⁇ 0, 1, 2, 3, 4 ⁇ and the final step has an array of indices ⁇ 0, 5, 10, 15, 20 ⁇ .
  • the theta ( ⁇ ) task within the Keccak-f algorithm has a set of nested shuffled sequences and randomisation (or mixing) may thus be introduced by randomising the set of indices.
  • the theta ( ⁇ ) task can be randomly executed at a task level by shuffling the indices.
  • unaltered code for the theta ( ⁇ ) stage could be blended randomly with the altered code.
  • the blending could be tuned to deliver SCA protection at nominal throughput.
  • the same fork-join parallelisation described above could be used to randomise the location of the code execution.
  • the ⁇ (rho) stage comprises a bitwise rotation of each of the 25 words by a different triangular number (0, 1, 3, 6, 10, 15, ...) and the ⁇ (pi) stage permutes the 25 words in a fixed pattern.
  • sequence_t rho_pi_indices ⁇ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 ⁇ ;
  • the chi ( x) stage is schematically illustrated in Fig. 18 and like the theta ( ⁇ ) stage, there are nested sequences which can be shuffled to add a deeper level of mixing than for the combined ⁇ (rho) and ⁇ (pi) stages.
  • the operation is to exclusive-or a round constant into one word of the state.
  • the ⁇ (iota) stage is one of the non-commuting tasks which is amenable to alteration as described below.
  • the ⁇ (rho) and ⁇ (pi) stages may be combined into a single subroutine.
  • the 24 round loop may be unrolled and expressed (for example using C++) as: static void keccakf(uint64_t s[25]) ⁇ int i, j, round; uint64_t t, bc[5]; // round 1 theta(s); rho_pi(s); chi(s); iota(s); // round 2 theta(s); rho_pi(s); chi(s); iota(s); ... // round 23 theta(s); rho_pi(s); chi(s); iota(s); // round 24 theta(s); rho_pi(s); chi(s); iota(s); ⁇ [215]
  • This sequence of non-commuting functions, theta, rho_pi, chi and iota can now be randomly executed out of order using one of the techniques described above.
  • each of the functions may be interconnected with wait free single product single consumer queues (SPSC) to result in a pipeline.
  • SPSC wait free single product single consumer queues
  • This pipeline is schematically illustrated in Fig.19a.
  • the pipeline has 96 functions with 95 connecting queues of bytes.
  • Fig.19b schematically illustrates the system to implement the pipeline of Fig.19a.
  • the pipeline can be executed asynchronously and out of order by virtue of the wait free connections carrying intermediate results.
  • the system comprises an array 122 of indices for each of the function pointers and a vector 120 of function wrappers to the pipeline stages.
  • each function pointer from the shuffled sequence is taken in turn and execution is attempted.
  • execution is attempted.
  • the function executes. The execution may be done by the CPU 110. The output of the execution is sent to the appropriate output queue which may be stored in memory 114. If one or both of the pre-conditions is not met, the function is not executed. The next function is then selected according to the next function pointer in the sequence and the pre-condition checks are repeated.
  • each function will be able to make progress at each turn of the shuffled sequence and after at most 96 turns of the shuffled sequence, each queue will have at least one intermediate result.
  • each function can make progress after at least 96 turns.
  • the probability of having to wait 96 turns before the pipeline is fully primed, and no longer stalls, is vanishingly small.
  • loading is likely to follow a Gaussian distribution and the pipeline is likely to be full in about 48 turns.
  • FIG. 20 is a flowchart illustrating the key steps in the method (or carried out by the processor). As shown in a first step S2000, at least part of the computer algorithm is stored as at least one pipeline, for example in memory. The at least one pipeline comprises a plurality of separately executable tasks.
  • FIG. 20 shows two routes for achieving randomisation. Although these are shown as separate options, they can be combined as described above.
  • the first route the plurality of separately executable tasks are randomised at a pipeline level. Typically this is used for non-commutative tasks which must be carried out in a specific order.
  • the first step is to randomise the order in which tasks are presented for execution (step S2010), for example using the DBRQ buffer or circular buffers with pointers as described above. Once the order is randomised, there is an attempt to execute each task once from the randomised order (step S2012).
  • the plurality of separately executable tasks are randomised at a task level.
  • Each task which is randomised comprises multiple operations. These operations may be commutative operations and may thus be executed in any order.
  • the first step may be randomising an order of execution of the set of operations within the at least one task (step S2020), for example using loop index shuffling as described above. For commutative operations, there is no need to check if the operation can be executed because they may be carried out in any order.
  • step S2022 the operations may be executed in the order in which they appear in the randomised order and hence the task is executed.
  • the next step is to check whether all the inputs which were received in step S2002 have been processed. In other words, there is a check to see if all the outputs have been generated (step S2030). For the encryption example described above, this may involve checking whether the input plain text has been processed by the algorithm to be a cipher text. If the outputs are not ready, as indicated by the arrows, the method loops back to the randomising steps.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Storage Device Security (AREA)

Abstract

Sont décrits ici des procédés et un appareil pour améliorer la sécurité lors de l'exécution d'un algorithme informatique comprenant des tâches exécutables séparément, chacune d'elles produisant un signal électrique lorsqu'elle est exécutée. L'appareil comprend une mémoire dans laquelle au moins une partie de l'algorithme informatique est stockée sous la forme d'au moins un pipeline, le pipeline ou les pipelines comprenant une pluralité de tâches pouvant être exécutées séparément ; et un processeur. Le processeur est configuré pour : recevoir une pluralité d'entrées devant être traitées par l'algorithme informatique ; répartir aléatoirement la pluralité de tâches exécutables séparément à un niveau de pipeline et/ou à un niveau de tâche ; exécuter la pluralité aléatoire de tâches exécutables séparément, et répéter les étapes de répartition aléatoire et d'exécution jusqu'à ce que l'algorithme informatique ait traité la pluralité d'entrées. De cette manière, à chaque répétition, les signaux électriques produits lors de l'exécution de la pluralité de tâches exécutables séparément sont répartis aléatoirement pour améliorer la sécurité.
PCT/GB2021/052034 2020-08-07 2021-08-05 Procédé et appareil pour réduire le risque d'attaques fructueuses par canal auxiliaire et injection d'erreurs WO2022029443A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
GBGB2012352.7A GB202012352D0 (en) 2020-08-07 2020-08-07 Method and apparatus for reducing the risk of successful side channel attacks
GB2012352.7 2020-08-07
GB2105109.9 2021-04-09
GBGB2105109.9A GB202105109D0 (en) 2021-04-09 2021-04-09 Method and apparatus for reducing the risk of successful side channel and fault injection attacks

Publications (1)

Publication Number Publication Date
WO2022029443A1 true WO2022029443A1 (fr) 2022-02-10

Family

ID=77358300

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2021/052034 WO2022029443A1 (fr) 2020-08-07 2021-08-05 Procédé et appareil pour réduire le risque d'attaques fructueuses par canal auxiliaire et injection d'erreurs

Country Status (1)

Country Link
WO (1) WO2022029443A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328001A (zh) * 2022-03-11 2022-04-12 紫光同芯微电子有限公司 用于ram受到故障注入攻击的检测方法、装置和存储介质

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8143A (en) 1851-06-10 dtjtcher
US20060288239A1 (en) 2003-04-22 2006-12-21 Francesco Pessolano Electronic circuit device for cryptographic applications
US20080019507A1 (en) 2006-06-29 2008-01-24 Incard S.A. Method for Protecting IC Cards Against Power Analysis Attacks
US20100166177A1 (en) 2008-12-31 2010-07-01 Incard S.A. Method for protecting a cryptographic device against spa, dpa and time attacks
GB2494731A (en) 2011-09-06 2013-03-20 Nds Ltd Dummy and secret control signals for a circuit
US20140013425A1 (en) 2012-07-03 2014-01-09 Honeywell International Inc. Method and apparatus for differential power analysis protection
GB2524335A (en) 2014-03-22 2015-09-23 Primary Key Associates Ltd Methods and apparatus for resisting side channel attack
US20160171252A1 (en) 2014-12-16 2016-06-16 Cryptography Research, Inc Buffer access for side-channel attack resistance
US20170099134A1 (en) 1998-01-02 2017-04-06 Cryptography Reserach, Inc. Differential power analysis - resistant cryptographic processing
EP3624390A1 (fr) * 2018-09-17 2020-03-18 Secure-IC SAS Dispositifs et procédés de protection de programmes cryptographiques

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8143A (en) 1851-06-10 dtjtcher
US20170099134A1 (en) 1998-01-02 2017-04-06 Cryptography Reserach, Inc. Differential power analysis - resistant cryptographic processing
US20060288239A1 (en) 2003-04-22 2006-12-21 Francesco Pessolano Electronic circuit device for cryptographic applications
US20080019507A1 (en) 2006-06-29 2008-01-24 Incard S.A. Method for Protecting IC Cards Against Power Analysis Attacks
US20100166177A1 (en) 2008-12-31 2010-07-01 Incard S.A. Method for protecting a cryptographic device against spa, dpa and time attacks
GB2494731A (en) 2011-09-06 2013-03-20 Nds Ltd Dummy and secret control signals for a circuit
US20140013425A1 (en) 2012-07-03 2014-01-09 Honeywell International Inc. Method and apparatus for differential power analysis protection
GB2524335A (en) 2014-03-22 2015-09-23 Primary Key Associates Ltd Methods and apparatus for resisting side channel attack
US20160171252A1 (en) 2014-12-16 2016-06-16 Cryptography Research, Inc Buffer access for side-channel attack resistance
EP3624390A1 (fr) * 2018-09-17 2020-03-18 Secure-IC SAS Dispositifs et procédés de protection de programmes cryptographiques

Non-Patent Citations (16)

* Cited by examiner, † Cited by third party
Title
"PhD by Spadavecchia", 2005, EDINBURGH UNIVERSITY, article "A Network-based Asynchronous Architecture for Cryptographic Devices"
AI ET AL.: "Improved wavelet transform for noise reduction in power analysis attacks", IEEE INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP, 2016
BADDAMJ. DAEMEN ET AL.: "Evaluation of Dynamic Voltage and Frequency Scaling as a Differential Power Analysis Countermeasure", 20TH INTERNATIONAL CONFERENCE ON VLSI DESIGN, January 2007 (2007-01-01)
BATINA ET AL.: "NN: Reverse Engineering of Neural Network Architectures Through Electromagnetic Side Channel", USENIX SECURITY SYMPOSIUM, 2019, pages 515 - 532
GONGYE ET AL.: "The 57th Annual Design Automation Conference", 2020, article "Reverse-Engineering Deep Neural Networks using Floating-Point Timing Side-Channels", pages: 1 - 6
GURKAYNAK ET AL.: "Improving DPA Security by Using Globally-Asynchronous Locally-Synchronous Systems", PROCEEDINGS OF THE 31ST EUROPEAN SOLID-STATE CIRCUITS CONFERENCE, 2005
HERBST CHRISTOPH ET AL: "An AES Smart Card Implementation Resistant to Power Analysis Attacks", 6 June 2006, ADVANCES IN DATABASES AND INFORMATION SYSTEMS; [LECTURE NOTES IN COMPUTER SCIENCE; LECT.NOTES COMPUTER], SPRINGER INTERNATIONAL PUBLISHING, CHAM, PAGE(S) 239 - 252, ISBN: 978-3-319-10403-4, XP047503379 *
HERBST ET AL.: "An AES Smart Card Implementation Resistant to Power Analysis Attacks", ACNS, 2006
KOCHER ET AL.: "Differential Power Analysis", ANNUAL INTERNATIONAL CRYPTOLOGY CONFERENCE IN DECEMBER 1999
O'FLYNNCHEN: "Side channel power analysis of an AES-256 bootloader", 2015 IEEE 28TH CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (CCECE), HALIFAX, NS, CANADA, 2015, pages 750 - 755
QUISQUATER ET AL.: "The Proceedings of the International Conference on Research in Smart Cards", vol. 2140, 2001, SPRINGER-VERLAG, article "ElectroMagnetic Analysis (EMA): Measures and Countermeasures for Smart Cards"
RIVAIN ET AL.: "Higher-order Masking and Shuffling for Software Implementations of Block Ciphers", CHES, 2009
V. RIJMEN ET AL.: "Resistance Against Implementation Attacks: A Comparative Study of the AES Proposals", SECOND ADVANCED ENCRYPTION STANDARD (AES) CANDIDATE CONFERENCE, 1999
VEYRAT-CHARVILLON ET AL.: "Shuffling against Side-channel attacks: A Comprehensive study with Cautionary Note", ASIACRYPT, 2012
XIANG ET AL.: "Open DNN Box by Power Side-Channel Attack", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, vol. 67, no. 11, November 2020 (2020-11-01), pages 2717 - 2721, XP011817547, DOI: 10.1109/TCSII.2020.2973007
YANG ET AL.: "Countering power analysis attacks by exploiting characteristics of multicore processors", IEICE ELECTRONICS EXPRESS, vol. 15, 2018, pages 20180084

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328001A (zh) * 2022-03-11 2022-04-12 紫光同芯微电子有限公司 用于ram受到故障注入攻击的检测方法、装置和存储介质

Similar Documents

Publication Publication Date Title
JP5120830B2 (ja) 共用のハードウェアを利用して暗号文及びメッセージ認証コードを生成するための方法及びシステム
CA2497935C (fr) Chiffrement en continu avec rotation des tampons
US8094816B2 (en) System and method for stream/block cipher with internal random states
Rahimunnisa et al. FPGA implementation of AES algorithm for high throughput using folded parallel architecture
EP3839788A1 (fr) Chiffrement à longueur de bit paramétrable
CA2486713A1 (fr) Moteur cryptographique d'equipement technique base sur la norme avancee de chiffrement (aes)
JP2007520951A (ja) 電力解析攻撃対策保護
US8976960B2 (en) Methods and apparatus for correlation protected processing of cryptographic operations
Abd Ali et al. Novel encryption algorithm for securing sensitive information based on feistel cipher
US20050120065A1 (en) Pseudorandom number generator for a stream cipher
WO2008013083A1 (fr) Générateur de nombres pseudo-aléatoires, dispositif de cryptage de flux et programme
Mandal et al. Sycon: A new milestone in designing ASCON-like permutations
WO2022029443A1 (fr) Procédé et appareil pour réduire le risque d'attaques fructueuses par canal auxiliaire et injection d'erreurs
Zeh et al. Risc-v cryptographic extension proposals volume I: Scalar & entropy source instructions
McKague Design and analysis of RC4-like stream ciphers
WO2007129197A1 (fr) Appareil et procédé cryptographiques
US11303436B2 (en) Cryptographic operations employing non-linear share encoding for protecting from external monitoring attacks
Maistri et al. Implementation of the advanced encryption standard on gpus with the nvidia cuda framework
Monfared et al. BSRNG: a high throughput parallel bitsliced approach for random number generators
Fanfakh et al. Simultaneous encryption and authentication of messages over GPUs
Abdulwahed Chaos-Based Advanced Encryption Standard
Irwin et al. Using media processors for low-memory AES implementation
Noura et al. DKEMA: GPU-based and dynamic key-dependent efficient message authentication algorithm
Rodrigues et al. Fast white-box implementations of dedicated ciphers on the ARMv8 architecture
Hars et al. Pseudorandom recursions: small and fast pseudorandom number generators for embedded applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21755549

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21755549

Country of ref document: EP

Kind code of ref document: A1