Connect public, paid and private patent data with Google Patents Public Datasets

Cryptographic system component

Download PDF

Info

Publication number
US20070157030A1
US20070157030A1 US11323329 US32332905A US2007157030A1 US 20070157030 A1 US20070157030 A1 US 20070157030A1 US 11323329 US11323329 US 11323329 US 32332905 A US32332905 A US 32332905A US 2007157030 A1 US2007157030 A1 US 2007157030A1
Authority
US
Grant status
Application
Patent type
Prior art keywords
processing
data
unit
logic
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11323329
Inventor
Wajdi Feghali
William Hasenplaugh
Gilbert Wolrich
Daniel Cutter
Vinodh Gopal
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Abstract

In general, in aspect, the disclosure describes a system integrated on a single die that includes a first processor core to receive commands from at least one other processor core to perform at least one specified transformative operation on specified data, multiple processing units to perform transformative operations on data, a shared memory, and logic to transfer data between a one of the set of multiple processing units and the shared memory.

Description

    REFERENCE TO RELATED APPLICATIONS
  • [0001]
    This relates to co-pending U.S. Patent Application Ser. No. ______, attorney docket 42390.P22797 filed on the same day as the present application, and entitled “MULTIPLIER”.
  • [0002]
    This relates to co-pending U.S. Patent Application Ser. No. ______, attorney docket 42390.P22799 filed on the same day as the present application, and entitled “CRYPTOGRAPHY PROCESSING UNITS AND MULTIPLIER”.
  • BACKGROUND
  • [0003]
    Cryptography can protect data from unwanted access. Cryptography typically involves mathematical operations on data (encryption) that makes the original data (plaintext) unintelligible (ciphertext). Reverse mathematical operations (decryption) restore the original data from the ciphertext. Typically, decryption relies on additional data such as a cryptographic key. A cryptographic key is data that controls how a cryptography algorithm processes the plaintext. In other words, different keys generally cause the same algorithm to output different ciphertext for the same plaintext. Absent a needed decryption key, restoring the original data is, at best, an extremely time consuming mathematical challenge.
  • [0004]
    Cryptography is used in a variety of situations. For example, a document on a computer may be encrypted so that only authorized users of the document can decrypt and access the document's contents. Similarly, cryptography is often used to encrypt the contents of packets traveling across a public network. While malicious users may intercept these packets, these malicious users access only the ciphertext rather than the plaintext being protected.
  • [0005]
    Cryptography covers a wide variety of applications beyond encrypting and decrypting data. For example, cryptography is often used in authentication (i.e., reliably determining the identity of a communicating agent), the generation of digital signatures, and so forth.
  • [0006]
    Current cryptographic techniques rely heavily on intensive mathematical operations. For example, many schemes involve the multiplication of very large numbers. For instance, many schemes use a type of modular arithmetic known as modular exponentiation which involves raising a large number to some power and reducing it with respect to a modulus (i.e., the remainder when divided by given modulus). The mathematical operations required by cryptographic schemes can consume considerable processor resources. For example, a processor of a networked computer participating in a secure connection may devote a significant portion of its computation power on encryption and decryption tasks, leaving less processor resources for other operations.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • [0007]
    FIG. 1 is a diagram of a cryptographic component.
  • [0008]
    FIG. 2 is a flow diagram illustrating operation of a cryptographic component.
  • [0009]
    FIG. 3 is a diagram of a processor including a cryptographic component.
  • [0010]
    FIG. 4 is a diagram illustrating processing unit architecture.
  • [0011]
    FIG. 5 is a diagram of logic interconnecting shared memory and the processing units.
  • [0012]
    FIG. 6 is a diagram of a set of processing units coupled to a multiplier.
  • [0013]
    FIG. 7 is a diagram of a programmable processing unit.
  • [0014]
    FIG. 8 is a diagram illustrating operation of an instruction to cause transfer of data from an input buffer into a data bank.
  • [0015]
    FIGS. 9-11 are diagrams illustrating operation of instructions to cause an arithmetic logic unit operation.
  • [0016]
    FIG. 12 is a diagram illustrating concurrent operation of datapath instructions.
  • [0017]
    FIG. 13 is a diagram illustrating different sets of variables corresponding to different hierarchical scopes of program execution.
  • [0018]
    FIG. 14 is a diagram illustrating windowing of an exponent
  • [0019]
    FIG. 15 is a diagram of windowing logic.
  • [0020]
    FIG. 16 is a diagram illustrating operation of a hardware multiplier.
  • [0021]
    FIG. 17 is a diagram of a hardware multiplier.
  • [0022]
    FIGS. 18-20 are diagrams of different types of processing units.
  • [0023]
    FIG. 21 is a diagram of a processor having multiple processor cores.
  • [0024]
    FIG. 22 is a diagram of a processor core.
  • [0025]
    FIG. 23 is a diagram of a network forwarding device.
  • DETAILED DESCRIPTION
  • [0026]
    FIG. 1 depicts a sample implementation of a system component 100 to perform cryptographic operations. The component 100 can be integrated into a variety of systems. For example, the component 100 can be integrated within the die of a processor or found within a processor chipset. The system component 100 can off-load a variety of cryptographic operations from other system processor(s). The component 100 provides high performance at relatively modest clock speeds and is area efficient.
  • [0027]
    As shown, the sample component 100 may be integrated on a single die that includes multiple processing units 106-112 coupled to shared memory logic 104. The shared memory logic 104 includes memory that can act as a staging area for data and control structures being operated on by the different processing units 106-112. For example, data may be stored in memory and then sent to different processing units 106-112 in turn, with each processing unit performing some task involved in cryptographic operations and returning the, potentially, transformed data back to the shared memory logic 104.
  • [0028]
    The processing units 106-112 are constructed to perform different operations involved in cryptography such as encryption, decryption, authentication, and key generation. For example, processing unit 106 may perform hashing algorithms (e.g., MD5 (Message Digest 5) and/or SHA (Secure Hash Algorithm)) while processing unit 110 performs cipher operations (e.g., DES (Data Encryption Standard), 3DES (Triple DES), AES (Advanced Encryption Standard), RC4 (ARCFOUR), and/or Kasumi).
  • [0029]
    As shown, the shared memory logic 104 is also coupled to a RAM (random access memory) 114. In operation, data can be transferred from the RAM 114 for processing by the processing units 106-112. Potentially, transformed data (e.g., encrypted or decrypted data) is returned to the RAM 114. Thus, the RAM 114 may represent a nexus between the component 100 and other system components (e.g., processor cores requesting cryptographic operations on data in RAM 114). The RAM 114 may be external to the die hosting the component 100.
  • [0030]
    The sample implementation shown includes a programmable processor core 102 that controls operation of the component 100. As shown, the core 102 receives commands to perform cryptographic operations on data. Such commands can identify the requesting agent (e.g., core), a specific set of operations to perform (e.g., cryptographic protocol), the data to operate on (e.g., the location of a packet payload), and additional cryptographic context data such as a cryptographic key, initial vector, and/or residue from a previous cryptographic operation. In response to a command, the core 102 can execute program instructions that transfer data between RAM 114, shared memory, and the processing units 106-112.
  • [0031]
    A program executed by the core 102 can perform a requested cryptographic operation in a single pass through program code. As an example, FIG. 2 illustrates processing of a command to encrypt packet “A” stored in RAM 114 by a program executed by core 102. For instance, another processor core (not shown) may send the command to component 100 to prepare transmission of packet “A”across a public network. As shown, the sample program: (1) reads the packet and any associated cryptography context (e.g., keys, initial vectors, or residue) into shared memory from RAM 114; (2) sends the data to an aligning processing unit 106 that writes the data back into shared memory 114 aligned on a specified byte boundary; (3) sends the data to a cipher processing unit 108 that performs a transformative cipher operation on the data before sending the transformed data to memory 104; and (4) transfers the transformed data to RAM 114. The core 102 may then generate a signal or message notifying the processor core that issued the command that encryption is complete.
  • [0032]
    The processor core 102 may be a multi-threaded processor core including storage for multiple program counters and contexts associated with multiple, respective, threads of program execution. That is, in FIG. 2, thread 130 may be one of multiple threads. The core 102 may switch between thread contexts to mask latency associated with processing unit 106-112 operation. For example, thread 130 may include an instruction (not shown) explicitly relinquishing thread 130 execution after an instruction sending data to the cipher processing unit 108 until receiving an indication that the transformed data has been written into shared memory 104. Alternately, the core 102 may use pre-emptive context switching that automatically switches contexts after certain events (e.g., requesting operation of a processing unit 106-112 or after a certain amount of execution time). Thread switching enables a different thread to perform other operations such as processing of a different packet in what would otherwise be wasted core 102 cycles. Throughput can be potentially be increased by adding additional contexts to the core 102. In a multi-threaded implementation, threads can be assigned to commands in a variety of ways, for example, by a dispatcher thread that assigns threads to commands or by threads dequeuing commands when the threads are available.
  • [0033]
    FIG. 3 illustrates a sample implementation of a processor 124 including a cryptographic system component 100. As shown, the component 100 receives commands from processor core(s) 118-122. In this sample implementation, core 102 is integrated into the system component 100 and services commands from the other cores 118-122. In an alternate implementation, processing core 102 may not be integrated within the component. Instead cores 118-122 may have direct control over component 100 operation. Alternately, one of cores 118-122, may be designated for controlling the cryptographic component 100 and servicing requests received from the other cores 118-122. This latter approach can lessen the expense and die footprint of the component 100.
  • [0034]
    As shown in FIG. 4, the different processing units 106-112 may feature the same uniform interface architecture to the shared memory logic 104. This uniformity eases the task of programming by making interaction with each processing unit very similar. The interface architecture also enables the set of processing units 106-112 included within the component 100 to be easily configured. For example, to increase throughput, a component 100 can be configured to include multiple copies of the same processing unit. For instance, if the component 100 is likely to be included in a system that will perform a large volume of authentication operations, the component 100 may be equipped with multiple hash processing units. Additionally, the architecture enables new processing units to be easily integrated into the component 100. For example, when a new cryptography algorithm emerges, a processing unit to implement the algorithm can be made available.
  • [0035]
    In the specific implementation shown in FIG. 4, each processing unit includes an input buffer 142 that receives data from shared memory logic 104 and an output buffer 140 that stores data to transfer to shared memory logic 104. The processing unit 106 also includes processing logic 144 such as programmable or dedicated hardware (e.g., an Application Specific Integrated Circuit (ASIC)) to operate on data received by input buffer 142 and write operation results to buffer 140. In the example shown, buffers 140, 142 may include memory and logic (not shown) that queue data in the buffers based on the order in which data is received. For example, the logic may feature head and tail pointers into the memory and may append newly received data to the tail.
  • [0036]
    In the sample implementation shown, the input buffer 140 is coupled to the shared memory logic 104 by a different bus 146 than the bus 148 coupling the output buffer 140 to the shared memory logic 104. These buses 146, 148 may be independently clocked with respect to other system clocks. Additionally, the buses 146, 148 may be private to component 100, shielding internal operation of the component 100. Potentially, the input buffers 140 of multiple processing units may share the same bus 146; likewise for the output buffers 140, 148. Of course, a variety of other communication schemes may be implemented such as a single shared bus instead of dual-buses or dedicated connections between the shared memory logic 104 and the processing units 106-112.
  • [0037]
    Generally, each processing unit is affected by at least two commands received by the shared memory logic 104: (1) a processing unit READ command that transfers data from the shared memory logic 104 to the processing unit input buffer 142 and (2) a processing unit WRITE command that transfers data from the output buffer 140 of the processing unit to the shared memory logic 104. Both commands can identify the target processing unit and the data being transferred. The uniformity of these instructions across different processing units can ease component 100 programming. In the specific implementation shown, a processing unit READ instruction causes a data push from shared memory to a respective target processing unit's 106-112 input buffer 142 via bus 146, while a processing unit WRITE instruction causes a data pull from a target processing unit's 106-112 output buffer 140 into shared memory via bus 148. Thus, to process data, a core 102 program may issue a command to first push data to the processing unit and later issue a command to pull the results written into the processing unit's output buffer 144. Of course, a wide variety of other inter-component 100 communication schemes may be used.
  • [0038]
    FIG. 5 depicts shared memory logic 104 of the sample implementation. As shown, the logic 104 includes a READ queue and a WRITE queue for each processing unit (labeled “PU”). Commands to transfer data to/from the banks of shared memory (banks a-n) are received at an inlet queue 180 and sorted into the queues 170-171 based on the target processing unit and the type of command (e.g., READ or WRITE). In addition to commands targeting processing units, the logic 104 also permits cores external to the component (e.g., cores 118-122) to READ (e.g., pull) or WRITE (e.g., push) data from/to the memory banks and features an additional pair of queues (labeled “cores”) for these commands. Arbiters 182-184 dequeue commands from the queues 170-171. For example, each arbiter 182-184 may use a round robin or other servicing scheme. The arbiters 182-184 forward the commands to another queue 172-178 based on the type of command. For example, commands pushing data to an external core are enqueued in queue 176 while commands pulling data from an external core enqueued in queue 172. Similarly, commands pushing data to a processing unit are enqueued in queue 178 while commands pulling data from a processing unit are enqueued in queue 174. When a command reaches the head of a queue, the logic 104 initiates a transfer of data/to from the memory banks to the processing unit using buses 146 or 148 as appropriate or by sending/receiving data by a bus coupling the component 100 to the cores 118-122. The logic 104 also includes circuitry to permit transfer (push and pulls) of data between the memory banks and the external RAM 114.
  • [0039]
    The logic 104 shown in FIG. 5 is merely an example, and a wide variety of other architectures may be used. For example, an implementation need not sort the commands into per processing unit queues, although this queuing can ensure fairness among request. Additionally, the architecture reflected in FIG. 5 could be turned on its head. That is, instead of the logic 104 receiving commands that deliver and retrieve data to/from the memory banks, commands may be routed to the processing units which in turn issue requests to access the shared memory banks.
  • [0040]
    Many cryptographic protocols, such as public-key exchange protocols, require modular multiplication (e.g., [A×B] mod m) and/or modular exponentiation (e.g., A{circumflex over ( A)}exponent mod m) of very large numbers. While computationally expensive, these operations are critical to many secure protocols such as a Diffie-Helman exchange, DSA signatures, RSA signatures, and RSA encryption/decryption. FIG. 6 depicts a dedicated hardware multiplier 156 coupled to multiple processing units 150-154. The processing units 150-154 can send data (e.g., a pair of variable length multi-word vector operands) to the multiplier 156 and can consume the results. To multiply very large numbers, the processing units 150-154 can decompose a multiplication into a set of smaller partial products that can be more efficiently performed by the multiplier 156. For example, multiplication of two 1024-bit operands can be computed as four sets of 512-bit×512 bit multiplications or sixteen sets of 256-bit×256-bit multiplications.
  • [0041]
    The most efficient use of the multiplier 156 may vary depending on the problem at hand (e.g., the size of the operands). To provide flexibility in how the processing units 150-154 use the multiplier 156, the processing units 150-154 shown in FIG. 6 may be programmable. The programs may be dynamically downloaded to the processing units 150-154, along with data to operate on, from the shared memory logic 104 via interface 158. The program selected for download to a given processing unit 150-154 can change in accordance with the problem assigned to the processing unit 150-154 (e.g., a particular protocol and/or operand size). The programmability of the units 150-154 permits component 100 operation to change as new security protocols, algorithms, and implementations are introduced. In addition, a programmer can carefully tailor processing unit 150-154 operation based on the specific algorithm and operand size required by a protocol. Since the processing units 150-154 can be dynamically reprogrammed on the fly (during operation of the component 100), the same processing units 150-154 can be used to perform operations for different protocols/protocol options by simply downloading the appropriate software instructions.
  • [0042]
    As described above, each processing unit 150-154 may feature an input buffer and an output buffer (see FIG. 4) to communicate with shared memory logic 104. The multiplier 156 and processing units 150-154 may communicate using these buffers. For example, a processing unit 150-154 may store operands to multiply in a pair of output queues in the output buffer for consumption by the multiplier 156. The multiplier 156 results may be then transferred to the processing unit 150-154 upon completion. The same processing unit 150-154 input and output buffers may also be used to communicate with shared memory logic 104. For example, the input buffer of a processing unit 150-154 may receive program instructions and operands from shared memory logic 104. The processing unit 150-154 may similarly store the results of program execution in an output buffer for transfer to the shared memory logic 104 upon completion of program execution.
  • [0043]
    To coordinate these different uses of a processing unit's input/output buffers, the processing units 150-154 provide multiple modes of operation that can be selected by program instructions executed by the processing units. For example, in “I/O” mode, the buffers of programming unit 150-154 exclusively exchange data with shared memory logic unit 104 via interface 158. In “run” mode, the buffers of the unit 150-154 exclusively exchange data with multiplier 156 instead. Additional processing unit logic (not shown), may interact with the interface 158 and the multiplier 156 to indicate the processing unit's current mode.
  • [0044]
    As an example, in operation, a core may issue a command to shared memory logic 104 specifying a program to download to a target processing unit and data to be processed. The shared memory logic 104, in turn, sends a signal, via interface 158, awakening a given processing unit from a “sleep” mode into I/O mode. The input buffer of the processing unit then receives a command from the shared memory logic 104 identifying, for example, the size of a program being downloaded, initial conditions, the starting address of the program instructions in shared memory, and program variable values. To avoid unnecessary loading of program code, if the program size is specified as zero, the previously loaded program will be executed. This optimizes initialization of a processing unit when requested to perform the same operation in succession.
  • [0045]
    After loading the program instructions, setting the variables and initial conditions to the specified values, an instruction in the downloaded program changes the mode of the processing unit from I/O mode to run mode. The processing unit can then write operands to multiply to its output buffers and receive delivery of the multiplier 156 results in its input buffer. Eventually, the program instructions write the final result into the output buffer of the processing unit and change the mode of the processing back to I/O mode. The final results are then transferred from the unit's output buffer to the shared memory logic 104 and the unit returns to sleep mode.
  • [0046]
    FIG. 7 depicts a sample implementation of a programmable processing unit 150. As shown, the processing unit 150 includes an arithmetic logic unit 216 that performs operations such as addition, subtraction, and logical operations such as boolean AND-ing and OR-ing of vectors. The arithmetic logic unit 216 is coupled to, and can operate on, operands stored in different memory resources 220, 212, 214 integrated within the processing unit 150. For example, as shown, the arithmetic logic unit 216 can operate on operands provided by a memory divided into a pair of data banks 212, 214 with each data bank 212, 214 independently coupled to the arithmetic logic unit 216. As described above, the arithmetic logic unit 216 is also coupled to and can operate on operands stored in input queue 220 (e.g., data transferred to the processing unit 150, for example, from the multiplier or shared memory logic 104). The size of operands used by the arithmetic logic unit 216 to perform a given operation can vary and can be specified by program instructions.
  • [0047]
    As shown, the arithmetic logic unit 216 may be coupled to a shifter 218 that can programmatically shift the arithmetic logic unit 216 output. The resulting output of the arithmetic logic unit 216/shifter 218 can be “re-circulated” back into a data bank 212, 214. Alternately, or in addition, results of the arithmetic logic unit 216/shifter 218 can be written to an output buffer 222 divided into two parallel queues. Again, the output queues 222 can store respective sets of multiplication operands to be sent to the multiplier 156 or can store the final results of program execution to be transferred to shared memory.
  • [0048]
    The components described above form a cyclic datapath. That is, operands flow from the input buffer 220, data banks 212, 214 through the arithmetic logic unit 216 and either back into the data banks 212, 214 or to the output buffer(s) 222. Operation of the datapath is controlled by program instructions stored in control store 204 and executed by control logic 206. The control logic 206 has a store of global variables 208 and a set of variable references 202 (e.g., pointers) into data stored in data banks 212, 214.
  • [0049]
    A sample instruction set that can be implemented by control logic 206 is described in the attached Appendix A. Other implementations may vary in instruction operation and syntax.
  • [0050]
    Generally, the control logic 206 includes instructions (“setup” instructions) to assign variable values, instructions (“exec” and “fexec” instrucions) to perform mathematical and logical operations, and control flow instructions such as procedure calls and conditional branching instructions. The conditional branching instructions can operate on a variety of condition codes generated by the arithmetic logic unit 216/shifter 218 such as carry, msb (if the most significant bit=1), lsb (if the least significant bit=1), negative, zero (if the last quadword =0), and zero_vector (if the entire operand=0). Additionally, the processing unit 150 provides a set of user accessible bits that can be used as conditions for conditional instructions.
  • [0051]
    The control logic 206 includes instructions that cause data to move along the processing unit 150 datapath. For example, FIG. 8 depicts the sample operation of a “FIFO” instruction that, when the processing unit is in “run” mode, pops data from the input queue 220 for storage in a specified data bank 212, 214. In “I/O” mode, the FIFO instruction can, instead, transfer data and instructions from the input queue 220 to the control store 204.
  • [0052]
    FIG. 9 depicts sample operation of an “EXEC” instruction that supplies operands to the arithmetic logic unit 216. In the example shown, the source operands are supplied by data banks 212, 214 and the output is written to an output queue 222. As shown in FIG. 10, an EXEC instruction can alternately store results back into one of the data banks 212, 214 (in the case shown, bank B 214).
  • [0053]
    FIG. 11 depicts sample operation of an “FEXEC” (FIFO EXEC) instruction that combines aspects of the FIFO and EXEC instructions. Like an EXEC instruction, an FEXEC instruction supplies operands to the arithmetic logic unit 216. However, instead of operands being supplied exclusively by the data banks 212, 214, an operand can be supplied from the input queue 222.
  • [0054]
    Potentially, different ones of the datapath instructions can be concurrently operating on the datapath. For example, as shown in FIG. 12, an EXEC instruction may follow a FIFO instruction during the execution of a program. While these instructions may take multiple cycles to complete, assuming the instructions do not access overlapping portions of the data banks 212, 214, the control logic 206 may issue the EXEC instruction before the FIFO instruction completes. To ensure that the concurrent operation does not deviate from the results of in-order operation, the control logic 206 may determine whether concurrent operation would destroy data coherency. For example, if the preceding FIFO instruction writes data to a portion of data bank A that sources an operand in the subsequent EXEC instruction, the control logic 206 awaits writing of the data by the FIFO instruction into the overlapping data bank portion before starting operation of the EXEC instruction on the datapath.
  • [0055]
    In addition to concurrent operation of multiple datapath instructions, the control logic 206 may execute other instructions concurrently with operations caused by datapath instructions. For example, the control logic 206 may execute control flow logic instructions (e.g., a conditional branch) and variable assignment instructions before previously initiated datapath operations complete. More specifically, in the implementation shown, FIFO instructions may issue concurrently with any branch instruction or any setup instruction except a mode instruction. FIFO instructions may issue concurrently with any execute instruction provided the destination banks for both are mutually exclusive. FEXEC and EXEC instructions may issue concurrently with any mode instructions and instructions that do not rely on the existence of particular condition states. EXEC instructions, however, may not issue concurrently with FEXEC instructions.
  • [0056]
    The processing unit 150 provides a number of features that can ease the task of programming cryptographic operations. For example, programs implementing many algorithms can benefit from recursion or other nested execution of subroutines or functions. As shown in FIG. 13, the processing unit may maintain different scopes 250-256 of variables and conditions that correspond to different depths of nested subroutine/function execution. The control logic uses one of the scopes 250-256 as the current scope. For example, the current scope in FIG. 13 is scope 252. While a program executes, the variable and condition values specified by this scope are used by the control logic 206. For example, a reference to variable “A0” by an instruction would be associated with A0 of the current scope 252. The control logic 206 can automatically increment or decrement the scope index in response to procedure calls (e.g., subroutine calls, function calls, or method invocations) and procedure exits (e.g., returns), respectively. For example, upon a procedure call, the current scope may advance to scope 254 before returning to scope 252 after a procedure return.
  • [0057]
    As shown, each scope 250-256 features a set of pointers into data banks A and B 212, 214. Thus, the A variables and B variables accessed by a program are de-referenced based on the current scope. In addition, each scope 250-256 stores a program counter that can be used to set program execution to the place where a calling procedure left off. Each scope also stores an operand scale value that identifies a base operand size. The instructions access the scale value to determine the size of operands being supplied to the arithmetic logic unit or multiplier. For example, an EXEC instruction may specify operands of N×current-scope-scale size. Each scope further contains Index and Index Compare values. These values are used to generate an Index Compare condition that can be used in conditional branching instructions when the two are equal. A scope may include a set of user bits that can be used as conditions for conditional instructions.
  • [0058]
    In addition to providing access to data in the current scope, the processing unit instruction set also provides instructions (e.g., “set scope <target scope>”) that provide explicit access to scope variables in a target scope other than the current scope. For example, a program may initially setup, in advance, the diminishing scales associated with an ensuing set of recursive/nested subroutine calls. In general, the instruction set includes an instruction to set each of the scope fields. In addition, the instruction set includes an instruction (e.g., “copy_scope”) to copy an entire set of scope values from the current scope to a target scope. Additionally, the instruction set includes instructions to permit scope values to be computed based on the values included in a different scope (e.g., “set variable relative”).
  • [0059]
    In addition to the scope support described above, the processing unit 150 also can include logic to reduce the burden of exponentiation. As described above, many cryptographic operations require exponentiation of large numbers. For example, FIG. 14 depicts an exponent 254 raising some number, g, to the 6,015,455,113-th power. To raise a number to this large exponent 254, many algorithms reduce the operation to a series of simpler mathematical operations. For example, an algorithm can process the exponent 254 as a bit string and proceeding bit-by-bit from left to right (most-significant-bit to least-significant-bit). For example, starting with an initial value of “1”, the algorithm can square the value for each “0” encountered in the bit string. For each “1” encountered in the bit string, the algorithm can square the value and multiply by g. For example, to determine the value of 2ˆ9, the algorithm would operate on the binary exponent of 1001b as follows:
    value
    initialization 1
    exponent bit 1-1 1 ˆ 2 * 2 = 2
    bit 2-0 2 ˆ 2 = 4
    bit 3-0 4 ˆ 2 = 16
    bit 4-1 16 ˆ 2 * 2 = 512
  • [0060]
    To reduce the computational demands of this algorithm, an exponent can be searched for windows of bits that correspond to pre-computed values. For example, in the trivially small example of 2ˆ9, a bit pattern of “10” corresponds to gˆ2 (4). Thus, identifying the “10” window value in exponent “1001” enables the algorithm to simply square the value for each bit within the window and multiply by the precomputed value. Thus, an algorithm using windows could proceed:
    value
    initialization 1
    exponent bit 1-1 1 ˆ 2 = 1
    bit 2-0 1 ˆ 2 = 1
    window “10” value 1 * 4 = 4
    bit 3-0 4 ˆ 2 = 16
    bit 4-1 16 ˆ 2 * 2 = 512
  • [0061]
    Generally, this technique reduces the number multiplications needed to perform an exponentiation (though not in this trivially small example). Additionally, the same window may appear many times within an exponent 254 bit string, thus the same precomputed value can be used.
  • [0062]
    Potentially, an exponent 254 may be processed in regularly positioned window segments of N-bits. For example, a first window may be the four most significant bits of exponent 254 (e.g., “0001”), a second window may be the next four most significant bits (e.g., “0110”) and so forth. Instead of regularly occurring windows, however, FIG. 14 depicts a scheme that uses sliding windows. That is, a window of some arbitrary size of N-bits can be found at any point within the exponent rather than aligned on an N-bit boundary. For example, FIG. 14 shows a bit string 256 identifying the location of 4-bit windows found within exponent 254. For example, an exponent window of “1011” is found at location 256 a and an exponent window of “1101” is found at location 256 b. Upon finding a window, the window bits are zeroed. For example, as shown, a window of “0011” is found at location 256 c. Zeroing the exponent bits enables a window of “0001” to be found at location 256 d.
  • [0063]
    FIG. 15 shows logic 210 used to implement a sliding window scheme. As shown, the logic 210 includes a set of M register bits (labeled C 4 to C −4) that perform a left shift operation that enables windowing logic 250 to access M-bits of an exponent string at a time as the exponent bits stream through the logic 210. Based on the register bits and an identification of a window size 252, the windowing logic 250 can identify the location of a window-size pattern of non-zero bits with the exponent. By searching within a set of bits larger than the window-size, the logic 250 can identify windows irrespective of location within the exponent bit string. Additionally, the greater swath of bits included in the search permits the logic 250 to select from different potential windows found within the M-bits (e.g., windows with the most number of “1” bits). For example, in FIG. 14, the exponent 254 begins with bits of “0001”, however this potential window is not selected in favor of the window “1011” using “look-ahead” bits (C −1-C −4).
  • [0064]
    Upon finding a window of non-zero bits, the logic 210 can output a “window found” signal identifying the index of the window within the exponent string. The logic 210 can also output the pattern of non-zero bits found. This pattern can be used as a lookup key into a table of pre-computed window values. Finally, the logic 210 zeroes the bits within the window and continues to search for window-sized bit-patterns.
  • [0065]
    The logic 210 shown can be included in a processing unit. For example, FIG. 7 depicts the logic 210 as receiving the output of shifter 218 which rotates bits of an exponent through the logic 210. The logic 210 is also coupled to control logic 206. The control logic 206 can feature instructions that control operation of the windowing logic (e.g., to set the window size and/or select fixed or sliding window operation) and to respond to logic 210 output. For example, the control logic 206 can include a conditional branching instruction that operates on “window found” output of the control logic. For example, a program can branch on a window found condition and use the output index to lookup a precomputed value for the window.
  • [0066]
    As described above, the processing units may have access to a dedicated hardware multiplier 156. Before turning to sample implementation (FIG. 17), FIG. 16 illustrates sample operation of a multiplier implementation. In FIG. 16 the multiplier 156 operates on two operands, A 256 and B 258, over a series of clock cycles. As shown, the operands are handled by the multiplier as sets of segments, though the number of segments and/or the segment size for each operand differs. For instance, in the example shown, the N-bits of operand A are divided into 8-segments (0-7) while operand B is divided into 2-segments (0-1).
  • [0067]
    As shown, the multiplier operates by successively multiplying a segment of operand A with a segment of operand B until all combinations of partial products of the segments are generated. For example, in cycle 2, the multiplier multiplies segment 0 of operand B (B0) with segment 0 of operand A (A0) 262 a while in cycle 17 2621 the multiplier multiplies segment 1 of operand B (B1) with segment 7 of operand A (A7). The partial products are shown in FIG. 16 as boxed sets of bits. As shown, based on the respective position of the segments within the operands, the set of bits are shifted with respect to one another. For example, multiplication of the least significant segments of A and B (B0×A0) 262 a results in the least significant set of resulting bits with multiplication of the most significant segments of A and B (B1×A7) 2621 results in the most significant set of resulting bits. The addition of the results of the series of partial products represents the multiplication of operands A 256 and B 258.
  • [0068]
    Sequencing computation of the series of partial products can incrementally yields bits of the final multiplication result well before the final cycle. For example, FIG. 16 identifies when bits of a given significance can be retired as arrowed lines spanning the bits. For example, after completing B0×A0 in cycle 2, the least significant bits of the final result are known since subsequent partial product results do not affect these bits. Similarly, after completing B0×A1 in cycle 3, bits can be retired since only partial products 262 a and 262 b affect the sum of these least significant bits. As shown, each cycle may not result in bits being retired. For example multiplication of different segments can yields bits occupying the exact same significance. For example, the results of B0×A4 in cycle 6 and B1×A0 in cycle 7 exactly overlap. Thus, no bits are retired in cycle 6.
  • [0069]
    FIG. 17 shows a sample implementation of a multiplier 156 in greater detail. The multiplier 156 can process operands as depicted in FIG. 16. As shown, the multiplier 156 features a set of multipliers 306-312 configured in parallel. While the multipliers may be N-bit×N-bit multipliers, the N-bits may not be a factor of 2. For example, for a 512-bit×512-bit multiplier 156, each multiplier may be a 67-bit×67-bit multiplier. Additionally, the multiplier 156 itself is not restricted to operands that are a power of two.
  • [0070]
    The multipliers 156 are supplied segments of the operands in turn, for example, as shown in FIG. 16. For instance, in a first cycle, segment 0 of operand A is supplied to each multiplier 306-312 while sub-segments d-a of segment 0 of operand B are respectively supplied to each multiplier 306-312. That is, multiplier 312 may receive segment 0 of operand A and segment 0, sub-segment a of operand B while multiplier 310 receives segment 0 of operand A and segment 0, sub-segment, b of operand B in a given cycle.
  • [0071]
    The outputs of the multipliers 306-312 are shifted 314-318 based on the significance of the respective segments within the operands. For example, shifter 318 shifts the results of Bnb×An 314 with respect to the results of Bna×An 312 to reflect the significance of sub-segment b relative to sub-segment a.
  • [0072]
    The shifted results are sent to an accumulator 320. In the example shown, the multiplier 156 uses a carry/save architecture where operations produce a vector that represents the results absent any carries to more significant bit positions and a vector that stores the carries. Addition of the two vectors can be postponed until the final results are needed. While FIG. 17 depicts a multiplier 156 that features a carry/save architecture other implementations may use other schemes (e.g., a carry/propagate adder), though a carry/save architecture may be many times more area and power efficient.
  • [0073]
    As shown, in FIG. 16, sequencing of the segment multiplications can result in the output of bits by the multipliers 306-312 that are not affected by subsequent output by the multipliers 306-312. For example, in FIG. 16, the least significant bits output by the multipliers 306-312 can sent to the accumulator 320 in cycle-2. The accumulator 320 can retire such bits as they are produced. For example, the accumulator 320 can output retired bits to a pair of FIFOs 322, 324 that store the accumulated carry/save vectors respectively. The multiplier 156 includes logic 326, 328, 336, 338 that shifts the remaining carry/save vectors in the multiplier by a number of bits corresponding to the number of bits retired. For example, if the accumulator 320 sends the least significant 64-bits to the FIFOs 322, 324, the remaining accumulator 320 vectors can be right shifted by 64-bits. As shown, the logic can shift the accumulator 320 vectors by a variable amount.
  • [0074]
    As described above, the FIFOs 322, 324 store bits of the carry/save vectors retired by the accumulator 320. The FIFOs 322, 324, in turn, feed an adder 330 that sums the retired portions of carry/save vectors. The FIFOs 322, 324 can operate to smooth feeding of bits to the adder 330 such that the adder 330 is continuously fed retired portions in each successive cycle until the final multiplier result is output. In other words, as shown in FIG. 16, not all cycles (e.g., cycle-6) result in retiring bits. Without FIFOs 322, 324, the adder 330 would stall when these cycles-without-retirement filter down through the multiplier 156. Instead, by filling the FIFOs 322, 324 with the retired bits and deferring dequeuing of FIFO 322, 324 bits until enough bits are retired, the FIFOs 322, 324 can ensure continuous operation of the adder 330. The FIFOs 322, 324, however, need not be as large as the number of bits in the final multiplier 156 result. Instead the FIFOs 322, 324 may only be large enough to store a sufficient number of retired bits such that “skipped” retirement cycles do stall the adder 330 and large enough to accommodate the burst of retired bits in the final cycles.
  • [0075]
    The multiplier 156 acts as a pipeline that propagates data through the multiplier stages in a series of cycles. As shown the multiplier features two queues 302, 304 that store operands to be multiplied. To support the partial product multiplication scheme described above, the width of the queues 302, 304 may vary with each queue being the width of 1-operand-segment. The queues 302, 304 prevent starvation of the pipeline. That is, as the multipliers complete multiplication of one pair of operands, the start of the multiplication of another pair of operands can immediately follow. For example, after the results of B1×A7 is output to the FIFOs 322, 324, logic 326, 328 can zero the accumulator 320 vectors to start multiplication of two new dequeued operands. Additionally, due to the pipeline architecture, the multiplication of two operands may begin before the multiplier receives the entire set of segments in the operands. For example, the multiplier may begin A×B as soon as segments A0 and B0 are received. In such operation, the FIFOs 322, 324 can not only smooth output of the adder 330 for a given pair of operands but can also smooth output of the adder 330 across different sets of operands. For example, after an initial delay as the pipeline fills, the multiplier 156 may output portions of the final multiplication results for multiple multiplication problems with each successive cycle. That is, after the cycle outputting the most significant bits of A×B, the least significant bits of C×D are output.
  • [0076]
    The multiplier 156 can obtain operands, for example, by receiving data from the processing unit output buffers. To determine which processing unit to service, the multiplier may feature an arbiter (not shown). For example, the arbiter may poll each processing unit in turn to determine whether a given processing unit has a multiplication to perform. To ensure multiplier 156 cycles are not wasted, the arbiter may determine whether a given processing unit has enqueued a sufficient amount of the operands and whether the processing unit has sufficient space in its input buffer to hold the results before selecting the processing unit for service.
  • [0077]
    The multiplier 156 is controlled by a state machine (not shown) that performs selection of the segments to supply to the multipliers, controls shifting, initiates FIFO dequeuing, and so forth.
  • [0078]
    Potentially, a given processing unit may decompose a given algorithm into a series of multiplications. To enable a processing unit to quickly complete a series of operations without interruption from other processing units competing for use of the multiplier 156, the arbiter may detect a signal provided by the processing unit that signals the arbiter to continue servicing additional sets of operands provided by the processing unit currently being serviced by the multiplier. In the absence of such a signal, the arbiter resumes servicing of the other processing units for example by resuming round-robin polling of the processing units.
  • [0079]
    Though the description above described a variety of processing units, a wide variety of processing units may be included in the component 100. For example, FIG. 18 depicts an example of a “bulk” processing unit. As shown, the unit includes an endian swapper to change data between big-endian and little-endian representations. The bulk processing unit also includes logic to perform CRC (Cyclic Redundancy Check) operations on data as specified by a programmable generator polynomial.
  • [0080]
    FIG. 19 depicts an example of an authentication/hash processing unit. As shown the unit stores data (“common authentication data structures”) that are used for message authentication that are shared among the different authentication algorithms (e.g., configuration and state registers). The unit also includes dedicated hardware logic responsible for the data processing for each algorithm supported (e.g., MD5 logic, SHA logic, AES logic, and Kasumi logic). The overall operation of the unit is controlled by control logic and a finite state machine (FSM). The FSM controls the loading and unloading of data in the authentication data buffer, tracks the amount of data in the data buffer, sends a start signal to the appropriate authentication core, controls the source of data that gets loaded into the data buffer, and sends information to padding logic to help determine padding data.
  • [0081]
    FIG. 20 depicts an example of a cipher processing unit. The unit can perform encryption and decryption, among other tasks, for a variety of different cryptographic algorithms. As shown, the unit includes registers to store state information including a configuration register (labeled “config”), counter register (labeled “ctr”), key register, parameter register, RC4 state register, and IV (Initial Vector) register. The unit also includes multiplexors and XOR gates to support CBC (Cipher Block Chaining), F8, and CTR (Counter) modes. The unit also includes dedicated hardware logic for multiple ciphers that include the logic responsible for the algorithms supported (e.g., AES logic, 3DES logic, Kasumi logic, and RC4 logic). The unit also includes control logic and a state machine. The logic block is responsible for controlling the overall behavior of the cipher unit including enabling the appropriate datapath depending on the mode the cipher unit is in (e.g., in encryption CBC mode, the appropriate IV is chosen to generate the encrypt IV while the decrypt IV is set to 0), selecting the appropriate inputs into the cipher cores throughout the duration of cipher processing (e.g., the IV, the counter, and the key to be used), and generating control signals that determine what data to send to the output datapath based on the command issued by the core 102. This block also initiates and generates the necessary control signals for RC4 key expansion and AES key conversion.
  • [0082]
    The processing units shown in FIGS. 18-20 are merely examples of different types of processing units and the component may feature many different types of units other than those shown. For example, the component may include a unit to perform pseudo random number generation, a unit to perform Reed-Solomon coding, and so forth.
  • [0083]
    The techniques describe above can be implemented in a variety of ways and in different environments. For example, the techniques may be integrated within a network processor. As an example, FIG. 21 depicts an example of network processor 400 that can be programmed to process packets. The network processor 400 shown is an Intel® Internet eXchange network Processor (IXP). Other processors feature different designs.
  • [0084]
    The network processor 400 shown features a collection of programmable processing cores 402 on a single integrated semiconductor die 400. Each core 402 may be a Reduced Instruction Set Computer (RISC) processor tailored for packet processing. For example, the cores 402 may not provide floating point or integer division instructions commonly provided by the instruction sets of general purpose processors. Individual cores 402 may provide multiple threads of execution. For example, a core 402 may store multiple program counters and other context data for different threads.
  • [0085]
    As shown, the network processor 400 also features an interface 420 that can carry packets between the processor 400 and other network components. For example, the processor 400 can feature a switch fabric interface 420 (e.g., a Common Switch Interface (CSIX)) that enables the processor 400 to transmit a packet to other processor(s) or circuitry connected to a switch fabric. The processor 400 can also feature an interface 420 (e.g., a System Packet Interface (SPI) interface) that enables the processor 400 to communicate with physical layer (PHY) and/or link layer devices (e.g., MAC or framer devices). The processor 400 may also include an interface 404 (e.g., a Peripheral Component Interconnect (PCI) bus interface) for communicating, for example, with a host or other network processors.
  • [0086]
    As shown, the processor 400 includes other resources shared by the cores 402 such as the cryptography component 100, internal scratchpad memory, and memory controllers 416, 418 that provide access to external memory. The network processor 400 also includes a general purpose processor 406 (e.g., a StrongARM® XScale® or Intel Architecture core) that is often programmed to perform “control plane” or “slow path” tasks involved in network operations while the cores 402 are often programmed to perform “data plane” or “fast path” tasks.
  • [0087]
    The cores 402 may communicate with other cores 402 via the shared resources (e.g., by writing data to external memory or the scratchpad 408). The cores 402 may also intercommunicate via neighbor registers directly wired to adjacent core(s) 402. The cores 402 may also communicate via a CAP (CSR (Control Status Register) Access Proxy) 410 unit that routes data between cores 402.
  • [0088]
    FIG. 22 depicts a sample core 402 in greater detail. The core 402 architecture shown in FIG. 22 may also be used in implementing the core 102 shown in FIG. 1. As shown the core 402 includes an instruction store 512 to store program instructions. The core 402 may include an ALU (Arithmetic Logic Unit), Content Addressable Memory (CAM), shifter, and/or other hardware to perform other operations. The core 402 includes a variety of memory resources such as local memory 502 and general purpose registers 504. The core 402 shown also includes read and write transfer registers 508, 510 that store information being sent to/received from targets external to the core. The core 402 also includes next neighbor registers 506, 516 that store information being directly sent to/received from other cores 402. The data stored in the different memory resources may be used as operands in the instructions. As shown, the core 402 also includes a commands queue 524 that buffers commands (e.g., memory access commands) being sent to targets external to the core.
  • [0089]
    To interact with the cryptography component 100, threads executing on the core 402 may send commands via the commands queue 524. These commands may identify transfer registers within the core 402 as the destination for command results (e.g., a completion message and/or the location of encrypted data in memory). In addition, the core 402 may feature an instruction set to reduce idle core cycles while waiting, for example for completion of a request by the cryptography component 100. For example, the core 402 may provide a ctx_arb (context arbitration) instruction that enables a thread to swap out of execution until receiving a signal associated with component 100 completion of an operation.
  • [0090]
    FIG. 23 depicts a network device that can process packets using a cryptography component. As shown, the device features a collection of blades 608-620 holding integrated circuitry interconnected by a switch fabric 610 (e.g., a crossbar or shared memory switch fabric). As shown the device features a variety of blades performing different operations such as I/O blades 608 a-608 n, data plane switch blades 618 a-618 b, trunk blades 612 a-612 b, control plane blades 614 a-614 n, and service blades. The switch fabric, for example, may conform to CSIX or other fabric technologies such as HyperTransport, Infiniband, PCI, Packet-Over-SONET, RapidlO, and/or UTOPIA (Universal Test and Operations PHY Interface for ATM).
  • [0091]
    Individual blades (e.g., 608 a) may include one or more physical layer (PHY) devices (not shown) (e.g., optic, wire, and wireless PHYs) that handle communication over network connections. The PHYs translate between the physical signals carried by different network mediums and the bits (e.g., “0”-s and “1”-s) used by digital systems. The line cards 608-620 may also include framer devices (e.g., Ethernet, Synchronous Optic Network (SONET), High-Level Data Link (HDLC) framers or other “layer 2” devices) 602 that can perform operations on frames such as error detection and/or correction. The blades 608 a shown may also include one or more network processors 604, 606 that perform packet processing operations for packets received via the PHY(s) 602 and direct the packets, via the switch fabric 610, to a blade providing an egress interface to forward the packet. Potentially, the network processor(s) 606 may perform “layer 2” duties instead of the framer devices 602. The network processors 604, 606 may feature techniques described above.
  • [0092]
    While FIGS. 21-23 described specific examples of a network processor and a device incorporating network processors, the techniques may be implemented in a variety of architectures including general purpose processors, network processors and network devices having designs other than those shown. Additionally, the techniques may be used in a wide variety of network devices (e.g., a router, switch, bridge, hub, traffic generator, and so forth). Further, many of the techniques described above may be found in components other than components to perform cryptographic operations.
  • [0093]
    The term circuitry as used herein includes hardwired circuitry, digital circuitry, analog circuitry, programmable circuitry, and so forth. The programmable circuitry may operate on computer programs disposed on a computer readable medium.
  • [0094]
    Other embodiments are within the scope of the following claims.

Claims (27)

1. A system integrated on a single die, comprising:
a first processor core to receive commands from at least one other processor core, the commands requesting performance of at least one specified transformative operation on specified data;
a set of multiple processing units comprising logic to perform transformative operations on data;
a shared memory coupled to the set of multiple processing units; and
logic to receive commands from the first processor core, the commands to transfer data between a one of the set of multiple processing units and the shared memory.
2. The system of claim 1, wherein at least two processing units of the multiple processing units perform the same operations and wherein at least two processing units of the multiple processing units perform different operations.
3. The system of claim 1, wherein at least one of the processing units of the set of multiple processing units comprises a processing unit to perform encryption.
4. The system of claim 1, wherein at least one of the processing units of the set of multiple processing units comprises at least one programmable processing unit.
5. The system of claim 4, wherein the at least one programmable processing unit receives instructions to execute from the shared memory.
6. The system of claim 1, wherein each processing unit in the set of multiple processing units comprises a processing unit having an input buffer to store data transferred from the shared memory, an output buffer to store data to be transferred to the shared memory, and a logic block to operate on data received by the input buffer.
7. The system of claim 1, wherein the transformative operation comprises at least one of: data encryption, data decryption, and data hashing.
8. The system of claim 1, wherein the commands comprise commands specifying a particular processing unit of the set of multiple processing units and an operation selected from one of the following group: transfer of data from the shared memory to the particular processing unit and transfer of data from the particular processing unit to the shared memory.
9. The system of claim 1, further comprising logic to transfer data from the shared memory to a randomly accessible memory external to the silicon die and to transfer data from the randomly accessible memory external to the silicon die to the shared memory.
10. The system of claim 1, wherein the first processor core comprises a processor core having storage for multiple program counters to provide multiple threads of execution.
11. The system of claim 1, wherein the logic enqueues the received commands based on a target one of the processing units specified by the command.
12. The system of claim 11, wherein the logic enqueues the received commands based on whether a command specifies a transfer to the target one of the processing units or a command specifies a transfer from the target one of the processing units.
13. The system of claim 1, further comprising a first bus coupling the first processor core with the at least one other processor core, and a second bus coupling the logic and the processing units.
14. The system of claim 1, wherein the at least one other processor core comprises multiple processor cores.
15. A method comprising:
receiving a command at a first processor core, the first processor core providing multiple threads of program execution, the command specifying at least one cryptographic operation to perform on specified data;
causing a thread of the multiple threads to:
cause the data to be transferred to a shared memory;
cause multiple ones of a set of multiple processing units coupled to the shared memory to perform operations based on the data in a sequence specified by the thread, each of the multiple ones of the set of multiple processing units operating on data from the shared memory and returning processed data to the shared memory.
16. The method of claim 15, wherein the at least one cryptographic operation comprises at least one selected from a group comprising: encryption, decryption, authentication, and generation of a cryptographic key.
17. The method of claim 15, wherein causing multiple ones of the set of multiple processing units to perform operations based on the data comprises transferring executable instructions of a program to implement at least one of the operations to at least one of the multiple processing units in response to the command, the executable instructions including at least one conditional branch of execution.
18. A computer program disposed on a computer readable medium, the program comprising instructions for causing a processor to:
receive a command specifying at least one cryptographic operation to perform on specified data;
cause a thread of the multiple threads to service the command, the thread comprising a thread to:
cause the data to be transferred to a shared memory;
cause multiple ones of a set of multiple processing units coupled to the shared memory to perform operations based on the data in a sequence specified by the thread, each of the multiple ones of the set of multiple processing units operating on data from the shared memory and returning processed data to the shared memory.
19. The program of claim 18, wherein the at least one cryptographic operation comprises at least one selected from a group comprising: encryption, decryption, authentication, and generation of a cryptographic key.
20. The program of claim 18, wherein causing multiple ones of the set of multiple processing units to perform operations based on the data comprises transferring executable instructions of a program to implement at least one of the operations to at least one of the multiple processing units in response to the command, the executable instructions including at least one conditional branch of execution.
21. A system, comprising:
an Ethernet MAC (media access controller); and
multiple processor cores integrated on a single die, a first of the processor cores to receive commands from the other processor cores to perform at least one specified transformative operation on specified data;
a set of multiple processing units comprising logic to perform transformative operations on data;
a shared memory coupled to the set of multiple processing units; and
logic to receive commands from the first processor core, the commands to transfer data between a one of the set of multiple processing units and the shared memory.
22. The system of claim 21, wherein at least one of the processing units of the set of multiple processing units comprises a processing unit having dedicated hardware to perform encryption.
23. The system of claim 21,
wherein at least one of the processing units of the set of multiple processing units comprises at least one programmable processing unit; and
wherein the at least one programmable processing unit receives instructions to execute from the shared memory in response to a command received by the logic to transfer data from the shared memory to the programmable processing unit.
24. The system of claim 21, wherein each processing unit in the set of multiple processing units comprises a processing unit having an input buffer to store data transferred from the shared memory, an output buffer to store data to be transferred to the shared memory, and a logic block to operate on data received by the input buffer.
25. The system of claim 21, wherein the transformative operation comprises at least one of: data encryption, data decryption, and data hashing.
26. The system of claim 21, wherein the commands comprise commands specifying a particular processing unit of the set of multiple processing units and an operation selected from one of the following group: transfer data from the shared memory to the particular processing unit and transfer data from the particular processing unit to the shared memory.
27. The system of claim 21, further comprising a first bus coupling the first processor core with the at least one other processor core, and a second bus coupling the logic and the processing units.
US11323329 2005-12-30 2005-12-30 Cryptographic system component Abandoned US20070157030A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US11323329 US20070157030A1 (en) 2005-12-30 2005-12-30 Cryptographic system component

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US11323329 US20070157030A1 (en) 2005-12-30 2005-12-30 Cryptographic system component
US11354670 US7475229B2 (en) 2005-12-30 2006-02-14 Executing instruction for processing by ALU accessing different scope of variables using scope index automatically changed upon procedure call and exit
US11354404 US7900022B2 (en) 2005-12-30 2006-02-14 Programmable processing unit with an input buffer and output buffer configured to exclusively exchange data with either a shared memory logic or a multiplier based upon a mode instruction
US11647892 US20070192626A1 (en) 2005-12-30 2006-12-28 Exponent windowing

Publications (1)

Publication Number Publication Date
US20070157030A1 true true US20070157030A1 (en) 2007-07-05

Family

ID=38226059

Family Applications (3)

Application Number Title Priority Date Filing Date
US11323329 Abandoned US20070157030A1 (en) 2005-12-30 2005-12-30 Cryptographic system component
US11354670 Expired - Fee Related US7475229B2 (en) 2005-12-30 2006-02-14 Executing instruction for processing by ALU accessing different scope of variables using scope index automatically changed upon procedure call and exit
US11647892 Abandoned US20070192626A1 (en) 2005-12-30 2006-12-28 Exponent windowing

Family Applications After (2)

Application Number Title Priority Date Filing Date
US11354670 Expired - Fee Related US7475229B2 (en) 2005-12-30 2006-02-14 Executing instruction for processing by ALU accessing different scope of variables using scope index automatically changed upon procedure call and exit
US11647892 Abandoned US20070192626A1 (en) 2005-12-30 2006-12-28 Exponent windowing

Country Status (1)

Country Link
US (3) US20070157030A1 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070192626A1 (en) * 2005-12-30 2007-08-16 Feghali Wajdi K Exponent windowing
US20070297601A1 (en) * 2006-06-27 2007-12-27 Hasenplaugh William C Modular reduction using folding
US20080189560A1 (en) * 2007-02-05 2008-08-07 Freescale Semiconductor, Inc. Secure data access methods and apparatus
US20090158132A1 (en) * 2007-12-18 2009-06-18 Vinodh Gopal Determining a message residue
US20090157784A1 (en) * 2007-12-18 2009-06-18 Vinodh Gopal Determining a message residue
US20090168999A1 (en) * 2007-12-28 2009-07-02 Brent Boswell Method and apparatus for performing cryptographic operations
US20100220854A1 (en) * 2009-02-27 2010-09-02 Atmel Corporation Data security system
US20110153994A1 (en) * 2009-12-22 2011-06-23 Vinodh Gopal Multiplication Instruction for Which Execution Completes Without Writing a Carry Flag
US20110158403A1 (en) * 2009-12-26 2011-06-30 Mathew Sanu K On-the-fly key generation for encryption and decryption
US8689078B2 (en) 2007-07-13 2014-04-01 Intel Corporation Determining a message residue
US20140149724A1 (en) * 2011-04-01 2014-05-29 Robert C. Valentine Vector friendly instruction format and execution thereof

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7725512B1 (en) * 2006-04-26 2010-05-25 Altera Corporation Apparatus and method for performing multiple exclusive or operations using multiplication circuitry
US8261049B1 (en) 2007-04-10 2012-09-04 Marvell International Ltd. Determinative branch prediction indexing
US8046775B2 (en) * 2006-08-14 2011-10-25 Marvell World Trade Ltd. Event-based bandwidth allocation mode switching method and apparatus
US7941643B2 (en) * 2006-08-14 2011-05-10 Marvell World Trade Ltd. Multi-thread processor with multiple program counters
US8667254B1 (en) * 2008-05-15 2014-03-04 Xilinx, Inc. Method and apparatus for processing data in an embedded system
US20120096281A1 (en) * 2008-12-31 2012-04-19 Eszenyi Mathew S Selective storage encryption
US20100169570A1 (en) * 2008-12-31 2010-07-01 Michael Mesnier Providing differentiated I/O services within a hardware storage controller
US9329936B2 (en) 2012-12-31 2016-05-03 Intel Corporation Redundant execution for reliability in a super FMA ALU

Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6092158A (en) * 1997-06-13 2000-07-18 Intel Corporation Method and apparatus for arbitrating between command streams
US6209087B1 (en) * 1998-06-15 2001-03-27 Cisco Technology, Inc. Data processor with multiple compare extension instruction
US6209098B1 (en) * 1996-10-25 2001-03-27 Intel Corporation Circuit and method for ensuring interconnect security with a multi-chip integrated circuit package
US6298411B1 (en) * 1999-01-05 2001-10-02 Compaq Computer Corporation Method and apparatus to share instruction images in a virtual cache
US20020027988A1 (en) * 1998-08-26 2002-03-07 Roy Callum Cryptographic accelerator
US20020091826A1 (en) * 2000-10-13 2002-07-11 Guillaume Comeau Method and apparatus for interprocessor communication and peripheral sharing
US20020108048A1 (en) * 2000-12-13 2002-08-08 Broadcom Corporation Methods and apparatus for implementing a cryptography engine
US20030084309A1 (en) * 2001-10-22 2003-05-01 Sun Microsystems, Inc. Stream processor with cryptographic co-processor
US20030123120A1 (en) * 2001-12-31 2003-07-03 Hewlett Gregory J. Pulse width modulation sequence generation
US20030174699A1 (en) * 2002-03-12 2003-09-18 Van Asten Kizito Gysbertus Antonius High-speed packet memory
US20040039928A1 (en) * 2000-12-13 2004-02-26 Astrid Elbe Cryptographic processor
US20040083354A1 (en) * 2002-10-24 2004-04-29 Kunze Aaron R. Processor programming
US20040225885A1 (en) * 2003-05-05 2004-11-11 Sun Microsystems, Inc Methods and systems for efficiently integrating a cryptographic co-processor
US20040230813A1 (en) * 2003-05-12 2004-11-18 International Business Machines Corporation Cryptographic coprocessor on a general purpose microprocessor
US6850999B1 (en) * 2002-11-27 2005-02-01 Cisco Technology, Inc. Coherency coverage of data across multiple packets varying in sizes
US6854044B1 (en) * 2002-12-10 2005-02-08 Altera Corporation Byte alignment circuitry
US20050141715A1 (en) * 2003-12-29 2005-06-30 Sydir Jaroslaw J. Method and apparatus for scheduling the processing of commands for execution by cryptographic algorithm cores in a programmable network processor
US20050278502A1 (en) * 2003-03-28 2005-12-15 Hundley Douglas E Method and apparatus for chaining multiple independent hardware acceleration operations
US20070130445A1 (en) * 2005-12-05 2007-06-07 Intel Corporation Heterogeneous multi-core processor having dedicated connections between processor cores

Family Cites Families (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5075840A (en) 1989-01-13 1991-12-24 International Business Machines Corporation Tightly coupled multiprocessor instruction synchronization
US5291611A (en) 1991-04-23 1994-03-01 The United States Of America As Represented By The Secretary Of The Navy Modular signal processing unit
US5983004A (en) * 1991-09-20 1999-11-09 Shaw; Venson M. Computer, memory, telephone, communications, and transportation system and methods
JPH08185320A (en) 1994-12-28 1996-07-16 Mitsubishi Electric Corp Semiconductor integrated circuit
US6282290B1 (en) * 1997-03-28 2001-08-28 Mykotronx, Inc. High speed modular exponentiator
US6356636B1 (en) 1998-07-22 2002-03-12 Motorola, Inc. Circuit and method for fast modular multiplication
US6442715B1 (en) * 1998-11-05 2002-08-27 Stmicroelectrics N.V. Look-ahead reallocation disk drive defect management
US6442751B1 (en) * 1998-12-14 2002-08-27 International Business Machines Corporation Determination of local variable type and precision in the presence of subroutines
US6397241B1 (en) 1998-12-18 2002-05-28 Motorola, Inc. Multiplier cell and method of computing
US6567832B1 (en) * 1999-03-15 2003-05-20 Matsushita Electric Industrial Co., Ltd. Device, method, and storage medium for exponentiation and elliptic curve exponentiation
US6427196B1 (en) 1999-08-31 2002-07-30 Intel Corporation SRAM controller for parallel processor architecture including address and command queue and arbiter
US6668317B1 (en) 1999-08-31 2003-12-23 Intel Corporation Microengine for parallel processor architecture
US6606704B1 (en) 1999-08-31 2003-08-12 Intel Corporation Parallel multithreaded processor with plural microengines executing multiple threads each microengine having loadable microcode
US6983350B1 (en) 1999-08-31 2006-01-03 Intel Corporation SDRAM controller for parallel processor architecture
US7681018B2 (en) 2000-08-31 2010-03-16 Intel Corporation Method and apparatus for providing large register address space while maximizing cycletime performance for a multi-threaded register file set
US6532509B1 (en) 1999-12-22 2003-03-11 Intel Corporation Arbitrating command requests in a parallel multi-threaded processing system
US6307789B1 (en) 1999-12-28 2001-10-23 Intel Corporation Scratchpad memory
US6463072B1 (en) 1999-12-28 2002-10-08 Intel Corporation Method and apparatus for sharing access to a bus
US6625654B1 (en) 1999-12-28 2003-09-23 Intel Corporation Thread signaling in multi-threaded network processor
US6584522B1 (en) 1999-12-30 2003-06-24 Intel Corporation Communication between processors
US6631462B1 (en) 2000-01-05 2003-10-07 Intel Corporation Memory shared between processing threads
US7240204B1 (en) 2000-03-31 2007-07-03 State Of Oregon Acting By And Through The State Board Of Higher Education On Behalf Of Oregon State University Scalable and unified multiplication methods and apparatus
US6745220B1 (en) * 2000-11-21 2004-06-01 Matsushita Electric Industrial Co., Ltd. Efficient exponentiation method and apparatus
JP3950638B2 (en) * 2001-03-05 2007-08-01 株式会社日立製作所 Tamper modular arithmetic processing method
US6868476B2 (en) 2001-08-27 2005-03-15 Intel Corporation Software controlled content addressable memory in a general purpose execution datapath
US7487505B2 (en) 2001-08-27 2009-02-03 Intel Corporation Multithreaded microprocessor with register allocation based on number of active threads
US7225281B2 (en) 2001-08-27 2007-05-29 Intel Corporation Multiprocessor infrastructure for providing flexible bandwidth allocation via multiple instantiations of separate data buses, control buses and support mechanisms
US6748412B2 (en) * 2001-09-26 2004-06-08 Intel Corporation Square-and-multiply exponent processor
US6738831B2 (en) 2001-12-12 2004-05-18 Intel Corporation Command ordering
US7437724B2 (en) 2002-04-03 2008-10-14 Intel Corporation Registers for data transfers
US20040133788A1 (en) * 2003-01-07 2004-07-08 Perkins Gregory M. Multi-precision exponentiation method and apparatus
US6941438B2 (en) 2003-01-10 2005-09-06 Intel Corporation Memory interleaving
US20050010761A1 (en) 2003-07-11 2005-01-13 Alwyn Dos Remedios High performance security policy database cache for network processing
US7373514B2 (en) 2003-07-23 2008-05-13 Intel Corporation High-performance hashing system
US7747020B2 (en) 2003-12-04 2010-06-29 Intel Corporation Technique for implementing a security algorithm
US20050138366A1 (en) 2003-12-19 2005-06-23 Pan-Loong Loh IPSec acceleration using multiple micro engines
US7543142B2 (en) 2003-12-19 2009-06-02 Intel Corporation Method and apparatus for performing an authentication after cipher operation in a network processor
US20050135604A1 (en) 2003-12-22 2005-06-23 Feghali Wajdi K. Technique for generating output states in a security algorithm
US20050149744A1 (en) 2003-12-29 2005-07-07 Intel Corporation Network processor having cryptographic processing including an authentication buffer
US7529924B2 (en) 2003-12-30 2009-05-05 Intel Corporation Method and apparatus for aligning ciphered data
US7171604B2 (en) 2003-12-30 2007-01-30 Intel Corporation Method and apparatus for calculating cyclic redundancy check (CRC) on data using a programmable CRC engine
US7653196B2 (en) 2004-04-27 2010-01-26 Intel Corporation Apparatus and method for performing RC4 ciphering
US7433469B2 (en) 2004-04-27 2008-10-07 Intel Corporation Apparatus and method for implementing the KASUMI ciphering process
US7627764B2 (en) 2004-06-25 2009-12-01 Intel Corporation Apparatus and method for performing MD5 digesting
US7539718B2 (en) 2004-09-16 2009-05-26 Intel Corporation Method and apparatus for performing Montgomery multiplications
US20060059219A1 (en) 2004-09-16 2006-03-16 Koshy Kamal J Method and apparatus for performing modular exponentiations
US7418543B2 (en) 2004-12-21 2008-08-26 Intel Corporation Processor having content addressable memory with command ordering
US7555630B2 (en) 2004-12-21 2009-06-30 Intel Corporation Method and apparatus to provide efficient communication between multi-threaded processing elements in a processor unit
JP4450737B2 (en) 2005-01-11 2010-04-14 富士通株式会社 The semiconductor integrated circuit
CN101228462A (en) * 2005-07-19 2008-07-23 日东电工株式会社 Polarizing plate and image display device
US7900022B2 (en) 2005-12-30 2011-03-01 Intel Corporation Programmable processing unit with an input buffer and output buffer configured to exclusively exchange data with either a shared memory logic or a multiplier based upon a mode instruction
US7725624B2 (en) 2005-12-30 2010-05-25 Intel Corporation System and method for cryptography processing units and multiplier
US20070157030A1 (en) * 2005-12-30 2007-07-05 Feghali Wajdi K Cryptographic system component

Patent Citations (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6209098B1 (en) * 1996-10-25 2001-03-27 Intel Corporation Circuit and method for ensuring interconnect security with a multi-chip integrated circuit package
US6092158A (en) * 1997-06-13 2000-07-18 Intel Corporation Method and apparatus for arbitrating between command streams
US6209087B1 (en) * 1998-06-15 2001-03-27 Cisco Technology, Inc. Data processor with multiple compare extension instruction
US20020027988A1 (en) * 1998-08-26 2002-03-07 Roy Callum Cryptographic accelerator
US6298411B1 (en) * 1999-01-05 2001-10-02 Compaq Computer Corporation Method and apparatus to share instruction images in a virtual cache
US20020091826A1 (en) * 2000-10-13 2002-07-11 Guillaume Comeau Method and apparatus for interprocessor communication and peripheral sharing
US20040039928A1 (en) * 2000-12-13 2004-02-26 Astrid Elbe Cryptographic processor
US20020108048A1 (en) * 2000-12-13 2002-08-08 Broadcom Corporation Methods and apparatus for implementing a cryptography engine
US20030084309A1 (en) * 2001-10-22 2003-05-01 Sun Microsystems, Inc. Stream processor with cryptographic co-processor
US20030123120A1 (en) * 2001-12-31 2003-07-03 Hewlett Gregory J. Pulse width modulation sequence generation
US20030174699A1 (en) * 2002-03-12 2003-09-18 Van Asten Kizito Gysbertus Antonius High-speed packet memory
US20040083354A1 (en) * 2002-10-24 2004-04-29 Kunze Aaron R. Processor programming
US6850999B1 (en) * 2002-11-27 2005-02-01 Cisco Technology, Inc. Coherency coverage of data across multiple packets varying in sizes
US6854044B1 (en) * 2002-12-10 2005-02-08 Altera Corporation Byte alignment circuitry
US20050278502A1 (en) * 2003-03-28 2005-12-15 Hundley Douglas E Method and apparatus for chaining multiple independent hardware acceleration operations
US20040225885A1 (en) * 2003-05-05 2004-11-11 Sun Microsystems, Inc Methods and systems for efficiently integrating a cryptographic co-processor
US20040230813A1 (en) * 2003-05-12 2004-11-18 International Business Machines Corporation Cryptographic coprocessor on a general purpose microprocessor
US20050141715A1 (en) * 2003-12-29 2005-06-30 Sydir Jaroslaw J. Method and apparatus for scheduling the processing of commands for execution by cryptographic algorithm cores in a programmable network processor
US20070130445A1 (en) * 2005-12-05 2007-06-07 Intel Corporation Heterogeneous multi-core processor having dedicated connections between processor cores

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070192626A1 (en) * 2005-12-30 2007-08-16 Feghali Wajdi K Exponent windowing
US20070297601A1 (en) * 2006-06-27 2007-12-27 Hasenplaugh William C Modular reduction using folding
US8229109B2 (en) 2006-06-27 2012-07-24 Intel Corporation Modular reduction using folding
US20080189560A1 (en) * 2007-02-05 2008-08-07 Freescale Semiconductor, Inc. Secure data access methods and apparatus
US8464069B2 (en) * 2007-02-05 2013-06-11 Freescale Semiconductors, Inc. Secure data access methods and apparatus
US8689078B2 (en) 2007-07-13 2014-04-01 Intel Corporation Determining a message residue
US20090158132A1 (en) * 2007-12-18 2009-06-18 Vinodh Gopal Determining a message residue
US20090157784A1 (en) * 2007-12-18 2009-06-18 Vinodh Gopal Determining a message residue
US8042025B2 (en) 2007-12-18 2011-10-18 Intel Corporation Determining a message residue
US7886214B2 (en) 2007-12-18 2011-02-08 Intel Corporation Determining a message residue
US8189792B2 (en) 2007-12-28 2012-05-29 Intel Corporation Method and apparatus for performing cryptographic operations
US20090168999A1 (en) * 2007-12-28 2009-07-02 Brent Boswell Method and apparatus for performing cryptographic operations
US20100220854A1 (en) * 2009-02-27 2010-09-02 Atmel Corporation Data security system
US9191211B2 (en) * 2009-02-27 2015-11-17 Atmel Corporation Data security system
US20110153994A1 (en) * 2009-12-22 2011-06-23 Vinodh Gopal Multiplication Instruction for Which Execution Completes Without Writing a Carry Flag
US20110158403A1 (en) * 2009-12-26 2011-06-30 Mathew Sanu K On-the-fly key generation for encryption and decryption
US9544133B2 (en) 2009-12-26 2017-01-10 Intel Corporation On-the-fly key generation for encryption and decryption
US20140149724A1 (en) * 2011-04-01 2014-05-29 Robert C. Valentine Vector friendly instruction format and execution thereof
US9513917B2 (en) * 2011-04-01 2016-12-06 Intel Corporation Vector friendly instruction format and execution thereof

Also Published As

Publication number Publication date Type
US20070192626A1 (en) 2007-08-16 application
US20070174372A1 (en) 2007-07-26 application
US7475229B2 (en) 2009-01-06 grant

Similar Documents

Publication Publication Date Title
Wu et al. CryptoManiac: a fast flexible architecture for secure communication
US6668317B1 (en) Microengine for parallel processor architecture
US6606704B1 (en) Parallel multithreaded processor with plural microengines executing multiple threads each microengine having loadable microcode
US5870598A (en) Method and apparatus for providing an optimized compare-and-branch instruction
US6870929B1 (en) High throughput system for encryption and other data operations
Lee et al. Efficient permutation instructions for fast software cryptography
US6427196B1 (en) SRAM controller for parallel processor architecture including address and command queue and arbiter
US6829696B1 (en) Data processing system with register store/load utilizing data packing/unpacking
US6983350B1 (en) SDRAM controller for parallel processor architecture
US20030046563A1 (en) Encryption-based security protection for processors
US7631106B2 (en) Prefetching of receive queue descriptors
US20030014457A1 (en) Method and apparatus for vector processing
US20040181652A1 (en) Apparatus and method for independently schedulable functional units with issue lock mechanism in a processor
US20100250966A1 (en) Processor and method for implementing instruction support for hash algorithms
Tillich et al. Instruction set extensions for efficient AES implementation on 32-bit processors
US20040059891A1 (en) Icache-based value prediction mechanism
US20060059314A1 (en) Direct access to low-latency memory
US20020194237A1 (en) Circuit and method for performing multiple modulo mathematic operations
US6952478B2 (en) Method and system for performing permutations using permutation instructions based on modified omega and flip stages
US6295599B1 (en) System and method for providing a wide operand architecture
US6209076B1 (en) Method and apparatus for two-stage address generation
US20060095741A1 (en) Store instruction ordering for multi-core processor
May et al. Non-deterministic processors
Elbirt et al. An instruction-level distributed processor for symmetric-key cryptography
US6237083B1 (en) Microprocessor including multiple register files mapped to the same logical storage and inhibiting sychronization between the register files responsive to inclusion of an instruction in an instruction sequence

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTEL CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:FEGHALI, WAJDI K.;HASENPLAUGH, WILLIAM D.;WOLRICH, GILBERT M.;AND OTHERS;REEL/FRAME:023233/0835

Effective date: 20060124