GB2407179A - Unified SIMD processor - Google Patents

Unified SIMD processor Download PDF

Info

Publication number
GB2407179A
GB2407179A GB0409815A GB0409815A GB2407179A GB 2407179 A GB2407179 A GB 2407179A GB 0409815 A GB0409815 A GB 0409815A GB 0409815 A GB0409815 A GB 0409815A GB 2407179 A GB2407179 A GB 2407179A
Authority
GB
United Kingdom
Prior art keywords
data
scalar
simd
data processor
execution unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0409815A
Other versions
GB0409815D0 (en
Inventor
David Stuttard
David Williams
James Packer
Colin Davidson
Neil Hickey
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ClearSpeed Technology PLC
Original Assignee
Clearspeed Solutions Ltd
ClearSpeed Technology PLC
ClearSpeed Technology Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Clearspeed Solutions Ltd, ClearSpeed Technology PLC, ClearSpeed Technology Ltd filed Critical Clearspeed Solutions Ltd
Publication of GB0409815D0 publication Critical patent/GB0409815D0/en
Priority to PCT/GB2004/004377 priority Critical patent/WO2005037326A2/en
Publication of GB2407179A publication Critical patent/GB2407179A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8015One dimensional arrays, e.g. rings, linear arrays, buses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute

Abstract

A data processor comprises a single control unit and an integral execution unit adapted to operate on both parallel and scalar data under the same instruction set. The execution unit comprises a SIMD part comprising an array of processing elements adapted to operate on the parallel data and a scalar part adapted to operate on the scalar data. Substantially all the instructions are operable on the parallel and/or the scalar data. Preferably, both the scalar part and the SIMD part have a complete range of addressing modes that allow them to support: scalar variables pointing to data in scalar memory; scalar variables pointing to data in memory in the SIMD part of the execution unit; SIMD variables pointing to memory in the SIMD part of the execution unit; and SIMD variables pointing to data in scalar memory. The scalar part, the SIMD part, and I/O means are adapted to be operated in parallel by the same instruction stream.

Description

- 1 - 2407179 Unified SIMD Processor
Background to the Invention
The invention consists of a significant improvement to a well known type of parallel processor. Single-Instruction Multiple-Data (SIMD) architectures are widely used in a number of computing applications. They have the advantage of high performance, low power consumption and easy scalability. A SIMD processor differs from the more commonly used multiprocessor system (where a number of processors, each independently executing their own program code, are used together to solve a problem) in that it executes a single stream of instructions but each instruction operates on multiple data items in parallel.
There are a range of SIMD processor architectures. These range from an intelligent memory type of approach where each column of a memory array has a simple 1-bit Arithmetic Logic Unit (ALU) through to more complex architectures where a, usually smaller, array of more complex processing elements are fed with a single instruction stream. They have, however, been difficult to program in the past.
In all these cases the array of SIMD processing elements are fed with instructions and data by a dedicated controller. Often the controller is the largest and most complex part of the design because of the need to match the data processing rate achieved by the SIMD array. The array controller is not itself 'intelligent' and is, in turn, controlled by a general purpose processor which also runs the bulk of the application software. This host processor will execute most of the application code; it will send data to be processed to the SIMD array with an indication of the function to be performed, typically a pointer to the start of a sequence of SIMD array instructions or microinstructions. An exemplary prior art SIMD processor is outlined in Figure 1.
This introduces two programming problems. Firstly, the array controller can only be programmed at a very low level: a series of microcode, or possibly assembler, instructions to be executed by the array. This is because the array controller is not designed to work with a compiler. Secondly, this programming is completely separated from the main application code running on the host processor. This may be 'wrapped up' by providing C functions or C++ classes which makes them appear more integrated and reduces the complexity of passing control between the host processor and the SIMD array. Also, in the simplest form, the host processor and the SIMD array will have completely separate memory spaces for code and data. Any of - 2 the standard shared memory techniques can be used to address this with all the resulting problems: complex arbitration, cache coherency issues, contention for memory bandwidth, inter-processor synchronization, greater cost, etc. A preferable solution is to provide a fully integrated architecture where the same s processor can transparently process both scalar and STMD operations in a fully integrated fashion. This means that a single program can be written, in a standard high-level language such as C, which includes all parts of the application - both normal' (scalar) data and parallel data processing.
Prior Art
lo The basic SIMD processing model has been in use for many years. Some of the earliest examples are the ILLIAC IV developed in 1972, which had 64 processors with floating point and memory, and Goodyear's STARAN (1975) which was a 1-bit architecture. Many variants have been developed since then.
There is also a recent trend to put small-scale SIMD extensions in standard processor architectures. This is exemplified by the Intel MMX and SSE extensions to the x86 architecture. This allows, for example, the values in two 32-bit registers to be added as if each register contained four independent 8-bit values. This is not considered relevant to the current discussion for a number of reasons: the SIMD processing is done on a very small scale; only a limited range of SIMD functionality is provided; there are distinct instructions for operating on SIMD data; the same hardware is used for SIMD and for scalar operations (although" separate registers may be provided for the SIMD operands) The idea of providing indexed addressing modes in a SIMD processor is disclosed in 2s our copending patent application GB9908225.7. This invention extends that idea by using a more flexible implementation of that idea in combination with a more powerful array controller to provide a unified processor architecture.
Summary of the Invention
Applicant's approach, after many years of experience with developing systems based on SIMD coprocessors, was to create a new architecture where instructions are decoded and despatched to the appropriate part of the execution unit: either the scalar (mono) execution unit or the SIMD (poly) execution unit. Instruction execution on these are decoupled so that, for example, a mono instruction followed by a poly instruction will execute concurrently. The support for mono and poly data is, as far as - 3 possible, identical - this makes it practical to write a compiler to target the processor.
This greatly simplifies the task of programming the processor.
This means a single program runs on the processor which does both mono and poly operations as required - typically it will do both at the same time. A single C program s can be written which manages program control flow, operates on mono data, and operates on poly data.
To support this, and further simplify programming, this unity of the architecture is reflected in the instruction set and programming model. The processing elements in the SIMD array and the mono execution unit have a common instruction set. For lo instance, an add instruction can be used with operands which are either poly registers or mono registers (or immediate values); the appropriate execution unit will execute the instruction. Mono and poly operands can also be mixed to allow, for example, a single mono value to be added to a poly register which holds a different value on every processing element.
It is appropriate to note that this does not require the mono execution unit to have exactly the same micro-architecture as the poly execution unit. For example, the poly processing elements could have an 8-bit ALU, while the mono execution unit has a 32-bit ALU.
A block schematic diagram of a processor in accordance with the invention is shown in Figure 2.
The most significant features contributing to the novelty and inventiveness of this invention include the following concepts: The processor provides a unified programming model: a single program runs on 2s a processor providing both scalar and SIMD operations.
This is supported with a unified instruction set: e.g. an add instruction can work with either mono or poly operands (or a mixture of the two) Poly operations can freely use mono registers in expressions, and more importantly for addressing memory.
Having a full range of addressing modes and completely separate memory blocks in each PE make it simple to support poly data structures and poly pointers on the SIMD array (previous SIMD architectures either provde a single address to all PEs or use sequences of instructions to construct an address in memory). - 4
Mono and poly operations are of equal standing: wherever possible, all instructions can be used with either mono or poly operands (there are some exceptions because a small set of operations, mainly for control, only make sense for one or the other).
The mono and poly execution units are loosely coupled so that their operation can be overlapped. Compilers are extremely good at scheduling instructions to take advantage of this flexibility - maximizing the efficiency of the system.
There are some compiler optimizations specific to this architecture: for example it may be better, in some instances, to move an operation that would normally lo be done on the mono execution unit on to the poly execution unit - either because it can then be overlapped with other operations on the mono execution unit or because the result is needed on the poly execution unit anyway. Again, the compiler can detect such cases and make the appropriate optimizations.
A common code and data space is available to both mono and poly operations making it easy to generate code for, and share data between, the two execution units.
The overall architecture is suitable for targeting by a high level language compiler - again for both mono and poly operations.
To this end, the invention provides, in a first aspect a data processor comprising a single control unit and an integral execution unit adapted to operate on both parallel and scalar data under the same instruction set.
According to a second aspect, the invention provides a data processor comprising a single control unit and an integral execution unit adapted to operate on both parallel and scalar data under the same instruction set, and wherein the execution unit comprises a SIMD part comprising an array of processing elements adapted to operate j on said parallel data and a scalar part adapted to operate on said scalar data.
According to a third aspect, the invention provides a data processor comprising a it single control unit and an integral execution unit adapted to operate on both parallel and scalar data under the same instruction set, and wherein substantially all said instructions are operable on said parallel and/or said scalar data.
According to a fourth aspect, the invention provides a data processor comprising a single control unit and an integral execution unit adapted to operate on both parallel and scalar data under the same instruction set, and wherein the execution unit comprises a SIMD part comprising an array of processing elements adapted to operate on said parallel data and a scalar part adapted to operate on said scalar data, and wherein substantially all said instructions are operable on said parallel and/or said scalar data.
The single control unit and the integral execution unit may be adapted to operate asynchronously on both the parallel and scalar data under the same instruction set.
The execution unit preferably comprises a SIMD part comprising an array of processing elements adapted to operate on the parallel data and a scalar part adapted to operate on the scalar data.
lo Preferably, the SIMD part and the scalar part differ only in that each processing element of the array contains local memory. The SIMD part and the scalar part preferably have a common memory area for instructions and scalar data. Each PE may contain an ALU and one or more multiplyaccumulate function units or an ALU and one or more floating point function units. Preferably, each PE has multiple enable bits to support nested conditional code.
The SIMD and the scalar execution unit conveniently have a similar set of | status/result flags which can be used to control branching on the scalar part of the execution unit and conditional execution on the SIMD part of the execution unit in a ! . . similar manner.
There is preferably a communication path between PEs in the SIMD part and means are preferably provided for data to be transferred to and from the scalar part of the execution unit and the SIMD part of the execution unit.
Advantageously the control unit is adapted to fetch and execute multiple instruction threads simultaneously. A semaphore unit may use semaphores to synchronize | 25 between instruction threads. The semaphores may be used to synchronize between I/O operations and instruction threads. The control unit may be adapted to schedule and execute threads based on their priority and the state of any semaphores they are waiting on and may include a set of the semaphores for use in synchronizing threads.
Each processing element of the SIMD array may comprise one or more function units, a register file, said local memory, and I/O means.
Both the scalar part and the SIMD part preferably have a complete range of : addressing modes that allow them to support: scalar variables pointing to data in : scalar memory; scalar variables pointing to data in memory in the S1MD part of the . execution unit; SIMD variables pointing to memory in the SIMD part of the execution unit; and SIMD variables pointing to data in scalar memory.
Substantially all the instructions are preferably operable on either the parallel data or the scalar data or a combination of both the scalar data and the parallel data.
s The processor may further comprise a compiler adapted to produce code from a single source program to operate on both the scalar and the parallel data. The compiler is preferably adapted to detect parallel data automatically.
The scalar part, the SIMD part, and the I/O means are preferably adapted to be operated in parallel by the same instruction stream.
lo A programmer or the compiler is adapted to use the multi-threaded execution to schedule I/O operations such that they occur concurrently with computation and data is always available when required.
The I/O means may comprise programmed I/O means adapted to enable each PE to transfer data to and from external scalar memory and wherein each PE provides the address it wants to read data from or write data to. The memory accesses may be consolidated to minimize the number of memory transfers when multiple PEs are accessing the same memory area.
Each PE is preferably adapted to transfer data to and from external scalar memory, in which case the address for each PE is generated automatically based on the PE number and the amount of data being transferred.
The I/O means may comprise streaming I/O means which distributes an incoming data stream to all the PEs in the SIMD part and collects data from all the PEs to generate an output stream independently of program execution on the SIMD part of the execution unit. The streaming I/O means may only use enabled PEs to take part in ! 2s the streaming I/O operation. The size of data distributed to, and collected from each PE may be different for each PK.
Brief Description of the Drawings
The invention will be described with reference to the following drawings, in which:
Figure 1 shows a prior art SIMD processor;
Figure 2 shows the architecture in accordance with the present invention; Figure 3 is a comparison between RISC and MTAP architecture; Figure 4 is a schematic diagram of an MTAP execution unit; Figure 5 indicates schematically the execution of an instruction; Figure 6 is a graph indicating the power efficiency of the MTAP architecture; - 7 Figure 7 is a block schematic diagram depicting an exemplary MTAP processor; and Figure 8 is a block diagram of a specific implementation of the MTAP architecture as applied to an evaluation chip.
s Detailed Description of the Illustrated Embodiments The design of this processor architecture was driven by the desire to make it practical and efficient to program in a high level language. In order to make this possible a number of features are required. First, the overall programming model needs to be simple and 'regular'- i.e. the same operations and addressing modes can be applied to lo all data types.
The execution unit for mono data is based on well known principles used in many RISC processors. This makes it simple for a compiler to target and easy for a user to understand.
The functionality supported by each PE is then made as similar as possible to the mono execution unit: each PE has an ALU, a register file and memory, and supports a range of addressing modes for transferring data between memory and registers.
It is not necessary for the PEs to be identical to the mono execution unit in every detail: e.g. the number of registers and the width and complexity of the ALU are likely to be different in practice. The poly execution unit gets its performance largely from the number of PEs brought to bear on the problem rather than the processing power of the individual PEs. To balance this, the mono execution unit is likely to require a wider and more complex ALU. This is reasonable because this is only instantiated once while the PEs are replicated many times and need to be efficiently implemented.
The number of clock cycles required to execute an instruction on the mono and poly j execution units is highly variable. Both will execute some instructions in a single I cycle. But some instructions could take several cycles on the mono execution unit or | on the poly execution unit. It is therefore essential that the two execution units are only loosely coupled rather than operating in lock- step; this allows them to overlap their execution of instructions. The hardware will synchronize their operations when necessary: e.g. when they need to access a shared resource, or one execution unit requires data from the other.
i Transfers between the registers and memory, and other I/O transfers, can take place concurrently with ALU operations. To simplify the task for the compiler, these - 8 interactions are all interlocked using register scoreboarding (see our copending patent application GB 9908203.4).
An important aspect of the PE architecture to support compiled code is the existence of a full set of addressing modes: direct, indirect and indexed. Indirect and indexed addressing by the PEs is described in the previously mentioned copending patent application GB 9908225.7.
Reference may also be had to our copending patent application GB 0321186. 9 for details of the "ClearConnect" bus referred to in the context of the present invention.
In a high level language compiler for this architecture, it is only necessary to lo introduce a simple modification; the keyword 'poly' used in a declaration identifies data that is distributed across the SIMD array. Similar methods have been used before, for example the 'shape' keyword used in C*, a language used for programming the "Connection Machine". However, it is only with the present architecture that poly variables become 'first class' objects - i.e. they can be used everywhere that 'normal' mono data can be: in expressions, as function arguments, in conditional statements, etc. Pointers to both mono and poly data are allowed, as are pointers of both mono I and poly types. Furthermore, poly variables can be mixed quite freely with mono data in all these contexts. This provides a very simple and regular programming model for the user.
Note that it would also be possible for a compiler to detect the parallelism in the | program or data automatically: This is frequently done with 'vectorizing' Fortran compilers used with supercomputers, for example. The same techniques could be applied to this architecture. The architecture will now be described in more detail.
The Multi-Threaded Array Processing (MTAP) architecture has been developed to address a number of problems in high performance, high data rate processing.
A number of applications are characterized by very high data rates, flexible processing requirements and hard real-time constraints. We use the term data flow processing to describe this class of problem.
l' The MTAP processor delivers on the three primary requirements for data flow applications: | 1. It directly addresses the data bandwidth, and has a clear scalability path for future I requirements. / A] - 9
2. It provides the raw horsepower for the processing functions required, on the maximum data rate that the system will encounter. That processing power can scale with increasing function demands.
3. It stores the data close to the processing core to maximize bandwidth and minimize latency.
MTAP Overview The MTAP architecture defines a family of embedded processors with parallel data processing capability. Figure 3 compares a standard processor and the MTAP architecture. As can be seen, the MTAP processor has a standard, RISC-like, control 0 unit with instruction fetch, caches and LO mechanisms. This is coupled to a highly parallel execution unit which provides the performance and scalability of the MTAP architecture.
The processor is designed in a highly modular fashion which allows many details of a specific implementation to be easily customized for the target application.
To simplify the integration of the processor into a variety of systems, the processor can be also configured to be big or little-endian.
Control unit The control unit fetches, decodes and dispatches instructions to the execution units.
The processor executes a fairly standard, three operand instruction set. The manner in which the same instruction stream is decoded and issued in parallel to the mono part, the poly part and the I/O part is schematically indicated in Figure 5.
The control unit also provides hardware support for multi-threaded execution, allowing fast swapping between multiple threads. The threads are prioritized and are intended primarily to support efficient overlap of I/O and compute. This can be used | 25 to hide the latency of external data accesses.
I The MTAP processor can also include instruction and data caches to minimize the latency of memory accesses. The size and type of these is a configurable option.
Alternatively, the caches may be replaced with SRAM to provide a complete embedded solution.
The control unit also includes a control port which is used for initializing and I debugging the processor; it includes support for breakpoints, single stepping and the examination of internal state. -
Execution unit The execution unit consists of a number of processing elements (PEs). This allows it to process data elements in parallel. Each PE consists of an ALU, registers, memory and I/O.
s The number of PEs in a processor core is a configurable parameter which allows performance to be scaled to meet the needs of the application (see Figure 6). The execution unit can be thought of as two, largely independent, parts (see Figure 4).
One PE forms the mono execution unit; this is dedicated to processing mono (i.e. scalar or non-parallel) data. The mono execution unit also handles program flow 0 control such as branching and thread switching. The rest of the PEs form the poly execution unit which processes parallel (poly) data.
The poly execution unit may consist of tens, hundreds or even thousands of PEs. This array of PEs operates in a synchronous manner, similar to SIMD, where every PE executes the same instruction on its piece of data. Each PE also has its own independent local memory; this provides fast access to the data being processed. For example, one PE at 400 MHz has a memory bandwidth of 1.6 Gbytes/s. An array with 256 such PEs has an aggregate bandwidth of over 400 Gbytes/s, with single cycle latency. The number of registers and the amount of memory in each PE are : configurable options.
Input/output The MTAP processor core has two basic LO mechanisms. The first, Programmed I/O (PIG), is the normal mechanism used for accessing memory external to the MTAP core: it supports random accesses to variable sized data by the PEs. The second mechanism is Streaming 1/0 (SIO) which allows chunks of contiguous data to be streamed directly into the memory within PEs.
Each of these supports a variety of addressing modes, described in more detail later.
Programming model From a programmer's perspective, the MTAP processor appears as a single processor running a single C program. This is very different from some other parallel processing models where the programmer has to explicitly program multiple independent l processors, or can only access the processor via function calls or some other indirect | mechanism.
The MTAP processor executes a single instruction stream; each instruction is sent to one of the functional units: this may be the mono or poly execution unit or one of the : ! - 11 I/O controllers. The processor can despatch an instruction on every cycle.
For multi-cycle instructions, the operation of the functional units can be overlapped.
So, for example, an I/O operation can be started with one instruction and on the next cycle the mono execution unit could start a multiply instruction (which requires several cycles to execute). While these operations are proceeding, the poly execution unit can continue to execute instructions.
The main change from programming a standard processor is the concept of operating on parallel data. Data to be processed is assigned to variables which have an instance on every PE and are operated on in parallel: we call this poly data. This can be lo thought of as a data vector which is distributed across the array.
Variables only requiring a single instance (e.g. loop control variables) are known as mono variables - they behave exactly like normal variables on a sequential processor.
Applicant provides a compiler which uses a simple extension to standard C to identify data which is to be processed in parallel. The new keyword poly is used in a declaration to define data which exists, and is processed, on every PE in the poly execution unit.
Example
To give a feel for the way the MTAP processor is programmed, the following represents a fragment of code which will calculate 64 values of a sine function across the PE array in a single operation.
l #include <fnext.h #include <math.h> i #define PI 3.1415926535897932384 i #define NUMBER_OF_PES 64 i 25 holy float angle, sine; holy lot he; i /* get PE number: O n-1 */ he = get_penum (J; /* convert to an angle in range O to Pi */ angle = he * PI / NUMBER_OF_PES; /* calculate sine of angle on each PE */ sine = sing (angle); - 12 This code uses the library function get penumO to get a unique value in the variable pe on each PK. This is scaled to give a range of values for angle between 0 and pi across the PEs.
Finally, the library function sinpO is called to calculate the sine of these values on all s PEs simultaneously. The sing function is the poly equivalent of the standard sine function; it takes a poly argument and returns the appropriate value on every PK.
Performance The performance of an MTAP core with 64 PEs, on a 0.1311 process and running at 400 MHz is shown in Table 1. The columns show the performance achieved with the 0 basic PE, the PE with an additional Multiply-Accumulate (MAC) unit, and the PE with an FPU extension.
Table I - Performance Comparison Base 8-bit FPU Architecture MAC MIPS 25, 600 25,600 25,600 MMACS (x 106) _ 25,600 25,600 MFLOPS _ _ 51,200 Memory Bandwidth 100 100 100 (Gbytes / s) Inter-PE Bandwidth 50 50 50 (GBytes / s) Relative Area 1.0 1.1 1.3 However, the core can be scaled well beyond this number of PEs, to 256 or even thousands of PEs. It also scales down efficiently to lower numbers of PEs, or lower clock rates, enabling low cost and low power systems to be built.
Figure 6 shows how performance and power scale with the number of PEs. By choosing the number of PEs and the clock frequency, the designer has a great deal of :' 20 flexibility to tune performance, bandwidth, cost and power for the needs of a given application. i - 13
Architecture Details The following sections describe the MTAP processor in more detail. A block diagram of the processor is shown in Figure 7. This is a conceptual view, which reflects the programming model. In practice, the control and mono execution units form a tightly coupled pipeline which provides overall control of the processor.
Interfaces The processor has a number of external interfaces. The widths of each of this are configurable. All these interfaces currently use the Virtual Component Interface Standard defined by the VSI Alliance.
0 There are three categories of interface: Mono data & instructions This is a single Advanced VCI (AVCI) interface which is used for mono loads and stores, and for instruction fetching.
Poly data This uses one or more AVCI interfaces for PIO and SIO data. The number of physical interfaces corresponds to the number of PIO and SIO channels implemented.
Control There are two control interfaces which use the Peripheral VCI (PVCI) protocol. One is used for initialization and debug. The otherallows the MTAP processor to generate interrupts to the host system.
I Instruction set The processor has a fairly standard RISC-like instruction set. Most instructions can operate on mono or poly operands and are executed by the appropriate part of the | execution unit. Some instructions are only relevant to either the mono or poly execution unit, for example all program flow control is handled by the mono unit.
The instruction set provides a standard set of functions on both mono and poly execution units: Integer and floating point adds, subtracts, multiplies, divides i, Logical operations: and, or, not, xor : 30 Arithmetic and logical shifts Data comparisons: equal, not equal, greater than, less than, etc. I Data movement between registers Loads and stores between registers and memory - 14 To give a feel for the nature of assembly code for the MTAP processor, a few lines of code are shown below. This simple example loads a value from memory into a mono register, gets the PE number into a register on each PE and then adds these two values together producing a different result on every PK.
Id 0:m4, Ox3000 // load mono reg O from Hem penum 8:p4 // PE number into Holy reg add 4: p4, 0: m4, 8: m4 / / add; result in reg 4 Control unit The control unit fetches instructions from memory, decodes them and despatches lo them to the appropriate functional unit.
The controller includes a scheduler to provide hardware support for multithreaded code. This is a vital part of the architecture for achieving the performance potential of the MTAP processor. Because of the highly parallel architecture of the poly execution unit, there can be significant latencies if all PEs need to read or write external data.
When part of an application stalls because it is waiting for data from external memory, the processor can switch to another code thread that is ready to run. This serves to hide the latency of accesses and keep the processor busy.
The number of threads is a configurable option. The threads are prioritized: the processor will run the highest priority thread that is ready to run; a higher priority thread can pre-empt a lower priority thread when it becomes ready to run. Threads are synchronized - with each other and with hardware, such as TO engines - via semaphores.
In the simplest case of multi-threaded code, a program would have two threads: one for LO and one for compute. By pre-fetching data in the I/O thread, the programmer i 25 (or the compiler) can ensure the data is available when it is required by the execution units and the processor can run without stalling.
Execution units This section describes the common aspects of the poly and mono execution units.
ALU operations Instructions for arithmetic and logical operations are provided in several versions: Various sizes: 1, 2, 3 and 4 bytes, and combinations of sizes i Various data types: signed and unsigned integers, and floating-point Mixes of signed/unsigned operations - 15 Status register Associated with the ALU is a status register; this contains five status bits that provide information about the result of the last ALU operation. When set, these bits indicate: Most significant bit set Carry generated Overflow generated Negative result Zero result Registers 0 To support operations on data of different widths, the registers in the PEs can be accessed very flexibly. The register files are best thought of as an array of bytes which can be addressed as registers of 1 to 8 bytes wide.
The mono and poly registers are addressed in a consistent way using byte addresses and widths specified in bytes. The mono register file is 16 bits wide and so all S addresses and widths must be a multiple of 2. There are no alignment restrictions on poly register accesses.
Addressing modes Load and store instructions are used to transfer data between memory and registers. In the case of the mono execution unit, these transfer data to and from memory external to the MTAP processor.
Poly loads and stores transfer data between the PE register file and PE memory. Data is transferred between the PEs and external memory using the I/O functions described later.
There are three addressing modes for loads and stores which can be used for both mono and poly data. These are: Direct The address to be read/written is specified as an immediate value.
Indirect The address is specified in a register.
Indexed i The address is calculated from adding an offset to a base address in a register. The offset must be an immediate value. - 16
Conditional code The main difference between code for the mono and the poly execution units is the handling of conditional execution.
The mono unit uses conditional jumps to branch around code, typically based on the result of the previous instructions. This means that mono conditions affect both mono and poly operations. This is just like a standard RISC architecture.
The poly unit uses a set of enable bits (described in more detail below) to control whether each PE will have its state changed instructions it executes. This provides per-PE predicated operation. The enable state can be changed based on the result of a lo previous operation.
The following sections describe the architectural features of the two execution units in more detail.
Mono execution unit The mono execution unit is a 16-bit processing element consisting of: A 16-bit ALU with optional extensions A register file of configurable size Status and control registers The mono ALU extensions include a multiplier, a barrel shifter and a normalizer, which is important for accelerating software implementations of floating point operations.
As well as handling mono data, the mono unit is responsible for program flow control (branching), thread switching and other control functions. The mono execution unit also has overall control of I/O operations of poly data. Results from these operations are returned to a register in the mono unit.
Conditional execution I The mono execution unit handles conditional execution in the same way as a traditional processor. A set of conditional and unconditional jump instructions use the result of previous operations to jump over conditional code, back to the start of loops, j etc. Multi-threaded execution i; The MTAP processor supports several hardware threads. There is a hardware i scheduler in the control unit and the mono execution unit maintains multiple banks of critical registers for fast context switching. :1 i - 17
The threads are prioritized (0 being highest priority). Control of execution between the threads is performed using semaphores under programmer control. Higher priority threads will only yield to lower priority threads when they are stalled on yielding instructions (such as semaphore wait operations). Lower priority threads can be pre empted at any time by higher priority ones.
Semaphores are special registers that can be incremented or decremented with atomic (non-interruptible) operations called signal and wait. A signal instruction will increment a semaphore. A wait will decrement a semaphore unless the semaphore is 0, in which case it will stall until the semaphore is signalled by another thread.
0 Semaphores can also be accessed by hardware units (such as the I/O controllers) to synchronize these with software.
Poly execution unit The poly execution unit is an array of Processing Elements (PEs). Each PE in the poly execution unit consists of: An 8-bit ALU with optional extensions such as a multiply-accumulate (MAC) unit A register file of configurable size Status and enable registers A block of memory of configurable size An inter-PE communication path One or more I/O channels Load and store instructions move data between a PE's register file and memory, while the ALU operates on data in the register file. Data is transferred in to, and out of, the PE's memory using I/O instructions.
I ALU
2s The poly ALU is used for performing arithmetic and logical operations on values held in the PE register file.
While the ALU is only 8 bits wide, instructions exist for multi-byte arithmetic which l is handled by iteration. ALU extensions exist to accelerate functions for floating point, DSP, etc. For example, an optional integer multiply-accumulate unit (MAC) can be included in the ALU. This can deliver an 8 x 8 bit MAC result every cycle, or a i 16 x 16 bit MAC every four cycles. The accumulated result can be up to 64 bits wide.
The standard ALU includes basic hardware support to accelerate floating point operations. - 18
Conditional behaviour The SIMD nature of the PE array prohibits each PE having its own branch unit (branching being handled by the mono execution unit). Instead, each PE can control whether its state should be updated by the current instruction by enabling or disabling itself; this is rather like the predicated instructions in some RISC CPUs.
Enable state A PE's enable state is determined by a number of bits in the enable register. If all these bits are set to one, then a PE is enabled and executes instructions normally. If one or more of the enable bits is zero, then the PE is disabled and most instructions it lo receives will be ignored (instructions on the enable state itself, for example, are not be disabled).
The enable register is treated as a stack, and new bits can be pushed onto the top of the stack allowing nested predicated execution. The result of a test, either a 1 or a 0, is pushed onto the enable stack. This bit can later be popped from the top of the stack to remove the effect of that condition. This makes handling nested conditions and loops very efficient. Note that, although the enable stack is of fixed size, the compiler handles saving and restoring the state automatically, so there are no limitations on compiled code. When programming at the assembler level, it is the programmer's responsibility to manage the stack.
Instructions Conditional execution on the poly execution unit is supported by a set of poly conditional instructions: if, else, endif, etc. These manage the enable bits to allow different PEs to execute each branch of an if...else construct in C, for example. These also support nested conditions by pushing and popping the condition value on the 2s enable stack.
As a simple example, consider the following code fragment: // disable PEs where reg 32 is non-zero if.eq 32:pl, 0 // push result onto stack I, // increment red 8 on enabled PEs add 8: p4, 8: p4, 1 i // return all PEs to original enable state ends f / / pop enable stack Here, the initial if instruction compares the two operands on each PK. If they are equal it pushes 1 onto the top of the enable stack - this leaves those PEs enabled if they - 19 were previously enabled and disabled if they were already disabled. If the two operands are not equal, a O is pushed onto the stack - this disables the corresponding PEs.
The following add instruction is sent to all PEs, but only acted on by those that are still enabled. Finally, the endif instruction pops the enable stack, returning all PEs to their original enable state.
Forced loads and stores Poly loads and stores are normally predicated by the enable state of the PK. However, because there are instances where it is necessary to load and store data regardless of lo the current enable state, the instruction set includesforced loads and stores. These will change the state of the PE even if it is disabled.
I/O mechanisms There are two I/O mechanisms provided for transferring data between PE memory and devices outside the MTAP core. Programmed I/O (PIO) extends the load/ store model: it is used for transfers of small amounts of data between PE memory and external memory. Streaming I/O (SIO) is used to efficiently transfer contiguous chunks of data in and out of PE memory.
The number of PIO and SIO channels in a processor is a configurable parameter. A processor will always have at least one PIO channel as this is required for normal program execution.
Multiple I/O channels can run simultaneously.
I/O architecture The VO systems consist of three parts: Controller, Engine and Node.
Controller The PIO and SIO controllers decode VO instructions and coordinate with the rest of the control unit and the mono processor. The controllers synchronize with software threads via semaphores.
Engine ! The VO engines are basically DMA engines which manage the actual data transfer.
There is a Controller and Engine for each VO channel. A single Controller can manage several VO Engines. - 20 Node
There is an I/O Node in each PK. The I/O Engine activates each Node in turn allowing to serialize the data transfers. The Nodes provide buffering of data to minimize the impact of I/O on the performance of the PEs.
s Programmed I/O (PIO) PIO is closely coupled to program execution and is the normal way for the processor to transfer data to and from the outside world (e.g. external memory, hardware accelerators or a host processor).
The PIO mechanism provides a number of addressing modes: 0 Direct addressed Each PE provides an external memory address for its data. This provides random ac cess to data.
Strided The external memory address is incremented for each PK. In each case, the size of data transferred to each PE is the same.
i When multiple PEs are accessing memory then the transfers can be consolidated so as | to perform the minimum number of external accesses. So, for example, if half the processors are reading one location and the other half reading another, then only two | memory reads would be performed. In fact, consolidation can be better than that: i 20 because the bus transfers are packetized, even transfers from nearby addresses can be effectively consolidated.
Streaming l/O (SIO) SIO is used for streaming high-bandwidth data directly to and from PE memory. This is less flexible but very efficient for transferring blocks of data in and out of the system. This is typically used to stream data to and from memory mapped LO devices. Each PE can transfer a different size block of data.
Streaming I/O can be made very efficient in a number of ways. For example, multi threaded code allows data LO and compute to be fully overlapped. Also, multiple data buffers can be allocated in PE memory so that different chunks of data can be input, processed and streamed out simultaneously.
i Swazzle Finally, the PEs are able to communicate with one another via what is known as the swozzle path that connects the register file of each PE with the register files of its left and right neighbours. On each cycle, PEn can perform a register-to-register transfer of i] A, - 21 16 bits to either its left or right neighbour, PEn or PEW' while simultaneously receiving data from the other neighbour.
Swazzle instructions use multiples of 2 bytes in their arguments, so the source and destination registers must be 2-byte aligned. Instructions are provided to shift data left or right through the array, and to swap data between adjacent PEs.
The enable state of a PE affects its participation in a swazzle operation in the following way: if a PE is enabled, then its registers may be updated by a neighbouring PE, regardless of the neighbouring PE's enable state. Conversely, if a PE is disabled, its register file will not be altered by a neighbour under any circumstance. A disabled 0 PE will still provide data to an enabled neighbour.
The data written into the registers of the PEs at the ends of the swazzle path can be set by the mono execution unit.
Host interface The interfaces to the host system are used for 3 basic purposes: initialization and booting, access to host services and debugging.
Initialization There are a number of stages of initialization required to start code running on the MTAP processor. These are normally handled transparently by the development tools, but an overview is provided here as background information.
First, the application code (including bootstrap) is loaded into memory.
Next, the host system initializes the state of the control unit, caches and mono execution unit by a series of writes to the PVCI port. The last of these specify the start address of the code to execute and tell the processor to start fetching instructions.
Finally, the boot code does any remaining initialization of the processor including any set-up of the PEs (e.g. setting the PE number) before running the application code.
Host services l Once application program is running it will need to access host resources such as the file system.
A protocol is defined between the run-time libraries and a device driver on the host system. This is interrupt based and allows the code running on the MTAP processor to make calls to the application or operating system running on the host.
of Debugging The processor includes hardware support for breakpoints and single-stepping. These are controlled via registers accessed through the PCVI interface.
J - 22
This interface also allows the debugger, running on the host, to interrogate and update the state of the processor to support fully interactive debugging.
Case Study: Device EV1 To illustrate the use of an MTAP processor in a real device, the architecture of s Applicant's evaluation chip EV1 is described in this section. This device is a simple embodiment of Applicant's MTAP processor using Applicant's bus to interface the various ports of the processor to a memory block and to external pins.
ELI architecture Figure 9 shows the top level architecture of the EV1 device. The five ports of the lo MTAP processor and one port of the embedded memory block are connected to a single ClearConnect channel (ClearConnect is Applicant's bus structure) of six nodes and two lanes in opposing directions. The two external ports at each end of the bus allow the ClearConnect bus to be connected from one chip to another. This means a system can be built from multiple EV1 devices to provide the required performance.
The bus ports can also be used to connect to an FPGA to provide other functions, such as peripherals, an external memory controller or a host interface.
The EV1 chip also includes on-chip SRAM which provides fast access to code and data.
MTAP processor The MTAP processor core in the EV1 has the following specification: General Implemented on 0.13 process Clock speed: 200 MHz Is 4 Kbyte instruction cache: 4-way, 256 lines x 4 instructions, with manual and auto pre-fetch 4 Kbyte data cache, 4-way, 256 lines x 16 bytes l Mono execution unit 64 byte register file Support for 8 threads Poly execution unit Array of 48 PEs i MAC extension to the ALU 4 Kbytes SRAM per PE - 23 64 byte register file One PIO channel AVCI port: 32-bit address, 64-bit data Transfer size: 4, 8, 16, 32 bytes per PE Address modes: direct and strided One SIO channel AVCI port: 32-bit address, 64-bit data Transfer size: up to 128 bytes per PE Control interfaces 0 Interrupt port: 32-bit address, 32-bit data PVCI Register port: 32-bit address, 32-bit data PVCI This gives a performance of: 9,600 MIPS 9.6 billion 8x8 MACs / second 38 Gbytes/s memory bandwidth 18 Gbytes/s inter-PE bandwidth
Example Applications
Here we give a couple of examples of how the MTAP processor can be applied to specific applications.
Network processing In this case, the data to be processed consists of a stream of packets of variable size.
Each packet consists of header information and a data payload. In its simplest form, the problem is to examine various fields in the header, do some sort of lookup | function and route the packet to the appropriate output port. This processing must be 2s done in real time.
In the network processing application, the SIO channels are used for continuously streaming packets into, and out of, PE memory. Within each PE, the software maintains several buffers so that, while one set of packets is being processed, the previous set can be output and, simultaneously, the next set can be loaded.
! 30 The PIO channels are used by the PEs to send requests to another subsystem that handles lockups. This could on-chip, for example Applicant's Table Lookup Engine, or an off-chip solution such as CAMs.
I - 24
Bio-informatics The developing field of in-silico drug discovery uses computer simulation rather than the 'wet science' of test tubes to explore the behaviour of potential new drugs.
This includes molecular simulation of individual molecules and the interactions of molecules. For example, the docking of a ligand molecule (drug) into a protein. This simulation requires the calculation of the interaction energies of all atoms of one molecule with all the atoms of the other. This process is repeated for thousands of configurations of the molecules. This results in an embarrassingly parallel problem ideally suited to the MTAP architecture.
0 Each PE is allocated a different configuration of protein and ligand, and performs all of the atom-atom interaction energy calculations. The memory available within a PE is insufficient to hold the details of an entire molecule, however the molecule can be split into pieces and each piece processed in turn. The fetching of atoms can be overlapped with the atom-atom processing and, even with the floating point accelerated MTAP processor, is compute bound. The processing currently performed in most production code is a simplistic model of the interaction of atoms. The processing power that the MTAP architecture makes available allows more so- phisticated models to be considered, improving the accuracy and thus value of the results.
The MTAP processor is also suited to other bio-informatic tasks such as genetic sequence comparison and pharmacophore fingerprinting.
Software Development Kit The Software Development Kit (SDK) provides a complete set of development tools: C compiler, assembler linker, debugger, profiler, etc. The software tools use a central 2s configuration file to define the attributes of the target processor that code is being generated for.
I Simulation tools Cycle and bit-accurate simulations of the core are available in C and Verilog. These can be parameterized for the number of PEs, size of memory, and other options. A generic configuration of the C simulator is shipped with the SDK. Once the target system is defined, the simulator can be configured to match that specification.
; This allows application development to start before the system architecture is fully i defined, and then proceed in parallel with the silicon implementation. - 25
Once the target hardware (or a simulation) is available then the application code can be run and debugged on the target hardware.
Debugger A major challenge when developing code for a parallel architecture is understanding the state of the machine when the code is stopped at a breakpoint or error. The data- parallel nature of the MTAP processor minimizes the difficulty here: a single instruction stream is executing and only the data is different on each PK.
Applicant's 2n generation debugger builds on previous experience and provides a variety of ways to visualize the state of PE memory and registers. As well as traditional source-level symbolic views of data, and low-level dumps of memory and data, the debugger provides user- defined picture views which show the contents of PE memory or registers in graphical form. This provides a quickly understood view of the overall state of the system.
The debugger works identically with all of the simulators and with target hardware. It supports all the features expected from a modern development tool: graphical user interface and command line operation, breakpoints, watchpoints, single-stepping, etc. - 26

Claims (32)

  1. Claims 1. A data processor comprising a single control unit and an
    integral execution unit adapted to operate on both parallel and scalar data under the same instruction set.
  2. 2. A data processor comprising a single control unit and an integral execution unit adapted to operate on both parallel and scalar data under the same instruction set, and wherein the execution unit comprises a SIMD part comprising an array of processing elements adapted to operate on said parallel data and a scalar part adapted lo to operate on said scalar data.
  3. 3. A data processor comprising a single control unit and an integral execution unit adapted to operate on both parallel and scalar data under the same instruction set, and wherein substantially all said instructions are operable on said parallel and/or said scalar data.
  4. 4. A data processor comprising a single control unit and an integral execution unit adapted to operate on both parallel and scalar data under the same instruction set, and wherein the execution unit comprises a SIMD part comprising an array of processing elements adapted to operate on said parallel data and a scalar part adapted to operate on said scalar data, and wherein substantially all said instructions are operable on said parallel and/or said scalar data.
    i'
  5. 5. A data processor as claimed in claim 1, wherein said scalar part and said 2s SIMD part are adapted to operate asynchronously.
    I
  6. 6. A data processor as claimed in any of claims 1, 3 or 5, wherein the execution ! unit comprises a SIMD part comprising an array of processing elements (PEs) adapted to operate on said parallel data and a scalar part adapted to operate on said scalar data.
  7. 7. A data processor as claimed in any of claims 2, 4 or 6, wherein the SIMD part and the scalar part differ only in that each processing element of the SIMD part contains local memory. - 27
  8. 8. A data processor as claimed in claim 6, wherein said SIMD part and said scalar part have a common memory area for instructions and scalar data.
  9. 9. A data processor as claimed in claim 6, wherein each PE contains an ALU and one or more multiply-accumulate function units.
  10. 10. A data processor as claimed in claim 6, wherein each PE contains an ALU and one or more floating point function units.
    lo
  11. 11. A data processor as claimed in claim 6, wherein each PE has multiple enable bits to support nested conditional code.
  12. 12. A data processor as claimed in claim 6, wherein said PEs and said scalar part each have a similar set of status/result flags which can be used to control branching, on the scalar part of the execution unit, and conditional execution, on the SIMD part of the execution unit, in a similar manner.
  13. 13. A data processor as claimed in claim 6, further comprising a communication path between PEs in the SIMD part.
  14. 14. A data processor as claimed in claim 6, wherein means are provided for data to be transferred to and from the scalar part of the execution unit and the SIMD part of the execution unit.
  15. 15. A data processor as claimed in any of the preceding claims, wherein the control unit is adapted to fetch and execute multiple instruction threads simultaneously.
  16. 16. A data processor as claimed in any of the preceding claims, further comprising a semaphore unit adapted to use semaphores to synchronize between instruction threads.
  17. 17. A data processor as claimed in claim 16, wherein said semaphores are adapted -' to synchronize between I/O operations and instruction threads. l - 28
  18. 18. A data processor as claimed in claim 16, wherein said control unit is adapted to schedule and execute threads based on their priority and the state of any semaphores they are waiting on.
  19. 19. A data processor as claimed in claim 16, wherein said control unit includes a set of said semaphores adapted to be used to synchronize threads.
  20. 20. A data processor as claimed in either of claims 2 or 3, wherein each processing element of the SIMD array comprises one or more function units, a register file, said local memory, and DO means.
  21. 21. A data processor as claimed in any of claims 2, 4, 6 or 7, wherein both said scalar part and said SIMD part have a complete range of addressing modes that allow them to support: scalar variables pointing to data in scalar memory; scalar variables pointing to data in memory in the SIMD part of the execution unit; S1MD variables pointing to memory in the SIMD part of the execution unit; and SIMD variables pointing to data in scalar memory. l
  22. 22. A data processor as claimed in any of claims 1, 2 or 5-21, wherein substantially all said instructions are operable on said parallel data or on said scalar data or on a combination of both said parallel data and said scalar data.
  23. 23. A data processor as claimed in claim 22, further comprising a compiler I adapted to produce code from a single source program to operate on both said scalar data and said parallel data.
  24. 24. A data processor as claimed in claim 22, wherein said compiler is adapted to - detect parallel data automatically.
    1
  25. 25. A data processor as claimed in claim 23 or 24, wherein a programmer or said I, compiler is adapted to use the multi-threaded execution to schedule I/O operations - 29 such that they occur concurrently with computation and data is always available when required.
  26. 26. A data processor as claimed in claim 2O, wherein said scalar part, said SIMD part and said VO means are adapted to be operated in parallel by the same instruction stream.
  27. 27. A data processor as claimed in claim 26, wherein said VO means comprises programmed VO means adapted to enable each PE to transfer data to and from lo external scalar memory and wherein each PE provides the address it wants to read data from or write data to.
  28. 28. A data processor as claimed in claim 27, wherein memory accesses are consolidated to minimize the number of memory transfers when multiple PEs are accessing the same memory area.
  29. 29. A data processor as claimed in claim 26, wherein each PE is adapted to transfer data to and from external scalar memory and wherein the address for each PE is generated automatically based on the PE number and the amount of data being i transferred.
  30. 30. A data processor as claimed in claim 26, wherein said VO means comprises streaming VO means which distributes an incoming data stream to all the PEs in the -' SIMD part and collects data from all the PEs to generate an output stream independently of program execution on the SIMD part of the execution unit.
  31. 31. A data processor as claimed in claim 30, wherein the streaming VO means only uses enabled PEs to take part in the streaming VO operation.
    it 30
  32. 32. A data processor as claimed in claim 30, wherein the size of data distributed to, and collected from each PE is different for each PK. --! A: :1 d :1
GB0409815A 2003-10-13 2004-04-30 Unified SIMD processor Withdrawn GB2407179A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/GB2004/004377 WO2005037326A2 (en) 2003-10-13 2004-10-13 Unified simd processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0323950A GB0323950D0 (en) 2003-10-13 2003-10-13 Unified simid processor

Publications (2)

Publication Number Publication Date
GB0409815D0 GB0409815D0 (en) 2004-06-09
GB2407179A true GB2407179A (en) 2005-04-20

Family

ID=29433806

Family Applications (2)

Application Number Title Priority Date Filing Date
GB0323950A Ceased GB0323950D0 (en) 2003-10-13 2003-10-13 Unified simid processor
GB0409815A Withdrawn GB2407179A (en) 2003-10-13 2004-04-30 Unified SIMD processor

Family Applications Before (1)

Application Number Title Priority Date Filing Date
GB0323950A Ceased GB0323950D0 (en) 2003-10-13 2003-10-13 Unified simid processor

Country Status (1)

Country Link
GB (2) GB0323950D0 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090006773A1 (en) * 2005-05-20 2009-01-01 Yuji Yamaguchi Signal Processing Apparatus
US7890733B2 (en) 2004-08-13 2011-02-15 Rambus Inc. Processor memory system
WO2020002783A1 (en) * 2018-06-29 2020-01-02 Vsora Asynchronous processor architecture
WO2020002782A1 (en) * 2018-06-29 2020-01-02 Vsora Processor memory access

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0454985A2 (en) * 1990-05-04 1991-11-06 International Business Machines Corporation Scalable compound instruction set machine architecture
US5187796A (en) * 1988-03-29 1993-02-16 Computer Motion, Inc. Three-dimensional vector co-processor having I, J, and K register files and I, J, and K execution units
US5197135A (en) * 1990-06-26 1993-03-23 International Business Machines Corporation Memory management for scalable compound instruction set machines with in-memory compounding
US5303356A (en) * 1990-05-04 1994-04-12 International Business Machines Corporation System for issuing instructions for parallel execution subsequent to branch into a group of member instructions with compoundability in dictation tag
EP0681236A1 (en) * 1994-05-05 1995-11-08 Rockwell International Corporation Space vector data path
US5537606A (en) * 1995-01-31 1996-07-16 International Business Machines Corporation Scalar pipeline replication for parallel vector element processing
US6317819B1 (en) * 1996-01-11 2001-11-13 Steven G. Morton Digital signal processor containing scalar processor and a plurality of vector processors operating from a single instruction

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5187796A (en) * 1988-03-29 1993-02-16 Computer Motion, Inc. Three-dimensional vector co-processor having I, J, and K register files and I, J, and K execution units
EP0454985A2 (en) * 1990-05-04 1991-11-06 International Business Machines Corporation Scalable compound instruction set machine architecture
US5303356A (en) * 1990-05-04 1994-04-12 International Business Machines Corporation System for issuing instructions for parallel execution subsequent to branch into a group of member instructions with compoundability in dictation tag
US5502826A (en) * 1990-05-04 1996-03-26 International Business Machines Corporation System and method for obtaining parallel existing instructions in a particular data processing configuration by compounding instructions
US5197135A (en) * 1990-06-26 1993-03-23 International Business Machines Corporation Memory management for scalable compound instruction set machines with in-memory compounding
EP0681236A1 (en) * 1994-05-05 1995-11-08 Rockwell International Corporation Space vector data path
US5537606A (en) * 1995-01-31 1996-07-16 International Business Machines Corporation Scalar pipeline replication for parallel vector element processing
US6317819B1 (en) * 1996-01-11 2001-11-13 Steven G. Morton Digital signal processor containing scalar processor and a plurality of vector processors operating from a single instruction

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7890733B2 (en) 2004-08-13 2011-02-15 Rambus Inc. Processor memory system
US9037836B2 (en) 2004-08-13 2015-05-19 Rambus Inc. Shared load-store unit to monitor network activity and external memory transaction status for thread switching
US9836412B2 (en) 2004-08-13 2017-12-05 Rambus Inc. Processor memory system
US20090006773A1 (en) * 2005-05-20 2009-01-01 Yuji Yamaguchi Signal Processing Apparatus
US8464025B2 (en) * 2005-05-20 2013-06-11 Sony Corporation Signal processing apparatus with signal control units and processor units operating based on different threads
WO2020002783A1 (en) * 2018-06-29 2020-01-02 Vsora Asynchronous processor architecture
WO2020002782A1 (en) * 2018-06-29 2020-01-02 Vsora Processor memory access
FR3083351A1 (en) * 2018-06-29 2020-01-03 Vsora ASYNCHRONOUS PROCESSOR ARCHITECTURE
FR3083350A1 (en) * 2018-06-29 2020-01-03 Vsora MEMORY ACCESS OF PROCESSORS
US11640302B2 (en) 2018-06-29 2023-05-02 Vsora SMID processing unit performing concurrent load/store and ALU operations

Also Published As

Publication number Publication date
GB0409815D0 (en) 2004-06-09
GB0323950D0 (en) 2003-11-12

Similar Documents

Publication Publication Date Title
Colwell et al. A VLIW architecture for a trace scheduling compiler
US5822606A (en) DSP having a plurality of like processors controlled in parallel by an instruction word, and a control processor also controlled by the instruction word
Brewer Instruction set innovations for the Convey HC-1 computer
Dongarra et al. High-performance computing systems: Status and outlook
EP1137984B1 (en) A multiple-thread processor for threaded software applications
US6088783A (en) DPS having a plurality of like processors controlled in parallel by an instruction word, and a control processor also controlled by the instruction word
US5978838A (en) Coordination and synchronization of an asymmetric, single-chip, dual multiprocessor
US20050251644A1 (en) Physics processing unit instruction set architecture
EP1102163A2 (en) Microprocessor with improved instruction set architecture
JPH11154144A (en) Method and device for interfacing processor to coprocessor
Stallings Reduced instruction set computer architecture
US20100174868A1 (en) Processor device having a sequential data processing unit and an arrangement of data processing elements
WO2000033183A9 (en) Method and structure for local stall control in a microprocessor
WO2006082091A2 (en) Low latency massive parallel data processing device
Eyre The digital signal processor derby
Krashinsky Vector-thread architecture and implementation
GB2407179A (en) Unified SIMD processor
Heath Microprocessor architectures and systems: RISC, CISC and DSP
WO2005037326A2 (en) Unified simd processor
Soliman Design, implementation, and evaluation of a low-complexity vector-core for executing scalar/vector instructions
Leppänen Scalability optimizations for multicore soft processors
Choquette et al. High performance RISC microprocessors
Khatri Implementation, Verification and Validation of an OpenRISC-1200 Soft-core Processor on FPGA
Cheikh Energy-efficient digital electronic systems design for edge-computing applications, through innovative RISC-V compliant processors
de Melo RISC-V Processing System with streaming support

Legal Events

Date Code Title Description
COOA Change in applicant's name or ownership of the application

Owner name: CLEARSPEED TECHNOLOGY PLC

Free format text: FORMER APPLICANT(S): CLEARSPEED SOLUTIONS LIMITED

WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)