WO2006121444A1 - Processeur vectoriel possedant des registres speciaux et un acces memoire tres rapide - Google Patents
Processeur vectoriel possedant des registres speciaux et un acces memoire tres rapide Download PDFInfo
- Publication number
- WO2006121444A1 WO2006121444A1 PCT/US2005/016485 US2005016485W WO2006121444A1 WO 2006121444 A1 WO2006121444 A1 WO 2006121444A1 US 2005016485 W US2005016485 W US 2005016485W WO 2006121444 A1 WO2006121444 A1 WO 2006121444A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- vector
- register
- instruction
- address
- data
- Prior art date
Links
- 239000013598 vector Substances 0.000 title claims abstract description 308
- 230000015654 memory Effects 0.000 title claims abstract description 81
- 238000000034 method Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 30
- 230000006870 function Effects 0.000 description 8
- 230000009467 reduction Effects 0.000 description 7
- 239000000872 buffer Substances 0.000 description 6
- 238000012545 processing Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 229920006395 saturated elastomer Polymers 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3877—Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
- G06F9/30014—Arithmetic instructions with variable precision
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30018—Bit or string instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30032—Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
- G06F9/30038—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G06F9/30043—LOAD or STORE instructions; Clear instruction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30094—Condition code generation, e.g. Carry, Zero flag
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30105—Register structure
- G06F9/30109—Register structure having multiple operands in a single register
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/3013—Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30181—Instruction operation extension or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/32—Address formation of the next instruction, e.g. by incrementing the instruction counter
- G06F9/322—Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
- G06F9/325—Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address for loops, e.g. loop detection or loop counter
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/34—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
- G06F9/345—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/34—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
- G06F9/345—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
- G06F9/3455—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results using stride
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3842—Speculative instruction execution
Definitions
- This invention relates to processors for executing stored programs, and in particular to a vector processor employing special purpose registers to reduce instruction width.
- Vector processors are processors which provide high level operations on vectors, that is, linear arrays of numbers.
- a typical vector operation might add two 64-entry, floating point vectors to obtain a single 64-entry vector.
- one vector instruction is equivalent to a loop with each iteration computing one of the 64 elements of the result, updating all the indices and branching back to the beginning.
- Vector operations are particularly useful for image processing or scientific and engineering applications where large amounts of data must be processed in generally a repetitive manner, hi a vector processor, the computation of each result is independent of the computation of previous results, thereby allowing a deep pipeline without generating data dependencies or conflicts. In essence, the absence of data dependencies is determined by the particular application to which the vector processor is applied, or by the compiler when a particular vector operation is specified.
- a typical vector processor includes a pipeline scalar unit together with a vector unit, hi vector-register processors, the vector operations, except loads and stores, use the vector registers.
- Typical prior art vector processors include machines provided by Cray Research and various supercomputers from Japanese manufacturers such as Hitachi, NEC, and Fujitsu. Processors such as provided by these companies, however, are usually physically quite large, requiring cabinets filled with circuit boards. Such machines therefore are expensive, consume large amounts of power, and are generally not suited for applications where cost is a significant factor in the selection of a particular processor.
- This invention provides a vector processor with limited instruction width, but which provides features of a processor having a greater instruction width by virtue of a special purpose register, and the referencing of that register by various instructions.
- This enables a limited width instruction to address the vector memory and provide the functionality of a larger processor, but without requiring the space, multiple integrated circuits, and higher power consumption of a larger processor.
- the simplicity of the design enables implementation on a single integrated circuit, thereby shortening signal propagation delays and increasing clock speed.
- the special purpose registers are set up by a scalar processor, and then their contents are reused without the necessity of reissuing new instructions from the scalar processor on each clock cycle. All vector instructions include a special field which indexes into these special registers to retrieve the attributes needed for executing the vector instructions.
- the vector processor includes a set of vector registers for storing data to be used in the execution of instructions and a vector functional unit which is coupled to the vector registers for executing instructions.
- the functional unit executes the instructions in response to operation codes provided to it, and those operation codes include a field which references a special register.
- each vector instruction includes a length and a starting point, and a special register is used to store the information about the length and starting point for each vector instruction.
- the invention also provides a memory organization for efficient use of the processor.
- a memory architecture is provided in which pipelined accesses are made to groups of banks of SRAM memories.
- a retry capability is provided to allow multiple accesses to the same bank. Data is moved into and out of the banks of SRAM using a parallel loading technique from a shift register.
- the memory system includes a group of access ports for enabling access to the memory, a set of address lines and a set of data lines coupled to the access ports to receive address information and data from the access ports, and a pipelined series of address decoder stages coupled to the address lines.
- addresses arrive, they are transferred from decoder to decoder, and each decoder compares the address on the address lines with a set of addresses assigned to that decoder corresponding to the memory banks associated with it.
- a first set of memory banks is coupled to the address lines and the data lines between a first address decoder and a second address decoder in the series of address decoders, and a second set of memory banks is coupled to the address lines and the data lines after the second address decoder in the series of address decoders.
- a shift register connected to each of the sets of memory banks enables bock loads and stores to the memory banks.
- An additional aspect of the invention is the provision of instructions for invoking the special register described above.
- This register stores information about the length and starting point for each vector instruction.
- a computer implemented method for executing a vector instruction which includes an operation code and references to various registers, includes the steps of decoding the vector instruction to obtain information about the operation code defining the particular mathematical, logical, or other type operation to be performed on a vector.
- the vector instruction is decoded to obtain an address of a first vector register where the at least one vector upon which the operation to be performed is stored, the address of a second vector register where the result of the operation is to be stored, and the address of a third register which stores the starting element and the vector length.
- the vector instruction is then executed using information from the first and third registers.
- Figure 1 is a block diagram illustrating the overall processor architecture of a preferred embodiment
- Figure 2 is a block diagram illustrating internal components of the vector processor
- Figure 3 is a diagram illustrating further details about the vector processor;
- Figure 4 is a diagram illustrating the data paths for the vector processor;
- Figure 5 is a block diagram illustrating the special purpose registers within a single vector pipe in the vector processor
- Figure 5b is a diagram illustrating the G register of Figure 5 ;
- Figure 6 is a block diagram illustrating how the vector registers communicate with memory
- Figure 7 illustrates the format for a typical vector instruction for a single vector pipe
- Figure 8 illustrates a typical vector instruction for multiple vector pipes
- Figure 9 illustrates a skip and repeat operation.
- Figure 10 illustrates the Move One Scalar to G Register (mlsg) instruction
- Figure 11 illustrates the Move Two Immediates to G Register (m2ig) instruction
- Figure 12 illustrates the Move Two Scalars to G Register (m2sg) instruction
- Figure 13 illustrates the Move Three Scalars to G Register (m3sg) instruction
- Figure 14 illustrates the Move Higher G Register to Scalar (mhgs) instruction
- Figure 15 illustrates the Move Immediate to G Register (mi(vlg,seg,rg,skg,sg)) instruction
- Figure 16 illustrates the Multi-Pipe Move Immediate to G Register (mmi(vlg,seg,rg,skg,sg)) instruction
- Figure 17 illustrates the Multi-Pipe Move Scalar Register to G Register (mms(vlg,seg,rg,skg,sg)) instruction
- Figure 18 illustrates the Multi-Pipe Move Scalar to Higher G Register (mmshg) instruction
- Figure 19 illustrates the Multi-Pipe Move Scalar to Lower G Register (mmslg) instruction
- Figure 20 illustrates the Move Scalar Register to G Register (ms(vlg,seg,rg,skg,sg)) instruction
- Figure 21 illustrates the Move Scalar to Higher G Register (mshg) instruction
- Figure 22 illustrates the Move Scalar to Lower G Register (mslg) instruction
- Figure 23 illustrates the Vector Load Byte Indexed (vlbi) instruction
- Figure 24 illustrates the Vector Load Byte Offset (vlbo) instruction
- Figure 25 illustrates the Vector Load Doublet Indexed (vldi) instruction
- Figure 26 illustrates the Vector Load Doublet Offset (vldo) instruction
- Figure 27 illustrates the Vector Store Byte Indexed (vstbi) instruction
- Figure 28 illustrates the Vector Store Byte Masked Indexed (vstbmi) instruction
- Figure 29 illustrates the Vector Store Byte Masked Offset (vstbmo) instruction
- Figure 30 illustrates the Vector Store Byte Offset (vstbo) instruction
- Figure 31 illustrates the Vector Store Doublet Indexed (vstdi) instruction
- Figure 32 illustrates the Vector Store Doublet Masked Index (vstdmi) instruction
- Figure 33 illustrates the Vector Store Doublet Masked Offset (vstdmo) instruction
- Figure 34 illustrates the Vector Store Doublet Offset (vstdo) instruction
- Figure 35 is a block diagram of a vector memory system
- Figure 36 is a more detailed illustration of the vector memory system
- Figure 37 is a block diagram illustrating in more detail one memory bank
- Figure 38 illustrates the store control pipeline
- Figure 39 illustrates the load control pipeline
- Figure 40 is a block diagram illustrating in more detail the load data path
- Figure 41 is a block diagram illustrating how the groups of banks interface with the DMA shift register;
- Figure 42 is a diagram illustrating the input signals provided to one memory bank;
- FIG. 43 is a more detailed diagram of the bank priority encoder
- Figure 44 is a block diagram illustrating details of the bank index multiplexer.
- Figure 45 illustrates the 5 : 1 multiplexer for selecting the write data for a particular bank and the input and output signals for the memory bank.
- This invention provides a vector processor which may be implemented on a single integrated circuit.
- five vector processors together with the data input/output unit and a DRAM controller are implemented on a single integrated circuit chip.
- This chip provides a video encoder which is capable of generating bit streams which are compliant with MPEG-2, Windows Media 9, and H.264 standards.
- FIG. 1 is a block diagram illustrating the basic structure of a microcontroller.
- the microcontroller includes a scalar processor 10, four independent 16-bit vector processors 20, high speed static random access memory 30, and an input/output (I/O) interface 40. Interfaces to the microcontroller include two 64-bit wide unidirectional buses 50 (one input and one output) for communication with synchronous DRAM, and two 32-bit wide unidirectional buses 60 (one input and one output) used for programmed I/O.
- the vector register memory 30 is implemented in SRAM and consists of four banks of 16-vector registers. Each register has 32 elements, thereby providing a total of 2,048 vector registers. The use of a large VSRAM to provide memory 30 enables maintaining an entire data set for an algorithm in a memory that has very fast access time compared to the relatively slower DRAM memory.
- FIG. 2 is a more detailed block diagram of the microcontroller shown more simply in Figure 1.
- the scalar processor includes an instruction unit, and integer execution unit and two register file banks.
- the integer execution unit typically includes a shifter, an adder, a multiplier, and logical functions.
- the two register file banks 70 are shown coupled to the scalar processor 10.
- the scalar processor is coupled to a 32-k Byte instruction cache 80, an 8-k Byte memory scratch memory 90, and a 4-k Byte set associated data cache 100.
- the data cache is coupled to the SRAM 30.
- the scalar processor will typically be a single issue design with hardware interlocks.
- Instructions issue in order and complete in order with instruction decode requiring one clock are 32 bits, but support 32, 16, and 8-bit data values. All execution units complete in one clock except the multiplier which requires four clocks, data cache loads which require three clocks, and the 32-bit shift which requires two clocks.
- the two banks of 32 entry scalar register files 70 provide one file for the supervisor, and another file for applications. As shown in Figure 2, each element in the register file is 32 bits, and the scratch memory 90 provides storage for any spilling of the registers.
- Scalar processor 10 accesses the register files using read ports 110 and write port 120. Simple instructions are executed in the scalar processor in a nine clock pipeline of icache fetch, icache hit and way select, instruction decode, operand fetch, execute 0, execute 1 , execute 2, execute 3, writeback.
- the scalar processor 10 has four condition code registers (c0, cl,c2,c3), each with a single flag bit. These 1-bit flags reflect the overflow (O) and carry (C) conditions. The meaning of the condition code flag depends on the type of instruction that set the flag:
- An instruction that specifies a condition code register to be set as a result of the operation performed also modifies the CC flag. For example, an instruction that compares two registers for equality and chooses c2 as the condition code register destination will set the flag. In contrast, a logical instruction such as the logical-and instruction cannot specify a condition code register and so leaves all condition code flags unmodified.
- a branch on condition instruction will not modify the cC flag.
- a cC register is used as a carry in and if there is an overflow from the operation, then the same cC register is modified.
- the Vector Mask registers (mM) 110 are used to store condition codes for the vector functional units. Each vector pipe has eight M registers that store a single bit for each element in the vector register. If the vector length is set to 32, then the M register is 32 bits. The meaning of the condition code flag depends on the type of instruction that set the flag:
- the M register can be moved to a scalar register and a bit reduction operation performed to check if any flags were set during the vector operation.
- the Mask registers can also be used to hold carry values for instructions that have a carry in. For example, if double precision (32-bit) arithmetic requires:
- vaddu nVD,nVA,nVB,mM add low bits unsigned, carry to mM
- vaddc nVD,nVA,nVB,mM add high bits with carry from mM
- Vector Mask registers can also be used with shift instructions on the vector side. For example, if a shift instruction shifts out any value of 1, the vector mask is set. This can be used to find the largest number in a vector and then scale the vector accordingly.
- the M register is used in the vector merge instruction. In this case, the mask bit selects whether the element from source one or the element from source two is written to the destination register.
- Figure 2 also shows more detail for the block diagram of the vector processor.
- the architecture has four vector processors 20, each with four 16-bit wide functional units (for a total of 16).
- the vector unit receives its data from the 128 banks of the on chip SRAM 30. Data is transferred under program control of the scalar processor 10 using a DMA controller and channel 130.
- the data is transferred from the DRAM backing store through the high-speed system bus 140 to the SRAM.
- Data from the SRAM is transferred by the memory controller to the register files by the scalar processor 10, and is interlocked with the appropriate instructions in the hardware.
- the memory interface has a capacity of twelve 16-bit simultaneous transfers per clock.
- Figure 3 illustrates typical bandwidths of the vector processor in a preferred implementation.
- Figure 4 shows the vector unit register organization. There are four vector register banks 200, each with 16 vector registers. Each vector register has 32 register elements that are 16-bits wide. Each of the four banks is identical with five read ports and four write ports. Each 32-entry vector register has two read ports and one write port.
- the vector function units 210 are capable of running two operations at the same time in each vector unit.
- Four vector functional units can have eight operations occurring simultaneously.
- Each vector function unit is capable of four reads and two writes simultaneously.
- the SRAM 30 buffers feed the vector registers 200 using memory controllers. These memory controllers are programmed by the scalar processorlO, but are located in each of the functional units 210. There are three memory controllers in each functional unit, two loads and one store.
- the vector processor 210 supports chaining. For example, if the first instruction issued is a multiply that stores the result in a vector register, a second instruction can issue on the next clock that reads the result in the register file from the first operation, and performs a different operation on the result of the first multiply. The hardware automatically schedules the second instruction when the result of the first operation is complete by register scoreboarding of the vector register elements.
- Figure 5 is a block diagram of a single vector pipe 220.
- the single vector pipe includes a vector functional unit 210 and 16 vector registers 200. These units are coupled to a load/store control 230 and another set of registers 240.
- the vector pipe is coupled to the SRAM 30 as also shown.
- the vector pipe includes within load/store control 8 G registers 235 and an address control block 236.
- the special "G" register file 235 is organized as eight 48-bit registers. This register file is capable one read and one write, and can be read and written by various instructions, as well as read by the SRAM load store controller 236. As will be described below in more detail, vector load and store operations use the "G" register file to obtain the desired values for a series of parameters. In the preferred embodiment these parameters include (1) vector length, (2) starting element, (3) repeat, (4) skip, and (5) stride.
- bit positions where these values are stored are: gG[47:42] ⁇ - (6-b Vector Length) gG[41:37] ⁇ - (5-b Starting Element) gG[36:31] ⁇ - (6-b Repeat) gG[30:15] ⁇ - (16-b Skip) gG[14:0] ⁇ - (15-b Stride)
- the G register is illustrated in more detail in Figure 5b.
- That instruction includes an index into the G register to specify the desired parameters for that operation.
- the G field in the vector instruction will be three bits in length.
- the vector pipe shown in Figure 5 also includes a special purpose dual ported register file referred to as the "M" register.
- This register holds vector mask data. It is organized as eight 32-bit registers, and can be read or written by various instructions. The operation of these mask registers was described above.
- Each vector pipe also has a special purpose 40-bit register file called aACC.
- This register file holds the 40-bit result of each MAC instruction, and each of the two add/sub reduction 24-bit Accumulators.
- the Accumulator is loaded from the ACC register file at the beginning of each MAC or reduction operation. At the end of the operation the final result in the Accumulator is stored in the ACC register.
- This register file is dual-ported to allow two operations to occur at the same time.
- Figure 6 is a block diagram of the high-speed SRAM and memory controller.
- the vector registers are capable of 32 reads and 16 writes per pipe, however only five reads and four writes can occur at the same time. Since only one load or store instruction can be issued at a time, obtaining twelve operations takes either twelve vector instructions, or a multi-pipe load or store operation where the attributes for each operation are located in the local G register.
- For each vector register file there are five read ports - two ports for the function unit on pipe 0, two ports for the function unit on pipe 1 and one port for store data.
- Each vector pipe has four write ports - one port for the function unit on pipe 0, one port for the function unit on pipe 1, one port for loads on pipe 0 and one port for loads on pipe 1.
- the SRAM is composed of 128 memory banks. Each memory bank is organized as 512 x 16 bits, and is capable of one read or one write per clock. Each bank has twelve address ports, eight read ports, and four write ports. Only one address port and one read or write port is selected for action in one clock. Addressing for the banks uses bits 1 through 7 to determine the bank address, therefore, a sequential block of 256 bytes will address all of the banks.
- a high speed interface is provided to all banks of the SRAM.
- the interface accumulates 256 bytes in a buffer, and then transfers all 256 bytes in four clocks to all of the banks.
- This 256-byte buffer is read or written from the SRAM on 256-byte boundaries. If any vectors are in flight, they are held for one clock while the read or write occurs.
- the Memory Controller routes each of the potential twelve read or writes from the vector register to the proper banks. Since each vector register may have up to 32 elements, a stride of one assures 32 consecutive banks will be addressed. Since the bank can read or write on every clock there is not a bank conflict between addresses in the same vector, however, there may be bank conflicts due to address conflicts from other vectors that are executing.
- FIG. 7 is a diagram of a typical vector instruction "Vector Add (vadd)" such as employs the G register.
- the vadd instruction provides an addition function.
- the vector pipe is selected by the 3-bit P field 270.
- the arithmetic functional unit is selected by the hardware.
- the vector register as specified by the VA field 271 has each element added to the vector element of the vector register vVB 272, with each result element placed into the vVD vector register 273.
- the 3-bit M field 274 selects the vector pipe M register that contains the vector mask registers. If the sum has overflowed, a one is placed in the M register.
- the G field 275 selects the appropriate G register containing the starting element and vector length.
- a typical implementation is:
- Mask register m mM m5 Furthermore, in the figures associated with many of the following instructions, reference is made to fields OxO, OxI etc. This nomenclature is intended to indicate that the bits so marked designate hexadecimal 0, hexadecimal 1, etc. In addition, "P” refers to the vector processor pipe number and "G" to the G register.
- FIG. 8 is a diagram of a typical multi-pipe vector operation, in this case "Multi- Pipe Vector Add (mvadd),” such as also employs the G register.
- mvadd Multi- Pipe Vector Add
- the format of the mvadd instruction is:
- This instruction is used on all four pipes at the same time.
- the arithmetic functional unit is selected by the hardware.
- Each element of the vector register specified by the VA field 280 is added to the vector element of vector register vVB 281.
- the result element is placed into the vVD vector register 282.
- the 3-bit M field 283 selects the vector pipe M register that contains the vector mask registers. If the sum has an overflow, a 1 is placed in the M register.
- the G field 284 selects the appropriate G register containing the starting element and vector length.
- a typical implementation is:
- the G register is set up by the scalar processor and then used over and over without the necessity of issuing new vector instructions.
- the G register provides the special attributes needed for execution of the instructions, such as vadd and mvadd. In the case of these instructions the G register provides the vector length and the starting field, thereby providing an indication of how many computations are required and where the addressing starts.
- the repeat, skip and stride relate to how an address sequence is generated for vector load and store instructions.
- the starting address of the first element is computed in the scalar pipe.
- a stride value is then added to this address and accumulated on every subsequent clock.
- a skip value is also added to this address stream every nth cycle defined by the repeat field.
- the overall impact of the G register is the enablement of a richer opcode set, but without need for long instruction words.
- the scalar processor reloads the G register when vector operations occur.
- the vector operations typically report 32 clocks, thereby providing the scalar processor the opportunity to reload the G register.
- This capability is enhanced by the vector operation renumbering the contents of the G register when the vector operation begins execution. This enables the G register to be reloaded immediately.
- the stride feature of the G register is particularly beneficial for video applications in which blocks of pixels from a serial data stream are addressed and processed. The stride allows addressing of the SRAM to step from one location to another where those locations are not contiguous, but are evenly spaced.
- the vector processor described above includes many instructions facilitating operations with the G register. These instructions are discussed next.
- the "Move Two Immediates to G Register (m2ig)" instruction is shown in Figure 11.
- the format of the instruction is: gG[47:42] ⁇ - gG[47:42] (vector length) gG[41:37] ⁇ - gG[41:37] (starting element) gG[36:31] ⁇ - rS[5:0] (repeat) gG[30: 15] ⁇ - rA[15:0] (skip) gG[14:0] ⁇ - rB[14:0] (stride)
- the vector pipe is selected by the 3-bit P field.
- the high-order 17 bits of the gG register are sent to the scalar general-purpose D register.
- a typical implementation is:
- the vector pipe is selected by the 3-bit P field.
- the Stride and Skip Immediate is a 12-bit signed value. (An assembly error will occur if more than twelve bits are specified.)
- the immediate values as shown in Table 1 are sent to the selected gG register.
- the MSB of Stride has the sign extended to form a 15-bit value.
- the MSB of Skip has the sign extended to form a 16-bit value.
- the vector pipe is selected by the 3 -bit P field.
- the immediate value for the vector length is in bits [16:11] (0x20).
- the starting element is in bits [25:21] (OxOO) of the instruction, and is sent to the vector pipe and stored in the addressed gG register.
- a typical implementation is: gG[47:42] ⁇ - I[16:l 1] (vector length) gG[41:37] ⁇ - I[25:21] (starting element) gG[36:31] ⁇ - gG[36:31] gG[30:15] ⁇ - gG[30:15] gG[14:0] ⁇ - gG[14:0]
- the vector pipe is selected by the 3-bit P field. Portions of the contents of the two general registers rA and rB are sent to the selected vector pipe, and stored in the addressed gG register.
- General-purpose register A contains the 5-bit starting element
- general-purpose register B contains the 6-bit vector length.
- a typical implementation is:
- the vector pipe is selected by the 3-bit P field. Portions of the contents of the three general registers rA, rB, and rS are sent to the selected vector pipe and stored in the ad-dressed gG register.
- General-purpose register S contains the 6-bit repeat
- general-purpose register A contains the 16-bit skip
- General-purpose register B contains the 15-bit stride.
- a typical implementation is:
- a typical implementation is:
- mmslg rA,gG For this instruction all vector pipes are selected. The contents of general register rA are sent to all of the vector pipes and stored in the addressed G registers. The contents of general- purpose register rA are sent to the selected vector pipe and stored in the lower 31 bits [30:0] of the addressed gG register.
- a typical implementation of the instruction is:
- the vector pipe is selected by the 3-bity P field.
- the contents of the general- purpose scalar register rA sent to the selected vector pipe are then sent to the selected gG register.
- Table 4 shows which bits from the general-purpose register rA go to the fields of register gG.
- the vector byte data is loaded from the Effective Address (EA) in the SRAM to the specified destination vector register vVD and sign-extended.
- EA Effective Address
- the 6-bit signed offset is sign- extended and shifted left five bit positions, and then added to the contents of general-purpose register rA to form the effective SRAM address.
- the 3-bit P field contains the pipe number, which has a value from 0-3. The upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file.
- the EA refers to the SRAM.
- a typical implementation of the instruction is:
- the vector pipe is selected by the 3-bit P field.
- the contents of general- purpose register rA are sent to the selected vector pipe and stored in the upper seventeen bits [47:31] of the addressed gG register.
- a typical implementation of the instruction is:
- the vector pipe is selected by the 3 -bit P field.
- the contents of general register rA are sent to the selected vector pipe and stored in the lower 31 bits [30:0] of the addressed gG register.
- a typical implementation of the instruction is:
- the vector data is loaded from the Effective Address (EA) in the SRAM to the specified destination vector register v VD .
- the index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address.
- the index (rB) is a signed value
- the base (rA) register is an unsigned value.
- the byte in memory addressed by the EA is loaded into the low-order eight bits of general-purpose vector register vVD.
- the high-order bits of general-purpose register vVD are replaced with bit seven of the loaded value.
- the 3-bit P field contains the pipe number which has a value from 0- 3.
- the upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, repeat, the vector starting element, and vector length that will be used for this operation.
- Each pipe has one G register file.
- a typical implementation of the instruction is:
- the vector data is loaded from the Effective Address (EA) in the SRAM to the specified destination vector register v VD .
- the index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address.
- the index (rB) is a signed value
- the base (rA) register is an unsigned value.
- the byte in the memory as addressed by the EA is loaded into general-purpose vector register vVD.
- the 3 -bit P field contains the pipe number, which has a value from 0-3.
- the upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file.
- a typical implementation of the instruction is:
- vldo vVD,rA,O,P,gG For this instruction the vector data is loaded from the Effective Address (EA) in the SRAM to the specified destination vector register vVD.
- EA Effective Address
- the 6-bit signed offset is sign-extended and shifted left six bit positions, and then added to the contents of general-purpose register rA to form the effective SRAM address.
- the 3-bit P field contains the pipe number, which has a value from 0-3. The upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and the vector length that will be used for this operation. Each pipe has one G register file.
- the EA refers to the SRAM.
- a typical implementation of the instruction is:
- the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM.
- EA Effective Address
- the index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address.
- the 3-bit P field contains the pipe number which has a value from 0-3. the upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file.
- the index (rB) is a signed value, and the base (rA) register is an unsigned value.
- a typical implementation of the instruction is: SRAM EA ⁇ - (rB[31:0] + rA[31 :0] + gG) SRAM EA [7:0] ⁇ - (vVS[7:0])
- the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM.
- EA Effective Address
- the index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address.
- the value in each element vVS is stored in the effective SRAM address only if the corresponding mask bit for that vector element is set to 1.
- the 3-bit P field contains the pipe number which has a value from 0-3. the upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, repeat, the vector starting element, and vector length that will be used for this operation.
- Each pipe has one G register file.
- the index (rB) is a signed value
- the base (rA) register is an unsigned value.
- a typical implementation of the instruction is:
- the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM.
- EA Effective Address
- the contents of general-purpose register rA are added to the offset to form the effective SRAM address.
- the value in each element vVS is stored in the effective SRAM address only if the corresponding mask bit for that vector element is set to 1.
- the 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, repeat, the vector starting element, and vector length that will be used for this operation.
- Each pipe has one G register file.
- the Immediate (T) is a signed value
- the base (rA) register is an unsigned value.
- a typical implementation of the instruction is:
- the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM.
- EA Effective Address
- the signed offset is sign-extended, shifted left six bit positions, and added to the contents of general-purpose register rA to form the effective SRAM address.
- the 3- bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file.
- the index (rB) is a signed value and the base (rA) register is an unsigned value.
- a typical implementation of the instruction is:
- vstdi vVS,rA,rB,P,gG For this instruction the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM.
- EA Effective Address
- the index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address.
- the 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file.
- the index (rB) is a signed value
- the base (rA) register is an unsigned value.
- a typical implementation of the instruction is:
- the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM.
- EA Effective Address
- the index from the contents of general-purpose register rB is added to the contents of general-purpose register rA to form the effective SRAM address.
- the value in each element vVS is stored in the effective SRAM address only if the corresponding mask bit for that vector element is set to 1.
- the 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, repeat, the vector starting element, and vector length that will be used for this operation.
- Each pipe has one G register file.
- the index (rB) is a signed value
- the base (rA) register is an unsigned value.
- a typical implementation of the instruction is:
- the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM.
- EA Effective Address
- the contents of general-purpose register rA are added to the offset to form the effective SRAM address.
- the value in each element vVS is stored in the effective SRAM address only if the corresponding mask bit for that vector element is set to 1.
- the 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion.
- the G field is used to select one of the eight local registers that contains the values for stride, skip, repeat, the vector starting element, and vector length that will be used for this operation.
- Each pipe has one G register file.
- the offset (O) is a signed value
- the base (rA) register is an unsigned value.
- a typical implementation of the instruction is:
- the vector data is sent from the specified vector register vVS to the Effective Address (EA) in the SRAM.
- EA Effective Address
- the 6-bit signed offset is sign-extended, shifted left six bit positions, and added to the contents of general-purpose register rA to form the effective SRAM address.
- the 3-bit P field contains the pipe number which has a value from 0-3. The upper bit of the P field is reserved for future expansion.
- the G field is used to select one of eight local registers that contains the values for stride, skip, the vector starting element, and vector length that will be used for this operation. Each pipe has one G register file.
- the index (rB) is a signed value
- the base (rA) register is an unsigned value.
- a typical implementation of the instruction is:
- FIG 35 is a block diagram of a vector memory system according to a preferred embodiment.
- the vector memory system is coupled to the vector pipes 220 to receives read control information and write control information, as well as address information.
- Write data is provided over four 16-bit ports 313, read data over eight 16-bit ports 315, and 64 bits are provided for direct memory access (DMA) data input 311 and output 317.
- the memory system includes 128k bytes of memory organized as 128 banks of single ported memory, each one of which is 512 by 16 bits. (This architecture is discussed below in conjunction with Figure 36.)
- the DMA bus 311, 317 provides single cycle read and write of 256 bytes and supports doublet reads and doublet writes. Eight read accesses per clock and four write accesses per clock are enabled.
- the vector memory system has a four clock cycle latency as also discussed below.
- the vector memory system is coupled to a scalar cache 310, also implemented as SRAM.
- the cache interfaces with the vector memory system over two buses, a 128 bit- wide cache line fill bus 312, and a 32 bit-wide quadlet store bus 314.
- the cache tags 316 are depicted.
- Scalar cache 310 is a 4k byte cache which is four- way set associative. It is a write-through cache with 16 byte lines.
- the external invalidate interfaces include DMA write operation to reload the vector memory.
- the invalidate sources also include a vector store from any of the vector pipes 0-3.
- FIG. 36 is a more detailed illustration of the vector memory system 30.
- the memory system includes a 256 byte, double buffered DMA shift register 320 and 128 banks of SRAM memory 330.
- the banks of memory are arranged as four groups 332, 334, 336, and 338. Each group includes 32 banks of memory.
- the banks are addressed via a bus 340 with address information supplied over port 345 to retry control 350. The details of the ports and retry control are discussed below.
- the addresses appear on bus 340, however, they pass through a 4-stage pipeline where they are compared with the addresses for each bank. For example, the addresses on bus 340 first passes through stage 342, then second stage 344, then third stage 346, and finally fourth stage 348.
- stage 342 registers the "match" enabling data to be written to or read from the read/write ports of the memory, in a manner explained below.
- Each bank is addressable by a 7-bit address, with two bits designating the group, and five bits designating the bank within that group. Because the address information arriving on bus 340 may address multiple banks within one group, or even the same bank multiple times, within a given period, a retry control 350 is provided.
- the retry control enables a subsequent address directed toward the same bank (which is thus not recognized by the downstream address decoding stages 344, 346 and 348) to be fed back via bus 360 to retry control 350. In this manner the same address can be "retried” against the banks a number of times until the access is granted.
- a retry control line 361 is used to trigger the retry control 350.
- the data in the 128 banks of SRAM is loaded and unloaded using a double buffered DMA shift register 320.
- the shift register is loaded and then its contents transferred out in parallel to a buffer.
- the 256 bytes are loaded into the 128 banks in parallel.
- Figure 37 is a block diagram illustrating in more detail one bank 330 in one group of the 128 banks shown in Figure 36.
- the bank can receive addresses, write data, and read/write control signals.
- the signals are decoded by a 12:1 priority encoder 370 using a priority which is discussed below. That circuit enables a 12: 1 multiplexer circuit 372 to pass the appropriate information to bank 330.
- Figures 38-45 illustrate the vector memory system in further detail.
- Figure 38 illustrates the store control pipeline, and Figure 39 the load control pipeline, both of which were represented bus 340 in Figure 36.
- reference numbers have been used corresponding to those in Figure 36.
- the input signals to the multiplexer 360 include DMA write signals, vector pipe write signals, and scalar cache write signals, all as shown.
- the 2-bit write request signal (Vpipe WRT REQ) for the vector pipe enables writes for the upper byte, the lower byte, or both bytes.)
- multiplexer 360 selects one of these three sets of input data and provides that set of inputs to the multiplexer 364.
- Multiplexer 364 enables the retry control, and will select the retry bus 360 if there has been a bank conflict or collision in the address information earlier provided, for example, if successive writes are to the same bank. If there has been no bank conflict, then the information from multiplexer 360 is placed on the bus 340 and provided to stage 0 (342) for determination about whether that bank address falls within the group of banks 0 - 31 in group 332.
- First priority is always given to retrying information from a previous cycle when a bank conflict has occurred.
- Second priority is assigned to the DMA controller for reloading the banks of memory, as discussed with regard to Figure 36.
- Third priority is given to vector store operations, and lowest priority is given to the write through scalar cache. Once the appropriate store control information is placed on bus 340, it is transferred to the banks based upon the bank address in the manner described with respect to Figure 36.
- Figure 39 is a diagram similar to Figure 38, but with a load control pipeline instead of the store control pipeline shown in Figure 38.
- the 3:1 multiplexer 360 receives DMA read requests, vector pipe read requests, and scalar cache read requests, together with associated address information.
- the selected read signals are provided to the second multiplexer 354 which chooses that selected read signals unless a bank conflict has arisen and a retry is required, all in the same manner as discussed with respect to Figure 38.
- the priorities for the load control pipeline in Figure 39 at multiplexer 360 are the same as in Figure 38. In particular, read retries have top priority, followed by DMA read access, vector reads, with scalar cache line fills having lowest priority. (If there has been a miss in the scalar cache, the load pipes are used to refill the cache.)
- Figure 40 is a block diagram illustrating in more detail the load data path from the 128 memory banks 330 (first discussed in conjunction with Figure 36) to the read output terminals.
- a multiplexer 370 selects which bank has information provided as output data.
- the return data buses 390 are illustrated near the right hand side of the diagram.
- the multiplexers are controlled by a bank priority encoder which is discussed below in conjunction with Figure 43.
- FIG 41 is a block diagram illustrating how the groups 332, 334, 336, and 338 of bank of memory 330 interface with the DMA shift register.
- Shift register 320 is illustrated across the lower portion of the diagram. As shown there, the shift register shifts 64 bits at a time to a 256-byte buffer 372, 374, 376, 378 depicted as a flip-flop for DMA read and write data.
- Each buffer includes a 3:1 multiplexer coupled to the flip-flop to select from data to be written to the banks of memory, data being read from the banks of memory, or data buffered for later writes.
- the shift register is a parallel load which reads all banks and then shifts them out.
- Figure 42 is a diagram illustrating the input signals provided to one memory bank 330 shown above in other figures.
- the memory bank includes eight load interfaces (designated load 0 - load 7), four store interfaces (designated store 0 - store 3), one DMA read interface and one DMA write interface. All of these are input signals to the memory bank.
- the bank output signal consists of a 16-bit read data output.
- FIG 43 is a more detailed description of the bank priority encoder 370 shown in block form in Figure 37.
- the bank priority encoder 370 receives the load and store requests together with the DMA requests. The particular encoder is selected by the bank ID. Among all of the groups of input signals, DMA requests have the highest priority, followed by the priorities in the order listed at the lower portion of the figure.
- the output from the bank priority encoder includes bank read and bank write enable signals, select bank index signals, select write data signals, and steer read data signals.
- Figure 44 is a block diagram illustrating details of the bank index multiplexer 372 within a memory bank. This multiplexer was illustrated in block form in as multiplexer 372 in Figure 37. As shown in Figure 44, the index multiplexer 372 receives load and store bank index signals for all eight load buses and four store buses. A select bank index control signal selects the 9-bit output signal providing the bank index.
- Figure 45 illustrates the 5 : 1 multiplexer for selecting the write data for a particular bank. As shown there, the four store buses and the DMA write bus are provided as inputs to the multiplexer. The select write data signal choosing one of the five to thereby provide a bank write data output.
- the particular input and output signals for the memory cells themselves are illustrated. These include the bank read enable, bank write enables (for upper and lower bytes), the bank write data and the bank index.
- the output from the SRAM consists of the bank read data signals.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Complex Calculations (AREA)
Abstract
Processeur vectoriel (20) comprenant un ensemble de registres vectoriels (200) servant à mémoriser des données à utiliser pour l'exécution d'instructions et une unité fonctionnelle vectorielle (210) couplée aux registres vectoriels (200) afin d'exécuter les instructions. Cette unité fonctionnelle (210) exécute ces instructions au moyen de codes opérationnels transmis à ladite unité et contenant un champ référençant un registre spécial (235). Ce registre spécial (235) contient des informations concernant la longueur et le point de départ de chaque instruction vectorielle. Ce processeur (20) comprend un système d'accès mémoire très rapide (130) permettant d'accélérer les opérations.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2005/016485 WO2006121444A1 (fr) | 2005-05-10 | 2005-05-10 | Processeur vectoriel possedant des registres speciaux et un acces memoire tres rapide |
IL187262A IL187262A0 (en) | 2005-05-10 | 2007-11-08 | Vector processor with special purpose registers and high speed memory access |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2005/016485 WO2006121444A1 (fr) | 2005-05-10 | 2005-05-10 | Processeur vectoriel possedant des registres speciaux et un acces memoire tres rapide |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2006121444A1 true WO2006121444A1 (fr) | 2006-11-16 |
Family
ID=37396838
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2005/016485 WO2006121444A1 (fr) | 2005-05-10 | 2005-05-10 | Processeur vectoriel possedant des registres speciaux et un acces memoire tres rapide |
Country Status (2)
Country | Link |
---|---|
IL (1) | IL187262A0 (fr) |
WO (1) | WO2006121444A1 (fr) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014114997A1 (fr) * | 2013-01-23 | 2014-07-31 | International Business Machines Corporation | Instruction de mise en rotation d'éléments vectoriels et d'insertion sous un masque |
US9436467B2 (en) | 2013-01-23 | 2016-09-06 | International Business Machines Corporation | Vector floating point test data class immediate instruction |
US9471311B2 (en) | 2013-01-23 | 2016-10-18 | International Business Machines Corporation | Vector checksum instruction |
US9703557B2 (en) | 2013-01-23 | 2017-07-11 | International Business Machines Corporation | Vector galois field multiply sum and accumulate instruction |
US9715385B2 (en) | 2013-01-23 | 2017-07-25 | International Business Machines Corporation | Vector exception code |
US9740482B2 (en) | 2013-01-23 | 2017-08-22 | International Business Machines Corporation | Vector generate mask instruction |
CN114443143A (zh) * | 2022-01-30 | 2022-05-06 | 上海阵量智能科技有限公司 | 指令处理方法、装置、芯片、电子设备以及存储介质 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4760518A (en) * | 1986-02-28 | 1988-07-26 | Scientific Computer Systems Corporation | Bi-directional databus system for supporting superposition of vector and scalar operations in a computer |
-
2005
- 2005-05-10 WO PCT/US2005/016485 patent/WO2006121444A1/fr active Application Filing
-
2007
- 2007-11-08 IL IL187262A patent/IL187262A0/en unknown
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4760518A (en) * | 1986-02-28 | 1988-07-26 | Scientific Computer Systems Corporation | Bi-directional databus system for supporting superposition of vector and scalar operations in a computer |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014114997A1 (fr) * | 2013-01-23 | 2014-07-31 | International Business Machines Corporation | Instruction de mise en rotation d'éléments vectoriels et d'insertion sous un masque |
GB2525357A (en) * | 2013-01-23 | 2015-10-21 | Ibm | Vector element rotate and insert under mask instruction |
JP2016510461A (ja) * | 2013-01-23 | 2016-04-07 | インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation | Vectorelementrotateandinsertundermask命令を処理するためのコンピュータ・プログラム、コンピュータ・システム及び方法 |
US9436467B2 (en) | 2013-01-23 | 2016-09-06 | International Business Machines Corporation | Vector floating point test data class immediate instruction |
US9471311B2 (en) | 2013-01-23 | 2016-10-18 | International Business Machines Corporation | Vector checksum instruction |
US9471308B2 (en) | 2013-01-23 | 2016-10-18 | International Business Machines Corporation | Vector floating point test data class immediate instruction |
US9513906B2 (en) | 2013-01-23 | 2016-12-06 | International Business Machines Corporation | Vector checksum instruction |
US9703557B2 (en) | 2013-01-23 | 2017-07-11 | International Business Machines Corporation | Vector galois field multiply sum and accumulate instruction |
US9715385B2 (en) | 2013-01-23 | 2017-07-25 | International Business Machines Corporation | Vector exception code |
US9727334B2 (en) | 2013-01-23 | 2017-08-08 | International Business Machines Corporation | Vector exception code |
US9733938B2 (en) | 2013-01-23 | 2017-08-15 | International Business Machines Corporation | Vector checksum instruction |
US9740483B2 (en) | 2013-01-23 | 2017-08-22 | International Business Machines Corporation | Vector checksum instruction |
US9740482B2 (en) | 2013-01-23 | 2017-08-22 | International Business Machines Corporation | Vector generate mask instruction |
US9778932B2 (en) | 2013-01-23 | 2017-10-03 | International Business Machines Corporation | Vector generate mask instruction |
US9804840B2 (en) | 2013-01-23 | 2017-10-31 | International Business Machines Corporation | Vector Galois Field Multiply Sum and Accumulate instruction |
US9823924B2 (en) | 2013-01-23 | 2017-11-21 | International Business Machines Corporation | Vector element rotate and insert under mask instruction |
US9823926B2 (en) | 2013-01-23 | 2017-11-21 | International Business Machines Corporation | Vector element rotate and insert under mask instruction |
US10101998B2 (en) | 2013-01-23 | 2018-10-16 | International Business Machines Corporation | Vector checksum instruction |
US10146534B2 (en) | 2013-01-23 | 2018-12-04 | International Business Machines Corporation | Vector Galois field multiply sum and accumulate instruction |
US10203956B2 (en) | 2013-01-23 | 2019-02-12 | International Business Machines Corporation | Vector floating point test data class immediate instruction |
US10338918B2 (en) | 2013-01-23 | 2019-07-02 | International Business Machines Corporation | Vector Galois Field Multiply Sum and Accumulate instruction |
US10606589B2 (en) | 2013-01-23 | 2020-03-31 | International Business Machines Corporation | Vector checksum instruction |
US10671389B2 (en) | 2013-01-23 | 2020-06-02 | International Business Machines Corporation | Vector floating point test data class immediate instruction |
US10877753B2 (en) | 2013-01-23 | 2020-12-29 | International Business Machines Corporation | Vector galois field multiply sum and accumulate instruction |
CN114443143A (zh) * | 2022-01-30 | 2022-05-06 | 上海阵量智能科技有限公司 | 指令处理方法、装置、芯片、电子设备以及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
IL187262A0 (en) | 2008-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20060259737A1 (en) | Vector processor with special purpose registers and high speed memory access | |
US20070150697A1 (en) | Vector processor with multi-pipe vector block matching | |
US7434024B2 (en) | SIMD processor with register addressing, buffer stall and methods | |
US5996057A (en) | Data processing system and method of permutation with replication within a vector register file | |
US6334176B1 (en) | Method and apparatus for generating an alignment control vector | |
JP2006004042A (ja) | データ処理装置 | |
CN101689107A (zh) | 用于将条件指令扩展为无条件指令与选择指令的方法和系统 | |
US5083267A (en) | Horizontal computer having register multiconnect for execution of an instruction loop with recurrance | |
WO2006121444A1 (fr) | Processeur vectoriel possedant des registres speciaux et un acces memoire tres rapide | |
US7308559B2 (en) | Digital signal processor with cascaded SIMD organization | |
CN108319559B (zh) | 用于控制矢量内存存取的数据处理装置及方法 | |
JP3789583B2 (ja) | データ処理装置 | |
EP1974254B1 (fr) | Sélection conditionnelle précoce d'un opérande | |
US11755320B2 (en) | Compute array of a processor with mixed-precision numerical linear algebra support | |
JP2002529847A (ja) | ビットfifoを有するディジタル信号プロセッサ | |
US7143268B2 (en) | Circuit and method for instruction compression and dispersal in wide-issue processors | |
WO2000068783A2 (fr) | Noyau de calcul de processeur de signaux numeriques | |
US7134000B2 (en) | Methods and apparatus for instruction alignment including current instruction pointer logic responsive to instruction length information | |
US7107302B1 (en) | Finite impulse response filter algorithm for implementation on digital signal processor having dual execution units | |
US7577824B2 (en) | Methods and apparatus for storing expanded width instructions in a VLIW memory for deferred execution | |
KR20000048531A (ko) | 데이터 처리장치에서의 입력 오퍼랜드 제어 | |
US5752271A (en) | Method and apparatus for using double precision addressable registers for single precision data | |
CN111814093A (zh) | 一种乘累加指令的处理方法和处理装置 | |
JPH1078872A (ja) | 複数命令並列発行/実行管理装置 | |
CN112463218B (zh) | 指令发射控制方法及电路、数据处理方法及电路 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 187262 Country of ref document: IL |
|
ENP | Entry into the national phase |
Ref document number: 2008511091 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
NENP | Non-entry into the national phase |
Ref country code: RU |
|
NENP | Non-entry into the national phase |
Ref country code: JP |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 05754533 Country of ref document: EP Kind code of ref document: A1 |