WO2006033056A2 - Micro processor device and method for shuffle operations - Google Patents

Micro processor device and method for shuffle operations Download PDF

Info

Publication number
WO2006033056A2
WO2006033056A2 PCT/IB2005/053019 IB2005053019W WO2006033056A2 WO 2006033056 A2 WO2006033056 A2 WO 2006033056A2 IB 2005053019 W IB2005053019 W IB 2005053019W WO 2006033056 A2 WO2006033056 A2 WO 2006033056A2
Authority
WO
WIPO (PCT)
Prior art keywords
vector
index
processing
memory
elements
Prior art date
Application number
PCT/IB2005/053019
Other languages
French (fr)
Other versions
WO2006033056A3 (en
Inventor
Cornelis H. Van Berkel
Balakrishnan Srinivasan
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Priority to JP2007533015A priority Critical patent/JP2008513903A/en
Priority to EP05782929A priority patent/EP1794671A2/en
Priority to CN200580039646.XA priority patent/CN101061460B/en
Publication of WO2006033056A2 publication Critical patent/WO2006033056A2/en
Publication of WO2006033056A3 publication Critical patent/WO2006033056A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30032Movement instructions, e.g. MOVE, SHIFT, ROTATE, SHUFFLE
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • G06F15/8061Details on data memory access
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30025Format conversion instructions, e.g. Floating-Point to Integer, decimal conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • G06F9/3455Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results using stride

Definitions

  • the present invention relates to a micro processor device comprising a. vector processor architecture with at least one functional vector processor unit comprising memory means for storing plural index vectors and processing means, the functional vector processor unit being arranged to receive a processing instruction and at least one input vector to be processed, the memory means being arranged to provide the processing means with one of said plural index vectors in accordance with the processing instruction, and the processing means being arranged to generate in response to said instruction at least one output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided.
  • the invention relates to a method for processing vectors comprising the steps: receiving a processing instruction and at least one input vector to be processed, storing plural index vectors in first memory means, selecting one of said plural index vectors in accordance with the processing instruction, and generating at least one output vector in response to said instruction, the output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided.
  • a micro processor devices hereinafter referred to as vector processors, and such a method are well established since several decades.
  • a vector processor provides vector instruction sets instead of or in addition to scalar instructions as provided by microprocessors employing scalar processor architecture only (as opposed to the abo"ve vector or parallel architecture).
  • Each instruction typically specifies operand vector(s) * containing plural data words or vector elements, its length (the number of vector elements), and an operation to be applied.
  • the advantage of vector processing is that instead of successively operating on single data words, it allows to operate - within one of said vector instructions - on entire vectors at the same time thereby enhancing calculating speed.
  • the vector processor will continuously fetch the entire vector from an external memory, then the vector will be continuously operated, and finally be continuously stored back to the external memory by another single access.
  • Vector processing therefore is a Single Instruction Multiple Data (SIMD) parallel processing technique.
  • SIMD Single Instruction Multiple Data
  • vector instructions specifying operations on one or more vectors such as shifting or shuffling a vector or adding, subtracting, multiplying, or dividing two or more vectors element-wise; vector-scalar instructions specifying operations on a vector and a scalar such as a scalar product; vector-scalar instructions (vector-reductions) specifying operations on one or more vectors and delivering a scalar such as cross product; and vector- memory instructions specifying load and store operations to transfer data between the external memory and the integrated vector register.
  • vector-vector instructions specifying operations on a vectors such as shifting or shuffling a vector or adding, subtracting, multiplying, or dividing two or more vectors element-wise
  • vector-scalar instructions specifying operations on a vector and a scalar such as a scalar product
  • vector-scalar instructions vector-reductions
  • vector- memory instructions specifying load and store operations to transfer data between the external memory and the integrated vector register.
  • even more sophisticated instructions can
  • VMU vector memory unit
  • IDU additional instruction distribution unit
  • vector-memory instructions allowing the processor to access memory according to different access patterns addressing the vector elements in the external memory. If the vector elements are all contiguous in the memory, that is the data words constituting the vector being requested are located in adjacent memory addresses, then fetching the vector from a set of memory banks is an easy task. This access is commonly called a unit-stride access. In some cases, the data words to be fetched are separated within the memory by a definite constant displacement. This is called a strided access or a stride-n access, wherein n denotes the distance of memory-addresses between two neighboring vector elements. In this case the instruction further specifies the stride n in order to allow the VMU to fetch all data in a single memory access.
  • data words to be loaded from the memory as elements of a vector and/or to be stored back to the memory are not even separated by a constant displacement but rather located at (pre-calculated) arbitrary positions and/or in arbitrary order in memory.
  • the order of say P memory banks of the external memory does generally not match the retrieved/delivered order of the vector elements.
  • the VMU In order to allow the processor to access the arbitrarily distributed elements data in a single vector-memory instruction the VMU must further be provided with the addresses indicating all memory locations where the vector elements are stored. This is accomplished by a vector-memory instruction, called "gather-instruction" providing an address vector containing the address elements.
  • a so-called “scatter-instruction”, another vector-memory instruction, is provided allowing the processor to store the vector elements to the memory according to a given address vector in a single memory access, too.
  • the functional unit described in the opening paragraph hereinafter also referred to as shuffle unit (SFU)
  • SFU the functional unit described in the opening paragraph
  • Programming the shuffle unit involves providing it with the above index vector containing a "shuffle pattern". Each element in the shuffle pattern specifies the position of the source element.
  • a first register 110 provides a start address (100) of the memory.
  • a second register 120 provides the length (4) of the vector to be loaded.
  • the shuffle pattern or index vector 130 specifies for example, value 4 at position 1.
  • the content of element 4 of the input (vector) which in this case is the fourth element in the memory 140 after the start address, must be copied to position 1 of the output vector 150, which is the vector requested by the program, and so on.
  • the appropriate shuffle pattern, and more precisely the processing instruction input into the SFU is determined by the least significant bits of the address elements in the address vector.
  • the shuffle pattern is directly determined by the code.
  • several shuffle patterns are stored near the shuffle processing means in dedicated shuffle memory means.
  • the SFU 200 comprises an array 210 of P multiplexers 212, 214,..., 216 each with P inputs for the input vector 240 to be processed and with one input for the assigned element of the index vector or shuffle pattern.
  • the shuffle pattern is chosen from the shuffle memory 220 according to an input instruction 222, for example, extracted from the address vector or the program code for rearrangement of the input vector elements in order to obtain the requested output vector 250.
  • Alternative realizations can be based on switch matrices or the like.
  • the data words are fetched by the VMU and afterwards rearranged again in order to comply with the position of the requested address. Again, the appropriate shuffle pattern can be obtained from the least significant bits of the address elements.
  • a shuffle according to the pre-computed shuffle pattern can be applied to obtain data in the proper order.
  • few shuffle patterns suffice for a given application, such as scatter or gather memory access or a computer code, assuming that access patterns to the memory banks can be reused without modification. If such a restriction, however, would be lifted, either the number of shuffle patterns would increase dramatically, or a shuffle pattern has to be preceded by a rotation operation, or the like. The former could lead to a lot of additional memory traffic, the latter costs an additional operation cycle and hence calculation speed.
  • Object of the present invention therefore is to advance the above micro processor in order to minimize the number of shuffle configuration patterns without suffering a decrease of calculating speed.
  • this object is achieved by a vector processor as mentioned in the opening paragraph wherein said functional vector processor further comprises pre-processing means arranged to receive a parameter and to process the elements of the one index vector dependent on said parameter before generating said at least one output vector in accordance with the processed index vector.
  • a vector processor with FUs e.g. SFUs, comprising such pre-processing means combines the functionality of shuffling elements in a data (or address) vector according to a prescribed shuffle pattern, and the functionality of further processing the data vector indicated by said parameter which preferably is a scalar value.
  • a shuffling operation and a further data processing require two successive steps and two switch networks each having its own control or two network reuses, shuffle and further data processing according to the present invention can be done in single step and the control can be combined to a single control step.
  • index vector since actually the index vector itself is processed and both the index vector and the parameter, like in many architectures, arrive one or more clock cycles prior to the input data vectors the processing of the index vector can be executed in advance. Finally, the shuffling of the data vector on the basis of the (pre-)processed index vector can be executed in a single clock cycle.
  • a vector processor according to the invention can be used to accelerate a large class of algorithms, in particular in combination with scatter-gather memory access, thereby keeping the additional memory traffic at a low level.
  • the (pre-)processing of the index vector generally could be any arithmetic or logical operation on the index vector and the parameter or scalar value.
  • bit-manipulation operations can be performed in a single step.
  • Bit manipulation in this context denotes operations like those described in connection with broadband media processors by Craig Hansen in "Micro Unity's Media Processor Architecture", IEEE Micro, August 1996, p. 36-38.
  • These generalized switching instructions alter the arrangement of vector elements in different manners. Thereby, many commonly required re-arrangements are performed in a single instruction and even arbitrarily re-arrangements can be obtained by a sequence of three instructions.
  • Group-shuffle, group- swizzle, group-extract, group-compress, group-deposit, group-merge-deposit, group- withdraw, group-shift, and group-rotate operations are some examples of such single instruction operations.
  • Several parameters are decoded from an immediate field of the specific instruction, exactly specifying the "degree" of re-arrangement.
  • three instruction parameters generally specify the size (in bits) over which vectors are shuffled, the size of the vectors and the degree of shuffle. In other instructions the specific number of parameters may be reduced.
  • said functional vector processor unit further comprises second memory means for storing plural parameters, said second memory means being arranged to provide said pre-processing means with one of said parameters in accordance with the processing instruction.
  • second memory means are also referred to as offset memory. This can be useful when the parameter being a scalar value is not a compile time constant.
  • a scalar processing unit that operates in parallel with the vector unit can compute these "offsets" and store it in the offset memory indicated.
  • pre-processing means are arranged to receive as parameter a scalar value having a sign and to process the elements of said index vector dependent on said scalar value and said sign. This allows for a greater variety of resulting index vectors without enhancing the number of pre-processing functionalities.
  • the said pre-processing means are arranged to execute a modulo addition of each element of the one index vector and said parameter.
  • Processing the one index vector includes adding to each element a constant value modulo P (preferably the length of the vector) which, consequently, results in a combined shuffle and rotate operation on the data vector.
  • the parameter or scalar value hereinafter is also referred to as rotation offset (L).
  • This ⁇ implementation takes into account that shuffle patterns used in a typical application are correlated and often are rotations of a previous shuffle pattern. Note that in many applications, the rotation offsets also are reused multiple times in succession. Therefore, particularly by integrating the rotation functionality into the SFU the additional memory traffic can be kept at a low level.
  • These second memory means in case of a rotation operation are referred to as rotation-offset memory.
  • a negative signed L could denote a right rotation direction.
  • a positive signed L could denote a left rotation direction
  • Rotation of an input "vector is a special case of a shuffle. Therefore, a left rotation by a number of +L places, with L ⁇ P, can be specified by shuffle pattern [L, L+l, L+2, ...
  • P- 1, 0, 1, ..., L-I] obtained by pre-processing the shuffle pattern [0, 1, 2, ..., P] which maps each input vector element on the same positions in the processed output vector. More precisely, the pre-processed shuffle pattern can be denoted as [(0+L) modulo P, (1+L) modulo P, (2+L) modulo P, ... , (P+L) modulo P].
  • the SFU according to the present invention increments the elements of the index vector or shuffle pattern with the value of the rotation offset L - note that this pre-increment must operate modulo P - the number of individual index vectors in the shuffle memory means can be drastically reduced, namely by a factor of P, without suffering a decrease of calculating speed.
  • said pre-processing means are arranged to execute a saturated addition of each element of said one index vector and said parameter. This saturated addition of the index vector elements and the parameter or scalar value results in a combined shuffle and shift operation on the input data or address vector.
  • Shift left/right for L positions can be seen, as a special case of rotate left/right where the L vacant positions at the right or left side (depending on the shifting direction) are filled with a pre-set constant, e.g. value 0.
  • a pre-set constant e.g. value 0.
  • source-index values -1 and P-I refer to the pre-set constant.
  • a Value of-1 indicates that the element in the corresponding element in the target register should not be over- written.
  • the index values can range from -1 to P-I .
  • pre-processing means are arranged to execute an XOR operation on each element of said one index vector and said parameter.
  • This operation on the index vector with a subsequent shuffling operation of the input (data) vector can be used to achieve "butterfly shuffling" operations as will be explained herein below.
  • Butterfly shuffling is used extensively in kernels like the FFT, DCT and FHT (Fast Hadamard Transform).
  • the (step) size or increment of the butterfly operations usually changes from stage to stage.
  • An implementation according to this aspect of the invention is advantageous since the increment is directly determined by the scalar input value that is the operand in the XOR. operation executed on each index vector element. Otherwise, a new shuffle pattern would have to be loaded each time the size of the butterfly changes leading to more data traffic.
  • the above object is further achieved by a method as mentioned in the opening paragraph in which the further steps of receiving a parameter and processing the elements of said one index vector dependent on said parameter are executed prior to the step of generating said least one output vector.
  • the main application area is vector processing as it is applied for example in CVP (Research), see CH. (Kees) van Berkel, Patrick P.E. Meu Giveaway, Nur Engin, and S. Balakrishnan, "CVP: A Programmable Co Vector Proceessor for 3G Mobile Baseband Processing," In the Proceedings of the World Wireless Congress 2003, nd in OnDSP (PS- Dresden, formerly Systemonic), and in EVP (PS ' DSP Innovation Centre).
  • the above invention can speed-up a number of signal processing kernels drastically. This is especially true for applications that are (close to being) memory bound and that riave irregular access patterns. Examples include: video codecs, FFT, format conversion, interleaving, etc.
  • Fig. 1 shows the functionality of a state-of-the-art sh ⁇ iffling unit (without pre-rotation);
  • Fig. 2 is a block diagram showing a state-of-the-art implementation of a shuffling unit (SFU);
  • Fig. 3 is a block diagram showing an implementation of a FU according to the present invention with pre-rotation capability
  • Fig. 4 is a block diagram showing the control of the FU according to Fig. 3;
  • Fig. 5 is a section of a program code extracted from an implementation of the Golay correlator;
  • Fig. 6 shows an illustration of the memory access according to the code of Fig. 5;
  • Fig. 7 is another demonstration for the applicability of combined shuffle and rotate functionality by means of a program code for bit-reversed accesses in fast Fourier Transformation (FFT);
  • FFT fast Fourier Transformation
  • Fig. 8 shows the bit-reversed permutation for a 32 point FFT memory access pattern
  • Figs. 9A and 9B illustrate two butterfly shuffling operations with different sizes executed on the same 32-element input vector.
  • a FU 300 with pre-rotation capability according to the embodiment of the present invention schematically shown in Fig. 3 comprises processing means and, more precisely, an array 310 of P multiplexers 312, 314, ..., 316 depicted as parallel devices but which can also be implemented by less (down to one) devices and serialized processing steps.
  • Each multiplexer has P inputs corresponding to the P elements of the input vector 320.
  • One further input is provided for the assigned element of the index vector or shuffle pattern that is chosen from memory means of the FU 300, the shuffle memory 320, according to an input instruction 322.
  • index vector elements are input to the multiplexers 312, 314, 316 they are subjected to a modulo addition by pre-processing means and, more precisely, by a combiner 360 consisting of several (P) modulo adders.
  • the direction and the magnitude of rotation is determined by a scalar input 332.
  • the input instruction 322 as well as the scalar input 332 are extracted for example from an address vector or a prograan code.
  • the index vector itself is processed it is input for shuffling the input vector 34O into the multiplexer array 310 in order to obtain the requested output vector 350.
  • the block diagram of Fig. 4 shows in a more generalized way that th.e combined shuffle and data processing (rotation, shift, butterfly, etc operation) of the input data or address vector 440 being executed by the FU can be initiated by a single control step.
  • the pre-processing means 460 receives the shuffle pattern 430 as well as the scalar input 432 and outputs a single instruction vector to the processing means 410. This instruction vector contains the pre-processed shuffle pattern.
  • the section of program code according to Fig. 5 extracted from an implementation of the Golay correlator.
  • the Golay correlator for example, is used in 3 rd generation mobile technology for cell search procedures using a hierarchical correlation sequence for the primary synchronization code (PSC).
  • PSC primary synchronization code
  • vectors of 8 (complex) elements are fetched by four memory accesses (readl through read4), whereby, access vector at ptr (assume // aligned) is aligned again after the fourth increment because the rotation offset equals two.
  • Fig. 6 The memory access according to the code of Fig. 5 is illustrated in Fig. 6. Therein, two consecutive vector locations in memory 610, 620, 630, 640 are shown on the left and the corresponding output from the memory (input vector from the point of "view of the SFU) on the right 611, 621, 631, 641. The final shuffle-rotated data vector 612, 622, 632, 642 is depicted directly below the memory output.
  • Plain rotation means there is no rearrangement required in the first memory access (readl), which can be achieved using an index vector with element values being equal to their positions, and each following memory access requires a rotation.
  • the shuffle pattern [7, 6, 5, 4, 3, 2, 1 5 O] used for a first memory access without shuffle (readl) leads to a first output vector 612 witbi elements having the same order as obtained from memory 610. Then, the same shuffle pattern — on the fly- is subjected to a modulo 8 addition with a scalar value of 6, thereby providing a shuffle pattern [5, 4, 3, 2, 1, 0, 7, 6]. This pre-processed shuffle pattern rotates the input data vector 621 obtained from memory access (read2) left by two elements in order to obtain the output data vector 622.
  • the initial shuffle pattern [7, 6, 5, 4, 3, 2, 1, 0] is subjection to a modulo 8 addition by scalar value of 4, thereby providing the shuffle pattern [3, 2, 1, 0, 7, 6, 5, 4].
  • This shuffle pattern is used to shuffle and (actually plain) rotate the input vector 631 obtained by the next memory access (read3) to the output vector 632 with elements being rotated left by four.
  • the final input vector 641 obtained by memory access (read4) is rotated left by 6 elements using the shuffle pattern [1, 0, 7, 6, 5, 4, 3, 2]which results from the modulo addition of the initial shuffle pattern and the scalar value of 2.
  • a single shuffle pattern was used along with suitable "rotate" values.
  • bit-reversed permutation is performed on an input data array before the FFT is performed. It is also known that bit-reversed permutations can be performed using shuffle functionality. In such a bit reversal the input data are re-organized on a binary level, i.e in bit- reversed order, utilizing a function such as "permute_bitrev" shown in Fig. 7.
  • the input in this example is assumed to consist of two arrays of the complex data, one containing the real part and the other containing the imaginary part. In the example vector processor, however, the input data is stored as a single array of complex numbers with the real and imaginary parts of each complex number stored in adjacent memory locations.
  • Fig. 8 a bit-reversal access pattern for a 32 point FFT and how this can be improved according to the invention utilizing shuffle operations with pre-rotation is shown. Bit reversal can be achieved very efficiently in an architecture that supports the "gather" operation described earlier. Regarding Fig. 8, it is assumed that a vector can hold a maximum of eight complex data elements and the number of memory banks (vector index) is eight wherein a memory location in a bank can store one complex data element.
  • the access patterns consist of a (vector number, memory bank number) tuple.
  • a more intelligent way of organizing data in this case is to rotate the starting address of successive vector elements by one memory bank. This would then involve wrapping around the data as illustrated in Table 1 below in order to write the data into a vector.
  • Table 1 Skewing input data to avoid memory bank conflicts in the 32 point FFT bit-reversal scheme
  • FIG. 9A Another embodiment of the present invention utilizes combined butterfly and shuffling operations as illustrated in Figs. 9A and 9B.
  • An input vector 911 with 32 elements is assumed in these examples.
  • a plain butterfly operation is applied.
  • the index vector (not shown) initially supplied from the shuffle memory in this example has the content [31, 30, ..., 1, 0] mapping each element of an input vector on the same position in an output vector. Any other initial shuffle pattern (or index vector) can be used as well.
  • the index vector is subjected to a pre-processing. More precisely, according to Fig. 9 A the corresponding pre-processing means execute XOR operations on each element of the index vector on a binary level.
  • the second operand in the case of Fig. 9A is an input scalar value of one.
  • the XOR operation on the index vector results in input (data) vector elements pair- wise swapping places.
  • the size or increment of the butterfly operation executed on the input vector in this case is one since each input vector element only jumps one position.
  • the operand value equals four.
  • the size of the butterfly operation executed on the input vector in this case is four since each input vector element jumps four positions.
  • a micro processor device comprising a vector processor architecture with at least one functional vector processor unit may be provided with several different and/or identical pre-processing means arranged to receive the same and/or different parameters or scalar values and to process the elements of the index vector subsequently and/or in parallel. In the case of several parameters they can be decoded from a single or different instructions and/or can be derived from the same or different second memory means.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)
  • Executing Machine-Instructions (AREA)

Abstract

The present invention relates to a micro processor device comprising a vector processor architecture with a functional vector processor unit comprising first memory means for storing plural index vectors and processing means, the functional vector processor unit being arranged to receive a processing instruction and at least one input vector to be processed, said first memory means being arranged to provide the processing means with one of said plural index vectors in accordance with the processing instruction, and the processing means being arranged to generate in response to said instruction at least one output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided. The functional vector processor unit further comprises pre-processing means arranged to receive a parameter and to process the elements of the one index vector dependent on said parameter before generating said at least one output vector in accordance with the processed index vector. The invention further relates to a method for processing vectors with such a functional vector-processing unit.

Description

Micro processor device and method for shuffle operations
The present invention relates to a micro processor device comprising a. vector processor architecture with at least one functional vector processor unit comprising memory means for storing plural index vectors and processing means, the functional vector processor unit being arranged to receive a processing instruction and at least one input vector to be processed, the memory means being arranged to provide the processing means with one of said plural index vectors in accordance with the processing instruction, and the processing means being arranged to generate in response to said instruction at least one output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided. Correspondingly the invention relates to a method for processing vectors comprising the steps: receiving a processing instruction and at least one input vector to be processed, storing plural index vectors in first memory means, selecting one of said plural index vectors in accordance with the processing instruction, and generating at least one output vector in response to said instruction, the output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided. Such a micro processor devices, hereinafter referred to as vector processors, and such a method are well established since several decades. A vector processor provides vector instruction sets instead of or in addition to scalar instructions as provided by microprocessors employing scalar processor architecture only (as opposed to the abo"ve vector or parallel architecture). Each instruction typically specifies operand vector(s) * containing plural data words or vector elements, its length (the number of vector elements), and an operation to be applied. The advantage of vector processing is that instead of successively operating on single data words, it allows to operate - within one of said vector instructions - on entire vectors at the same time thereby enhancing calculating speed.. For example, by a single memory access, the vector processor will continuously fetch the entire vector from an external memory, then the vector will be continuously operated, and finally be continuously stored back to the external memory by another single access. Vector processing therefore is a Single Instruction Multiple Data (SIMD) parallel processing technique. The calculating speed can even be more enhanced by using a vector register architecture Λvhich allows the processor to keep intermediate results in its integrated vector register close to the other functional units of the processor, thereby reducing temporary storage requirements, inter-instruction latency, etc.
There are many different types of vector instructions some of which can be classified as: vector-vector instructions specifying operations on one or more vectors such as shifting or shuffling a vector or adding, subtracting, multiplying, or dividing two or more vectors element-wise; vector-scalar instructions specifying operations on a vector and a scalar such as a scalar product; vector-scalar instructions (vector-reductions) specifying operations on one or more vectors and delivering a scalar such as cross product; and vector- memory instructions specifying load and store operations to transfer data between the external memory and the integrated vector register. Depending on the task, even more sophisticated instructions can be implemented, for example, specifying array operations producing an array from vectors, or the like. Also, instructions specifying logical operations on vectors are possible. In recent developments several vector instructions are packed in a single instruction word and executed in parallel, thereby, once again enhancing the processing speed. This processing type is called Very Long Instruction Word (VLIW) parallel processing technique.
All of these instructions are executed by corresponding functional units (FU). For example, the FU responsible for vector-memory instructions, referred to as vector memory unit (VMU), contains the integrated vector register, and processing means for receiving and implementing the vector-memory instructions. According to the instruction as described below, these processing means will load the requested vector elements from the external memory into its vector register or store vector elements into the external memory. Normally, the VMU is the only FU connected to the "external world" outside the processor. Integrated in VLIW capable vector processors there is an additional instruction distribution unit (IDU) for receiving the very long instruction word, and sequencing and distributing the instructions to other functional units.
There are several vector-memory instructions allowing the processor to access memory according to different access patterns addressing the vector elements in the external memory. If the vector elements are all contiguous in the memory, that is the data words constituting the vector being requested are located in adjacent memory addresses, then fetching the vector from a set of memory banks is an easy task. This access is commonly called a unit-stride access. In some cases, the data words to be fetched are separated within the memory by a definite constant displacement. This is called a strided access or a stride-n access, wherein n denotes the distance of memory-addresses between two neighboring vector elements. In this case the instruction further specifies the stride n in order to allow the VMU to fetch all data in a single memory access.
However, sometimes data words to be loaded from the memory as elements of a vector and/or to be stored back to the memory are not even separated by a constant displacement but rather located at (pre-calculated) arbitrary positions and/or in arbitrary order in memory. Also, the order of say P memory banks of the external memory does generally not match the retrieved/delivered order of the vector elements. In order to allow the processor to access the arbitrarily distributed elements data in a single vector-memory instruction the VMU must further be provided with the addresses indicating all memory locations where the vector elements are stored. This is accomplished by a vector-memory instruction, called "gather-instruction" providing an address vector containing the address elements. Accordingly, a so-called "scatter-instruction", another vector-memory instruction, is provided allowing the processor to store the vector elements to the memory according to a given address vector in a single memory access, too. In the case when data words are located at arbitrary positions in memory, the functional unit described in the opening paragraph, hereinafter also referred to as shuffle unit (SFU), is required to rearrange the data elements obtained from the memory. Programming the shuffle unit involves providing it with the above index vector containing a "shuffle pattern". Each element in the shuffle pattern specifies the position of the source element. In Fig. 1 an illustration of a gather instruction is given. Therein a first register 110 provides a start address (100) of the memory. A second register 120 provides the length (4) of the vector to be loaded. The shuffle pattern or index vector 130 specifies for example, value 4 at position 1. Hence, the content of element 4 of the input (vector), which in this case is the fourth element in the memory 140 after the start address, must be copied to position 1 of the output vector 150, which is the vector requested by the program, and so on.
In an alternative shuffling scheme, instead of specifying the source position as in the above example, the destination position of an element within the vector would be provided. This is, however, less general.
In scatter or gather operations the appropriate shuffle pattern, and more precisely the processing instruction input into the SFU, is determined by the least significant bits of the address elements in the address vector. In other applications, for example, a software code according to which it may be required to read a segment (of size P) of an FFT input block in bit-reversed address order, the shuffle pattern is directly determined by the code. In case of the above vector processor, several shuffle patterns are stored near the shuffle processing means in dedicated shuffle memory means. An example of a known hardware realization of a SFU is given schematically in Fig. 2. The SFU 200 comprises an array 210 of P multiplexers 212, 214,..., 216 each with P inputs for the input vector 240 to be processed and with one input for the assigned element of the index vector or shuffle pattern. The shuffle pattern is chosen from the shuffle memory 220 according to an input instruction 222, for example, extracted from the address vector or the program code for rearrangement of the input vector elements in order to obtain the requested output vector 250. Alternative realizations can be based on switch matrices or the like.
After the address vector elements are rearranged by the SFU the data words are fetched by the VMU and afterwards rearranged again in order to comply with the position of the requested address. Again, the appropriate shuffle pattern can be obtained from the least significant bits of the address elements.
In this way, a shuffle according to the pre-computed shuffle pattern can be applied to obtain data in the proper order. Typically, few shuffle patterns suffice for a given application, such as scatter or gather memory access or a computer code, assuming that access patterns to the memory banks can be reused without modification. If such a restriction, however, would be lifted, either the number of shuffle patterns would increase dramatically, or a shuffle pattern has to be preceded by a rotation operation, or the like. The former could lead to a lot of additional memory traffic, the latter costs an additional operation cycle and hence calculation speed.
Object of the present invention therefore is to advance the above micro processor in order to minimize the number of shuffle configuration patterns without suffering a decrease of calculating speed.
According to a first aspect of the invention this object is achieved by a vector processor as mentioned in the opening paragraph wherein said functional vector processor further comprises pre-processing means arranged to receive a parameter and to process the elements of the one index vector dependent on said parameter before generating said at least one output vector in accordance with the processed index vector.
A vector processor with FUs, e.g. SFUs, comprising such pre-processing means combines the functionality of shuffling elements in a data (or address) vector according to a prescribed shuffle pattern, and the functionality of further processing the data vector indicated by said parameter which preferably is a scalar value. While in state-of-the- art hardware implementations a shuffling operation and a further data processing (reordering or the like) requires two successive steps and two switch networks each having its own control or two network reuses, shuffle and further data processing according to the present invention can be done in single step and the control can be combined to a single control step. This is, since actually the index vector itself is processed and both the index vector and the parameter, like in many architectures, arrive one or more clock cycles prior to the input data vectors the processing of the index vector can be executed in advance. Finally, the shuffling of the data vector on the basis of the (pre-)processed index vector can be executed in a single clock cycle. A vector processor according to the invention can be used to accelerate a large class of algorithms, in particular in combination with scatter-gather memory access, thereby keeping the additional memory traffic at a low level. The (pre-)processing of the index vector generally could be any arithmetic or logical operation on the index vector and the parameter or scalar value. By employment of the present invention, in principle, even bit-manipulation operations can be performed in a single step. Bit manipulation in this context denotes operations like those described in connection with broadband media processors by Craig Hansen in "Micro Unity's Media Processor Architecture", IEEE Micro, August 1996, p. 36-38. These generalized switching instructions alter the arrangement of vector elements in different manners. Thereby, many commonly required re-arrangements are performed in a single instruction and even arbitrarily re-arrangements can be obtained by a sequence of three instructions. Group-shuffle, group- swizzle, group-extract, group-compress, group-deposit, group-merge-deposit, group- withdraw, group-shift, and group-rotate operations are some examples of such single instruction operations. Several parameters are decoded from an immediate field of the specific instruction, exactly specifying the "degree" of re-arrangement. In case of group- shuffle instructions, for example, three instruction parameters generally specify the size (in bits) over which vectors are shuffled, the size of the vectors and the degree of shuffle. In other instructions the specific number of parameters may be reduced.
According to a second aspect of the invention which constitutes a further development of the first aspect said functional vector processor unit further comprises second memory means for storing plural parameters, said second memory means being arranged to provide said pre-processing means with one of said parameters in accordance with the processing instruction. These second memory means are also referred to as offset memory. This can be useful when the parameter being a scalar value is not a compile time constant. A scalar processing unit that operates in parallel with the vector unit can compute these "offsets" and store it in the offset memory indicated. According to a third aspect of the invention which constitutes a further development of the first or second aspects said pre-processing means are arranged to receive as parameter a scalar value having a sign and to process the elements of said index vector dependent on said scalar value and said sign. This allows for a greater variety of resulting index vectors without enhancing the number of pre-processing functionalities.
According to a fourth aspect of the invention which constitutes a further development of any one of the first to third aspects the said pre-processing means are arranged to execute a modulo addition of each element of the one index vector and said parameter.
Processing the one index vector according to this aspect includes adding to each element a constant value modulo P (preferably the length of the vector) which, consequently, results in a combined shuffle and rotate operation on the data vector. In these cases, the parameter or scalar value hereinafter is also referred to as rotation offset (L). This ■ implementation takes into account that shuffle patterns used in a typical application are correlated and often are rotations of a previous shuffle pattern. Note that in many applications, the rotation offsets also are reused multiple times in succession. Therefore, particularly by integrating the rotation functionality into the SFU the additional memory traffic can be kept at a low level. These second memory means in case of a rotation operation are referred to as rotation-offset memory.
In a combination, of modulo addition according to the fourth aspect and a signed scalar value according to the third aspect, for example, a negative signed L could denote a right rotation direction., a positive signed L could denote a left rotation direction, and L=O denotes 0-rotation of the input vector (if present, without reloading the rotation-offset memory). Rotation of an input "vector is a special case of a shuffle. Therefore, a left rotation by a number of +L places, with L<P, can be specified by shuffle pattern [L, L+l, L+2, ... P- 1, 0, 1, ..., L-I] obtained by pre-processing the shuffle pattern [0, 1, 2, ..., P] which maps each input vector element on the same positions in the processed output vector. More precisely, the pre-processed shuffle pattern can be denoted as [(0+L) modulo P, (1+L) modulo P, (2+L) modulo P, ... , (P+L) modulo P]. Since, the SFU according to the present invention (pre-) increments the elements of the index vector or shuffle pattern with the value of the rotation offset L - note that this pre-increment must operate modulo P - the number of individual index vectors in the shuffle memory means can be drastically reduced, namely by a factor of P, without suffering a decrease of calculating speed. According to a fifth aspect of the invention which constitutes a further development of any one of the first to third aspects said pre-processing means are arranged to execute a saturated addition of each element of said one index vector and said parameter. This saturated addition of the index vector elements and the parameter or scalar value results in a combined shuffle and shift operation on the input data or address vector. Shift left/right for L positions can be seen, as a special case of rotate left/right where the L vacant positions at the right or left side (depending on the shifting direction) are filled with a pre-set constant, e.g. value 0. This is achieved by replacing the modulo addition/subtraction by said saturated addition/subtraction, where source-index values -1 and P-I refer to the pre-set constant. A Value of-1 indicates that the element in the corresponding element in the target register should not be over- written. Thus the index values can range from -1 to P-I .
According to a sixth aspect of the invention which constitutes a further development of anyone of the first to third aspects said pre-processing means are arranged to execute an XOR operation on each element of said one index vector and said parameter.
This operation on the index vector with a subsequent shuffling operation of the input (data) vector can be used to achieve "butterfly shuffling" operations as will be explained herein below. Butterfly shuffling is used extensively in kernels like the FFT, DCT and FHT (Fast Hadamard Transform). In these kernels the (step) size or increment of the butterfly operations usually changes from stage to stage. An implementation according to this aspect of the invention is advantageous since the increment is directly determined by the scalar input value that is the operand in the XOR. operation executed on each index vector element. Otherwise, a new shuffle pattern would have to be loaded each time the size of the butterfly changes leading to more data traffic. According to a seventh aspect of the invention the above object is further achieved by a method as mentioned in the opening paragraph in which the further steps of receiving a parameter and processing the elements of said one index vector dependent on said parameter are executed prior to the step of generating said least one output vector.
The main application area is vector processing as it is applied for example in CVP (Research), see CH. (Kees) van Berkel, Patrick P.E. Meuwissen, Nur Engin, and S. Balakrishnan, "CVP: A Programmable Co Vector Proceessor for 3G Mobile Baseband Processing," In the Proceedings of the World Wireless Congress 2003, nd in OnDSP (PS- Dresden, formerly Systemonic), and in EVP (PS ' DSP Innovation Centre). The above invention can speed-up a number of signal processing kernels drastically. This is especially true for applications that are (close to being) memory bound and that riave irregular access patterns. Examples include: video codecs, FFT, format conversion, interleaving, etc.
The above and other objects, features and advantages of the present invention will become apparent from the following description of preferred embodiments thereof taken in conjunction with the accompanying drawings in which
Fig. 1 shows the functionality of a state-of-the-art shαiffling unit (without pre-rotation); Fig. 2 is a block diagram showing a state-of-the-art implementation of a shuffling unit (SFU);
Fig. 3 is a block diagram showing an implementation of a FU according to the present invention with pre-rotation capability;
Fig. 4 is a block diagram showing the control of the FU according to Fig. 3; Fig. 5 is a section of a program code extracted from an implementation of the Golay correlator;
Fig. 6 shows an illustration of the memory access according to the code of Fig. 5;
Fig. 7 is another demonstration for the applicability of combined shuffle and rotate functionality by means of a program code for bit-reversed accesses in fast Fourier Transformation (FFT);
Fig. 8 shows the bit-reversed permutation for a 32 point FFT memory access pattern; and
Figs. 9A and 9B illustrate two butterfly shuffling operations with different sizes executed on the same 32-element input vector.
A FU 300 with pre-rotation capability according to the embodiment of the present invention schematically shown in Fig. 3 comprises processing means and, more precisely, an array 310 of P multiplexers 312, 314, ..., 316 depicted as parallel devices but which can also be implemented by less (down to one) devices and serialized processing steps. Each multiplexer has P inputs corresponding to the P elements of the input vector 320. One further input is provided for the assigned element of the index vector or shuffle pattern that is chosen from memory means of the FU 300, the shuffle memory 320, according to an input instruction 322. However, before the index vector elements are input to the multiplexers 312, 314, 316 they are subjected to a modulo addition by pre-processing means and, more precisely, by a combiner 360 consisting of several (P) modulo adders. The direction and the magnitude of rotation is determined by a scalar input 332. The input instruction 322 as well as the scalar input 332 are extracted for example from an address vector or a prograan code. After the index vector itself is processed it is input for shuffling the input vector 34O into the multiplexer array 310 in order to obtain the requested output vector 350.
The block diagram of Fig. 4 shows in a more generalized way that th.e combined shuffle and data processing (rotation, shift, butterfly, etc operation) of the input data or address vector 440 being executed by the FU can be initiated by a single control step. The pre-processing means 460 receives the shuffle pattern 430 as well as the scalar input 432 and outputs a single instruction vector to the processing means 410. This instruction vector contains the pre-processed shuffle pattern.
The section of program code according to Fig. 5 extracted from an implementation of the Golay correlator. The Golay correlator, for example, is used in 3rd generation mobile technology for cell search procedures using a hierarchical correlation sequence for the primary synchronization code (PSC). This is just one of numberless examples showing that shuffle patterns used in software applications and memory accesses are correlated and even often are rotations of a previous shuffle pattern with rotation offsets reused multiple times in succession. In this example, vectors of 8 (complex) elements are fetched by four memory accesses (readl through read4), whereby, access vector at ptr (assume // aligned) is aligned again after the fourth increment because the rotation offset equals two.
The memory access according to the code of Fig. 5 is illustrated in Fig. 6. Therein, two consecutive vector locations in memory 610, 620, 630, 640 are shown on the left and the corresponding output from the memory (input vector from the point of "view of the SFU) on the right 611, 621, 631, 641. The final shuffle-rotated data vector 612, 622, 632, 642 is depicted directly below the memory output. For simplification and a better intelligibility the processing of the input vectors is a plain rotation. Plain rotation means there is no rearrangement required in the first memory access (readl), which can be achieved using an index vector with element values being equal to their positions, and each following memory access requires a rotation. In detail: the shuffle pattern [7, 6, 5, 4, 3, 2, 15 O] used for a first memory access without shuffle (readl) leads to a first output vector 612 witbi elements having the same order as obtained from memory 610. Then, the same shuffle pattern — on the fly- is subjected to a modulo 8 addition with a scalar value of 6, thereby providing a shuffle pattern [5, 4, 3, 2, 1, 0, 7, 6]. This pre-processed shuffle pattern rotates the input data vector 621 obtained from memory access (read2) left by two elements in order to obtain the output data vector 622. In the next step the initial shuffle pattern [7, 6, 5, 4, 3, 2, 1, 0] is subjection to a modulo 8 addition by scalar value of 4, thereby providing the shuffle pattern [3, 2, 1, 0, 7, 6, 5, 4]. This shuffle pattern is used to shuffle and (actually plain) rotate the input vector 631 obtained by the next memory access (read3) to the output vector 632 with elements being rotated left by four. And the final input vector 641 obtained by memory access (read4) is rotated left by 6 elements using the shuffle pattern [1, 0, 7, 6, 5, 4, 3, 2]which results from the modulo addition of the initial shuffle pattern and the scalar value of 2. This results in an output vector 642. In this particular application due to the vector processor according to the invention instead of four individual shuffle patterns a single shuffle pattern was used along with suitable "rotate" values.
It is commonly known that in many Fast Fourier Transform (FFT) implementations bit-reversed permutation is performed on an input data array before the FFT is performed. It is also known that bit-reversed permutations can be performed using shuffle functionality. In such a bit reversal the input data are re-organized on a binary level, i.e in bit- reversed order, utilizing a function such as "permute_bitrev" shown in Fig. 7. The input in this example is assumed to consist of two arrays of the complex data, one containing the real part and the other containing the imaginary part. In the example vector processor, however, the input data is stored as a single array of complex numbers with the real and imaginary parts of each complex number stored in adjacent memory locations. Therefore, the permutation is executed on the array of complex numbers. The function bitrev() in Fig. 7 returns the numbers shown in the leftmost column of Fig. 8. In Fig. 8 a bit-reversal access pattern for a 32 point FFT and how this can be improved according to the invention utilizing shuffle operations with pre-rotation is shown. Bit reversal can be achieved very efficiently in an architecture that supports the "gather" operation described earlier. Regarding Fig. 8, it is assumed that a vector can hold a maximum of eight complex data elements and the number of memory banks (vector index) is eight wherein a memory location in a bank can store one complex data element. The access patterns consist of a (vector number, memory bank number) tuple. It is further assumed that the data items are arranged in memory by say a DMA controller in the shown fashion. If a naive memory organization for the data is assumed, the access pattern detailed in Fig. 8 will lead to memory bank conflicts if each bank has a single port. For example, the first group of bit-reversed accesses, shown between the first and second dotted lines in Fig. 8, uses four accesses to memory bank 0 (vector index = 0) and four accesses to memory bank 4 (vector index = 4). Hence, serialized accesses to memory will be required, leading to inefficient use of the available memory bandwidth.
A more intelligent way of organizing data in this case is to rotate the starting address of successive vector elements by one memory bank. This would then involve wrapping around the data as illustrated in Table 1 below in order to write the data into a vector.
Table 1 : Skewing input data to avoid memory bank conflicts in the 32 point FFT bit-reversal scheme
Figure imgf000013_0001
With the data organization shown in table 1, all the data items with the vector indices shown in a particular access can be fetched with no bank conflicts. Therefore, for every access the full bandwidth of memory can be used. However, the data returned by the memory system will have to be re-arranged to obtain the vector in the desired order of data elements as shown in Fig. 8. For example, using the source encoding scheme detailed with reference to Fig. 8, shuffle pattern for the bit-reversed accesses will be [0,2,1,3,4,6,5,7] for the first set (vector number = 0) [2,4,3,5,6,0,7,1] for the next set (vector number = 1) and subsequently [1,3,2,4,5,7,6,0], [3,5,4,6,7,1,0,2]. Note that if we divide the array into blocks of vectors, that all numbers in the shuffle pattern is incremented by the bit-reversed block number. In the example, we have four blocks that can be numbered sequentially starting from 0. The block numbers will then be 0,1,2,3. The bit-reversed block numbers will then be 0,2,1,3. This is precisely the increment to each element in the shuffle pattern. The increment achieves the rotation of the shuffled data. A similar scheme can be implemented for FFTs with different number of point (power of two).
Another embodiment of the present invention utilizes combined butterfly and shuffling operations as illustrated in Figs. 9A and 9B. An input vector 911 with 32 elements is assumed in these examples. Actually, a plain butterfly operation is applied. In other words, the index vector (not shown) initially supplied from the shuffle memory in this example has the content [31, 30, ..., 1, 0] mapping each element of an input vector on the same position in an output vector. Any other initial shuffle pattern (or index vector) can be used as well. However, before shuffling the input vector the index vector is subjected to a pre-processing. More precisely, according to Fig. 9 A the corresponding pre-processing means execute XOR operations on each element of the index vector on a binary level. Thereby, the second operand in the case of Fig. 9A is an input scalar value of one. Hence, the XOR operation on the index vector results in input (data) vector elements pair- wise swapping places. The size or increment of the butterfly operation executed on the input vector in this case is one since each input vector element only jumps one position.
According to Fig. 9B the operand value equals four. Thus, by executing the XOR operation on the bit pattern of each index vector element and subsequently shuffling the input vector eight input vector element blocks each containing four consecutive vector elements are formed and pair- wise swap places. The size of the butterfly operation executed on the input vector in this case is four since each input vector element jumps four positions.
The use of the present invention is not limited to memory access, FFT, DCT or FHT applications. It can be applied in any kind of application and it is most advantageous for applications (re-)using shuffle patterns which can be deduced from formerly applied patterns by one or several subsequent arithmetic or logical operations. Consequently, in a micro processor device according to the invention comprising a vector processor architecture with at least one functional vector processor unit may be provided with several different and/or identical pre-processing means arranged to receive the same and/or different parameters or scalar values and to process the elements of the index vector subsequently and/or in parallel. In the case of several parameters they can be decoded from a single or different instructions and/or can be derived from the same or different second memory means.

Claims

CLAIMS:
1. Micro processor device comprising a vector processor achitecture with at least one functional vector processor unit comprising first memory means for storing plural index vectors and processing means, the functional vector processor unit being arranged to receive a processing instruction and at least one input vector to be processed, said first memory means being arranged to provide the processing means with one of said plural index vectors in accordance with the processing instruction, and the processing means being arranged to generate in response to said instruction at least one output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided, whereby said functional vector processor unit further comprises pre-processing means arranged to receive a parameter and to process the elements of said one index vector dependent on said parameter before generating said at least one output vector in accordance with the processed index vector.
2. Micro processor device according to claim 1, wherein said functional vector processor unit further comprises second memory means for storing plural parameters, said second memory means being arranged to provide said pre¬ processing means with one of said plural parameters in accordance with the processing instruction.
3. Micro processor device according to claim 1, wherein said pre-processing means are arranged to receive as parameter a scalar value having a sign and to process the elements of said index vector dependent on said scalar value and said sign.
4. Micro processor device according to claim 1, wherein said pre-processing means are arranged to execute a modulo addition of each element of said one index vector and said parameter.
5. Micro processor device according to claim 1, wherein said pre-processing means are arranged to execute a saturated addition of each element of said one index vector and said parameter.
6. Micro processor device according to claim 1, wherein said pre-processing means are arranged to execute an XOR-operation on each element of said one index vector and said parameter.
7. Method for processing vectors comprising the steps - receiving a processing instruction and at least one input vector to be processed, storing plural index vectors in first memory means selecting one of said plural index vectors in accordance with the processing instruction generating at least one output vector in response to said instruction, the output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided, characterized by the further steps prior to the step of generating said least one output vector receiving a parameter and processing the elements of said one index vector dependent on said parameter.
8. Method according to claim 7, wherein said step of processing the elements of said one index vector comprises a modulo addition of each element of the index vector and said parameter.
9. Method according to claim 7, wherein said step of processing the elements of said one index vector comprises a saturated addition of each element of the index vector and said parameter.
10. Method according to claim 7, wherein said step of processing the elements of said one index vector comprises a XOR- operation on each element of the index vector and said parameter.
PCT/IB2005/053019 2004-09-21 2005-09-14 Micro processor device and method for shuffle operations WO2006033056A2 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2007533015A JP2008513903A (en) 2004-09-21 2005-09-14 Microprocessor device and method for shuffle operations
EP05782929A EP1794671A2 (en) 2004-09-21 2005-09-14 Micro processor device and method for shuffle operations
CN200580039646.XA CN101061460B (en) 2004-09-21 2005-09-14 Micro processor device and method for shuffle operations

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP04104559.2 2004-09-21
EP04104559 2004-09-21

Publications (2)

Publication Number Publication Date
WO2006033056A2 true WO2006033056A2 (en) 2006-03-30
WO2006033056A3 WO2006033056A3 (en) 2006-10-26

Family

ID=35385641

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2005/053019 WO2006033056A2 (en) 2004-09-21 2005-09-14 Micro processor device and method for shuffle operations

Country Status (4)

Country Link
EP (1) EP1794671A2 (en)
JP (1) JP2008513903A (en)
CN (1) CN101061460B (en)
WO (1) WO2006033056A2 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008126041A1 (en) * 2007-04-16 2008-10-23 Nxp B.V. Method of storing data, method of loading data and signal processor
EP2674855A1 (en) 2012-06-14 2013-12-18 ST-Ericsson SA An element selection unit and a method therein
WO2014150636A1 (en) * 2013-03-15 2014-09-25 Qualcomm Incorporated Vector indirect element vertical addressing mode with horizontal permute
US9292286B2 (en) 2011-10-18 2016-03-22 Panasonic Intellectual Property Management Co., Ltd. Shuffle pattern generating circuit, processor, shuffle pattern generating method, and instruction sequence
US9411584B2 (en) 2012-12-29 2016-08-09 Intel Corporation Methods, apparatus, instructions, and logic to provide vector address conflict detection functionality
US9411592B2 (en) 2012-12-29 2016-08-09 Intel Corporation Vector address conflict resolution with vector population count functionality
CN107003846A (en) * 2014-12-23 2017-08-01 英特尔公司 The method and apparatus for loading and storing for vector index
US9959247B1 (en) 2017-02-17 2018-05-01 Google Llc Permuting in a matrix-vector processor
WO2018093439A2 (en) 2016-09-30 2018-05-24 Intel Corporation Processors, methods, systems, and instructions to load multiple data elements to destination storage locations other than packed data registers
EP3608776A1 (en) * 2018-08-11 2020-02-12 Intel Corporation Systems, apparatuses, and methods for generating an index by sort order and reordering elements based on sort order
CN114297138A (en) * 2021-12-10 2022-04-08 龙芯中科技术股份有限公司 Vector shuffling method, processor and electronic equipment

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9513905B2 (en) * 2008-03-28 2016-12-06 Intel Corporation Vector instructions to enable efficient synchronization and parallel reduction operations
US8547385B2 (en) * 2010-10-15 2013-10-01 Via Technologies, Inc. Systems and methods for performing shared memory accesses
US8688957B2 (en) 2010-12-21 2014-04-01 Intel Corporation Mechanism for conflict detection using SIMD
US10013253B2 (en) * 2014-12-23 2018-07-03 Intel Corporation Method and apparatus for performing a vector bit reversal
DE102017207876A1 (en) * 2017-05-10 2018-11-15 Robert Bosch Gmbh Parallel processing

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030046559A1 (en) 2001-08-31 2003-03-06 Macy William W. Apparatus and method for a data storage device with a plurality of randomly located data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1304086B (en) * 1999-12-13 2010-06-16 凌阳科技股份有限公司 Microcontroller structure capable of raising information acces efficiency
EP1261912A2 (en) * 2000-03-08 2002-12-04 Sun Microsystems, Inc. Processing architecture having sub-word shuffling and opcode modification
US6922716B2 (en) * 2001-07-13 2005-07-26 Motorola, Inc. Method and apparatus for vector processing
CN1246770C (en) * 2003-02-13 2006-03-22 上海交通大学 Digital signal processor with modulus address arithmetic

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030046559A1 (en) 2001-08-31 2003-03-06 Macy William W. Apparatus and method for a data storage device with a plurality of randomly located data

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008126041A1 (en) * 2007-04-16 2008-10-23 Nxp B.V. Method of storing data, method of loading data and signal processor
US20100211749A1 (en) * 2007-04-16 2010-08-19 Van Berkel Cornelis H Method of storing data, method of loading data and signal processor
US8489825B2 (en) 2007-04-16 2013-07-16 St-Ericsson Sa Method of storing data, method of loading data and signal processor
US9292286B2 (en) 2011-10-18 2016-03-22 Panasonic Intellectual Property Management Co., Ltd. Shuffle pattern generating circuit, processor, shuffle pattern generating method, and instruction sequence
EP2674855A1 (en) 2012-06-14 2013-12-18 ST-Ericsson SA An element selection unit and a method therein
WO2013186155A1 (en) 2012-06-14 2013-12-19 St-Ericsson Sa An element selection unit and a method therein
US9350584B2 (en) 2012-06-14 2016-05-24 Telefonaktiebolaget Lm Ericsson (Publ) Element selection unit and a method therein
US9411584B2 (en) 2012-12-29 2016-08-09 Intel Corporation Methods, apparatus, instructions, and logic to provide vector address conflict detection functionality
US9411592B2 (en) 2012-12-29 2016-08-09 Intel Corporation Vector address conflict resolution with vector population count functionality
WO2014150636A1 (en) * 2013-03-15 2014-09-25 Qualcomm Incorporated Vector indirect element vertical addressing mode with horizontal permute
US9639503B2 (en) 2013-03-15 2017-05-02 Qualcomm Incorporated Vector indirect element vertical addressing mode with horizontal permute
EP3238026A4 (en) * 2014-12-23 2018-08-01 Intel Corporation Method and apparatus for vector index load and store
CN107003846A (en) * 2014-12-23 2017-08-01 英特尔公司 The method and apparatus for loading and storing for vector index
CN107003846B (en) * 2014-12-23 2021-02-26 英特尔公司 Method and apparatus for vector index load and store
EP3519948A4 (en) * 2016-09-30 2020-08-19 Intel Corporation Processors, methods, systems, and instructions to load multiple data elements to destination storage locations other than packed data registers
WO2018093439A2 (en) 2016-09-30 2018-05-24 Intel Corporation Processors, methods, systems, and instructions to load multiple data elements to destination storage locations other than packed data registers
CN109791487A (en) * 2016-09-30 2019-05-21 英特尔公司 Processor, method, system and instruction for the destination storage location being loaded into multiple data elements in addition to packed data register
CN109791487B (en) * 2016-09-30 2023-10-20 英特尔公司 Processor, method, system and instructions for loading multiple data elements
US11068264B2 (en) 2016-09-30 2021-07-20 Intel Corporation Processors, methods, systems, and instructions to load multiple data elements to destination storage locations other than packed data registers
US10592583B2 (en) 2017-02-17 2020-03-17 Google Llc Permuting in a matrix-vector processor
US9959247B1 (en) 2017-02-17 2018-05-01 Google Llc Permuting in a matrix-vector processor
US10614151B2 (en) 2017-02-17 2020-04-07 Google Llc Permuting in a matrix-vector processor
US10956537B2 (en) 2017-02-17 2021-03-23 Google Llc Permuting in a matrix-vector processor
US10216705B2 (en) 2017-02-17 2019-02-26 Google Llc Permuting in a matrix-vector processor
US11748443B2 (en) 2017-02-17 2023-09-05 Google Llc Permuting in a matrix-vector processor
EP3944077A1 (en) * 2018-08-11 2022-01-26 INTEL Corporation Systems, apparatuses, and methods for generating an index by sort order and reordering elements based on sort order
EP4191405A1 (en) * 2018-08-11 2023-06-07 INTEL Corporation Systems, apparatuses, and methods for generating an index by sort order and reordering elements based on sort order
EP3608776A1 (en) * 2018-08-11 2020-02-12 Intel Corporation Systems, apparatuses, and methods for generating an index by sort order and reordering elements based on sort order
CN114297138A (en) * 2021-12-10 2022-04-08 龙芯中科技术股份有限公司 Vector shuffling method, processor and electronic equipment
CN114297138B (en) * 2021-12-10 2023-12-26 龙芯中科技术股份有限公司 Vector shuffling method, processor and electronic equipment

Also Published As

Publication number Publication date
CN101061460B (en) 2011-03-30
JP2008513903A (en) 2008-05-01
CN101061460A (en) 2007-10-24
EP1794671A2 (en) 2007-06-13
WO2006033056A3 (en) 2006-10-26

Similar Documents

Publication Publication Date Title
WO2006033056A2 (en) Micro processor device and method for shuffle operations
US11468003B2 (en) Vector table load instruction with address generation field to access table offset value
ES2954562T3 (en) Hardware accelerated machine learning
US20190188151A1 (en) Two address translations from a single table look-aside buffer read
US9557994B2 (en) Data processing apparatus and method for performing N-way interleaving and de-interleaving operations where N is an odd plural number
US8200948B2 (en) Apparatus and method for performing re-arrangement operations on data
US11860790B2 (en) Streaming engine with separately selectable element and group duplication
US10810131B2 (en) Streaming engine with multi dimensional circular addressing selectable at each dimension
GB2411973A (en) Constant generation in SIMD processing
GB2409059A (en) A data processing apparatus and method for moving data between registers and memory
US11921636B2 (en) Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets
US20050125639A1 (en) Table lookup operation within a data processing system
JP4624098B2 (en) Processor address generation unit
GB2409063A (en) Vector by scalar operations
US11804858B2 (en) Butterfly network on load data return
US10140239B1 (en) Superimposing butterfly network controls for pattern combinations
US8489825B2 (en) Method of storing data, method of loading data and signal processor
US20030221089A1 (en) Microprocessor data manipulation matrix module
US10983912B2 (en) Streaming engine with compressed encoding for loop circular buffer sizes
US11604648B2 (en) Vector bit transpose

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2005782929

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2007533015

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 200580039646.X

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: 2005782929

Country of ref document: EP