WO2006033056A2

WO2006033056A2 - Micro processor device and method for shuffle operations

Info

Publication number: WO2006033056A2
Application number: PCT/IB2005/053019
Authority: WO
Inventors: Cornelis H. Van Berkel; Balakrishnan Srinivasan
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2004-09-21
Filing date: 2005-09-14
Publication date: 2006-03-30
Also published as: JP2008513903A; WO2006033056A3; CN101061460B; CN101061460A; EP1794671A2

Abstract

The present invention relates to a micro processor device comprising a vector processor architecture with a functional vector processor unit comprising first memory means for storing plural index vectors and processing means, the functional vector processor unit being arranged to receive a processing instruction and at least one input vector to be processed, said first memory means being arranged to provide the processing means with one of said plural index vectors in accordance with the processing instruction, and the processing means being arranged to generate in response to said instruction at least one output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided. The functional vector processor unit further comprises pre-processing means arranged to receive a parameter and to process the elements of the one index vector dependent on said parameter before generating said at least one output vector in accordance with the processed index vector. The invention further relates to a method for processing vectors with such a functional vector-processing unit.

Description

Micro processor device and method for shuffle operations

The present invention relates to a micro processor device comprising a. vector processor architecture with at least one functional vector processor unit comprising memory means for storing plural index vectors and processing means, the functional vector processor unit being arranged to receive a processing instruction and at least one input vector to be processed, the memory means being arranged to provide the processing means with one of said plural index vectors in accordance with the processing instruction, and the processing means being arranged to generate in response to said instruction at least one output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided. Correspondingly the invention relates to a method for processing vectors comprising the steps: receiving a processing instruction and at least one input vector to be processed, storing plural index vectors in first memory means, selecting one of said plural index vectors in accordance with the processing instruction, and generating at least one output vector in response to said instruction, the output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided. Such a micro processor devices, hereinafter referred to as vector processors, and such a method are well established since several decades. A vector processor provides vector instruction sets instead of or in addition to scalar instructions as provided by microprocessors employing scalar processor architecture only (as opposed to the abo"ve vector or parallel architecture). Each instruction typically specifies operand vector(s) * containing plural data words or vector elements, its length (the number of vector elements), and an operation to be applied. The advantage of vector processing is that instead of successively operating on single data words, it allows to operate - within one of said vector instructions - on entire vectors at the same time thereby enhancing calculating speed.. For example, by a single memory access, the vector processor will continuously fetch the entire vector from an external memory, then the vector will be continuously operated, and finally be continuously stored back to the external memory by another single access. Vector processing therefore is a Single Instruction Multiple Data (SIMD) parallel processing technique. The calculating speed can even be more enhanced by using a vector register architecture Λvhich allows the processor to keep intermediate results in its integrated vector register close to the other functional units of the processor, thereby reducing temporary storage requirements, inter-instruction latency, etc.

There are many different types of vector instructions some of which can be classified as: vector-vector instructions specifying operations on one or more vectors such as shifting or shuffling a vector or adding, subtracting, multiplying, or dividing two or more vectors element-wise; vector-scalar instructions specifying operations on a vector and a scalar such as a scalar product; vector-scalar instructions (vector-reductions) specifying operations on one or more vectors and delivering a scalar such as cross product; and vector- memory instructions specifying load and store operations to transfer data between the external memory and the integrated vector register. Depending on the task, even more sophisticated instructions can be implemented, for example, specifying array operations producing an array from vectors, or the like. Also, instructions specifying logical operations on vectors are possible. In recent developments several vector instructions are packed in a single instruction word and executed in parallel, thereby, once again enhancing the processing speed. This processing type is called Very Long Instruction Word (VLIW) parallel processing technique.

All of these instructions are executed by corresponding functional units (FU). For example, the FU responsible for vector-memory instructions, referred to as vector memory unit (VMU), contains the integrated vector register, and processing means for receiving and implementing the vector-memory instructions. According to the instruction as described below, these processing means will load the requested vector elements from the external memory into its vector register or store vector elements into the external memory. Normally, the VMU is the only FU connected to the "external world" outside the processor. Integrated in VLIW capable vector processors there is an additional instruction distribution unit (IDU) for receiving the very long instruction word, and sequencing and distributing the instructions to other functional units.

There are several vector-memory instructions allowing the processor to access memory according to different access patterns addressing the vector elements in the external memory. If the vector elements are all contiguous in the memory, that is the data words constituting the vector being requested are located in adjacent memory addresses, then fetching the vector from a set of memory banks is an easy task. This access is commonly called a unit-stride access. In some cases, the data words to be fetched are separated within the memory by a definite constant displacement. This is called a strided access or a stride-n access, wherein n denotes the distance of memory-addresses between two neighboring vector elements. In this case the instruction further specifies the stride n in order to allow the VMU to fetch all data in a single memory access.

However, sometimes data words to be loaded from the memory as elements of a vector and/or to be stored back to the memory are not even separated by a constant displacement but rather located at (pre-calculated) arbitrary positions and/or in arbitrary order in memory. Also, the order of say P memory banks of the external memory does generally not match the retrieved/delivered order of the vector elements. In order to allow the processor to access the arbitrarily distributed elements data in a single vector-memory instruction the VMU must further be provided with the addresses indicating all memory locations where the vector elements are stored. This is accomplished by a vector-memory instruction, called "gather-instruction" providing an address vector containing the address elements. Accordingly, a so-called "scatter-instruction", another vector-memory instruction, is provided allowing the processor to store the vector elements to the memory according to a given address vector in a single memory access, too. In the case when data words are located at arbitrary positions in memory, the functional unit described in the opening paragraph, hereinafter also referred to as shuffle unit (SFU), is required to rearrange the data elements obtained from the memory. Programming the shuffle unit involves providing it with the above index vector containing a "shuffle pattern". Each element in the shuffle pattern specifies the position of the source element. In Fig. 1 an illustration of a gather instruction is given. Therein a first register 110 provides a start address (100) of the memory. A second register 120 provides the length (4) of the vector to be loaded. The shuffle pattern or index vector 130 specifies for example, value 4 at position 1. Hence, the content of element 4 of the input (vector), which in this case is the fourth element in the memory 140 after the start address, must be copied to position 1 of the output vector 150, which is the vector requested by the program, and so on.

In an alternative shuffling scheme, instead of specifying the source position as in the above example, the destination position of an element within the vector would be provided. This is, however, less general.

In scatter or gather operations the appropriate shuffle pattern, and more precisely the processing instruction input into the SFU, is determined by the least significant bits of the address elements in the address vector. In other applications, for example, a software code according to which it may be required to read a segment (of size P) of an FFT input block in bit-reversed address order, the shuffle pattern is directly determined by the code. In case of the above vector processor, several shuffle patterns are stored near the shuffle processing means in dedicated shuffle memory means. An example of a known hardware realization of a SFU is given schematically in Fig. 2. The SFU 200 comprises an array 210 of P multiplexers 212, 214,..., 216 each with P inputs for the input vector 240 to be processed and with one input for the assigned element of the index vector or shuffle pattern. The shuffle pattern is chosen from the shuffle memory 220 according to an input instruction 222, for example, extracted from the address vector or the program code for rearrangement of the input vector elements in order to obtain the requested output vector 250. Alternative realizations can be based on switch matrices or the like.

After the address vector elements are rearranged by the SFU the data words are fetched by the VMU and afterwards rearranged again in order to comply with the position of the requested address. Again, the appropriate shuffle pattern can be obtained from the least significant bits of the address elements.

In this way, a shuffle according to the pre-computed shuffle pattern can be applied to obtain data in the proper order. Typically, few shuffle patterns suffice for a given application, such as scatter or gather memory access or a computer code, assuming that access patterns to the memory banks can be reused without modification. If such a restriction, however, would be lifted, either the number of shuffle patterns would increase dramatically, or a shuffle pattern has to be preceded by a rotation operation, or the like. The former could lead to a lot of additional memory traffic, the latter costs an additional operation cycle and hence calculation speed.

Object of the present invention therefore is to advance the above micro processor in order to minimize the number of shuffle configuration patterns without suffering a decrease of calculating speed.

According to a first aspect of the invention this object is achieved by a vector processor as mentioned in the opening paragraph wherein said functional vector processor further comprises pre-processing means arranged to receive a parameter and to process the elements of the one index vector dependent on said parameter before generating said at least one output vector in accordance with the processed index vector.

A vector processor with FUs, e.g. SFUs, comprising such pre-processing means combines the functionality of shuffling elements in a data (or address) vector according to a prescribed shuffle pattern, and the functionality of further processing the data vector indicated by said parameter which preferably is a scalar value. While in state-of-the- art hardware implementations a shuffling operation and a further data processing (reordering or the like) requires two successive steps and two switch networks each having its own control or two network reuses, shuffle and further data processing according to the present invention can be done in single step and the control can be combined to a single control step. This is, since actually the index vector itself is processed and both the index vector and the parameter, like in many architectures, arrive one or more clock cycles prior to the input data vectors the processing of the index vector can be executed in advance. Finally, the shuffling of the data vector on the basis of the (pre-)processed index vector can be executed in a single clock cycle. A vector processor according to the invention can be used to accelerate a large class of algorithms, in particular in combination with scatter-gather memory access, thereby keeping the additional memory traffic at a low level. The (pre-)processing of the index vector generally could be any arithmetic or logical operation on the index vector and the parameter or scalar value. By employment of the present invention, in principle, even bit-manipulation operations can be performed in a single step. Bit manipulation in this context denotes operations like those described in connection with broadband media processors by Craig Hansen in "Micro Unity's Media Processor Architecture", IEEE Micro, August 1996, p. 36-38. These generalized switching instructions alter the arrangement of vector elements in different manners. Thereby, many commonly required re-arrangements are performed in a single instruction and even arbitrarily re-arrangements can be obtained by a sequence of three instructions. Group-shuffle, group- swizzle, group-extract, group-compress, group-deposit, group-merge-deposit, group- withdraw, group-shift, and group-rotate operations are some examples of such single instruction operations. Several parameters are decoded from an immediate field of the specific instruction, exactly specifying the "degree" of re-arrangement. In case of group- shuffle instructions, for example, three instruction parameters generally specify the size (in bits) over which vectors are shuffled, the size of the vectors and the degree of shuffle. In other instructions the specific number of parameters may be reduced.

According to a second aspect of the invention which constitutes a further development of the first aspect said functional vector processor unit further comprises second memory means for storing plural parameters, said second memory means being arranged to provide said pre-processing means with one of said parameters in accordance with the processing instruction. These second memory means are also referred to as offset memory. This can be useful when the parameter being a scalar value is not a compile time constant. A scalar processing unit that operates in parallel with the vector unit can compute these "offsets" and store it in the offset memory indicated. According to a third aspect of the invention which constitutes a further development of the first or second aspects said pre-processing means are arranged to receive as parameter a scalar value having a sign and to process the elements of said index vector dependent on said scalar value and said sign. This allows for a greater variety of resulting index vectors without enhancing the number of pre-processing functionalities.

According to a fourth aspect of the invention which constitutes a further development of any one of the first to third aspects the said pre-processing means are arranged to execute a modulo addition of each element of the one index vector and said parameter.

Processing the one index vector according to this aspect includes adding to each element a constant value modulo P (preferably the length of the vector) which, consequently, results in a combined shuffle and rotate operation on the data vector. In these cases, the parameter or scalar value hereinafter is also referred to as rotation offset (L). This ■ implementation takes into account that shuffle patterns used in a typical application are correlated and often are rotations of a previous shuffle pattern. Note that in many applications, the rotation offsets also are reused multiple times in succession. Therefore, particularly by integrating the rotation functionality into the SFU the additional memory traffic can be kept at a low level. These second memory means in case of a rotation operation are referred to as rotation-offset memory.

In a combination, of modulo addition according to the fourth aspect and a signed scalar value according to the third aspect, for example, a negative signed L could denote a right rotation direction., a positive signed L could denote a left rotation direction, and L=O denotes 0-rotation of the input vector (if present, without reloading the rotation-offset memory). Rotation of an input "vector is a special case of a shuffle. Therefore, a left rotation by a number of +L places, with L<P, can be specified by shuffle pattern [L, L+l, L+2, ... P- 1, 0, 1, ..., L-I] obtained by pre-processing the shuffle pattern [0, 1, 2, ..., P] which maps each input vector element on the same positions in the processed output vector. More precisely, the pre-processed shuffle pattern can be denoted as [(0+L) modulo P, (1+L) modulo P, (2+L) modulo P, ... , (P+L) modulo P]. Since, the SFU according to the present invention (pre-) increments the elements of the index vector or shuffle pattern with the value of the rotation offset L - note that this pre-increment must operate modulo P - the number of individual index vectors in the shuffle memory means can be drastically reduced, namely by a factor of P, without suffering a decrease of calculating speed. According to a fifth aspect of the invention which constitutes a further development of any one of the first to third aspects said pre-processing means are arranged to execute a saturated addition of each element of said one index vector and said parameter. This saturated addition of the index vector elements and the parameter or scalar value results in a combined shuffle and shift operation on the input data or address vector. Shift left/right for L positions can be seen, as a special case of rotate left/right where the L vacant positions at the right or left side (depending on the shifting direction) are filled with a pre-set constant, e.g. value 0. This is achieved by replacing the modulo addition/subtraction by said saturated addition/subtraction, where source-index values -1 and P-I refer to the pre-set constant. A Value of-1 indicates that the element in the corresponding element in the target register should not be over- written. Thus the index values can range from -1 to P-I .

According to a sixth aspect of the invention which constitutes a further development of anyone of the first to third aspects said pre-processing means are arranged to execute an XOR operation on each element of said one index vector and said parameter.

This operation on the index vector with a subsequent shuffling operation of the input (data) vector can be used to achieve "butterfly shuffling" operations as will be explained herein below. Butterfly shuffling is used extensively in kernels like the FFT, DCT and FHT (Fast Hadamard Transform). In these kernels the (step) size or increment of the butterfly operations usually changes from stage to stage. An implementation according to this aspect of the invention is advantageous since the increment is directly determined by the scalar input value that is the operand in the XOR. operation executed on each index vector element. Otherwise, a new shuffle pattern would have to be loaded each time the size of the butterfly changes leading to more data traffic. According to a seventh aspect of the invention the above object is further achieved by a method as mentioned in the opening paragraph in which the further steps of receiving a parameter and processing the elements of said one index vector dependent on said parameter are executed prior to the step of generating said least one output vector.

The main application area is vector processing as it is applied for example in CVP (Research), see CH. (Kees) van Berkel, Patrick P.E. Meuwissen, Nur Engin, and S. Balakrishnan, "CVP: A Programmable Co Vector Proceessor for 3G Mobile Baseband Processing," In the Proceedings of the World Wireless Congress 2003, nd in OnDSP (PS- Dresden, formerly Systemonic), and in EVP (PS ' DSP Innovation Centre). The above invention can speed-up a number of signal processing kernels drastically. This is especially true for applications that are (close to being) memory bound and that riave irregular access patterns. Examples include: video codecs, FFT, format conversion, interleaving, etc.

The above and other objects, features and advantages of the present invention will become apparent from the following description of preferred embodiments thereof taken in conjunction with the accompanying drawings in which

Fig. 1 shows the functionality of a state-of-the-art shαiffling unit (without pre-rotation); Fig. 2 is a block diagram showing a state-of-the-art implementation of a shuffling unit (SFU);

Fig. 3 is a block diagram showing an implementation of a FU according to the present invention with pre-rotation capability;

Fig. 4 is a block diagram showing the control of the FU according to Fig. 3; Fig. 5 is a section of a program code extracted from an implementation of the Golay correlator;

Fig. 6 shows an illustration of the memory access according to the code of Fig. 5;

Fig. 7 is another demonstration for the applicability of combined shuffle and rotate functionality by means of a program code for bit-reversed accesses in fast Fourier Transformation (FFT);

Fig. 8 shows the bit-reversed permutation for a 32 point FFT memory access pattern; and

Figs. 9A and 9B illustrate two butterfly shuffling operations with different sizes executed on the same 32-element input vector.

A FU 300 with pre-rotation capability according to the embodiment of the present invention schematically shown in Fig. 3 comprises processing means and, more precisely, an array 310 of P multiplexers 312, 314, ..., 316 depicted as parallel devices but which can also be implemented by less (down to one) devices and serialized processing steps. Each multiplexer has P inputs corresponding to the P elements of the input vector 320. One further input is provided for the assigned element of the index vector or shuffle pattern that is chosen from memory means of the FU 300, the shuffle memory 320, according to an input instruction 322. However, before the index vector elements are input to the multiplexers 312, 314, 316 they are subjected to a modulo addition by pre-processing means and, more precisely, by a combiner 360 consisting of several (P) modulo adders. The direction and the magnitude of rotation is determined by a scalar input 332. The input instruction 322 as well as the scalar input 332 are extracted for example from an address vector or a prograan code. After the index vector itself is processed it is input for shuffling the input vector 34O into the multiplexer array 310 in order to obtain the requested output vector 350.

The block diagram of Fig. 4 shows in a more generalized way that th.e combined shuffle and data processing (rotation, shift, butterfly, etc operation) of the input data or address vector 440 being executed by the FU can be initiated by a single control step. The pre-processing means 460 receives the shuffle pattern 430 as well as the scalar input 432 and outputs a single instruction vector to the processing means 410. This instruction vector contains the pre-processed shuffle pattern.

The section of program code according to Fig. 5 extracted from an implementation of the Golay correlator. The Golay correlator, for example, is used in 3^rd generation mobile technology for cell search procedures using a hierarchical correlation sequence for the primary synchronization code (PSC). This is just one of numberless examples showing that shuffle patterns used in software applications and memory accesses are correlated and even often are rotations of a previous shuffle pattern with rotation offsets reused multiple times in succession. In this example, vectors of 8 (complex) elements are fetched by four memory accesses (readl through read4), whereby, access vector at ptr (assume // aligned) is aligned again after the fourth increment because the rotation offset equals two.

The memory access according to the code of Fig. 5 is illustrated in Fig. 6. Therein, two consecutive vector locations in memory 610, 620, 630, 640 are shown on the left and the corresponding output from the memory (input vector from the point of "view of the SFU) on the right 611, 621, 631, 641. The final shuffle-rotated data vector 612, 622, 632, 642 is depicted directly below the memory output. For simplification and a better intelligibility the processing of the input vectors is a plain rotation. Plain rotation means there is no rearrangement required in the first memory access (readl), which can be achieved using an index vector with element values being equal to their positions, and each following memory access requires a rotation. In detail: the shuffle pattern [7, 6, 5, 4, 3, 2, 1₅ O] used for a first memory access without shuffle (readl) leads to a first output vector 612 witbi elements having the same order as obtained from memory 610. Then, the same shuffle pattern — on the fly- is subjected to a modulo 8 addition with a scalar value of 6, thereby providing a shuffle pattern [5, 4, 3, 2, 1, 0, 7, 6]. This pre-processed shuffle pattern rotates the input data vector 621 obtained from memory access (read2) left by two elements in order to obtain the output data vector 622. In the next step the initial shuffle pattern [7, 6, 5, 4, 3, 2, 1, 0] is subjection to a modulo 8 addition by scalar value of 4, thereby providing the shuffle pattern [3, 2, 1, 0, 7, 6, 5, 4]. This shuffle pattern is used to shuffle and (actually plain) rotate the input vector 631 obtained by the next memory access (read3) to the output vector 632 with elements being rotated left by four. And the final input vector 641 obtained by memory access (read4) is rotated left by 6 elements using the shuffle pattern [1, 0, 7, 6, 5, 4, 3, 2]which results from the modulo addition of the initial shuffle pattern and the scalar value of 2. This results in an output vector 642. In this particular application due to the vector processor according to the invention instead of four individual shuffle patterns a single shuffle pattern was used along with suitable "rotate" values.

It is commonly known that in many Fast Fourier Transform (FFT) implementations bit-reversed permutation is performed on an input data array before the FFT is performed. It is also known that bit-reversed permutations can be performed using shuffle functionality. In such a bit reversal the input data are re-organized on a binary level, i.e in bit- reversed order, utilizing a function such as "permute_bitrev" shown in Fig. 7. The input in this example is assumed to consist of two arrays of the complex data, one containing the real part and the other containing the imaginary part. In the example vector processor, however, the input data is stored as a single array of complex numbers with the real and imaginary parts of each complex number stored in adjacent memory locations. Therefore, the permutation is executed on the array of complex numbers. The function bitrev() in Fig. 7 returns the numbers shown in the leftmost column of Fig. 8. In Fig. 8 a bit-reversal access pattern for a 32 point FFT and how this can be improved according to the invention utilizing shuffle operations with pre-rotation is shown. Bit reversal can be achieved very efficiently in an architecture that supports the "gather" operation described earlier. Regarding Fig. 8, it is assumed that a vector can hold a maximum of eight complex data elements and the number of memory banks (vector index) is eight wherein a memory location in a bank can store one complex data element. The access patterns consist of a (vector number, memory bank number) tuple. It is further assumed that the data items are arranged in memory by say a DMA controller in the shown fashion. If a naive memory organization for the data is assumed, the access pattern detailed in Fig. 8 will lead to memory bank conflicts if each bank has a single port. For example, the first group of bit-reversed accesses, shown between the first and second dotted lines in Fig. 8, uses four accesses to memory bank 0 (vector index = 0) and four accesses to memory bank 4 (vector index = 4). Hence, serialized accesses to memory will be required, leading to inefficient use of the available memory bandwidth.

A more intelligent way of organizing data in this case is to rotate the starting address of successive vector elements by one memory bank. This would then involve wrapping around the data as illustrated in Table 1 below in order to write the data into a vector.

Table 1 : Skewing input data to avoid memory bank conflicts in the 32 point FFT bit-reversal scheme

With the data organization shown in table 1, all the data items with the vector indices shown in a particular access can be fetched with no bank conflicts. Therefore, for every access the full bandwidth of memory can be used. However, the data returned by the memory system will have to be re-arranged to obtain the vector in the desired order of data elements as shown in Fig. 8. For example, using the source encoding scheme detailed with reference to Fig. 8, shuffle pattern for the bit-reversed accesses will be [0,2,1,3,4,6,5,7] for the first set (vector number = 0) [2,4,3,5,6,0,7,1] for the next set (vector number = 1) and subsequently [1,3,2,4,5,7,6,0], [3,5,4,6,7,1,0,2]. Note that if we divide the array into blocks of vectors, that all numbers in the shuffle pattern is incremented by the bit-reversed block number. In the example, we have four blocks that can be numbered sequentially starting from 0. The block numbers will then be 0,1,2,3. The bit-reversed block numbers will then be 0,2,1,3. This is precisely the increment to each element in the shuffle pattern. The increment achieves the rotation of the shuffled data. A similar scheme can be implemented for FFTs with different number of point (power of two).

Another embodiment of the present invention utilizes combined butterfly and shuffling operations as illustrated in Figs. 9A and 9B. An input vector 911 with 32 elements is assumed in these examples. Actually, a plain butterfly operation is applied. In other words, the index vector (not shown) initially supplied from the shuffle memory in this example has the content [31, 30, ..., 1, 0] mapping each element of an input vector on the same position in an output vector. Any other initial shuffle pattern (or index vector) can be used as well. However, before shuffling the input vector the index vector is subjected to a pre-processing. More precisely, according to Fig. 9 A the corresponding pre-processing means execute XOR operations on each element of the index vector on a binary level. Thereby, the second operand in the case of Fig. 9A is an input scalar value of one. Hence, the XOR operation on the index vector results in input (data) vector elements pair- wise swapping places. The size or increment of the butterfly operation executed on the input vector in this case is one since each input vector element only jumps one position.

According to Fig. 9B the operand value equals four. Thus, by executing the XOR operation on the bit pattern of each index vector element and subsequently shuffling the input vector eight input vector element blocks each containing four consecutive vector elements are formed and pair- wise swap places. The size of the butterfly operation executed on the input vector in this case is four since each input vector element jumps four positions.

The use of the present invention is not limited to memory access, FFT, DCT or FHT applications. It can be applied in any kind of application and it is most advantageous for applications (re-)using shuffle patterns which can be deduced from formerly applied patterns by one or several subsequent arithmetic or logical operations. Consequently, in a micro processor device according to the invention comprising a vector processor architecture with at least one functional vector processor unit may be provided with several different and/or identical pre-processing means arranged to receive the same and/or different parameters or scalar values and to process the elements of the index vector subsequently and/or in parallel. In the case of several parameters they can be decoded from a single or different instructions and/or can be derived from the same or different second memory means.

Claims

CLAIMS:

1. Micro processor device comprising a vector processor achitecture with at least one functional vector processor unit comprising first memory means for storing plural index vectors and processing means, the functional vector processor unit being arranged to receive a processing instruction and at least one input vector to be processed, said first memory means being arranged to provide the processing means with one of said plural index vectors in accordance with the processing instruction, and the processing means being arranged to generate in response to said instruction at least one output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided, whereby said functional vector processor unit further comprises pre-processing means arranged to receive a parameter and to process the elements of said one index vector dependent on said parameter before generating said at least one output vector in accordance with the processed index vector.

2. Micro processor device according to claim 1, wherein said functional vector processor unit further comprises second memory means for storing plural parameters, said second memory means being arranged to provide said pre¬ processing means with one of said plural parameters in accordance with the processing instruction.

3. Micro processor device according to claim 1, wherein said pre-processing means are arranged to receive as parameter a scalar value having a sign and to process the elements of said index vector dependent on said scalar value and said sign.

4. Micro processor device according to claim 1, wherein said pre-processing means are arranged to execute a modulo addition of each element of said one index vector and said parameter.

5. Micro processor device according to claim 1, wherein said pre-processing means are arranged to execute a saturated addition of each element of said one index vector and said parameter.

6. Micro processor device according to claim 1, wherein said pre-processing means are arranged to execute an XOR-operation on each element of said one index vector and said parameter.

7. Method for processing vectors comprising the steps - receiving a processing instruction and at least one input vector to be processed, storing plural index vectors in first memory means selecting one of said plural index vectors in accordance with the processing instruction generating at least one output vector in response to said instruction, the output vector having the elements of the at least one input vector rearranged in accordance with the one index vector provided, characterized by the further steps prior to the step of generating said least one output vector receiving a parameter and processing the elements of said one index vector dependent on said parameter.

8. Method according to claim 7, wherein said step of processing the elements of said one index vector comprises a modulo addition of each element of the index vector and said parameter.

9. Method according to claim 7, wherein said step of processing the elements of said one index vector comprises a saturated addition of each element of the index vector and said parameter.

10. Method according to claim 7, wherein said step of processing the elements of said one index vector comprises a XOR- operation on each element of the index vector and said parameter.