WO2007057832A2 - Vector shuffle unit - Google Patents

Vector shuffle unit Download PDF

Info

Publication number
WO2007057832A2
WO2007057832A2 PCT/IB2006/054214 IB2006054214W WO2007057832A2 WO 2007057832 A2 WO2007057832 A2 WO 2007057832A2 IB 2006054214 W IB2006054214 W IB 2006054214W WO 2007057832 A2 WO2007057832 A2 WO 2007057832A2
Authority
WO
WIPO (PCT)
Prior art keywords
vector
shuffle
multiplexer units
bit
control information
Prior art date
Application number
PCT/IB2006/054214
Other languages
French (fr)
Other versions
WO2007057832A3 (en
Inventor
David E. Leane
Jean-Paul C. F. H. Smeets
Willem E. H. Kloosterhuis
Original Assignee
Nxp B.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nxp B.V. filed Critical Nxp B.V.
Publication of WO2007057832A2 publication Critical patent/WO2007057832A2/en
Publication of WO2007057832A3 publication Critical patent/WO2007057832A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/76Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data
    • G06F7/762Arrangements for rearranging, permuting or selecting data according to predetermined rules, independently of the content of the data having at least two separately controlled rearrangement levels, e.g. multistage interconnection networks

Definitions

  • the invention relates to a data processing apparatus and method, and in particular to a data processing apparatus and method having power reduction when processing vectors.
  • the invention further relates to a device, such as a mobile phone, PDA or alike, comprising such data processing apparatus.
  • Power efficiency for processor based equipment is becoming increasingly important.
  • a number of techniques have been used to reduce power usage. These include designing the processor's circuitry to use less power, or designing the processor in a manner which allows power usage to be managed. Also, for a given processor architecture, power consumption can be saved by optimizing its programming.
  • mapping of several similar operations onto one piece of hardware is quite common in the area of processor design. This often means that the result is sub-optimal for each of the specific operations. Therefore, the mapping of several operations onto one piece of hardware tends to result in higher power dissipation per operation when compared to dedicated circuitry being provided for each specific operation.
  • 64 bit vector i.e. using 8 bytes, in which an output vector 3 can be a shuffled version of the input vector 5
  • the 64 bit vector shuffle operation is performed using a vector shuffle unit configured around four base multiplexers (muxO, muxl, mux2, mux3 - not shown).
  • Each base multiplexer has a distance of 4 bytes between its inputs. Since the vector in this example has 8 elements of 1 byte, each of the base multiplexers will be a 2:1 multiplexer.
  • Fig. 2 shows the configuration for the first base multiplexer, muxO.
  • byte 0 in the output vector 3 can therefore come from byte 0 or byte 4 of the input vector 5.
  • this configuration provides all the shuffling options for shuffling vector elements of 32 bits.
  • Fig. 2 shows how an input vector having 64 bits, comprising two elements of 32 bits each (i.e. first element being bytes 0-3 and the second element being bytes 4-7), can be shuffled to provide an output vector 3 in which bytes 0-3 come from bytes 4-7 of the input, and bytes 4-7 come from bytes 0-3 of the input.
  • Fig. 3 shows the connection to byte 0 for each of the base multiplexers (muxO, muxl, mux2, mux3) in a 64 bit vector comprising 8 bytes.
  • muxO, byte 0 in the output can come from byte 0 or byte 4 of the input.
  • muxl, byte 0 in the output can come from byte 1 or byte 5 of the input.
  • mux2, byte 0 in the output can come from byte 2 or byte 6 of the input.
  • mux3, byte 0 in the output can come from byte 3 or byte 7 of the input.
  • Fig. 4 shows a conventional vector shuffle unit 1 that is capable of performing vector shuffle operations for vectors having element sizes of 8, 16 and 32 bits.
  • the vector shuffle unit 1 comprises a register 7 for storing an input vector 5.
  • the register 7 is connected to each of one of four base multiplexer units, muxO, muxl, mux2, mux3, using appropriate bus connections 9.
  • the output of each base multiplexer unit muxO, muxl, mux2, mux3 is connected to an output multiplexer 11, again using appropriate bus connections 13.
  • Table 1 below illustrates how, for certain element sizes, only some of the base multiplexer units muxO, muxl, mux2, mux3 are utilized.
  • the aim of the present invention is to provide a data processing apparatus and method for shuffling vectors having different sized elements, but without wasting power consumption.
  • a data processing apparatus for performing vector shuffle operations on an input vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element.
  • the data processing apparatus comprises a plurality of multiplexer units configured to shuffle at least an input vector comprising elements of a first size or an input vector comprising elements of a second size.
  • a power saving circuit is connected to receive control information indicative of the element size of a vector being shuffled. The power saving circuit is configured to disable operation of one or more of the multiplexer units in accordance with the received control information.
  • This invention allows maximum reuse of the vector shuffle hardware (resource sharing) while minimizing the power dissipated.
  • a method of reducing power in a data processing apparatus configured to perform vector shuffle operations on an input vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element.
  • the method comprises the step of providing a plurality of multiplexer units for shuffling at least an input vector comprising elements of a first size or an input vector comprising elements of a second size.
  • the method also comprises the step of providing a power saving circuit for masking an input vector from one or more of the plurality of multiplexer units, by receiving control information indicative of the element size of a vector being shuffled, and disabling the operation of one or more of the multiplexer units by masking the input vector therefrom, in accordance with the received control information.
  • a vector shuffle instruction for performing vector shuffle operations on a vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element, wherein the vector shuffle instruction comprises at least one data bit for indicating the element size of the vector being shuffled.
  • Fig. 1 shows a basic vector shuffle operation on a vector having 256 bits, with 32 elements of 8 bits each;
  • Fig. 2 shows a basic shuffle operation on a vector with 64 bits, with each element having a granularity of 32 bits;
  • Fig. 3 shows how the first byte is obtained for each multiplexer in a shuffle operation with 8 bits granularity, on a vector of 64 bits;
  • Fig. 4 shows a conventional vector shuffle unit for shuffling 8, 16 and 32 bit element sizes
  • Fig. 5 shows a vector shuffle unit having a power saving circuit according to the present invention.
  • Fig. 5 shows a vector shuffle unit 50 according to the present invention.
  • the vector shuffle unit 50 comprises a plurality of base multiplexer units (muxO, muxl, mux2, mux3), which are connected to an output multiplexer 11.
  • a power saving circuit 15 is provided for reducing the power dissipation in the base multiplexer units.
  • power dissipation in base multiplexer units muxl, mux2 and mux3 can be reduced by masking the inputs to these multiplexer units, because of the element size, when their result is not needed.
  • No masking is required for muxO as it is always needed for 8, 16 and 32 bit element sizes (as seen from Table 1).
  • Muxl and mux3 are only used for 8 bit element sizes and can therefore be masked together as they are always used together.
  • Mux2 is only used for 8 and 16 bit elements and requires its own masking circuitry within the power saving circuitry 15.
  • the power saving circuit 15 is disposed between the input register 7 and the base multiplexer units.
  • the power saving circuit 15 receives first and second control bits 17, 19.
  • the first and second control bits 17, 19 form part of, or are derived from, the instruction set, for example, part of a vector shuffle instruction.
  • the first control bit 17 can be set “high” to indicate when a 16 bit element is being shuffled, and set “low” at other times.
  • the second control bit 19 can be set “high” to indicate when a 32 bit element is being shuffled, and set “low” and other times.
  • the first and second control bits 17, 19 are connected to an OR gate 21.
  • the output of the OR gate 21 is connected to the input of a first AND gate 23.
  • the AND gate 23 is connected to receive its other input from the register 7, and has its output connected to the second and fourth multiplexer units, muxl, mux3.
  • the first AND gate 23 receives the input vector 5 via bus connection 9 at one input, which can be masked using the signal received from the OR gate 21.
  • a second AND gate 25 is connected to receive the input vector 5 at its first input, and the second control bit 19 at its other input.
  • the output of the second AND gate 25 is connected to the third multiplexer unit mux2.
  • the second AND gate 25 receives the input vector 5 via bus connection 9 at one input, which can be masked using the signal received from the second control bit 19.
  • the first and second control bits 17, 19 will be set low.
  • each of the AND gates 23 and 25 having one of its inputs set low, thus resulting in the input vector 5 being connected to each of the multiplexer units muxO, muxl, mux2 and mux3 in the normal way.
  • multiplexer unit muxO will receive the input vector directly, multiplexer units muxl and mux3 will receive their inputs via the first AND gate 23, while multiplexer unit mux2 will receive its input from the second AND gate 25.
  • the first control bit 17 When processing a vector having a granularity of 16 bit elements, the first control bit 17 is set high. This has the effect of setting the output of the OR gate 21 high, which in turn provides a high signal on one of the input connections to the first AND gate 23. This has the effect of masking the input vector from the multiplexer units muxl and mux3, which are connected to the output of the first AND gate 23. Since the input to the second AND gate 25 is connected to the second control bit 19 (i.e. the control bit for the 32 bit element which will be set low), the multiplexer unit mux2 will receive the input vector 5 at its input in the normal manner. In this way, only base multiplexer units muxO and mux2 are used when processing a 16 bit vector. Power is therefore saved because base multiplexer units muxl and mux3 are masked from operation.
  • the second control bit 19 When processing a vector having a granularity of 32 bit elements, the second control bit 19 is set high. This has the effect of setting the output of the OR gate 21 high, which in turn provides a high signal on one of the input connections to the first AND gate 23. This has the effect of masking the input vector 5 from the multiplexer units muxl and mux3, which are connected to the output of the first AND gate 23. In addition, since the input to the second AND gate 23 is also set high (i.e. because this input is connected to the second control bit 19), the multiplexer unit mux2 will also be masked from receiving the input vector 5. In this way, only base multiplexer unit muxO is used when processing a 32 bit vector.
  • the shuffling unit when the shuffling unit is not active, the inputs are kept “low”.
  • the power saving circuitry 9 detects the power saving opportunity using the first and second control bits 17, 19, and masks the appropriate busses. It will be appreciated that modifications are required to both the instruction set and the hardware circuitry in order to realize the power saving.
  • the preferred embodiment has been described in relation to shuffling an input vector, the invention may equally be used with more than one input vector, for example in a system having two to one shuffle units.

Abstract

A vector shuffle unit (50) comprises a number of base multiplexer units (mux0, muxl, mux2, mux3), which are connected to an output multiplexer (11). The vector shuffle unit (50) can be configured to shuffle a vector having any one of a number of different element sizes (for example 8, 16 and 32 bit element sizes). A power saving circuit (15) is provided for reducing the power consumption in the base multiplexer units muxl, mux2 and mux3, by masking the inputs to these multiplexer units when performing shuffle operations on certain element sizes. For example, no masking is required for mux0 as it is always needed for each of the 8, 16 and 32 bit element sizes. Multiplexer units muxl and mux3 are only used for 8 bit elements and can be masked together as they are always used together. Mux2 is only used for 8 and 16 bit elements and requires its own power saving circuitry.

Description

Data processing apparatus and method
The invention relates to a data processing apparatus and method, and in particular to a data processing apparatus and method having power reduction when processing vectors.
The invention further relates to a device, such as a mobile phone, PDA or alike, comprising such data processing apparatus.
Power efficiency for processor based equipment is becoming increasingly important. A number of techniques have been used to reduce power usage. These include designing the processor's circuitry to use less power, or designing the processor in a manner which allows power usage to be managed. Also, for a given processor architecture, power consumption can be saved by optimizing its programming.
The mapping of several similar operations onto one piece of hardware is quite common in the area of processor design. This often means that the result is sub-optimal for each of the specific operations. Therefore, the mapping of several operations onto one piece of hardware tends to result in higher power dissipation per operation when compared to dedicated circuitry being provided for each specific operation.
For example, in many data processing applications there is the need to shuffle vectors on a per element basis. The most common element sizes to be supported are of 8, 16 and 32 bits. All of these element sizes are usually supported by providing only 8 bit element size support, with the 16 and 32 bit element sizes then being catered for as just a subset of the
8 bit element size support. In other words, in a standard instruction set the vector shuffling only needs to be described for 8 bit element support because all of the other sizes (16 and 32 bit) are in essence a subset of this instruction. An example of a vector shuffle operation for a vector with 32 elements is given in Fig. 1. For example, this vector can be 256bits, with 32 elements of 8 bits each.
Referring to Fig. 2, consider the basic operation of a vector shuffle unit for a
64 bit vector, i.e. using 8 bytes, in which an output vector 3 can be a shuffled version of the input vector 5 The 64 bit vector shuffle operation is performed using a vector shuffle unit configured around four base multiplexers (muxO, muxl, mux2, mux3 - not shown). Each base multiplexer has a distance of 4 bytes between its inputs. Since the vector in this example has 8 elements of 1 byte, each of the base multiplexers will be a 2:1 multiplexer.
In particular, Fig. 2 shows the configuration for the first base multiplexer, muxO. As can be seen, for the first base multiplexer muxO, byte 0 in the output vector 3 can therefore come from byte 0 or byte 4 of the input vector 5. It will be appreciated that this configuration provides all the shuffling options for shuffling vector elements of 32 bits. In other words, Fig. 2 shows how an input vector having 64 bits, comprising two elements of 32 bits each (i.e. first element being bytes 0-3 and the second element being bytes 4-7), can be shuffled to provide an output vector 3 in which bytes 0-3 come from bytes 4-7 of the input, and bytes 4-7 come from bytes 0-3 of the input.
Fig. 3 shows the connection to byte 0 for each of the base multiplexers (muxO, muxl, mux2, mux3) in a 64 bit vector comprising 8 bytes. As indicated above, in the first base multiplexer, muxO, byte 0 in the output can come from byte 0 or byte 4 of the input. In the second base multiplexer, muxl, byte 0 in the output can come from byte 1 or byte 5 of the input. In the third base multiplexer, mux2, byte 0 in the output can come from byte 2 or byte 6 of the input. In the fourth base multiplexer, mux3, byte 0 in the output can come from byte 3 or byte 7 of the input. In this way, the input vector can be shuffled using the four base multiplexers such that byte 0 in the output can be derived from any of the input bytes 0 to 7. Fig. 4 shows a conventional vector shuffle unit 1 that is capable of performing vector shuffle operations for vectors having element sizes of 8, 16 and 32 bits. In other words, the circuit shown in Fig. 4 is an example whereby several similar operations have been mapped onto one piece of hardware, which results is sub-optimal performance for the specific operations, as will be explained below. The vector shuffle unit 1 comprises a register 7 for storing an input vector 5.
The register 7 is connected to each of one of four base multiplexer units, muxO, muxl, mux2, mux3, using appropriate bus connections 9. The output of each base multiplexer unit muxO, muxl, mux2, mux3 is connected to an output multiplexer 11, again using appropriate bus connections 13. Table 1 below illustrates how, for certain element sizes, only some of the base multiplexer units muxO, muxl, mux2, mux3 are utilized. Element size Base multiplexers needed
8 muxO, muxl, mux2 and mux3
16 muxO and mux 2
32 muxO
Table 1
As can be seen, in the conventional hardware which is configured to allow resource sharing, i.e. different sized vector elements to be shuffled, power is wasted when shuffling certain sized elements. This is because some of the base multiplexer units will be consuming power unnecessarily. For example, when processing a 64 bit vector with an element size of 32 bits, base multiplexer units muxl, mux2, and mux3 will be consuming power unnecessarily, because theirs results are not used. This results in higher power dissipation per operation when compared to dedicated circuitry.
The aim of the present invention is to provide a data processing apparatus and method for shuffling vectors having different sized elements, but without wasting power consumption.
According to a first aspect of the invention, there is provided a data processing apparatus for performing vector shuffle operations on an input vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element. The data processing apparatus comprises a plurality of multiplexer units configured to shuffle at least an input vector comprising elements of a first size or an input vector comprising elements of a second size. A power saving circuit is connected to receive control information indicative of the element size of a vector being shuffled. The power saving circuit is configured to disable operation of one or more of the multiplexer units in accordance with the received control information.
This invention allows maximum reuse of the vector shuffle hardware (resource sharing) while minimizing the power dissipated.
According to a second aspect of the invention, there is provided a method of reducing power in a data processing apparatus configured to perform vector shuffle operations on an input vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element. The method comprises the step of providing a plurality of multiplexer units for shuffling at least an input vector comprising elements of a first size or an input vector comprising elements of a second size. The method also comprises the step of providing a power saving circuit for masking an input vector from one or more of the plurality of multiplexer units, by receiving control information indicative of the element size of a vector being shuffled, and disabling the operation of one or more of the multiplexer units by masking the input vector therefrom, in accordance with the received control information.
According to a third aspect of the invention, there is provided a vector shuffle instruction for performing vector shuffle operations on a vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element, wherein the vector shuffle instruction comprises at least one data bit for indicating the element size of the vector being shuffled.
For a better understanding of the present invention, and to show more clearly how it may be carried into effect, reference will now be made, by way of example only, to the following drawings in which:
Fig. 1 shows a basic vector shuffle operation on a vector having 256 bits, with 32 elements of 8 bits each; Fig. 2 shows a basic shuffle operation on a vector with 64 bits, with each element having a granularity of 32 bits;
Fig. 3 shows how the first byte is obtained for each multiplexer in a shuffle operation with 8 bits granularity, on a vector of 64 bits;
Fig. 4 shows a conventional vector shuffle unit for shuffling 8, 16 and 32 bit element sizes;
Fig. 5 shows a vector shuffle unit having a power saving circuit according to the present invention.
Fig. 5 shows a vector shuffle unit 50 according to the present invention. In a similar manner to Fig. 4, the vector shuffle unit 50 comprises a plurality of base multiplexer units (muxO, muxl, mux2, mux3), which are connected to an output multiplexer 11. However, unlike Fig. 4, a power saving circuit 15 is provided for reducing the power dissipation in the base multiplexer units. In particular, power dissipation in base multiplexer units muxl, mux2 and mux3 can be reduced by masking the inputs to these multiplexer units, because of the element size, when their result is not needed. No masking is required for muxO as it is always needed for 8, 16 and 32 bit element sizes (as seen from Table 1). Muxl and mux3 are only used for 8 bit element sizes and can therefore be masked together as they are always used together. Mux2 is only used for 8 and 16 bit elements and requires its own masking circuitry within the power saving circuitry 15.
The power saving circuit 15 is disposed between the input register 7 and the base multiplexer units. The power saving circuit 15 receives first and second control bits 17, 19. The first and second control bits 17, 19 form part of, or are derived from, the instruction set, for example, part of a vector shuffle instruction.
The first control bit 17 can be set "high" to indicate when a 16 bit element is being shuffled, and set "low" at other times. The second control bit 19 can be set "high" to indicate when a 32 bit element is being shuffled, and set "low" and other times. The first and second control bits 17, 19 are connected to an OR gate 21. The output of the OR gate 21 is connected to the input of a first AND gate 23. The AND gate 23 is connected to receive its other input from the register 7, and has its output connected to the second and fourth multiplexer units, muxl, mux3. Thus, it can be seen that the first AND gate 23 receives the input vector 5 via bus connection 9 at one input, which can be masked using the signal received from the OR gate 21. A second AND gate 25 is connected to receive the input vector 5 at its first input, and the second control bit 19 at its other input. The output of the second AND gate 25 is connected to the third multiplexer unit mux2. Thus, it can be seen that the second AND gate 25 receives the input vector 5 via bus connection 9 at one input, which can be masked using the signal received from the second control bit 19. When processing a vector having a granularity of 8 bit elements, the first and second control bits 17, 19 will be set low. This in turn will result in each of the AND gates 23 and 25 having one of its inputs set low, thus resulting in the input vector 5 being connected to each of the multiplexer units muxO, muxl, mux2 and mux3 in the normal way. In other words, multiplexer unit muxO will receive the input vector directly, multiplexer units muxl and mux3 will receive their inputs via the first AND gate 23, while multiplexer unit mux2 will receive its input from the second AND gate 25.
When processing a vector having a granularity of 16 bit elements, the first control bit 17 is set high. This has the effect of setting the output of the OR gate 21 high, which in turn provides a high signal on one of the input connections to the first AND gate 23. This has the effect of masking the input vector from the multiplexer units muxl and mux3, which are connected to the output of the first AND gate 23. Since the input to the second AND gate 25 is connected to the second control bit 19 (i.e. the control bit for the 32 bit element which will be set low), the multiplexer unit mux2 will receive the input vector 5 at its input in the normal manner. In this way, only base multiplexer units muxO and mux2 are used when processing a 16 bit vector. Power is therefore saved because base multiplexer units muxl and mux3 are masked from operation.
When processing a vector having a granularity of 32 bit elements, the second control bit 19 is set high. This has the effect of setting the output of the OR gate 21 high, which in turn provides a high signal on one of the input connections to the first AND gate 23. This has the effect of masking the input vector 5 from the multiplexer units muxl and mux3, which are connected to the output of the first AND gate 23. In addition, since the input to the second AND gate 23 is also set high (i.e. because this input is connected to the second control bit 19), the multiplexer unit mux2 will also be masked from receiving the input vector 5. In this way, only base multiplexer unit muxO is used when processing a 32 bit vector.
Power is therefore saved because base multiplexer units muxl, mux2 and mux3 are masked from operation.
It is noted that in the analysis above, it is assumed that when the shuffling unit is not active, the inputs are kept "low". As will be appreciated from the above, by differentiating the different element sizes in the instruction set (i.e. providing first and second control bits 17, 19), a power saving opportunity is made possible in the hardware. The power saving circuitry 9 detects the power saving opportunity using the first and second control bits 17, 19, and masks the appropriate busses. It will be appreciated that modifications are required to both the instruction set and the hardware circuitry in order to realize the power saving.
This means of power saving can be applied to all hardware that has to support shuffle vectors on a per element basis where there is a manner (e.g. instruction set) of differentiating multiple element sizes. Although the preferred embodiment has been described in relation to a vector shuffle unit configured to shuffle 8, 16 or 32 bit elements, it will be appreciated that the invention could also be used with a vector shuffle unit configured to switch less, or more differently sized elements. It will also be appreciated that, although the preferred embodiment refers to the control signals 17, 19 having a logic high signal for indicating a particular state, a logic low signal could also be used, with the power saving circuitry adapted accordingly to give the same logic output. Furthermore, although the preferred embodiment has been described using
AND gates in the power saving circuit, it will be appreciated that other logic circuitry can be used to provide operand isolation.
Also, although the preferred embodiment has been described in relation to shuffling an input vector, the invention may equally be used with more than one input vector, for example in a system having two to one shuffle units.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim, "a" or "an" does not exclude a plurality, and a single processor or other unit may fulfill the functions of several units recited in the claims. Any reference signs in the claims shall not be construed so as to limit their scope.

Claims

CLAIMS:
1. A data processing apparatus for performing vector shuffle operations on an input vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element, the data processing apparatus comprising: - a plurality of multiplexer units configured to shuffle at least an input vector comprising elements of a first size or an input vector comprising elements of a second size; a power saving circuit connected to receive control information indicative of the element size of a vector being shuffled; wherein the power saving circuit is configured to disable operation of one or more of the multiplexer units in accordance with the received control information.
2. A data processing apparatus as claimed in claim 1, wherein the control information is contained in a vector shuffle instruction forming part of an instruction set.
3. A data processing apparatus as claimed in claim 1 or 2, wherein the power saving circuit comprises logic circuitry for masking an input vector from the one or more multiplexer units in accordance with the received control information.
4. A data processing apparatus as claimed in any one of the proceeding claims, wherein the data processing apparatus is configured to shuffle vectors having 8, 16 or 32 bit element sizes, the apparatus comprising: first, second, third and fourth base multiplexer units forming the plurality of multiplexer units; a first logic gate for masking the second and fourth base multiplexer units when either a first control bit or a second control bit in the control information is enabled; and a second logic gate for masking the third base multiplexer unit when the second control bit in the control information is enabled.
5. A method of reducing power in a data processing apparatus configured to perform vector shuffle operations on an input vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element, the method comprising the steps of: - providing a plurality of multiplexer units for shuffling at least an input vector comprising elements of a first size or an input vector comprising elements of a second size; providing a power saving circuit for masking an input vector from one or more of the plurality of multiplexer units; receiving control information indicative of the element size of a vector being shuffled; and disabling the operation of one or more of the multiplexer units by masking the input vector therefrom, in accordance with the received control information.
6. A method as claimed in claim 5, wherein the control information is received from a vector shuffle instruction forming part of an instruction set.
7. A method as claimed in claim 5 or 6, wherein the step of disabling the operation of one or more of the multiplexer units comprises the step of masking an input vector from the one or more multiplexer units in accordance with the received control information.
8. A method as claimed in any one of claims 5 to 7, wherein the data processing apparatus is configured to shuffle vectors having 8, 16 or 32 bit element sizes, the method comprising the steps of: - providing first, second, third and fourth base multiplexer units as the plurality of multiplexer units; masking the second and fourth base multiplexer units when either a first control bit or a second control bit in the control information is enabled; and masking the third base multiplexer unit when the second control bit in the control information is enabled.
9. A vector shuffle instruction for performing vector shuffle operations on a vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element, wherein the vector shuffle instruction comprises at least one data bit for indicating the element size of the vector being shuffled.
10. Device comprising a data processing apparatus according to any of claims 1-4.
PCT/IB2006/054214 2005-11-15 2006-11-13 Vector shuffle unit WO2007057832A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP05110769 2005-11-15
EP05110769.6 2005-11-15

Publications (2)

Publication Number Publication Date
WO2007057832A2 true WO2007057832A2 (en) 2007-05-24
WO2007057832A3 WO2007057832A3 (en) 2007-08-02

Family

ID=37907071

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2006/054214 WO2007057832A2 (en) 2005-11-15 2006-11-13 Vector shuffle unit

Country Status (2)

Country Link
TW (1) TW200811705A (en)
WO (1) WO2007057832A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017112170A1 (en) * 2015-12-20 2017-06-29 Intel Corporation Instruction and logic for vector permute

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013095620A1 (en) 2011-12-23 2013-06-27 Intel Corporation Apparatus and method of improved insert instructions
CN108241504A (en) 2011-12-23 2018-07-03 英特尔公司 The device and method of improved extraction instruction
US9632980B2 (en) 2011-12-23 2017-04-25 Intel Corporation Apparatus and method of mask permute instructions
US9946540B2 (en) 2011-12-23 2018-04-17 Intel Corporation Apparatus and method of improved permute instructions with multiple granularities
WO2013095637A1 (en) 2011-12-23 2013-06-27 Intel Corporation Apparatus and method of improved permute instructions

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62105524A (en) * 1985-11-01 1987-05-16 Nec Corp Signal selecting circuit
EP0757312A1 (en) * 1995-08-01 1997-02-05 Hewlett-Packard Company Data processor
US20030095547A1 (en) * 2001-11-21 2003-05-22 Schofield William G.J. 2n-1 Shuffling network
US6622242B1 (en) * 2000-04-07 2003-09-16 Sun Microsystems, Inc. System and method for performing generalized operations in connection with bits units of a data word

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS62105524A (en) * 1985-11-01 1987-05-16 Nec Corp Signal selecting circuit
EP0757312A1 (en) * 1995-08-01 1997-02-05 Hewlett-Packard Company Data processor
US6622242B1 (en) * 2000-04-07 2003-09-16 Sun Microsystems, Inc. System and method for performing generalized operations in connection with bits units of a data word
US20030095547A1 (en) * 2001-11-21 2003-05-22 Schofield William G.J. 2n-1 Shuffling network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEE R B: "Subword permutation instructions for two-dimensional multimedia processing in microSIMD architectures" APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES, AND PROCESSORS, 2000. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON JULY 10-12, 2000, PISCATAWAY, NJ, USA,IEEE, 10 July 2000 (2000-07-10), pages 3-14, XP010507732 ISBN: 0-7695-0716-6 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017112170A1 (en) * 2015-12-20 2017-06-29 Intel Corporation Instruction and logic for vector permute
CN108292271A (en) * 2015-12-20 2018-07-17 英特尔公司 Instruction for vector permutation and logic
US10467006B2 (en) 2015-12-20 2019-11-05 Intel Corporation Permutating vector data scattered in a temporary destination into elements of a destination register based on a permutation factor
CN108292271B (en) * 2015-12-20 2024-03-29 英特尔公司 Instruction and logic for vector permutation

Also Published As

Publication number Publication date
TW200811705A (en) 2008-03-01
WO2007057832A3 (en) 2007-08-02

Similar Documents

Publication Publication Date Title
US7571303B2 (en) Reconfigurable integrated circuit
KR101310044B1 (en) Incresing workload performance of one or more cores on multiple core processors
KR101050554B1 (en) Masking in Data Processing Systems Applicable to Development Interfaces
US9348792B2 (en) Coarse-grained reconfigurable processor and code decompression method thereof
EP1870813A1 (en) Page processing circuits, devices, methods and systems for secure demand paging and other operations
JP6239130B2 (en) System and method for reducing memory bus bandwidth according to workload
EP2580657B1 (en) Information processing device and method
US10678710B2 (en) Protection scheme for embedded code
WO2007057832A2 (en) Vector shuffle unit
US8275975B2 (en) Sequencer controlled system and method for controlling timing of operations of functional units
US20200042321A1 (en) Low power back-to-back wake up and issue for paired issue queue in a microprocessor
US5784642A (en) System for establishing a transfer mode between system controller and peripheral device
US9697163B2 (en) Data path configuration component, signal processing device and method therefor
US20140325183A1 (en) Integrated circuit device, asymmetric multi-core processing module, electronic device and method of managing execution of computer program code therefor
US9000804B2 (en) Integrated circuit device comprising clock gating circuitry, electronic device and method for dynamically configuring clock gating
DE102022121048A1 (en) SELECTING THE POWER SUPPLY FOR A HOST SYSTEM
US20050210219A1 (en) Vliw processsor
US20060101291A1 (en) System, method, and apparatus for reducing power consumption in a microprocessor
US9442788B2 (en) Bus protocol checker, system on chip including the same, bus protocol checking method
TW202340967A (en) Encoding byte information on a data bus
US9501584B2 (en) Apparatus and method for distributing a search key in a ternary memory array
US20120005438A1 (en) Input/output control apparatus and information processing apparatus
EP2585907B1 (en) Accelerating execution of compressed code
JP2005327062A (en) Control method for input/output terminal module and input/output terminal module
KR20150002319A (en) Processor of heterogeneous cluster architecture

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06821408

Country of ref document: EP

Kind code of ref document: A2