WO2007057832A2

WO2007057832A2 - Vector shuffle unit

Info

Publication number: WO2007057832A2
Application number: PCT/IB2006/054214
Authority: WO
Inventors: David E. Leane; Jean-Paul C. F. H. Smeets; Willem E. H. Kloosterhuis
Original assignee: Nxp B.V.
Priority date: 2005-11-15
Filing date: 2006-11-13
Publication date: 2007-05-24
Also published as: TW200811705A; WO2007057832A3

Abstract

A vector shuffle unit (50) comprises a number of base multiplexer units (mux0, muxl, mux2, mux3), which are connected to an output multiplexer (11). The vector shuffle unit (50) can be configured to shuffle a vector having any one of a number of different element sizes (for example 8, 16 and 32 bit element sizes). A power saving circuit (15) is provided for reducing the power consumption in the base multiplexer units muxl, mux2 and mux3, by masking the inputs to these multiplexer units when performing shuffle operations on certain element sizes. For example, no masking is required for mux0 as it is always needed for each of the 8, 16 and 32 bit element sizes. Multiplexer units muxl and mux3 are only used for 8 bit elements and can be masked together as they are always used together. Mux2 is only used for 8 and 16 bit elements and requires its own power saving circuitry.

Description

Data processing apparatus and method

The invention relates to a data processing apparatus and method, and in particular to a data processing apparatus and method having power reduction when processing vectors.

The invention further relates to a device, such as a mobile phone, PDA or alike, comprising such data processing apparatus.

Power efficiency for processor based equipment is becoming increasingly important. A number of techniques have been used to reduce power usage. These include designing the processor's circuitry to use less power, or designing the processor in a manner which allows power usage to be managed. Also, for a given processor architecture, power consumption can be saved by optimizing its programming.

The mapping of several similar operations onto one piece of hardware is quite common in the area of processor design. This often means that the result is sub-optimal for each of the specific operations. Therefore, the mapping of several operations onto one piece of hardware tends to result in higher power dissipation per operation when compared to dedicated circuitry being provided for each specific operation.

For example, in many data processing applications there is the need to shuffle vectors on a per element basis. The most common element sizes to be supported are of 8, 16 and 32 bits. All of these element sizes are usually supported by providing only 8 bit element size support, with the 16 and 32 bit element sizes then being catered for as just a subset of the

8 bit element size support. In other words, in a standard instruction set the vector shuffling only needs to be described for 8 bit element support because all of the other sizes (16 and 32 bit) are in essence a subset of this instruction. An example of a vector shuffle operation for a vector with 32 elements is given in Fig. 1. For example, this vector can be 256bits, with 32 elements of 8 bits each.

Referring to Fig. 2, consider the basic operation of a vector shuffle unit for a

64 bit vector, i.e. using 8 bytes, in which an output vector 3 can be a shuffled version of the input vector 5 The 64 bit vector shuffle operation is performed using a vector shuffle unit configured around four base multiplexers (muxO, muxl, mux2, mux3 - not shown). Each base multiplexer has a distance of 4 bytes between its inputs. Since the vector in this example has 8 elements of 1 byte, each of the base multiplexers will be a 2:1 multiplexer.

In particular, Fig. 2 shows the configuration for the first base multiplexer, muxO. As can be seen, for the first base multiplexer muxO, byte 0 in the output vector 3 can therefore come from byte 0 or byte 4 of the input vector 5. It will be appreciated that this configuration provides all the shuffling options for shuffling vector elements of 32 bits. In other words, Fig. 2 shows how an input vector having 64 bits, comprising two elements of 32 bits each (i.e. first element being bytes 0-3 and the second element being bytes 4-7), can be shuffled to provide an output vector 3 in which bytes 0-3 come from bytes 4-7 of the input, and bytes 4-7 come from bytes 0-3 of the input.

Fig. 3 shows the connection to byte 0 for each of the base multiplexers (muxO, muxl, mux2, mux3) in a 64 bit vector comprising 8 bytes. As indicated above, in the first base multiplexer, muxO, byte 0 in the output can come from byte 0 or byte 4 of the input. In the second base multiplexer, muxl, byte 0 in the output can come from byte 1 or byte 5 of the input. In the third base multiplexer, mux2, byte 0 in the output can come from byte 2 or byte 6 of the input. In the fourth base multiplexer, mux3, byte 0 in the output can come from byte 3 or byte 7 of the input. In this way, the input vector can be shuffled using the four base multiplexers such that byte 0 in the output can be derived from any of the input bytes 0 to 7. Fig. 4 shows a conventional vector shuffle unit 1 that is capable of performing vector shuffle operations for vectors having element sizes of 8, 16 and 32 bits. In other words, the circuit shown in Fig. 4 is an example whereby several similar operations have been mapped onto one piece of hardware, which results is sub-optimal performance for the specific operations, as will be explained below. The vector shuffle unit 1 comprises a register 7 for storing an input vector 5.

The register 7 is connected to each of one of four base multiplexer units, muxO, muxl, mux2, mux3, using appropriate bus connections 9. The output of each base multiplexer unit muxO, muxl, mux2, mux3 is connected to an output multiplexer 11, again using appropriate bus connections 13. Table 1 below illustrates how, for certain element sizes, only some of the base multiplexer units muxO, muxl, mux2, mux3 are utilized. Element size Base multiplexers needed

8 muxO, muxl, mux2 and mux3

16 muxO and mux 2

32 muxO

Table 1

As can be seen, in the conventional hardware which is configured to allow resource sharing, i.e. different sized vector elements to be shuffled, power is wasted when shuffling certain sized elements. This is because some of the base multiplexer units will be consuming power unnecessarily. For example, when processing a 64 bit vector with an element size of 32 bits, base multiplexer units muxl, mux2, and mux3 will be consuming power unnecessarily, because theirs results are not used. This results in higher power dissipation per operation when compared to dedicated circuitry.

The aim of the present invention is to provide a data processing apparatus and method for shuffling vectors having different sized elements, but without wasting power consumption.

According to a first aspect of the invention, there is provided a data processing apparatus for performing vector shuffle operations on an input vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element. The data processing apparatus comprises a plurality of multiplexer units configured to shuffle at least an input vector comprising elements of a first size or an input vector comprising elements of a second size. A power saving circuit is connected to receive control information indicative of the element size of a vector being shuffled. The power saving circuit is configured to disable operation of one or more of the multiplexer units in accordance with the received control information.

This invention allows maximum reuse of the vector shuffle hardware (resource sharing) while minimizing the power dissipated.

According to a second aspect of the invention, there is provided a method of reducing power in a data processing apparatus configured to perform vector shuffle operations on an input vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element. The method comprises the step of providing a plurality of multiplexer units for shuffling at least an input vector comprising elements of a first size or an input vector comprising elements of a second size. The method also comprises the step of providing a power saving circuit for masking an input vector from one or more of the plurality of multiplexer units, by receiving control information indicative of the element size of a vector being shuffled, and disabling the operation of one or more of the multiplexer units by masking the input vector therefrom, in accordance with the received control information.

According to a third aspect of the invention, there is provided a vector shuffle instruction for performing vector shuffle operations on a vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element, wherein the vector shuffle instruction comprises at least one data bit for indicating the element size of the vector being shuffled.

For a better understanding of the present invention, and to show more clearly how it may be carried into effect, reference will now be made, by way of example only, to the following drawings in which:

Fig. 1 shows a basic vector shuffle operation on a vector having 256 bits, with 32 elements of 8 bits each; Fig. 2 shows a basic shuffle operation on a vector with 64 bits, with each element having a granularity of 32 bits;

Fig. 3 shows how the first byte is obtained for each multiplexer in a shuffle operation with 8 bits granularity, on a vector of 64 bits;

Fig. 4 shows a conventional vector shuffle unit for shuffling 8, 16 and 32 bit element sizes;

Fig. 5 shows a vector shuffle unit having a power saving circuit according to the present invention.

Fig. 5 shows a vector shuffle unit 50 according to the present invention. In a similar manner to Fig. 4, the vector shuffle unit 50 comprises a plurality of base multiplexer units (muxO, muxl, mux2, mux3), which are connected to an output multiplexer 11. However, unlike Fig. 4, a power saving circuit 15 is provided for reducing the power dissipation in the base multiplexer units. In particular, power dissipation in base multiplexer units muxl, mux2 and mux3 can be reduced by masking the inputs to these multiplexer units, because of the element size, when their result is not needed. No masking is required for muxO as it is always needed for 8, 16 and 32 bit element sizes (as seen from Table 1). Muxl and mux3 are only used for 8 bit element sizes and can therefore be masked together as they are always used together. Mux2 is only used for 8 and 16 bit elements and requires its own masking circuitry within the power saving circuitry 15.

The power saving circuit 15 is disposed between the input register 7 and the base multiplexer units. The power saving circuit 15 receives first and second control bits 17, 19. The first and second control bits 17, 19 form part of, or are derived from, the instruction set, for example, part of a vector shuffle instruction.

The first control bit 17 can be set "high" to indicate when a 16 bit element is being shuffled, and set "low" at other times. The second control bit 19 can be set "high" to indicate when a 32 bit element is being shuffled, and set "low" and other times. The first and second control bits 17, 19 are connected to an OR gate 21. The output of the OR gate 21 is connected to the input of a first AND gate 23. The AND gate 23 is connected to receive its other input from the register 7, and has its output connected to the second and fourth multiplexer units, muxl, mux3. Thus, it can be seen that the first AND gate 23 receives the input vector 5 via bus connection 9 at one input, which can be masked using the signal received from the OR gate 21. A second AND gate 25 is connected to receive the input vector 5 at its first input, and the second control bit 19 at its other input. The output of the second AND gate 25 is connected to the third multiplexer unit mux2. Thus, it can be seen that the second AND gate 25 receives the input vector 5 via bus connection 9 at one input, which can be masked using the signal received from the second control bit 19. When processing a vector having a granularity of 8 bit elements, the first and second control bits 17, 19 will be set low. This in turn will result in each of the AND gates 23 and 25 having one of its inputs set low, thus resulting in the input vector 5 being connected to each of the multiplexer units muxO, muxl, mux2 and mux3 in the normal way. In other words, multiplexer unit muxO will receive the input vector directly, multiplexer units muxl and mux3 will receive their inputs via the first AND gate 23, while multiplexer unit mux2 will receive its input from the second AND gate 25.

When processing a vector having a granularity of 16 bit elements, the first control bit 17 is set high. This has the effect of setting the output of the OR gate 21 high, which in turn provides a high signal on one of the input connections to the first AND gate 23. This has the effect of masking the input vector from the multiplexer units muxl and mux3, which are connected to the output of the first AND gate 23. Since the input to the second AND gate 25 is connected to the second control bit 19 (i.e. the control bit for the 32 bit element which will be set low), the multiplexer unit mux2 will receive the input vector 5 at its input in the normal manner. In this way, only base multiplexer units muxO and mux2 are used when processing a 16 bit vector. Power is therefore saved because base multiplexer units muxl and mux3 are masked from operation.

When processing a vector having a granularity of 32 bit elements, the second control bit 19 is set high. This has the effect of setting the output of the OR gate 21 high, which in turn provides a high signal on one of the input connections to the first AND gate 23. This has the effect of masking the input vector 5 from the multiplexer units muxl and mux3, which are connected to the output of the first AND gate 23. In addition, since the input to the second AND gate 23 is also set high (i.e. because this input is connected to the second control bit 19), the multiplexer unit mux2 will also be masked from receiving the input vector 5. In this way, only base multiplexer unit muxO is used when processing a 32 bit vector.

Power is therefore saved because base multiplexer units muxl, mux2 and mux3 are masked from operation.

It is noted that in the analysis above, it is assumed that when the shuffling unit is not active, the inputs are kept "low". As will be appreciated from the above, by differentiating the different element sizes in the instruction set (i.e. providing first and second control bits 17, 19), a power saving opportunity is made possible in the hardware. The power saving circuitry 9 detects the power saving opportunity using the first and second control bits 17, 19, and masks the appropriate busses. It will be appreciated that modifications are required to both the instruction set and the hardware circuitry in order to realize the power saving.

This means of power saving can be applied to all hardware that has to support shuffle vectors on a per element basis where there is a manner (e.g. instruction set) of differentiating multiple element sizes. Although the preferred embodiment has been described in relation to a vector shuffle unit configured to shuffle 8, 16 or 32 bit elements, it will be appreciated that the invention could also be used with a vector shuffle unit configured to switch less, or more differently sized elements. It will also be appreciated that, although the preferred embodiment refers to the control signals 17, 19 having a logic high signal for indicating a particular state, a logic low signal could also be used, with the power saving circuitry adapted accordingly to give the same logic output. Furthermore, although the preferred embodiment has been described using

AND gates in the power saving circuit, it will be appreciated that other logic circuitry can be used to provide operand isolation.

Also, although the preferred embodiment has been described in relation to shuffling an input vector, the invention may equally be used with more than one input vector, for example in a system having two to one shuffle units.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim, "a" or "an" does not exclude a plurality, and a single processor or other unit may fulfill the functions of several units recited in the claims. Any reference signs in the claims shall not be construed so as to limit their scope.

Claims

CLAIMS:

1. A data processing apparatus for performing vector shuffle operations on an input vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element, the data processing apparatus comprising: - a plurality of multiplexer units configured to shuffle at least an input vector comprising elements of a first size or an input vector comprising elements of a second size; a power saving circuit connected to receive control information indicative of the element size of a vector being shuffled; wherein the power saving circuit is configured to disable operation of one or more of the multiplexer units in accordance with the received control information.

2. A data processing apparatus as claimed in claim 1, wherein the control information is contained in a vector shuffle instruction forming part of an instruction set.

3. A data processing apparatus as claimed in claim 1 or 2, wherein the power saving circuit comprises logic circuitry for masking an input vector from the one or more multiplexer units in accordance with the received control information.

4. A data processing apparatus as claimed in any one of the proceeding claims, wherein the data processing apparatus is configured to shuffle vectors having 8, 16 or 32 bit element sizes, the apparatus comprising: first, second, third and fourth base multiplexer units forming the plurality of multiplexer units; a first logic gate for masking the second and fourth base multiplexer units when either a first control bit or a second control bit in the control information is enabled; and a second logic gate for masking the third base multiplexer unit when the second control bit in the control information is enabled.

5. A method of reducing power in a data processing apparatus configured to perform vector shuffle operations on an input vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element, the method comprising the steps of: - providing a plurality of multiplexer units for shuffling at least an input vector comprising elements of a first size or an input vector comprising elements of a second size; providing a power saving circuit for masking an input vector from one or more of the plurality of multiplexer units; receiving control information indicative of the element size of a vector being shuffled; and disabling the operation of one or more of the multiplexer units by masking the input vector therefrom, in accordance with the received control information.

6. A method as claimed in claim 5, wherein the control information is received from a vector shuffle instruction forming part of an instruction set.

7. A method as claimed in claim 5 or 6, wherein the step of disabling the operation of one or more of the multiplexer units comprises the step of masking an input vector from the one or more multiplexer units in accordance with the received control information.

8. A method as claimed in any one of claims 5 to 7, wherein the data processing apparatus is configured to shuffle vectors having 8, 16 or 32 bit element sizes, the method comprising the steps of: - providing first, second, third and fourth base multiplexer units as the plurality of multiplexer units; masking the second and fourth base multiplexer units when either a first control bit or a second control bit in the control information is enabled; and masking the third base multiplexer unit when the second control bit in the control information is enabled.

9. A vector shuffle instruction for performing vector shuffle operations on a vector having a plurality of elements, each element comprising a predetermined number of data bits, and the number of data bits defining the size of an element, wherein the vector shuffle instruction comprises at least one data bit for indicating the element size of the vector being shuffled.

10. Device comprising a data processing apparatus according to any of claims 1-4.