GB2475653A

GB2475653A - Select-and-insert instruction for a data processor

Info

Publication number: GB2475653A
Application number: GB1104112A
Authority: GB
Inventors: Dominic Hugo Symes; Daniel Kershaw; Mladen Wilder
Original assignee: ARM Ltd; Advanced Risc Machines Ltd
Current assignee: ARM Ltd
Priority date: 2007-03-12
Filing date: 2007-03-12
Publication date: 2011-05-25
Anticipated expiration: 2027-03-12
Also published as: GB0704735D0; US7895417B2; GB201104112D0; US20100217958A1; US7814302B2; GB2475653B; US20080229073A1; GB2447427A; GB2447427B

Abstract

A data processing system has a select-and-insert instruction, which takes two input values. The instruction shifts the first valuenbits, selectsnbits from the second value and concatenates the shifted value with the selected bits to produce a result. If the first value is left shifted, then the selected bits form the least significant bits of the result. If the first value is right shifted, then the selected bits forth the most significant bits of the result. The instruction may be used in a Viterbi decoder with the first input being a Viterbi decoder state and the second value being a Viterbi trellis value.

Description

SELECT-AND-INSERT INSTRUCTIONS WITHIN DATA PROCESSING SYSTEMS

This invention relates to the field of data processing systems. More particularly, this invention relates to data processing systems supporting program instructions tailored to high data throughput requirements.

It is known within data processing systems to perform data processing operations which requirc a high data throughput and the manipulation of large amounts of data. An example of such manipulations are Viterbi algorithm calculations commonly used when transmitting data over a noisy communication channel. While these techniques can be highly successful in resisting data loss arising due to noise on the channel, they bring with them a high computational load. These high levels of computation present a significant challenge in producing low overhead (in terms of size, cost and energy consumption) systems capable of performing the required processing.

One particular challenge within Viterbi decoding is that the trellis traceback algorithm requires access to a two-dimensional array of data values with one dimension of the array being stepped through at a constant rate and the other dimension being accessed "randomly" depending upon the current state of the decoder.

Known software Viterbi implementations (e.g. C54x) implement these requirements by using one instruction to step through the dimension which changes at a constant rate and another instruction to apply the value for the randomly accessed dimension when seeking to form the composite address for accessing the two-dimensional array.

A problem situation that arises concerns the manipulation of data values in a manner that depends directly upon the data values to be manipulated. Conventionally this requires multiple instructions, i.e. first to examine the data to identify the manipulation to be performed and then to separately perform that manipulation.

Viewed from another aspect the present invention provides apparatus for processing data comprising: data processing circuitry responsive to control signals to perform data processing data processing circuitry responsive to control signals to perform data processing operations; and instruction decoder circuitry coupled to said data processing circuitry and responsive to program instructions to generate said control signals; wherein said instruction decoder circuitry is responsive to a select-and-insert instruction having as input operands at least a first input value and a second input value to generate control signals to control said data processing circuitry to form an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to form a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.

The present technique recognises the bottleneck that is introduced by the need to perform manipulations upon data values in dependence upon those data values themselves in circumstances where these manipulations are frequently required and where high data throughput is required. More particularly, the present technique recognises a particular class of such situations for which it is desirable to provide hardware support. These correspond to a select-and-insert instruction in which a first input value is shifted by a variable number N of bit positions to form a shifted value, N bits from within a second input value are selected in dependence of the first input value, and then the shifted value and the selected N bits are concatenated to form an output value. This particular combination of manipulations is one which is frequently required in certain fields where high volumes of data are to be processed, desirably with a high level of efficiency.

Whilst the above select-and-insert instruction could be used in other circumstances, it is particularly well suited to use when the first input value is a Viterbi decoder state value and the second input value is a Viterbi trellis data value. The instruction then provides a high efficiency mechanism for tracing back through the Viterbi trellis data values to reconstruct decoder state and decode the signals required.

It will be appreciated that the first input value could be left shifted with the N bits concatenated to form the least significant bits of the output data value. Alternatively, the first input value could be right shifted and the N bits concatenated with the shifted value to form the most significant bits of the output value. The number of bit positions shifted and the number of bits inserted can take a variety of values, but is often usefully one.

The present technique is well suited to pipelined implementation when the first input value is a Viterbi decoder state value, the second input value is a multi-bit Viterbi trellis data value loaded from a memory by a load instruction executed in a processing cycle preceding the processing cycle in which the select-and-insert instruction is executed. In these circumstances, the latency associated with accessing the Viterbi trellis data value with the load instruction can be compensated for since the bits which will be required from that Viterbi trellis data value to be inserted into the Viterbi decoder state value can be determined and selected later by the select-and-insert instruction, The load can thus effectively load all of the bit values which might be required and the select-and-insert instruction can then select the bit values which are actually required for the manipulation to be performed.

The provision of the select-and-insert instruction is complemented by the provision of the previously discussed address calculation instruction as together these instructions can significantly reduce the processing bottlenecks which would otherwise be present and obstruct a high efficiency implementation of, in particular, a Viterbi software decoder. This is particularly beneficial when the trellis is generated by parallel data processing units, such as in a SIMD machine. In this case the scalar traceback processing becomes a bottleneck.

Viewed from another aspect the present invention provides a method of processing data using data processing circuitry responsive to control signals to perform data processing operations and instruction decoder circuitry coupled to said data processing circuitry and responsive to program instructions to generate said control signals, said method comprising the steps of: decoding a select-and-insert instruction having as input operands having as input operands at least a first input value and a second input value to generate control signals; controlling said data processing circuitry with said control signals to calculate an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to form a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.

Viewed from a further aspect the present invention provides apparatus for processing data comprising: data processing means for performing data processing operations in response to control signals; and instruction decoder means coupled to said data processing means for generating said control signals in response to program instructions; wherein said instruction decoder means, in response to a select-and-insert instruction having as input operands at least an first input value and a second input value, generates controls signals to control said data processing means to calculate an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to fonn a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.

Viewed from a further aspect the present invention provides a virtual machine implementation of an apparatus for processing data, said virtual machine implementation being responsive to a select-and-insert instruction having as input operands at least an first input value and a second input value to calculate an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to fonn a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which: Figure 1 schematically illustrates an integrated circuit suitable for software radio processing; Figure 2 schematically illustrates a Viterbi coding and decoding system; Figure 3 schematically illustrates Viterbi trellis data; Figure 4 schematically illustrates updating of Viterbi decoder state data during traceback; Figure 5 schematically illustrates a two-dimensional array of Viterhi trellis data being traversed as part of a traceback operation; Figure 6 schematically illustrates an instruction decoder responsive to program instructions for controlling data processing circuitry; Figure 7 schematically illustrates the operation of an address calculation instruction; Figure 8 is a flow diagram schematically illustrating the processing performed by an address calculation instruction; Figure 9 illustrates the syntax of an address calculation instruction; Figure 10 schematically illustrates the operation of a select-and-insert instruction; Figure 11 schematically illustrates an alternative operation of a select-and-insert instruction; Figure 12 is a flow diagram schematically illustrating the operation of a select-and-insert instruction; Figure 13 illustrates the syntax of a select-and-insert instruction; Figure 14 is an example code sequence illustrating the use of a select-and-insert instruction in combination with an address calculation instruction to perform Viterbi traceback operations; and Figure 15 is a diagram schematically illustrating a virtual machine implementation for executing program code utilising the address calculation instruction and select-and-insert instruction of the current techniques.

Figure 1 shows an integrated circuit 2 adapted to perform software radio processing functions. Software radio places heavy demands upon the processing capabilities of such a programmable integrated circuit. The data throughputs required are large and it is important to balance the different elements provided within the integrated circuit 2 in order that all the elements are used with a high degree of efficiency. In the illustrated example, thirty-two parallel lanes, each sixteen bits wide, for performing multiplication, addition and shuffle operations upon arithmetic values are provided. Each of these lanes includes a multiplier 4, an adder 6 and a shuffle unit 8. 16-bit data vmrds are taken from a respective lane within an input value register 10 to provide input operands to the multiplier 4, the adder 6 and the shuffle unit 8. The multiplier 4, the adder 6 and the shuffle unit 8 form a three-cycle deep pipeline such that the results of a calculation will be available three cycles after the calculation is issued into the pipeline. The respective processing lanes are controlled by a 256-bit very long instruction word (VLIW) instruction stored within an instruction register 12. This VLIW instruction also includes a scalar instruction supplied to a scalar processor 14.

The scalar processor 14 operates in parallel with the previously discussed thirty two parallel lanes and serves primarily to perform control and higher level decoding operations.

The scalar processor 14 also controls an address generation unit 16 which is responsible for generating memory access addresses supplied to a memory 18 for accessing data values therefrom (which are fed to the operand register 10 for processing in the thirty two parallel lanes as well as to the scalar processor 14 itself). The scalar processor 14 also has a three-cycle pipeline depth and the memory 18 has a three-cycle latency, Matching the pipeline depths/latency of the address generation wait 16, the thirty-tw parallel lanes and the memory 18 simplifies efficient coding and allows more flexibility in the scheduling of instructions.

One of the tasks of the address generation unit 16 in performing Viterbi decoding is to undertake the traceback operations through the Viterbi trellis data which has been calculated by thirty-two parallel lanes. The thirty-two parallel lanes, each comprising a multiplier 4, an adder 6 and a shuffle unit 8, are responsible for the data processing necessary to compute the probability coefficients and branch values to be associated with each state node within the Viterbi decoding process. Such a highly parallel data processing engine is well suited to this eomputationally intensive task. Once the Viterbi trellis data has been calculated it is necessary to analyse this calculated data so as to extract therefrom the bit stream which has been decoded. This task is performed by the address generation unit 16. The thirty-two parallel lanes write the Viterbi trellis data to the memory 18 from where it is read and analysed by the address generation unit 16. The address generation unit 16 also tracks the Viterbi decoder state data which provides the decoded data stream.

Viterbi decoding in itself is a well known technique within the field of data and signal processing. Viterbi decoding will not be described herein in detail.

Figure 2 illustrates at a high level the processing that is performed. An input data stream 20 is subject to convolution encoding and the addition of some parity data by a convolutional encoder 22. This Viterbi encoded data is then transmitted over a noisy data channel (e.g. a wireless data channel) to a Viterbi decoder 24. The Viterbi decoder 24 applies Viterbi decoding algorithms to the received data to form Viterbi trellis data 26, which can then be subject to traceback processing by a traceback processor 28 to generate an output data stream 30 corresponding to the input datastream 20.

Figure 3 schematically illustrates Viterbi trellis data. In this example each Viterbi decoder state is taken to have four possible values, m3 -m0 These four possible states at each time t have a value associated with them indicating how probable it is that the decoder has reached that state given the preceding sequence of bits that have been received. The transition from one possible decoder state to the next possible decoder state can have two potential targets selected between in dependence upon the received bit associated with that transition.

The trellis data comprises a large number of computed elements representing the probabilities of states and the bit sequences which have led to those states. Calculating this trellis data is computationally intensive and is performed by the wide multi-lane data engine of illustrated in Figure 1. When the trellis data has been formed in this way, another processing unit, such as the address generation unit 16 is used to analyse this trellis data and "traceback" thercthrough. This type of processing is in itself known. It will be appreciated that in practice a Viterbi decoder will have many more than four possible states at each time making the Viterbi trace back data significantly larger in volume and more complex to analyse.

Figure 4 schematically illustrates a small part of the traceback operation performed as part of typical Viterbi decoding. The decoder has been determined at time t to be in a given state that is most probable given the already decoded trellis data which has been traversed.

Stored within the trellis data for the time t and the state at which the decoder is currently expressing is afi Indication of which preceding state at the time t-1 is the most probable preceding state. This indicates to which state the decoder is traced back to and the bit value which is deemed to have been decoded by that change of state. The change of state will also be accompanied by a change in the decoder state value which is achieved> in this example, by left shifting the current state value and shifting into the bottom of that state value a bit indicating which of the two options for the preceding bit has been deemed the most probable, and accordingly deemed to have been decoded. This shifted value with an inserted new bit then forms the new state of the decoder at time t-1. The process repeats at time t-1 and a further bit is decoded traceback tluough the Viterbi trellis data is so made.

Figure 5 is another example illustration of this process. At the various times t the decoder state in this example can have sixteen possible values. With each of these values there is an associated bit indicating the most likely path by which that state will have been reached from the two possible preceding states at an earlier time. This path is then followed back to that preceding state, which will in itself have an indicator to the preceding state to which traceback is to be performed. Thus, in the example illustrated, the state at time t is "0101". The bit stored within the trellis data indicating the preceding state associated with that state is a "1" indicating that a "1" is to be shifted into the bottom of the state value as it is left shifted to form the state value for the preceding state at time t-l. In this way, the state value for the preceding state is formed as "lOll". Data is stored within the trellis data associated with this state at time t-l indicating the next state to be adopted. Thus, the trellis data shown in Figure 5 is subject to a traceback operation during which the decoder state is updated and is used to generate the decoded data stream in accordance with known techniques.

Figure 6 illustrates a portion of the integrated circuit 2 of Figure 1 in more detail. The scalar processor 14 is provided with a scalar instruction register 32 (which is part of the YLIW instruction register 12) for storing a scalar instruction to be executed. An instruction decoder 34 is responsive to the scalar instruction in the scalar instruction register 32 to generate control signals supplied to data processing circuitry 36. The data processing circuitry 36 performs data processing operations in response to the control signals supplied thereto in order to perform the desired data processing operations specified by the instruction within the sealar instruction register 32. The instruction decoder 34 is circuitry configured to be responsive to the bit patterns within the scalar instruction register 32 to generate the desired control signals for supply to the data processing circuitry 36. The data processing circuitry 36 typically includes a wide variety of different functional elements, such as an adder 38, a shifter 40 and general purpose combinatorial logic 42. It will be appreciated that a wide variety of other forms of circuitry may be provided within the data processing circuitry 36 to achieve the desired functions. It will further be appreciated that the selection of which program instructions are to be supported by the instruction decoder 34 is a critical one in system design. A general purpose processor can normally accomplish most processing tasks desired if enough program instructions and processor cycles are dedicated to those tasks.

However, this is not sufficient when high efficiency is also required as it is desirable to perform such processing as quickly and with low energy consumption. In this way, the selection of which processing operations are to be supported within the instruction bit space and natively supported by the data processing circuitry 36 is critical in achieving good levels of efficiency. The present techniques concern the identification and selection of particular forms of data processing instruction which are surprisingly advantageous and accordingly desirable to support natively.

Figure 7 illustrates the operations performed by an address caléulation instruction supported by the instruction decoder 34 and the data processing circuitry 36. The input address value 44 is divided into a first portion 46 and a second portion 48 in dependence upon a size value 50. The size value 50 is in this example specified as a value representing the logarithm of the size of a mask to be applied to the input address value 44 to split it into the first portion 46 and the second portion 48. Also supplied as input operands to the address calculation instruction in this example are an offset value stored within a register specified as a register field within the instruction and a state value stored within a register specified as a register field within the instruction. The address calculation instruction serves to add an offset value to the first portion 46, In the example illustrated, this offset value is "-1", which effectively results in a decrement of the first portion. If the first portion is indexing a two dimensional data array, then the high order bits of the first portion can be considered to form the base address for that two dimensional array and the lower bits of the first portion represent the row address within that array. In this case the array is aligned -more generally the high order bits are the base address plus the row address. The number of bits of this lower portion of the first portion representing the row address varies depending upon the row size. In the example of Viterbi trellis data, each row can correspond to a different time t with data values corresponding to the different decoder states at that time t.

The manipulation performed upon the second portion 48 of the input address value 44 is to set the second portion 48 to a value specified by the State input operand being a value held within a register specified by a register field within the address calculation instruction and subject to masking of that state value to select the relcvant bits thereof which are to be used as the second portion 48.

In this way, a new address can be formed as the output address value 52 by adding an offset value to the most significant bit portion of the input address value and setting the least significant bit portion of the input address value to a new value which can effectively randomly be selected, Thus, if a two dimensional data structure is considered, the modification to the first portion 46 steps through the rows of the data structure in a regular fashion (e.g. one row at a time, two rows at as time etc) with the setting of the second portion 48 of the address value allowing a random column position within the two-dimensional data structure to be selected for access using the output address value calculated, In the context of traversing Viterbi trellis data it will be seen that this instruction is well suited to this task since such trellis data is regularly traversed, typically one row at a time, with a random next column needing to be accessed at each access. Thus, by appropriately loading the state value into the register to be used to form the second portion, and setting the desired offset, the new address following a trace back step can be calculated with a single instruction.

Figure 8 schematically illustrates the operation of the instruction decoder 34 when encountering an address calculation instruction. At step 54, the instruction decoder 34 identifies the scalar instruction within the scalar instruction register 32 as an address calculation instruction. At step 56 the input address value is split into a first portion and a second portion in dependence upon the size value specified in association with the address calculation instruction. At step 58 a non-zero offset (which may be positive or negative) is added to the first portion. At step 60 the second portion is set to a value determined directly or indirectly from the address calculation instruction. At step 62 the first portion and the second portion which have been modified are concatenated to form the output address value 52.

It will be appreciated that the sequence of operations shown in Figure 8 is linear whereas in practice these operations may be performed in a different order and/or with varying degrees of parallelism. Figure 8 is intended to represent the functionality provided rather than the precise way in which such functionality is provided. The various options for the provision of this functionality will be familiar to those in this technical field.

Figure 9 schematically illustrates the syntax of an address calculation instruction in accordance with the present technique. As will be seen, the instruction includes a field identifying a register storing the offset value to be used, a field identifying a register storing a value at least part of which is to be used to set the second portion when forming the output address value and further a field (in this case an immediate) specifying a size value to be used when dividing the input address value into a first portion and a second portion. The variability of the size value allows different widths of two dimensional data array to be appropriately addressed using the address calculation instruction depending upon the circumstances.

Figure 10 schematically illustrates the operations to be performed as part of traceback when updating a decoder state value. Figure 10 illustrates the example in which the state value is left shifted and a new bit value is inserted in the least significant bit positions. As illustrated in Figure 10, a first input value 64 is provided in conjunction with a second input value 66. The operation of the select-and-insert instruction is to use, in this example, the bottom three bits of the first input value 64 to select a bit within the second input value 66 which is to be inserted in the least significant bit position of the new value to be formed as the output value 68 after it has been left shifted and had the bit inserted at its least significant bit position. It will be appreciated that the width of the portion of the first input value 64 used to select the bits or bit within the second input value 66 to be inserted can vary depending upon the width of the second input value 66. Similarly, the number of bits to be inserted with each instruction can vary and be more generally N bits. In many circumstances, such as a simple Viterbi traceback, the shift by one bit position and the inserting of one bit will be normal.

Figure 11 illustrates a variant of the shift-and-insert instruction, with in this case the first input operand 70 being subject to a right shift and the selected bit or bits from the second input value 72 being inserted at the most significant bit position within the new state value.

The state value in this example is M bits wide and accordingly there are 2M possible one bit shift values which can be selected within the second input value 72 for insertion. The output value 74 represents the traceback Viterbi decoder state at time t-1 It will be appreciated from Figures 10 and 11 that the second input values 66, 72 includes more than just the bit(s) which are to be inserted and used to update the state values when these are shifted. This is advantageous since the second input value 66, 72 can he fetched from memory by an instruction issued several cycles earlier before it is known precisely which of the bits from that fetched value will need to be inserted within the state value to update the state value when that update is required. Thus, the latency associated with the memory access can effectively be hidden by fetching more than just the bit(s) which will be required and then later selecting the desired bit(s) from the fetched second input value to perform the desired update. ln practice memories are accessed with access mechanisms/path wider than a single bit (e.g. typically byte or word access) and accordingly the fetching of more than just the single bit or N bits required to be inserted does not in practice consume more energy than would otherwise be the case.

Figure 12 is a flow diagram schematically illustrating the processing performed by the select-and-insert instructions of Figures 10 and 11. At step 76 the instruction decoder 34 identifies from the bit pattern within the scalar instruction register 32 that a select-and-insert instruction has been received. It then generates the appropriate control signals to configure the data processing circuitry 36 to perform the above described data processing operations.

At step 78 the first input value is shifted by N bits to perform a shifted value. At step 80, N bits are selected from the second input value, as pointed to by the first input value. More specifically the selected bits from within said second input value are bits (K*N) to (K.tN)+(N 1) where K is the bottom M bits of the first input value. In this case, Viterbi trellis data value includes 2M possible N bit portions to be concatenated with said shifted value permitting up to M cycles to load said Viterbi trellis data value from said memory whilst permitting said data processing circuitry to executes a sequence of said select-and-insert instructions in a manner providing a throughput capable of forming one output value per clock cycle. At step 82 the shifted value and the selected N bits are concatenated to form the output value, As previously discussed, it will be appreciated that the flow diagram of Figure 12 represents the processing as sequential operations, but it will be appreciated that in practice this could be performed in a different order and with varying degrees of parallelism.

Figure 13 schematically illustrates the syntax of a select-and-insert instruction. This instruction includes a first input operand and a second input operand, each in the form of a register specifier pointing to a register holding respectively the current state value and trellis data value as part of the Viterbi decoding.

Figure 14 is an example code sequence showing the use of the address calculation instruction and the shift-and-insert instruction in a code fragment for performing Viterbi trace back operations. In this example it will in particular be seen that the first triplet of instructions terminates with a load to register d4 and this value is not needed until that triplet of instructions is returned to in the next loop cycle. This permits the latency associated with this load to be tolerated without stalling the instruction processing. Furthermore, since the value within the register d4 contains more than just the bits which are to be inserted, the various options for which bits will be inserted can be catered for.

Figure 15 illustrates a virtual machine implementation of the present techniques. it will be appreciated that the above has described the implementation of the present invention in the terms of apparatus and methods for operating specific processing hardware supporting the instructions concerned, It will be appreciated by those in this technical field that it is also possible to provide so called "virtual machine" implementations of hardware devices, These virtual machine implementations run on a host processor 84 running a host operating system 86 supporting a virtual machine program 88. Typically large powerful processors are required to provide virtual machine implementations which execute at a reasonable speed, but such an approach may be justified in certain circumstance, such as a desire to run code native to another processor for compatibility or reuse reasons, The virtual machine program 88 provides an application program interface to an application program 90 which is the same as the application program interface which would be provided by the real hardware which is the device being modelled by the virtual machine program 88. Thus, the program instructions, including the address calculation instruction and the select-and-insert instruction described above, may be executed from within the application program 90 using the virtual machine program 88 to model their interaction with the virtual machine hardware.

Claims

cLAiMS 1. Apparatus for processing data comprising: data processing circuitry responsive to control signals to perform data processing operations; and instniction decoder circuitry coupled to said data processing circuitry and responsive to program instructions to generate said control signals; wherein said instruction decoder circuitry is responsive to a select-and-insert instruction having as input operands at least a first input value and a second input value to generate control signals to control said data processing circuitry to form an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to form a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.
2. Apparatus as claimed in claim I, wherein said first input value is left-shifted and said N bits are concatenated with said shifted value to form N least significant bits of said output value.
3. Apparatus as claimed in claim I, wherein said first input value is right-shifted and said N bits are concatenated with said shifted value to form N most significant bits of said output value.
4. Apparatus as claimed in any one of claims 1 to 3, wherein Nl.
5. Apparatus as claimed in any one of claims 1 to 4, wherein said first input value is a Viterbi decoder state value and said second input value is a Viterbi trellis data value.
6. Apparatus as claimed in claim 5, wherein said Viterbi trellis data value is a mulitbit data value loaded from a memory by a load instruction executed in a processing cycle preceding a processing cycle in which said select-and-insert instruction is executed.
7. Apparatus as claimed in claim 6, wherein said Viterbi trellis data value includes 2M possible N bit portions to be concatenated with said shifted value permitting up to M cycles to load said Viterbi trellis data value from said memory whilst permitting said data processing circuitry to executes a sequence of said select-and-insert instructions in a manner providing a throughput capable of forming one output value per clock cycle.
8. Apparatus as claimed in claim 7, wherein said N bits selected from within said second input value are bits (K*N) to (K*N)+(Nl) where K is the bottom M bits of the first input value.
9. Apparatus as claimed in any one of claims 1 to 8, wherein said instruction decoder circuitry is responsive to an address calculation instruction having as input operands at least an input address value and a size value to generate control signals to control said data processing circuitry to calculate an output address value equal to that given by performing the steps of: splitting said input address value at a position dependent upon said size value into an input first portion and an input second portion; adding a non-zero offset value to said input first portion to form an output first portion; setting an output second portion to a value; and concatenating said output first portion and said output second portion to form said output address value.
10. A method of processing data using data processing circuitry responsive to control signals to perform data processing operations and instruction decoder circuitry coupled to said data processing circuitry and responsive to program instructions to generate said control signals, said method comprising the steps of: decoding a select-and-insert instruction having as input operands having as input operands at least a first input value and a second input value to generate control signals; controlling said data processing circuitry with said control signals to calculate an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to form a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.IL A method as claimed in claim 10, wherein said first input value is left-shifted arid said N bits are concatenated with said shifted value to form N least significant bits of said output value, 12. A method as claimed in claim 10, wherein said first input value is right-shifted and said N bits are concatenated with said shifted value to form N most significant bits of said output value.13. A method as claimed in any one of claims 10 to 12, wherein N=l.14. A method as claimed in any one of claims 10 to 13, wherein said first input value is a Viterbi decoder state value and said second input value is a Viterbi trellis data value.15. A method as claimed in claim 14, wherein said Viterbi trellis data value is a mulitbit data value loaded from a memory by a load instruction executed in a processing cycle preceding a processing cycle in which said select-and-insert instruction is executed.16. A method as claimed in claim 15, wherein Viterbi trellis data value includes 2M possible N bit portions to be concatenated with said shifted value permitting up to M cycles to load said Viterbi trellis data value from said memory whilst permitting said data processing circuitry to executes a sequence of said select-and-insert instructions in a manner providing a throughput capable of forming one output value per clock cycle.17. A method as claimed in claim 16, wherein said N bits selected from within said second input value are bits (KtN) to (K*N)+(N 1) where K is the bottom M bits of the first input value.18. A method as claimed in any one of claims 10 to 17, comprising decoding an address. calculation instruction having as input operands at least an input address value and a size value to generate control signals; and control said data processing circuitry using said control signals to calculate an output address value equal to that given by performing the steps of: splitting said input address value at a position dependent upon said size value into an input first portion and an input second portion; adding a non-zero offset value to! said input first portion to form an output first portion; setting an output second portion to a value; and concatenating said output first portion and said output second portion to form said output address value.19. A virtual machine implementation of an apparatus for processing data, said virtual machine implementation being responsive to a select-and-insert instruction having as input operands at least an first input value and a second input value to calculate an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to form a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.