GB2475653A - Select-and-insert instruction for a data processor - Google Patents

Select-and-insert instruction for a data processor Download PDF

Info

Publication number
GB2475653A
GB2475653A GB1104112A GB201104112A GB2475653A GB 2475653 A GB2475653 A GB 2475653A GB 1104112 A GB1104112 A GB 1104112A GB 201104112 A GB201104112 A GB 201104112A GB 2475653 A GB2475653 A GB 2475653A
Authority
GB
United Kingdom
Prior art keywords
value
input
bits
output
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1104112A
Other versions
GB201104112D0 (en
GB2475653B (en
Inventor
Dominic Hugo Symes
Daniel Kershaw
Mladen Wilder
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ARM Ltd
Original Assignee
ARM Ltd
Advanced Risc Machines Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ARM Ltd, Advanced Risc Machines Ltd filed Critical ARM Ltd
Priority to GB1104112A priority Critical patent/GB2475653B/en
Publication of GB201104112D0 publication Critical patent/GB201104112D0/en
Publication of GB2475653A publication Critical patent/GB2475653A/en
Application granted granted Critical
Publication of GB2475653B publication Critical patent/GB2475653B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/37Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
    • H03M13/39Sequence estimation, i.e. using statistical methods for the reconstruction of the original codes
    • H03M13/41Sequence estimation, i.e. using statistical methods for the reconstruction of the original codes using the Viterbi algorithm or Viterbi processors
    • H03M13/4161Sequence estimation, i.e. using statistical methods for the reconstruction of the original codes using the Viterbi algorithm or Viterbi processors implementing path management
    • H03M13/4169Sequence estimation, i.e. using statistical methods for the reconstruction of the original codes using the Viterbi algorithm or Viterbi processors implementing path management using traceback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/22Microcontrol or microprogram arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/22Microcontrol or microprogram arrangements
    • G06F9/26Address formation of the next micro-instruction ; Microprogram storage or retrieval arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/355Indexed addressing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3887Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/23Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using convolutional codes, e.g. unit memory codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/25Error detection or forward error correction by signal space coding, i.e. adding redundancy in the signal constellation, e.g. Trellis Coded Modulation [TCM]
    • H03M13/256Error detection or forward error correction by signal space coding, i.e. adding redundancy in the signal constellation, e.g. Trellis Coded Modulation [TCM] with trellis coding, e.g. with convolutional codes and TCM
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/37Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
    • H03M13/39Sequence estimation, i.e. using statistical methods for the reconstruction of the original codes
    • H03M13/41Sequence estimation, i.e. using statistical methods for the reconstruction of the original codes using the Viterbi algorithm or Viterbi processors
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/37Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
    • H03M13/39Sequence estimation, i.e. using statistical methods for the reconstruction of the original codes
    • H03M13/41Sequence estimation, i.e. using statistical methods for the reconstruction of the original codes using the Viterbi algorithm or Viterbi processors
    • H03M13/4161Sequence estimation, i.e. using statistical methods for the reconstruction of the original codes using the Viterbi algorithm or Viterbi processors implementing path management
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/004Arrangements for detecting or preventing errors in the information received by using forward error control
    • H04L1/0045Arrangements at the receiver end
    • H04L1/0052Realisations of complexity reduction techniques, e.g. pipelining or use of look-up tables
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/004Arrangements for detecting or preventing errors in the information received by using forward error control
    • H04L1/0045Arrangements at the receiver end
    • H04L1/0054Maximum-likelihood or sequential decoding, e.g. Viterbi, Fano, ZJ algorithms

Abstract

A data processing system has a select-and-insert instruction, which takes two input values. The instruction shifts the first valuenbits, selectsnbits from the second value and concatenates the shifted value with the selected bits to produce a result. If the first value is left shifted, then the selected bits form the least significant bits of the result. If the first value is right shifted, then the selected bits forth the most significant bits of the result. The instruction may be used in a Viterbi decoder with the first input being a Viterbi decoder state and the second value being a Viterbi trellis value.

Description

SELECT-AND-INSERT INSTRUCTIONS WITHIN DATA PROCESSING SYSTEMS
This invention relates to the field of data processing systems. More particularly, this invention relates to data processing systems supporting program instructions tailored to high data throughput requirements.
It is known within data processing systems to perform data processing operations which requirc a high data throughput and the manipulation of large amounts of data. An example of such manipulations are Viterbi algorithm calculations commonly used when transmitting data over a noisy communication channel. While these techniques can be highly successful in resisting data loss arising due to noise on the channel, they bring with them a high computational load. These high levels of computation present a significant challenge in producing low overhead (in terms of size, cost and energy consumption) systems capable of performing the required processing.
One particular challenge within Viterbi decoding is that the trellis traceback algorithm requires access to a two-dimensional array of data values with one dimension of the array being stepped through at a constant rate and the other dimension being accessed "randomly" depending upon the current state of the decoder.
Known software Viterbi implementations (e.g. C54x) implement these requirements by using one instruction to step through the dimension which changes at a constant rate and another instruction to apply the value for the randomly accessed dimension when seeking to form the composite address for accessing the two-dimensional array.
A problem situation that arises concerns the manipulation of data values in a manner that depends directly upon the data values to be manipulated. Conventionally this requires multiple instructions, i.e. first to examine the data to identify the manipulation to be performed and then to separately perform that manipulation.
Viewed from another aspect the present invention provides apparatus for processing data comprising: data processing circuitry responsive to control signals to perform data processing data processing circuitry responsive to control signals to perform data processing operations; and instruction decoder circuitry coupled to said data processing circuitry and responsive to program instructions to generate said control signals; wherein said instruction decoder circuitry is responsive to a select-and-insert instruction having as input operands at least a first input value and a second input value to generate control signals to control said data processing circuitry to form an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to form a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.
The present technique recognises the bottleneck that is introduced by the need to perform manipulations upon data values in dependence upon those data values themselves in circumstances where these manipulations are frequently required and where high data throughput is required. More particularly, the present technique recognises a particular class of such situations for which it is desirable to provide hardware support. These correspond to a select-and-insert instruction in which a first input value is shifted by a variable number N of bit positions to form a shifted value, N bits from within a second input value are selected in dependence of the first input value, and then the shifted value and the selected N bits are concatenated to form an output value. This particular combination of manipulations is one which is frequently required in certain fields where high volumes of data are to be processed, desirably with a high level of efficiency.
Whilst the above select-and-insert instruction could be used in other circumstances, it is particularly well suited to use when the first input value is a Viterbi decoder state value and the second input value is a Viterbi trellis data value. The instruction then provides a high efficiency mechanism for tracing back through the Viterbi trellis data values to reconstruct decoder state and decode the signals required.
It will be appreciated that the first input value could be left shifted with the N bits concatenated to form the least significant bits of the output data value. Alternatively, the first input value could be right shifted and the N bits concatenated with the shifted value to form the most significant bits of the output value. The number of bit positions shifted and the number of bits inserted can take a variety of values, but is often usefully one.
The present technique is well suited to pipelined implementation when the first input value is a Viterbi decoder state value, the second input value is a multi-bit Viterbi trellis data value loaded from a memory by a load instruction executed in a processing cycle preceding the processing cycle in which the select-and-insert instruction is executed. In these circumstances, the latency associated with accessing the Viterbi trellis data value with the load instruction can be compensated for since the bits which will be required from that Viterbi trellis data value to be inserted into the Viterbi decoder state value can be determined and selected later by the select-and-insert instruction, The load can thus effectively load all of the bit values which might be required and the select-and-insert instruction can then select the bit values which are actually required for the manipulation to be performed.
The provision of the select-and-insert instruction is complemented by the provision of the previously discussed address calculation instruction as together these instructions can significantly reduce the processing bottlenecks which would otherwise be present and obstruct a high efficiency implementation of, in particular, a Viterbi software decoder. This is particularly beneficial when the trellis is generated by parallel data processing units, such as in a SIMD machine. In this case the scalar traceback processing becomes a bottleneck.
Viewed from another aspect the present invention provides a method of processing data using data processing circuitry responsive to control signals to perform data processing operations and instruction decoder circuitry coupled to said data processing circuitry and responsive to program instructions to generate said control signals, said method comprising the steps of: decoding a select-and-insert instruction having as input operands having as input operands at least a first input value and a second input value to generate control signals; controlling said data processing circuitry with said control signals to calculate an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to form a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.
Viewed from a further aspect the present invention provides apparatus for processing data comprising: data processing means for performing data processing operations in response to control signals; and instruction decoder means coupled to said data processing means for generating said control signals in response to program instructions; wherein said instruction decoder means, in response to a select-and-insert instruction having as input operands at least an first input value and a second input value, generates controls signals to control said data processing means to calculate an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to fonn a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.
Viewed from a further aspect the present invention provides a virtual machine implementation of an apparatus for processing data, said virtual machine implementation being responsive to a select-and-insert instruction having as input operands at least an first input value and a second input value to calculate an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to fonn a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which: Figure 1 schematically illustrates an integrated circuit suitable for software radio processing; Figure 2 schematically illustrates a Viterbi coding and decoding system; Figure 3 schematically illustrates Viterbi trellis data; Figure 4 schematically illustrates updating of Viterbi decoder state data during traceback; Figure 5 schematically illustrates a two-dimensional array of Viterhi trellis data being traversed as part of a traceback operation; Figure 6 schematically illustrates an instruction decoder responsive to program instructions for controlling data processing circuitry; Figure 7 schematically illustrates the operation of an address calculation instruction; Figure 8 is a flow diagram schematically illustrating the processing performed by an address calculation instruction; Figure 9 illustrates the syntax of an address calculation instruction; Figure 10 schematically illustrates the operation of a select-and-insert instruction; Figure 11 schematically illustrates an alternative operation of a select-and-insert instruction; Figure 12 is a flow diagram schematically illustrating the operation of a select-and-insert instruction; Figure 13 illustrates the syntax of a select-and-insert instruction; Figure 14 is an example code sequence illustrating the use of a select-and-insert instruction in combination with an address calculation instruction to perform Viterbi traceback operations; and Figure 15 is a diagram schematically illustrating a virtual machine implementation for executing program code utilising the address calculation instruction and select-and-insert instruction of the current techniques.
Figure 1 shows an integrated circuit 2 adapted to perform software radio processing functions. Software radio places heavy demands upon the processing capabilities of such a programmable integrated circuit. The data throughputs required are large and it is important to balance the different elements provided within the integrated circuit 2 in order that all the elements are used with a high degree of efficiency. In the illustrated example, thirty-two parallel lanes, each sixteen bits wide, for performing multiplication, addition and shuffle operations upon arithmetic values are provided. Each of these lanes includes a multiplier 4, an adder 6 and a shuffle unit 8. 16-bit data vmrds are taken from a respective lane within an input value register 10 to provide input operands to the multiplier 4, the adder 6 and the shuffle unit 8. The multiplier 4, the adder 6 and the shuffle unit 8 form a three-cycle deep pipeline such that the results of a calculation will be available three cycles after the calculation is issued into the pipeline. The respective processing lanes are controlled by a 256-bit very long instruction word (VLIW) instruction stored within an instruction register 12. This VLIW instruction also includes a scalar instruction supplied to a scalar processor 14.
The scalar processor 14 operates in parallel with the previously discussed thirty two parallel lanes and serves primarily to perform control and higher level decoding operations.
The scalar processor 14 also controls an address generation unit 16 which is responsible for generating memory access addresses supplied to a memory 18 for accessing data values therefrom (which are fed to the operand register 10 for processing in the thirty two parallel lanes as well as to the scalar processor 14 itself). The scalar processor 14 also has a three-cycle pipeline depth and the memory 18 has a three-cycle latency, Matching the pipeline depths/latency of the address generation wait 16, the thirty-tw parallel lanes and the memory 18 simplifies efficient coding and allows more flexibility in the scheduling of instructions.
One of the tasks of the address generation unit 16 in performing Viterbi decoding is to undertake the traceback operations through the Viterbi trellis data which has been calculated by thirty-two parallel lanes. The thirty-two parallel lanes, each comprising a multiplier 4, an adder 6 and a shuffle unit 8, are responsible for the data processing necessary to compute the probability coefficients and branch values to be associated with each state node within the Viterbi decoding process. Such a highly parallel data processing engine is well suited to this eomputationally intensive task. Once the Viterbi trellis data has been calculated it is necessary to analyse this calculated data so as to extract therefrom the bit stream which has been decoded. This task is performed by the address generation unit 16. The thirty-two parallel lanes write the Viterbi trellis data to the memory 18 from where it is read and analysed by the address generation unit 16. The address generation unit 16 also tracks the Viterbi decoder state data which provides the decoded data stream.
Viterbi decoding in itself is a well known technique within the field of data and signal processing. Viterbi decoding will not be described herein in detail.
Figure 2 illustrates at a high level the processing that is performed. An input data stream 20 is subject to convolution encoding and the addition of some parity data by a convolutional encoder 22. This Viterbi encoded data is then transmitted over a noisy data channel (e.g. a wireless data channel) to a Viterbi decoder 24. The Viterbi decoder 24 applies Viterbi decoding algorithms to the received data to form Viterbi trellis data 26, which can then be subject to traceback processing by a traceback processor 28 to generate an output data stream 30 corresponding to the input datastream 20.
Figure 3 schematically illustrates Viterbi trellis data. In this example each Viterbi decoder state is taken to have four possible values, m3 -m0 These four possible states at each time t have a value associated with them indicating how probable it is that the decoder has reached that state given the preceding sequence of bits that have been received. The transition from one possible decoder state to the next possible decoder state can have two potential targets selected between in dependence upon the received bit associated with that transition.
The trellis data comprises a large number of computed elements representing the probabilities of states and the bit sequences which have led to those states. Calculating this trellis data is computationally intensive and is performed by the wide multi-lane data engine of illustrated in Figure 1. When the trellis data has been formed in this way, another processing unit, such as the address generation unit 16 is used to analyse this trellis data and "traceback" thercthrough. This type of processing is in itself known. It will be appreciated that in practice a Viterbi decoder will have many more than four possible states at each time making the Viterbi trace back data significantly larger in volume and more complex to analyse.
Figure 4 schematically illustrates a small part of the traceback operation performed as part of typical Viterbi decoding. The decoder has been determined at time t to be in a given state that is most probable given the already decoded trellis data which has been traversed.
Stored within the trellis data for the time t and the state at which the decoder is currently expressing is afi Indication of which preceding state at the time t-1 is the most probable preceding state. This indicates to which state the decoder is traced back to and the bit value which is deemed to have been decoded by that change of state. The change of state will also be accompanied by a change in the decoder state value which is achieved> in this example, by left shifting the current state value and shifting into the bottom of that state value a bit indicating which of the two options for the preceding bit has been deemed the most probable, and accordingly deemed to have been decoded. This shifted value with an inserted new bit then forms the new state of the decoder at time t-1. The process repeats at time t-1 and a further bit is decoded traceback tluough the Viterbi trellis data is so made.
Figure 5 is another example illustration of this process. At the various times t the decoder state in this example can have sixteen possible values. With each of these values there is an associated bit indicating the most likely path by which that state will have been reached from the two possible preceding states at an earlier time. This path is then followed back to that preceding state, which will in itself have an indicator to the preceding state to which traceback is to be performed. Thus, in the example illustrated, the state at time t is "0101". The bit stored within the trellis data indicating the preceding state associated with that state is a "1" indicating that a "1" is to be shifted into the bottom of the state value as it is left shifted to form the state value for the preceding state at time t-l. In this way, the state value for the preceding state is formed as "lOll". Data is stored within the trellis data associated with this state at time t-l indicating the next state to be adopted. Thus, the trellis data shown in Figure 5 is subject to a traceback operation during which the decoder state is updated and is used to generate the decoded data stream in accordance with known techniques.
Figure 6 illustrates a portion of the integrated circuit 2 of Figure 1 in more detail. The scalar processor 14 is provided with a scalar instruction register 32 (which is part of the YLIW instruction register 12) for storing a scalar instruction to be executed. An instruction decoder 34 is responsive to the scalar instruction in the scalar instruction register 32 to generate control signals supplied to data processing circuitry 36. The data processing circuitry 36 performs data processing operations in response to the control signals supplied thereto in order to perform the desired data processing operations specified by the instruction within the sealar instruction register 32. The instruction decoder 34 is circuitry configured to be responsive to the bit patterns within the scalar instruction register 32 to generate the desired control signals for supply to the data processing circuitry 36. The data processing circuitry 36 typically includes a wide variety of different functional elements, such as an adder 38, a shifter 40 and general purpose combinatorial logic 42. It will be appreciated that a wide variety of other forms of circuitry may be provided within the data processing circuitry 36 to achieve the desired functions. It will further be appreciated that the selection of which program instructions are to be supported by the instruction decoder 34 is a critical one in system design. A general purpose processor can normally accomplish most processing tasks desired if enough program instructions and processor cycles are dedicated to those tasks.
However, this is not sufficient when high efficiency is also required as it is desirable to perform such processing as quickly and with low energy consumption. In this way, the selection of which processing operations are to be supported within the instruction bit space and natively supported by the data processing circuitry 36 is critical in achieving good levels of efficiency. The present techniques concern the identification and selection of particular forms of data processing instruction which are surprisingly advantageous and accordingly desirable to support natively.
Figure 7 illustrates the operations performed by an address caléulation instruction supported by the instruction decoder 34 and the data processing circuitry 36. The input address value 44 is divided into a first portion 46 and a second portion 48 in dependence upon a size value 50. The size value 50 is in this example specified as a value representing the logarithm of the size of a mask to be applied to the input address value 44 to split it into the first portion 46 and the second portion 48. Also supplied as input operands to the address calculation instruction in this example are an offset value stored within a register specified as a register field within the instruction and a state value stored within a register specified as a register field within the instruction. The address calculation instruction serves to add an offset value to the first portion 46, In the example illustrated, this offset value is "-1", which effectively results in a decrement of the first portion. If the first portion is indexing a two dimensional data array, then the high order bits of the first portion can be considered to form the base address for that two dimensional array and the lower bits of the first portion represent the row address within that array. In this case the array is aligned -more generally the high order bits are the base address plus the row address. The number of bits of this lower portion of the first portion representing the row address varies depending upon the row size. In the example of Viterbi trellis data, each row can correspond to a different time t with data values corresponding to the different decoder states at that time t.
The manipulation performed upon the second portion 48 of the input address value 44 is to set the second portion 48 to a value specified by the State input operand being a value held within a register specified by a register field within the address calculation instruction and subject to masking of that state value to select the relcvant bits thereof which are to be used as the second portion 48.
In this way, a new address can be formed as the output address value 52 by adding an offset value to the most significant bit portion of the input address value and setting the least significant bit portion of the input address value to a new value which can effectively randomly be selected, Thus, if a two dimensional data structure is considered, the modification to the first portion 46 steps through the rows of the data structure in a regular fashion (e.g. one row at a time, two rows at as time etc) with the setting of the second portion 48 of the address value allowing a random column position within the two-dimensional data structure to be selected for access using the output address value calculated, In the context of traversing Viterbi trellis data it will be seen that this instruction is well suited to this task since such trellis data is regularly traversed, typically one row at a time, with a random next column needing to be accessed at each access. Thus, by appropriately loading the state value into the register to be used to form the second portion, and setting the desired offset, the new address following a trace back step can be calculated with a single instruction.
Figure 8 schematically illustrates the operation of the instruction decoder 34 when encountering an address calculation instruction. At step 54, the instruction decoder 34 identifies the scalar instruction within the scalar instruction register 32 as an address calculation instruction. At step 56 the input address value is split into a first portion and a second portion in dependence upon the size value specified in association with the address calculation instruction. At step 58 a non-zero offset (which may be positive or negative) is added to the first portion. At step 60 the second portion is set to a value determined directly or indirectly from the address calculation instruction. At step 62 the first portion and the second portion which have been modified are concatenated to form the output address value 52.
It will be appreciated that the sequence of operations shown in Figure 8 is linear whereas in practice these operations may be performed in a different order and/or with varying degrees of parallelism. Figure 8 is intended to represent the functionality provided rather than the precise way in which such functionality is provided. The various options for the provision of this functionality will be familiar to those in this technical field.
Figure 9 schematically illustrates the syntax of an address calculation instruction in accordance with the present technique. As will be seen, the instruction includes a field identifying a register storing the offset value to be used, a field identifying a register storing a value at least part of which is to be used to set the second portion when forming the output address value and further a field (in this case an immediate) specifying a size value to be used when dividing the input address value into a first portion and a second portion. The variability of the size value allows different widths of two dimensional data array to be appropriately addressed using the address calculation instruction depending upon the circumstances.
Figure 10 schematically illustrates the operations to be performed as part of traceback when updating a decoder state value. Figure 10 illustrates the example in which the state value is left shifted and a new bit value is inserted in the least significant bit positions. As illustrated in Figure 10, a first input value 64 is provided in conjunction with a second input value 66. The operation of the select-and-insert instruction is to use, in this example, the bottom three bits of the first input value 64 to select a bit within the second input value 66 which is to be inserted in the least significant bit position of the new value to be formed as the output value 68 after it has been left shifted and had the bit inserted at its least significant bit position. It will be appreciated that the width of the portion of the first input value 64 used to select the bits or bit within the second input value 66 to be inserted can vary depending upon the width of the second input value 66. Similarly, the number of bits to be inserted with each instruction can vary and be more generally N bits. In many circumstances, such as a simple Viterbi traceback, the shift by one bit position and the inserting of one bit will be normal.
Figure 11 illustrates a variant of the shift-and-insert instruction, with in this case the first input operand 70 being subject to a right shift and the selected bit or bits from the second input value 72 being inserted at the most significant bit position within the new state value.
The state value in this example is M bits wide and accordingly there are 2M possible one bit shift values which can be selected within the second input value 72 for insertion. The output value 74 represents the traceback Viterbi decoder state at time t-1 It will be appreciated from Figures 10 and 11 that the second input values 66, 72 includes more than just the bit(s) which are to be inserted and used to update the state values when these are shifted. This is advantageous since the second input value 66, 72 can he fetched from memory by an instruction issued several cycles earlier before it is known precisely which of the bits from that fetched value will need to be inserted within the state value to update the state value when that update is required. Thus, the latency associated with the memory access can effectively be hidden by fetching more than just the bit(s) which will be required and then later selecting the desired bit(s) from the fetched second input value to perform the desired update. ln practice memories are accessed with access mechanisms/path wider than a single bit (e.g. typically byte or word access) and accordingly the fetching of more than just the single bit or N bits required to be inserted does not in practice consume more energy than would otherwise be the case.
Figure 12 is a flow diagram schematically illustrating the processing performed by the select-and-insert instructions of Figures 10 and 11. At step 76 the instruction decoder 34 identifies from the bit pattern within the scalar instruction register 32 that a select-and-insert instruction has been received. It then generates the appropriate control signals to configure the data processing circuitry 36 to perform the above described data processing operations.
At step 78 the first input value is shifted by N bits to perform a shifted value. At step 80, N bits are selected from the second input value, as pointed to by the first input value. More specifically the selected bits from within said second input value are bits (K*N) to (K.tN)+(N 1) where K is the bottom M bits of the first input value. In this case, Viterbi trellis data value includes 2M possible N bit portions to be concatenated with said shifted value permitting up to M cycles to load said Viterbi trellis data value from said memory whilst permitting said data processing circuitry to executes a sequence of said select-and-insert instructions in a manner providing a throughput capable of forming one output value per clock cycle. At step 82 the shifted value and the selected N bits are concatenated to form the output value, As previously discussed, it will be appreciated that the flow diagram of Figure 12 represents the processing as sequential operations, but it will be appreciated that in practice this could be performed in a different order and with varying degrees of parallelism.
Figure 13 schematically illustrates the syntax of a select-and-insert instruction. This instruction includes a first input operand and a second input operand, each in the form of a register specifier pointing to a register holding respectively the current state value and trellis data value as part of the Viterbi decoding.
Figure 14 is an example code sequence showing the use of the address calculation instruction and the shift-and-insert instruction in a code fragment for performing Viterbi trace back operations. In this example it will in particular be seen that the first triplet of instructions terminates with a load to register d4 and this value is not needed until that triplet of instructions is returned to in the next loop cycle. This permits the latency associated with this load to be tolerated without stalling the instruction processing. Furthermore, since the value within the register d4 contains more than just the bits which are to be inserted, the various options for which bits will be inserted can be catered for.
Figure 15 illustrates a virtual machine implementation of the present techniques. it will be appreciated that the above has described the implementation of the present invention in the terms of apparatus and methods for operating specific processing hardware supporting the instructions concerned, It will be appreciated by those in this technical field that it is also possible to provide so called "virtual machine" implementations of hardware devices, These virtual machine implementations run on a host processor 84 running a host operating system 86 supporting a virtual machine program 88. Typically large powerful processors are required to provide virtual machine implementations which execute at a reasonable speed, but such an approach may be justified in certain circumstance, such as a desire to run code native to another processor for compatibility or reuse reasons, The virtual machine program 88 provides an application program interface to an application program 90 which is the same as the application program interface which would be provided by the real hardware which is the device being modelled by the virtual machine program 88. Thus, the program instructions, including the address calculation instruction and the select-and-insert instruction described above, may be executed from within the application program 90 using the virtual machine program 88 to model their interaction with the virtual machine hardware.

Claims (10)

  1. cLAiMS 1. Apparatus for processing data comprising: data processing circuitry responsive to control signals to perform data processing operations; and instniction decoder circuitry coupled to said data processing circuitry and responsive to program instructions to generate said control signals; wherein said instruction decoder circuitry is responsive to a select-and-insert instruction having as input operands at least a first input value and a second input value to generate control signals to control said data processing circuitry to form an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to form a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.
  2. 2. Apparatus as claimed in claim I, wherein said first input value is left-shifted and said N bits are concatenated with said shifted value to form N least significant bits of said output value.
  3. 3. Apparatus as claimed in claim I, wherein said first input value is right-shifted and said N bits are concatenated with said shifted value to form N most significant bits of said output value.
  4. 4. Apparatus as claimed in any one of claims 1 to 3, wherein Nl.
  5. 5. Apparatus as claimed in any one of claims 1 to 4, wherein said first input value is a Viterbi decoder state value and said second input value is a Viterbi trellis data value.
  6. 6. Apparatus as claimed in claim 5, wherein said Viterbi trellis data value is a mulitbit data value loaded from a memory by a load instruction executed in a processing cycle preceding a processing cycle in which said select-and-insert instruction is executed.
  7. 7. Apparatus as claimed in claim 6, wherein said Viterbi trellis data value includes 2M possible N bit portions to be concatenated with said shifted value permitting up to M cycles to load said Viterbi trellis data value from said memory whilst permitting said data processing circuitry to executes a sequence of said select-and-insert instructions in a manner providing a throughput capable of forming one output value per clock cycle.
  8. 8. Apparatus as claimed in claim 7, wherein said N bits selected from within said second input value are bits (K*N) to (K*N)+(Nl) where K is the bottom M bits of the first input value.
  9. 9. Apparatus as claimed in any one of claims 1 to 8, wherein said instruction decoder circuitry is responsive to an address calculation instruction having as input operands at least an input address value and a size value to generate control signals to control said data processing circuitry to calculate an output address value equal to that given by performing the steps of: splitting said input address value at a position dependent upon said size value into an input first portion and an input second portion; adding a non-zero offset value to said input first portion to form an output first portion; setting an output second portion to a value; and concatenating said output first portion and said output second portion to form said output address value.
  10. 10. A method of processing data using data processing circuitry responsive to control signals to perform data processing operations and instruction decoder circuitry coupled to said data processing circuitry and responsive to program instructions to generate said control signals, said method comprising the steps of: decoding a select-and-insert instruction having as input operands having as input operands at least a first input value and a second input value to generate control signals; controlling said data processing circuitry with said control signals to calculate an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to form a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.IL A method as claimed in claim 10, wherein said first input value is left-shifted arid said N bits are concatenated with said shifted value to form N least significant bits of said output value, 12. A method as claimed in claim 10, wherein said first input value is right-shifted and said N bits are concatenated with said shifted value to form N most significant bits of said output value.13. A method as claimed in any one of claims 10 to 12, wherein N=l.14. A method as claimed in any one of claims 10 to 13, wherein said first input value is a Viterbi decoder state value and said second input value is a Viterbi trellis data value.15. A method as claimed in claim 14, wherein said Viterbi trellis data value is a mulitbit data value loaded from a memory by a load instruction executed in a processing cycle preceding a processing cycle in which said select-and-insert instruction is executed.16. A method as claimed in claim 15, wherein Viterbi trellis data value includes 2M possible N bit portions to be concatenated with said shifted value permitting up to M cycles to load said Viterbi trellis data value from said memory whilst permitting said data processing circuitry to executes a sequence of said select-and-insert instructions in a manner providing a throughput capable of forming one output value per clock cycle.17. A method as claimed in claim 16, wherein said N bits selected from within said second input value are bits (KtN) to (K*N)+(N 1) where K is the bottom M bits of the first input value.18. A method as claimed in any one of claims 10 to 17, comprising decoding an address. calculation instruction having as input operands at least an input address value and a size value to generate control signals; and control said data processing circuitry using said control signals to calculate an output address value equal to that given by performing the steps of: splitting said input address value at a position dependent upon said size value into an input first portion and an input second portion; adding a non-zero offset value to! said input first portion to form an output first portion; setting an output second portion to a value; and concatenating said output first portion and said output second portion to form said output address value.19. A virtual machine implementation of an apparatus for processing data, said virtual machine implementation being responsive to a select-and-insert instruction having as input operands at least an first input value and a second input value to calculate an output value equal to that given by performing the steps of: shifting said first input value by N bit positions to form a shifted value, where N is an integer value greater than zero; selecting N bits from within said second input value in dependence upon said first input value; and concatenating said shifted value and said N bits to form said output value.
GB1104112A 2007-03-12 2007-03-12 Select and insert instructions within data processing systems Active GB2475653B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1104112A GB2475653B (en) 2007-03-12 2007-03-12 Select and insert instructions within data processing systems

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0704735A GB2447427B (en) 2007-03-12 2007-03-12 Address calculation within data processing systems
GB1104112A GB2475653B (en) 2007-03-12 2007-03-12 Select and insert instructions within data processing systems

Publications (3)

Publication Number Publication Date
GB201104112D0 GB201104112D0 (en) 2011-04-27
GB2475653A true GB2475653A (en) 2011-05-25
GB2475653B GB2475653B (en) 2011-07-13

Family

ID=37988816

Family Applications (2)

Application Number Title Priority Date Filing Date
GB0704735A Active GB2447427B (en) 2007-03-12 2007-03-12 Address calculation within data processing systems
GB1104112A Active GB2475653B (en) 2007-03-12 2007-03-12 Select and insert instructions within data processing systems

Family Applications Before (1)

Application Number Title Priority Date Filing Date
GB0704735A Active GB2447427B (en) 2007-03-12 2007-03-12 Address calculation within data processing systems

Country Status (2)

Country Link
US (2) US7814302B2 (en)
GB (2) GB2447427B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9652210B2 (en) * 2007-08-28 2017-05-16 Red Hat, Inc. Provisioning a device with multiple bit-size versions of a software component
GB2488980B (en) * 2011-03-07 2020-02-19 Advanced Risc Mach Ltd Address generation in a data processing apparatus
US9021233B2 (en) * 2011-09-28 2015-04-28 Arm Limited Interleaving data accesses issued in response to vector access instructions
CN107908427B (en) * 2011-12-23 2021-11-09 英特尔公司 Instruction for element offset calculation in multi-dimensional arrays
US9489196B2 (en) 2011-12-23 2016-11-08 Intel Corporation Multi-element instruction with different read and write masks
DE102012010102A1 (en) * 2012-05-22 2013-11-28 Infineon Technologies Ag Method and device for data processing
US20160179530A1 (en) * 2014-12-23 2016-06-23 Elmoustapha Ould-Ahmed-Vall Instruction and logic to perform a vector saturated doubleword/quadword add
US9996350B2 (en) 2014-12-27 2018-06-12 Intel Corporation Hardware apparatuses and methods to prefetch a multidimensional block of elements from a multidimensional array
US20220374236A1 (en) * 2021-05-20 2022-11-24 Huawei Technologies Co., Ltd. Method and system for optimizing address calculations

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5487159A (en) * 1993-12-23 1996-01-23 Unisys Corporation System for processing shift, mask, and merge operations in one instruction
US20040193848A1 (en) * 2003-03-31 2004-09-30 Hitachi, Ltd. Computer implemented data parsing for DSP
US20050188182A1 (en) * 1999-12-30 2005-08-25 Texas Instruments Incorporated Microprocessor having a set of byte intermingling instructions
GB2411978A (en) * 2004-03-10 2005-09-14 Advanced Risc Mach Ltd Using multiple registers to shift and insert data into a packed format
US7047396B1 (en) * 2000-06-22 2006-05-16 Ubicom, Inc. Fixed length memory to memory arithmetic and architecture for a communications embedded processor system

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3735355A (en) * 1971-05-12 1973-05-22 Burroughs Corp Digital processor having variable length addressing
JPH01204147A (en) * 1988-02-09 1989-08-16 Toshiba Corp Address qualifying circuit
EP0656712A1 (en) * 1993-11-16 1995-06-07 AT&T Corp. Viterbi equaliser using variable length tracebacks
US5490178A (en) * 1993-11-16 1996-02-06 At&T Corp. Power and time saving initial tracebacks
JPH07333720A (en) * 1994-06-08 1995-12-22 Minolta Co Ltd Camera provided with magnetic recording function
US6148388A (en) * 1997-07-22 2000-11-14 Seagate Technology, Inc. Extended page mode with memory address translation using a linear shift register
US5987490A (en) * 1997-11-14 1999-11-16 Lucent Technologies Inc. Mac processor with efficient Viterbi ACS operation and automatic traceback store
GB2402757B (en) * 2003-06-11 2005-11-02 Advanced Risc Mach Ltd Address offset generation within a data processing system
US7275204B2 (en) * 2004-09-30 2007-09-25 Marvell International Ltd. Distributed ring control circuits for Viterbi traceback

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5487159A (en) * 1993-12-23 1996-01-23 Unisys Corporation System for processing shift, mask, and merge operations in one instruction
US20050188182A1 (en) * 1999-12-30 2005-08-25 Texas Instruments Incorporated Microprocessor having a set of byte intermingling instructions
US7047396B1 (en) * 2000-06-22 2006-05-16 Ubicom, Inc. Fixed length memory to memory arithmetic and architecture for a communications embedded processor system
US20040193848A1 (en) * 2003-03-31 2004-09-30 Hitachi, Ltd. Computer implemented data parsing for DSP
GB2411978A (en) * 2004-03-10 2005-09-14 Advanced Risc Mach Ltd Using multiple registers to shift and insert data into a packed format

Also Published As

Publication number Publication date
GB0704735D0 (en) 2007-04-18
US7895417B2 (en) 2011-02-22
GB201104112D0 (en) 2011-04-27
US20100217958A1 (en) 2010-08-26
US7814302B2 (en) 2010-10-12
GB2475653B (en) 2011-07-13
US20080229073A1 (en) 2008-09-18
GB2447427A (en) 2008-09-17
GB2447427B (en) 2011-05-11

Similar Documents

Publication Publication Date Title
US7895417B2 (en) Select-and-insert instruction within data processing systems
CN109643233B (en) Data processing apparatus having a stream engine with read and read/forward operand encoding
EP2569694B1 (en) Conditional compare instruction
JP4484925B2 (en) Method and apparatus for control flow management in SIMD devices
US5680597A (en) System with flexible local control for modifying same instruction partially in different processor of a SIMD computer system to execute dissimilar sequences of instructions
US7177876B2 (en) Speculative load of look up table entries based upon coarse index calculation in parallel with fine index calculation
KR101137403B1 (en) Fast vector masking algorithm for conditional data selection in simd architectures
CN108780395B (en) Vector prediction instruction
JPH04313121A (en) Instruction memory device
CN109952559B (en) Streaming engine with individually selectable elements and group replication
KR20070026434A (en) Apparatus and method for control processing in dual path processor
CN107851013B (en) Data processing apparatus and method
JP2005332361A (en) Program command compressing device and method
KR20080005574A (en) Cyclic redundancy code error detection
US5742621A (en) Method for implementing an add-compare-select butterfly operation in a data processing system and instruction therefor
KR20100085131A (en) Optimized viterbi decoder and gnss receiver
CN108351762A (en) Use the redundant representation of the numerical value of overlapping bit
IL256403A (en) Vector length querying instruction
JP7324754B2 (en) Add instruction with vector carry
US20140122839A1 (en) Apparatus and method of execution unit for calculating multiple rounds of a skein hashing algorithm
US10235167B2 (en) Microprocessor with supplementary commands for binary search and associated search method
KR20100101585A (en) Accelerating traceback on a signal processor
WO2002039272A1 (en) Method and apparatus for reducing branch latency
CN105404588B (en) Processor and method for generating one or more addresses of data storage operation therein
Ates et al. Multi-Gbps Fano Decoding Algorithm on GPGPU