GB2514099A

GB2514099A - A data processing apparatus and method for performing a transform between spatial and frequency domains when processing video data

Info

Publication number: GB2514099A
Application number: GB1308186.4A
Authority: GB
Inventors: Dominic Hugo Symes; Tomas Edso
Original assignee: ARM Ltd; Advanced Risc Machines Ltd
Current assignee: ARM Ltd
Priority date: 2013-05-07
Filing date: 2013-05-07
Publication date: 2014-11-19
Anticipated expiration: 2033-05-07
Also published as: CN104144346A; GB201308186D0; US9378186B2; JP2014241585A; JP6357345B2; CN104144346B; GB2514099B; US20140337396A1

Abstract

A data processing apparatus for performing transforms between spatial and frequency domains when processing video data comprises transform circuitry to receive N input values (i.e. relating to video data or transformed video data) and, for each input value, generating a set of M internal input values. The M internal input values are supplied to base circuitry 215 which performs an operation equivalent multiplying a matrix of the M internal input values by a matrix of coefficients (e.g. supplied from coefficient generation circuitry 220). The matrix of coefficients may be a Hankel matrix which is a square matrix with constant skew diagonals, where each element of the array identifies a coefficient. The matrix multiplication returns a set of M internal output values which are provided to the transform circuitry. At the transform circuitry, each set of M internal output values is used to derive an output value (i.e. relating to transformed video data or video data), corresponding to the original input value. The process is repeated to produce N output values, corresponding to the N input values. A method for performing transforms between spatial and frequency domains when processing video data is also claimed. The apparatus and method described can be used during encoding of video data to transform the data from the spatial to the frequency domain, or during decoding of video data to perform an inverse transform the data from the frequency to the spatial domain. The transforms performed may be a Discrete Cosine Transform (DCT) and Inverse Discrete Cosine Transform (IDCT) respectively. The conversion between N input values and M internal input values are performed using permutation circuitry 200 and adding/summation circuitry 210, with corresponding circuitry 225,240 to derive N output values from the M internal output values. The apparatus and method described are intended to enable more efficient processing of larger transform units (e.g. 32x32 blocks of pixel data) when encoding/decoding video data in accordance with the High Efficiency Coding (HEVC) standard.

Description

I

A DATA PROCESSING APPARATUS AND METHOD FOR PERFORMING A

TRANSFORM BETWEEN SPATIAL AND FREqUENCY DOMAINS WHEN

PROCESSING VIDEO DATA

FIELD OF THE INVENTION

The present invention relates to techniques for performing a transform between spatial and frequency domains when processing video data. Such transforms are typically performed by both video encoders and video decoders, with a video encoder performing a forward transform to convert a video signal from the spatial domain to 1 0 the frequency domain, and a video decoder performing a corresponding inverse transform in order to convert the encoded signal from the frequency domain back to the spatial domain.

DESCRIPI1ON OF THE PRIOR ART

There are various known transforms for converting signals between the spatial I S and frequency domains. A commonly used transfonn is the discrete cosine transform.

Contemporary video encoders and decoders may he required to perform video encoding and decoding operations in accordance with a number of video standards, such as MPEG2, MPEG4. H.263, H.264 high profile, VP8, VC-l and so on. It is known that a particularly computationally intensive part of the video encoding and decoding process is the performance of the transform operation.

Video encoding and decoding has typically been performed on the basis of 8 x 8 blocks of pixel data, wherein four 8 x 8 blocks of luma (Y) data and two 8 x 8 hloc]cs of ehroma (Cb and Cr) data represent a given macroblock of the video data. The transform operations are performed on all six 8 x 8 blocks for each macroblock to produce six transformed output 8 x 8 blocks.

Until recently, only relatively small transform opcrations have been needed, such as 8x8 transfonns in the above mentioned examples. However, with the introduction of high definition video newer video standards are emerging, such as the I-IEVC standard, which requires transform operations to be performed on larger arrays, for example 1 6xl 6 and 32x32. Many of the techniques developed to efficiently perthrm the smaller sized transfonns have been found not to be scalable to such larger transforms.

Considering specifically the example of a discrete cosine transform (DCT), various papers have studied larger DCTs, and techniques have been developed fbr enabling such large DCTs to be efficiently implemented by Fast Fourier Transform (FFT) style methods when repeated multiplications are permitted (i.e. the result of one multiplication is fed as an input to a frirther multiplication). For example the two papers by Feig & Winograd entitled "On the Multiplicative Complexity of Discrete Cosine Transforms", IEEE Trans information Theory, Volume 38, No. 4, July 1992, and "Fast Algorithms for the Discrete Cosine Transform", IEEE Trans Signal Processing, Volume 40, No. 9. September 1992, discuss possible algorithms for optimising DCTs which reduce the number of multiplication operations required.

However, generally these techniques require the earlier mentioned repeated multiplications, particularly lbr the larger transform sizes.

However, in video standards, there is often a requirement for the outputs of at least the decoding operation to be bit exact, since in video processing the contents of certain pictures are predicted from the previous picture. Taking the specific example of the HEVC standard, the inverse transform operation performed during decoding must be implemented to exactly match the output of a reference fixed-point version of the transform using integer multiplies. As a result, the known optimisation techniques that use repeated multiplications (typically in combination with shift operations) cannot be used due to the rounding errors introduced.

A known technique which avoids the need for such repeated multiplications, and hence can be used when bit exact results are required, uses repeated (A+B, A-B) butterflies to reduce the number of multiply operations required. When considering the example of a 32x32 transform, then without any optimisation this would require 32x32 multiplications for each one dimensional transform, i.e. 1024 multiplications.

Through the use of such known butterfly techniques, the number of multiplications for that specific scenario can be reduced to 342.

Nevertheless, this is still a significant number of multiplications to perfonn, and this number of multiplications needs to he repeated for every one dimensional transformation. For example, video encoding and decoding typically uses two dimensional DOs, and hence by way of example using the HEVC standard, each block of video data to be processed may consist of an array of 32x32 data values.

Typically the two dimensional discrete cosine transform is implemented by performing a series of one dimensional transforms applied to each row and each column of the array, and hence in the above example would involve the performance of 32 one dimensional transforms to cover each row of the array, followed by 32 one dimensional transforms to cover each of the columns. hence, 64 one dimensional transforms will he required for each block of video data, and each one dimensional transform would require 342 multiplication operations in accordance with the specific butterfly technique discussed earlier.

There is a continual desire to provide higher performance and lower area cost 1 0 video encoders and decoders, and accordingly it would be desirable to reduce the number of multiplications required during performance of forward and inverse transform operations on video data. This desire is becoming more and more acute as the size of the transformations to be supported increases in accordance with the newer video standards such as the HEVC standard.

SUMMARY OF THE INVENTION

Viewed from a first aspect, the present invention provides a data processing apparatus for performing a transform between spatial and frequency domains when processing video data, the data processing apparatus comprising: transform circuitry configured to receive N input values and to perform a sequence of operations to generate N output values representing the transform of said N input values betweei the spatial and frequency domains; a base circuitry configured to receive M internal input values generated by the transform circuitry, where M is greater than or equal to 4, and to perform a base operation equivalent to matrix multiplication of said M internal input values by a matrix comprising an array of coefficients c and having the form c0 e1 c2..

C1 C2 e3...

C2 C3 c4... CM-fl C.j CM 0M-I-l... C2M2 in order to generate M internal output values for returning to the transform circuitry; and the transform circuitry being arranged during perfonnaiice of said sequence of operations to generate from the N input values multiple sets of said M intemal input values, to provide each set of M internal input values to the base circuitry in order to cause multiple sets of said M internal output values to be produced, and to derive the N output values from said multiple sets of M internal output values.

In accordance with the present invention, the data processing apparatus is configured to make repeated use of a base circuitry that is configured to perform a base operation equivalent to matrix multiplication of M internal input values by a matrix comprising an array of coef±icient c and having the form C0 c1 C2...

ci c, C3...

C2 C3 C4. -CM--i CM.1 CM CM+1... c2M.2 M is greater than or equal to 4, and in one example M is equal to 4. Transform circuitry manipulates the originally provided N input values in order to generate multiple sets of M internal input values, with each set of M internal input values being passed through the base circuitry. Hence, the base circuitry is used iteratively for each of the sets of M internal input values produced by the transform circuitry. The transform circuitry then derives the N output values' Irom the multiple sets of M internal output values produced by the base circuitry.

Often the above described technique of the present invention will be employed repetitively on a series of one dimensional transforms in order to implement a two dimensional transform. For each one dimensional transform, a set of N input values will be input to the transform circuitry of the data processing apparatus. In accordance with the present invention, each set of M internal input values is only passed once through the base circuitry for a particular provided set of N input values, and none of the internal output values generated by the base circuitry are used as the input to a subsequent iteration of the multiplication performed by the base circuitry. Hence there are no repeated multiplications performed when using the apparatus of the present invention, and accordingly this enables a bit exact result to be generated by the apparatus, as required by modern standards such as the HEVC video standard.

The particular form of matrix employed within the base circuitry is a square matrix with constant skew diagonals (i.e. positive sloping diagonals), and is also known as a 1-lanlcel matrix. The inventors of the present invention have realised that for even the larger transforms required by modern video processing standards, the required transform between spatial and frequency domains can be factorised in a maimer that enables repeated matrix multiplications of a smaller size using the Hankel matrix.

The base operation performed by the base circuitry is able to handle any particular instance of the Hankel matrix of the above mentioned form, and hence is able to perform an operation equivalent to matrix multiplication of the M internal input values by the Hankel matrix irrespective of the values allocated to the coefficients e0 to c2M2. Hence, by way of example, the base circuitry is able to perfonn the required operation cvcn if all of the coefficients e0 to C2M2 have different values, and/or if thc values of the coefficients vary for each set of M internal input values provided to the base circuitry.

Further, the technique of the present invention may be used for both encoding and decoding and is readily scalable Ibr varying sizes of N. It has been found that the above arrangement enables a significant reduction in the number of multiplications required in order to perform a transform of the N input values between the spatial and frequency domains. For example, considering the earlier mentioned 32-point transform required by the new HEYC video standard, it has been frnmd that in accordance with the technique of the present invention approximately a third the number of multiplications are required when compared with the partial butterfly approach used by the FIEVC reference software, l'he 1-IEVC reference software (also known as I-IM-8.0 at http://r2d2n3po.tistory.com/61) is a C code implementation of the Standard used as a reference implementation to the papcr Standard.

The transform performed between the spatial and frequency domains can take a variety of forms, but in one embodiment is a discrete cosine transform.

There are a number of ways in which the transform circuitry can be arranged to generate each set of M internal input values provided to the base circuitry. However, iii one embodiment the transform circuitry comprises permutation circuitry configured to permute the received N input values in order to produce K groups of input values, where K=N/M and hence each group has M members, each member being one of said received N input values. Adder circuitry is thcn configured to perfoni at least one of addition and subtraction operations on corresponding members from selected groups in order to generate each set of said M internal input values.

The actual permutation performed by the permutation circuitry will depend upon whether the apparatus is being used to perform a forward transform from the spatial to the frequency domain or an inverse transform from the frequency to the spatial domain. Similarly, the addition and subtraction operations performed by the adder circuitry will differ depending on whether a forward transform or an inverse transform is being performed.

The manner in which the transfonn circuitry is configured to derive the N output values from the multiple sets of M internal output values may vary dependent on embodiment. However, in on embodiment the transform circuitry further comprises further adder circuitry configured to perform at least one of addition and subtraction operations on the multiple sets of said M internal output values produced by the base circuitry in order to produce N intermediate output values. l'he additions/subtractions performed by the adder circuitry and the ftuther adder circuitry will depend upon whether the apparatus is being used to perform a forward transform or an inverse transfonn.

In onc embodiment, the transform circuitry further comprises shift circuitry configured to perform a shift operation on the N intermediate output values in order to generate shifted intermediate output values. It should he notcd that the shift operation is only performed once, after the N intennediate output values have been generated following the iterative operation of the base circuitry on the various sets of M internal input values. This serves to ensure the exact nature of the results. The once shifted values are output and not recirculated (except potentially to form an input value for another related 1 dimensional transform).

Tn one embodiment, the shift operation includes a saturate operation. Again, as with the shift operation, the saturate operation is only performed once.

In one embodiment, the transform circuitry ftirther comprises further permutation circuitry configured to permute the N shifted intermediate output values in order to generate said N output values. The permutation performed by the further permutation circuitry wilt be dependent on the permutation performed by the permutation circuitry on the N input values.

In one embodiment, the apparatus further comprises coefficient generation circuitry configured to generate, for each set of M internal input values, a corresponding set of coefficient values to be used by the base circuitry when performing the base operation. hence, the set of coefficient values can be set for each iteration of the base circuitry.

As mentioned earlier, the apparatus can be used to perform either a forward transform from the spatial to the frequency domain or an inverse transform from the frequency to the spatial domain. In one embodiment the apparatus is configurable so that it can he switched between performing either a forward transform or an inverse I 0 transform, In one particular embodiment, the corresponding set of coefficient values generated by the coefficient generation circuitry for each set of M internal input values are the same irrespective of whether the data processing apparatus is conuigured to perftn-in die forward transform or is configured to perform the inverse transform.

Hence, whilst the operations of the pennutation circuitry, adder circuitry, further adder circuitry and thither permutation circuitry will he modified dependent on whether the apparatus is performing a forward transform or an inverse transform, the basic operation of the base circuitry is unchanged, and exactly the same coefficients arc generated by the coefficient generation circuitry assuming the apparatus is still operating in accordance with the same video standard.

Whilst the apparatus of embodiments performs multip]e iterations of the earlier described base operation, it will typically still be necessary to perform a small transform, in particular an MxM transform. Hence, in onc embodiment, the transform circuitry is further configured to generate a further set of M internal input values for provision to the base circuitry, and the base circuitry is configured to perform a discrete cosine transform on said further set of M internal input values by performing a discrete cosine transform operation equivalent to matrix multiplication of said further set of M internal input values by a discrete cosine transform matrix.

In one particular embodiment, the data processing apparatus is configured to perform a forward discrete cosine transform during encoding of the video data, and the base circuitry is configured to perform as the discrete cosine transform operation a forward discrete transform operation following performance of the base operation on said multiple sets of M internal input values.

In contrast, if the data processing apparatus is configured to perform an inverse discrete cosine transform during decoding of the video data, the base circuitry is configured to perform as the discrete cosine transform operation an inverse discrete transform operation prior to performance of the base operation on said multiple sets of M internal input values.

The value of N may vary dependent on embodiment. In one embodiment, N is a multiple oFM. In one particular embodiment, N is constrained to be a power of two.

As mentioned earlier, M may he greater than or equal to 4, and in one embodiment M is set equal to 4. Hence, in that embodiment, all of the multiplications performed are in respect of a 4x4 matrix, irrespective of the size of N. The adder circuitry can be configured in a variety of ways, but in one embodiment the adder circuitry is configured as SIMD circuitry providing M lanes of parallel processing for performing said at least one of addition and subtraction operations in parallel in order to generate each set of said M internal input values.

Similarly, in one embodiment the further adder circuitry may be configured as SIMD circuitry providing M lanes of parallel processing for performing said at least one of addition and subtraction operations in parallel on each set of said M internal output values produced by the base circuitry.

As mentioned earlier, the number of multiplications required to transform the N input values between spatial mid frequency domains is significantly reduced when using the techniques of the above described embodiments. In one particular embodiment, the data processing apparatus is configured to operate on video data blocks comprising an N x N array of data values by separately performing, on each row and each column of N data valucs, said transform between the spatial and frequency domains, and the total number of multiplications performed by said base circuitry for each said row or each said column is 3t1 ± 311-2 + + 9 + Z, where Z is 9, and where N = 2". The value of Z depends on the number of multiplications required to perform the single MxM discrete cosine transfonn. and in one specific implementation configured to operate on a 32x32 array (i.e. N=32) and where M=4, it has been found that six multiplies are required for the single 4x4 discrete cosine transform (i.e. Z=o). From the above equation, this results in 123 multiplies being required. this being approximately a third of the number of multiplies that would be required by the earlier mentioned partial butterfly approach.

It has been found that the apparatus of the above described embodiments offers significant flexibility. Not only can the same apparatus be configured to perform both forward transforms and inverse transforms, hut in addition the apparatus may be configurable to support different video standards. In particular, in one embodiment the apparatus is configurable to support different video Standards by causing the coefficient generation circuitry to set the corresponding set of the coefficients supplied to the base circuitry for each set of M internal input values dependent on a currently selected video Standard.

Viewed from a second aspect, the present invention provides a method of performing a transform between spatial and frequency domains when processing video data, the method comprising: employing transform circuitry to receive N input values and to perform a sequence of operations to generate N output values representing the transform of said N input values between the spatial and frequency domains; employing a base circuitry to receive M internal input values generated by the transform circuitry, where M is greater than or equal to 4, and to perfonn a base operation equivalent to matrix multiplication of said M internal input values by a matrix comprising an array of coefficients c and having the form C0 C1 C2...

C1 c2 C3... CJ\4 C2 C3 e4... CM1 CM.I CM CM+I... C2M.2 in order to generate M internal output values for returning to the transform circuitry; and performance of said sequence of operations by the transform circuitry comprising: generating from the N input values multiple sets of said M internal input values; providing each set of M internal input values to the base circuitry in order to cause multiple sets of said M internal output values to be produced; and deriving the N output values from said multiple sets of M internal output values.

Viewed from a third aspect, the present invention provides a data processing apparatus for performing a transform between spatial and frequency domains when processing video data, the data processing apparatus comprising: transform means for receiving N input values and for performing a sequence of operations to generate N output values representing the transform of said N input values between the spatial and frequency domains; base circuitry means for receiving M internal input values generated by the transform means, where M is greater than or equal to 4, and for performing a base operation equivalent to matrix multiplication of said M internal input values by a matrix comprising an array of coefficients c and having the form C0 C1 e2...

Ci C2 C3. . -CM C2 C3 C4...

CM1 CM CM-I-! ...

in order to generate M internal output values for returning to the transform means; arid the transform means, during performance of said sequence of operations, for generating from the N input values multiple sets of said M internal input values, for providing each set of M internal input values to the base circuitry means in order to cause multiple sets of said M internal output values to be produced, and for deriving the N output values from said multiple sets of M internal output values.

BRiEF DESCRIPTION OF THE DRAWINGS

The present invention will be described further, by way of example only, with reference to embodiments thereof as illustrated in the accompanying diawings, in which: Figure 1 is a diagram schematically illustrating how a frame of video data is processed in a block by block manner in accordance with a known technique; Figure 2 illustrates a one dimensional DCT transform operation in accordance with a known approach; Figure 3A illustrates how the NxN inverse DCT transform matrix may be factorised in accordance with one embodiment, in order to implement the NxN matrix by a series of smaller linear correlation matrices and a single smaller inverse DCT matrix with the inputs and outputs to those matrices being subjected to various addition and subtraction operations; Figure 3B illustrates how the NxN forward DCT transform matrix may be faetorised in accordance with one embodiment, in order to implement thc NxN matrix by a series of smaller linear correlation matrices and a single forward DCT matrix with the inputs and outputs to those matrices being subjected to various addition and subtraction operations; Figure 4 is a block diav'am schematically illustrating components provided within a data processing apparatus in accordance with one embodiment in order to perfonn a transfonn between spatial and frequency domains for N received input values; Figure 5 schematically illustrates components provided within the adder circuitry and thither adder circuitry of Figure 4 in accordance with one embodiment; and Figures 6A and 6B provide a flow diagram illustrating the steps performed by the circuitry of Figure 4 in order to process one set of N input values in accordance with one embodiment.

DESCRIPTION OF EMBODIMENTS

Figure 1 illusifates a frame 10 of video data, the frame being considered as an array of blocks 15, each block comprising an NxN array of data values 20. Each data value will typically comprise multiple bits, for example 16 bits of data. When perfomiing an encoding operation on input video data, each such block 15 will he subjected to a two dimensional transform operation to convert the data from the spatial to the frequency domain. Typically a forward discrete cosine transform (FDCT) operation will be performed in order to perform such encoding. Similarly, when decoding an encoded frame of video data, each block will be subjected to a two dimensional inverse discrete cosine transform (IDCT) operation in order to convert the received encoded signal from the frequency domain to the spatial domain.

In practice, the two dimensional DCT' operation is performed by a series of one dimensional DCT operations. For example, it is typically thc case that a one dimensional DCT operation will be performed on each of the rows to produce some intermediate results, and this will then be followed by a corresponding series of one dimensional DCT operations performed on each column of those intermediate results.

Accordingly, for an NxN block, 2N one dimensional DCT operations will need to be performed in order to implement the required two dimensional DC'l' operation.

Figure 2 schematically illustrates a standard one dimensional DCT operation performed on a row or a column of input values x0 to XN.1 50. These N input values 50 are multiplied by an NxN matrix 60 comprising an NxN matrix of coefficient values.

As illustrated schematically in Figure 2, the coefficients in each row are multiplied by the corresponding input values, with the results then being added to produce an associated output value. Accordingly, N output values X0 to XN1 70 will he generated.

From Figure 2, it will be appreciated that a large number of multiplications need to be performed for each one dimensional transform operation. For example, considering the situation where N is 32, then each one dimensional transform operation will require 32x32 multiplications, i.e. 1024 multiplications, As mentioncd earlier, known butterfly techniques can be used to reduce the number of multiplications. In particular, considering again the example where N is 32, this would enable the number of multiplications to be reduced to 342. However, it would be desirable to further reduce the number of multiplications required to perform each one dimensional transform operation whilst still enabling a bit exact result to be achieved.

Figure 3A schematically illustrates the faetorisation approach adopted in accordance with the described embodiments for an inverse transform operation in order to enable the NxN IDCT matrix to be degenerated into a series of smaller matrices. In particular, in accordance with the described embodiments, the NxN IDCT matrix 100 is effectively replaced by a matrix 110, the matrix 110 being larger (i.e. having a longer diagonal) than the matrix 100, but containing fewer non-zero elements, and indeed typically a large number of the coefficient values in the matrix are zero. Along the diagonal path though the matrix 110, a number of smaller MxM matrices are defmed. In this particular example it is assumed that M is equal to 4, and as shown a series of L4 matrices 120 are provided, each L4 matrix being a Ilankel matrix of the earlier described form, i.e. a linear correlation matrix with constant skew diagonals. The actual values of the coefficients in one instance of the L4 matrix 120 will typically he different to the values in another instance of the L4 matrix.

As also shown, an initial DCI (14) matrix 130 is provided, this being an IDCT matrix of size 4x4. A base circuit provided within the apparatus of the described embodiments can be used to iteratively perform matrix multiplications using each of these defined 4x4 matrix instances 130, 120, starting with the matrix in the top left of the matrix 110. However, the M internal input values provided to the base circuit need to be separately derived for each iteration, based on the supplied N input values. As will be discussed later with reference to Figure 4, this is achieved by using permute circuitry to permute the received N input values in order to produce IC groups of input values, where K=N/M and hence each group has M members, each member being one of the received N input values. Further, adder circuitry is then used to perfonn a series of addition and subtraction operations on corresponding members from selected groups in order to generate each set of M internal input values. The required addition and subtraction operations that need to he perlhrrned are defmed within the matrix 105 of Os and +7-1 values shown in Figure 3A. Similarly, the internal output values generated by the base circuit need manipulation before they can be used to generate the N output values corresponding to the IDCT of the N input values. In particular, as will be discussed later with reference to Figure 4, thither adder circuitry is used to perform a series of addition and subtraction operations on the multiple sets of M internal output values produced by the base circuitry. with the matrix 115 of Os and +1-1 values identif5ting the required addition and subtraction operations.

Figure 3B illustrates how the same basic factorisation approach can be used to perform a forward DCI' operation, In particular, the NxN FDCT matrix 140 is effectively degenerated into the matrix 150 comprising a similar arrangement of smaller MxM (in this case 4x4) matrices. In particular, a series of L4 matrices 120 arc again provided, but in this instance a FDCT 4x4 (F4) matrix 160 is provided instead of the I DCI' 4x4 (14) matrix 130 of Figure 3A. In addition, the base circuitry performs the F4 matrix multiplication 160 as a final iteration, whereas in the example or Figure 3A the T4 matrix multiplication is performed as an initial iteration by the base circuitry. Again, a matrix of Os and +1-1 values 145 is used to define the addition and subtraction operations to be performed by the adder circuitry when generating the internal input values to provide to the base circuitry for each iteration, and similarly a matrix 155 of Os and +7-is is used to identif' the addition and subtraction operations required by the further adder circuitry used to process the internal output values generated by the base circuitry.

Considering the relative sizes of the various matrices shown in Figures 3A and 3B, then assuming the original matrices 100, 140 are N*N, and if N = 2" = 4 x then the width of the matrix 110 or 150 will be 4 x (34 + 3t14 + ± 3 ± I + 1).

Considering the specific example where N = 32, n = 5 and hence the width of the matrix 110 or 150 will be 4x(9+3+1+1), i.e. 56. 1-lence, each of the matrices 110 and will be 56x56 matrices. In that instance, the matrices 105 and 145 will be 32(acruss)x56(down), and the matrices 115 and 155 will he 56(across)x32(down).

Figure 4 is a block diagram illustrating components provided within the data processing apparatus in accordance with one embodiment. Permute circuitry 200 is arranged to receive each set of N input values, and is configured to perform a permutation on those N input values in order to produce K groups of input values stored within internal storage 205. K is equal to N!M and hence each group has M members, where each member is one of the received N input values. Hence, by way of example, if N is 32 and M is 4, there will be eight groups provided within the storage 205.

The adder circuitry 21 0 is then used to generate each set of NI internal input values to be provided to the base circuitry 215 (also referred to herein as the LM circuit). The adder circuitry is configured to operate on corresponding members from selected groups, and hence in one iteration may operate on member 0 from a selected number of the groups, and in another iteration may operate on member 1 from a number of the groups. As discussed earlier with reference to Figures 3A and 3B, a matrix 105, 125 is referenced by the adder circuitry in order to determine the required addition and subtraction operations for any particular iteration.

Considering the example of Figure 3A where an inverse transform is performed, it will be appreciated that during a first iteration the adder circuitry generates M internal input values to be subjected to a niaftix multiplication using the TM matrix forming an MxM IDCT matrix. For the particular example of Figure 3A, it is assumed that M is 4, and accordingly the M internal input values generated by the adder circuitry 210 during a fast iteration will be subject to a matrix multiplication by (lie 14 matrix 130 using the base circuitry 215. The coefficient generation circuitry 220 generates the values of the coefficients contained within the T4 matrix 1 30. The base circuitry 215 will then perform the required matrix multiplication operation in order to generate M internal output values which are routed to the further adder circuitry 225. l'he further adder circuitry then perfonns the addition and subtraction operations defined by the malrix 115 in order to generate intermediate output values stored within the storage 230. These intermediate output values can be considered to form K groups of intermediate output values, where again each group has M members, each member being one of the intermediate output values. The storage 230 is populated such that by the time all iterations have been performed, the storage 230 is populated with all of the N intermediate output values.

Returning to the example of Figure 3A, following the first iteration where the small 4x4 IDCT transformation is performed, a series of matrix multiplications will then he performed over multiple iterations to multiply generated scts of M internal input values by the various linear colTelation matrices 120, in Figure 3A these linear correlation matrices each taking the form ol a 4x4 Hankel matrix. For each iteration, the adder circuitry 210 will reference the matrix 105 in order to determine the appropriate addition and subtraction operations to be performed when generating each set of internal input values, and the coefficient generation circuitry 220 will generate the appropriate coefficient values for each iteration. Similarly the further adder circuitry 225 will reference the matrix 115 in order to detemine the appropriate addition and subtraction operations to be performed on the internal output values generated by the base circuitry during each iteration.

Once all of the required iterations have been performed, and the storage 230 has been populated with the N intermediate output values, those intermediate output values arc passed through the shift and saturate circuit 235, where a shift and saturate operation is performed in order to generate shifted and saturated intermediate output values. It should be noted that the shift and saturate operation is only performed once, after all of the N intermediate output values have been generated, and this serves to ensure the bit exact nature of the results, in particular avoiding rounding errors that would be introduced by iteratively performing shifting and saturating operations at multiple stages during the process.

l'hc shifted and saturated intermediate output values are then passed to the further permute circuitry 240 which is configured to penriute the N shifted and saturated intermediate output values in order to generate the required N output values.

The permutation performed by the Thrther permutation circuitry 240 is dependent on the permutation performed by the permute circuitry 200 on the input values.

The circuitry of Figure 4 can be configured to perform either a forward transform during encoding of video data to transform that data from the spatial to the S frequency domain, or an inverse transform during decoding data to transform that data from the frequency domain to the spatial domain. The permutations performed by the pennute circuitry 200 and the further permute circuitry 240 will depend upon whether the apparatus is being used to perform a forward transform or an inverse transform.

Similarly the addition and subtraction operations perfonned by the adder circuitry 210 and the further adder circuitry 225 will differ depending on whether a forward transform or am inverse transform is being perfonned.

The coefficient generation circuitry 220 will need to generate coefficients for the T4 matrix 130 when pertbrming the inverse transform operation or for the F4 matrix 140 when performing the forward transformation. However, the coefficients generated for each instance of the L4 matrices 120 are in some cases (e.g. for large I-IE\C matrices, where the forward matrix is the transpose of the inverse matrix) unchanged when reconfiguring the apparatus between performance of a tbrward transfonu and an inverse transfonn. Accordingly, when performing the multiple iterations of the L4 matrix multiplication, the operation of the base circuitry is unchanged in those eases, and exactly the same coefficients are generated by the coefficient generation circuitry 220.

The apparatus can also be used to implement various video standards.

However, the coefficients generated by the coefficient generation circuitry 220 will vary between the different video standards. In contrast, the basic operation of the permute circuitry 200, adder circuitry 210, further adder circuitry 225 and further pennute circuitry 240 is typically unaffected by changing the video standard.

Figure 5 illustrates components provided within the adder circuitry 210 and further adder circuitry 225 in accordance with one embodiment. In this example, it is assumed that M equals 4, and hence once the K groups of M input values have been stored within the storage 205 of Figure 4, it will be appreciated that each group has four members. In this example the storage 205 is considered to form part of the adder circuitry 210 and includes four register banks YBO 300, VBI 305, VB2 310 and VB3 315, each register bank having sufficient registers to store corresponding members from each of the groups. Hence, considering the example where N is 32, there will be eight groups, each with four members, and each of the register banks 300, 305, 310, 315 will provide eight registers in order to enable the corresponding members from S each of the eight groups to be stored therein.

The pennuted input values produced by the permute circuitry 200 are typically buffered within a RAM and then loaded serially via the mO and jul inputs (two data values per cycle) in the permuted order into the relevant register banks 300, 305, 310, 315.

Corresponding two-input adders 320, 325, 330, 335 are provided in association with each of the register barks 300, 305, 310, 315 and, in the embodiment shown, during each clock cycle two of those adders may he used to generate internal input values to provide to the L4 circuit 340 whilst the other two adders are optionally used to generate intermediate values fed back and stored within the associated register bank.

For each matrix multiplication performed by the L4 circuit 340, four internal input values are required, and accordingly it takes two clock cycles to generate the inputs required for each matrix multiplication performed by the L4 circuit 340.

In one embodiment, the various adder circuits 320, 325, 330, 335 perform butterfly operations and Karatsuba recursion in order to break down the 32 point transforms to 4 point transforms processed by the L4 circuit 340.

The operation performed by the adder circuitry 210 is in this embodiment performed within a first pipelines stage P0, with the operation of the L4 circuit 340 then occupying four pipeline stages P1 to P4. The further adder circuitry 225 then occupies a sixth pipeline stage PS. As shown, the final stage of the L4 circuit 340 produces two internal output values per cycle which are temporarily stored within the registers 345, 350. Each value stored in the register 345 will then be provided to the adder circuit 375 or the adder circuit 380 as appropriate, and similarly each value stored in the register 350 will be provided to the adder circuit 385 or the adder circuit 390 as appropriate. Initially, the adder circuits 375, 380, 385, 390 will route those values back into the associated register barks WBO 355, WB1 360, WB2 365 and WB3 370. The adder circuits 375, 380, 385, 390 will then use butterfly operations and Karatsuba recursion to reconstruct the 32 point transform output from the 4 point IA transform outputs and in the latter stages of the process this will result in the two register banks 392 and 394 being populated with N intermediate output values.

Once all the multiplications have been performed by the L4 circuit, the contents of the registers 392, 394 will represent the N intermediate output values, and these will then be routed through the shifi and saturate circuits 396, 398 to generate shiFted and saturated intermediate output values which can then be routed to the frirther pemiute circuit 240 to permute the values back into the final order required for the N output values.

The circuitry of Figure 5 can be used to support numbers of input values that are multiples of four, in one particular embodiment the number of input values being limited to be a power of 2, such that the N input values may be 4 input values, 8 input values, 16 input values, 32 input values, etc. ft situations where N is actually set equal to 4, then as shown in Figure 5 bypass paths may be provided around the adder circuits 320. 325, 330, 335 and the further adder circuits 375, 380, 385, 390. This optiinisation hence allows a low latency path through the circuitry when N is set equal to M. however, the shift and saturate stage of the pipeline path P6 will still be used for all transform sizes.

The multiple adder circuits 320, 325, 330, 335 can be arranged in a SIMD manner so that they operate in parallel to perform four sets of additions/subtractions.

However, in one embodiment, the operations of adder circuits DO and D2 320, 330 are staggered with respect to the operations of the adder circuits 325, 335, such tint in a first cycle, only adder circuits 320 and 330 arc used, and generate two internal input values for provision to the L4 circuit, whilst in the next cycle adders 325 and 335 are used to generate two further internal input values for provision to the L4 circuit.

During that second cycle, the adders 320 and 330 can again be used, but this time will generale intermediate values for routing back to their respective register banks 300, 310. In the next cycle, all four adders can again be used, with the adders 320, 330 providing the internal input values to the L4 circuit, and the adders 325, 335 generating intermediate values for routing back to their respective register banks 305, 315. This provides an efficient mechanism for providing two internal input values per cycle to the L4 circuit 340, whilst also enabling intcrmediate additions and subtractions to he performed in parallel. The output adder circuits 375, 380, 385, 390 can bc arranged in a similar manner to perform SIMD addition and subtraction operat. ions.

In one embodiment, each provided input value is 16 bits in size, and the register banks 300, 305, 310, 315 have 18 bit inputs and outputs to accommodate the increased size of the operands that may be generated by virtue of the additions performed by the adder circuits 320. 325, 330, 335. Hence the adder circuits 320, 325, 330, 335 also have 18 bit inputs and outputs. Within the further adder circuitry 225, each of the registers banks 355, 360, 365, 370 and adder circuits 375, 380, 385, 390 have 32 bit input and output widths in one embodiment, to accommodate the sizes of the internal output values that may be generated as a result of the multiplications performed within the L4 circuit 340. The operation of the shift and saturate circuits 396, 398 takes the relevant 32-bit huputs rcccivcd from the register bank 392, 394 and produces 16-bit outputs, i.e. output values that are of the same size as the input values.

Figures 6A and 6B provide a flow diagram illustrating the operation of the circuitry of Figure 4 in accordance with one embodiment. At step 400, N input values are provided to the permute circuitry 200, whercafter at step 405 a permutation is performed in order to create IC groups of M values (with the permutation being dependent on whether the apparatus is configured to perform an FDCT or an IDCT).

At step 410, it is determined whether the apparatus is configured to perform an IDCT.

and if so the process proceeds to step 415 where the adder circuitry 210 is used to generate M internal input values to be subjected to an IDCF operation. At step 420, those M internal input values are passed through the base circuitry 21 5 in order to perfonn matrix multiplication using an MxM IDCT matrix, with the appropriate coefficient values being provided by the coefficient generation circuitry 220.

The process then proceeds to step 425, where the adder circuitry 210 is used to perform addition and subtraction operations on corresponding members from selected groups within the storage 205 in order to generate multiple sets of M internal input values to be subjected to multiplication by the Hankel matrix. At step 430, each set of internal input values is then passed sequentially through the base circuitry 215 in order to cause multiple iterations of the matrix multiplication to he performed using the Hankel matrix (also referred to as an LM linear correlation matrix). As discussed previously, the coefficient generation circuitry 220 will typically generate separate sets of coefficient values for each iteration.

The process then proceeds to step 435, where it is determined whether an FDCT is being performed. If not, then the process proceeds directly to step 450.

S Conversely, if an FDCT is being performed, then as shown in Figure 6A, steps 415 and 420 will have been bypassed, and in their place steps 440 and 445 will then be peiformed following performance of steps 425 and 430. In particular, at step 440, the adder circuitry 210 is used to generate M internal input values to be subjected to the FDCT matrix multiplication using an MxM FDCT matrix. Thereafter, at step 445, those M internal input values are passed through the base circuitry 215 which then performs the required matrix multiplication using the FDC1' matrix. Again the coefficient generation circuitry 220 generates the appropriate coefficient values for the MxM FDCT matrix.

Following step 445, or directly following step 435 in the event that an IDCT is being performed, the process proceeds to step 450 where the fitrther adder circuitry is used to perfonn add and!or subtract operations on the multiple sets of M internal output values generated by the base circuitry 215 in order to produce N intermediate output values, Whilst in Figure 6B, step 450 is shown as being performed after all of the iterations of the base operation performed by the base circuitry have been performed, it will be appreciated that in alternative embodiments the further adder circuitry may operate on each set of M internal output values as they are generated by the base circuitry.

Once step 450 has been performed, the storage 230 will contain N intermediate output values. At step 455, the shift and saturate circuit 235 applies a shift and saturate operation to the intermediate output values in order to generate shifted and saturated intermediate output values. The further permute circuitry 240 then performs a further pennute operation in order to pennute the output values provided by the shift and saturate circuit 235 in order to generate N output values. At this point, the N output values will represent the bit exact transform of the N input values. The actual permutation performed by the further permute circuitry 240 will be dependent on whether an FDCT or an IDCT is being performed.

By using the mechanism of the above described embodiments, it has been found that tile number of multiplications required for each one dimensional transform caui be significantly reduced, whilst maintaining a bit exact result as required by modern video standards such as the IIEYC standard. The technique may be used for encoding and decoding, and is readily scalable for varying sizes of N. In one embodiment, the total number of multiplications performed by the base circuitry 215 for each one dimensional transform is 3"' + + .... + 9 + Z, where Z is 9, and where N = 2". The value of Z depends on the number of multiplications required to perform the single MxM discrete cosine transform, and in one specific implementation configured to operate on a 32x32 array (i.e. N32) and where M4, it has been found that six multiplies are required for the single 4x4 discrete cosine transform (i.e. Z=6).

From the above equation, this results in 123 multiplies being required, this being approximately a third of the number of multiplies that would be required by the known partial butterfly approach.

The number of the iterations of the LM matrix required can be derived directly from the above equation. In particular, for the example where M is equal to 4, and accordingly multiple iterations of an L4 matrix multiplication are performed by the base circuitry, then nine multiplications are required to implement each L4 matrix multiplication. If N is 32, when as discussed earlier 123 multiplies are required, this including six multiplies required to perform the single 4x4 discrete cosine transform.

1-lence. 117 multiplies are required to implement the multiplications of the L4 matrix multiplication, and in particular there will bc 13 iterations of the L4 matrix, each requiring nine multiplications.

The following additional information is provided relating to a specific embodiment.

Inverse transform algorithm description

This section describes how to calculate the operations required for an N-point inverse discrete cosine transform where the number of points N is a power of two. The projections are first defined: I r ifO«=r<N PN(ZflN+T)=t2Nr ifN«=r<2N +1 ifO«=r<Nor3N<r<4N SN(4nN+r)= 0 ifr=Norr= 3N -1 ifN<r<3N Then the N-point inverse discrete cosine transformation TN(co,.,., CM_i) can be defined by the matrix with elements at row 1, column] given by: = [SN((21 + 1)J)C(PN((2i + In practice the coefficients c1 are scaled integral or fractional estimates of c(k) = cos(kir/2N), but no reliance is made on the coefficients having specific values, only that the matrix has the S above form. The following matrices show TN for small N. /C0 C1 C2 C3 cC0 C1\ r jc0 C3 -C2 -C1 T2_C -Ci) C0 -C3 -C2 C1 \c0 -C1 C2 -C3 C0 C1 C2 C3 C4 C5 C6 C7 C0 C3 C6 C7 C4 C1 C2 -C; C0 C5 -C6 C1 -C4 C7 C2 Cg C0 C7 -C2 -C5 C4 C3 -C5 -C1 -C0 -C7 -C2 C5 C4 -C3 -C6 C, C0 C5 -C6 C1 -C4 -C7 C2 -Cg C0 -C3 C6 Cy C4. C C2 C5 C0 -C1 C2 -C3 C4 -C; C6 -C7 The input vector x and output vectory are related by the equation y = TNX.

The following is then further defined: q(k) = r2N(sk) This is an odd value between I and 2N -1. (Eq 1) tN(k) = s2N(3) (Eq 2) Two permutations, N and QN' are defined by: ti(0)x( qi(0)), t2(0)x( q2(0));t2(1)x( 2w).

PNX (Eq3) t4(O)x( q4(O)) t4(3)xE q4(3)), tN/2(0)x ( qN/2(0)) tN_2 ( -a) x ( q12 ( -i)) QNY = (y( (q(O) -1)/2),y(( q(1) -1)/2) y(( q(N -1)-1)12)) (Eq 4) The permutation N is a signed permutation to a linear vector (the description is split over multiple rows to make the pattern clear). The permutation Q is a reordering of the values without change of sign.

Permuting the input output and coefficient values, 2 = PX, 5 = Pc, = QNY gives a new permuted transform D such that 3 = ..., c.32. If the Henkel matrix is further defined: / C0 Cj...

LN(co,...3C2N_i) = C_1 \CN_1 C2_3 C2_2 Then the first relation (Ri) is: 1 1 (TN(Co CN/2_i) 0 T(c0,..., = (1 2 o L(5 CN_LN The second relation (R2) is: -I t(t0... , c2_1) LN(cN c3N_I) L2(c0,..., c4N_i) -, ( V I \ NLCN'.; CSN_l) NIS2N,..., (LN(cD-cN,.., C2N_1-C3N_1) 0 0 0 (1 0 o LN(c2N-CN C4N_l-CSN_1) 1(0 0 1 0 0 LN(cN CSN_i)! \i Relation (Ri) reduces iNtO and followed by N additions and subtractions.

Relation (R2) reduces Lto N/2 additions followed by three multiplications by L72, followed by N additions. This does not include the coefficient subtractions, but the coefficients are assumed to be fixed and the subtracted coefficient values can be calculated in advance.

ForN = 2" »= 8, repeating relations (Ri) and (R2) recursively reduces D to additions, followed by one multiplication by ?4 and (3" + + 1) multiplications by matrices of the form L. followed by additions and subtractions.

Inverse transform example This section illustrates how to apply the theory of the previous section to the practical case ofN = 16.

Starting with input vector x = (x0, x,, x2, X3, x4, X5, X5, X7, X6, X9, X10, X11 Xjj, X13, X14, x15), this is permuted to 2 = (x0, x8, x4, X12, X2, X6, -X14, x10, x1, x3, x9, -x5, -X15, x13, x,1).

The permuted vector is split into 4 groups or vectors of 4 elements each: = (xe, x8, x4, x12), X1 = (X2, x6, -x14, x10), X2 = (xj, X, x9, -x), X, = (-x15,x13, x,, x11) In a similar way the coefficients are permuted and grouped: C0 = (c0,c0,c4,c12), C1 = (c21c6,-c14,c10), C2 = (CI,C,1C9,-C5), C, = The output vector is split into four permuted groups: = (y0,y1,y4,y13), Y1 = (Ys,Y6,Y12,Ys), Y2 = (Yjs,Y14,YII,Y2), Y, = (Yv;Yg,Ys,Yio) Applying relation Ri gives the equations: Yu = Wc1 + Fl!2, V1 = W1 ± W3 2 = -I'V2, V3 = W1 -W3 (°) = r8c0,c, () (:) = L8(C2,C,,-C2,-C,) (:) Applying relation Ri and R2 gives the equations: W0 = w'0 + W'1, W1 = W'0 -W'1, W2 = W'2 + W'4 W3 = W'3 + W'4 w'0 = T4(c)x0, W'1 = L4(CI, -C3X1 W'2 = L4(C2 -C3, C3 + C2)X2, W'3 = L4(-C2 -C3, -C3 + C2)X, = L4(C3,-C3)(X2 +X3) This reduces the transform to one and four L4 operations.

Forward transform algorithm description

The N-point forward discrete cosine transformation FN(co, ... ,CNi) = NTJ[1 can be defined by the matrix with elements at row t, columnj given by: = [sN((21 + 1)0c(p((2j + 1)0)1..

Defining N = N?N', swapping the input and output permutations, and inverting relation (RI), the relation (R3) below is obtained: c121) o FN(co,..., CM_i) = 2 CN_lCN _CN.l))(1) ForM = 2" »= 8, repeating relations (R3) and (R2) recursively reduces EN to additions and subtractions, followed by one multiplication by P4 and (3?t3 + + 1) multiplications by matrices of the form L4 followed by additions. The relations can be applied in a similar way to the inverse transform example.

Example implementation The following example C code implements the inverse transform TN and the forward transform N n the functions fact idct idil] C and fact fdcLldi respectively.

/* Calculate prolection p() * * p(2*k*ntr) = r if O<=r<n * 2*n_r if n<=r<2*n *1 static unsigned mt p_n(unsigned mt k, unsigned mt n) k k % (2*n); if (k>=n) k = 2*nk; return k;

I

/* Calculate sign sQ * * s(4*k*n+r) = +1 i-f O<=r<n or 3*n<rc4*n IC S i-F r=n or r==3*n * -1 i-f n<r<2*n *1 static mt s_n(unsigned mt k, unsigned mt n) mt s=+i; k=k%(4*n)J if (k==n II k=3*n) s = e:lse if (Ic>=n && kc3*n) ) return s; 20} / Linear correlation * * y[i] = x[S]*c[i] + . . -I-x{niJ*c[iIn_i] * The nxn correlation is broken down into 4x4 operations by recursion. *1

static void L_n( intj2t *y, II output (n elements) const int32_t 15<, II input (n elements) const int32 t II coefficients (2*n_J. elements) unsigned mt n II size ) { assert(n>=4) if (n==4) /* Implement L4 in 9 mutliplies * The coefficients can be pre-calculated *1 int32_t vS = (x{S]÷x[i])*(c{i]_c[31); int32t vi = int32_t v2 = (x[OJ÷x[i]+x{2]+x[3])*c[31; int32_t wU = vS + v2; int32_t wi = vi + v2; vU = vi. = v2 = (x{O]+x[2])*(c[2]_c[3]); yES] = wO + vU + v2; y[2] = wi + vi. + vS x[l]*((c{2].c[41)_(c[l]_c[3]fl; vi = v2 = (x{i]÷x{3])*(c[4]_c{3]); y[i] wO+vS+v2; y[3] = wi + vi + v2

I

else / n>4 / int32t x2 [MAX ITRANSSIZE/4]; int32t cO[MAXITRANSSIZE/2]; int32t cl[MAXITRAr4SSIZE/2]; int32_t y2 [MAX_ITRANSSIZE/4]; unsigned mt I; unsigned mt n2=n>1; /* Additions subtractions before recursion / for (i=0; i<n2; i÷+) { nEil = x[i] + x[n2+i]; /* coefficients can be pre-calculated *1 for (i=0; in-is i+-i-) cO[i] = c[i] c{n2+i]; ci[i] c[n-i-iJ -c[n2i-iJ;

I

/ Recurse / L_n(y2, xl, c-i-n2, n2); II L4 on x[i]+x[(n/2)+i] L_n(y, x, ce, n2); II L4 on x[i] L_n(y÷n2, x4-n2, ci, nZ); If L4 on x[(n/2)+i] / Further additions/subtractions after recursion */ for (1=0; 1cn2; i++) y{i] = y[i] + y2[i]; y[n2-i-i] = y[n2+i] + y2[i];

I

/* Caluclate matrix T='_n(c) with permuted input and output */ static void TTn( int32_t *y // output const int32_t *x, /1 Input const int32_t *c, // coefficients unsigned mt n If size ) /* Factorized version V assert(n>=4); if (n==4) ft 4-point IDCT in 6 multiplies 7 int3lt vB,vi,v2; assert(c[0]==c[i]); vO = x[2]tc[2]; vi = x[3]tc[3]; v2 = (x[0J+x[iJ)*c[0]; y{0] = v2+vO+vi; y[2] v2-vO-vi; vS = x[2J*c[3]; vi = x[3]*c[2]; v2 = (x[0Jx[iJ)*c[0J; y[l] = v2-i-vO-vi; y[3] = vZ-vO+vl;

I

else /* n>4 / int32_t cl[MAXJIRANS_SIZE]; II expanded correlation unsigned mt i; unsigned mt n2 = /* Coefficients can be pre-caiculated *1 for (1=0; 1n2; 1±-i-) ci[i] = c[nZ-i-i]; cl[ri2+i] = -ci[i];

I

/ Recurse *1 IT_n(y, x, c, n2); L_n(y÷n2, x-i-n2, ci, n2); /* Further additions/subtractions after recursion */ for (1=0; 1<n2; 1+-i-) int32t ye = int32_t yl = y[i÷n2]; y[i] yO-i-yl; y[i+n2] = y@-yi;

I

/ Caluclate matrix F_n(c) with permuted input and output */ static void FF_n( int32_t *y, 1/ output int32_t *x, /1 input (modified by butterflies) const int32_t *c 1/ coefficients unsigned mt n /1 size ) /* Factorized version */ assert(n>=4); if (n=4) /* 4-point FOCI in 6 multiplies */ int32_t vO,vi,v2; ass ert (c [0] ==c [1] ); vO = (x[0]_x[2])*c[2]; vi = (x[i]_x{3J)*c[3]; v2 = (x[0Jsx[1J÷x[2]÷x[3])*c[0]; y[0] = v2; y[2] = vO-t-vl; ye = (x[0]-x[2])tc[3]; vi = (x[i]-x[3])c[2J; v2 = (x[0]x[i]ix[2]x[3J)*c[0]; y[1] = v2; y[3] = vO-vl;

I

else /* n>4 */ int3i_t ci[MAX_ITRANS_SIZE]; /7 expanded correlation unsigned mt i; unsigned mt n2 = 7* coefficients can be pre-calculated *1 for (i=O; i<n2; i-i-+) cl{i] = c[n2i-i]; cl[n2i-i] = -cl[i]; 10} 7* Additions subtractions before recursion / for (1=0; icn2; i-i-i-) int32_t xO = int32t xl x{ii-n2]; x[i] = xO+xi; x[i÷n2] = xO-xl; / Recurse / FF_n(y, x, c, n2); L_n(y+n2, x÷n2, ci, nZ); 25]- / Factorized 1D linear integer IDCT / void factjdct_1dJ16 ( intl6_t *-y, II Output const intl6_t tx, 7/ Input const intl6_t *c, 7/ Coefficients unsigned mt n, II Size unsigned mt shift, 7/ right shift int32_t R 7/ round ) int32_t Y[MAX_ITRANS_SIZE 1; int32_t X[MAX_ITRANS_SIZE]; int3z_t C [MAX_ITRANS_SIZE]; unsigned mt i; unsigned mt k; unsigned mt m; unsigned mt q; unsigned mt p; mt s; /* Apply input signed permutation / x[e] = C[O] = c{@]; m = n/2; for (k=1; kcn; k=k<<i, m=ni>>i) q=l; for (1=0; i<k; i-i--i-) p = m*p_n(q, 2*k); s = s_n(q, 2*k); X[k+i] = (s>e) ? x[p] C[k+i] = (s>O) ? c[p] q (q*3) % (S*n); 5} /* calculate permuted transform T_n I TT_n(Y,X,C,n); II note X and Y are modified /* Apply output (unsigned) permutation I q = 1; for (i=O; i<nj i+÷) p p_n(q, 2tn) >> 1; y[p] = (Y[i] + R) >> shift; q = (q*3) % (8*n); ) /* Factorized 10 linear integer FDCT I void fa ct_fdctjd_i16 ( intl6t *y, II Output const intl6t *x, II Input const intl6_t c, II coefficients unsigned mt n, II size unsigned mt shift, II right shift int32_t R II round 30) int32t Y[MAX ITRANS SIZE]; int32t X[MAX ITRANS SIZE]; int32_t C[MAX_ITRANS_SIZE]; unsigned mt 1; unsigned mt k; unsigned mt m; unsigned mt q; unsigned mt p; mt s; /* Apply input (unsigned) permutation f q = 1; for (i=O; icn; i÷+) { p = p_n(q, 2*n) >> 1; X[i] = q (q*3) % (8*n); /* Apply coefficient signed permutation *1 c[@] = m = n/2; for (k=l; k<n; k=k<<i, m=m>>l) { q = 1; for (i=@; i<k; i++) p = m*p_n(q, 2*k); s = s_n(q, 2*k); C[k-i-i] = (s>@) ? c{p] q = (q*3) % (8*n); 5} /* Calculate permuted transform Fn / F F_n (Y, X, C, n) /" Apply output signed permutation / y{O] = (Y[O]÷R)>>shift m = n/2; for (k=i; kcn; k=Ic<<1, m=m>>i) q=1; for (i=@; i<k; i++) p = rti*pn(q, 2*k); s = s_n(q, 2*k); int32_t yy = (s>@) ? V[k÷i] -4k-i-i]; y[p] = (yy+R)>>shift; q (q*3) % (8*n); 25} From the above described embodiments, it will be appreciated that such embodiments provide a scalable mechanism for performing both forward and inverse transforms for varying sizes of N, which result in a significant reduction in the number of multiplications required in order to perform the transform, and which produces a bit exact result.

Although particular embodiments of the invention have been described herein, it will he apparent that the invention is not limited thereto, and that many modifications and additions may be made within the scope of the invention, For example, various combinations of the fcaturcs of the following dependent claims could be made with the features of thc independent claims without departing from the scope of the present invention.

Claims

CLAIMS1. A data processing apparatus for performing a transform between spatial and frequency domains when processing video data, the data processing apparatus comprising: transform circuitry configured to receive N input values and to perform a sequence of operations to generate N output values representing the transform of said N input values between the spatial and frequency domains; a base circuitry coniigured to receive M internal input values generated by the transform circuitry, where M is greater than or equal to 4, and to perform a base operation equivalent to matrix multiplication of said M internal input values by a matrix comprising an array of coefficients c and having the form C0 c c2...Ci C Cg...c2 c3 c4... CM_I cM1 CM CM 1 *.. C2M2 in order to generate M internal output values for returning to the transform circuitry; and the transform circuitry being arranged during performance of said sequence of operations to generate from the N input values multiple sets of said M internal input values, to provide each set of M internal input values to the base circuitry in ordcr to cause multiple sets of said M internal output values to he produced, and to derive the N output values from said multiple sets of M internal output values.
2. A data processing apparatus as claimed in Claim 1, wherein: said matrix comprising an array of coefficients c is a Hankel matrix; and the base circuitry is configured to perform said base operation equivalent to matrix multiplication of said M internal input values by said Hankel matrix irrespective of the values allocated to the coefficients c0 to c2M2.
3. A data processing apparatus as claimed in Claim 1, wherein the transform performed between spatial and frequency domains is a discrete cosine transform.
4. A data processing apparatus as claimed in any of claims 1 to 3, wherein the transform circuitry comprises: permutation circuitry configured to permute the received N input values in S order to produce K groups of input values, where K=N/M and hence each group has M members, each member being one of said received N input values; and addcr circuitry configured to perform at least one of addition and subtraction operations on corresponding members from selected groups in order to generate each set of said M internal input values.
5. A data processing apparatus as claimed in Claim 4, wherein the transform circuitry further comprises: further adder circuitry configured to perform at least one of addition and subtraction operations on the multiple sets of said M internal output values produced by the base circuitry in order to produce N intermediate output values.
6. A data proccssing apparatus as claimed in Claim 5, wherein the transform circuitry further comprises: shift circuitry configured to perform a shift operation on the N intermediate output values in order to generate shifted interntediate output values.
7. A data processing apparatus as claimed in Claim 6 wherein said shift operation includes a saturate operation.
8. A data processing apparatus as claimed in Claim 6 or Claim 7, wherein the transform circuitry further comprises: further permutation circuitry configured to permute the N shifted intermediate output values in order to generate said N output values.
9. A data processing apparatus as claimed in any preceding claim, further comprising coefficient generation circuitry configured to generate for each set of M internal input values a corresponding set of coefficient values to be used by the base circuitry when performing the base operation.
10. A data processing apparatus as claimed in any preceding claim, wherein the data processing apparatus is configurable to perform one of a forward transform from the spatial to the frequency domain and an inverse transfonn from the frequency to the spatial domain.
11. A data processing apparatus as claimed in Claim 10 when dependent on Claim 9, wherein said corresponding set of coefficient values generated by the coefficient generation circuitry for each set of M internal input values arc the same irrespective of whether the data processing apparatus is configured to perfonn the forward transform or is configured to perform the inverse transfonm 12, A data processing apparatus as claimed in any preceding claim when dependent on Claim 3, wherein the transfonn circuitry is Thrther configured to generate a further set of M internal input values for provision to the base circuitry, and the base circuitry is configured to perform a discrete cosine transform on said further set of M internal input values by performing a discrete cosine transform operation equivalent to matrix multiplication of said further set of M internal input values by a discrete cosine transform matrix.13. A data processing apparatus as claimed in Claim 12, wherein said data processing apparatus is configured to perform a forward discrete cosine transform during encoding of the video data, and the base circuitry is configured to perform as the discrete cosine transform operation a forward discrete transform operation following performance of the base operation on said multiple sets of M internal input values.14. A data processing apparatus as claimed in Claim 12, wherein said data processing apparatus is contigured to perform an inverse discrete cosine transfonn during decoding of the video data, and the base circuitry is configured to perform as the discrete cosine transform operation an inverse discrete transform operation pi-i or to performance of the base operation on said multiple sets of M internal input values.15. A data processing apparatus as claimed in any preceding claim, wherein N is a multiple of M. 16. A data processing apparatus as claimed in Claim 15, wherein N is a power of 2.17. A data processing apparatus as claimed in any preceding claim, wherein M 4.18. A data processing apparatus as claimed in any preceding claim when dependent on Claim 4, wherein said adder circuitry is configured as SIMD circuitry providing M lanes of parallel processing for performing said at least one of addition and subtraction operations in parallel in order to generate each set of said M internal input values.19. A data processing apparatus as claimed in any preceding claim when dependent on Claim 5, wherein said further adder circuitry is configured as SIMD circuitry providing M lanes of parallel processing for performing said at least one of addition and subtraction operations in parallel on each set of said M internal output values produced by the base circuitry.20. A data processing apparatus as claimed in any preceding claim, wherein: the data processing apparatus is configured to operate on video data blocks comprising an N x N array of data values by separately performing, on each row and each column of N data values, said transform between the spatial and frequency domains; and the total number of multiplications performed by said base circuitry for each said row or each said column is 3n1+3h12+ ....+9+Z,whereZisS9,andwhereN2.21. A data processing apparatus as claimed in any preceding claim when dependent on Claim 9, wherein said data processing apparatus is configurable to support different video Standards by causing the coefficient generation circuitry to set the corresponding set of the coefficients supplied to the base circuitry for each set of M internal input values dependent on a currently selected video Standard.22. A method of perfonning a transform between spatial and frequency domains when processing video data, the method comprising: employing transform circuitry 1.0 receive N input values and to perform a sequence of operations to generate N output values representing the transform of said N input values between the spatial and frequency domains; employing a base circuitry to receive M internal input values generated by the transform circuitry, where M is greater than or equal to 4, and to perform a base operation equivalent to matrix multiplication of said M internal input values by a matrix comprising an array of coefficients c and having the form c0 c1 e2...c1 c2 c3...c2 C3 c4... cM_I CM1 cM C4I 1 *.. c2M2 in order to generate M internal output values for returning to the transform circuitry; and performance of said sequence of operations by the transform circuitry comprising: generating from the N input values multiple sets of said M internal input values; providing each set of M internal input values to the base circuitry in order to cause multiple sets of said M internal output values to he produced; and deriving the N output values from said multiple sets of M internal output values.23. A data processing apparatus for performing a transform between spatial and frequency domains when processing video data, the data processing apparatus comprising: transform means for receiving N input values and for performing a sequence of operations to generate N output values representing tile transform of said N input values between the spatial and frequency domains; base circuitry means for receiving M internal input values generated by the transform means, where M is greater than or equal to 4, and for performing a base operation equivalent to matrix multiplication of said M internal input values by a matrix comprising an array of coefficients e and having the form C0 cj C2 *..C1 e7 C3... CM C2 C C4 CM I C14 CM-I... C2M2 in order to generate M internal output values for returning to the transform means; and the transform means, during performance of said sequence of operations, for generating from the N input values multiple sets of said M internal input values, for providing each set of M internal input values to the base circuitry means in order to cause multiple sets of said M internal output values to be produced, and for deriving the N output values from said multiple sets of M internal output values.24. A data processing apparatus for performing a transform between spatial and frequency domains when processing video data, substantially as hereinbcfore described with reference to figures 3A to 6B.25. A method of perfonniiig a transform between spatial and frequency domains when processing video data, substantially as hereinbefore described with reference to figures 3A to 6B.