AU2008255265A1

AU2008255265A1 - Prediction region memory access reduction

Info

Publication number: AU2008255265A1
Application number: AU2008255265A
Authority: AU
Inventors: Christopher James ROSEWARNE
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2008-12-12
Filing date: 2008-12-12
Publication date: 2010-07-01

Description

S&F Ref: 886300 AUSTRALIA PATENTS ACT 1990 COMPLETE SPECIFICATION FOR A STANDARD PATENT Name and Address Canon Kabushiki Kaisha, of 30-2, Shimomaruko 3 of Applicant: chome, Ohta-ku, Tokyo, 146, Japan Actual Inventor(s): Christopher James Rosewarne Address for Service: Spruson & Ferguson St Martins Tower Level 35 31 Market Street Sydney NSW 2000 (CCN 3710000177) Invention Title: Prediction region memory access reduction The following statement is a full description of this invention, including the best method of performing it known to me/us: 5845c(1890692 1) - 1 PREDICTION REGION MEMORY ACCESS REDUCTION TECHNICAL FIELD The current invention relates to video decoding and, in particular, how to minimise duplicated memory data fetches when retrieving reference picture data from a system memory. 5 BACKGROUND A video decoder is capable of producing decoded video data based on the contents of a compressed video bit-stream. To produce an output picture, the video decoder requires reading or referring to previously decoded pictures, typically requiring a large system memory for storage. In embedded systems, such as an off-chip system, 10 memory access is a time and power consuming operation. In a video coding system, motion compensation accounts for the majority of system memory access. It is desirable to optimize memory access required during motion compensation when processing video. Motion vectors for neighbouring regions of an image tend to refer to overlapping 15 regions of the same previously decoded picture. This results in many more memory accesses than the theoretical minimum being performed. Accessing system memory, which commonly exists as synchronous dynamic RAM, is very time consuming. One method for reducing the number of memory accesses is to retain the samples in a cache that are required to process a particular region in the picture that will be re 20 used in a future region. A problem with this method is that implementing a cache to hold the samples at particular locations is very computationally intensive due to the need to recall caching information for each sample in the picture being processed. 1864663_1 886300_speci -2 Another method for reducing the number of accesses to system memory involves merging regions into a larger group for fetching from main memory. This approach requires decision logic to determine when a partition should be added to the group or processed separately. One drawback of this method is that the set computation does not 5 extend across a region group boundary, resulting in inefficiency where the neighbouring region group is using samples from the same memory locations. Another method describes a method by which samples that are likely to be read by a future partition are pre-fetched and stored for their possible subsequent use. The samples to pre-fetch are generally spatially local, being to the right of or below the 10 current region. The motion vector value may result in a subsequent region being in a different part of the picture, resulting in the pre-fetched sample data not being required to calculate the next region. Another method to reduce memory accesses during motion compensation processing is to combine several small access requests into a single larger request. In 15 this approach, the set of small accesses must be examined and merged into a single larger set that spans all the memory required. Regions that overlap or regions that can be merged must be determined by examining the motion vector associated with each region; a process that is computationally expensive. SUMMARY 20 In accordance with one aspect of the present disclosure there is provided a method for storing memory addresses in a buffer to fetch sample data from reference pictures in a system memory for a current partition, the method comprising the steps of: providing a first two-dimensional array of fetch-bits, representing the fetches that are performed for previous partitions; 1864663_1 886300_spec -3 generating a second two-dimensional array of fetch-bits representing the fetches required from the reference picture from the partition; determining a third two-dimensional array of fetch-bits by determining the intersection of the second set of fetch-bits and inverted bits of the first set of fetch-bits; 5 performing the fetches corresponding to the third two-dimensional array of fetch bits into a local memory buffer; setting the bits in the first two-dimensional array of fetch-bits corresponding to the bits set in the third two-dimensional array of fetch-bits; and updating the corresponding entries in the buffer to the locations of the fetched 10 data in the local memory buffer. In a specific implementation, firstly, using the fractional motion vector and the integer motion vector, each partition in an inter-predicted macroblock is mapped onto a two-dimensional fetch bit-map whereby each bit corresponds to a location to read sample data from, relative to the current macroblock location. 15 Each bit in the fetch bit-map corresponds to an entry in two-dimensional circular buffer that is used to store the address of the corresponding fetch data in a Local RAM Buffer. The Local RAM Buffer is a sliding window that stores fetches performed on the picture buffer in chronological order. A previous fetch bit-map, indicating which fetches are currently stored in the 20 Local RAM block, is used as input. It is compared with the required fetches for each partition in the macroblock. Any 0 bits indicate data that must be fetched. This data is fetched, with the corresponding bit being set and the Local RAM buffer address where the data resides being written to the corresponding entry in the circular buffer. Sample generation may be performed once the data is fetched. 1864663_1 886300_speci -4 On advancing to the next macroblock (except for the right edge of the picture), the fetch bit-map is shifted one bit to the left, corresponding to the macroblock advancing one position to the right. The correspondence between the fetch bit-map and the circular buffer is also shifted to compensate for the bit-shift within the fetch bit-map, 5 thus preserving references to Local RAM Buffer addresses. BRIEF DESCRIPTION OF THE DRAWINGS One or more embodiments of the invention will now be described with reference to the following drawings, in which: Fig. I is a block diagram of a typical video processing architecture. 10 Fig. 2 is a block diagram of the internal components of a video decoding system. Fig. 3 shows a detailed block diagram of the prediction components in a video decoder. Fig. 4 shows a representation of the sample storage for one luma fetch from the system memory. 15 Fig. 5 shows a representation of the sample storage for one chroma fetch from the system memory containing both chroma channels in 4:2:0 chroma format. Fig. 6 shows a representation of the relationship between the luma and the chroma channels in a specified memory format for sample data. Fig. 7 shows the required data fetched from a reference picture in system memory 20 for a partition contained within a single macroblock. Fig. 8 shows the required data fetched from a reference picture in system memory for a partition crossing a macroblock boundary. Fig. 9 shows the additional samples required to generate the current partition samples. 1864663_1 886300_speci -5 Fig. 10 shows the format of a register used to store a two dimensional fetch bit map of fetches from system memory that are stored in the local memory buffer. Fig. 11 shows how the fetch bit-map of samples to be fetched for this partition is determined. 5 Fig. 12 shows how a fetch bit-map is advanced when the current macroblock has been completely processed. Fig. 13 shows the layout of the circular memory buffer with its locations corresponding to a two dimensional fetch bit-map representing the data in a local memory and its data corresponding to the addresses within the local memory. 10 Fig. 14 shows how the circular buffer is advanced when the current macroblock has been completely processed. Fig. 15 shows a flow-chart for the process of updating the circular memory buffer. DETAILED DESCRIPTION INCLUDING BEST MODE 15 In general, standard video formats (e.g. MPEG-2 or H.264) encode video as a series of pictures. A picture is divided into a set of fixed sized macroblocks, typically 16 samples high and 16 samples wide. A pixel is represented as luma and chroma samples, in the YCbCr colour space. Y denotes the luma channel, Cb the first chroma channel and Cr the second chroma channel. 20 A previously decoded picture required for future picture generation is known as a reference picture. A macroblock that refers to one reference picture is known as a predicted macroblock. A macroblock that refers to two reference pictures is known as a bi-predicted macroblock. 1864663_1 886300_sped -6 Luma and chroma samples may be provided at different resolutions, or sampling formats. Common sampling formats are 4:4:4, 4:2:2 and 4:2:0. In 4:4:4 sampling, luma and chroma samples are at the same resolution. In 4:2:2 sampling, chroma samples are at half the horizontal resolution of luma samples. In 4:2:0 sampling, chroma samples are 5 at half the horizontal and vertical resolution of luma samples. A macroblock in the 4:2:0 chroma format consists of a 16x16 Y sample region, an 8x8 Cb sample region and an 8x8 Cr sample region representing the picture at the location of the macroblock. Predicted and bi-predicted macroblocks are divided into one or many partitions, in blocks of size 16x16, 16x8, 8x16 or 8x8 luma samples (and corresponding chroma 10 samples) each, filling the entire macroblock. 8x8 sized partitions may be redivided into smaller partitions, in blocks of size 4x8, 8x4 or 4x4 luma samples (and corresponding chroma samples). For each partition, predicted sample data is generated by referring to data in previously decoded pictures, offset horizontally and vertically by an amount specified by 15 a motion vector associated with the partition. A filtering operation is applied to calculate the sample data, which is applied to the partition. Filtering is performed horizontally and vertically, with the type of filtering operation dependent on a fractional component to the partition's motion vector. For example, in H.264 the fractional component allows specification of reference sample data aligned down to one quarter the spacing of the 20 available samples. Slices are an arrangement of macroblocks in a defined order within each picture. The defined slice types are I, P and B slices. Inter-coded macroblocks are either predicted or bi-predicted macroblocks and may occur in a P slice or a B slice. Intra coded macroblocks have no motion prediction and can occur in any I, P or B slice. 1864663_1 886300_speci -7 A video exists in three dimensions; the horizontal spatial axis, the vertical spatial axis and the temporal axis. Intra-coding exploits spatial redundancies within a picture. Inter-coding exploits temporal redundancies between pictures. Inter-coding takes advantage of the similarities between pictures. A picture store 5 is required at both the encoder and the decoder to make reference pictures available. Rather than encode pictures separately, inter-coding encodes the difference between a current and one or more reference pictures, known as the residual of the picture. The amount of detail remaining in the residual picture is reduced by compensating for motion between the current picture and the reference picture(s). 10 Motion vectors, which consist of an integer and a fractional component, are associated with each partition of an inter-predicted macroblock. Motion vectors specify to the decoder the spatial offset in the reference picture from which sample data should be obtained. Motion compensation refers to interpolated samples in the reference picture(s), known as sub-pixel motion compensation, when the motion vector contains a 15 non-zero fractional component. H.264 supports half-pixel and quarter-pixel sub-pixel motion compensation. For a 4x4 partition with sub-pixel motion compensation, a 9x9 reference sample region is required. Fig. I is a block diagram of a video processing architecture according to an embodiment of the invention. It shows an encoded video bit-stream 101 input into the 20 video decoding system 105. The video decoding system 105 decodes the video bit stream 101 to generate a video picture. The video picture is passed to a display engine 105 via the output buffer 103. The display engine 105 is connected to a display device 106. 1864663_1 886300_spec -8 The video decoding system 102 accesses the system memory 104. System memory 104 retains video pictures for reference when decoding future pictures. The display engine 105 controls the display of the decoded video data from the output buffer 103 on the display device 106. The display device 106 displays video data output 5 from the video decoding system 102 via the output buffer 103. Video streams 101 are input into the video decoding system 102 via the video bit-stream external interface. Video Decoding System Architecture Fig. 2 is a block diagram of the internal components of a preferred Video 10 Decoding System 200, which may be used as the system 102. Processing Stages The system 200 as illustrated includes five video decoding processing stages, being Syntax Processing 204, Residual Processing 206, Prediction Assistant Processing 208, Prediction Processing 209 and Post Processing 210. Each processing stage 15 performs operations according to the video format of the encoded bit-stream. Each processing stage performs specific operations on data associated with one macroblock from the video stream and then output the results of those operations to the next processing stage. This process is repeated until the video stream is fully decoded, or there is an interruption request from the operation controller. 20 The first processing stage is the Syntax Processing stage 204. The Syntax Processing stage 204 receives the encoded bit-stream as input via its input core interface bus 203 and performs syntax decoding operations on that encoded bit-stream according to the video codec format of the encoded bit-stream. The output of the Syntax Processing stage 204 is a decoded series of syntax elements associated with the current 1864663_1 886300_sped -9 macroblock. Decoded syntax elements associated with reference pictures and motion vectors are output to the Prediction Assistant 208, other decoded syntax elements are output to the Residual Processing stage 206 via a core interface bus 207. The second processing stage is the Residual Processing stage 206. The Residual 5 Processing stage 206 receives a series of syntax elements associated with a macroblock from the Syntax Processing stage 204 via the core interface bus 207. The Residual Processing stage 206 performs reverse quantisation and transformation operations on the data from the syntax elements to generate the current macroblock of residual image data. The macroblock of residual image data is output to the Prediction Processing stage 209 10 via a core interface bus 216. The third processing stage is the Prediction Assistant 208. The Prediction Assistant 208 uses reference picture information and motion vectors to fetch reference data for the current macroblock. A request for reference data is sent to the System Memory Interface Controller 217 which retrieves the required data from System Memory 15 104 via a Memory Bus 218. The reference data is sent to the Prediction Processing Stage 209 to combine with the residual macroblock from the Residual Processing Stage 206 to reconstruct the current macroblock. The Prediction Assistant stage 208 is only used when reference data is required to process the current macroblock. The fourth processing stage is the Prediction Processing stage 209. The 20 Prediction Processing stage receives a macroblock of the residual image from the Residual Processing stage 206 and a macroblock of reference data from the Prediction Assistant stage 208 if it is required for this macroblock. The Prediction Processing stage 209 combines the macroblock of residual data with a macroblock of reference data 1864663_1 886300_speci -10 to reconstruct the current macroblock. The reconstructed macroblock is sent to the Post Processing stage 210 via a core interface bus 213. The fifth processing stage in the system is the Post Processing stage 210. The function performed by the Post Processing stage 210 depends upon the video codec 5 format of the video stream being decoded. For example, when a stream encoded with the H.264 video codec is being decoded by the video decoding system, the Post Processing stage 210 performs deblocking filtering operations according to the H.264 video codec standard. The Post Processing stage 210 requests the System Memory Interface Controller 217 to store the final macroblock in the System Memory 104. 10 Operation Controller The Video Decoding System 200 contains an Operation Controller 205 which controls the operation of the video decoding system. The Operation Controller 205 has an external interface from the video decoding processing system. The Operation Controller 205 is connected to each processing stage 204, 206, 15 208, 209 and 2. The Operation Controller 205 receives information relating to the status of the processing stages via these control buses. The operation controller can also send control commands to the processing stages via these control buses to control the operation of the video decoding system. Input Interface Block 20 The input interface block 201 receives video stream data via the input interface 202 from the input channels 103 within the input stream manager 102 and forwards the video stream data to the first processing stage 204 via a core interface bus 203. 1864663_1 886300_speci - 11 The input interface block 201 also receives control information from the external system 211 via the input interface 202. This control information is sent to the Operation Controller 205 via interface 215. Sample Generation 5 The sample generation operation may require multiple input samples to produce one output sample, with the sample generation operation applied to each sample in the partition. When fractional values are present in the corresponding motion vector for the luma channel in the horizontal or vertical direction multiple input samples are required. For example, in H.264 a fractional horizontal component to the motion vector requires a 10 6 wide by 1 sample high input sample array from the reference picture to produce one output sample. For example, in H.264 a fractional vertical component to the motion vector requires a 1 wide by 6 high input sample array from the reference picture to produce one output sample. For the chroma channel, the data dependencies for each chroma sample depend on the motion vector position on the 4:2:0 sample grid. This 15 value is specified by the fractional motion vector value and the lowest bit of the motion vector. For example, in H.264 when the horizontal and vertical positions are zero, one reference chroma sample is required for reference. When the horizontal position is nonzero and the vertical position is zero, a 2 wide by 1 high set of reference chroma samples is required. When the horizontal position is zero and the vertical position is 20 nonzero, a 1 wide by 2 high set of reference chroma samples is required. When the horizontal and vertical positions are nonzero, a 2 wide by 2 high set of reference chroma samples is required. Fig. 9 shows the relationship between the luma samples being generated for a given 4x4 block and the required input samples to generate these output samples, 1864663_1 886300_speci - 12 fractional spacing is used horizontally and vertically. The required samples 903, from the reference picture, are used for sample generation in the current partition 902. The dimensions of the current set of the required samples 903 may be increased from the current partition 902 due to the extra sample dependencies introduced by fractional 5 motion vector components. The required samples 903 may extend outside the macroblock boundary 901 depending on the partition's location and motion vector. Prediction Components Fig. 3 shows the prediction components in the video decoding system 200. 307 is the system memory used to store the current and previous picture sample data. 309 is 10 the System Memory Interface Controller 217. 310 is the Prediction Processing stage 209. The Prediction Assistant block 301 is used to co-ordinate the operation of the prediction processing stage 310 and drive the memory interface controller 309. The bus arbiter 305 enables multiple devices inside the video decoding system 105 to access the system memory 307. This bus arbiter 305 accepts inputs from the 15 other processor cores 308 in the video decoding system 105 and from the direct memory access controller 306. The direct memory access controller 306 fetches sample data from reference pictures under control of the prediction assistant 301. Sample data fetched by the direct memory access controller 306 is written to a local memory buffer 304. The local buffer 304 receives reference sample data without any address 20 information and stores the reference sample data such that each new reference sample data fetch is stored at a monotonically increasing address. The local memory buffer 304 capacity is fixed in size. When the local memory buffer 304 reaches full capacity, data at the lowest address is discarded, to avoid exceeding the local memory buffer 304 capacity. 1864663_1 886300_speci - 13 The prediction assistant 301 controls the fetching of reference sample data by the direct memory access controller 306 and the sample generation block 302. The sample generation block 302 is instructed to retrieve reference sample data from the local memory buffer 304. The sample generation block 302 operates on the fetched data 5 according to the specified motion vector fractional component for the partition currently being processed by the prediction assistant 301. The processed sample data is then output to a processing block 303 which recombines the processed sample data with residual data to produce a picture. Fig. 4 shows a representation of the sample storage for one luma fetch from the 10 system memory 307. Samples 410 are retrieved by performing a read transaction from the system memory. Transactions are always aligned to the boundaries of a macroblock 400. Fig. 5 shows a representation of the sample storage for one fetch of chroma samples from the system memory 307. Fetches are aligned to the macroblock grid 501 15 and due to the use of the 4:2:0 chroma format cover an area depicted as 502. There are two chroma channels, Cr and Cb, with the samples in each being stored interleaved in the system memory 307. Sample Data Fetch Format Fig. 7 shows the required data fetched from a reference picture in system memory 20 307 for a partition. 701 shows a macroblock boundary. The samples passed to the sample generation block 302 are shown in 703. In this figure all samples to be generated 902 are contained within a single macroblock 701. The samples read from system memory 307 are shown in 702 and contain all samples required by the sample generation block 302. The samples read from system memory 307 are at least the same size as the 1864663_1 886300_speci -14 samples required by the sample generation block 302 and may extend to the left and or to the right of the samples required by the sample generation block 302 to the next macroblock boundary 701. There may be several samples read from system memory 307 but not used in the sample generation block 302 for the current partition. 5 Fig. 8 shows the required data fetched from a reference picture in system memory for a partition. 801 shows a macroblock boundary. The samples passed to the sample generation block are shown in 803. In this figure samples required by the sample generation block fall across a boundary between two macroblocks 801. The samples read from system memory 802 contain all samples required by the samples passed to the 10 sample generation block 803. The prediction assistant 301 needs to track which samples have been read from the system memory 307 and stored in the local memory buffer 304. This tracking is performed by the circular memory buffer 311. Fig. 10 shows the format of a register used to store a fetch bit-map 1004 of fetches from system memory 307 which are stored 15 in the local memory buffer 304. One bit 1002 and 1003 of the fetch bit-map 1004 represents one fetch from system memory. The bits in the fetch bit-map 1004 are arranged as a two dimensional array of fetch-bits, corresponding to the region of the current macroblock and the surrounding area within a macroblock region. The dimensions of the two dimensional array define the extent of the fetch bit-map 1004 20 surrounding the current macroblock and are limited by the hardware timing constraints of the integrated circuit used to implement the register. The bits used to represent the current macroblock are shown in 1003, the bits used to depict the fetches neighbouring the current macroblock are shown in 1002 and non-existing bits, depicting the rest of the image are shown in 1001. These non-existing bits 1001 are not implemented in 1864663_1 886300_speci - 15 hardware and so the presence of fetches in the local memory for this part of the picture is not recorded in the bit map. Circular Memory Buffer Fig. 13 shows the layout of the circular memory buffer 1301, contained within 5 the Prediction Assistant 301. Each location in 1301 corresponds to a bit in the two dimensional fetch bit map 1002. When a bit in the fetch bit-map 1004 is set, the corresponding location in the circular memory buffer 1301 contains the address where the fetch data may be read from the local memory buffer 304. The circular memory buffer is a two-dimensional array, for each column the row data is stored consecutively 10 with each new column stored such that the first row in the current column is consecutive to the last row in the previous column. These column start addresses are depicted in 1310. When addressing the circular buffer 1301 a row and column address is provided, the column address being offset by addition of a value and then applying the modulus of the number of columns in the circular memory buffer 1301. 15 In order to perform the sample generation for the region 803, the fetches 802 is performed. Fig. 15 shows a flow-chart for a process 1500 of updating the circular memory buffer 1301. In an exemplary implementation, bit setting is performed by which the fetch bit-map 1101 stored in the prediction assistant, retained from the previous macroblock 1501 has the bits set in 1002 and 1003 indicate which of these fetches has 20 already been performed. The prediction assistant operates on each partition in the macroblock, generating a fetch bit-map 1004 indicating which fetches need to be performed for that partition 1111, 1502. If any fetches required for the partition fall outside the area covered by the fetch bit-map 1004, that partition is handled by a mechanism separate to this invention. The prediction assistant determines which fetches 1864663_1 886300_speci -16 need to be performed by taking the intersection 1103 and 1503 of the fetches required for the current partition 1111 and the inverted bits of the fetch bit-map 1004 containing the fetches already performed 1110. The output of the intersection operation 1103 is a fetch bit-map 1104 representing bits that are required to be fetched for generating samples for 5 the current partition 1112. These fetches are performed in 1504 and the data written to the local memory buffer as step 1505. The new fetches to perform 1104 are then logical ORed back into the fetch bit-map 1101 stored in the prediction assistant, retained from the previous macroblock as step 1506. This prevents duplicating fetches from system memory for the current macroblock. 10 Advancing to the Next Macroblock When the last region in the current macroblock has been processed by the prediction assistant, processing may advance to the next macroblock. Fig. 12 shows the current fetch bit-map 1201 being advanced for the next macroblock. The current fetch bit-map 1201 is shifted by one bit to the left and the rightmost bits are all cleared 1202. 15 This shifted fetch bit-map 1202 is stored in the Prediction Assistant to track which fetches have been performed and stored in the local memory buffer 304. The circular memory buffer 1301 1402 311, arranged as a two-dimensional array of locations in memory, may have its various entries accessed relative to one of multiple starting locations 1401, each being a base address and thus the buffer is circularly 20 addressable. A modulus or modulo operation is applied to the relative address to prevent exceeding the boundary of the circular memory buffer 1402 and causing a wrap around to the start of the circular memory buffer 1402 instead. Incrementing the starting location occurs when progressing advancing to the next macroblock to the right of the current macroblock. 1864663_1 886300_speci - 17 Memory Format In one embodiment the fetch bit-map 1004 represents two luma fetches aligned to the macroblock boundary 601, occurring on consecutive even raster scan lines 602 and odd raster scan lines 603, and a chroma fetch 605 also aligned to the macroblock 5 boundary 604. The area represented by the luma and chroma fetches on the sample grid will be identical, enabling a single fetch bit-map 1004 to be used for luma and chroma channels. Samples are stored in the system memory 104 which stores in rows in a particular pattern: a luma even row 602, then a luma odd row 603 and followed by chroma 605. The chroma is stored with interlacing of the samples between both 10 channels INDUSTRIAL APPLICABILITY The arrangements described are applicable to the computer and data processing industries and particularly for the decoding of video data streams. The foregoing describes only some embodiments of the present invention, and 15 modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive. (Australia Only)In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and 20 "comprises" have correspondingly varied meanings. 1864663_1 886300_speci

Claims

1. A method for storing memory addresses in a buffer to fetch sample data from reference pictures in a system memory for a current partition, said method comprising 5 the steps of: providing a first two-dimensional array of fetch-bits, representing the fetches that are performed for previous partitions; generating a second two-dimensional array of fetch-bits representing the fetches required from the reference picture from the partition; 10 determining a third two-dimensional array of fetch-bits by determining the intersection of the second set of fetch-bits and inverted bits of the first set of fetch-bits; performing the fetches corresponding to the third two-dimensional array of fetch bits into a local memory buffer; setting the bits in the first two-dimensional array of fetch-bits corresponding to 15 the bits set in the third two-dimensional array of fetch-bits; and updating the corresponding entries in said buffer to the locations of the fetched data in the local memory buffer.

2. A method according to claim 1, wherein: 20 said first 2D fetch-bit array and said buffer are retained on advancing to the next macroblock; said first 2D fetch-bit array is shifted to the left for the next macroblock, with the rightmost bits being cleared; 1864663_1 886300 spec -19 said buffer is a circularly addressable buffer, with the base address being incremented in modulo fashion when progressing to the next macroblock, corresponding to the shift in the first 2D fetch-bit array. 5

3. A method according to claim 2, wherein: chroma is stored in said system memory in chroma format 4:2:0; samples are stored in said system memory stores in rows in the pattern: luma even row, luma odd row and chroma, said chroma stored having interlacing of the samples between both channels; 10 said bits in first, second and third 2D fetch-bit arrays represent a three row fetch of two luma rows and one chroma row; said fetches to local buffer correspond to 3 fetches from main memory containing two luma rows and one chroma row. 15 Dated this 12th day of December 2008 CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant Spruson&Ferguson 1864663_1 886300_sped