WO2005088974A1 - Architecture efficace de mise en correspondance de blocs - Google Patents

Architecture efficace de mise en correspondance de blocs Download PDF

Info

Publication number
WO2005088974A1
WO2005088974A1 PCT/IE2005/000023 IE2005000023W WO2005088974A1 WO 2005088974 A1 WO2005088974 A1 WO 2005088974A1 IE 2005000023 W IE2005000023 W IE 2005000023W WO 2005088974 A1 WO2005088974 A1 WO 2005088974A1
Authority
WO
WIPO (PCT)
Prior art keywords
block
pixels
absolute differences
value
processing
Prior art date
Application number
PCT/IE2005/000023
Other languages
English (en)
Inventor
Daniel Larkin
Valentin Muresan
Sean Marlow
Noel Murphy
Noel O'connor
Alan Smeaton
Original Assignee
Dublin City University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dublin City University filed Critical Dublin City University
Publication of WO2005088974A1 publication Critical patent/WO2005088974A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/90Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using coding techniques not provided for in groups H04N19/10-H04N19/85, e.g. fractals
    • H04N19/93Run-length coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/156Availability of hardware or computational resources, e.g. encoding based on power-saving criteria
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • H04N19/21Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with binary alpha-plane coding for video objects, e.g. context-based arithmetic encoding [CAE]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • H04N19/43Hardware specially adapted for motion estimation or compensation
    • H04N19/433Hardware specially adapted for motion estimation or compensation characterised by techniques for memory access
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • H04N19/436Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation using parallelised computational arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/53Multi-resolution motion estimation; Hierarchical motion estimation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/557Motion estimation characterised by stopping computation or iteration based on certain criteria, e.g. error magnitude being too large or early exit

Definitions

  • MPEG-4 is a relatively new multimedia standard targeting low bit-rate video and one of its main innovative features is video and audio object-based processing.
  • MPEG-4 provides a suite of tools to efficiently code audio and visual data objects separately. This is much more powerful than previous generations of the MPEG standard since performing the encoding on an object basis allows much greater levels of interactivity between users and the objects.
  • shape/alpha information as well as texture information must be coded.
  • BAB binary alpha block
  • the BM-based ME algorithms can be further split into two sets of sub-algorithms which have to be optimised together in order to achieve the best results. Firstly, a block - matching (BM) iftiplementation has to be found that is low cost, with low power consumption and a short throughput delay. Secondly, a fast search approach, which determines the search window, is required to reduce the number of points searched and, subsequently, reduces the uncompressed input data set. This reduction leads to computation reduction and, indirectly, to power consumption reductions.
  • the BM-based ME algorithms evaluate the motion vector by minimizing a distance criteria, which is meant to ease the search for the global minimum.
  • the BM algorithm can be based on different distance criteria. Many of them are based on square root, multiplication and division operations.
  • SAD is a simplified mean absolute difference (MAD) criterion.
  • a search strategy finds appropriate candidate blocks within a search window.
  • a block - matching algorithm evaluates a distortion measure, which is a level of similarity, between each candidate block in the search window and the current block in the current frame of video data in order to decide how the current frame should be encoded, e.g. the video data thereof processed, with a view to minimising the transmission bandwidth required between encoder and decoder to accurately reconstruct the encoded video stream at the decoder.
  • SAD is one such distortion measure.
  • a SAD value is calculated for each block match between the current block and the reference blocks in the search window.
  • the SAD formula for a 16 X 16 pixels frame block, known as a macroblock, is: 16 16
  • R c jobrr is the block under consideration in the current frame
  • B re f is the block at the current search location in the search frame.
  • the reference block with the lowest value SAD is chosen for further processing in order to encode the current block in the current frame. Motion estimation is complete for this macroblock and the algorithm proceeds to the next macroblock in the current frame.
  • the search and block - matching routine is repeated for all frames.
  • Motion estimation therefore requires a suitable search strategy to be used in conjunction with the block - matching algorithm.
  • Some typical search algorithms which are commonly used in the field, are outlined below: • Full search (exhaustive) algorithm - different hardware and software implementations of the full search ME algorithm; • Linear search strategies - the next search step is adjacent to the previous search step: circular search, diamond search, hexagonal search; • Reduced-pixel-information search algorithms - reduce the complexity either by decreasing the number of matching points or reducing the pixel-value information (binary algorithms, pixel sub-sampling); • Fast exhaustive search algorithms - reduce the computational load by cancelling early the non- optimal search positions from the search window without losing optimality, hence they are also called SAD-steps cancellation search strategies; • Fast heuristic search strategies — reduce the computational load by decreasing the number of search positions from a traditional full search algorithm's search window based on a heuristic that does not guarantee optimality.
  • reduced-pixel-information search algorithms reduce the complexity of the block - matching algorithm either by decreasing the number of absolute differences (e.g. pixel sub-sampling) or reducing the pixel-value information (e.g. binary motion estimation algorithms).
  • the reduced-pel information approaches reduce the pel information (usually by edge extraction, frame processing or pel sub-sampling). Once this is carried out, virtually any search strategy can be implemented on a reduced-bit frame representation.
  • Other adaptive reduced-bit BM architectures vary the size of the pel information, so that acceptable compressed image quality is maintained.
  • the drawback of the reduced-pel information approaches" is that they do not guarantee an optimum match.
  • Another important category is the binary search algorithms approaches that are also employed for shape coding.
  • An alternative approach to reduce the power consumption of the motion estimation architecture is to provide memory optimisation.
  • Such implementations aim to make an efficient use of the memory bandwidth and to therefore achieve energy savings over the processing times. They usually either exploit the memory data flow or adopt the more traditional memory banking optimisation techniques. Basically, they re-arrange and re-map the content of the on-chip memory in order to achieve the highest memory access efficiency. This is achieved by a high degree of on-chip memory content re-use, parallel pel information access and memory access interleaving. Power can be saved at memory level by reducing the number of accesses to large frames/memories.
  • a memory interleaving technique has previously been used to alleviate the huge memory bandwidth required for a tree architecture-based motion estimator.
  • Memory interleaving is a redundant pixel distribution technique, which ensures that wherever the accessed block is located, its pixels can each be accessed from a different memory bank in parallel. This was made possible by repeating each pixel information to each of the memory banks accessible in parallel. However the price paid is an increase in the memory capacity needed and in the power consumed by the memory modules accessed in parallel.
  • An alternative power saving implementation which has been developed provides a reduced number of SAD operations for the block - matching data path. This results in a reduction of power consumed when compared to a conventional SAD-based block - matching implementation.
  • SAD-based block - matching implementation By eliminating (i.e. cancelling) a significant number of SAD operations, the number of memory accesses will indirectly decrease as well. Greater power savings will therefore result, not only in the architecture's data path, but also in its memory.
  • a reduction in the number of clock cycles (steps) is achieved at the same time. This has implications for the total energy consumption, because the above power is consumed over a shorter period of time.
  • the static power component is also smaller due to the reduced area.
  • Equation 2 " SAD(B curr ,B rcf ) XOR B rcf (i , j)
  • Equation 2 has been previously implemented in a parallel manner in 1 -Dimensional or 2-Dimensional systolic array architectures and regarded as a "solved" problem due to the computation complexity reduction. Having regard to the fact that the motion within the macro-blocks exhibits a high degree of non-uniformity, however, the processing overhead can be further reduced by employing early SAD cancellation techniques.
  • the early exit mechanism provides a method for cancelling a SAD calculation for a particular block in a search area when the calculated SAD value for that block exceeds a minimum SAD value previously calculated for another block in that search area.
  • the early exit mechanism provides polygon matching when motion estimation is carried out for video objects rather than video frames.
  • run length coding is used in conjunction with a reformulated SAD equation to access only relevant data from memory, in order to further reduce said computational complexity. Reducing the number of computations, which need to be carried out is of considerable benefit for processing throughput. But reducing the number of SAD computations also leads to smaller dynamic power consumption in the datapath and the memory.
  • an apparatus for processing audio- visual data, including data input means, processing means, memory means and networking means, wherein the memory means stores instructions and the visual data as a sequence of frames defined by a plurality of blocks of picture screen elements (pixels) configured with respective red, green, blue and alpha values and the instructions configure the processing means to process a sum of absolute differences between alpha values of pixels of a first block thereof in a current frame and alpha values of pixels of a first block thereof in a reference frame as a minimum sum of absolute differences. The instructions also configure the processing means to process respective partial sums of absolute differences between the alpha values of the pixels of portions of the block in the current frame and alpha values of pixels of portions of subsequent blocks thereof in the reference frame.
  • the memory means stores instructions and the visual data as a sequence of frames defined by a plurality of blocks of picture screen elements (pixels) configured with respective red, green, blue and alpha values and the instructions configure the processing means to process a sum of absolute differences between alpha values of pixels of a first block thereof in a current frame and
  • the instructions also configure the processing means to cease processing the partial sums of absolute differences when any of the partial sums of absolute differences exceeds the minimum sum of absolute differences.
  • the instructions also configure the processing means to cease processing the partial sums of absolute differences when the accumulation of the partial sums of absolute differences exceeds the minimum sum of absolute differences and to declare the accumulated partial sums of absolute differences as the minimum sum if it is less than the sum of absolute differences processed from the first reference block, such that a motion vector for shape is estimated for encoding of the current frame.
  • the apparatus may be coupled to processing means via a data bus and receiving picture screen element (pixel) data therefrom.
  • processing means via a data bus and receiving picture screen element (pixel) data therefrom.
  • the apparatus for implementing block based motion estimation in a system that compresses data in the form of sequential frames or sequential binary alpha planes, the apparatus being adapted to find the best match for a selected block in a current frame from reference blocks within a search area of a previous frame, the best match reference block being the block within the search area whose value for the sum of absolute difference between its pixels values (SAD value) and the pixel values for the selected block is a minimum, the device comprising: a) identifying means for identifying a block within the search area of the previous frame to be a reference block and providing a deaccumulator associated with the reference block, the deaccumulator during the first match associated with the selected block being the maximum value of the bits allocated in the register for storing the block-level minimum SAD value, or subsequently being the current block-level minimum SAD value; b) calculating means for calculating the absolute difference between a pixel value in the selected block of the current frame and a pixel value in the reference block of the
  • a method for processing audiovisual data, the visual data being a sequence of frames, each of which being defined by a plurality of blocks of picture screen elements configured with respective red, green, blue and alpha values, the method comprising the step of processing a sum of absolute differences between alpha values of pixels of a first block thereof in a current frame and alpha values of pixels of a first block thereof in a reference frame as a minimum sum of absolute differences.
  • the method also comprises the step of processing respective partial sums of absolute differences between the alpha values of the pixels of portions of the block in the current frame and alpha values of pixels of portions of subsequent blocks thereof in the reference frame.
  • the method also comprises the step of ceasing the processing of the partial sums of absolute differences when any of the partial sums of absolute differences exceeds the minimum sum of absolute differences.
  • the method also comprises the step of ceasing the processing of the partial sums of absolute differences when the accumulation of the partial sums of absolute differences exceeds the minimum sum of absolute differences and the step of declaring the accumulated partial sums of absolute differences as the minimum sum if it is less than the sum of absolute differences processed from the first reference block, such that a motion vector for shape is estimated for encoding of the current frame.
  • the best match reference block according to the method may preferably be the block within the search area with a minimum value for the sum of absolute difference (SAD value) between its pixels values and the pixel values for the selected block during the first match attempt.
  • the best match reference block according to the method may alternately be the blockVithin the search area with the current block-level minimum SAD value.
  • the method may advantageously comprise repeating steps of identifying a reference block, providing a deaccumulator associated with the reference block, calculating the absolute difference and deaccumulating the absolute difference for each pixel in the reference block until the deaccumulator has a negative value and updating the minimum SAD value if all the pixels in the reference block have been processed, for one or more further reference blocks within the search area of the previous frame.
  • the method is preferably conducted in parallel for at least one block of a macroblock, wherein if at least one of the blocks satisfies the condition of the deaccumulator reaching a negative value, the remaining steps are terminated for all of the blocks, and wherein the updating condition comprises updating 'the minimum SAD value by the summed value of the deaccumulator associated with each block, provided their summed value is greater than zero.
  • a macroblock may usefully comprise four 8x8 blocks.
  • a macroblocks may comprise sixteen 4x4 blocks.
  • the blocks of the macroblocks are subdivided using pixel sub-sampling.
  • the calculating and updating is implemented using hardware components.
  • the calculating and updating is implemented in software.
  • the frames are associated with a video object having a video object plane containing information as to those pixels which form the object and those pixels which are outside the object, and wherein the steps are only carried out for those pixels which are inside the video object.
  • the video object is analysed in the same way as a video frame.
  • This invention can therefore work on top of the block-matching architecture mentioned when the video sequence consists of segmented video objects (i.e. MPEG-4 NOPs).
  • the video object plane may be run length encoded.
  • the method may further comprise a memory resource, which provides for the storage of the reference, previous and the next frame.
  • Run length code of the current block is preferably generated in parallel to the processing of the minimum sum of absolute differences, whereby only relevant pixel data is accessed to process the partial sums of absolute differences.
  • the run length code is advantageously generated as a sequence of pairs of elements, a first element being equal to the number of pixels until the next white pixel is encountered and a second element being equal to the number of contiguous white pixels.
  • Figure 1 shows a preferred embodiment of the present invention in an environment, including at least one audio-visual data processing terminal and at least one remote terminal;
  • Figure 2 provides an example of the audio-visual data processing terminal, which includes processing means, memory means and communicating means;
  • Figure 3 details the processing steps according to which the audio-visual data processing device of Figures 1 and 2 operates, including a step of loading instructions and processing visual data at runtime;
  • Figure 6 further details the Motion Vector for Shape processing step of Figure 5, including a step of processing a Sum of Absolute Differences from frame blocks shown in Figure 4;
  • Figure 7 further details the Sum of Absolute Differences processing step of Figure 6, including a step of sub-sampling pixels of frame blocks shown in Figure 4;
  • Figure 8 further details the Motion Vector for Shape processing step of Figure 5 in an alternative embodiment, including a step of processing run length encoding from a frame block shown m Figure 4
  • Figure 9 shows an alternative embodiment of the present invention as an alternative configuration of the terminal shown in Figures 1 to 8, comprising a plurality of processing elements;
  • Figure 10 provides a detailed representation of a processing element shown in Figure 9;
  • Figure 12 illustrates a BM sequence in which a final update is required
  • Figure 14 provides a graphical representation of the energy-throughput trade-off
  • Figure 15 shows another alternative embodiment of the present invention as yet another altemative configuration of the terminal shown in Figure 9;
  • Figure id is a block diagram representation of a block-wise ME memory architecture according to the present invention.
  • Figure 17 is a block diagram representation of a video frame memory architecture according to the present invention.
  • FIG. 1 A preferred embodiment of the present invention is shown in an environment in Figure 1, which includes at least one network-connected apparatus under the form of a mobile telephone handset 101.
  • Mobile phone 101 is configured with audio-visual data capturing means, such as a built-in camera 102, and may connect to remote terminals over any of a plurality of wireless networks 103 including a low- bandwidth Global System for Mobile Communication ('GSM') network, or higher-bandwidth General Packet Radio Service ('GPRS') network, or yet higher-bandwidth 'G3' network.
  • 'GSM' Global System for Mobile Communication
  • 'GPRS' General Packet Radio Service
  • Mobile phone 101 receives or emits data encoded as a digital signal over wireless networks 102, wherein said signal is relayed respectively to or from mobile phone 101 by the geographically-closest communication link relay 104 of a plurality thereof.
  • Said plurality of communication link relays allows said digital signal to be routed between mobile phone 101 and its intended recipient or from its remote emitter.
  • Mobile phone 101 may also connect to a remote, but proximate terminal over a local wireless network 105 such as the medium bandwidth 'BluetoothTM' network.
  • Mobile phone 101 may also connect to a remote terminal 108, such as a desktop computer or a portable computer, a variation thereof being a personal digital assistant, over a Wide Area Network ('WAN') 109, such as the Internet, by way of a remote gateway 110.
  • Gateway 110 is for instance a communication network switch coupling digital signal traffic between wireless telecommunication networks, such as the network 102 within which the example wireless data transmission 105 takes place, and said wide area network (WAN) 109.
  • Said gateway 110 further provides protocol conversion if required, for instance if mobile phone 101 broadcasts (106) data to said terminal 108, which is itself only connected to the WAN 109.
  • Mobile phone 101 firstly includes processing means in the form of a general-purpose central processing unit (CPU) 201, which is for instance an Intel ARM X-Scale processor manufactured by the Intel Corporation of Santa Clara, California, USA, for acting as the main controller of mobile phone 101 and processing data.
  • Mobile phone 101 next includes memory means 202, which includes non- volatile random-access memory (NVRAM) 203 totalling 512 kilobytes in this embodiment.
  • NVRAM 203 provides non- volatile storage of instructions and initialising data for processing means 201 when not in operation.
  • Memory means 202 preferably also includes volatile random-access memory (RAM) 204 totalling 16 megabytes in this embodiment.
  • RAM 204 provides volatile storage of data for processing means 201 in operation.
  • CPU 210, NVRAM 23 and RAM 204 are connected by a data input/output bus 205, over which they communicate and to which further components of mobile phone 101 are similarly linked in order to provide wireless communication functionality and receive input data.
  • Network communication functionality is provided by a modem 206, which provides the interface to external communication systems, such as the GSM, GPRS or G3 cellular telephone networks 103 shown in Figure 1.
  • An analogue-to-digital converter 207 receives analogue voice data from the user of mobile phone 101 through a microphone 208, or from remote devices connected to the GSM network only and processes it into digital data.
  • An aerial 209 is preferably provided to amplify the network communication operation.
  • Analogue input data may be received from microphone 208 and digital data may be locally input with data input interface 210, which is a keypad.
  • a third data input interface 109 is provided as a CCD camera configured to capture visual data as a digital video frame defined by a plurality of picture screen elements (pixels), each of which having respective red, green, blue and alpha numerical values and wherein said alpha value may be zero or one.
  • Power may be provided to mobile phone 101 by an electrical converter 212 connected to an internal module battery 213.
  • Output data in addition to digital output data broadcast over networks 103, includes local visual data output by CPU 201 to a Video Display Unit 214 and audio data output by CPU 201 to a Speaker Unit 215. Said arrangement is described herein by way of a generalised data processing architecture only, in order to not unnecessarily obscure the present description, and it will be readily understood by those skilled in the art that such arrangement may vary to a fairly large extent.
  • visual data is input from CCD camera 211 and, in the example, is a shot of a car driving by a tree.
  • Said visual data is a frame of pixels having respective RGB and alpha values, said RGB values having respective values ranging between 0 and 256 representative of a hue in a colour look-up table and said alpha value being equal to 255 (white) or 0 (black), according to whether the object the alpha- valued pixel represents belongs to a VOP image region or not, respectively.
  • the input to be coded can be a VOP image region of arbitrary shape and the shape and location of the region can vary from frame to frame.
  • Successive VOP's belonging to the same physical object in a scene are referred to as Video Objects (VO's) - a sequence of VOP's of possibly arbitrary shape and position.
  • the shape (alpha), motion and texture information of the VOP's belonging to the same VO is encoded and transmitted or coded into a separate VOL (Video Object Layer).
  • relevant information needed to identify each of the VOL's - and how the various VOL's are composed at the receiver to reconstruct the entire original sequence is also included in the bitstream. This allows the separate decoding of each VOP and the required flexible manipulation of the video " sequence.
  • the video source input assumed for the VOL structure either already exists in terms of separate'entities (i.e. is generated with chroma-key technology based on the afore-mentioned alpha channel) or is generated by means of on-line or off-line segmentation algorithms.
  • the coding structure simply degenerates into a single layer representation which supports conventional image sequences of rectangular shape: the MPEG-4 content-based approach can thus be seen as a logical extension of the conventional MPEG-1 and MPEG-2 coding approach towards image input sequences of arbitrary shape.
  • the shape information of a VOP is coded prior to coding motion vectors based on the VOP image window Macroblock grid and is available to both encoder and decoder. In subsequent processing steps only the motion and texture information for the Macroblocks belonging to the VOP image are coded (which includes the standard Macroblocks as well as the shape Macroblocks), wherein the shape information is refered to as "alpha planes" in the context of the MPEG-4 VM.
  • visual data is captured over a period of time, resulting in a plurality of sequential frame, the ordered processing for output of which translates into a video sequence depicting motion of said objects.
  • said sequence of frames includes I-frames, P-frames and B-frames, which are interleaved in a sequence such as IBBPBBP(%) or IBPBPBPBP().
  • the former is more difficult to encode but provides a higher compression ratio than the latter.
  • said I-, P-, and B-frames are encoded by the mobile phone 101.
  • I-frames Intra-coded frames
  • P-frames Predicted frames
  • a new P-frame is first predicted by 'predicting' the values of each new pixel from processing the last I- or P-frame.
  • said B-frames (Bi-directional frames) are encoded as differences from the last or next I- or P-frame according to the present invention.
  • B-frames use prediction as for P-frames, but for each block of 16 by 16 pixels thereof, either the previous I - or P-frame is used, or the next I- or P-frame.
  • a question is asked at step 305, as to whether mobile phone 101 is in a broadcast mode and should distribute the encoded frame output from encoding step 304 across a network it is connected to, for instance to remote mobile phone 107 or remote computer 108.
  • step 305 If the question of step 305 is answered positively, the communication instructions loaded at step 302 configure CPU 201 to broadcast said output encoded data to said remote recipients 107, 108 at step 306, or possibly to both of sai ⁇ recipients 107, 108 in a particularly advantageous embodiment. Thereafter, another question is asked' at step 307, as to whether more input data has been received from CCD camera 211, which is also asked if said mobile phone 101 is not in a broadcast mode and the question of step 305 is answered negatively. If the question of step 307 is answered positively, control proceeds back to step 303, whereby the next frame is input, encoded and broadcast if appropriate as described hereinbefore, and so on and so forth.
  • FIG. 4 An example of the input data processed at the apparatus of Figures 1, 2 and 3, including alpha data of two frames and a difference therebetween, is shown in Figure 4.
  • a first frame 401 in a sequence 402 thereof depicts a car 403 driving by a tree 404.
  • Said frame 401 is a reference frame at time t ⁇ .
  • a second frame 405 is shown in said sequence 402 depicts said car 403 still driving by said tree 404, having moved in relation thereto and by a distance relative to the previous frame 401, figuratively shown as a dashed line arrow.
  • Said frame 405 is a current frame at time to.
  • the car 403 constitutes a first object and the tree 404 constitutes a second object in the sequence 402 of MPEG4 frames of the example, and the description will hereinafter focus upon the encoding of said car object only, for the purpose of not unnecessarily obscuring the present description.
  • This binary pixel value is typified by the representation thereof as white or black pixels in alpha frames 406, 407.
  • current BAB 408 in frame 407 is processed by the instructions described hereinbefore, in order to predict an optimum encoding parameter which is a motion vector for shape for the encoding of said BAB 408, and the eventual encoding of said frame 407 when all such current BABs 408 have been processed in said frame 407.
  • Partial SAD values are concurrently processed for arrays of pixels subsampled from said current BAB and each reference BAB in turn according to the search sequence, wherein said sequential SAD processing according to said search pattern is cancelled as soon as a minimum SAD value is exceeded by said partial SAD value.
  • the step 304 of encoding input data is further detailed in Figure 5.
  • a first check is performed upon the frame type by a question asked at step 501, as to whether said current frame 405 is a P-frame or a B- frame. If the question of step 501 is answered negatively, the current frame 405 is an I-frame and motion estimation processing is redundant, whereby the frame is encoded according to the known prior art and control proceeds to the question of step 307 for selection and frame type questioning of the next frame in the video sequence.
  • step 501 the question of step 501 is answered positively and motion estimation may be processed and instructions loaded at step 302 configure CPU 201 to select a first macroblock 408 of 16 by 16 RGB pixels in the current frame 405 at step 502 and to select a corresponding macroblock 409 16 of by 16 RGB pixels in the reference frame 401 at step 503, so as to attempt a match in order to output a motion estimation and derive a motion vector for shape predictor therefrom.
  • reference macroblock RGB data is processed to output a first valid motion vector predictor from motion vectors present in the reference shape frame (motion vector for shape) or texture frame (motion vector associated with the RGB texture) pairs, to select it as a motion vector predictor for shape (MVPS).
  • MVPS motion vector predictor for shape
  • An alpha macroblock has an invalid motion vector if it is transparent, or if it belongs to an intra block, e.g. an I-frame, which has been eliminated by question 501.
  • the motion vector of a texture macroblock is invalid if it is transparent, or if the current Video Object Plane belongs to a B-VOP, or if the current Video Object has binary information only and no texture information. If no neighbouring vector is valid, said first valid MVPS is set to zero.
  • said first valid MNPS is then processed for motion compensation (MVPS MC).
  • the encoding instructions configure CPU 201 to select a first sub-block of 4 by 4 pixels of the current 16 by 16 pixels macroblock 408 and a corresponding first sub-block of 4 by 4 pixels of the reference 16 by 16 pixels macroblock 409 tentatively matched by means of the MVPS MC at step 506, such that the comparison of the respective alpha data thereof yields an answer to a question asked at step 507, as to whether the comparison error is less than a threshold AlphaTH, disclosed by Noel Brady in "MPEG-4 Standardized Methods for the Compression of Artibitarily Shaped Video Objects" (IEEE Transactions on circuits and systems for Video technology, Vol. 9, No. 8, December 1999).
  • step 507 If the question of step 507 is answered positively, another question is asked as to whether another sub-block of 4 by 4 pixels remains to be processed for error comparison in the current 16 by 16 pixels macroblock 408 at step 508. If the question of step 508 is answered positively, control returns to step 505, whereby the next sub-block may be selected and processed for error comparison, until such time as all such sub- blocks have been processed and the question of step 508 is answered negatively, whereby said macroblock 408 may be encoded at step 509 with said motion-compensated, first valid motion vector for shape predictor (MVPS).
  • MVPS motion vector for shape predictor
  • step 507 the question of step 507 is answered negatively, signifying that at least one pair of current and reference sub-blocks features an error greater than AlphaTH, which would result in video artefacts if the current frame was to be encoded on the basis of said MVPS MC of step 505.
  • Control proceeds to step 510, wherein a motion vector for shape (MVS) must be processed and output according to the present invention in order to satisfactorily encode said macroblock 408 and prevent said artefacts.
  • a final question is asked at step 512 as to whether another current macroblock remains to be processed in the current frame 405.
  • step 512 If the question of step 512 is answered positively, control returns to step 502 for the selection of said next macroblock and the subsequent motion estimation and encoding thereof. Eventually, all macroblocks of the current frame 405 have been encoded, thus the frame has been encoded and the question of step 512 is answered negatively, whereby said frame maybe broadcast according to step 306.
  • the step 510 of processing a motion vector for shape for optimum, power-efficient and resource- efficient encoding of digitised input visual data is further detailed in Figure 6.
  • the search strategy parameters for finding a matching block suitable for outputting a motion vector for shape are declared which, in the example, define an exhaustive search 411.
  • CPU 210 selects a first reference macroblock BAB n of 16 by 16 binary alpha pixels, which is then sub-sampled at step 6025 into a plurality of reference blocks PB n of 8 by 8 binary alpha pixels.
  • a first question is asked at step 604, as to whether the SAD n value output at step 602 is the first such output value.
  • the question of step 604 is always answered positively when the first reference BAB] macroblock is processed, whereby said SADi value is temporarily stored in RAM 204 as' a minimum SAD n value.
  • the question of step 604 is always answered negatively, until such time as another current BAB macroblock is selected in a subsequent iteration of the processing step 502.
  • a second question is then asked at step 605, as to whether said temporarily stored minimum SAD n value output at step 603 should be updated as a result of the output of step 603.
  • the answer to the question of step 605 is provided by the processing within step 603, which will be further described hereinbelow, and always answered positively when the first reference BABi macroblock is processed.
  • step 605 if the question of step 605 is answered positively, the SAD n value output at step 603 is smaller than the currently stored minimum SADi value, whereby said minimum SAD n value is updated from said SADi value to said SAD n value at step 606.
  • the question of step 605 is answered negatively, the SAD n value output at step 603 is larger than the currently stored minimum SAD n value and a third question is asked at step 6065, as to whether another reference block PB n + ⁇ remains to be processed in the reference macroblock BAB n according to the search parameters of step 501.
  • step 6065 If the question of step 6065 is answered positively, control returns to step 6025 for the selection of said next reference block PB n+ ⁇ and the processing of a partial SAD therefrom, until such time as all sub-samples of the currently-selected reference macroblock BAB have been processed.
  • the question of step 6065 is therefore eventually answered positively and a fourth question asked at step 607, as to whether another reference macroblock BAB n+ ⁇ remains to be processed according to the search parameters of step 501.
  • step 607 If the question of step 607 is answered positively, control returns to step 602, whereby CPU 210 selects the next reference macroblock BAB n+ ⁇ of 16 by 16 binary alpha pixels, then processes a sum of absolute differences (SAD n+ ⁇ ) value at step 603 as the distance between the alpha- valued pixels of said next reference BAB n+ ⁇ macroblock and the alpha-valued pixels of said current BAB macroblock selected at the previous step 502. Alternatively, the question of step 607 is answered positively, signifying that all reference macroblocks BAB ⁇ have been processed.
  • SAD n+ ⁇ sum of absolute differences
  • the currently stored minimum SAD n value therefore represents the smallest distance between macroblock 408 and the reference macroblock 409 corresponding to said minimum SAD n value, which is the matching block from which a motion vector for shape (MVS) is processed at step 608 in order to encode said current BAB 408 at step 511.
  • VMS motion vector for shape
  • the instructions of step 302 configure CPU 201 to sub-sample the current macroblock 408 selected at step 502 and the current reference macroblock 409 selected at step 602 into four sub-blocks or arrays of 8 by 8 alpha-valued pixels at step 6025, and to process four per-pixel SADs in parallel.
  • the instructions of step 302 configure CPU 201 to sub-sample the current macroblock 408 selected at step 502 and the current reference macroblock 409 selected at step 602 into four sub-blocks or arrays of 8 by 8 alpha-valued pixels at step 6025, and to process four per-pixel SADs in parallel.
  • the instructions of step 302 configure CPU 201 to sub-sample the current macroblock 408 selected at step 502 and the current reference macroblock 409 selected at step 602 into four sub-blocks or arrays of 8 by 8 alpha-valued pixels at step 6025, and to process four per-pixel SADs in parallel.
  • FIG. 1 only one processing pipeline is fully detailed for the purpose of not unn
  • the sample size is declared at step 701 as a variable from which the number of required iterations of each processing pipeline is derived, i.e. the number of partial SAD to process for the selected sub-block PB n .
  • a first pixel is selected in the sample and the SAD value is processed at step 703a from said selected reference pixel and its corresponding pixel in the block of the current pixel data 405.
  • a question is asked at step 704a, as to whether said SAD value exceeds the currently stored minimum SAD value which, if answered positively, causes a SAD cancellation flag to be set at step 705a.
  • second, third and fourth pixels are selected, respective SAD values processed therefrom and cancellation flags possibly set as described above according to steps 702b to 705b, 702c to 705c and 702d to 705d respectively.
  • a second question is asked at step 706a, as to whether all four flags corresponding to the four processing pipelines have been set. If the question of step 706a is answered negatively, signifying that only up to three processing pipelines have output a cancellation signal, a third question is asked at step 707a, as to whether another pixel remains to be processed in the PB n sample. If the question of step 707a is answered positively, control returns to step 702a, whereby said next pixel is selected and so on and so forth. Said second and third questions are likewise asked at 706b and 707b, 706c and 707c and 706d and 707d, such that all of the pixels in the PB n sample are rapidly processed.
  • step 704a if said question is answered negatively, the cancellation flag is not set and control proceeds directly to step 707a for the possible selection of another pixel in the sample for the processing thereof. If the question of step 707a is answered negatively, however, all of the pixels in the PB n sample have been processed and the respective partial SAD values generated therefrom in an uninterrupted manner at steps 703a, 703b, 703c and 703d are added at step 708 and an updated signal set so as to answer the question of step 605 positively.
  • the reference sample contains at least four pixels with a SAD value greater than the stored minimum SAD value, whereby processing of the remaining pixels in the PB n sample is unnecessary as it is unlikely to contribute data for finding a match for the purpose of motion estimation, such that unnecessary processing, memory accesses and power consumption are avoided.
  • Control thus skips step 708 and immediately cancels any further iteration of the processing upon pixels of the currently-selected block PB n , whereby questions 604 and 605 are both answered negatively and the next PB n sample or the next reference macroblock BABont. H may be selected to attempt a match.
  • FIG. 7B An alternative embodiment of the present invention is shown in Figure 7B, wherein the flag-setting step 705a, 705b, 705c and 705d is replaced with a process cancelling step705a, 705b, 705c and 705d, the question of steps 706a, 706b, 706c and 706d is removed, such that further SAD processing is cancelled when a single flag is set.
  • This embodiment is particularly suited to a requirement for extremely low power consumption, but would only be preferred to achieve this particular benefit as it may impact the accuracy and real-time capacity of the encoding.
  • the functionality of the present invention described above may be implemented with by way of a dedicated device interfaced with CPU 201 of the networked terminal, i.e. mobile phone 101, over bus 205, in place of the set of instructions loaded at step 302 and with varying degrees of parallelism depending on the critical requirements (area, power, throughput, technology) of said networked terminal.
  • a dedicated device interfaced with CPU 201 of the networked terminal, i.e. mobile phone 101, over bus 205, in place of the set of instructions loaded at step 302 and with varying degrees of parallelism depending on the critical requirements (area, power, throughput, technology) of said networked terminal.
  • Such an embodiment is particularly suited to accelerate the parallel processing described in Figures 7A and 7B with least power consumption and in cases where CPU 201 may not be configurable for efficient parallel processing itself.
  • FIG. 8 An alternative embodiment is shown in Figure 8 as an alternative configuration of the terminal shown in Figures 1 to 8, wherein the instructions configuring CPU 201 to process motion estimation according to the present invention are replaced with an additional hardware component 801 within mobile phone 101.
  • Said component 801 will hereinafter be referred to as a hardware accelerator, but it will be readily understood by those skilled in the art that this denomination is not limitative and any other such descriptive term may be used.
  • Accelerator 801 is configured to receive input visual data from said CPU 201 over bus 205 and process said data to achieve the SAD cancellation substantially as hereinbefore described and output encoded current frame data.
  • Accelerator 801 includes either 4 parallel processing elements (4xPE) or, in yet another " alternative embodiment which will be further described hereinbelow, 16 parallel processing elements ⁇ 16xPE) and both architectures are independent of the search strategy employed and can use a range f searches including modified full search or circular search strategies 411, 412. Both embodiments use less hardware resources than a typical implementation with comparable throughput.
  • Figure 9 shows a first 4xPE parallel architecture of the alternative embodiment 801 of the present invention.
  • each of the four parallel PEs 901 to 904 generates a partial PSADn value.
  • the decision-making unit 905 uses these accumulated PSAD n values to make a SAD cancellation decision.
  • the cancellation can occur at any point during the block - matching processing. If the processing progresses to completion without a SAD cancellation then an update stage is invoked, embodying step 606, wherein the question is asked as to whether the accumulated PSADj.it value is less than the minimum SAD n encountered so far.
  • PRENJDACCjREGi registers store the DACCjREGi values when the SAD cancellation has not occurred.
  • the preliminary results of the current match are thus handed over to the update stage while the 4 PEs can be committed to a new block-match;
  • B SAD REGi register stores the block-level minimum SAD distortion (best match) calculated in each block-level PE so far in the search for the best match, i.e. within which PSADi, PSAD 2 , PSAD 3 and PSAD 4 are stored;
  • TOTAL DACCJ EG is the register that stores the total de-accumulator value that is the sum of all the DACC REGi values; i.e. within which PSADi, PSAD 2 , PSAD 3 and PSAD 4 are accumulated;
  • TOTAL_MIN_SAD_REG stores the total minimum SAD value. This value is the minimum SAD value found so far during the motion estimation search steps run for the current reference block.
  • the update logic consists of a single adder/subtractor and an accumulator and the update is carried out sequentially.
  • the architecture shown in Figure 9 processes motion estimation on one macroblock 408 at a time.
  • the macroblock 408 is split into four 8x8 blocks as per step 701 by using a simple pixel sub-sampling technique.
  • Each of the four PEs then operates on one such 8x8 block.
  • input visual data is stored already sub-sampled and re-mapped when it is stored in the local memory 204.
  • Said pixel data re-mapping is done in both the current frame 405 and the reference frame 401.
  • the two stages, block - matching stage embodying step 603 and update stage embodying step 606, run in parallel.
  • An initialization process runs initially for the block-level matching stage and its function is to calculate a initial minimum SAD n value and its corresponding block SAD n values at the beginning of every motion estimation search for a new "current" block.
  • Block-matching attempts to find the best match for the current block by searching and comparing reference blocks in the search window.
  • the partial SAD n calculation will give a negative result if REG DACCi is greater than the previous best match saved in SAD_BMi register: this is a trigger for SAD cancellation.
  • SAD cancellation can operate in two different modes.
  • the update stage is by default idle.
  • the control unit 905 activates said update stage only when the block-level matching stage is not cancelled prematurely (i.e. when step 603 proceeds directly to step 608) and an investigation has to be carried out, to determine whether a better match has been found in the case of SAD calculations not having been cancelled yet after the 8x8 steps. If after full processing of the block (TOT curr I Inverse TOT curr + 2 cycles) SAD cancellation does not occur, potentially a new minimum SAD n value has been found.
  • the update logic is launched and a new block level match is launched in parallel.
  • SAME is the common white pixels between the current block and the reference block
  • DIFF curr is the number of white pixels in the current block but not in the reference block
  • TOT re f is the total number of white pixels in the reference block; and IFF re f s the number of white pixels in the reference block but not in the current block.
  • TOTcurr is calculated only once per search, TOT re f can be updated in one clock cycle, after initial calculation, the incremental addition of DIFFcurr allows early cancellation if the current minimum SAD is exceeded and irrelevant data is not accessed.
  • a Run length code consists of two elements, X and Y.
  • X represents the number of pixels until the next white pixel is encountered.
  • Y is the number of consecutive white pixels from and including the transitional pixel (0 to 1 pixel value change).
  • the summation ⁇ of X and Y should equal 256.
  • the run length code is generated in parallel with the first match of the selrch step. It is possible to do this during the first match because SAD cancellation is not possible, since no minimum SAD n value has been found. As a result it is not possible to load an initial minimum de- accumulation value into the SAD PE, therefore SAD cancellation is not possible.
  • the encoding logic can subsequently be powered down until the next search position.
  • Equation 9 it becomes apparent that the total number of operations possible (when SAD cancellation does not occur) is equal to the addition of TOT re f, the subtraction of TOT cu ⁇ r and the summation of the XOR result between each white pixel in the DIFF curr area in the current block 408 and the co-located pixel in the reference block 409. Since there are TOT cmr white pixels in the current frame 405, there are maximally TOT curr summations. Therefore the total number of operations in the worst-case scenario is equivalent to 2 + TOT cul ⁇
  • Equation 7 will now be reformulated to allow this. Using equation 3, 4 and equation 5 the following can be derived:
  • Equation 7 This alternative form of Equation 7 is highly desirable for the computational reasons outlined above.
  • Equation 8 due to the RLE DIFF re f component in the equation.
  • the run length code should be manipulated by way of a solution.
  • the run length codes for the white pixels in the reference alpha block are not required, because the position of the black pixels in the current alpha block can be easily generated by using the inverse thereof.
  • XOR-ing the black pixels against the co-located pixels in the reference alpha block generates DIFFref-
  • the XOR logic only outputs a result of one when a pixel with a value of 1 is encountered in the reference block, and not all white pixels in the reference block require processing, because all the inverse run length pixels addressed in the current block have a value of 0 and these are the only pixels that can contribute to the SAD. This therefore guarantees only relevant pixels are processed.
  • the inverse run length code can be generated from the regular run length code.
  • inverse run length pairs have the pattern of being offset by one component from the regular run length pairs.
  • the first inverse run length code (IRLO) code is slightly different but is just simply a matter of having an X value of zero (since there is no previous value to offset) and a Y of the regular RLO-X. This covers the situation when the first pixel in the macroblock is black. If the first pixel in a macroblock is white, then IRLO will equal (0,0). This condition can be ignored and the process skips to IRL1, whereby the remaining inverse run length pairs however follow the regular offset pattern.
  • Equation 9 form of SAD calculation is used whenever TOTcurr is greater than half the number of pixels in a macroblock.
  • TOTcurr is greater than (NxN)/2 (where N is the size of the macroblock) the Most Significant Bit (MSB) equals one.
  • MSB Most Significant Bit
  • TOTref w has been decremented to zero, all DIFFref and SAME pixels have already been examined.
  • the TOTref sign change allows SAD cancellation regardless of whether more inverse run length codes exist for the current alpha block in memory as these codes only relate to common black pixel values.
  • An unmodified version of TOTref is needed for the fast update thereof. This condition occurs when the search moves to the next new position: TOTref is updated by adding subtracting one row/column relative to the last TOT ⁇ ef value.
  • a copy of the TOTref register, decTOTre is therefore used for decrementing and allow early SAD cancellation.
  • FIG. 10 provides a detailed view of the SAD Processing Element (PE) 901.
  • PE SAD Processing Element
  • the minimum SAD n value encountered so far is loaded into DACC REG 1001.
  • TOTcurr 1 TOTref is added to DACC_REG (depending if the MSB of TOT reJ is 0 or 1 respectively).
  • DACC_REG is de-accumulated by the value of either TOT re f I TOT CU rr again depending on whether the MSB of TOT re f is 0 or 1 respectively. If a sign change occurs at this point the minimum SAD n value has already been exceeded and no further processing is required. If a sign change has not occurred the next run length code is fetched from memory 204.
  • the run length pair code is processed unmodified.
  • the MSB of TOT curr is 1, the inverse run length code is processed. In either case, the run length code processing results in an X, Y macroblock address. The X,Y address is used to retrieve the relevant pixel from the reference macroblock 409 and the current macroblock 408.
  • the pixel values are XOR-ed and the result is left shifted by one place and then subtracted from the DACC_REG 1001. If a sign change occurs, early SAD cancellation is possible. If a sign change does not occur, the remaining pixels in the current run length code are processed. If SAD calculation is not cancelled, subsequent run length codes for the cunent macro block are fetched from memory and the processing repeats. If inverse run length DIFF cu ⁇ r pixels are used, TOT re f is loaded in parallel into the decTOT re f register 1002. During each processing cycle the result of the bit inversion is decremented from decTOT re f 1002. If a sign change occurs, early SAD cancellation is possible, since the maximum number of TOT ⁇ /pixels have already been examined.
  • TOTref 1002 can be updated in one clock cycle.
  • the updated TOT re f is calculated by subtracting the previous row or column (depending on search window movement) from EOE re /and adding the new row or column.A zero result from the XOR logic indicates no net change to the number of pixels.
  • Figures 11, 12, and 13 depict possible scenarios of BM step sequences that emphasize the parallelism between the block-matching stage and the update stage.
  • the block-level matching and update stages are overlapped on the time scale.
  • the update stage is idle by default and activated only when a block-level match was not cancelled after 64 steps.
  • Figure 12 describes the case where the update step is undertaken this time after 11 steps and adds one cycle to the current match's number of cycles.
  • Figure 13 describes the same cases as in the latter figure, but with the particular case of having the BM cancellation condition met at some stage in the middle of the update stage. Thus, a new BM is launched in this case skipping the previous one without affecting the update stage.
  • FIG. 15 Another alternative embodiment of the present invention is shown in Figure 15 as another configuration of the accelerator shown in Figures 8 to 10, wherein 16 PEs are implemented instead of four.
  • the structure described previously uses four parallel processing elements 901 to 904.
  • 16 processing elements are used in this embodiment to improve throughput further.
  • the drawback ' with this embodiment is that the potential to reduce redundant calculations is reduced.
  • Another consequence of the 16xPE structure is that more operations are required for the update stage.
  • 16 PREVJDACC REGi are needed to hold the previous de-accumulation values, as well as 16 BM_SADi registers to hold the value of the current best SAD n value for each of the 16 blocks.
  • the datapath block - matching PE structure remains identical.
  • the major modification required in the architecture is within the update logic.
  • the update proceeded in a sequential manner requiring maximally 11 cycles. If the same structure was adopted for the 16xPE, the minimum update cycle increases to 16 cycles. However, the block size has now been reduced to 4 by 4 pixels, meaning that the maximum SAD n value calculation time is 16 cycles. Therefore the update logic runs for longer than the SAD calculation.
  • the sequential update is replaced by an adder tree structure. The minimum time required for the update is once again 5 cycles. The update stage can now process the current SAD n value calculation without stalling.
  • the fifteen adders can be pipelined so as to reduce the number of adders to 8 and therefore reduce the required silicon area, though this comes with the expense of extra control logic. If area is not a concern but minimal latency of update is required, the adder tree may be designed to complete in a single clock cycle.
  • the block level SAD n values can be updated in 1 to 2 clock cycles depending on the constraints of the particular implementation.
  • the other major change compared to the 4xPE architecture is that memory also needs to be remapped from four blocks of 8x8 pixels to sixteen blocks of 4x4 pixels. The same pixel sub-sampling technique as previously described is used.
  • the 16xPE architecture caters for exactly the same block size as that required by ACQ, architecture re-usage is possible with the addition of minor modifications to the logic.
  • the DACC REGi should be initially loaded with the threshold value AlphaTH and decremented from this point. This modification requires only a small amount of additional control logic.
  • the PE is not being used for motion estimatiotg the logic is idle.
  • ACQ functionality is required before and after motion estimation so the logic can be efficiently reused. This reduces area requirements and therefore indirectly static power consumption. SAD cancellation is also possible, in a similar manner to the fast heuristic match seen previously. If one PE exceeds the AlphaTH value, no further processing is required for the binary alpha block. So from a power consumption perspective, the benefits of using the 16xPE for ACQ are two-fold.
  • the video memory is sub-sampled in 4 or 16 blocks, respectively, depending on the architecture (i.e. 4xPE or 16xPE).
  • Figure 16 depicts a block diagram of the memory that stores a video frame. The same memory architecture is employed to store the search area and macroblock information.
  • FIG. 17 depicts the video frame memory. Normally, two video frames (previous and current) are stored at one time. We add a new frame memory that is loaded with the next video frame, while the other two are processed to generate the motion vectors.
  • a simple module 3 counter can be used to multiplex between the three frame memories so that each of them becomes periodically, previous, current and next frame memory.
  • a similar technique is employed for the Search Area RAM and the Current Block RAM memories, but with only two memory modules this time, that is Search Area RAM and New Search Area RAM, and Current Block RAM and New Current Block RAM.
  • the 16x16 macroblock can be equally divided by pixel sub-sampling into four 8x8, sixteen 4x4, sixty four 2x2, or two hundred fifty six l l sub-blocks, respectively.
  • the 256 pixel macroblock would be divided in 64 blocks of 2x2 pixels, or 256 blocks of lxl pixels, respectively, and this is deemed the extreme case where the architecture tends to be similar to a 2D systolic array.
  • the SAD cancellation mechanism ceases to be efficient as the opportunity to cancel the SAD operations is significantly reduced.
  • a rrj re trivial implementation is then design with only one processing element (lxPE), which would be' the simple implementation of a sequential motion estimator.
  • the block-matching architecture of the present invention runs well with any search strategy that can provide the next search step within a clock cycle.
  • it is also suitable for fractional-pel search strategies (e.g. half-pel motion estimation, quarter-pel motion estimation), where the video data (i.e. pixel) information is extended from integer pixel grids to non-integer pixel positions.
  • fractional-pel search strategies e.g. half-pel motion estimation, quarter-pel motion estimation
  • the video data (i.e. pixel) information is extended from integer pixel grids to non-integer pixel positions.
  • it will be transparent to the block-matching architecture whether the search space is made out of integer-pixel or non-integer pixel positions.
  • a bilinear interpolation approach can be used before block- matching to generate the pixel data in the non-integer pixel positions.
  • the search strategy is implemented in the Address Generation Unit. Any simple search strategy that can generate the coordinates of the next match within a clock cycle can be implemented here.
  • the full search strategies, linear search strategies (i.e. circular and diamond shape) and fast heuristical search . strategies (i.e. logarithmic) have been successfully tested.
  • the search strategy is implemented by means of a finite state machine that has to be able to provide each clock cycle (step) the position of the next macroblock to be matched in the search window.
  • the finite state machine controls two pointers (x, y) that give the row and column indexes in the search window's table.
  • the input to the finite state machine is the type of the selected search strategy.
  • the architecture achieves the cancellation of up to 92% (e.g. for akiyo test sequence with a circular search strategy) of the total number of SAD operations, normally executed by a classic ME implementation (e.g. ID SA), where the number of SAD operations is usually constant. Also, as a result of the SAD cancellation achievements mentioned above the throughput is significantly improved.
  • ID SA classic ME implementation
  • an adaptive ME architecture is obtained, where the throughput and the number of SAD operations depends on the video sequence content. This is in contrast to the other ME implementations, which execute the same number of SAD operations regardless the video sequence processed.
  • the pixel sub-sampling techniques employed in the 4xPE and 16xPE architectures process in parallel lower-resolution versions of the video data.
  • This sub-sampled data is fed in the 4xPE architecture to the four processing elements by four different memory blocks.
  • the sixteen processing elements are fed in parallel with sub-sampled video data by sixteen memory blocks.
  • Another advantage of the present invention is that the MPEG-4's polygon matching modes can easily be attached to the block-matching architecture proposed in here. Moreover, the run-length based polygon matching approach succeeds up to 80% improvement (depending on the video object data), both in terms of number of SAD operations and throughput, over the classic polygon matching approach.
  • the architectures shown in Figures 8 to 17 embody a low-power, fast exhaustive binary block - matching motion estimation hardware accelerator for use in MPEG-4 binary shape coding. However, the use thereof is not limited to only motion estimation for shape coding, as it may have application in texture motion estimation when coupled with an 8 bit to 1 bit pixel filter.
  • the present invention comprises of a binary SAD cancellation method tightly connected with an efficient run length coding memory design.
  • the throughput is increased and the processing time reduced. This equates to less dynamic power consumed since there are less operations.
  • static power is also indirectly reduced because less area is required compared to the conventional 1 -Dimensional and 2- Dimensional systolic arrays of the prior art architectures.
  • a further embodiment option is to "time share" the CPU 201 configured according to the instructions loaded at step 302 for a virtual 4xPE or 16xPE architecture. This will increase throughput while maintaining most of the area benefits associated with the fully serial approach.
  • the area for this embodiment is not as small as the simple fully serial design since multiple DACC_REG and BM_MIN_SAD registers are still required similar to the 4xPE and 16xPE design.
  • the 4xPE and 16xPE architectures have a potential voltage reduction benefit. Since these architectures run in parallel, the voltage can be reduced while the processing duration maintained. This is highly desirable since power has a quadratic relationship with voltage.
  • the pixel sub-sampling means multiple smaller memories can be used instead of one large memory. Instead of one RAM, 4 or 16 smaller RAMs for the 4xPE and 16xPE architecture respectively can be used. This actually leads to a benefit on power consumption, as the RAM power consumption does not scale linearly.
  • the computation load is balanced between the processing elements (PE) by employing the pixel sub- sampling technique to distribute lower resolution versions of the same macroblock (highly similar for natural videos) to each of the PEs.
  • PE processing elements
  • all the PEs complete their block-level matching operations almost at the same time (dealing with lower resolution versions of the same macroblock).
  • SAD cancellation condition negative block-level decrement values

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

La présente invention a trait à un appareil (101) pour le codage de données audiovisuelles (405, 407) comportant un moyen d'entrée de données (211), un moyen de traitement (201), un moyen de mémoire (202, 203, 204) et un moyen de mise en réseau (206, 207). Le moyen de mémoire (203 ou 204) mémorise des instructions et des données (405, 407) sous forme d'une séquence (402) de trames (401, 405) définies par une pluralité de blocs (408, 409) de pixels configurés avec des valeurs de rouge, vert, bleu et alpha. Le moyen de traitement (201) assure le traitement (603) d'une somme de différences absolues entre des valeurs alpha de pixels d'un premier bloc (408) dans une trame actuelle (405) et des valeurs alpha de pixels d'un premier bloc dans une trame de référence (401) sous la forme d'une somme minimale de différences absolues (606). Le moyen de traitement (201) assure le traitement (70) de sommes partielles respectives de différences absolues entre des valeurs alpha de pixels de portions du bloc 4-8 dans la trame actuelle (405) et les valeurs alpha de pixels de portions de blocs ultérieurs (409) dans la trame de référence (401). Le moyen de traitement (201) arrête le traitement des sommes partielles de différences absolues lorsqu'une quelconque parmi les sommes partielles de différences absolues dépasse (70) la somme minimale de différences absolues. Le moyen de traitement (201) déclare les sommes partielles de différences absolues accumulées comme étant la somme minimale si elle est inférieure à la somme de différences absolues traitées à partir du premier bloc de référence, de sorte qu'un vecteur de mouvement pour la forme est estimé (510) pour le codage (511) de la trame actuelle (405).
PCT/IE2005/000023 2004-03-15 2005-03-15 Architecture efficace de mise en correspondance de blocs WO2005088974A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
IE20040162 2004-03-15
IE2004/0162 2004-03-15
IE20040480 2004-07-15
IE2004/0480 2004-07-15

Publications (1)

Publication Number Publication Date
WO2005088974A1 true WO2005088974A1 (fr) 2005-09-22

Family

ID=34961698

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IE2005/000023 WO2005088974A1 (fr) 2004-03-15 2005-03-15 Architecture efficace de mise en correspondance de blocs

Country Status (1)

Country Link
WO (1) WO2005088974A1 (fr)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003094528A1 (fr) * 2002-05-03 2003-11-13 Qualcomm Incorporated Techniques de sortie precoces destinees a une estimation du mouvement video numerique

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003094528A1 (fr) * 2002-05-03 2003-11-13 Qualcomm Incorporated Techniques de sortie precoces destinees a une estimation du mouvement video numerique

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
CHANG H-C ET AL: "VLSI architecture Design of MPEG-4 Shape Coding", IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, vol. 12, no. 9, September 2002 (2002-09-01), pages 741 - 751, XP002329682 *
DONGHOON YU ET AL: "A fast motion estimation algorithm for MPEG-4 shape coding", IMAGE PROCESSING, 2000. PROCEEDINGS. 2000 INTERNATIONAL CONFERENCE ON SEPTEMBER 10-13, 2000, PISCATAWAY, NJ, USA,IEEE, vol. 1, 10 September 2000 (2000-09-10), pages 876 - 879, XP010530755, ISBN: 0-7803-6297-7 *
LEE K-B ET AL: "A Memory-Efficient Binary Motion Estimation Architecture for MPEG-4 Shape Coding", PROCEEDINGS OF THE 16TH EUROPEAN CONFERENCE ON CIRCUIT THEORY AND DESIGN, ECCTD 03, CRACOW, POLAND, 1 September 2003 (2003-09-01), pages II-93 - 96, XP008047373 *
LEE K-B ET AL: "OPTIMAL FRAME MEMORY AND DATA TRANSFER SCHEME FOR MPEG-4 SHAPE CODING", IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, IEEE INC. NEW YORK, US, vol. 50, no. 1, February 2004 (2004-02-01), pages 342 - 348, XP001198149, ISSN: 0098-3063 *
LENGWEHASATIT K ET AL: "Computationally scalable partial distance based fast search motion estimation", IMAGE PROCESSING, 2000. PROCEEDINGS. 2000 INTERNATIONAL CONFERENCE ON SEPTEMBER 10-13, 2000, PISCATAWAY, NJ, USA,IEEE, vol. 1, 10 September 2000 (2000-09-10), pages 824 - 827, XP010530742, ISBN: 0-7803-6297-7 *
PANTISOPONE K ET AL: "A fast motion estimation method for MPEG-4 arbitrarily shaped objects", IMAGE PROCESSING, 2000. PROCEEDINGS. 2000 INTERNATIONAL CONFERENCE ON SEPTEMBER 10-13, 2000, PISCATAWAY, NJ, USA,IEEE, vol. 3, 10 September 2000 (2000-09-10), pages 624 - 627, XP010529544, ISBN: 0-7803-6297-7 *
WANG Y-C ET AL: "An Efficient Architecture of Binary Motion Estimation for MPEG-4 Shape Coding", VISUAL COMMUNICATIONS AND IMAGE PROCESSING 2001, PROCEEDINGS OF THE SPIE, vol. 4310, 2001, pages 959 - 967, XP002329681 *

Similar Documents

Publication Publication Date Title
EP1958448B1 (fr) Prediction de blocs adjacents multidimensionnels pour codage video
JP3801886B2 (ja) ハイブリッド型高速動き推定方法
Chen et al. A new block-matching criterion for motion estimation and its implementation
US20150172687A1 (en) Multiple-candidate motion estimation with advanced spatial filtering of differential motion vectors
KR100937616B1 (ko) 계산적으로 제약된 비디오 인코딩
JP2008523724A (ja) 動画像符号化のための動き推定技術
Chan et al. Experiments on block-matching techniques for video coding
KR20050012806A (ko) 비디오 인코딩 및 디코딩 기술
CN101621696B (zh) 允许分数视频运动估计和双向视频运动估计的选择性使用方法和编码器
KR100910327B1 (ko) 디지털 비디오 모션 추정을 위한 조기 종료 기술
KR20110036886A (ko) 움직임 추정 반복 탐색의 개선 방법 및 시스템, 다음 탐색 영역의 중심점 결정 방법 및 시스템, 지역적 최소값의 회피 방법 및 시스템
US20070133689A1 (en) Low-cost motion estimation apparatus and method thereof
US20090028241A1 (en) Device and method of coding moving image and device and method of decoding moving image
KR20050012794A (ko) 비디오 인코딩을 위한 모션 추정 기술
KR100221171B1 (ko) 조밀한 이동벡터필드를 재생하는 방법 및 장치
EP1683361B1 (fr) Procede d'estimation de mouvement co-situe a puissance optimisee
WO2005088974A1 (fr) Architecture efficace de mise en correspondance de blocs
WO2010044566A2 (fr) Appareil de codage/décodage de film et procédé et appareil pour compensation du mouvement de blocs se chevauchant et compensation du mouvement de blocs hybrides correspondants
KR100900058B1 (ko) 다양한 멀티미디어 코덱에 사용되는 움직임 추정 연산 방법및 그 연산회로
Alfonso et al. An innovative, programmable architecture for ultra-low power motion estimation in reduced memory MPEG-4 encoder
US11509940B1 (en) Video apparatus with reduced artifact and memory storage for improved motion estimation
KR20100097387A (ko) 고속 움직임 추정을 위한 부분 블록정합 방법
Larkin et al. A low complexity hardware architecture for motion estimation
KR20070061214A (ko) 저비용 움직임 추정 장치 및 움직임 추정 방법
Ma et al. WZ frame reconstruction algorithm based on side information improved in distributed video coding

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 69(1) EPC - FORM EPO 1205A DATED 28-11-2006

122 Ep: pct application non-entry in european phase

Ref document number: 05718813

Country of ref document: EP

Kind code of ref document: A1