US20130223532A1 - Motion estimation and in-loop filtering method and device thereof - Google Patents

Motion estimation and in-loop filtering method and device thereof Download PDF

Info

Publication number
US20130223532A1
US20130223532A1 US13/777,434 US201313777434A US2013223532A1 US 20130223532 A1 US20130223532 A1 US 20130223532A1 US 201313777434 A US201313777434 A US 201313777434A US 2013223532 A1 US2013223532 A1 US 2013223532A1
Authority
US
United States
Prior art keywords
macroblock
current macroblock
absolute difference
sum
line segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US13/777,434
Inventor
Yinglai XI
Qiang Li
Jumei LI
Jianbin HE
Jinfeng ZHOU
Zhichong CHEN
Liu Yang
Dong Li
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Via Telecom Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Via Telecom Inc filed Critical Via Telecom Inc
Assigned to VIA TELECOM, INC. reassignment VIA TELECOM, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHEN, ZHICHONG, HE, JIANBIN, LI, DONG, LI, JUMEI, LI, QIANG, XI, YINGLAI, YANG, LIU, ZHOU, JINFENG
Publication of US20130223532A1 publication Critical patent/US20130223532A1/en
Assigned to VIA TELECOM CO., LTD. reassignment VIA TELECOM CO., LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: VIA TELECOM, INC.
Priority to US14/818,886 priority Critical patent/US10469868B2/en
Assigned to INTEL CORPORATION reassignment INTEL CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: VIA TELECOM CO., LTD.
Abandoned legal-status Critical Current

Links

Images

Classifications

    • H04N19/00569
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/547Motion estimation performed in a transform domain
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/117Filters, e.g. for pre-processing or post-processing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/102Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or selection affected or controlled by the adaptive coding
    • H04N19/12Selection from among a plurality of transforms or standards, e.g. selection between discrete cosine transform [DCT] and sub-band transform or selection between H.263 and H.264
    • H04N19/122Selection of transform size, e.g. 8x8 or 2x4x8 DCT; Selection of sub-band transforms of varying structure or type
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • H04N19/43Hardware specially adapted for motion estimation or compensation
    • H04N19/433Hardware specially adapted for motion estimation or compensation characterised by techniques for memory access
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/56Motion estimation with initialisation of the vector search, e.g. estimating a good candidate to initiate a search
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/80Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation
    • H04N19/82Details of filtering operations specially adapted for video compression, e.g. for pixel interpolation involving filtering within a prediction loop
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/523Motion estimation or motion compensation with sub-pixel accuracy
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/50Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
    • H04N19/503Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
    • H04N19/51Motion estimation or motion compensation
    • H04N19/533Motion estimation using multistep search, e.g. 2D-log search or one-at-a-time search [OTS]

Definitions

  • the present invention relates to video processing, and in particular relates to a motion estimation acceleration circuit and in-loop filtering acceleration circuit by using data in the overlapped portions of neighboring macroblocks recursively to reduce memory bandwidth.
  • Video compression standards such as MPEG2, H.264 or VC-1 standards, have been widely used in the video codec (coding/decoding) systems on the market.
  • calculation of motion estimation and de-blocking filtering may have the largest amount of operations. If a video codec system performs motion estimation and de-blocking filtering by software only, it may cause a serious burden to the processing unit.
  • some previously used macroblock data may be read from the external memory repeatedly, so that the memory bandwidth for accessing the external memory is wasted.
  • a motion estimation acceleration circuit applied in a video encoding system supporting multiple video codec standards comprises: a start searching point prediction unit, configured to determine a start searching point according to multiple neighboring macroblocks of a current macroblock, wherein the current macroblock corresponds to a searching window; and an integer pixel estimation unit, configured to determine a best candidate pixel according to a first line segment where the start searching point is located, a second line segment on the first line segment, and a third line segment beneath the first line segment, wherein the integer pixel estimation unit further determines whether the best candidate pixel is located at the first line segment, if so, the integer pixel estimation unit sets a candidate motion vector corresponding to the best candidate pixel as a first current macroblock motion vector; if not, the integer pixel estimation unit dynamically adjusts the second line segment or the third line segment in the searching window to update the best candidate pixel, and retrieve the first current macroblock motion vector corresponding to the updated best candidate pixel.
  • a motion estimation method has the following steps of: determining a start searching point according to multiple neighboring macroblocks of a current macroblock, wherein the current macroblock corresponds to a searching window; determining a best candidate pixel according to a first line segment where the start searching point is located, and a second/third line segment on/beneath the first line segment; determining whether the best candidate pixel is located at the first line segment; if so, setting a candidate motion vector corresponding to the best candidate pixel as a first motion vector of the current macroblock; and if not, dynamically adjusting the second line segment or the third line segment in the searching window to update the best candidate pixel, and retrieving the first motion vector of the current macroblock corresponding to the updated best candidate pixel.
  • an in-loop filtering acceleration circuit applied in a video codec system supporting the H.264 standard and the VC-1 standard comprises a processing unit to perform video processing to generate at least one reconstructed macroblock and a value of boundary strength corresponding to each edge of the reconstructed macroblock.
  • the in-loop filtering acceleration circuit comprises: multiple one-dimensional (1D) filters configured to perform a filtering process; and a filter selection unit configured to select one of the 1D filters according to the value of the boundary strength to perform the filtering processing to the reconstructed macroblock, wherein the in-loop filtering acceleration circuit further divides the reconstructed macroblock into multiple 8 ⁇ 8 blocks and multiple 4 ⁇ 4 blocks, performs the filtering process to horizontal edges of the 8 ⁇ 8 blocks, the reconstructed macroblock row by row according to a first predefined order, and performs the filtering process to horizontal edges of the 4 ⁇ 4 blocks row by row from top to bottom, wherein the in-loop filtering acceleration circuit further performs the filtering process to vertical edges of the 8 ⁇ 8 blocks column by column according to a second predefined order, and performs the filtering process to vertical edges of the 4 ⁇ 4 blocks column by column from left to right.
  • 1D one-dimensional
  • an in-loop filtering method applied in an in-loop filtering acceleration circuit of a video codec system supporting the H.264 standard and the VC-1 standard comprises a processing unit to perform video processing to generate at least one reconstructed macroblock and a value of boundary strength corresponding to each edge of the reconstructed macroblock.
  • the method comprises the following steps of: dividing the reconstructed macroblock into multiple 8 ⁇ 8 blocks and multiple 4 ⁇ 4 blocks; selecting one of multiple 1D filters according to the value of the boundary strength to perform the filtering processing to the reconstructed macroblock; performing the filtering process to horizontal edges of the 8 ⁇ 8 blocks the reconstructed macroblock row by row according to a predefined order, and performing the filtering process to horizontal edges of the 4 ⁇ 4 blocks row by row from top to bottom; and performing the filtering process to vertical edges of the 8 ⁇ 8 blocks column by column according to another predefined order, and performing the filtering process to vertical edges of the 4 ⁇ 4 blocks column by column from left to right.
  • FIG. 1 is a block diagram illustrating a video encoding system according to an embodiment of the invention
  • FIG. 2 is a diagram illustrating prediction of the start search point in the motion estimation method according to an embodiment of the invention
  • FIG. 3 is a diagram illustrating the motion estimation method according to an embodiment of the invention.
  • FIG. 4 is a diagram illustrating overlapped searching windows of horizontally neighboring macroblocks according to an embodiment of the invention.
  • FIGS. 5A ⁇ 5D are diagrams illustrating the architecture of the searching window buffer according to an embodiment of the invention.
  • FIG. 6 is a schematic diagram of the motion estimation acceleration circuit 122 according to an embodiment of the invention.
  • FIG. 7 is a block diagram illustrating the hardware architecture of the integer pixel estimation unit 151 according to an embodiment of the invention.
  • FIG. 8 is a structure diagram illustrating a processing element in the integer pixel estimation unit 151 according to an embodiment of the invention.
  • FIGS. 9A and 9B are portions of a diagram illustrating the hardware architecture of the half pixel estimation unit 152 according to an embodiment of the invention.
  • FIG. 10 is a diagram illustrating the in-loop filtering sequence in the H.264 standard according to an embodiment of the invention.
  • FIG. 11 is a diagram illustrating the in-loop filtering sequence in the VC-1 standard according to an embodiment of the invention.
  • FIG. 12 is a diagram illustrating the architecture of the de-blocking filter buffer 145 according to an embodiment of the invention.
  • FIGS. 13A ⁇ 13D are portions of a diagram illustrating the sequence of data accessing in the de-blocking filter buffer 145 according to an embodiment of the invention.
  • FIGS. 14A and 14B are portions of a diagram illustrating the hardware architecture of the in-loop filtering acceleration circuit 124 according to an embodiment of the invention.
  • FIGS. 15A and 15B are diagrams illustrating the working principle of the filter selection unit 1410 according to an embodiment of the invention.
  • FIGS. 16A ⁇ 16F are diagrams illustrating the architecture of each H.264 1D filter according to an embodiment of the invention.
  • FIGS. 17A ⁇ 17B are portions of a diagram illustrating the architecture of the VC-1 filter in the in-loop filtering acceleration circuit 124 according to an embodiment of the invention.
  • FIG. 18 is a block diagram illustrating a video codec system according to an embodiment of the invention.
  • FIGS. 19A and 19B are portions of a flow chart illustrating the motion estimation method according to an embodiment of the invention.
  • FIG. 1 is a block diagram illustrating a video encoding system according to an embodiment of the invention.
  • the video decoding system 100 may comprise a processing unit 110 , an encoding module 120 , an external storage unit 130 and a DMA controller 160 .
  • the processing unit 110 may be a controller configured to execute a hardware accelerator control program, and execute an entropy encoding program, a bit rate control program, and a boundary extension program.
  • the processing unit 110 may be a central processing unit (CPU), a digital signal processor (DSP) or other equivalent circuits implementing the same functions.
  • CPU central processing unit
  • DSP digital signal processor
  • the encoding module 120 may comprise a hardware accelerator controller 121 , a motion estimation acceleration circuit 122 , a DCT and quantization accelerator 123 , an in-loop filtering acceleration circuit 124 , and an internal storage unit 140 .
  • the encoding module 120 can be divided into a hardware encoding unit and a software encoding unit (not shown in FIG. 1 ). That is, each component in the encoding module 120 may be implemented by hardware or a DSP (i.e. software) configured to perform encoding processes, such as motion estimation, motion compensation, discrete cosine transform/inverse transform (DCT/iDCT), quantization/inverse quantization, zig-zag scan, and in-loop filtering.
  • the motion estimation acceleration circuit 122 and the in-loop filtering acceleration circuit 124 are dedicated digital logic circuits or hardware to implement encoding processes, such as motion estimation and in-loop filtering processing.
  • the hardware accelerator controller 121 the motion estimation acceleration circuit 122 , the DCT and quantization accelerator 123 , and the in-loop filtering acceleration circuit 124 in the encoding module 120 of FIG. 1 is implemented by hardware.
  • the hardware components such as the processing unit 110 and the encoding module 120 , may utilize a frame level flow control method indicating that the CPU may decode the next frame when the hardware components of the encoding module 120 decodes the current frame.
  • the data flow between each component (e.g. all hardware, or integrated by hardware/software) in the encoding module 120 may be macroblock level flow control.
  • the external storage unit 130 is configured to store reference frames, reconstructed frames, decoding parameters, and run-last-level codes (i.e. RLL codes).
  • the external storage unit 130 may be a volatile memory component (e.g. random access memory, such as DRAM, SRAM) and/or a non-volatile memory component (e.g. ROM, hardware accelerator, CDROM).
  • the DMA controller 160 is configured to retrieve macroblock data and encoding parameters corresponding to the encoding process.
  • the hardware accelerator controller 121 in the encoding module 120 may read the required macroblock data (e.g. the current macroblock and reference macroblock) from the external storage unit 130 to the internal storage unit 140 through the DMA controller 160 .
  • the processing unit 110 may control each component in the encoding module 120 .
  • the processing unit 110 may set and check register values associated with the hardware accelerator controller 121 , and then activate the encoding module 120 to encode the current frame. It is necessary for the processing unit 110 to request and register a corresponding DMA channel, check status of the DMA channel, and set registers associated with the DMA controller 160 to activate the DMA controller.
  • the encoding module 120 may start to encode the current frame. It should be noted that, the encoding module 120 and the processing unit 110 is controlled by a frame level flow. Before finishing the encoding procedure of each current frame by the hardware accelerator, the processing unit 110 (i.e.
  • an encoding program e.g. program codes
  • the encoding program may detect whether the hardware encoding unit has completed the encoding procedure of the current frame.
  • the processing unit 110 may execute other programs having higher priority and being ready for execution. Specifically, when the encoding module 120 has finished the encoding procedure of the current frame, the encoding module 120 may generate an interrupt signal. Accordingly, an interrupt service program executed by the processing unit 110 may send an event completion signal to the encoding program. Then, the encoding program may retake control of the processing unit 110 to encode the next frame.
  • the processing unit 110 may further execute various programs to perform encoding post-processing, such as executing an entropy decoding program, a bit rate control program and a boundary extension program.
  • the entropy encoding program may indicate that the processing unit 110 read encoding parameters and RLL codes from the external storage unit 130 to perform entropy encoding, and output a video bitstream of an image.
  • the bit rate control program may indicate that the processing unit 110 may calculate quantization parameters of the next frame according to encoding results of the current frame, the total bit rate, and the frame rate.
  • the boundary extension program may indicate that the processing unit 110 performs boundary extension to the reconstructed frame, which is used for calculation of motion estimation of the next frame, outputted by the hardware encoding unit.
  • the internal storage unit 140 may comprise a residue macroblock buffer 141 , a first-in-first-out (FIFO) buffer 142 , a current macroblock buffer 143 , a searching window buffer 144 , and a de-blocking filter buffer 145 .
  • the residue macroblock buffer 141 is configured to store residue values of macroblocks for motion compensation.
  • the FIFO buffer 142 is configured to store encoding parameters and RLL codes, wherein the encoding parameters are from the hardware accelerator controller 121 , and the RLL codes are from the DCT and quantization accelerator 123 .
  • the current macroblock buffer 143 is configured to store the current macroblock.
  • the searching window buffer 144 is configured to store macroblocks in the searching window for motion estimation.
  • the de-blocking filter buffer 145 is configured to store reconstructed macroblocks after motion compensation and filtered macroblocks generated by the in-loop filtering acceleration circuit 124 .
  • the in-loop filtering acceleration circuit 124 reads reconstructed macroblocks, which are generated by the DCT and quantization accelerator 123 , from the de-blocking filter buffer 145 , and performs in-loop filtering to the reconstructed macroblocks to generate filtered macroblocks, and writes the filtered macroblocks into the de-blocking filter buffer 145 .
  • the hardware accelerator controller 121 may set and manage each component in the encoding module 120 . For example, when the motion estimation acceleration circuit 121 in the encoding module 120 has completed encoding of a macroblock, the motion estimation acceleration circuit 121 may send a first interrupt signal to the hardware accelerator controller 121 . Meanwhile, the hardware accelerator controller 121 may set and activate subsequent corresponding accelerators and acceleration circuits. When hardware (e.g. the in-loop filtering acceleration circuit 124 ) in the encoding module 120 has completed encoding of a frame, the hardware accelerator controller 121 may send a second interrupt signal to the processing unit 110 . Then, the processing unit 110 may write the encoding parameters to registers (not shown) inside the hardware accelerator controller 121 directly, so that the hardware accelerator controller 121 may set each hardware component in the encoding module 120 .
  • hardware e.g. the in-loop filtering acceleration circuit 124
  • the motion estimation acceleration circuit 122 in the invention may use a prediction-based 12-point line searching algorithm to complete motion estimation of integer pixels (i.e. details will be described later), and to perform motion estimation of half pixels.
  • the motion estimation acceleration circuit 122 may search for eight points while performing motion estimation of half pixels, and the interpolation and motion estimation of half pixels can be executed in parallel.
  • the motion estimation method for integer pixels provided in the invention may comprise the following four steps of: (1) predicting the start searching point; (2) 12-point line searching based on a 8 ⁇ 8 block; (3) motion searching of 16 ⁇ 16 macroblocks; and (4) determining the macroblock mode for motion estimation.
  • FIG. 2 is a diagram illustrating prediction of the start search point in the motion estimation method according to an embodiment of the invention.
  • FIGS. 19A and 19B are portions of a flow chart illustrating the motion estimation method according to an embodiment of the invention.
  • the motion estimation acceleration circuit 122 may confirm the start searching point for every macroblock before performing motion estimation.
  • the motion estimation acceleration circuit 122 may predict the start searching point by using motion vectors of neighboring macroblocks. As illustrated in FIG.
  • motion vectors MVa, MVb, MVc and MVd of a left neighboring macroblock A, a upper neighboring macroblock B, a upper-right neighboring macroblock C and a upper-left neighboring macroblock D of the current macroblock E are referenced to predict the start searching point.
  • the four pixels pointed by the motion vectors MVa, MVb, MVc, and MVd of the four neighboring macroblocks of the current macroblock E are checked, and the sum of absolute difference (SAD) corresponding to each of the four points is calculated.
  • the point with the least SAD value is regarded as the start searching point for motion estimation.
  • some neighboring macroblocks may not exist if the current macroblock is located at the boundary of the image. Meanwhile, a zero-valued motion vector may be used to substitute the motion vectors of the non-existing neighboring macroblocks, and the predicted reference point is set to zero-point.
  • FIG. 3 is a diagram illustrating the motion estimation method according to an embodiment of the invention.
  • the motion estimation method used in the motion estimation acceleration circuit 122 is based on searching the 12-point line segments of integer pixels.
  • FIGS. 19A and 19B shows a flow chart illustrating the motion estimation method according to an embodiment of the invention.
  • Step 1 as illustrated in FIG. 3 , the current macroblock is divided into four 8 ⁇ 8 blocks.
  • the motion estimation acceleration circuit 122 may search for three 12-point line segments p ⁇ 1, p and p+1 taking the pixel-word at which the start point S 1 is located as center, and thus there are 36 candidate pixels, such as the white points illustrated in FIG. 3 .
  • a SAD 16 ⁇ 16 value of a candidate point can be obtained by summarizing four SAD 8 ⁇ 8 values corresponding to the same candidate point (i.e. 36 SAD 16 ⁇ 16 values in total). If the reference point corresponding to the least SAD 16 ⁇ 16 value (e.g. the best reference point, such as the gray point illustrated in FIG. 3 ) is located on the line segment p+1, step 2 is performed. If the best reference point is located on the line segment p ⁇ 1, step 3 is performed. Otherwise, step 4 is performed.
  • the reference point corresponding to the least SAD 16 ⁇ 16 value e.g. the best reference point, such as the gray point illustrated in FIG. 3
  • Step 4 the motion estimation acceleration circuit 122 may set the motion vector MV 16 ⁇ 16 of the 16 ⁇ 16 macroblock to the motion vector corresponding to the least SAD 16 ⁇ 16 value, and set the motion vectors MV 8 ⁇ 8 of the four 8 ⁇ 8 blocks to the motion vector corresponding to the least SAD 8 ⁇ 8 value.
  • FIGS. 19A and 19B details of the aforementioned steps 1 ⁇ 4 can be described with the steps illustrated in FIGS. 19A and 19B :
  • step S 1901 the current macroblock is divided into at least one 8 ⁇ 8 block.
  • the current macroblock is divided into at least one 8 ⁇ 8 block.
  • For each 8 ⁇ 8 block taking a pixel word comprising four pixels at where the start searching point is located as center, 36 initial candidate points can be retrieved from a first line segment, a second line segment and a third line segment (i.e. the first/second/third line segments are aligned, as shown in FIG. 3 ), wherein the first line segment comprises the pixel word and four neighboring pixels at the right and left sides of the pixel word, and the second line segment is on the first line segment, and the third line segment is beneath the first line segment;
  • step S 1902 a first SAD value of each initial candidate point relative to each 8 ⁇ 8 block is calculated, thereby obtaining an initial current macroblock SAD value corresponding to each initial candidate point.
  • a first least current macroblock SAD value can be obtained according to the initial current macroblock SAD values;
  • step S 1903 it is determined whether a best reference point corresponding to the first least current macroblock SAD value is located on the second line segment or not. If so, step (d) (i.e. step S 1905 ) is performed. If not, it is further determined whether the reference point corresponding to the first least current macroblock SAD value is located on the third line segment (step S 1904 ). If so, step (g) (i.e. step S 1909 ) is performed. Otherwise, step (j) (i.e. step S 1912 ) is performed;
  • step S 1905 it is determined whether the second line segment is located on a boundary of a searching window corresponding to the current macroblock or not. If so, step (j) (i.e. step S 1912 ) is performed. If not, the second line segment is moved down by a pixel, and the moved second line segment is adjusted horizontally to generate 12 first refined candidate points according to a pixel word where the best reference point is located (step S 1906 ), and step (e) is performed;
  • step S 1907 a second sub-macroblock SAD value of each first refined candidate point relative to each 8 ⁇ 8 block is calculated, thereby obtaining a second current macroblock SAD value corresponding to each first refined candidate point. Then, a second least current macroblock SAD value can be obtained according to the second current macroblock SAD value corresponding to each first refined candidate point;
  • step 1908 it is determined whether the second least current macroblock SAD value is larger than the first least current macroblock SAD value. If so, step (j) (i.e. step S 1912 ) is performed. If not, the second least current macroblock SAD value is set to the first least current macroblock SAD value, and step (d) (i.e. step S 1905 ) is performed.
  • step S 1909 it is determined whether the third line segment is located on a boundary of the searching window corresponding to the current macroblock. If so, step (j) (i.e. step S 1912 ) is performed. If not, the third line segment is moved up by one pixel, and the moved third line segment is adjusted horizontally to generate 12 second refined candidate points according to a pixel word where the best reference point is located (step S 1913 ), and step (h) (i.e. step 1910 ) is performed;
  • step S 1910 a third sub-macroblock SAD value of each second refined candidate point relative to each 8 ⁇ 8 block is calculated, thereby obtaining a third current macroblock SAD value corresponding to each second refined candidate point. Then, a third least current macroblock SAD value can be obtained according to the third current macroblock SAD value corresponding to each second refined candidate point;
  • step S 1911 it is determined whether the third least current macroblock SAD value is larger than the first least current macroblock SAD value. If so, step (j) (i.e. step S 1912 ) is performed. If not, the third least current macroblock SAD value is set to the first least current macroblock SAD value, and step (g) (i.e. step S 1909 ) is performed;
  • step S 1912 the current macroblock integer pixel motion vector is set to a first motion vector corresponding to the first least current macroblock SAD value, and multiple sub-macroblock motion vectors corresponding to the 8 ⁇ 8 blocks in the current macroblock are set to multiple motion vectors pointing to the second sub-macroblock SAD values or the third sub-macroblock SAD values.
  • the motion estimation acceleration circuit 122 may take the reference point corresponding to the least SAD 16 ⁇ 16 value as center, and searches for eight half pixels around the center. If the SAD 8 ⁇ 8 or SAD 16 ⁇ 16 value corresponding to the half pixels is smaller than the SAD value of integer pixels, the motion estimation acceleration circuit 122 may update the motion vectors corresponding to the 8 ⁇ 8 blocks or the 16 ⁇ 16 macroblock.
  • the motion estimation acceleration circuit 122 may determine whether an INTER mode (i.e. for 16 ⁇ 16 macroblocks) or an INTER4V mode (i.e. for 8 ⁇ 8 blocks) is used for encoding the current macroblock according to a rate distortion optimization (RDO) value.
  • RDO rate distortion optimization
  • the mode with a smaller RDO value may have a higher priority, and the motion estimation acceleration circuit 122 may select the mode with a smaller RDO value as the encoding mode for the current macroblock.
  • the current frame and the reference frame for motion estimation are stored in the external storage unit 130
  • the current macroblock and the searching window are stored in the internal storage unit 140
  • the hardware accelerator controller 121 may read the current macroblock and the searching window from the external storage unit 130 , and write the current macroblock and the searching window to the internal storage unit 140 .
  • the current macroblock is stored in the current macroblock buffer 143
  • the pixels of the searching window are stored in the searching window buffer 144 .
  • each pixel may have an 8-bit accuracy, and neighboring pixels in the horizontal direction are placed into the same pixel word.
  • FIG. 4 is a diagram illustrating overlapped searching windows of horizontally neighboring macroblocks according to an embodiment of the invention.
  • a search range for motion estimation used in the motion estimation acceleration circuit 122 is ( ⁇ 16, 15.5), and the size of the corresponding searching window may be 48 ⁇ 48 pixels.
  • the overlapped portion between the searching windows of the two horizontally neighboring macroblocks is 32 ⁇ 48 pixels.
  • the searching window buffer 144 in the invention is implemented in the architecture of four memory banks.
  • Each memory bank may store a region of 16 ⁇ 48 pixels.
  • the motion estimation acceleration circuit 122 may access a 48 ⁇ 48 searching window comprising three memory banks, whereas the remaining memory bank is accessed by the DMA controller 160 . That is, the DMA controller 160 may read the region of 16 ⁇ 48 pixels for motion estimation of the next macroblock from the external storage unit 130 to the searching window buffer 144 . Since there are four memory banks in the searching window buffer 144 , it can be ensured that the calculation of motion estimation and accessing of the searching window of the next macroblock can be performed in parallel.
  • FIGS. 5A ⁇ 5D are diagrams illustrating the architecture of the searching window buffer according to an embodiment of the invention.
  • the searching window comprises three different memory banks in the searching window buffer 144 alternately.
  • the DMA controller 160 may write the region of 16 ⁇ 48 pixels for motion estimation of the next macroblock into the memory bank 4 , memory bank 1 , memory bank 2 and memory bank 3 sequentially, as illustrated in FIGS. 5A ⁇ 5D .
  • the motion estimation acceleration circuit 122 may read the 48 ⁇ 48 searching window from the external storage unit 130 when starting calculation for motion estimation of the first macroblock in each row. For calculation of motion estimation of the remaining macroblocks in each row, the motion estimation acceleration circuit 122 may only have to read a region of 16 ⁇ 48 pixels from the external storage unit 130 . Therefore, the invention may reduce the memory bandwidth for accessing the external storage unit 130 effectively.
  • FIG. 6 is a schematic diagram of the motion estimation acceleration circuit 122 according to an embodiment of the invention.
  • the motion estimation acceleration circuit 122 may comprise a start searching point prediction unit 150 , an integer pixel estimation unit 151 , a half pixel estimation unit 152 , and a prediction difference calculating unit 153 .
  • Each component in the motion estimation acceleration circuit 122 may execute a calculating procedure associated with its name, respectively.
  • the start searching point prediction unit 150 may search for and predict the start point for motion estimation, as described in section B-1 and illustrated in FIG. 2 .
  • the start searching point prediction unit 150 may read pixels of the searching window and the current macroblock from the searching window buffer 144 and the current macroblock buffer 143 , respectively, according to motion vectors of the neighboring macroblocks of the current macroblock. Then, the start searching point prediction unit 150 may further calculate SAD values of the candidate points, and select a start searching point prediction value by comparing all the SAD values. Further, the start searching point prediction unit 150 may transmit the start searching point prediction value to the integer pixel estimation unit 151 , so that the integer pixel estimation unit 151 may perform a 12-point line segment searching process for motion estimation.
  • the integer pixel estimation unit 151 may read pixels of the searching window and the current macroblock from the searching window buffer 144 and the current macroblock buffer 143 , respectively. Then, the integer pixel estimation unit 151 may calculate SAD values of all candidate points, and determine motion vectors of integer pixels by comparing all the SAD values. The integer pixel estimation unit 151 may transmit the motion vectors of integer pixels to the half pixel estimation unit 152 .
  • the half pixel estimation unit 152 may perform calculation of interpolation and motion estimation of half pixels.
  • the half pixel estimation unit 152 may read pixels of the searching window and the current macroblock from the searching window buffer 144 and the current macroblock buffer 143 , respectively, and generate reference macroblocks by interpolation.
  • the half pixel estimation unit 152 may further calculate SAD values of all candidate points, and determine motion vectors for half pixels by comparing all the SAD values.
  • the prediction difference calculating unit 153 may read pixels of the best reference macroblock from the searching window 144 according to the motion vectors for half pixels generated by the half pixel estimation unit 152 .
  • the prediction difference calculating unit 153 may further obtain residue values by subtracting pixels of the best reference macroblock by pixels of the current macroblock, and write the residue values into the residue macroblock buffer 141 .
  • FIG. 7 is a block diagram illustrating the hardware architecture of the integer pixel estimation unit 151 according to an embodiment of the invention.
  • the integer pixel estimation unit 151 may implement the aforementioned 12-point line segment searching algorithm by using a systolic array comprising 12 parallel processing elements (PE).
  • PE parallel processing elements
  • the 12 processing elements of the integer pixel estimation unit 151 may be divided into four sub-arrays, wherein the first sub-array comprises processing elements PE 1 , PE 5 and PE 9 ; the second sub-array comprises processing elements PE 2 , PE 6 and PE 10 ; the third sub-array comprises processing elements PE 3 , PE 7 and PE 11 ; and the fourth sub-array comprises processing elements PE 4 , PE 8 and PE 12 .
  • Each processing element may have two input terminals, and pixels in the searching window buffer 144 can be broadcasted to all 12 processing elements. Pixels of the current macroblocks may be reordered into four sets of input data, and the four sets of input data are transmitted to the four sub-arrays, respectively. In addition, the transmission path of the input data is sequential in each sub-array (e.g. PE 1 ⁇ PE 5 ⁇ PE 9 ). Also, eight 32-bit flip-flops are used as delaying units in four transmission paths of pixels of the current macroblock.
  • the integer pixel estimation 151 may access these two buffers simultaneously via two different physical channels (e.g. memory channels).
  • pixels are stored in the format of pixel words in the current macroblock buffer 143 and the searching window buffer 144 , and thus a pixel word of the current macroblock and a pixel word of the searching window can be read simultaneously from the current macroblock buffer 143 and the searching window buffer 144 every clock cycle, wherein each pixel word is divided into four pixels to be written into the register arrays (e.g. RA 0 , RA 1 , RA 2 , and RA 3 ).
  • the integer pixel estimation unit 151 writes pixels b 0 ⁇ b 3 of the searching window into the register array RB, and writes pixels a 0 ⁇ a 3 of the current macroblocks into the register array RA.
  • pixels a 0 ⁇ a 3 are arranged into different orders and written into the register arrays RA 1 , RA 2 and RA 3 , as illustrated in FIG. 7 .
  • the integer pixel estimation unit 151 may broadcast the pixels b 0 ⁇ b 3 of the searching window stored in the register array RB to all the 12 processing elements, and transmit pixels of the current macroblock stored in the register arrays RA 0 ⁇ RA 3 to the four sub-arrays through four transmission paths.
  • the processing elements PE 1 ⁇ PE 4 have received pixels of the current macroblock and the searching window for calculation, the processing elements PE 5 ⁇ PE 12 are idling since they have not received the pixels of the current macroblock yet.
  • the integer pixel estimation unit 151 may keep on reading the current macroblock buffer 143 and the searching window buffer 144 , store pixels b 4 ⁇ b 7 of the searching window to the register array RB, and store pixels a 4 ⁇ a 7 of the current macroblock to the register array RA 0 .
  • the integer pixel estimation unit 151 may further reorder the pixels a 4 ⁇ a 7 of the current macroblock and substitute some pixels in the register arrays RA 1 ⁇ RA 3 with the reordered pixels a 4 ⁇ a 7 , as illustrated in FIG. 7 .
  • the integer pixel estimation 151 may broadcast the pixels b 4 ⁇ b 7 of the searching window to all the 12 processing elements. Pixels of the current macroblock stored in the register arrays RA 0 ⁇ RA 3 are transmitted to the four sub-arrays via four different transmission paths, so that the pixels can be transmitted sequentially in the processing elements in each sub-array.
  • the processing elements PE 1 ⁇ PE 8 have received pixels of the current macroblock and the searching window for calculation, but the processing elements PE 9 ⁇ PE 12 are idling since they have not received the pixels of the current macroblock yet.
  • the integer pixel estimation unit 151 may keep on reading the searching window buffer 144 , and store the pixels b 8 ⁇ b 11 of the searching window into the register array RB.
  • the integer pixel estimation unit 151 may further reorder pixels a 4 ⁇ a 7 of the current macroblock stored in the register array RA 0 , and substitute some pixels in the register arrays RA 1 ⁇ RA 3 with the reordered pixels, as illustrated in FIG. 7 .
  • the integer pixel estimation unit 151 may broadcast the pixels b 8 ⁇ b 11 of the searching window to all the 12 processing elements. Pixels of the current macroblock stored in the register arrays RA 0 ⁇ RA 3 are transmitted to the four sub-arrays via four different transmission paths, so that the pixels can be transmitted sequentially in the processing elements in each sub-array. Meanwhile, the integer pixel estimation unit 151 may keep reading the searching window buffer 144 , and store the pixels b 12 ⁇ b 15 of the searching window into the register array RB. Therefore, all processing elements on the four transmission paths have received pixel data for calculation in the fourth clock cycle.
  • the integer pixel estimation unit 151 may broadcast the pixels b 12 ⁇ b 15 of the searching window to all the 12 processing elements. Also, the processing elements PE 1 ⁇ PE 4 are idling since they do not receive any new pixels of the current macroblock, and the processing elements PE 5 ⁇ PE 12 have received pixels of the searching window and pixels of the current macroblock from the delaying units FF 0 ⁇ FF 7 for calculation. Meanwhile, the integer pixel estimation unit 151 may keep reading the searching window buffer 144 , and store the pixels b 16 ⁇ b 19 of the searching window into the register array RB.
  • the integer pixel estimation unit 151 has completed calculation of difference values of a pixel row (e.g. 12 integer pixels). Further, each processing element may comprise an accumulator, and the integer pixel estimation unit 151 may accumulate and store the difference values corresponding to the 12 candidate points, and calculation of a SAD 8 ⁇ 8 value of the 12 candidate points can be completed by repeating the aforementioned steps 8 times. Then, the least SAD 8 ⁇ 8 value can be obtained by using the comparators, and thus a corresponding motion vector MV 8 ⁇ 8 can be obtained.
  • the integer pixel estimation unit 151 may keep calculating the SAD 8 ⁇ 8 value of the 12 candidate points in the other three 8 ⁇ 8 blocks, thereby obtaining twelve SAD 16 ⁇ 16 values.
  • the integer pixel estimation unit 151 may further obtain the least SAD 16 ⁇ 16 value by using the comparators, thereby obtaining the corresponding motion vector MV 16 ⁇ 16 .
  • FIG. 8 is a structure diagram illustrating a processing element in the integer pixel estimation unit 151 according to an embodiment of the invention.
  • the processing element may comprise four SAD calculating units and an accumulator.
  • the processing element may receive four pixels of the current macroblock and four pixels of the searching window, and calculate absolute difference values of the four pixel pairs.
  • the processing element may selectively accumulate the four absolute difference values.
  • the corresponding control signal is a fixed 4-bit value in clock cycles for performing calculation of motion estimation. Control signals between neighboring processing elements in the same set may have a one-clock-cycle delay. Accordingly, eight 4-bit flip-flops are used as delaying units in the integer pixel estimation unit 151 to distribute the control signal of each processing element.
  • a motion vector point of an integer pixel is often taken as a center, and eight candidate half pixels around the center are searched while performing searching of half pixels.
  • the reference macroblock corresponding to the eight half pixels is generated after linear interpolation of integer pixels.
  • h, v, d denote the half pixels in the horizontal direction, vertical direction and diagonal direction, respectively;
  • a 1 and A 2 denote the integer pixels horizontally neighboring to the half pixel h;
  • a 1 and A 3 denote the integer pixels vertically neighboring to the half pixel v; and
  • a 1 ⁇ A 4 denote the integer pixels neighboring to the half pixel d
  • the interpolation for half pixels in different directions can be expressed as the following equations:
  • FIGS. 9A and 9B are portions of a diagram illustrating the hardware architecture of the half pixel estimation unit 152 according to an embodiment of the invention.
  • the half pixel estimation unit 152 may comprise 4 sets of 10-bit adders and 3 sets of rounding and shifting units to implement interpolation of half pixels.
  • the half pixel estimation unit 152 may further comprise eight parallel processing elements to implement searching of half pixels, as illustrated in FIGS. 9A and 9B .
  • pixels are stored in the format of pixel words in the current macroblock buffer 143 and the searching window 144 .
  • the half pixel estimation unit 152 may read a pixel word of the current macroblock from the current macroblock buffer 143 and a pixel word of the searching window from the searching window buffer 144 simultaneously.
  • Each pixel word is unpacked into four pixels, and the unpacked four pixels are written into the register arrays (e.g. RA 10 and RA 11 ).
  • the current macroblock register comprises two ping-pong register arrays RA 10 and RA 11 , and each of the register arrays RA 10 and RA 11 may comprise eight 8-bit registers.
  • the searching window register is comprised of two ping-pong register arrays RB 10 and RB 11 , and each of the register arrays RB 10 and RB 11 may comprise ten 8-bit registers.
  • the half pixel estimation unit 152 may read eight pixels in the first row of the current macroblock from the current macroblock buffer 143 , and write the eight pixels in the first row into the register array RA 10 .
  • the half pixel estimation unit 152 may read eight pixels in the second row of the current macroblock from the current macroblock buffer 143 , and write the eight pixels in the second row into the register array RA 11 .
  • the half pixel estimation unit 152 may read 10 pixels in the first row of the searching window from the searching window buffer 144 , and write the 10 pixels in the first row to the register array RB 10 .
  • the half pixel estimation unit 152 may read 10 pixels in the second row of the searching window from the searching window buffer 144 , and write the 10 pixels in the second row to the register array RB 11 .
  • the half pixel estimation unit 152 may further read pixels in a subsequent new row of the current macroblock from the current macroblock buffer 143 , thereby substituting a prior row stored in the register array RA 10 or RA 11 with the new row.
  • the half pixel estimation unit 152 may further read pixels in a subsequent new row of the searching window from the searching window buffer 144 , thereby substituting a prior row stored in the register array RB 10 or RB 11 with the new row.
  • the half pixel estimation unit 152 may simultaneously generate 9 half pixels in a row in the horizontal direction, 8 half pixels in a column in the vertical direction, and 9 half pixels in a row in the diagonal direction, so that the criterion to search for eight candidate half pixels simultaneously can be satisfied. Further, two lines, which each comprises 10 integer pixels, are required when the half pixel estimation unit 152 generates the aforementioned half pixels in different directions. In addition, the half pixel estimation unit 152 may read the two lines from the searching window buffer 144 , and write the two lines into the register arrays RB 10 and RB 11 , respectively. Since pixels are stored in the format of pixel words (i.e.
  • the half pixel estimation unit 152 has to read three pixel words continuously from the searching window buffer 144 while reading 10 integer pixels in a line.
  • the half pixel estimation unit 152 may further unpack the three pixel words into 12 integer pixels, and align the integer pixels according to the locations of the motion vectors of integer pixels in the pixel words, thereby truncating two invalid integer pixels.
  • the half pixel estimation unit 152 may comprise 8 parallel processing elements PE 21 ⁇ PE 28 , and the processing elements PE 21 ⁇ PE 28 are divided into 3 groups.
  • the first group comprises the processing elements PE 21 ⁇ PE 24 , configured to calculate SAD values of four candidate half pixels in the diagonal direction.
  • the second group comprises the processing elements PE 25 and PE 26 , configured to calculate SAD values of two candidate half pixels in the vertical direction.
  • the third group comprises the processing elements PE 27 and PE 28 , configured to calculate SAD values of two candidate half pixels in the horizontal direction.
  • the half pixel estimation unit 152 may broadcast the pixels of the current macroblock stored in the register array RA 10 to the processing elements PE 23 , PE 24 and PE 26 through a first broadcasting path, and broadcast the pixels of the current macroblock stored in the register array RA 11 to the processing elements PE 21 , PE 22 , PE 25 , PE 27 and PE 28 through a second broadcasting path. Then, When the half pixel estimation unit 152 has completed calculation of interpolation of half pixels in a row, the broadcasting paths from the register arrays RA 10 and RA 11 may be interchanged.
  • the nine half pixels d 0 ⁇ d 8 in the diagonal direction generated by the half pixel estimation unit 152 are divided into two groups. For example, the half pixels d 0 ⁇ d 7 are transmitted to the processing elements PE 21 and PE 23 , and the half pixels d 1 ⁇ d 8 are transmitted to the processing elements PE 22 and PE 24 . Similarly, the nine half pixels h 0 ⁇ h 8 in the horizontal direction generated by the half pixel estimation unit 152 are divided into two groups. For example, the half pixels h 0 ⁇ h 7 are transmitted to the processing element PE 27 , and the half pixels h 1 ⁇ h 8 are transmitted to the processing element PE 28 . In addition, the eight half pixels v 0 ⁇ v 7 in the vertical direction generated by the half pixel estimation unit 152 are transmitted to the processing elements PE 25 and PE 26 simultaneously.
  • each processing element in the half pixel estimation unit 152 may comprise four SAD calculating units and an accumulator (as shown in FIG. 7 ), and it may take two clock cycles to complete calculation of SAD values of half pixels in a line.
  • the half pixel estimation unit 152 may accumulate the SAD values of half pixels in 8 lines to obtain eight SAD 8 ⁇ 8 values.
  • the half pixel estimation unit 152 may select the least SAD 8 ⁇ 8 value of half pixels by using the comparators, and compare the least SAD 8 ⁇ 8 value of half pixels with the least SAD 8 ⁇ 8 value of integer pixels, thereby obtaining the resulting motion vector MV 8 ⁇ 8 (i.e. the least SAD 8 ⁇ 8 value after comparison).
  • an in-loop filter is a necessary component in a video encoding system and a video decoding system for the H.264 and VC-1 standards.
  • the in-loop filter may reduce the discontinuity between neighboring macroblocks generated by the processes, such as DCT/iDCT and quantization/inverse quantization, thereby enhancing the image quality after motion compensation and increasing the efficiency for video encoding.
  • the codec module 1820 may comprise a hardware accelerator controller 1821 , a codec processing unit 1822 , an in-loop filtering acceleration circuit 1823 , an external storage unit 1830 and an internal storage unit 1840 .
  • the codec processing unit 1822 can be implemented by hardware circuits (i.e. hardware) or DSPs (i.e. software) configured to perform decoding processes, such as motion compensation, intra-frame prediction, inverse DCT, inverse quantization and zig-zag scan.
  • the functionality of the in-loop filtering acceleration circuit 1823 is identical to that of the in-loop filtering acceleration circuit 124 , and the details will not be described here. In the following sections, only the details of the in-loop filtering acceleration circuit 124 will be described.
  • the internal storage unit 1840 may comprise a searching window buffer 1841 , a first FIFO buffer 1842 , a de-blocking filter buffer 1843 , and a second FIFO buffer 1844 .
  • the searching window buffer 1841 is configured to store reference macroblocks for motion compensation.
  • the first FIFO buffer 1842 is configured to store RLL codes.
  • the de-blocking filter buffer 1843 is configured to store reconstructed macroblocks after motion compensation executed by the codec processing unit 1822 , and filtered macroblocks generated by the in-loop filtering acceleration circuit 1823 .
  • the in-loop filtering acceleration circuit 1823 may read the reconstructed macroblocks generated by the codec processing unit 1822 from the de-blocking filter buffer 1843 , perform in-loop filtering to the reconstructed macroblocks, and write the filtered macroblocks into the de-blocking filter buffer 1843 .
  • the second FIFO buffer 1844 is configured to store decoding parameters generated by the processing unit 1810 .
  • FIG. 10 is a diagram illustrating the in-loop filtering sequence in the H.264 standard according to an embodiment of the invention.
  • Y denotes a luminance macroblock
  • U and V denote a respective chrominance macroblock.
  • the filtering sequence for an in-loop filter in the H.264 standard is defined as following: for each frame, the vertical edges of all 4 ⁇ 4 blocks are filtered first, and the vertical edges are filtered from top to bottom and from left to right. Then, the horizontal edges of all 4 ⁇ 4 blocks are filtered, and the horizontal edges are also filtered from top to bottom and from left to right.
  • the in-loop filtering acceleration circuit 124 may perform video encoding/decoding by macroblock, and the edges to be filtered in each macroblock are the black bold lines illustrated in FIG. 10 .
  • the blocks filled with diagonal lines represent the current luminance macroblock and current chrominance macroblocks, and the white blocks represent neighboring luminance macroblocks and neighboring chrominance macroblocks of the current luminance macroblock and current chrominance macroblocks, respectively.
  • the in-loop filtering acceleration circuit 124 may re-define the filtering sequence for filtering edges of 4 ⁇ 4 blocks in a 16 ⁇ 16 macroblock as the order of numbers illustrated in FIG. 10 .
  • the in-loop filtering acceleration circuit 124 may filter the vertical edges of all 4 ⁇ 4 blocks from left to right and from top to bottom, and filter the horizontal edges of all 4 ⁇ 4 blocks from top to bottom and from left to right.
  • the in-loop filtering acceleration circuit 124 may use the neighboring macroblocks having overlapped edges effectively to reduce the memory bandwidth for accessing the external storage unit 130 by using the filtering sequence defined for the H.264 standard in the invention.
  • the in-loop filtering acceleration circuit 124 may read two 4 ⁇ 4 blocks located at the left/right side of the vertical edge from the de-blocking filter buffer 145 , and write the two 4 ⁇ 4 blocks to the transposition register arrays TA and TB (as shown in FIG. 14A , and details will be described later).
  • the in-loop filtering acceleration circuit 124 has completed filtering of a vertical edge, it is not necessary for the in-loop filtering acceleration circuit 124 to write the 4 ⁇ 4 block, which is located at the right side of the vertical edge, back to the de-blocking filter buffer 145 .
  • the 4 ⁇ 4 block can be preserved in the de-blocking filter buffer 145 , so that the 4 ⁇ 4 block can be used as the macroblock located at the left side of the next vertical edge. Accordingly, accessing (i.e. writing and reading) of a 4 ⁇ 4 block can be saved when the in-loop filtering acceleration circuit 124 performs filtering to a vertical edge. Similarly, another accessing operation (i.e. writing and reading) of a 4 ⁇ 4 block can be saved when the in-loop filtering acceleration circuit 124 performs filtering to a horizontal edge.
  • FIG. 1 is a diagram illustrating the filtering sequence for an in-loop filter in the VC-1 standard according to an embodiment of the invention.
  • Y denotes a luminance macroblock
  • U and V denote a respective chrominance macroblock.
  • the filtering sequence in the in-loop filter defined by the VC-1 standard can be expressed as the following criteria of:
  • the in-loop filtering acceleration circuit 124 When the in-loop filtering acceleration circuit 124 encodes or decodes a frame by macroblock, some edges of the current macroblock are not filtered by the in-loop filtering acceleration circuit 124 due to the limitation of the filtering sequence of the VC-1 standard, wherein the limitation may indicate that the right edge and the bottom edge are not filtered while performing in-loop filtering for each macroblock. Accordingly, the edges can only be filtered while the in-loop filtering acceleration circuit 124 performs filtering of the next macroblock or the macroblock exactly on the next line (i.e. the line beneath the current line).
  • the edges to be filtered may comprise some internal edges of the current macroblock, and some edges of the up, left, and upper-left neighboring macroblocks, such as the black bolded lines illustrated in FIG. 11 .
  • the blocks filled with diagonal lines are the luminance macroblock and chrominance macroblocks of the current macroblock, and the white blocks are the luminance macroblock and chrominance macroblocks of the neighboring macroblocks.
  • the in-loop filtering acceleration circuit 124 may re-define the filtering sequence for filtering edges of 4 ⁇ 4 blocks in a 16 ⁇ 16 macroblock as the order of numbers illustrated in FIG. 11 .
  • the in-loop filtering acceleration circuit 124 may filter horizontal edges. That is, the in-loop filtering acceleration circuit 124 may filter the horizontal edges of 8 ⁇ 8 blocks from bottom to top, and filter the horizontal edges of 4 ⁇ 4 blocks from top to bottom.
  • the in-loop filtering acceleration circuit 124 may filter vertical edges. That is, the in-loop filtering acceleration circuit 124 may filter the vertical edges of 8 ⁇ 8 blocks from right to left, and filter the vertical edges of 4 ⁇ 4 blocks from left to right.
  • the in-loop filtering acceleration circuit 124 may use the neighboring macroblocks having overlapped edges effectively to reduce the memory bandwidth for accessing the external storage unit 130 by using the filtering sequence re-defined for the VC-1 standard in the invention.
  • the reconstructed macroblocks generated by the in-loop filtering acceleration circuit 124 may compose a reconstructed frame, which is stored in the external storage unit 130 .
  • the pixels of the reconstructed macroblocks before in-loop filtering and pixels of the macroblocks after in-loop filtering are stored in the de-blocking filter buffer 145 of the internal storage unit 140 with the format of pixel words (e.g. word32 format). Briefly, each pixel has an 8-bit accuracy, and four horizontally adjacent pixels are placed into the same pixel word.
  • the DCT and quantization accelerator 123 may write the reconstructed macroblock after motion compensation or spatial compensation into the de-blocking filter buffer 145 .
  • the hardware accelerator controller 121 may read the required neighboring macroblocks for in-loop filtering from the external storage unit 130 , and write the macroblocks into the de-blocking filter buffer 145 .
  • the hardware accelerator controller 121 may copy the reconstructed macroblocks and neighboring macroblocks after in-loop filtering to the external storage unit 130 by using the DMA controller 160 .
  • FIG. 12 is a diagram illustrating the architecture of the de-blocking filter buffer 145 according to an embodiment of the invention.
  • the de-blocking filter buffer 145 may have an architecture of four memory banks, so that the operations of reading, writing and filtering macroblocks can be executed in parallel to increase performance of the video encoding system 100 .
  • Each memory bank may store the current macroblock and certain lines of luminance/chrominance pixels above the current macroblock.
  • the de-blocking filter buffer 145 may store four lines of luminance/chrominance pixels above the current macroblock for the H.264 standard. Alternatively, the de-blocking filter buffer 145 may store 8 lines of luminance/chrominance pixels above the current macroblock for the VC-1 standard.
  • Two neighboring memory banks (e.g. memory banks 1 and 2 ) in the de-blocking filter buffer 145 are configured to store the current macroblock, the left neighboring macroblock, and two upper neighboring luminance and chrominance macroblocks, and the in-loop filtering acceleration circuit 124 may read the two neighboring memory banks simultaneously to perform the in-loop filtering process.
  • Other hardware accelerators or the DSP processor e.g.
  • DCT and quantization unit 123 in the encoding module 120 may write the reconstructed macroblocks into a memory bank (e.g. memory bank 3 ) of the de-blocking filter buffer 145 .
  • the hardware accelerator controller 121 may further read the upper neighboring macroblock of the reconstructed macroblock from the external storage unit 130 , and write the upper neighboring macroblock into a memory bank (e.g. memory bank 3 ) of the de-blocking filter buffer 145 .
  • the hardware accelerator controller 121 may further copy the reconstructed macroblock and the upper neighboring macroblock after in-loop filtering, which are stored in a memory bank (e.g. memory bank 0 ) of the de-blocking filter buffer 145 , to the external storage unit 130 .
  • FIGS. 13A ⁇ 13D are portions of a diagram illustrating the sequence of data accessing in the de-blocking filter buffer 145 according to an embodiment of the invention.
  • different hardware accelerators or the DSP processor of the encoding module 120 should access different memory banks of the de-blocking filter buffer 145 circularly via the DMA controller 160 , as illustrated in FIGS. 13A ⁇ 13D .
  • three different indices are used in the de-blocking filter buffer 145 to prevent different hardware accelerators and the DMA controller 160 from accessing the same memory bank of the de-blocking filter buffer 145 .
  • the three aforementioned indices are configured to control different hardware accelerators and the DMA controller 160 to access different memory banks of the de-blocking filter buffer 145 .
  • the control mechanism of the indices can be expressed in the following steps:
  • the reading index rd_index is set to 0.
  • the DMA controller 160 may read the memory bank to which the reading index rd_index is pointing. Every time when the DMA controller 160 has completed reading a macroblock and its upper neighboring macroblock, the DMA controller 160 may add the reading index rd_index by 1.
  • the filter index filter_index is set to 0.
  • the in-loop filtering acceleration circuit 124 may access two memory banks directed to by filter_index and (filter_index ⁇ 1). Every time when the in-loop filtering acceleration circuit 124 has completed in-loop filtering of a macroblock, the in-loop filtering acceleration circuit 124 may add the filter index filter_index by 1.
  • the writing index wr_index is set to 0.
  • the writing index wr_index is larger than (rd_index+2)
  • other hardware accelerators/the DSP processor, and the hardware accelerator controller 121 may write macroblock data to the memory bank to which the writing index wr_index is pointing. Every time when other hardware accelerators/the DSP processor and the hardware accelerator controller 121 have completed writing of a macroblock and its upper neighboring macroblock, the aforementioned components may add the writing index wr_index by 1.
  • FIGS. 14A and 14B are portions of a diagram illustrating the hardware architecture of the in-loop filtering acceleration circuit 124 according to an embodiment of the invention.
  • the filtering parameter such as boundary strength (BS) in the H.264 standard is calculated by the processing unit 110 .
  • the processing unit 110 may control the in-loop filtering acceleration circuit 124 by the hardware accelerator controller 121 .
  • the processing unit 110 may determine whether each edge should be filtered or not.
  • boundary strength is not defined in the VC-1 standard, and thus there are only two conditions, specifically, to be filtered or not, for each edge in the VC-1 standard.
  • two cases of boundary strength are defined for the VC-1 standard in the invention. That is, if the processing unit 110 determines that the edge should be filtered, the value of boundary strength is set to 0. Conversely, if the processing unit 110 determines that the edge should not be filtered, the value of boundary strength is set to 5.
  • the in-loop filtering acceleration circuit 124 only has to read macroblock data from the de-blocking filter buffer 145 , and select an appropriate one-dimensional (1D) filter according to filtering parameters, such as the value of boundary strength, to perform in-loop filtering of the corresponding edge.
  • the in-loop filtering acceleration circuit 124 may comprise two transposition register arrays TA and TB, a filter selection unit 1410 , and multiple 1D filters (e.g. G_FILTER 0 ⁇ G_FILTER 1 , S_FILTER 0 ⁇ S_FILTER 3 and V_FILTER). Since the reconstructed macroblock to be filtered is stored in the de-blocking filter buffer 145 with a format of pixel words, the in-loop filtering acceleration circuit 124 may read one pixel word from the de-blocking filter buffer 145 every clock cycle, unpack the pixel word into four pixels, and write the four pixels into the transposition register arrays TA and TB.
  • 1D filters e.g. G_FILTER 0 ⁇ G_FILTER 1 , S_FILTER 0 ⁇ S_FILTER 3 and V_FILTER.
  • the in-loop filtering acceleration circuit 124 may read pixels in a 4 ⁇ 4 block column by column, or row by row, freely by using the transposition register arrays TA and TB, so that the same hardware circuit (e.g. 1D filter) can be used to filter horizontal edges and vertical edges.
  • the in-loop filtering acceleration circuit 124 may start to perform in-loop filtering, and the procedures for in-loop filtering are described as the following steps:
  • step (3) If the processing unit 110 determines that the boundary strength BS of the current edge is equal to 5, it may indicate that the filtering process is to filter the current edge in the VC-1 standard, and step (4) is performed to select a 1D filter of the VC-1 standard. Otherwise, step (3) is performed.
  • , d 1
  • , and d 2
  • and d 4
  • the in-loop filtering acceleration circuit 124 may select a 1D filter according to the value of boundary strength to perform filtering of input pixels p 0 ⁇ p 3 and q 0 ⁇ q 3 .
  • the in-loop filtering acceleration circuit 124 may select a H.264 strong filter (S_FILTER).
  • S_FILTER H.264 strong filter
  • G_FILTER H.264 general filter
  • V_FILTER VC-1 filter
  • the in-loop filtering acceleration circuit 124 may write output pixels p 0 ′ ⁇ p 3 ′ back to the transposition register array TA, and write output pixels q 0 ′ ⁇ q 3 ′ back to the transposition register array TB.
  • the in-loop filtering acceleration circuit 124 may write 4 ⁇ 4 blocks, which are above the horizontal edge or located at the left side of the vertical edge, back to the de-blocking filter buffer 145 . If a horizontal edge is processed, the in-loop filtering acceleration circuit 124 may read pixels by column, and four adjacent pixels in a column are packed into a pixel word to be written into the de-blocking filter buffer 145 . If a vertical edge is processed, the in-loop filtering acceleration circuit 124 may read pixels by row, and four adjacent pixels in a row are packed into a pixel word to be written into the de-blocking filter buffer 145 .
  • the filter selection unit 1410 of the in-loop filtering acceleration circuit 124 is configured to calculate filter selection parameters (e.g. d 0 , d 3 and d 4 ) according to the input pixels, select a corresponding 1D filter according to the calculated filter selection parameters.
  • filter selection parameters e.g. d 0 , d 3 and d 4
  • four filters are included in the H.264 strong filters, such as S_FILTER 0 , S_FILTER 1 , S_FILTER 2 , and S_FILTER 3 .
  • the parameters received by the filter selection unit 1410 may comprise boundary strength BS, a chrominance parameter chroma, a clipping parameter c 0 , a bit rate parameter alpha, a quantization parameter PQuant, and filter selection parameters d 0 , d 3 and d 4 .
  • the boundary strength BS is determined by the processing unit 110 .
  • the chrominance parameter chroma may indicate that the current macroblock is a luminance macroblock or a chrominance macroblock.
  • chrominance parameter chroma is 1, it may indicate that the current macroblock is a chrominance macroblock. Otherwise, it may indicate that the current macroblock is a luminance macroblock.
  • c 0 is a clipping parameter, which is obtained from a look-up table according to the boundary strength BS, used in H.264 general filters.
  • alpha is a bit rate parameter generated by the processing unit 110 while decoding a bitstream.
  • the quantization parameter PQuant is generated by the processing unit 110 .
  • the filter selection unit 1410 of the in-loop filtering acceleration circuit 124 may calculate filter selection parameters d 0 , d 3 and d 4 according to the input pixels.
  • the working principle of the filter selection unit 1410 is shown in FIGS. 15A and 15B .
  • the filter selection unit 1410 may select the filter type according to the boundary strength. Then, the filter selection unit 1410 may determine the 1D filter(s) to be used according to other parameters.
  • FIGS. 16A ⁇ 16F are diagrams illustrating the architecture of each H.264 1D filter according to an embodiment of the invention.
  • the in-loop filtering acceleration circuit 124 may start to perform filtering. It should be noted that a filtering procedure is generally completed by a certain amount of 1D filters.
  • Each 1D filter may select a portion of input pixels p 0 ⁇ p 3 and q 0 ⁇ q 3 as an input, and perform calculation of the selected input pixels to obtain 1 or 2 results (i.e. filtered pixels), and substitute one or two pixels of the input pixels with the filtered pixels, thereby generating output pixels (e.g. pout, pout 1 or pout 2 in FIGS. 16A ⁇ 16F ). Then, the output pixels are written back to the transposition register arrays TA or TB.
  • H.264 strong 1D filters e.g. S_FILTER 0 , S_FILTER 1 , S_FILTER 2 , and S_FILTER 3
  • H.264 general 1D filters e.g. G_FILTER 0 and G_FILTER 1
  • FIGS. 16A ⁇ 16F Each 1D filter comprises a certain amount of adders, shifters and clipping units, wherein pin 0 ⁇ pin 4 denote input pins in different 1D filters, and pout, pout 1 and pout 2 denote the output pixels of the different 1D filters.
  • FIGS. 17A ⁇ 17B are portions of a diagram illustrating the architecture of the VC-1 filter in the in-loop filtering acceleration circuit 124 according to an embodiment of the invention.
  • the VC-1 filter V_FILTER may comprise two parts. The first part may perform calculation of eight input pixels p 0 ⁇ p 3 and q 0 ⁇ q 3 to generate four internal parameters a 0 ,
  • the second part may further substitute the input pixels p 0 and q 0 with the output pixels p 0 ′ and q 0 ′, and write the output pixels back to the transposition register arrays TA and TB.
  • the in-loop filtering acceleration circuit 124 performs filtering of horizontal edges, the horizontal edges of a 4 ⁇ 4 block in the third row should be filtered first.
  • the in-loop filtering acceleration circuit 124 performs filtering of vertical edges
  • the vertical edges of a 4 ⁇ 4 block in the third column should be filtered first.
  • the flag 3rd_pel_pair is set to 1. Then, the VC-1 filter should further determine another flag filter_other — 3_pixels. If the flag filter_other — 3_pixels is 1, pixels in the remaining three rows or columns should be further filtered. Otherwise, the filtering process of the pixels in the remaining three rows or columns can be skipped.
  • the in-loop filtering acceleration circuit 124 is used to perform filtering processes of horizontal edges, vertical edges and diagonal lines.
  • the in-loop filtering acceleration circuit 124 may comply with the H.264 standard (e.g. Baseline profile) and the VC-1 standard (e.g. Simple profile and Main profile).
  • the 1D filters in the in-loop filtering acceleration circuit 124 can be upgraded to comply with other video codec standards.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Discrete Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

A motion estimation method is provided. The method has the following steps of: determining a start searching point according to multiple neighboring macroblocks of a current macroblock, wherein the current macroblock corresponds to a searching window; determining a best candidate pixel according to a first line segment where the start searching point is located, and a second/third line segment above/beneath the first line segment; determining whether the best candidate pixel is located at the first line segment; if so, setting a candidate motion vector corresponding to the best candidate pixel as a first motion vector of the current macroblock; and if not, dynamically adjusting the second line segment or the third line segment in the searching window to update the best candidate pixel, and retrieving the first motion vector of the current macroblock corresponding to the updated best candidate pixel.

Description

    CROSS REFERENCE TO RELATED APPLICATIONS
  • This Application claims priority of China Patent Application No. 201210046566.9, filed on Feb. 27, 2012, the entirety of which is incorporated by reference herein.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • The present invention relates to video processing, and in particular relates to a motion estimation acceleration circuit and in-loop filtering acceleration circuit by using data in the overlapped portions of neighboring macroblocks recursively to reduce memory bandwidth.
  • 2. Description of the Related Art
  • Video compression standards, such as MPEG2, H.264 or VC-1 standards, have been widely used in the video codec (coding/decoding) systems on the market. However, in a video codec system, calculation of motion estimation and de-blocking filtering may have the largest amount of operations. If a video codec system performs motion estimation and de-blocking filtering by software only, it may cause a serious burden to the processing unit. In addition, when a conventional hardware circuit performs motion estimation and de-blocking filtering, some previously used macroblock data may be read from the external memory repeatedly, so that the memory bandwidth for accessing the external memory is wasted.
  • BRIEF SUMMARY OF THE INVENTION
  • In an exemplary embodiment, a motion estimation acceleration circuit applied in a video encoding system supporting multiple video codec standards is provided. The circuit comprises: a start searching point prediction unit, configured to determine a start searching point according to multiple neighboring macroblocks of a current macroblock, wherein the current macroblock corresponds to a searching window; and an integer pixel estimation unit, configured to determine a best candidate pixel according to a first line segment where the start searching point is located, a second line segment on the first line segment, and a third line segment beneath the first line segment, wherein the integer pixel estimation unit further determines whether the best candidate pixel is located at the first line segment, if so, the integer pixel estimation unit sets a candidate motion vector corresponding to the best candidate pixel as a first current macroblock motion vector; if not, the integer pixel estimation unit dynamically adjusts the second line segment or the third line segment in the searching window to update the best candidate pixel, and retrieve the first current macroblock motion vector corresponding to the updated best candidate pixel.
  • In another exemplary embodiment, a motion estimation method is provided. The method has the following steps of: determining a start searching point according to multiple neighboring macroblocks of a current macroblock, wherein the current macroblock corresponds to a searching window; determining a best candidate pixel according to a first line segment where the start searching point is located, and a second/third line segment on/beneath the first line segment; determining whether the best candidate pixel is located at the first line segment; if so, setting a candidate motion vector corresponding to the best candidate pixel as a first motion vector of the current macroblock; and if not, dynamically adjusting the second line segment or the third line segment in the searching window to update the best candidate pixel, and retrieving the first motion vector of the current macroblock corresponding to the updated best candidate pixel.
  • In yet another exemplary embodiment, an in-loop filtering acceleration circuit applied in a video codec system supporting the H.264 standard and the VC-1 standard is provided. The video codec system comprises a processing unit to perform video processing to generate at least one reconstructed macroblock and a value of boundary strength corresponding to each edge of the reconstructed macroblock. The in-loop filtering acceleration circuit comprises: multiple one-dimensional (1D) filters configured to perform a filtering process; and a filter selection unit configured to select one of the 1D filters according to the value of the boundary strength to perform the filtering processing to the reconstructed macroblock, wherein the in-loop filtering acceleration circuit further divides the reconstructed macroblock into multiple 8×8 blocks and multiple 4×4 blocks, performs the filtering process to horizontal edges of the 8×8 blocks, the reconstructed macroblock row by row according to a first predefined order, and performs the filtering process to horizontal edges of the 4×4 blocks row by row from top to bottom, wherein the in-loop filtering acceleration circuit further performs the filtering process to vertical edges of the 8×8 blocks column by column according to a second predefined order, and performs the filtering process to vertical edges of the 4×4 blocks column by column from left to right.
  • In yet another exemplary embodiment, an in-loop filtering method applied in an in-loop filtering acceleration circuit of a video codec system supporting the H.264 standard and the VC-1 standard is provided. The video codec system comprises a processing unit to perform video processing to generate at least one reconstructed macroblock and a value of boundary strength corresponding to each edge of the reconstructed macroblock. The method comprises the following steps of: dividing the reconstructed macroblock into multiple 8×8 blocks and multiple 4×4 blocks; selecting one of multiple 1D filters according to the value of the boundary strength to perform the filtering processing to the reconstructed macroblock; performing the filtering process to horizontal edges of the 8×8 blocks the reconstructed macroblock row by row according to a predefined order, and performing the filtering process to horizontal edges of the 4×4 blocks row by row from top to bottom; and performing the filtering process to vertical edges of the 8×8 blocks column by column according to another predefined order, and performing the filtering process to vertical edges of the 4×4 blocks column by column from left to right.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The present invention can be more fully understood by reading the subsequent detailed description and examples with references made to the accompanying drawings, wherein:
  • FIG. 1 is a block diagram illustrating a video encoding system according to an embodiment of the invention;
  • FIG. 2 is a diagram illustrating prediction of the start search point in the motion estimation method according to an embodiment of the invention;
  • FIG. 3 is a diagram illustrating the motion estimation method according to an embodiment of the invention;
  • FIG. 4 is a diagram illustrating overlapped searching windows of horizontally neighboring macroblocks according to an embodiment of the invention;
  • FIGS. 5A˜5D are diagrams illustrating the architecture of the searching window buffer according to an embodiment of the invention;
  • FIG. 6 is a schematic diagram of the motion estimation acceleration circuit 122 according to an embodiment of the invention;
  • FIG. 7 is a block diagram illustrating the hardware architecture of the integer pixel estimation unit 151 according to an embodiment of the invention;
  • FIG. 8 is a structure diagram illustrating a processing element in the integer pixel estimation unit 151 according to an embodiment of the invention;
  • FIGS. 9A and 9B are portions of a diagram illustrating the hardware architecture of the half pixel estimation unit 152 according to an embodiment of the invention;
  • FIG. 10 is a diagram illustrating the in-loop filtering sequence in the H.264 standard according to an embodiment of the invention;
  • FIG. 11 is a diagram illustrating the in-loop filtering sequence in the VC-1 standard according to an embodiment of the invention;
  • FIG. 12 is a diagram illustrating the architecture of the de-blocking filter buffer 145 according to an embodiment of the invention;
  • FIGS. 13A˜13D are portions of a diagram illustrating the sequence of data accessing in the de-blocking filter buffer 145 according to an embodiment of the invention;
  • FIGS. 14A and 14B are portions of a diagram illustrating the hardware architecture of the in-loop filtering acceleration circuit 124 according to an embodiment of the invention;
  • FIGS. 15A and 15B are diagrams illustrating the working principle of the filter selection unit 1410 according to an embodiment of the invention;
  • FIGS. 16A˜16F are diagrams illustrating the architecture of each H.264 1D filter according to an embodiment of the invention;
  • FIGS. 17A˜17B are portions of a diagram illustrating the architecture of the VC-1 filter in the in-loop filtering acceleration circuit 124 according to an embodiment of the invention;
  • FIG. 18 is a block diagram illustrating a video codec system according to an embodiment of the invention;
  • FIGS. 19A and 19B are portions of a flow chart illustrating the motion estimation method according to an embodiment of the invention.
  • DETAILED DESCRIPTION OF THE INVENTION
  • The following description is of the best-contemplated mode of carrying out the invention. This description is made for the purpose of illustrating the general principles of the invention and should not be taken in a limiting sense. The scope of the invention is best determined by reference to the appended claims.
  • A. System Architecture
  • FIG. 1 is a block diagram illustrating a video encoding system according to an embodiment of the invention. The video decoding system 100 may comprise a processing unit 110, an encoding module 120, an external storage unit 130 and a DMA controller 160. During the video encoding procedure (e.g. MPEG2, H.263, and MPEG4 standards), the processing unit 110 may be a controller configured to execute a hardware accelerator control program, and execute an entropy encoding program, a bit rate control program, and a boundary extension program. For example, the processing unit 110 may be a central processing unit (CPU), a digital signal processor (DSP) or other equivalent circuits implementing the same functions.
  • The encoding module 120 may comprise a hardware accelerator controller 121, a motion estimation acceleration circuit 122, a DCT and quantization accelerator 123, an in-loop filtering acceleration circuit 124, and an internal storage unit 140. In an embodiment, the encoding module 120 can be divided into a hardware encoding unit and a software encoding unit (not shown in FIG. 1). That is, each component in the encoding module 120 may be implemented by hardware or a DSP (i.e. software) configured to perform encoding processes, such as motion estimation, motion compensation, discrete cosine transform/inverse transform (DCT/iDCT), quantization/inverse quantization, zig-zag scan, and in-loop filtering. However, the motion estimation acceleration circuit 122 and the in-loop filtering acceleration circuit 124 are dedicated digital logic circuits or hardware to implement encoding processes, such as motion estimation and in-loop filtering processing.
  • For ease of explanation, the hardware accelerator controller 121, the motion estimation acceleration circuit 122, the DCT and quantization accelerator 123, and the in-loop filtering acceleration circuit 124 in the encoding module 120 of FIG. 1 is implemented by hardware. The hardware components, such as the processing unit 110 and the encoding module 120, may utilize a frame level flow control method indicating that the CPU may decode the next frame when the hardware components of the encoding module 120 decodes the current frame. The data flow between each component (e.g. all hardware, or integrated by hardware/software) in the encoding module 120 may be macroblock level flow control. The external storage unit 130 is configured to store reference frames, reconstructed frames, decoding parameters, and run-last-level codes (i.e. RLL codes). For example, the external storage unit 130 may be a volatile memory component (e.g. random access memory, such as DRAM, SRAM) and/or a non-volatile memory component (e.g. ROM, hardware accelerator, CDROM). The DMA controller 160 is configured to retrieve macroblock data and encoding parameters corresponding to the encoding process. The hardware accelerator controller 121 in the encoding module 120 may read the required macroblock data (e.g. the current macroblock and reference macroblock) from the external storage unit 130 to the internal storage unit 140 through the DMA controller 160.
  • In an embodiment, the processing unit 110 may control each component in the encoding module 120. First, the processing unit 110 may set and check register values associated with the hardware accelerator controller 121, and then activate the encoding module 120 to encode the current frame. It is necessary for the processing unit 110 to request and register a corresponding DMA channel, check status of the DMA channel, and set registers associated with the DMA controller 160 to activate the DMA controller. After activating the encoding module 120 and the DMA controller 160 by the processing unit 110, the encoding module 120 may start to encode the current frame. It should be noted that, the encoding module 120 and the processing unit 110 is controlled by a frame level flow. Before finishing the encoding procedure of each current frame by the hardware accelerator, the processing unit 110 (i.e. software) may pre-execute an encoding program (e.g. program codes) for performing calculation of entropy encoding and bit rate control of the previous frame. The encoding program may detect whether the hardware encoding unit has completed the encoding procedure of the current frame. When the encoding module 120 has not finished the encoding procedure of the current frame yet, the processing unit 110 may execute other programs having higher priority and being ready for execution. Specifically, when the encoding module 120 has finished the encoding procedure of the current frame, the encoding module 120 may generate an interrupt signal. Accordingly, an interrupt service program executed by the processing unit 110 may send an event completion signal to the encoding program. Then, the encoding program may retake control of the processing unit 110 to encode the next frame.
  • In another embodiment, the processing unit 110 may further execute various programs to perform encoding post-processing, such as executing an entropy decoding program, a bit rate control program and a boundary extension program. The entropy encoding program may indicate that the processing unit 110 read encoding parameters and RLL codes from the external storage unit 130 to perform entropy encoding, and output a video bitstream of an image. The bit rate control program may indicate that the processing unit 110 may calculate quantization parameters of the next frame according to encoding results of the current frame, the total bit rate, and the frame rate. The boundary extension program may indicate that the processing unit 110 performs boundary extension to the reconstructed frame, which is used for calculation of motion estimation of the next frame, outputted by the hardware encoding unit.
  • In an embodiment, the internal storage unit 140 may comprise a residue macroblock buffer 141, a first-in-first-out (FIFO) buffer 142, a current macroblock buffer 143, a searching window buffer 144, and a de-blocking filter buffer 145. The residue macroblock buffer 141 is configured to store residue values of macroblocks for motion compensation. The FIFO buffer 142 is configured to store encoding parameters and RLL codes, wherein the encoding parameters are from the hardware accelerator controller 121, and the RLL codes are from the DCT and quantization accelerator 123. The current macroblock buffer 143 is configured to store the current macroblock. The searching window buffer 144 is configured to store macroblocks in the searching window for motion estimation. The de-blocking filter buffer 145 is configured to store reconstructed macroblocks after motion compensation and filtered macroblocks generated by the in-loop filtering acceleration circuit 124. In addition, the in-loop filtering acceleration circuit 124 reads reconstructed macroblocks, which are generated by the DCT and quantization accelerator 123, from the de-blocking filter buffer 145, and performs in-loop filtering to the reconstructed macroblocks to generate filtered macroblocks, and writes the filtered macroblocks into the de-blocking filter buffer 145.
  • The hardware accelerator controller 121 may set and manage each component in the encoding module 120. For example, when the motion estimation acceleration circuit 121 in the encoding module 120 has completed encoding of a macroblock, the motion estimation acceleration circuit 121 may send a first interrupt signal to the hardware accelerator controller 121. Meanwhile, the hardware accelerator controller 121 may set and activate subsequent corresponding accelerators and acceleration circuits. When hardware (e.g. the in-loop filtering acceleration circuit 124) in the encoding module 120 has completed encoding of a frame, the hardware accelerator controller 121 may send a second interrupt signal to the processing unit 110. Then, the processing unit 110 may write the encoding parameters to registers (not shown) inside the hardware accelerator controller 121 directly, so that the hardware accelerator controller 121 may set each hardware component in the encoding module 120.
  • B. Motion Estimation Method
  • B-1. Prediction of Searching the Start Point
  • The motion estimation acceleration circuit 122 in the invention may use a prediction-based 12-point line searching algorithm to complete motion estimation of integer pixels (i.e. details will be described later), and to perform motion estimation of half pixels. The motion estimation acceleration circuit 122 may search for eight points while performing motion estimation of half pixels, and the interpolation and motion estimation of half pixels can be executed in parallel. The motion estimation method for integer pixels provided in the invention may comprise the following four steps of: (1) predicting the start searching point; (2) 12-point line searching based on a 8×8 block; (3) motion searching of 16×16 macroblocks; and (4) determining the macroblock mode for motion estimation.
  • FIG. 2 is a diagram illustrating prediction of the start search point in the motion estimation method according to an embodiment of the invention. FIGS. 19A and 19B are portions of a flow chart illustrating the motion estimation method according to an embodiment of the invention. Referring to FIGS. 2, 19A and 19B, the motion estimation acceleration circuit 122 may confirm the start searching point for every macroblock before performing motion estimation. The motion estimation acceleration circuit 122 may predict the start searching point by using motion vectors of neighboring macroblocks. As illustrated in FIG. 2, motion vectors MVa, MVb, MVc and MVd of a left neighboring macroblock A, a upper neighboring macroblock B, a upper-right neighboring macroblock C and a upper-left neighboring macroblock D of the current macroblock E are referenced to predict the start searching point. First, the four pixels pointed by the motion vectors MVa, MVb, MVc, and MVd of the four neighboring macroblocks of the current macroblock E are checked, and the sum of absolute difference (SAD) corresponding to each of the four points is calculated. The point with the least SAD value is regarded as the start searching point for motion estimation. It should be noted that some neighboring macroblocks may not exist if the current macroblock is located at the boundary of the image. Meanwhile, a zero-valued motion vector may be used to substitute the motion vectors of the non-existing neighboring macroblocks, and the predicted reference point is set to zero-point.
  • B-2. 12-Point Line Segment Searching of Integer Pixels
  • FIG. 3 is a diagram illustrating the motion estimation method according to an embodiment of the invention. The motion estimation method used in the motion estimation acceleration circuit 122 is based on searching the 12-point line segments of integer pixels. FIGS. 19A and 19B shows a flow chart illustrating the motion estimation method according to an embodiment of the invention.
  • Four steps are described in the motion estimation method. Step 1: as illustrated in FIG. 3, the current macroblock is divided into four 8×8 blocks. For each 8×8 block, the motion estimation acceleration circuit 122 may search for three 12-point line segments p−1, p and p+1 taking the pixel-word at which the start point S1 is located as center, and thus there are 36 candidate pixels, such as the white points illustrated in FIG. 3. Then, a SAD16×16 value of a candidate point can be obtained by summarizing four SAD8×8 values corresponding to the same candidate point (i.e. 36 SAD16×16 values in total). If the reference point corresponding to the least SAD16×16 value (e.g. the best reference point, such as the gray point illustrated in FIG. 3) is located on the line segment p+1, step 2 is performed. If the best reference point is located on the line segment p−1, step 3 is performed. Otherwise, step 4 is performed.
  • Step 2: the motion estimation acceleration circuit 122 sets the value p=p+1, and searches for 12 candidate points on the line segment p. Furthermore, the locations of 12 candidate points on the line segment p+1 should be adjusted horizontally according to the location of the best reference point on the line segment p, and thus it can be ensured that the pixel word of the middle four points and the pixel word of the best reference point on the line segment p are located in the same row. Then, the 12 candidate points on the line segment p are searched, and the SAD16×16 value of each candidate point can be obtained by summarizing four SAD8×8 values corresponding to the same candidate point. If the reference point corresponding to the least SAD16×16 value (i.e. the best reference point) is located on the line segment p, step 4 is performed. Otherwise, step 2 is performed repeatedly until the reference point corresponding to the least SAD16×16 value is located on the line segment p or the boundary of a 48×48 searching window is reached.
  • Step 3: the motion estimation acceleration circuit 122 sets the value p=p−1, and searches for the 12 candidate points on the line segment p. Furthermore, the locations of 12 candidate points on the line segment p+1 should be adjusted horizontally according to the location of the best reference point on the line segment p, and thus it can be ensured that the pixel word of the middle four points and the pixel word of the best reference point on the line segment p are located in the same row. Then, the 12 candidate points on the line segment p are searched, and the SAD16×16 value of each candidate point can be obtained by summarizing four SAD8×8 values corresponding to the same candidate point. If the reference point corresponding to the least SAD16×16 value (i.e. the best reference point) is located on the line segment p, step 4 is performed. Otherwise, step 3 is performed repeatedly until the reference point corresponding to the least SAD16×16 value is located on the line segment p or the boundary of a 48×48 searching window is reached.
  • Step 4: the motion estimation acceleration circuit 122 may set the motion vector MV16×16 of the 16×16 macroblock to the motion vector corresponding to the least SAD16×16 value, and set the motion vectors MV8×8 of the four 8×8 blocks to the motion vector corresponding to the least SAD8×8 value.
  • Referring to FIGS. 19A and 19B, details of the aforementioned steps 1˜4 can be described with the steps illustrated in FIGS. 19A and 19B:
  • (a) In step S1901, the current macroblock is divided into at least one 8×8 block. For each 8×8 block, taking a pixel word comprising four pixels at where the start searching point is located as center, 36 initial candidate points can be retrieved from a first line segment, a second line segment and a third line segment (i.e. the first/second/third line segments are aligned, as shown in FIG. 3), wherein the first line segment comprises the pixel word and four neighboring pixels at the right and left sides of the pixel word, and the second line segment is on the first line segment, and the third line segment is beneath the first line segment;
  • (b) In step S1902, a first SAD value of each initial candidate point relative to each 8×8 block is calculated, thereby obtaining an initial current macroblock SAD value corresponding to each initial candidate point. Thus, a first least current macroblock SAD value can be obtained according to the initial current macroblock SAD values;
  • (c) In step S1903, it is determined whether a best reference point corresponding to the first least current macroblock SAD value is located on the second line segment or not. If so, step (d) (i.e. step S1905) is performed. If not, it is further determined whether the reference point corresponding to the first least current macroblock SAD value is located on the third line segment (step S1904). If so, step (g) (i.e. step S1909) is performed. Otherwise, step (j) (i.e. step S1912) is performed;
  • (d) In step S1905, it is determined whether the second line segment is located on a boundary of a searching window corresponding to the current macroblock or not. If so, step (j) (i.e. step S1912) is performed. If not, the second line segment is moved down by a pixel, and the moved second line segment is adjusted horizontally to generate 12 first refined candidate points according to a pixel word where the best reference point is located (step S1906), and step (e) is performed;
  • (e) In step S1907, a second sub-macroblock SAD value of each first refined candidate point relative to each 8×8 block is calculated, thereby obtaining a second current macroblock SAD value corresponding to each first refined candidate point. Then, a second least current macroblock SAD value can be obtained according to the second current macroblock SAD value corresponding to each first refined candidate point;
  • (f) In step 1908, it is determined whether the second least current macroblock SAD value is larger than the first least current macroblock SAD value. If so, step (j) (i.e. step S1912) is performed. If not, the second least current macroblock SAD value is set to the first least current macroblock SAD value, and step (d) (i.e. step S1905) is performed.
  • (g) In step S1909, it is determined whether the third line segment is located on a boundary of the searching window corresponding to the current macroblock. If so, step (j) (i.e. step S1912) is performed. If not, the third line segment is moved up by one pixel, and the moved third line segment is adjusted horizontally to generate 12 second refined candidate points according to a pixel word where the best reference point is located (step S1913), and step (h) (i.e. step 1910) is performed;
  • (h) In step S1910, a third sub-macroblock SAD value of each second refined candidate point relative to each 8×8 block is calculated, thereby obtaining a third current macroblock SAD value corresponding to each second refined candidate point. Then, a third least current macroblock SAD value can be obtained according to the third current macroblock SAD value corresponding to each second refined candidate point;
  • (i) In step S1911, it is determined whether the third least current macroblock SAD value is larger than the first least current macroblock SAD value. If so, step (j) (i.e. step S1912) is performed. If not, the third least current macroblock SAD value is set to the first least current macroblock SAD value, and step (g) (i.e. step S1909) is performed;
  • (j) In step S1912, the current macroblock integer pixel motion vector is set to a first motion vector corresponding to the first least current macroblock SAD value, and multiple sub-macroblock motion vectors corresponding to the 8×8 blocks in the current macroblock are set to multiple motion vectors pointing to the second sub-macroblock SAD values or the third sub-macroblock SAD values.
  • B-3. 8-Point Searching Based on Half Pixels
  • The motion estimation acceleration circuit 122 may take the reference point corresponding to the least SAD16×16 value as center, and searches for eight half pixels around the center. If the SAD8×8 or SAD16×16 value corresponding to the half pixels is smaller than the SAD value of integer pixels, the motion estimation acceleration circuit 122 may update the motion vectors corresponding to the 8×8 blocks or the 16×16 macroblock.
  • B-4. Decision of Macroblock Mode for Motion Estimation
  • For the MPEG4 standard, the motion estimation acceleration circuit 122 may determine whether an INTER mode (i.e. for 16×16 macroblocks) or an INTER4V mode (i.e. for 8×8 blocks) is used for encoding the current macroblock according to a rate distortion optimization (RDO) value. The mode with a smaller RDO value may have a higher priority, and the motion estimation acceleration circuit 122 may select the mode with a smaller RDO value as the encoding mode for the current macroblock.
  • C. Storage Format of Current Macroblock Buffer and Searching Window Buffer
  • In an embodiment, the current frame and the reference frame for motion estimation are stored in the external storage unit 130, and the current macroblock and the searching window are stored in the internal storage unit 140. When starting the encoding process, the hardware accelerator controller 121 may read the current macroblock and the searching window from the external storage unit 130, and write the current macroblock and the searching window to the internal storage unit 140. The current macroblock is stored in the current macroblock buffer 143, and the pixels of the searching window are stored in the searching window buffer 144. For the current macroblock and the searching window, each pixel may have an 8-bit accuracy, and neighboring pixels in the horizontal direction are placed into the same pixel word.
  • FIG. 4 is a diagram illustrating overlapped searching windows of horizontally neighboring macroblocks according to an embodiment of the invention. In an embodiment, a search range for motion estimation used in the motion estimation acceleration circuit 122 is (−16, 15.5), and the size of the corresponding searching window may be 48×48 pixels. As illustrated in FIG. 4, the overlapped portion between the searching windows of the two horizontally neighboring macroblocks is 32×48 pixels.
  • In order to reduce the memory bandwidth for accessing the external storage unit 130 by using the overlapped portion effectively, the searching window buffer 144 in the invention is implemented in the architecture of four memory banks. Each memory bank may store a region of 16×48 pixels. The motion estimation acceleration circuit 122 may access a 48×48 searching window comprising three memory banks, whereas the remaining memory bank is accessed by the DMA controller 160. That is, the DMA controller 160 may read the region of 16×48 pixels for motion estimation of the next macroblock from the external storage unit 130 to the searching window buffer 144. Since there are four memory banks in the searching window buffer 144, it can be ensured that the calculation of motion estimation and accessing of the searching window of the next macroblock can be performed in parallel.
  • FIGS. 5A˜5D are diagrams illustrating the architecture of the searching window buffer according to an embodiment of the invention. Given that four neighboring macroblocks are MB1, MB2, MB3 and MB4, when the motion estimation acceleration circuit 122 performs motion estimation of the current macroblock by using the respective macroblocks MB1, MB2, MB3, and MB4, the searching window comprises three different memory banks in the searching window buffer 144 alternately. Meanwhile, the DMA controller 160 may write the region of 16×48 pixels for motion estimation of the next macroblock into the memory bank 4, memory bank 1, memory bank 2 and memory bank 3 sequentially, as illustrated in FIGS. 5A˜5D. Accordingly, the motion estimation acceleration circuit 122 may read the 48×48 searching window from the external storage unit 130 when starting calculation for motion estimation of the first macroblock in each row. For calculation of motion estimation of the remaining macroblocks in each row, the motion estimation acceleration circuit 122 may only have to read a region of 16×48 pixels from the external storage unit 130. Therefore, the invention may reduce the memory bandwidth for accessing the external storage unit 130 effectively.
  • D. Architecture of Motion Estimation Acceleration Circuit
  • FIG. 6 is a schematic diagram of the motion estimation acceleration circuit 122 according to an embodiment of the invention. The motion estimation acceleration circuit 122 may comprise a start searching point prediction unit 150, an integer pixel estimation unit 151, a half pixel estimation unit 152, and a prediction difference calculating unit 153. Each component in the motion estimation acceleration circuit 122 may execute a calculating procedure associated with its name, respectively. For example, the start searching point prediction unit 150 may search for and predict the start point for motion estimation, as described in section B-1 and illustrated in FIG. 2. After the motion estimation acceleration circuit 122 is activated, the start searching point prediction unit 150 may read pixels of the searching window and the current macroblock from the searching window buffer 144 and the current macroblock buffer 143, respectively, according to motion vectors of the neighboring macroblocks of the current macroblock. Then, the start searching point prediction unit 150 may further calculate SAD values of the candidate points, and select a start searching point prediction value by comparing all the SAD values. Further, the start searching point prediction unit 150 may transmit the start searching point prediction value to the integer pixel estimation unit 151, so that the integer pixel estimation unit 151 may perform a 12-point line segment searching process for motion estimation.
  • The integer pixel estimation unit 151 may read pixels of the searching window and the current macroblock from the searching window buffer 144 and the current macroblock buffer 143, respectively. Then, the integer pixel estimation unit 151 may calculate SAD values of all candidate points, and determine motion vectors of integer pixels by comparing all the SAD values. The integer pixel estimation unit 151 may transmit the motion vectors of integer pixels to the half pixel estimation unit 152.
  • The half pixel estimation unit 152 may perform calculation of interpolation and motion estimation of half pixels. The half pixel estimation unit 152 may read pixels of the searching window and the current macroblock from the searching window buffer 144 and the current macroblock buffer 143, respectively, and generate reference macroblocks by interpolation. The half pixel estimation unit 152 may further calculate SAD values of all candidate points, and determine motion vectors for half pixels by comparing all the SAD values.
  • The prediction difference calculating unit 153 may read pixels of the best reference macroblock from the searching window 144 according to the motion vectors for half pixels generated by the half pixel estimation unit 152. The prediction difference calculating unit 153 may further obtain residue values by subtracting pixels of the best reference macroblock by pixels of the current macroblock, and write the residue values into the residue macroblock buffer 141.
  • E. Hardware Architecture for Searching Integer Pixels
  • FIG. 7 is a block diagram illustrating the hardware architecture of the integer pixel estimation unit 151 according to an embodiment of the invention. In an embodiment, the integer pixel estimation unit 151 may implement the aforementioned 12-point line segment searching algorithm by using a systolic array comprising 12 parallel processing elements (PE). As illustrated in FIG. 7, the 12 processing elements of the integer pixel estimation unit 151 may be divided into four sub-arrays, wherein the first sub-array comprises processing elements PE1, PE5 and PE9; the second sub-array comprises processing elements PE2, PE6 and PE10; the third sub-array comprises processing elements PE3, PE7 and PE11; and the fourth sub-array comprises processing elements PE4, PE8 and PE12. Each processing element may have two input terminals, and pixels in the searching window buffer 144 can be broadcasted to all 12 processing elements. Pixels of the current macroblocks may be reordered into four sets of input data, and the four sets of input data are transmitted to the four sub-arrays, respectively. In addition, the transmission path of the input data is sequential in each sub-array (e.g. PE1→PE5→PE9). Also, eight 32-bit flip-flops are used as delaying units in four transmission paths of pixels of the current macroblock.
  • Since the current macroblock and the searching window are respectively stored in the current macroblock buffer 143 and the searching window buffer 144, the integer pixel estimation 151 may access these two buffers simultaneously via two different physical channels (e.g. memory channels). In addition, pixels are stored in the format of pixel words in the current macroblock buffer 143 and the searching window buffer 144, and thus a pixel word of the current macroblock and a pixel word of the searching window can be read simultaneously from the current macroblock buffer 143 and the searching window buffer 144 every clock cycle, wherein each pixel word is divided into four pixels to be written into the register arrays (e.g. RA0, RA1, RA2, and RA3).
  • In the first clock cycle, the integer pixel estimation unit 151 writes pixels b0˜b3 of the searching window into the register array RB, and writes pixels a0˜a3 of the current macroblocks into the register array RA. In addition, pixels a0˜a3 are arranged into different orders and written into the register arrays RA1, RA2 and RA3, as illustrated in FIG. 7.
  • In the second clock cycle, the integer pixel estimation unit 151 may broadcast the pixels b0˜b3 of the searching window stored in the register array RB to all the 12 processing elements, and transmit pixels of the current macroblock stored in the register arrays RA0˜RA3 to the four sub-arrays through four transmission paths. In the second clock cycle, the processing elements PE1˜PE4 have received pixels of the current macroblock and the searching window for calculation, the processing elements PE5˜PE12 are idling since they have not received the pixels of the current macroblock yet. Meanwhile, the integer pixel estimation unit 151 may keep on reading the current macroblock buffer 143 and the searching window buffer 144, store pixels b4˜b7 of the searching window to the register array RB, and store pixels a4˜a7 of the current macroblock to the register array RA0. The integer pixel estimation unit 151 may further reorder the pixels a4˜a7 of the current macroblock and substitute some pixels in the register arrays RA1˜RA3 with the reordered pixels a4˜a7, as illustrated in FIG. 7.
  • In the third clock cycle, the integer pixel estimation 151 may broadcast the pixels b4˜b7 of the searching window to all the 12 processing elements. Pixels of the current macroblock stored in the register arrays RA0˜RA3 are transmitted to the four sub-arrays via four different transmission paths, so that the pixels can be transmitted sequentially in the processing elements in each sub-array. In the third clock cycle, the processing elements PE1˜PE8 have received pixels of the current macroblock and the searching window for calculation, but the processing elements PE9˜PE12 are idling since they have not received the pixels of the current macroblock yet. Meanwhile, the integer pixel estimation unit 151 may keep on reading the searching window buffer 144, and store the pixels b8˜b11 of the searching window into the register array RB. The integer pixel estimation unit 151 may further reorder pixels a4˜a7 of the current macroblock stored in the register array RA0, and substitute some pixels in the register arrays RA1˜RA3 with the reordered pixels, as illustrated in FIG. 7.
  • In the fourth clock cycle, the integer pixel estimation unit 151 may broadcast the pixels b8˜b11 of the searching window to all the 12 processing elements. Pixels of the current macroblock stored in the register arrays RA0˜RA3 are transmitted to the four sub-arrays via four different transmission paths, so that the pixels can be transmitted sequentially in the processing elements in each sub-array. Meanwhile, the integer pixel estimation unit 151 may keep reading the searching window buffer 144, and store the pixels b12˜b15 of the searching window into the register array RB. Therefore, all processing elements on the four transmission paths have received pixel data for calculation in the fourth clock cycle.
  • In the fifth clock cycle, the integer pixel estimation unit 151 may broadcast the pixels b12˜b15 of the searching window to all the 12 processing elements. Also, the processing elements PE1˜PE4 are idling since they do not receive any new pixels of the current macroblock, and the processing elements PE5˜PE12 have received pixels of the searching window and pixels of the current macroblock from the delaying units FF0˜FF7 for calculation. Meanwhile, the integer pixel estimation unit 151 may keep reading the searching window buffer 144, and store the pixels b16˜b19 of the searching window into the register array RB.
  • In the sixth clock cycle, the integer pixel estimation unit 151 has completed calculation of difference values of a pixel row (e.g. 12 integer pixels). Further, each processing element may comprise an accumulator, and the integer pixel estimation unit 151 may accumulate and store the difference values corresponding to the 12 candidate points, and calculation of a SAD8×8 value of the 12 candidate points can be completed by repeating the aforementioned steps 8 times. Then, the least SAD8×8 value can be obtained by using the comparators, and thus a corresponding motion vector MV8×8 can be obtained. The integer pixel estimation unit 151 may keep calculating the SAD8×8 value of the 12 candidate points in the other three 8×8 blocks, thereby obtaining twelve SAD16×16 values. The integer pixel estimation unit 151 may further obtain the least SAD16×16 value by using the comparators, thereby obtaining the corresponding motion vector MV16×16.
  • FIG. 8 is a structure diagram illustrating a processing element in the integer pixel estimation unit 151 according to an embodiment of the invention. As illustrated in FIG. 8, the processing element may comprise four SAD calculating units and an accumulator. In every clock cycle, the processing element may receive four pixels of the current macroblock and four pixels of the searching window, and calculate absolute difference values of the four pixel pairs. The processing element may selectively accumulate the four absolute difference values. For each processing element, the corresponding control signal is a fixed 4-bit value in clock cycles for performing calculation of motion estimation. Control signals between neighboring processing elements in the same set may have a one-clock-cycle delay. Accordingly, eight 4-bit flip-flops are used as delaying units in the integer pixel estimation unit 151 to distribute the control signal of each processing element.
  • F. Hardware Architecture for Half-Pixel Interpolation and Searching
  • In the MPEG4 and H.263 video codec standards, a motion vector point of an integer pixel is often taken as a center, and eight candidate half pixels around the center are searched while performing searching of half pixels. The reference macroblock corresponding to the eight half pixels is generated after linear interpolation of integer pixels. There are three modes for interpolation of half pixels, such as horizontal interpolation, vertical interpolation, and diagonal interpolation. Given that h, v, d denote the half pixels in the horizontal direction, vertical direction and diagonal direction, respectively; A1 and A2 denote the integer pixels horizontally neighboring to the half pixel h; A1 and A3 denote the integer pixels vertically neighboring to the half pixel v; and A1˜A4 denote the integer pixels neighboring to the half pixel d, the interpolation for half pixels in different directions can be expressed as the following equations:

  • h=(A1+A2+1)>>1;

  • v=(A1+A3+1)>>1;

  • d=(A1+A2+A3+A4+2)>>2;
  • FIGS. 9A and 9B are portions of a diagram illustrating the hardware architecture of the half pixel estimation unit 152 according to an embodiment of the invention. The half pixel estimation unit 152 may comprise 4 sets of 10-bit adders and 3 sets of rounding and shifting units to implement interpolation of half pixels. The half pixel estimation unit 152 may further comprise eight parallel processing elements to implement searching of half pixels, as illustrated in FIGS. 9A and 9B. As described in aforementioned embodiments, pixels are stored in the format of pixel words in the current macroblock buffer 143 and the searching window 144. The half pixel estimation unit 152 may read a pixel word of the current macroblock from the current macroblock buffer 143 and a pixel word of the searching window from the searching window buffer 144 simultaneously. Each pixel word is unpacked into four pixels, and the unpacked four pixels are written into the register arrays (e.g. RA10 and RA11). In an embodiment, the current macroblock register comprises two ping-pong register arrays RA10 and RA11, and each of the register arrays RA10 and RA11 may comprise eight 8-bit registers. The searching window register is comprised of two ping-pong register arrays RB10 and RB11, and each of the register arrays RB10 and RB11 may comprise ten 8-bit registers.
  • When the half pixel estimation unit 152 starts to perform interpolation of half pixels, the half pixel estimation unit 152 may read eight pixels in the first row of the current macroblock from the current macroblock buffer 143, and write the eight pixels in the first row into the register array RA10. Similarly, the half pixel estimation unit 152 may read eight pixels in the second row of the current macroblock from the current macroblock buffer 143, and write the eight pixels in the second row into the register array RA11. The half pixel estimation unit 152 may read 10 pixels in the first row of the searching window from the searching window buffer 144, and write the 10 pixels in the first row to the register array RB10. Similarly, the half pixel estimation unit 152 may read 10 pixels in the second row of the searching window from the searching window buffer 144, and write the 10 pixels in the second row to the register array RB11. When the half pixel estimation unit 152 has completed calculation of interpolation and searching of half pixels in a row, the half pixel estimation unit 152 may further read pixels in a subsequent new row of the current macroblock from the current macroblock buffer 143, thereby substituting a prior row stored in the register array RA10 or RA11 with the new row. The half pixel estimation unit 152 may further read pixels in a subsequent new row of the searching window from the searching window buffer 144, thereby substituting a prior row stored in the register array RB10 or RB11 with the new row. While calculating interpolation of half pixels, the half pixel estimation unit 152 may simultaneously generate 9 half pixels in a row in the horizontal direction, 8 half pixels in a column in the vertical direction, and 9 half pixels in a row in the diagonal direction, so that the criterion to search for eight candidate half pixels simultaneously can be satisfied. Further, two lines, which each comprises 10 integer pixels, are required when the half pixel estimation unit 152 generates the aforementioned half pixels in different directions. In addition, the half pixel estimation unit 152 may read the two lines from the searching window buffer 144, and write the two lines into the register arrays RB10 and RB11, respectively. Since pixels are stored in the format of pixel words (i.e. each comprises four integer pixels) in the searching window buffer 144, the half pixel estimation unit 152 has to read three pixel words continuously from the searching window buffer 144 while reading 10 integer pixels in a line. The half pixel estimation unit 152 may further unpack the three pixel words into 12 integer pixels, and align the integer pixels according to the locations of the motion vectors of integer pixels in the pixel words, thereby truncating two invalid integer pixels.
  • The half pixel estimation unit 152 may comprise 8 parallel processing elements PE21˜PE28, and the processing elements PE21˜PE28 are divided into 3 groups. The first group comprises the processing elements PE21˜PE24, configured to calculate SAD values of four candidate half pixels in the diagonal direction. The second group comprises the processing elements PE25 and PE26, configured to calculate SAD values of two candidate half pixels in the vertical direction. The third group comprises the processing elements PE27 and PE28, configured to calculate SAD values of two candidate half pixels in the horizontal direction. When the half pixel estimation unit 152 calculates interpolation of half pixels in the first row, the half pixel estimation unit 152 may broadcast the pixels of the current macroblock stored in the register array RA10 to the processing elements PE23, PE24 and PE26 through a first broadcasting path, and broadcast the pixels of the current macroblock stored in the register array RA11 to the processing elements PE21, PE22, PE25, PE27 and PE28 through a second broadcasting path. Then, When the half pixel estimation unit 152 has completed calculation of interpolation of half pixels in a row, the broadcasting paths from the register arrays RA10 and RA11 may be interchanged. The nine half pixels d0˜d8 in the diagonal direction generated by the half pixel estimation unit 152 are divided into two groups. For example, the half pixels d0˜d7 are transmitted to the processing elements PE21 and PE23, and the half pixels d1˜d8 are transmitted to the processing elements PE22 and PE24. Similarly, the nine half pixels h0˜h8 in the horizontal direction generated by the half pixel estimation unit 152 are divided into two groups. For example, the half pixels h0˜h7 are transmitted to the processing element PE27, and the half pixels h1˜h8 are transmitted to the processing element PE28. In addition, the eight half pixels v0˜v7 in the vertical direction generated by the half pixel estimation unit 152 are transmitted to the processing elements PE25 and PE26 simultaneously.
  • In an embodiment, each processing element in the half pixel estimation unit 152 may comprise four SAD calculating units and an accumulator (as shown in FIG. 7), and it may take two clock cycles to complete calculation of SAD values of half pixels in a line. The half pixel estimation unit 152 may accumulate the SAD values of half pixels in 8 lines to obtain eight SAD8×8 values. The half pixel estimation unit 152 may select the least SAD8×8 value of half pixels by using the comparators, and compare the least SAD8×8 value of half pixels with the least SAD8×8 value of integer pixels, thereby obtaining the resulting motion vector MV8×8 (i.e. the least SAD8×8 value after comparison).
  • The half pixel estimation unit 152 may sum up the four SAD8×8 values corresponding to each of the 8 candidate half pixels, thereby obtaining 8 SAD16×16 values. Then, the half pixel estimation unit 152 may select the least SAD16×16 value of half pixels by using the comparators, and compare the least SAD16×16 value of half pixels with the least SAD16×16 value of integer pixels, thereby obtaining the resulting motion vector MV16×16 (i.e. the least SAD16×16 value after comparison).
  • G. Definition of Loop Filtering Sequence
  • Encoding processes and decoding processes in video codec standards, such as the H.264 or VC-1 standards, are controlled in a frame level flow, and the order for processing the boundary in the in-loop filtering processes are defined in the video codec standards. In addition, the hardware accelerators in the encoding module 120 may perform the encoding process by macroblock. In the invention, a filtering order of the boundary of 4×4 blocks in a 16×16 macroblock is further defined based on the definition in the video codec standards, thereby using the overlapped portion of neighboring macroblocks effectively to reduce the memory bandwidth for accessing the external storage unit 130.
  • It should be noted that an in-loop filter is a necessary component in a video encoding system and a video decoding system for the H.264 and VC-1 standards. The in-loop filter may reduce the discontinuity between neighboring macroblocks generated by the processes, such as DCT/iDCT and quantization/inverse quantization, thereby enhancing the image quality after motion compensation and increasing the efficiency for video encoding.
  • FIG. 18 is a block diagram illustrating a video codec system according to an embodiment of the invention. Referring to FIG. 1 and FIG. 18, the in-loop filtering acceleration circuit 124 is not only applied in the video encoding system 100, but also applied in a video codec system 1800. The video codec system 1800 may comprise a processing unit 1810, a codec module 1820, and an external storage unit 1830. The processing unit 1810 may be a controller, configured to execute a hardware acceleration control program, and execute decoding pre-processing and post-processing, such as an entropy decoding program and a decoding parameters calculating program, respectively. For example, the processing unit 1810 may be a central processing unit (CPU), a digital signal processor (DSP) or other equivalent circuits implementing the same functions.
  • The codec module 1820 may comprise a hardware accelerator controller 1821, a codec processing unit 1822, an in-loop filtering acceleration circuit 1823, an external storage unit 1830 and an internal storage unit 1840. In an embodiment, the codec processing unit 1822 can be implemented by hardware circuits (i.e. hardware) or DSPs (i.e. software) configured to perform decoding processes, such as motion compensation, intra-frame prediction, inverse DCT, inverse quantization and zig-zag scan. The functionality of the in-loop filtering acceleration circuit 1823 is identical to that of the in-loop filtering acceleration circuit 124, and the details will not be described here. In the following sections, only the details of the in-loop filtering acceleration circuit 124 will be described.
  • The external storage unit 1830 is configured to store reference frames, reconstructed frames, decoding parameters, and RLL codes. The external storage 1830 may be a volatile memory component (e.g. random access memory, such as DRAM or SRAM) and/or a non-volatile memory component (e.g. ROM, hard disk, CDROM).
  • The internal storage unit 1840 may comprise a searching window buffer 1841, a first FIFO buffer 1842, a de-blocking filter buffer 1843, and a second FIFO buffer 1844. The searching window buffer 1841 is configured to store reference macroblocks for motion compensation. The first FIFO buffer 1842 is configured to store RLL codes. The de-blocking filter buffer 1843 is configured to store reconstructed macroblocks after motion compensation executed by the codec processing unit 1822, and filtered macroblocks generated by the in-loop filtering acceleration circuit 1823. In addition, the in-loop filtering acceleration circuit 1823 may read the reconstructed macroblocks generated by the codec processing unit 1822 from the de-blocking filter buffer 1843, perform in-loop filtering to the reconstructed macroblocks, and write the filtered macroblocks into the de-blocking filter buffer 1843. The second FIFO buffer 1844 is configured to store decoding parameters generated by the processing unit 1810.
  • G-1. In-Loop Filtering Sequence in H.264 Standard
  • FIG. 10 is a diagram illustrating the in-loop filtering sequence in the H.264 standard according to an embodiment of the invention. As illustrated in FIG. 10, Y denotes a luminance macroblock, and U and V denote a respective chrominance macroblock. The filtering sequence for an in-loop filter in the H.264 standard is defined as following: for each frame, the vertical edges of all 4×4 blocks are filtered first, and the vertical edges are filtered from top to bottom and from left to right. Then, the horizontal edges of all 4×4 blocks are filtered, and the horizontal edges are also filtered from top to bottom and from left to right.
  • The in-loop filtering acceleration circuit 124 may perform video encoding/decoding by macroblock, and the edges to be filtered in each macroblock are the black bold lines illustrated in FIG. 10. The blocks filled with diagonal lines represent the current luminance macroblock and current chrominance macroblocks, and the white blocks represent neighboring luminance macroblocks and neighboring chrominance macroblocks of the current luminance macroblock and current chrominance macroblocks, respectively.
  • Based on the filtering sequence defined in the H.264 standard, the in-loop filtering acceleration circuit 124 may re-define the filtering sequence for filtering edges of 4×4 blocks in a 16×16 macroblock as the order of numbers illustrated in FIG. 10. First, the in-loop filtering acceleration circuit 124 may filter the vertical edges of all 4×4 blocks from left to right and from top to bottom, and filter the horizontal edges of all 4×4 blocks from top to bottom and from left to right. Briefly, the in-loop filtering acceleration circuit 124 may use the neighboring macroblocks having overlapped edges effectively to reduce the memory bandwidth for accessing the external storage unit 130 by using the filtering sequence defined for the H.264 standard in the invention. For example, when filtering a vertical edge of a 4×4 block, the in-loop filtering acceleration circuit 124 may read two 4×4 blocks located at the left/right side of the vertical edge from the de-blocking filter buffer 145, and write the two 4×4 blocks to the transposition register arrays TA and TB (as shown in FIG. 14A, and details will be described later). When the in-loop filtering acceleration circuit 124 has completed filtering of a vertical edge, it is not necessary for the in-loop filtering acceleration circuit 124 to write the 4×4 block, which is located at the right side of the vertical edge, back to the de-blocking filter buffer 145. That is, the 4×4 block can be preserved in the de-blocking filter buffer 145, so that the 4×4 block can be used as the macroblock located at the left side of the next vertical edge. Accordingly, accessing (i.e. writing and reading) of a 4×4 block can be saved when the in-loop filtering acceleration circuit 124 performs filtering to a vertical edge. Similarly, another accessing operation (i.e. writing and reading) of a 4×4 block can be saved when the in-loop filtering acceleration circuit 124 performs filtering to a horizontal edge.
  • G-2. In-Loop Filtering Sequence in VC-1 Standard
  • FIG. 1 is a diagram illustrating the filtering sequence for an in-loop filter in the VC-1 standard according to an embodiment of the invention. As illustrated in FIG. 11, Y denotes a luminance macroblock, and U and V denote a respective chrominance macroblock. For each frame, the filtering sequence in the in-loop filter defined by the VC-1 standard can be expressed as the following criteria of:
  • (a) horizontal edges of all 8×8 blocks are filtered from left to right and from top to bottom;
  • (b) horizontal edges of all 4×4 blocks are filtered from left to right and from top to bottom;
  • (c) vertical edges of all 8×8 blocks are filtered from top to bottom and from left to right; and
  • (d) vertical edges of all 4×4 blocks are filtered from top to bottom and from left to right.
  • When the in-loop filtering acceleration circuit 124 encodes or decodes a frame by macroblock, some edges of the current macroblock are not filtered by the in-loop filtering acceleration circuit 124 due to the limitation of the filtering sequence of the VC-1 standard, wherein the limitation may indicate that the right edge and the bottom edge are not filtered while performing in-loop filtering for each macroblock. Accordingly, the edges can only be filtered while the in-loop filtering acceleration circuit 124 performs filtering of the next macroblock or the macroblock exactly on the next line (i.e. the line beneath the current line). Therefore, when the in-loop filtering acceleration circuit 124 performs filtering of each macroblock, the edges to be filtered may comprise some internal edges of the current macroblock, and some edges of the up, left, and upper-left neighboring macroblocks, such as the black bolded lines illustrated in FIG. 11. In addition, the blocks filled with diagonal lines are the luminance macroblock and chrominance macroblocks of the current macroblock, and the white blocks are the luminance macroblock and chrominance macroblocks of the neighboring macroblocks.
  • Based on the filtering sequence defined in the VC-1 standard, the in-loop filtering acceleration circuit 124 may re-define the filtering sequence for filtering edges of 4×4 blocks in a 16×16 macroblock as the order of numbers illustrated in FIG. 11. First, the in-loop filtering acceleration circuit 124 may filter horizontal edges. That is, the in-loop filtering acceleration circuit 124 may filter the horizontal edges of 8×8 blocks from bottom to top, and filter the horizontal edges of 4×4 blocks from top to bottom. Then, the in-loop filtering acceleration circuit 124 may filter vertical edges. That is, the in-loop filtering acceleration circuit 124 may filter the vertical edges of 8×8 blocks from right to left, and filter the vertical edges of 4×4 blocks from left to right. Briefly, the in-loop filtering acceleration circuit 124 may use the neighboring macroblocks having overlapped edges effectively to reduce the memory bandwidth for accessing the external storage unit 130 by using the filtering sequence re-defined for the VC-1 standard in the invention.
  • H. Storage Format of Pixels for In-Loop Filtering
  • The reconstructed macroblocks generated by the in-loop filtering acceleration circuit 124 may compose a reconstructed frame, which is stored in the external storage unit 130. The pixels of the reconstructed macroblocks before in-loop filtering and pixels of the macroblocks after in-loop filtering are stored in the de-blocking filter buffer 145 of the internal storage unit 140 with the format of pixel words (e.g. word32 format). Briefly, each pixel has an 8-bit accuracy, and four horizontally adjacent pixels are placed into the same pixel word. Before performing in-loop filtering, the DCT and quantization accelerator 123 may write the reconstructed macroblock after motion compensation or spatial compensation into the de-blocking filter buffer 145. Then, the hardware accelerator controller 121 may read the required neighboring macroblocks for in-loop filtering from the external storage unit 130, and write the macroblocks into the de-blocking filter buffer 145. When in-loop filtering has completed, the hardware accelerator controller 121 may copy the reconstructed macroblocks and neighboring macroblocks after in-loop filtering to the external storage unit 130 by using the DMA controller 160.
  • Referring to FIG. 10 and FIG. 11, left, upper, and upper-left neighboring macroblocks of the current macroblock are used while performing in-loop filtering for the current macroblock. FIG. 12 is a diagram illustrating the architecture of the de-blocking filter buffer 145 according to an embodiment of the invention. For convenience to read neighboring macroblocks, the de-blocking filter buffer 145 may have an architecture of four memory banks, so that the operations of reading, writing and filtering macroblocks can be executed in parallel to increase performance of the video encoding system 100. Each memory bank may store the current macroblock and certain lines of luminance/chrominance pixels above the current macroblock. For example, the de-blocking filter buffer 145 may store four lines of luminance/chrominance pixels above the current macroblock for the H.264 standard. Alternatively, the de-blocking filter buffer 145 may store 8 lines of luminance/chrominance pixels above the current macroblock for the VC-1 standard. Two neighboring memory banks (e.g. memory banks 1 and 2) in the de-blocking filter buffer 145 are configured to store the current macroblock, the left neighboring macroblock, and two upper neighboring luminance and chrominance macroblocks, and the in-loop filtering acceleration circuit 124 may read the two neighboring memory banks simultaneously to perform the in-loop filtering process. Other hardware accelerators or the DSP processor (e.g. DCT and quantization unit 123) in the encoding module 120 may write the reconstructed macroblocks into a memory bank (e.g. memory bank 3) of the de-blocking filter buffer 145. In addition, the hardware accelerator controller 121 may further read the upper neighboring macroblock of the reconstructed macroblock from the external storage unit 130, and write the upper neighboring macroblock into a memory bank (e.g. memory bank 3) of the de-blocking filter buffer 145. Also, the hardware accelerator controller 121 may further copy the reconstructed macroblock and the upper neighboring macroblock after in-loop filtering, which are stored in a memory bank (e.g. memory bank 0) of the de-blocking filter buffer 145, to the external storage unit 130.
  • FIGS. 13A˜13D are portions of a diagram illustrating the sequence of data accessing in the de-blocking filter buffer 145 according to an embodiment of the invention. In order to perform reading, writing and in-loop filtering of macroblocks simultaneously, different hardware accelerators or the DSP processor of the encoding module 120 should access different memory banks of the de-blocking filter buffer 145 circularly via the DMA controller 160, as illustrated in FIGS. 13A˜13D. In order to synchronize reading, writing and in-loop filtering of macroblocks, three different indices are used in the de-blocking filter buffer 145 to prevent different hardware accelerators and the DMA controller 160 from accessing the same memory bank of the de-blocking filter buffer 145. The three aforementioned indices, such as a reading index rd_index, a filter index filter_index, and a writing index wr_index, are configured to control different hardware accelerators and the DMA controller 160 to access different memory banks of the de-blocking filter buffer 145. The control mechanism of the indices can be expressed in the following steps:
  • (a) When the reading index rd_index is pointing to a memory bank accessed by the DMA controller 160, the reading index rd_index is set to 0. When (rd_index+1) is smaller than the filter index filter_index, the DMA controller 160 may read the memory bank to which the reading index rd_index is pointing. Every time when the DMA controller 160 has completed reading a macroblock and its upper neighboring macroblock, the DMA controller 160 may add the reading index rd_index by 1.
  • (b) When the filter index filter_index is directing to a memory accessed by the in-loop filtering acceleration circuit 124, the filter index filter_index is set to 0. When the filter index filter_index is smaller than the writing index wr_index, the in-loop filtering acceleration circuit 124 may access two memory banks directed to by filter_index and (filter_index−1). Every time when the in-loop filtering acceleration circuit 124 has completed in-loop filtering of a macroblock, the in-loop filtering acceleration circuit 124 may add the filter index filter_index by 1.
  • (c) When the writing index wr_index is pointing to the memory bank read by other hardware accelerators, the DSP processor, and the hardware accelerator controller 121, the writing index wr_index is set to 0. When the writing index wr_index is larger than (rd_index+2), other hardware accelerators/the DSP processor, and the hardware accelerator controller 121 may write macroblock data to the memory bank to which the writing index wr_index is pointing. Every time when other hardware accelerators/the DSP processor and the hardware accelerator controller 121 have completed writing of a macroblock and its upper neighboring macroblock, the aforementioned components may add the writing index wr_index by 1.
  • I. Hardware Architecture of In-Loop Filtering Acceleration Circuit
  • FIGS. 14A and 14B are portions of a diagram illustrating the hardware architecture of the in-loop filtering acceleration circuit 124 according to an embodiment of the invention. In the invention, the filtering parameter, such as boundary strength (BS), in the H.264 standard is calculated by the processing unit 110. In addition, the processing unit 110 may control the in-loop filtering acceleration circuit 124 by the hardware accelerator controller 121. For the VC-1 standard, the processing unit 110 may determine whether each edge should be filtered or not. For the H.264 standard, 5 levels of boundary strength, such as BS=0˜4, are defined for edges of a macroblock. However, boundary strength is not defined in the VC-1 standard, and thus there are only two conditions, specifically, to be filtered or not, for each edge in the VC-1 standard. For convenience in selecting the type of filters, two cases of boundary strength are defined for the VC-1 standard in the invention. That is, if the processing unit 110 determines that the edge should be filtered, the value of boundary strength is set to 0. Conversely, if the processing unit 110 determines that the edge should not be filtered, the value of boundary strength is set to 5. Accordingly, the in-loop filtering acceleration circuit 124 only has to read macroblock data from the de-blocking filter buffer 145, and select an appropriate one-dimensional (1D) filter according to filtering parameters, such as the value of boundary strength, to perform in-loop filtering of the corresponding edge.
  • As illustrated in FIGS. 14A and 14B, the in-loop filtering acceleration circuit 124 may comprise two transposition register arrays TA and TB, a filter selection unit 1410, and multiple 1D filters (e.g. G_FILTER0˜G_FILTER1, S_FILTER0˜S_FILTER3 and V_FILTER). Since the reconstructed macroblock to be filtered is stored in the de-blocking filter buffer 145 with a format of pixel words, the in-loop filtering acceleration circuit 124 may read one pixel word from the de-blocking filter buffer 145 every clock cycle, unpack the pixel word into four pixels, and write the four pixels into the transposition register arrays TA and TB. Accordingly, only four clock cycles are taken for the in-loop filtering acceleration circuit 124 to read pixels of a 4×4 block from the de-blocking filter buffer 145 to the transposition register arrays TA and TB. Pixels should be read column by column or row by row while filtering horizontal edges and vertical edges, respectively. However, the accessing of the de-blocking filter buffer 145 is more effective only when data is read or written row by row. The in-loop filtering acceleration circuit 124 may read pixels in a 4×4 block column by column, or row by row, freely by using the transposition register arrays TA and TB, so that the same hardware circuit (e.g. 1D filter) can be used to filter horizontal edges and vertical edges. When two 4×4 blocks are written into the transposition register arrays TA and TB, the in-loop filtering acceleration circuit 124 may start to perform in-loop filtering, and the procedures for in-loop filtering are described as the following steps:
  • (1) Four pixels p0, p1, p2 and p3 are read from the transposition register array TA and four pixels q0, q1, q2 and q3 are read from the transposition register array TB column by column or row by row according to the current filtering direction (e.g. horizontal direction or vertical direction). The processing unit 110 may determine the boundary strength of the current edge. If BS=0, the current edge is not filtered, and step (1) is repeated.
  • (2) If the processing unit 110 determines that the boundary strength BS of the current edge is equal to 5, it may indicate that the filtering process is to filter the current edge in the VC-1 standard, and step (4) is performed to select a 1D filter of the VC-1 standard. Otherwise, step (3) is performed.
  • (3) The in-loop filtering acceleration circuit 124 may calculate filter selection parameters d0=|p0−q0|, d1=|p1−p0|, and d2=|q0−q1|, and compare the parameters d0˜d2 with threshold values α and β. If the in-loop filtering acceleration circuit 124 determines that the criterion (d0<α && d1<α && d2<β) does not stand, the current edge is not filtered, and step (1) is performed. If the criterion stands, the in-loop filtering acceleration circuit 124 may further determine whether the current macroblock is a luminance macroblock in the H.264 standard. If so, the in-loop filtering acceleration circuit may calculate filter selection parameters d3=|p2−p0| and d4=|q2−q0|, and step (4) is performed to select a 1D filter of the H.264 standard. If not, step (4) is performed.
  • (4) The in-loop filtering acceleration circuit 124 may select a 1D filter according to the value of boundary strength to perform filtering of input pixels p0˜p3 and q0˜q3. When the value of boundary strength BS is 4, the in-loop filtering acceleration circuit 124 may select a H.264 strong filter (S_FILTER). When the value of boundary strength BS is between 1˜3, the in-loop filtering acceleration circuit 124 may select a H.264 general filter (G_FILTER). When the value of boundary strength BS is 5, the in-loop filtering acceleration circuit 124 may select a VC-1 filter (V_FILTER). If the filtering of edges has not been completed yet, step (1) is performed. When the filtering of edges has completed, the in-loop filtering acceleration circuit 124 may write output pixels p0′˜p3′ back to the transposition register array TA, and write output pixels q0′˜q3′ back to the transposition register array TB.
  • (5) When the filtering of edges has completed, the in-loop filtering acceleration circuit 124 may write 4×4 blocks, which are above the horizontal edge or located at the left side of the vertical edge, back to the de-blocking filter buffer 145. If a horizontal edge is processed, the in-loop filtering acceleration circuit 124 may read pixels by column, and four adjacent pixels in a column are packed into a pixel word to be written into the de-blocking filter buffer 145. If a vertical edge is processed, the in-loop filtering acceleration circuit 124 may read pixels by row, and four adjacent pixels in a row are packed into a pixel word to be written into the de-blocking filter buffer 145.
  • In an embodiment, the filter selection unit 1410 of the in-loop filtering acceleration circuit 124 is configured to calculate filter selection parameters (e.g. d0, d3 and d4) according to the input pixels, select a corresponding 1D filter according to the calculated filter selection parameters. There are three types of 1D filters in the in-loop filtering acceleration circuit 124, such as H.264 strong filters, H.264 general filters, and a VC-1 filter. For example, four filters are included in the H.264 strong filters, such as S_FILTER0, S_FILTER1, S_FILTER2, and S_FILTER3. Two filters are included in the H.264 general filters, such as G_FILTER0 and G_FILTER1. Only one filter V_FILTER is included in the VC-1 filter. The parameters received by the filter selection unit 1410 may comprise boundary strength BS, a chrominance parameter chroma, a clipping parameter c0, a bit rate parameter alpha, a quantization parameter PQuant, and filter selection parameters d0, d3 and d4. For example, the boundary strength BS is determined by the processing unit 110. The chrominance parameter chroma may indicate that the current macroblock is a luminance macroblock or a chrominance macroblock. If the chrominance parameter chroma is 1, it may indicate that the current macroblock is a chrominance macroblock. Otherwise, it may indicate that the current macroblock is a luminance macroblock. Further, c0 is a clipping parameter, which is obtained from a look-up table according to the boundary strength BS, used in H.264 general filters. Also, alpha is a bit rate parameter generated by the processing unit 110 while decoding a bitstream. The quantization parameter PQuant is generated by the processing unit 110. As described in the aforementioned embodiment, the filter selection unit 1410 of the in-loop filtering acceleration circuit 124 may calculate filter selection parameters d0, d3 and d4 according to the input pixels.
  • The working principle of the filter selection unit 1410 is shown in FIGS. 15A and 15B. First, the filter selection unit 1410 may select the filter type according to the boundary strength. Then, the filter selection unit 1410 may determine the 1D filter(s) to be used according to other parameters.
  • FIGS. 16A˜16F are diagrams illustrating the architecture of each H.264 1D filter according to an embodiment of the invention. In another embodiment, when the filter selection unit 1410 has determined the 1D filter(s) to be used, the in-loop filtering acceleration circuit 124 may start to perform filtering. It should be noted that a filtering procedure is generally completed by a certain amount of 1D filters. Each 1D filter may select a portion of input pixels p0˜p3 and q0˜q3 as an input, and perform calculation of the selected input pixels to obtain 1 or 2 results (i.e. filtered pixels), and substitute one or two pixels of the input pixels with the filtered pixels, thereby generating output pixels (e.g. pout, pout1 or pout2 in FIGS. 16A˜16F). Then, the output pixels are written back to the transposition register arrays TA or TB.
  • Four H.264 strong 1D filters (e.g. S_FILTER0, S_FILTER1, S_FILTER2, and S_FILTER3) and two H.264 general 1D filters (e.g. G_FILTER0 and G_FILTER1) are illustrated in FIGS. 16A˜16F, respectively. Each 1D filter comprises a certain amount of adders, shifters and clipping units, wherein pin0˜pin4 denote input pins in different 1D filters, and pout, pout1 and pout2 denote the output pixels of the different 1D filters.
  • FIGS. 17A˜17B are portions of a diagram illustrating the architecture of the VC-1 filter in the in-loop filtering acceleration circuit 124 according to an embodiment of the invention. As illustrated in FIGS. 17A and 17B, the VC-1 filter V_FILTER may comprise two parts. The first part may perform calculation of eight input pixels p0˜p3 and q0˜q3 to generate four internal parameters a0, |a0|, a3 and delta. The second part may perform filtering by using the four internal parameters and a quantization parameter PQuant to generate two output pixels p0′ and q0′. Then, the second part may further substitute the input pixels p0 and q0 with the output pixels p0′ and q0′, and write the output pixels back to the transposition register arrays TA and TB. When the in-loop filtering acceleration circuit 124 performs filtering of horizontal edges, the horizontal edges of a 4×4 block in the third row should be filtered first. Similarly, when the in-loop filtering acceleration circuit 124 performs filtering of vertical edges, the vertical edges of a 4×4 block in the third column should be filtered first. If the input pixels p0˜p3 and q0˜q3 are located on the horizontal edge of a 4×4 block in the third row or the vertical edge of a 4×4 block in the third column, the flag 3rd_pel_pair is set to 1. Then, the VC-1 filter should further determine another flag filter_other3_pixels. If the flag filter_other3_pixels is 1, pixels in the remaining three rows or columns should be further filtered. Otherwise, the filtering process of the pixels in the remaining three rows or columns can be skipped.
  • For those skilled in the art, it should be appreciated that the in-loop filtering acceleration circuit 124 is used to perform filtering processes of horizontal edges, vertical edges and diagonal lines. Also, the in-loop filtering acceleration circuit 124 may comply with the H.264 standard (e.g. Baseline profile) and the VC-1 standard (e.g. Simple profile and Main profile). In addition, the 1D filters in the in-loop filtering acceleration circuit 124 can be upgraded to comply with other video codec standards.
  • While the invention has been described by way of example and in terms of the preferred embodiments, it is to be understood that the invention is not limited to the disclosed embodiments. To the contrary, it is intended to cover various modifications and similar arrangements (as would be apparent to those skilled in the art). Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.

Claims (27)

What is claimed is:
1. A motion estimation acceleration circuit applied in a video encoding system supporting multiple video codec standards, comprising:
a start searching point prediction unit, configured to determine a start searching point according to multiple neighboring macroblocks of a current macroblock, wherein the current macroblock corresponds to a searching window; and
an integer pixel estimation unit, configured to determine a best candidate pixel according to a first line segment at where the start searching point is located, a second line segment on the first line segment, and a third line segment beneath the first line segment,
wherein the integer pixel estimation unit further determines whether the best candidate pixel is located at the first line segment,
if so, the integer pixel estimation unit sets a candidate motion vector corresponding to the best candidate pixel as a first current macroblock motion vector;
if not, the integer pixel estimation unit dynamically adjusts the second line segment or the third line segment in the searching window to update the best candidate pixel, and retrieves the first current macroblock motion vector corresponding to the updated best candidate pixel.
2. The motion estimation acceleration circuit as claimed in claim 1, wherein the start searching point prediction unit further calculates multiple macroblock reference pixels pointed by multiple second neighboring macroblock motion vectors, and calculates multiple neighboring macroblock sum of absolute difference values corresponding to the macroblock reference points, wherein the start searching point prediction unit further assigns one of the macroblock reference points corresponding to the least one of the neighboring macroblock sum of absolute difference values as the start searching points.
3. The motion estimation acceleration circuit as claimed in claim 2, wherein when the current macroblock is located at a boundary of a current frame, the start searching point prediction unit further substitutes the second neighboring macroblock motion vectors with a zero motion vector, and sets the start searching point as the zero point.
4. The motion estimation acceleration circuit as claimed in claim 2, wherein the integer pixel estimation unit calculates the first current macroblock motion vector by executing the following steps of:
(a) dividing the current macroblock into at least one 8×8 block, taking a pixel word having four pixels, including the start searching point as center for each 8×8 block, and retrieving 36 initial candidate points from the first line segment, the second line segment and the third line segment, wherein the first line segment comprises the pixel word and four neighboring pixels left and right to the pixel word;
(b) calculating a first sub-macroblock sum of absolute difference value of each initial candidate point relative to each 8×8 block to obtain an initial current macroblock sum of absolute difference value corresponding to each initial candidate point, and obtaining a first least current macroblock sum of absolute difference value according to the initial current macroblock sum of absolute difference values;
(c) determining whether a best candidate point corresponding to the first least current macroblock sum of absolute difference value is located at the second line segment,
if so, executing step (d);
if not, determining whether the best candidate point corresponding to the first least current macroblock sum of absolute difference value is located at the third line segment,
if so, executing step (f); and
if not, executing step (j);
(d) determining whether the second line segment is located at a boundary of the searching window corresponding to the current macroblock,
if so, executing step (j); and
if not, moving down the second line segment by 1 pixel, adjusting the moved second line segment in a horizontal direction to generate 12 first refined candidate points according to a pixel word at which the best candidate point is located, and executing step (e);
(e) calculating a second sub-macroblock sum of absolute difference value of each 8×8 block relative to the first refined candidate points to obtain a second current macroblock sum of absolute difference value corresponding to each first refined candidate point, and obtaining a second least current macroblock sum of absolute difference value according to the second current macroblock sum of absolute difference value corresponding to each first refined candidate point;
(f) determining whether the second least current macroblock sum of absolute difference value is larger than the first least current macroblock sum of absolute difference value,
if so, executing step (j); and
if not, setting the second least current macroblock sum of absolute difference value as the first least current macroblock sum of absolute difference value, and executing step (d);
(g) determining whether the third line segment is located at the boundary of the searching window corresponding to the current macroblock,
if so, executing step (j); and
if not, moving up the third line segment by 1 pixel, adjusting the moved third line segment in the horizontal direction to generate 12 second refined candidate points according to the pixel word at which the best candidate point is located, and executing step (f);
(h) calculating a third sub-macroblock sum of absolute difference value of each 8×8 block relative to the second refined candidate points to obtain a third current macroblock sum of absolute difference value corresponding to each second refined candidate point, and obtaining a third least current macroblock sum of absolute difference value according to the third current macroblock sum of absolute difference value corresponding to each second refined candidate point;
(i) determining whether the third least current macroblock sum of absolute difference value is larger than the first least current macroblock sum of absolute difference value,
if so, executing step (j); and
if not, setting the third least current macroblock sum of absolute difference value as the first least current macroblock sum of absolute difference value, and executing step (g); and
(j) setting a first motion vector corresponding to the first least current macroblock sum of absolute difference value as the current macroblock integer pixel motion vector, and setting multiple motion vectors pointing to the second sub-macroblock sum of absolute difference value or the third sub-macroblock sum of absolute difference value as multiple sub-macroblock motion vectors corresponding to the 8×8 blocks in the current macroblock.
5. The motion estimation acceleration circuit as claimed in claim 4, further comprising:
a half pixel estimation unit, configured to search for eight half pixels around the best candidate point as the center, wherein when multiple half pixel sub-macroblock sum of absolute difference values or a half pixel current macroblock sum of absolute difference value corresponding to the half pixels are smaller than the sub-macroblock motion vectors or then current macroblock motion vector, the half pixel estimation unit further sets the half pixel sub-macroblock sum of absolute difference values or the half pixel current macroblock sum of absolute difference value as the sub-macroblock motion vectors or the current macroblock motion vector, respectively.
6. The motion estimation acceleration circuit as claimed in claim 1, further comprising:
a prediction difference calculating unit configured to determine an encoding mode of the current macroblock according to a rate distortion optimization value.
7. The motion estimation acceleration circuit as claimed in claim 1, wherein the motion estimation acceleration circuit further reads pixels of the searching window from a searching window buffer having four memory banks, wherein the motion estimation acceleration circuit further reads three of the four memory banks sequentially, and pixels of the searching window, which are used for filtering a next macroblock, are read by a DMA controller from an external storage unit to the one of the four memory banks which is not being read by the motion estimation acceleration circuit.
8. The motion estimation acceleration circuit as claimed in claim 4, wherein the integer pixel estimation unit comprises 12 processing elements in parallel, and multiple flip-flops, wherein the processing elements are divided into four groups, wherein the integer pixel estimation unit further broadcasts pixels of the searching window to the processing elements, and the pixels of the current macroblock are reordered into four sets of data, which are transmitted to the processing elements through four transmission paths.
9. The motion estimation acceleration circuit as claimed in claim 5, wherein the half pixel estimation unit further comprises 4 sets of 10-bit adders, 3 sets of rounding and shifting units, and 8 processing elements in parallel, wherein the 10-bit adders and the round and shifting units are configured to calculate interpolation of the half pixels, and the processing elements are configured to search for the half pixels.
10. The motion estimation acceleration circuit as claimed in claim 1, wherein the video codec standards supported by the motion estimation acceleration circuit comprise MPEG2, MPEG4 and H.263.
11. A motion estimation method applied in a motion estimation acceleration circuit in a video encoding system supporting multiple video codec standards, comprising:
determining a start searching point according to multiple neighboring macroblocks of a current macroblock, wherein the current macroblock corresponds to a searching window;
determining a best candidate pixel according to a first line segment where the start searching point is located, a second line segment on the first line segment, and a third line segment beneath the first line segment;
determining whether the best candidate pixel is located at the first line segment;
if so, setting a candidate motion vector corresponding to the best candidate pixel as a first motion vector of the current macroblock; and
if not, dynamically adjusting the second line segment or the third line segment in the searching window to update the best candidate pixel, and retrieving the first motion vector of the current macroblock corresponding to the updated best candidate pixel.
12. The motion estimation method as claimed in claim 11, wherein the step of determining the start searching point further comprises:
calculating multiple macroblock reference pixels pointed by multiple second neighboring macroblock motion vectors;
calculating multiple neighboring macroblock sum of absolute difference values corresponding to the macroblock reference points; and
assigning one of the macroblock reference points corresponding to the lea of the neighboring macroblock sum of absolute difference values as the start searching points.
13. The motion estimation method as claimed in claim 12, further comprising:
substituting the second neighboring macroblock motion vectors with a zero motion vector when the current macroblock is located at a boundary of a current frame; and
setting the start searching point as the zero point.
14. The motion estimation method as claimed in claim 13, wherein the step of calculating the current macroblock motion vector further comprises the following steps of:
(a) dividing the current macroblock into at least one 8×8 block, taking a pixel word having four pixels including the start searching point as center for each 8×8 block, and retrieving 36 initial candidate points from the first line segment, the second line segment and the third line segment, wherein the first line segment comprises the pixel word and four neighboring pixels left and right to the pixel word;
(b) calculating a first sub-macroblock sum of absolute difference value of each initial candidate point relative to each 8×8 block to obtain a initial current macroblock sum of absolute difference value corresponding to each initial candidate point, and obtaining a first least current macroblock sum of absolute difference value according to the initial current macroblock sum of absolute difference values;
(c) determining whether a best candidate point corresponding to the first least current macroblock sum of absolute difference value is located at the second line segment,
if so, executing step (d);
if not, determining whether the best candidate point corresponding to the first least current macroblock sum of absolute difference value is located at the third line segment,
if so, executing step (f); and
if not, executing step (j);
(d) determining whether the second line segment is located at a boundary of the searching window corresponding to the current macroblock,
if so, executing step (j); and
if not, moving down the second line segment by 1 pixel, adjusting the moved second line segment in a horizontal direction to generate 12 first refined candidate points according to a pixel word at which the best candidate point is located, and executing step (e);
(e) calculating a second sub-macroblock sum of absolute difference value of each 8×8 block relative to the first refined candidate points to obtain a second current macroblock sum of absolute difference value corresponding to each first refined candidate point, and obtaining a second least current macroblock sum of absolute difference value according to the second current macroblock sum of absolute difference value corresponding to each first refined candidate point;
(f) determining whether the second least current macroblock sum of absolute difference value is larger than the first least current macroblock sum of absolute difference value,
if so, executing step (j); and
if not, setting the second least current macroblock sum of absolute difference value as the first least current macroblock sum of absolute difference value, and executing step (d);
(g) determining whether the third line segment is located at the boundary of the searching window corresponding to the current macroblock,
if so, executing step (j); and
if not, moving up the third line segment by 1 pixel, adjusting the moved third line segment in the horizontal direction to generate 12 second refined candidate points according to the pixel word at which the best candidate point is located, and executing step (f);
(h) calculating a third sub-macroblock sum of absolute difference value of each 8×8 block relative to the second refined candidate points to obtain a third current macroblock sum of absolute difference value corresponding to each second refined candidate point, and obtaining a third least current macroblock sum of absolute difference value according to the third current macroblock sum of absolute difference value corresponding to each second refined candidate point;
(i) determining whether the third least current macroblock sum of absolute difference value is larger than the first least current macroblock sum of absolute difference value,
if so, executing step (j); and
if not, setting the third least current macroblock sum of absolute difference value as the first least current macroblock sum of absolute difference value, and executing step (g); and
(j) setting a first motion vector corresponding to the first least current macroblock sum of absolute difference value as the current macroblock integer pixel motion vector, and setting multiple motion vectors pointing to the second sub-macroblock sum of absolute difference value or the third sub-macroblock sum of absolute difference value as multiple sub-macroblock motion vectors corresponding to the 8×8 blocks in the current macroblock.
15. The motion estimation method as claimed in claim 14, further comprising:
searching for eight half pixels around the best candidate point as the center; and
when multiple half pixel sub-macroblock sum of absolute difference values or a half pixel current macroblock sum of absolute difference value corresponding to the half pixels are smaller than the sub-macroblock motion vectors or then current macroblock motion vector, setting the half pixel sub-macroblock sum of absolute difference values or the half pixel current macroblock sum of absolute difference value as the sub-macroblock motion vectors or the current macroblock motion vector, respectively.
16. The motion estimation method as claimed in claim 11, further comprising:
determining an encoding mode of the current macroblock according to a rate distortion optimization value.
17. The motion estimation method as claimed in claim 11, wherein pixels of the searching window are read from a searching window buffer having four memory banks, and the method further comprises:
reading three of the four memory banks sequentially; and
reading pixels of the searching window, which are used for filtering a next macroblock, to one of the four memory banks which is not read via a DMA controller.
18. The motion estimation method as claimed in claim 11, wherein the video codec standards supported by the motion estimation method comprise MPEG2, MPEG4 and H.263.
19. An in-loop filtering acceleration circuit applied in a video codec system supporting the H.264 standard and the VC-1 standard, the video codec system comprising a processing unit to perform video processing to generate at least one reconstructed macroblock and a value of boundary strength corresponding to each edge of the reconstructed macroblock, the circuit comprising:
multiple one-dimensional (1D) filters configured to perform a filtering process; and
a filter selection unit configured to select one of the 1D filters according to the value of the boundary strength to perform the filtering processing to the reconstructed macroblock,
wherein the in-loop filtering acceleration circuit further divides the reconstructed macroblock into multiple 8×8 blocks and multiple 4×4 blocks, performs the filtering process to horizontal edges of the 8×8 blocks the reconstructed macroblock row by row from bottom to top, and performs the filtering process to horizontal edges of the 4×4 blocks row by row from top to bottom,
wherein the in-loop filtering acceleration circuit further performs the filtering process to vertical edges of the 8×8 blocks column by column from right to left, and performs the filtering process to vertical edges of the 4×4 blocks column by column from left to right.
20. The in-loop filtering acceleration circuit as claimed in claim 19, wherein the 1D filters comprises multiple H.264 strong filers, multiple H.264 general filters, and a VC-1 filter, and the 1D filters further performs the filtering process to horizontal edges or vertical edges of one of the 8×8 blocks.
21. The in-loop filtering acceleration circuit as claimed in claim 20, wherein when the value of boundary strength corresponding to an edge is 0, the in-loop filtering acceleration circuit does not perform the filtering process;
wherein when the value of boundary strength corresponding to the edge is between 1 to 3, the filter selection unit selects the H.264 general filters to perform the filtering process on the edge;
wherein when the value of boundary strength corresponding to the edge is 4, the filter selection unit selects the H.264 strong filters to perform the filtering process on the edge; and
wherein when the value of boundary strength corresponding to the edge is 5, the filter selection unit selects the VC-1 filter to perform the filtering process on the edge.
22. The in-loop filtering acceleration circuit as claimed in claim 19, further comprising:
multiple transposition register arrays, configured to store a portion of the reconstructed macroblock, and transpose pixels of the reconstructed macroblock, so that the transposed pixels of the reconstructed macroblock are read by the 1D filters row by row or column by column.
23. The in-loop filtering acceleration circuit as claimed in claim 20, wherein the filter selection unit further calculates multiple filter selection parameters according to the pixels of the reconstructed macroblock, and selects one of the H.264 strong filters, the H.264 general filters and the VC-1 filter according to the value of boundary strength, a luminance parameter, a clipping parameter, a bit rate parameter and the filter selection parameters to perform the filtering process.
24. An in-loop filtering method applied in an in-loop filtering acceleration circuit of a video codec system supporting the H.264 standard and the VC-1 standard, the video codec system comprising a processing unit to perform video processing to generate at least one reconstructed macroblock and a value of boundary strength corresponding to each edge of the reconstructed macroblock, the method comprising:
dividing the reconstructed macroblock into multiple 8×8 blocks and multiple 4×4 blocks;
selecting one of multiple 1D filters according to the value of the boundary strength to perform the filtering processing to the reconstructed macroblock;
performing the filtering process to horizontal edges of the 8×8 blocks, the reconstructed macroblock row by row from down to up, and performing the filtering process to horizontal edges of the 4×4 blocks row by row from top to bottom; and
performing the filtering process to vertical edges of the 8×8 blocks column by column from right to left, and performing the filtering process to vertical edges of the 4×4 blocks column by column from left to right.
25. The in-loop filtering method as claimed in claim 24, wherein the 1D filters comprises multiple H.264 strong filers, multiple H.264 general filters, and a VC-1 filter.
26. The in-loop filtering method as claimed in claim 24, wherein the step of selecting one of the 1D filters according to the value of boundary strength further comprises:
selecting the H.264 general filters to perform the filtering process on the edge when the value of boundary strength corresponding to the edge is between 1 to 3;
selecting the H.264 strong filters to perform the filtering process on the edge when the value of boundary strength corresponding to the edge is 4; and
selecting the VC-1 filter to perform the filtering process on the edge when the value of boundary strength corresponding to the edge is 5.
27. The in-loop filtering method as claimed in claim 24, further comprising:
calculating multiple filter selection parameters according to the pixels of the reconstructed macroblock; and
selecting one of the H.264 strong filters, the H.264 general filters and the VC-1 filter according to the value of boundary strength, a luminance parameter, a clipping parameter, a bit rate parameter and the filter selection parameters to perform the filtering process.
US13/777,434 2012-02-27 2013-02-26 Motion estimation and in-loop filtering method and device thereof Abandoned US20130223532A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/818,886 US10469868B2 (en) 2012-02-27 2015-08-05 Motion estimation and in-loop filtering method and device thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210046566.9A CN102547296B (en) 2012-02-27 2012-02-27 Motion estimation accelerating circuit and motion estimation method as well as loop filtering accelerating circuit
CN201210046566.9 2012-02-27

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/818,886 Division US10469868B2 (en) 2012-02-27 2015-08-05 Motion estimation and in-loop filtering method and device thereof

Publications (1)

Publication Number Publication Date
US20130223532A1 true US20130223532A1 (en) 2013-08-29

Family

ID=46353094

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/777,434 Abandoned US20130223532A1 (en) 2012-02-27 2013-02-26 Motion estimation and in-loop filtering method and device thereof
US14/818,886 Active 2034-11-21 US10469868B2 (en) 2012-02-27 2015-08-05 Motion estimation and in-loop filtering method and device thereof

Family Applications After (1)

Application Number Title Priority Date Filing Date
US14/818,886 Active 2034-11-21 US10469868B2 (en) 2012-02-27 2015-08-05 Motion estimation and in-loop filtering method and device thereof

Country Status (2)

Country Link
US (2) US20130223532A1 (en)
CN (1) CN102547296B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140341308A1 (en) * 2013-05-15 2014-11-20 Texas Instruments Incorporated Optimized edge order for de-blocking filter
US20140355691A1 (en) * 2013-06-03 2014-12-04 Texas Instruments Incorporated Multi-threading in a video hardware engine
US20150288979A1 (en) * 2012-12-18 2015-10-08 Liu Yang Video frame reconstruction
CN105578197A (en) * 2015-12-24 2016-05-11 福州瑞芯微电子股份有限公司 Master control system for realizing inter-frame prediction
CN105578195A (en) * 2015-12-24 2016-05-11 福州瑞芯微电子股份有限公司 H.264 inter-frame prediction system
US20160189425A1 (en) * 2012-09-28 2016-06-30 Qiang Li Determination of augmented reality information
US20180189587A1 (en) * 2016-12-29 2018-07-05 Intel Corporation Technologies for feature detection and tracking
US20190313095A1 (en) * 2016-12-28 2019-10-10 Sony Corporation Image processing apparatus and image processing method
US20220182608A1 (en) * 2018-12-25 2022-06-09 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Prediction method for decoding and apparatus, and computer storage medium
CN117114971A (en) * 2023-08-01 2023-11-24 北京城建设计发展集团股份有限公司 Pixel map-to-vector map conversion method and system

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014009864A2 (en) * 2012-07-09 2014-01-16 Squid Design Systems Pvt Ltd Programmable variable block size motion estimation processor
US10057590B2 (en) * 2014-01-13 2018-08-21 Mediatek Inc. Method and apparatus using software engine and hardware engine collaborated with each other to achieve hybrid video encoding
US10354394B2 (en) 2016-09-16 2019-07-16 Dolby Laboratories Licensing Corporation Dynamic adjustment of frame rate conversion settings
US10977809B2 (en) 2017-12-11 2021-04-13 Dolby Laboratories Licensing Corporation Detecting motion dragging artifacts for dynamic adjustment of frame rate conversion settings
AU2019351346B2 (en) * 2018-09-24 2023-07-13 Huawei Technologies Co., Ltd. Image processing device and method for performing quality optimized deblocking
CN111935484B (en) * 2020-09-28 2021-01-19 广州佰锐网络科技有限公司 Video frame compression coding method and device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5428403A (en) * 1991-09-30 1995-06-27 U.S. Philips Corporation Motion vector estimation, motion picture encoding and storage
US20020054642A1 (en) * 2000-11-04 2002-05-09 Shyh-Yih Ma Method for motion estimation in video coding
US6483876B1 (en) * 1999-12-28 2002-11-19 Sony Corporation Methods and apparatus for reduction of prediction modes in motion estimation
US20030072374A1 (en) * 2001-09-10 2003-04-17 Sohm Oliver P. Method for motion vector estimation
US20040062308A1 (en) * 2002-09-27 2004-04-01 Kamosa Gregg Mark System and method for accelerating video data processing
US20050201463A1 (en) * 2004-03-12 2005-09-15 Samsung Electronics Co., Ltd. Video transcoding method and apparatus and motion vector interpolation method
US20070183504A1 (en) * 2005-12-15 2007-08-09 Analog Devices, Inc. Motion estimation using prediction guided decimated search
US20100080297A1 (en) * 2008-09-30 2010-04-01 Microsoft Corporation Techniques to perform fast motion estimation
US20100118961A1 (en) * 2008-11-11 2010-05-13 Electronics And Telecommunications Research Institute High-speed motion estimation apparatus and method

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003021936A2 (en) * 2001-09-05 2003-03-13 Emblaze Semi Conductor Ltd Method for reducing blocking artifacts
US7084929B2 (en) * 2002-07-29 2006-08-01 Koninklijke Philips Electronics N.V. Video data filtering arrangement and method
US20050013494A1 (en) * 2003-07-18 2005-01-20 Microsoft Corporation In-loop deblocking filter
KR100614647B1 (en) * 2004-07-02 2006-08-22 삼성전자주식회사 Register array structure for effective edge filtering operation of deblocking filter
TWI295140B (en) 2005-05-20 2008-03-21 Univ Nat Chiao Tung A dual-mode high throughput de-blocking filter
SG140508A1 (en) * 2006-08-31 2008-03-28 St Microelectronics Asia Multimode filter for de-blocking and de-ringing
US20080084932A1 (en) 2006-10-06 2008-04-10 Microsoft Corporation Controlling loop filtering for interlaced video frames
US9961372B2 (en) * 2006-12-08 2018-05-01 Nxp Usa, Inc. Adaptive disabling of deblock filtering based on a content characteristic of video information
TWI335764B (en) * 2007-07-10 2011-01-01 Faraday Tech Corp In-loop deblocking filtering method and apparatus applied in video codec
JP5232175B2 (en) * 2008-01-24 2013-07-10 パナソニック株式会社 Video compression device
CN101267556B (en) * 2008-03-21 2011-06-22 海信集团有限公司 Quick motion estimation method and video coding and decoding method
CN101272498B (en) * 2008-05-14 2010-06-16 杭州华三通信技术有限公司 Video encoding method and device
CN101742292B (en) 2008-11-14 2013-03-27 北京中星微电子有限公司 Image content information-based loop filtering method and filter
CN101404773B (en) * 2008-11-25 2011-03-30 江苏大学 Image encoding method based on DSP
CN101489131A (en) * 2009-01-22 2009-07-22 上海广电(集团)有限公司中央研究院 Center eccentric motion estimation implementing method
CN101511019A (en) * 2009-03-13 2009-08-19 广东威创视讯科技股份有限公司 Method and apparatus for estimating variable mode motion
CN101715127B (en) 2009-11-19 2013-07-24 无锡中星微电子有限公司 Loop filter method and loop filter system
KR20110125153A (en) * 2010-05-12 2011-11-18 에스케이 텔레콤주식회사 Method and apparatus for filtering image and encoding/decoding of video data using thereof
CN102075757B (en) * 2011-02-10 2013-08-28 北京航空航天大学 Video foreground object coding method by taking boundary detection as motion estimation reference
CN102088610B (en) * 2011-03-08 2013-12-18 开曼群岛威睿电通股份有限公司 Video codec and motion estimation method thereof

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5428403A (en) * 1991-09-30 1995-06-27 U.S. Philips Corporation Motion vector estimation, motion picture encoding and storage
US6483876B1 (en) * 1999-12-28 2002-11-19 Sony Corporation Methods and apparatus for reduction of prediction modes in motion estimation
US20020054642A1 (en) * 2000-11-04 2002-05-09 Shyh-Yih Ma Method for motion estimation in video coding
US20030072374A1 (en) * 2001-09-10 2003-04-17 Sohm Oliver P. Method for motion vector estimation
US20040062308A1 (en) * 2002-09-27 2004-04-01 Kamosa Gregg Mark System and method for accelerating video data processing
US20050201463A1 (en) * 2004-03-12 2005-09-15 Samsung Electronics Co., Ltd. Video transcoding method and apparatus and motion vector interpolation method
US20070183504A1 (en) * 2005-12-15 2007-08-09 Analog Devices, Inc. Motion estimation using prediction guided decimated search
US20100080297A1 (en) * 2008-09-30 2010-04-01 Microsoft Corporation Techniques to perform fast motion estimation
US20100118961A1 (en) * 2008-11-11 2010-05-13 Electronics And Telecommunications Research Institute High-speed motion estimation apparatus and method

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160189425A1 (en) * 2012-09-28 2016-06-30 Qiang Li Determination of augmented reality information
US9691180B2 (en) * 2012-09-28 2017-06-27 Intel Corporation Determination of augmented reality information
US20150288979A1 (en) * 2012-12-18 2015-10-08 Liu Yang Video frame reconstruction
US11700396B2 (en) 2013-05-15 2023-07-11 Texas Instruments Incorporated Optimized edge order for de-blocking filter
US9872044B2 (en) * 2013-05-15 2018-01-16 Texas Instruments Incorporated Optimized edge order for de-blocking filter
US11202102B2 (en) 2013-05-15 2021-12-14 Texas Instruments Incorporated Optimized edge order for de-blocking filter
US10652582B2 (en) 2013-05-15 2020-05-12 Texas Instruments Incorporated Optimized edge order for de-blocking filter
US20140341308A1 (en) * 2013-05-15 2014-11-20 Texas Instruments Incorporated Optimized edge order for de-blocking filter
US20140355691A1 (en) * 2013-06-03 2014-12-04 Texas Instruments Incorporated Multi-threading in a video hardware engine
US11736700B2 (en) 2013-06-03 2023-08-22 Texas Instruments Incorporated Multi-threading in a video hardware engine
US11228769B2 (en) * 2013-06-03 2022-01-18 Texas Instruments Incorporated Multi-threading in a video hardware engine
CN105578195A (en) * 2015-12-24 2016-05-11 福州瑞芯微电子股份有限公司 H.264 inter-frame prediction system
CN105578197A (en) * 2015-12-24 2016-05-11 福州瑞芯微电子股份有限公司 Master control system for realizing inter-frame prediction
US10924735B2 (en) * 2016-12-28 2021-02-16 Sony Corporation Image processing apparatus and image processing method
US20190313095A1 (en) * 2016-12-28 2019-10-10 Sony Corporation Image processing apparatus and image processing method
CN108257176A (en) * 2016-12-29 2018-07-06 英特尔公司 For the technology of feature detect and track
US20180189587A1 (en) * 2016-12-29 2018-07-05 Intel Corporation Technologies for feature detection and tracking
US20220182608A1 (en) * 2018-12-25 2022-06-09 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Prediction method for decoding and apparatus, and computer storage medium
US20220182610A1 (en) * 2018-12-25 2022-06-09 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Prediction method for decoding and apparatus, and computer storage medium
US20220182609A1 (en) * 2018-12-25 2022-06-09 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Prediction method for decoding and apparatus, and computer storage medium
US11677936B2 (en) 2018-12-25 2023-06-13 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Prediction method for decoding and apparatus, and computer storage medium
US11683477B2 (en) * 2018-12-25 2023-06-20 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Prediction method for decoding and apparatus, and computer storage medium
US11683478B2 (en) * 2018-12-25 2023-06-20 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Prediction method for decoding and apparatus, and computer storage medium
US11785208B2 (en) * 2018-12-25 2023-10-10 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Prediction method for decoding and apparatus, and computer storage medium
CN117114971A (en) * 2023-08-01 2023-11-24 北京城建设计发展集团股份有限公司 Pixel map-to-vector map conversion method and system

Also Published As

Publication number Publication date
US10469868B2 (en) 2019-11-05
US20150341658A1 (en) 2015-11-26
CN102547296A (en) 2012-07-04
CN102547296B (en) 2015-04-01

Similar Documents

Publication Publication Date Title
US10469868B2 (en) Motion estimation and in-loop filtering method and device thereof
US10735727B2 (en) Method of adaptive filtering for multiple reference line of intra prediction in video coding, video encoding apparatus and video decoding apparatus therewith
US9948934B2 (en) Estimating rate costs in video encoding operations using entropy encoding statistics
US20060126740A1 (en) Shared pipeline architecture for motion vector prediction and residual decoding
US8737476B2 (en) Image decoding device, image decoding method, integrated circuit, and program for performing parallel decoding of coded image data
AU2015213342A1 (en) Video decoder, video encoder, video decoding method, and video encoding method
US9918088B2 (en) Transform and inverse transform circuit and method
US8514937B2 (en) Video encoding apparatus
CN111669597B (en) Palette decoding apparatus and method
US20140177726A1 (en) Video decoding apparatus, video decoding method, and integrated circuit
US9420308B2 (en) Scaled motion search section with parallel processing and method for use therewith
US9918079B2 (en) Electronic device and motion compensation method
US9300975B2 (en) Concurrent access shared buffer in a video encoder
US8989268B2 (en) Method and apparatus for motion estimation for video processing
US20120163462A1 (en) Motion estimation apparatus and method using prediction algorithm between macroblocks
KR20210096282A (en) Inter prediction method and apparatus
US8189672B2 (en) Method for interpolating chrominance signal in video encoder and decoder
CN116320401A (en) Video encoding and decoding method and related device
JP2009260494A (en) Image coding apparatus and its control method
US20100220786A1 (en) Method and apparatus for multiple reference picture motion estimation
US20030123555A1 (en) Video decoding system and memory interface apparatus
US8737478B2 (en) Motion estimation apparatus and method
EP2073553A1 (en) Method and apparatus for performing de-blocking filtering of a video picture
US11622106B2 (en) Supporting multiple partition sizes using a unified pixel input data interface for fetching reference pixels in video encoders
KR100269426B1 (en) Motion compensator having an improved frame memory

Legal Events

Date Code Title Description
AS Assignment

Owner name: VIA TELECOM, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:XI, YINGLAI;LI, QIANG;LI, JUMEI;AND OTHERS;REEL/FRAME:029878/0406

Effective date: 20130219

AS Assignment

Owner name: VIA TELECOM CO., LTD., CAYMAN ISLANDS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:VIA TELECOM, INC.;REEL/FRAME:031298/0044

Effective date: 20130912

AS Assignment

Owner name: INTEL CORPORATION, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:VIA TELECOM CO., LTD.;REEL/FRAME:037096/0075

Effective date: 20151020

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION