TECHNIQUES TO PERFORM VIDEO STABILIZATION AND DETECT VIDEO SHOT BOUNDARIES BASED ON COMMON PROCESSING ELEMENTS
Field
The subject matter disclosed herein relates generally to techniques to perform video stabilization and detect video shot boundaries using common processing elements.
Related Art
Video stabilization aims to improve visual qualities of video sequences captured by digital video cameras. When cameras are hand held or mounted on unstable platforms, the captured video can appear shaky because of undesired camera motions, which lead to a degraded viewer experience. Video stabilization techniques can be employed to remove or reduce the undesired motions among the captured video frames.
A video usually consists of scenes, and each scene includes one or more shots. A shot is defined as a sequence of frames captured by a single camera in a single continuous action. The change from one shot to another, also known as shot transition, includes two key types: abrupt transition (CUT) and gradual transition (GT). Video shot boundary detection aims to detect shot boundary frames. Video shot boundary detection can be applied in various applications, such as intra frame identification in video coding, video indexing, video retrieval, and video editing.
Brief Description of the Drawings
Embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the drawings and in which like reference numerals refer to similar elements.
FIG. 1 depicts a block diagram format of a video stabilization system in accordance with an embodiment.
FIG. 2 shows a block diagram of an inter-frame dominant motion estimation module, in accordance with an embodiment.
FIG. 3 provides a flow diagram of a process performed to improve video stabilization, in accordance with an embodiment.
FIG. 4 depicts a block diagram of a shot boundary detection system, in accordance
with an embodiment.
FIG. 5 provides a process of a shot boundary decision scheme, in accordance with an embodiment.
FIG. 6 depicts a block diagram of a system that performs video stabilization and shot boundary detection, in accordance with an embodiment.
FIG. 7 depicts an example of identification of a matched block in a reference frame using a search window where the matched block corresponds to a target block in a current frame. Detailed Description
Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrase "in one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.
Furthermore, the particular features, structures, or characteristics may be combined in one or more embodiments.
A graphics processing system may need to support multiple video processing features as well as various video encoding or decoding standards. Various embodiments permit a graphics processing system to support both video stabilization and video shot boundary detection features. In particular, various embodiments permit a graphics processing system to use certain processing capabilities for both video stabilization and shot boundary detection. In some embodiments, down sampling and block motion search features of a graphics processing system are used for both video stabilization and video shot boundary detection. Reuse of features may reduce the cost of manufacturing a graphics processing system and also reduce the size of the graphics processing system.
Various embodiments are capable of encoding or decoding video or still images in accordance with a variety of standards, such as but not limited to: MPEG-4 Part 10 advanced video codec (AVQ/H.264. The H.264 standard has been prepared by the Joint Video Team (JVT), which includes ITU-T SGl 6 Q.6, also known as VCEG (Video Coding Expert Group) and the ISO-IEC JTC1/SC29/WG11 (2003), known as MPEG (Motion Picture Expert Group). In addition, embodiments may be used in a variety of still image or video compression systems including, but not limited to, object oriented video
coding, model based video coding, scalable video coding, as well as MPEG-2 (ISO/IEC 13818-1 (2000) available from International Organization for Standardization, Geneva, Switzerland), VCl (SMPTE 42 IM (2006) available from SMPTE White Plains, NY 10601), as well as variations of MPEG-4, MPEG-2, and VCl.
FIG. 1 depicts a video stabilization system 100, in block diagram format, in accordance with an embodiment. Video stabilization system 100 includes inter-frame dominant motion estimation (DME) block 102, trajectory computation block 104, trajectory smoothing block 106, and jitter compensation block 108. Inter-frame DME block 102 is to determine camera vibration between two consecutive frames in a video sequence. Inter-frame DME block 102 is to identify local motion vectors and then determine the dominant motion parameters based on those local motion vectors.
Trajectory computation block 104 is to calculate the motion trajectory with those determined dominant motion parameters. Trajectory smoothing block 106 is to smooth the calculated motion trajectory to provide a smoother trajectory. Jitter compensation module 108 is to reduce jitter in the smoother trajectory.
FIG. 2 shows a block diagram of an inter-frame dominant motion estimation module 200, in accordance with an embodiment. Module 200 includes frame down- sampling block 202, reference buffer 204, block motion search block 206, iterative least square solver block 208, and motion up-scaling block 210.
Down-sampling block 202 is to down scale input frames to a smaller size. For example, a down-sampling factor of approximately 4-5 may be used, although other values can be used. In some embodiments, down-sampling block 202 provides smaller sized frames that are approximately 160x120 pixels. A resulting downscaled frame has a fewer number of blocks. A block may be 8x8, 16x16, or other sizes due to the design of the common processing element. Generally, a 16x16 block is used. The downscaling process also down-scales block motion vectors. In various embodiments, a motion vector represents a vertical and horizontal displacement of a pixel, a block, or an image between frames. Downscaling the frames also downscales the x and y motions between two frames. For example, if the down-sampling factor is 4 and the motion vector is (20, 20), the downscaled motion vector will be approximately (5, 5) in the downscaled frames. As a result, a window/region-limited block motion search on a smaller picture can encompass larger motions on the original frames. Accordingly, processing speed and processing resources used to identify process blocks can be reduced.
Down-sampling block 202 is to store the down-sampled frames into reference buffer 204. Reference buffer 204 may be a region in memory that is available for use at least in performing video stabilization and shot boundary detection. The region may be a buffer or a portion of a buffer. For example, if the region is a portion of a buffer, the other portions of the same buffer can be used simultaneously or at other times by other applications or processes. In various embodiments, for video stabilization and shot boundary detection, a single reference frame is used. Accordingly, the size of the reference buffer can be set to store one frame. At each updating of the reference buffer, a reference frame can be replaced with another reference frame.
Block motion search block 206 is to receive a down-sampled current frame from down-sampling block 202 and also to receive the down-sampled previous reference frame from reference buffer 204. Block motion search block 206 is to identify a local motion vector of selected blocks within a pre-defined search window. For example, the identified motion vector can be the motion vector associated with a block in a search window with the lowest sum of absolute difference (SAD) with respect to a target block in the current frame. The block in the search window may be a macroblock or a small block, such as 8x8 pixels, although other sizes can be used. In some embodiments, the block size is 16x16 pixels and the search window can be set to 48x32 pixels, hi various embodiments, block motion search block 206 does not search for motion vectors associated with blocks on frame borders.
In some embodiments, block motion search block 206 is to determine sum of absolute difference (SAD) for macro blocks of each frame. For example, determining a SAD for each macro block in a frame may include comparing each 16x16 pixel macro block of a reference frame with a 16x16 pixel macro block in a current frame. For example, in some embodiments, all macro blocks within a 48x32 pixel search window of a reference frame can be compared with a target 16x16 pixel macro block in a current frame. The target macro block can be picked one by one or in chessboard pattern. For a full search, all macroblocks in a 48x32 search window may be compared with the target macro block. Accordingly, 32x16 (512) macroblocks can be compared. When moving a 16x16 macroblock within a 48x32 search window, there are 32x16 positions to move.
Accordingly, in this example, 512 SADs are determined.
FIG. 7 depicts an example of identification of a matched block in a reference frame using a search window where the matched block corresponds to a target block in a current
frame. An exemplary block motion search may include the following steps.
(1) Select multiple target blocks in a current frame. Let the coordinates of the target blocks be (x_i, y_i), where i is the block index. Target blocks in the current frame can be selected one by one. Although other selection techniques can be used such as selecting them in chessboard manner.
(2) For target block i in the current frame, block motion search is used in the search window to identify the matched block and obtain a local motion vector (mvx i, mvy i). Finding a matched block in the search window in the reference frame for target block i can include comparing all candidate blocks in a reference frame search window with the target block, and the one with minimum SAD is regarded as the matched block.
(3) After block motion search for block i, calculate: x' i = x_i + mvx i and y' i = y_i + mvy_i. Then, (x_i, y_i) and (x'_i, y'_i) are regarded as a pair.
(4) After performing block motion search for all selected target blocks in the current frame, multiple pairs (x_i, y_i) and (x' i, y' i) are obtained.
As shown in FIG. 7, for one target block (x, y) in a current frame, the 48x32 search window is specified in a reference frame, and the position of the search window can be centered by (x, y). After finding the matched block in the search window by block motion search, the local motion vector (mvx, mvy) for the target block is obtained. The coordinates of the matched block (x',y') is x' = x + mvx, y' = y + mvy. Then, (x,y) and (x',y') are regarded as a pair.
Referring again to FIG. 2, iterative least square solver 208 is to determine dominant motion parameters based on at least two identified local motion vectors. In some embodiments, iterative least square solver 208 is to apply the similarity motion model shown in FIG. 2 to approximate the dominant inter-frame motion parameters. The similarity motion model can also be written in the format of equation (1) below.
where:
(x', y') represents the matched block coordinates in a reference frame, (x, y) represent the block coordinates in the current frame, and (a, b, c, d) represents the dominant motion parameters, where parameters a and b relate to rotation and parameters c and d relate to translation.
For example, block coordinates (x', y') and (x, y) could be defined as top-left comer, bottom-right corner, or block center of a block, as long as consistently used. For a block whose coordinates are (x, y) and the identified local motion vector (from block 206) is (mvx, mvy), the coordinates (x',y') of its matched block are obtained by x'=x+mvx and y'=y+mvy. In various embodiments, all (x, y) and (x', y') pairs of a frame are used in equation (1). Iterative least squares solver block 208 to determine motion parameters (a, b, c, d) by solving equation (1) using the Least Squares (LS) technique.
Outlier local motion vectors may negatively impact estimation of dominant motions if considered by iterative least square solver 208. Outlier local motion vectors may be identified by block motion search block 206 if some blocks in a current frame are selected from an area that includes foreground objects or repeated similar patterns. In various embodiments, iterative least square solver 208 uses an iterative least square (ILS) solver to reduce the effect of the outlier local motion vectors by identifying and removing outlier location motion vectors from consideration. In such embodiments, after
determining dominant motion parameters using equation (1) above, iterative least square solver 208 is to determine the squared estimation error (SEE) of each remaining block position (x;, yi) in the current frame. Block position (XJ, yj) can be the top-left corner, bottom-right corner, or block center, as long as consistently used.
SEE
1 = {ax, + by
t + C - X
1)
2 + {-bx, + ay, + d - y,f (2) A local motion vector is regarded as an outlier if its corresponding squared estimation error (SEE) satisfies equation (3).
where,
T is a constant, which can be empirically set to 1.4, although other values can be used and
n is the number of remaining blocks in the current frame.
Equations (l)-(3) above are repeated until no outlier local motion vectors are detected or the number of remaining blocks is less than a predefined threshold number. For example, the threshold number can be 12, although other numbers can be used. In each iteration of equations (l)-(3), the detected outlier motion vectors and blocks associated with the outlier motion vectors are not considered. Instead, motion vectors
associated with the remaining blocks are considered. After removing outlier local motion vectors from consideration, iterative least squares block 208 performs equation (1) to determine motion parameters.
Motion up-scaling block 210 is to up-scale the translation motion parameters, c and d, according to the inverse of the down-sampling factor applied by down-scaling block 202. Because down-sampling process does not affect the rotation and scaling motions between two frames, the parameters a and b may not be upscaled.
Referring again to FIG. 1, trajectory computation block 104 is to determine a trajectory. For example, trajectory computation block 104 is to determine the motion trajectory of frame j, Tj, using the accumulated motion as defined in equation (4).
where,
Mj is the global motion matrix between frames j and j-1 and is based on dominant motion parameters (a, b, c, d). Dominant motion parameters (a, b, c, d) are for the current frame (referred to as frame j) in equation (4).
An inter-frame global motion vector includes camera intended motion and camera jitter motion. Trajectory smoothing block 106 is to reduce camera jitter motion from an inter-frame global motion vector. In various embodiments, trajectory smoothing block 106 is to reduce camera jitter motion by using motion trajectory smoothing. The low frequency component of the motion trajectory is recognized as the camera intended movement. After trajectory computation block 104 determines the motion trajectory of each frame, trajectory smoothing block 106 is to increase the smoothness of the motion trajectory using a low-pass filter, such as but not limited to Gaussian filter. The Gaussian filter window can be set to 2n+l frames. The filtering process introduces n frames delay. Experimental results show that n can be set to 5, although other values can be used. The smoother motion trajectory, T'
j, can be determined using equation (5).
where g(k) is the Gaussian filter kernel. A Gaussian filter is a low-pass filter,
1 k2
g(k) = exp(— τ) . After specifying its variation value δ , the filter coefficients can
J2πδ2 2J
be calculated. In some embodiments, the variation value is set to 1.5, but it can be set to other values. A larger variation value may produce smoother motion trajectory.
Jitter compensation block 108 is to compensate jitter in the un-smoothed original trajectory. Camerajitter motion is the high frequency component of the trajectory. The high frequency component of the trajectory is the difference between the original trajectory and the smoothed trajectory. Jitter compensation block 108 is to compensate the high frequency component and provide a more stabilized current frame. For example, the more stabilized frame representation, frame F'(j), for the current frame may be obtained by warping current frame F(j) with the jitter motion parameters.
After performing trajectory smoothing for j-th current frame F(j), the motion differences between TQ) and TQ) (shown in equations 4 and 5) are regarded as jitter motions. Jitter motions can be represented by jitter motion parameters (a1, b', c', d')- The following describes a manner to determine (a1, b1, c', d') from the difference between T(j) and T'(j ) . Suppose the j itter motion parameters of T(j ) are (a 1 , b 1 , c 1 , d 1 ) and the smoothed jitter motion parameters of T'(j) are (a2, b2, c2, d2). Setting θl = arctan(bl/al) and 02 = arctan(b2/a2), the jitter motion parameters are determined as follows:
a' = cos(θl - 02), b' = sin(θl - 02), c' = cl - c2, d' = dl - d2. An exemplary warping process is as follows.
(1) For any pixel positioned at (x, y) in the more stabilized frame F'(j), the pixel value is denoted by F'(x,y,j).
(2) The corresponding position (x1, y') in current frame F(j) is determined as x' = a'*x + b'*y + c1, y1 = -b'*x + a'*y + d'.
(3) If x' and y' are integers, set F'(x, y, j) = F(x', y', j). Otherwise, calculate F'(x, y, j) through bi-linear interpolation using the pixels in F(j) around the position (x1, y').
(4) If (x1, y') is outside the current frame FQ), set F'(x, y, j) to a black pixel.
FIG. 3 provides a flow diagram of a process to improve video stabilization, in accordance with an embodiment. Block 302 includes performing frame size down scaling. For example, techniques described with regard to down-sampling block 202 may be used to perform frame size down scaling.
Block 304 includes performing block motion search to identify two or more local motion vectors. For example, techniques described with regard to block motion search
block 206 may be used to identify one or more local motion vectors.
Block 306 includes determining dominant motion parameters. For example, techniques described with regard to iterative least squares block 208 may be used to determine dominant motion parameters.
Block 308 includes up-scaling dominant motion parameters. For example, techniques described with regard to up-scaling block 210 may be used to up-scale dominant motion parameters.
Block 310 includes determining a trajectory. For example, techniques described with regard to trajectory computation block 104 may be used to determine a trajectory.
Block 312 includes improving trajectory smoothness. For example, techniques described with regard to trajectory smoothing block 106 may be used to perform trajectory smoothing.
Block 314 includes performing jitter compensation by warping a current frame to provide a more stable version of the current frame. For example, techniques described with regard to jitter compensation block 108 may be used to reduce jitter.
FIG. 4 depicts a block diagram of a shot boundary detection system, in accordance with an embodiment. In various embodiments, some results from inter-frame dominant motion estimation block 102 used by video stabilization system 100 are also used by shot boundary detection system 400. For example, the same information available from any of down-sampling block 202, reference buffer 204, and block motion search block 206 can be used in either or both of video stabilization and shot boundary detection. In some embodiments, shot boundary detection system 400 detects abrupt scene transition (i.e., a
CUT scene). Shot boundary decision block 402 is to determine whether a frame is a scene change frame. For example, shot boundary decision block 402 may use a process described with regard to FIG. 5 to determine whether a current frame is a scene change frame.
FIG. 5 provides a process of a shot boundary decision scheme, in accordance with an embodiment. Blocks 502 and 504 are substantially similar to respective blocks 302 and
304.
Block 506 includes determining a mean sum of absolute difference (SAD) for the current frame. Note that the current frame is a down-scaled frame. For example, block
506 may include receiving a SAD for each macro block in the current frame from block motion search block 206 and determining the mean of the SADs of all macro-blocks in the
current frame.
Block 508 includes determining whether the mean SAD is less than a threshold, TO. TO can be empirically set to approximately 1600 for a 16x16 block, although other values can be used. If the mean SAD is less than the threshold, then the frame is not a shot- boundary frame. If the mean SAD is not less than the threshold, then block 510 follows block 508.
Block 510 includes determining a number of blocks with a SAD larger than threshold Tl. Threshold Tl can be empirically set to 4 times the mean SAD, although other values can be used.
Block 512 includes determining whether the number of blocks with a SAD larger than threshold Tl is less than another threshold, T2. Threshold T2 can be empirically set to two thirds of the total number of target blocks in a frame, although other values of T2 can be used. If the number of blocks with a SAD larger than threshold Tl is less than the threshold T2, then the current frame is not considered a shot boundary frame. If the number of blocks is equal to or greater than the threshold T2, then the current frame is considered a shot boundary frame.
FIG. 6 depicts a block diagram of a system that is to perform video stabilization and shot boundary detection, in accordance with an embodiment. In various embodiments, frame down-sampling and block motion search operations are implemented in hardware. The frame down-sampling and block motion search operations are shared by both video stabilization and shot boundary detection applications. In various embodiments, for video stabilization (VS), trajectory computation, trajectory smoothing, jitter motion
determination, and jitter compensation operations are performed in software executed by a processor. In various embodiments, shot boundary detection (SBD) is performed in software executed by a processor, where the shot boundary detection uses results from the hardware-implemented frame down-sampling and block motion search operations. Other video or image processing techniques can make use of the results provided by down sampling or block motion search.
Processed images and video can be stored into any type of memory such as a transistor-based memory or magnetic memory.
The frame buffer may be a region in a memory. A memory can be implemented as a volatile memory device such as but not limited to a Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static RAM (SRAM), or other type of
semiconductor-based memory or magnetic memory such as a magnetic storage device.
When designing a media processor with multiple video processing features, e.g., video encoding, de-interlacing, super-resolution, frame rate conversion, and so forth, hardware re-use can be a very efficient way to save the cost and reduce the form factor. Various embodiments greatly reduce the complexity of implementing both video stabilization and video shot boundary detection features on the same media processor, especially when the media processor has supported the block motion estimation function.
The graphics and/or video processing techniques described herein may be implemented in various hardware architectures. For example, graphics and/or video functionality may be integrated within a chipset. Alternatively, a discrete graphics and/or video processor may be used. As still another embodiment, the graphics and/or video functions may be implemented by a general purpose processor, including a multi-core processor. In a further embodiment, the functions may be implemented in a consumer electronics device such as portable computers and mobile telephones with display devices capable of displaying still images or video. The consumer electronics devices may also include a network interface capable of connecting to any network such as the internet using any standards such as Ethernet (e.g., IEEE 802.3) or wireless standards (e.g., IEEE 802.11 or 16).
Embodiments of the present invention may be implemented as any or a
combination of: one or more microchips or integrated circuits interconnected using a motherboard, hardwired logic, software stored by a memory device and executed by a microprocessor, firmware, an application specific integrated circuit (ASIC), and/or a field programmable gate array (FPGA). The term "logic" may include, by way of example, software or hardware and/or combinations of software and hardware.
Embodiments of the present invention may be provided, for example, as a computer program product which may include one or more machine-readable media having stored thereon machine-executable instructions that, when executed by one or more machines such as a computer, network of computers, or other electronic devices, may result in the one or more machines carrying out operations in accordance with
embodiments of the present invention. A machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (Compact Disc-Read Only Memories), and magneto-optical disks, ROMs (Read Only Memories), RAMs (Random Access Memories), EPROMs (Erasable Programmable Read Only Memories), EEPROMs
(Electrically Erasable Programmable Read Only Memories), magnetic or optical cards, flash memory, or other type of media / machine-readable medium suitable for storing machine-executable instructions.
The drawings and the forgoing description gave examples of the present invention. Although depicted as a number of disparate functional items, those skilled in the art will appreciate that one or more of such elements may well be combined into single functional elements. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, orders of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts necessarily need to be performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of the present invention, however, is by no means limited by these specific examples. Numerous variations, whether explicitly given in the specification or not, such as differences in structure, dimension, and use of material, are possible. The scope of the invention is at least as broad as given by the following claims.