PICTURE MASKING AND COMPOSITING IN THE FREQUENCY DOMAIN BACKGROUND OF THE INVENTION
Field of the Invention
The present invention relates to video processing systems, and, in particular, to apparatuses and methods for perforating picture masking and compositing in the DCT domain.
Description of the Related Art
Computer systems are frequently used to perform various types of video or image processing, such as picture masking and compositing. In masking, a specified fraction of certain pixels of a first image are retained in a new image, according to a provided mask. In compositing, pixels of two input images are combined or "blended" at a certain ratio, to form a new image.
Such masking and compositing are important operations, for example in commercial video or image processing. For example, commonly used effects such as chroma keying, wipe, and overlaying are based on compositing pictures from two video sources. Masking and compositing are also frequently used in production of still images, for example, when generating magazine advertisements and posters.
Computer systems are also used for various data encoding purposes, such as video compression. Many video compression standards (including JPEG, MPEG-1, MPEG-2, H.261, and
H.263) are based on the discrete cosine transform (DCT), it may be desirable to process compressed pictures in the DCT domain. However, image processing techniques like masking and compositing are typically designed to operate in the spatial domain, not the frequency, or DCT, domain. Thus, if the image processing of compressed video signals is done in the spatial domain, the input compressed video signals must be transformed into the spatial domain before being processed, and the processed signal must be transformed back into the DCT domain once more. Such transformation to the spatial domain and back into the frequency domain can be very computationally expensive and, therefore, undesirable. Moreover, conventional ''brute force" convolutions performed directly in the frequency domain are also extremely computationally exoensive.
SUMMARY
In the present invention, at least one image signal and a mask signal are received, wherein the image signal and mask signal are in the DCT domain. Masking of the image signal is performed in the DCT domain, in accordance with the mask signal, by representing the masking in terms of the DCT basis functions, to provide an output image signal.
BRIEF DESCRIPTION OF THE DRAWINGS
Fig. 1 shows a prior art spatial domain image processing system;
Fig. 2 is a block diagram of a DCT domain image processing system, in accordance with a preferred embodiment of the present invention; and
Fig. 3 depicts an exemplary processed image processed by the DCT domain image processing system of Fig. 2.
DESCRIPTION OF THE PREFERRED EMBODIMENT
As explained above, performing image processing techniques like masking and compositing in the spatial domain offers drawbacks when processing compressed video signals, which are in a frequency domain such as the DCT domain. Accordingly, in the present invention, there is provided an efficient method and associated apparatus for implementing picture masking and compositing in the DCT domain. As described in further detail below, the technique of the present invention is based on representing the masking function in terms of the DCT basis functions and computing the masking as a weighted sum of the results of masking by the DCT basis functions.
In the DCT domain, masking by the DCT basis functions has a relatively simple and efficient implementation. Because of the energy compaction property of the DCT, the weight of many of the functions is very small and can be dropped from the weighted sum without introducing noticeable artifacts. This leads to very efficient implementations for masking and compositing images in the DCT domain, typically requiring less than three multiplications per pixel. These and other features and advantages of the present invention are described in further detail below.
Spatial Domain Processing of DCT Images
Referring now to Fig. 1 , there is shown a prior art spatial domain image processing system 100. As illustrated, spatial domain image processing system 100 includes three inverse DCT (TDCT) functional blocks 120, 121, 122, and a DCT functional block 130, as well as spatial domain processing functional block 110. As will be appreciated by those skilled in the art, each of these functional blocks may be implemented in hardware or software. For example, the IDCT and DCT operations of blocks 120, 121, 122, and 130, respectively, as well as the spatial domain processing of block 110, may be performed by a suitably programmed general-purpose or special-purpose microprocessor.
System 100 receives as input signals the mask signal and image signals xg and x each r>f which are in the DCT domain. For example, image signals x0 and x, may have been previously compressed with a process that utilizes the DCT. System 100 outputs output image signal^, which represents the compositing of image signals x0 and x, in accordance with the mask signal. Output image signal y is also in the DCT domain. Since block 110 performs image processing in the spatial domain (e.g., with RGB or YUV spatial representations of image pixels), IDCT blocks 120, 121, and 123 are necessary in prior art svtεms to transform the input signals into the spatial domain.
Once the (spatial domain) input signals are processed, the processed output signal must be transformed back into the DCT domain, to provide signal y.
As will be appreciated, it is trivial for spatial domain processing unit 110 to implement spatial masking in the spatial domain by using spatial windowing. For an input picture x[m,n], masking (also referred to as windowing) with the mask, or window, w[m,nj, is simply
y[m,n] = w[m,n] x x[m, ] (1)
As will be appreciated, windowing in the spatial domain is equivalent to convolution in the frequency domain. The masking in (1) can, therefore, be implemented by DCT processing of DCT signals as
Y[k,l] = W[k,l] * X[k,l] (2) where XfkU, Yfi,l], and Wβ,lJ are the frequency representations of x[m,nj, y[m,n], and w[m,nj, respectively, * is the convolution operator, m, n are the spatial domain indices, and k, I are the DCT or frequency domain indices. The approach in (2) is a "brute force" DCT domain processing implementation based on symmetric convolution. As will be appreciated, a symmetric convolution is achieved by making a symmetric extension of two finite length signals and the convolving the extended signals together using circular convolution. If the frequency domain is the discrete Fourier transform (DFT) domain, the convolution in (2) is circular convolution. Further background on such techniques may be found in D.E. Dudgeon & R.M. Mersereau,
Multidimensional Digital Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1984. If the frequency domain is the DCT domain (or other discrete trigonometric transforms), the convolution in (2) is a symmetric convolution. Further background on symmetric convolutions may be found in S.A. Martucci, Symmetric Convolution and the Discrete Sine and Cosine Transforms: Principles and Applications, PhD thesis, Georgia Institute of Technology, 1993. Spatial masking in the DCT domain can, therefore, be implemented by using symmetric convolution according to (2).
Masking can be used to implement compositing of two input pictures x0[n,m] and Xj[n,m] according to
/ y[m,n] = x, [m, ή] +w [m,ri] x (X0[JH, «] -x. [m, «]) (3)
In this case a mask value of one means that samples from x0[n,m] are used, while a mask value of zero means that samples from x,[n,m] are used. Mask values in the range from zero to one imply linear interpolation between the two signals x0 and x,. (Mask values outside the unit interval imply linear extrapolation of the two input pictures.)
If the mask, w[m,n], is a separable signal, the convolution in (2) can be implemented as two separate one-dimensional (1-D) convolutions. As will be appreciated, a 2-D signal x[m,n] is separable if there exist two 1-D signals rfmj and sfnj such that x[m,n] = rfmjsfnj (i.e., it can be implemented as a cascade of horizontal and vertical DCTs). In the separable case, the convolution may provide a reasonable approach to masking, since it requires, for example, only 16 multiplications per sample for an 8x8 DCT. For non-separable masks, however, the convolution approach to masking is not as feasible since, for example, masking for an 8x8 block DCT requires 64 multiplications and considerable data shuffling.
Accordingly, both spatial domain processing of DCT images and the brute force DCT approach often require an undesirably high amount of computation. As video compression resulting in DCT images becomes more common, it becomes more desirable to do picture processing on compressed image data without completely decoding or decompressing the image data. Some techniques in this regard are discussed in further detail in B.C. Smith & L.A. Rowe, "Algorithms for Manipulating Compressed Images," IEEE Computer Graphics & Applications, pp. 34-42, September 1993; S-F Chang & D.G. Messerschmitt, "A New Approach to Decoding and Compositing Motion-Compensated DCT-Based Images," ICASSP-93, pp. V421-V424, 1993; and
N. Merhav & V. Bhaskaran, "A Transform Domain Approach to Spatial Domain Image Scaling," ICASSP-96, pp. 2405-2409, 1996.
Frequency Domain Processing of DCT Images
In the present invention, compressed pictures are processed in the DCT domain with a technique based on representing the masking function in terms of the DCT basis functions and computing the masking as a weighted sum of the results of masking by the DCT basis functions, as described in further detail below. Such DCT domain processing makes it possible to reduce both the computational complexity and the latency of the processing, by eliminating the need for transforming signals from the DCT domain into the spatial domain and back.
As will be appreciated, if the desired processing (i.e.. masking and compositing) is done in the DCT domain, the three IDCT transforms 120, 121. 122, and the DCT transform 130, required in spatial domain processing of DCT images can be eliminated. Referring now to Fig. 2, there is shown a block diagram of a DCT domain image processing system 200, in accordance with a preferred embodiment of the present invention. As shown, system 200 comprises DCT domain processor 210, but does not comprise nor require the three IDCT transforms and one DCT transform used in spatial domain processing. Instead, DCT domain processor 210 operates in the DCT domain, and is thus able to provide processing efficiencies relative to spatial domain processing.
In one embodiment, system 200 operates with respect to two-dimensional (2-D) type-LT DCT of 8x8 blocks, such as is used by the image and video compression standards JPEG, MPEG-1,
MPEG-2, H.261 and H.263. As will be appreciated, however, in alternative embodiments the present invention may be utilized with other types of DCTs and other block sizes.
The 8x8 type-II DCT is given by
X[k,l] = τ ]η[/] + 1)/) (4)
where η is a frequency-dependent DCT normalization coefficient which depends on the values of DCT domain indices k, I. It should be noted that even though the 2-D DCT can be used to represent non-separable signals, the transform itself is separable — and the basis functions of the 2-D DCT are separable. DCT basis functions are discussed in further detail below, with reference to (15).
The 2-D DCT of each block can be implemented using matrix multiplications
X = CXC τ, (5)
where X and X are matrix representations of Xfk J and x[m,n], respectively, and C is the DCT
transformation matrix (a unitary matrix, i.e., CCT=I). For the 8x8 DCT, the matrices X, X and C are all 8x8.
Vertical spatial masking by any window v^mj and horizontal masking by any window vfnj can be implemented as the matrix multiplication
Ϋκl = VkXV, (6) where
Vk = diag(vk[n]) (7) Based on the fact that C is unitary, it can be derived that
= CVkXV,C τ (9)
= CVkC τCXC τCV,C τ (10)
= vkxv, (1i) where
Vj = CVjC Z (12)
It should be noted that non-separable masking cannot be expressed in a simple matrix multiplication form similar to (11). However, a non-separable mask can be transformed by the 2- DCT, which does have separable basis functions. The IDCT of the DCT domain representation of the non-separable mask, W[k,l], is given by
v[m,n] = ∑ ∑ W[k,l] τ#] cos(^(2 + l)*)η[Z]cos(-^(2n + 1)1) (13) t=o ι=o 16 16
Substituting (13) into (1) gives
y[ ,n] = ∑ ∑ {(W[k,l] t ]η[/]) vk[m]x[m,n] v,[n]} (14)
where
π v. [m] cos( — (2m + \)k) (15) 16
This implies that masking x[m,n] with w[m,n] is equivalent to a weighted sum of the masking of x[m,n] with the basis functions of the IDCT. It should be noted that the windowing in
(14) is separable and can, therefore, be written in terms of the matrix multiplications in (6), where both Vj mJ and v,[nj are given by (15) for k,l e {0,1,...,7}. Therefore, the DCT transform of the masked signal becomes
Y = Σ Σ { [k,l] τ ]η[/]) VkXV\ (16)
*=o /=o
where the numerical values of the DCT domain windowing matrices, VJ3 can be evaluated according to (12).
Thus, a non-separable mask can be implemented as weighted sum of separable functions
(13), and masking can be accomplished with separable functions using matrix notations (11). Therefore, non-separable masks can be implemented as a weighted sum of separable masking operations (16). The separable masking operations in (16), by the DCT basis functions, then turn out to have simple and efficient implementation.
As will be appreciated, the functions defined in (15) are the DCT basis functions (for 1-D type-Li DCT of size N=8). The DCT basis functions form an orthogonal basis that can represent all discrete functions of length N. The factor η normalizes the basis functions so r\[k] times the basis function in (15) (i.e.,η[£]v,(.[m]) forms an orthonormal (normalized orthogonal) basis for all functions of length N. Since the basis functions for the 2-D DCT are formed as the product of two 1-D basis functions vk[m], v,[ ], the 2-D DCT basis functions are separable.
The windowing matrices for the DCT basis functions, VJ5 are sparse and have very regular structure. For example, for j = 4 we have
0 0 0 1 0 1 0 0
0 0 1 0 0 0 1 0
1 0 1 0 0 0 0 0 1
V. = - (17)
2 J2 0 0 0 0 0 0 0
0 1 0 0 0 0 0 -1
0 0 1 0 0 0 -1 0
0 0 0 1 0 -1 0 0
and the other windowing matrices have the same kind of structure.
By incorporating the factor of lA into the windowing function W[k,l], each matrix multiplication in (16) can be implemented using only one addition per sample and two multiplications (by A>) per 64 samples (for 8 x 8 DCT). If the DCT coefficients are obtained from decoding JPEG or MPEG streams, the multiplications byA c n be incorporated into the quantization matrices, reducing the computational complexity to only one addition per sample for each matrix multiplication in (16). In addition, there is one multiplication and one addition per pixel for each term in the weighted sum in (16). Therefore, the computational complexity of implementing masking according to (16) is approximately one multiplication and three additions per pixel for each term that is evaluated. Additionally, when the weighting coefficient, W[k,l], is zero, the whole term can be dropped and no computation is needed for that term.
As will be appreciated, the DCT approach is used in compression systems, such as JPEG and MPEG, because for most signals the energy is concentrated into relatively few DCT coefficients. In the present invention, this property is utilized to save computations by skipping all processing for weighting coefficients, Wfk,l], equal to zero. Alternatively, the savings can be made more substantial by dropping weighting coefficients close to zero. As will be appreciated, when the weight is zero or close to zero, terms can be dropped from the sum, which reduces the computational complexity. That is, the representation of the masking in terms of the weighted sum allows computational complexity to be reduced by skipping all processing for weighting coefficients W[k,l] equal to zero (or, in one embodiment, for all weighting coefficients W[k,l] less than a predetermined threshold).
As will be appreciated, by adjusting the threshold for choosing which coefficients are dropped, the quality of the masking operation can be traded for computational complexity in a similar manner as quality is traded for bit rate in encoding. The ability to trade off quality of the masking against computational complexity gives great flexibility in trading cost for quality. Accordingly, the frequency domain implementation of picture masking and compositing of the present invention can be very efficient.
Thus, as will be appreciated by those skilled in the art, in the present invention, the masking function is implemented in terms of the DCT basis functions. As will be appreciated, any necessary scaling is first performed, and may be incorporated into the quantization matrix in an inverse quantization. Next, a weighted sum of the blocks masked in this fashion is then implemented. In one embodiment, the masked block at this point is re-normalized, in accordance with the scaling done previously. (As will be appreciated, the initial scaling and re-normalization scaling may be incorporated into the quantization matrix if the input signal is dequantized and the output signal is quantized.)
In one embodiment, all processing for weighting coefficients W[ l] equal to zero is skipped
(where, for an input picture x[m,n], w[m,n] is the window used to mask the input picture, and W ife,// is the frequency representation of w[m,n]). In an alternative embodiment, because of the aforementioned energy compaction property of the DCT, all processing is skipped for weighting coefficients Wfk J close to zero. In general, all processing is skipped for weighting coefficients W[k, I] less than or equal a predetermined threshold value (where a threshold value of zero yields the former case). In one embodiment of the present invention, this threshold is selected in accordance with a desired tradeoff between the quality of the masking operation and computational complexity, where a higher threshold provides lower quality but greater savings in computational complexity, and vice-versa.
Compositing of two images can be implemented by use of masking, according to (3).
For a given block of the image (e.g., an 8-by-8 block), for example, the following steps may be taken by a suitably programmed processor to implement the present invention, in one embodiment. First, examine every DCT coefficient of the mask, W[k,l], and if the coefficient is
"relevant" (i.e., either bigger than zero or bigger than a given threshold, depending on the embodiment), then do masking by the corresponding basis function, multiply each coefficient of the
masked signal by the weighting coefficient, and add the result to the weighted sum for the block. This implements the weighted sum in (16).
The masking by the DCT basis functions can be implemented in terms of matrix multiplications as shown in (11) (and (16)). However, a more efficient implementation can be achieved by taking into account the regular structure of the windowing matrices as the example in
(17) shows. Several of these more efficient implementations are discussed in the detailed description above.
For processing of original signals already in the DCT domain, the frequency domain processing of the present invention requires less computation than both spatial domain processing and brute force DCT domain processing based on symmetric convolution. Through empirical testing and modeling, the inventors have found that the computational complexity involved in using the frequency domain processing of the present invention is approximately one to four multiplications per sample for most typical masking operations. As will be appreciated, by using an algorithm similar to rate control algorithms, the complexity of spatial masking in the DCT domain can be limited to only three multiplications per sample without any noticeable degradation of the masking quality.
By contrast, a single 2-D DCT takes about three multiplications per sample, and when implementing masking of JPEG or MPEG compressed pictures in the spatial domain, IDCTs must be first used to transform the DCT data into the spatial domain, and then use the DCT operation to transform the processed picture back into the DCT or frequency domain. Thus, when implementing picture compositing, there are at least two IDCTs (one for each input picture) and one DCT (for the composite picture) needed, in addition to the spatial processing. Therefore, there are at least nine multiplications needed for implementing picture compositing in the spatial domain, which is approximately three times what is needed for described embodiments of the DCT domain implementation of the present invention. In sum, therefore, for picture compositing of pictures in the DCT domain, the present invention, in one embodiment, requires about three times fewer multiplications per pixel than spatial domain processing, and about twenty times fewer multiplications than processing based on brute force convolution.
Referring now to Fig. 3, there is depicted an exemplary image 300 processed using DCT domain picture compositing performed in the frequency domain by image processing system .-00.
Image 300 contains a head-and-shoulder portion 312, which is overlaid over a flower garden background 310, and a transparent logo "SARNOFF" 315, which was inserted in the top right hand corner of image 300. The picture compositing performed by system 200 to arrive at image 300 was performed, in one actual experiment, using only 1.8 multiplications per pixel.
As will be appreciated, although the embodiments of the present invention described above is implemented with respect to the DCT frequency domain, the present invention is also potentially applicable to other frequency domains in which the masking function may be represented in terms of the frequency domain's basis functions and in which the masking can then be computed as a weighted sum of the results of masking by these basis functions. For example, the present invention may be applicable to other frequency domains such as the DFT and discrete sine transform (DST).
As will be understood, the present invention can be embodied in the form of computer- implemented processes and apparatuses for practicing those processes. The present invention can also be embodied in the form of computer program code embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other computer-readable storage medium, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. The present invention can also be embodied in the form of computer program code, for example, whether stored in a storage medium, loaded into and/or executed by a computer, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits.
It will be understood that various changes in the details, materials, and arrangements of the parts which have been described and illustrated above in order to explain the nature of this invention may be made by those skilled in the art without departing from the principle and scope of the invention as recited in the following claims.