A video encoder and method of video encoding
The invention relates to a video encoder and method of video encoding and in particular to video encoding using motion estimation.
In recent years, the use of digital storage and distribution of video signals has become increasingly prevalent. In order to reduce the bandwidth required to transmit digital video signals, it is well known to use efficient digital video encoding comprising video data compression whereby the data rate of a digital video signal may be substantially reduced. In order to ensure interoperability, video encoding standards have played a key role in facilitating the adoption of digital video in many professional- and consumer applications. Most influential standards are traditionally developed by either the International Telecommunications Union (ITU-T) or the MPEG (Motion Pictures Experts Group) committee of the ISO/IEC (the International Organization for Standardization/the International Electrotechnical Committee). The ITU-T standards, known as recommendations, are typically aimed at real-time communications (e.g. videoconferencing), while most MPEG standards are optimized for storage (e.g. for Digital Versatile Disc (DVD)) and broadcast (e.g. for Digital Video Broadcast (DVB) standard). Currently, one of the most widely used video compression techniques is known as the MPEG-2 (Motion Picture Expert Group) standard. MPEG-2 is a block based compression scheme wherein a frame is divided into a plurality of blocks each comprising eight vertical and eight horizontal pixels. For compression of luminance data, each block is individually compressed using a Discrete Cosine Transform (DCT) followed by quantization which reduces a significant number of the transformed data values to zero. Frames based only on intra-frame compression are known as Intra Frames (I- Frames). In addition to intra-frame compression, MPEG-2 uses inter-frame compression to further reduce the data rate. Inter-frame compression includes generation of predicted frames (P-frames) based on previous I-frames. In addition, I and P frames are typically interposed by Bidirectional predicted frames (B-frames), wherein compression is achieved by only transmitting the differences between the B-frame and surrounding I- and P-frames. In
addition, MPEG-2 uses motion estimation wherein the image of macro-blocks of one frame found in subsequent frames at different positions are communicated simply by use of a motion vector. Motion estimation is performed to determine the parameters for the process of motion compensation or, equivalently, inter prediction. As a result of these compression techniques, video signals of standard TV studio broadcast quality level can be transmitted at data rates of around 2-4 Mbps. Recently, a new ITU-T standard, known as H.26L, has emerged. H.26L is becoming broadly recognized for its superior coding efficiency in comparison to the existing standards such as MPEG-2. Although the gain of H.26L generally decreases in proportion to the picture size, the potential for its deployment in a broad range of applications is undoubted. This potential has been recognized through formation of the Joint Video Team (JVT) forum, which is responsible for finalizing H.26L as a new joint ITU-T/MPEG standard. The new standard is known as H.264 or MPEG-4 AVC (Advanced Video Coding). Furthermore, H.264-based solutions are being considered in other standardization bodies, such as the DVB and DVD Forums. The H.264/AVC standard employs similar principles of block-based motion estimation as MPEG-2. However, H.264/AVC allows a much increased choice of encoding parameters. For example, it allows a more elaborate partitioning and manipulation of 16x16 macro-blocks whereby e.g. a motion compensation process can be performed on divisions of a macro-block as small as 4x4 in size. Another, and even more efficient extension, is the possibility of using variable block sizes for prediction of a macro-block. Accordingly, a macro-block (still 16x16 pixels) may be partitioned into a number of smaller blocks and each of these sub-blocks can be predicted separately. Hence, different sub-blocks can have different motion vectors and can be retrieved from different reference pictures. Also, the selection process for motion compensated prediction of a sample block may involve a number of stored, previously-decoded frames (or images), instead of only the adjacent frames (or images). Also, the resulting prediction error following motion compensation may be transformed and quantized based on a 4x4 block size, instead of the traditional 8x8 size. Generally, existing encoding standards such as MPEG 2 and H.264/AVC exploit temporal correlation by a block based motion estimation and compensation. Thus, the motion estimation and compensation algorithms are based on the encoding blocks of the video standard. Although this provides for an efficient encoding of video signals, it is desirable to provide an even more efficient video encoding wherein a higher quality to data rate ratio can be achieved.
An option that promises improved encoding performance is to provide an image segment based motion estimation and compensation. For example, image segments corresponding to players in a sports arena may be determined and used for motion estimation. However, motion estimation based on image segments tends to have a number of disadvantages including the following: Motion estimation is typically based on detection of corresponding edges within different frames. However, as the borders of image segments typically coincide with edges in the picture, the segmentation has a tendency to remove the presence of edges useful for segment based motion estimation. Also, video encoding may introduce new edges within segments which may increase the probability of generating false vectors. For example, quantisation may introduce edges within an image segment caused by minor texture image data fluctuations. Image segments typically comprise several encoding blocks and the larger image area of image segments results in similar motion estimation inaccuracies being substantially more perceptible. Research into segment based motion estimation is significantly less advanced than for block based motion estimation. Specifically, there are fewer algorithms known for segment based motion estimation and these tend to be have worse performance than block based algorithms. - Image segments tend to have unknown and irregular shapes. In contrast, image blocks tend to have known and regular shapes thereby facilitating hardware implementation as dedicated hardware may be developed for fixed block size processing. Hence, dedicated hardware implementation tends to be more complex, costly and less efficient for segment based motion estimation than for block based motion estimation. Hence, an improved system for video encoding would be advantageous and in particular a system enabling or facilitating the use of segment based estimation, improving the quality to data rate ratio, facilitating implementation, increasing performance and/or reducing complexity would be advantageous.
Accordingly, the Invention preferably seeks to mitigate, alleviate or eliminate one or more of the above mentioned disadvantages singly or in any combination. According to a first aspect of the invention, there is provided a video encoder for encoding a video signal, the video encoder comprising: means for generating block
motion estimation data associated with a first frame and a reference frame by performing block motion estimation based on image blocks; means for segmenting at least one of the first frame and the reference frame into a plurality of image segments; means for determining segment motion data for at least one segment of the plurality of image segments in response to the block motion estimation; means for encoding the at least one segment in response to the reference frame and the segment motion data; and means for generating video data comprising the segment motion data. The invention may improve and/or facilitate video encoding performance by combining block based motion estimation and segment based motion compensation. Hence, the inventors have realised that the advantages of segment based motion compensation may be achieved while retaining the advantages of block based motion estimation. The invention may provide improved video encoding performance. For example, the invention may allow existing block based motion estimation algorithms to be used thereby providing an improved design choice and the possibility of improved performance. As another example, the invention may allow motion estimation to exploit edges existing in the original image prior to segmentation thereby providing for facilitated and/or more accurate motion estimation. More accurate motion estimation may furthermore improve the perceived quality when performing motion compensation for larger picture segments or objects. Also, the invention may provide for facilitated implementation as block based motion estimation typically is more suitable for hardware implementation. Preferably, the image blocks have a predetermined size. This may facilitate processing and practical implementation of the block motion estimation. Preferably, the image blocks have sizes selected from a set of possible block sizes, the set of block sizes being independent of the content of the video signal. In contrast, the at least one segment typically has a size dependent on the content of the video signal. This may facilitate processing and practical implementation of the block motion estimation. Preferably, the at least one segment comprises a plurality of image blocks. The invention may provide for an advantageous way of performing motion compensation on objects which are larger than the blocks used for motion estimation. This may facilitate encoding and/or improve the video quality to data rate ratio. Preferably, the means for determining segment motion data is operable to determine the segment motion data by selecting motion data associated with a subset of the plurality of image blocks.
This provides for a low complexity yet high performance means of determining segment motion data from block based motion data. The subset may comprise a single block and in particular, the selected block based motion data may correspond to the motion vector of a single block. Majority voting may preferably be used. Preferably, the means for determining segment motion data is operable to average motion data associated with the plurality of image blocks. This provides for a low complexity yet high performance means of determining segment motion data from block based motion data. The averaging may for example be a weighted average wherein the weight of individual motion data for each block is determined in accordance with a suitable algorithm or criterion. The means for determining segment motion data may determine the segment motion data by a combination of selection and averaging of motion estimation data for a plurality of blocks associated with the segment to be encoded. Preferably, the video encoder is a block based video encoder and the image blocks are encoding blocks. The video encoder may specifically be a video encoder which comprises a spatial frequency transform and the encoding blocks may correspond to the transform blocks. Preferably, the encoding blocks are Discrete Fourier Transform (DCT) blocks. This may facilitate video encoding as the same blocks are used. Preferably, the segment motion data comprises data associated with a motion model for the at least one segment. The data may comprise information for defining or identifying one or more aspects of a suitable model and/or may comprise parameter data used for the motion model. Using a more complex motion model may provide improved motion compensation. In particular, larger segments corresponding to objects in an image may be more accurately described by a more complex (e.g. affine) motion model than by a simple translational motion description. Preferably, the motion model is a model for a two dimensional image of a three dimensional object. This provides for improved quality and/or reduced data rate when performing motion compensation on three dimensional objects moving in the image. According to a second aspect of the invention, there is provided a method of video encoding a video signal, the method comprising the steps of: generating block motion estimation data associated with a first frame and a reference frame by performing block motion estimation based on image blocks; segmenting at least one of the first frame and the
reference frame into a plurality of image segments; determining segment motion data for at least one segment of the plurality of image segments in response to the block motion estimation; encoding the at least one segment in response to the reference frame and the segment motion data; and generating video data comprising the segment motion data. These and other aspects, features and advantages of the invention will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.
An embodiment of the invention will be described, by way of example only, with reference to the drawings, in which Fig. 1 is an illustration of a video encoder in accordance with an embodiment of the invention.
The following description focuses on a specific embodiment of the invention but it will be appreciated that the invention is not limited to this application. In accordance with the embodiment, a video encoder is described wherein block based motion estimation is combined with segment based block compensation. In accordance with the embodiment, motion vectors are generated for a plurality of image blocks wherein each of the blocks has a size selected from a set of predetermined sizes. In addition, the frame to be encoded is analysed to generate a number of image segments. For a given image segment, the blocks comprised in the image segment are identified and the motion vectors of these blocks are processed to generate a single motion vector for the image segment. The image segment is then encoded using motion compensation and the resulting encoding data and segment motion vector data is combined in an output data stream. Fig. 1 is an illustration of a video encoder in accordance with an embodiment of the invention. The operation of the video encoder will be described for specific situation where a first frame is encoded using motion estimation and compensation from a single reference frame but it will be appreciated that in other embodiments motion estimation for one frame may be based on any suitable frame or frames including for example future frame(s) and/or frame(s) having different temporal offsets from the first frame. The video encoder 100 comprises an input frame memory 101 which stores a frame to be encoded henceforth referred to as the first frame. The video encoder further
comprises an encoding processor 103 which generates encoded data for the frames of the video signal. The encoding processor 103 is coupled to an output processor 105 which generates an output data stream from the video encoder 100. The output processor 105 combines encoding data from the different frames, adds motion vectors, auxiliary data control information etc as required for the specific video encoding protocol. The encoding processor 103 is coupled to a decoding processor 107 which performs a local decoding of an encoded frame received from the encoding processor 103. The decoding processor 107 operates similarly to a video decoder receiving the video data stream and accordingly generates a local frame which corresponds to the frame which will be generated at a receiving video decoder. In the embodiment, motion estimation is based on the locally decoded frame rather than on the original frame in order to more accurately reflect the processing and data of a receiving video decoder. In the specific example described, the decoding processor 107 decodes the frame immediately prior to the frame to be encoded and uses this to perform motion estimation and compensation. Thus, in the example, the input frame memory 101 will comprise the data corresponding to the original first frame and the decoding processor 107 will generate a reference frame by decoding the previously encoded frame. The decoding processor 107 is coupled to a motion estimation processor 109 which is fed the reference frame. The motion estimation processor 109 is furthermore ■ coupled to the input frame memory 101 and receives the original first frame therefrom. The motion estimation processor 109 performs a block based motion estimation based on the reference frame and the first frame. The blocks used for the motion estimation are preferably the same blocks which are used by the encoding processor 103 in generating the encoded data. For example, the image blocks may be encoding blocks which are processed as blocks by the encoder. Specifically, the encoder may comprise a DCT transform operating with a given block size and the motion estimation blocks may be the same blocks. The image segments are generated by the segmentation processor 111 and therefore may have an irregular shape and size dependent on the content of the image.
However, the image blocks have a size which is selected from a discrete set of possible sizes. For example, in one embodiment, all image blocks are 8x8 pixel blocks. In another embodiment, image blocks may be selected to be either 4x4 pixel, 8x8 pixel or 16x16 pixel blocks. Furthermore, the blocks are typically defined in a fixed grid which is independent of
the content of the image. In other words, whereas image segments have locations that depend on the content of the video signal, the image blocks may only be located at discrete locations. Typically, the entire image is divided into consecutive regular shaped image blocks which may then be processed by the motion estimation processor 109. In a simple embodiment, the block based motion estimation is performed by dividing the first frame into a plurality of relatively small quadratic picture blocks and searching the reference frame for matching blocks. For example, the original first frame may be divided into 8x8 pixel blocks. Each of the 8x8 picture blocks is scanned across the reference frame and for each scan position a sum square value of the pixel value differences between the first frame and the reference frame is generated. If the sum square value is sufficiently low, a match is deemed to have occurred and a motion vector for the block may be generated as the relative difference in position between the two blocks. As a more complex example, motion estimation may be performed by a 3- Dimensional Recursive Search Block Matching unit. This motion estimation unit is designed to estimate motion vectors on the basis of a sequence of input images. The estimated motion vectors can e.g. be used to compute a predicted output image. A motion vector is related to the translation of a group of pixels of a first image of the sequence to a further group of pixels of a second image of the sequence. Typically the groups of pixels are blocks of pixels of e.g. 8*8 pixels. For each of these blocks a couple of candidate vectors are tested. These candidate vectors are vectors from neighboring blocks. Some of those vectors get a random offset added to them. This random offset allows the motion estimation unit to track object with deviating motion components. Furthermore a vector from a neighboring block for the previous field is used. This latter vector results in the recursive approach and therefore in a relatively consistent motion vectors. From these candidate vectors the one with the closest match is chosen. A refinement step can be applied to the group of pixels that have no proper match. If so, the group of pixels can be subdivided into say 4 blocks of 4*4 pixels, and if that does not have a proper match to even smaller blocks. This will typically be the case at discontinuities of the motion vectors. It will be appreciated that any suitable algorithm for block based motion estimation may be used without detracting from the invention. The decoding processor 107 is furthermore coupled to a segmentation processor 111 which receives the reference frame. The segmentation processor 111 is operable to segment the first frame into a plurality of image segments.
The encoder can also be configured in such a way that the segmentation processor 111 is operable to segment the first input image from the input frame memory 101. In this configuration the segmentation map should be transmitted to the decoding process. The aim of image segmentation is to group pixels together into image segments which have similar movement characteristics, for example because they belong to the same object. A basic assumption is that object edges cause a sharp change of brightness or colour in the image. Pixels with similar brightness and/or colour are therefore grouped together resulting in brightness/colour edges between regions. In the preferred embodiment, picture segmentation thus comprises the process of a spatial grouping of pixels based on a common property. There exist several approaches to picture- and video segmentation, and the effectiveness of each will generally depend on the application. It will be appreciated that any known method or algorithm for segmentation of a picture may be used without detracting from the invention. In the preferred embodiment, the segmentation includes detecting disjoint regions of the image in response to a common characteristic and subsequently tracking this object from one image or picture to the next. In one embodiment, the segmentation comprises grouping picture elements having similar brightness levels in the same image segment. Contiguous groups of picture elements having similar brightness levels tend to belong to the same underlying object. Similarly, contiguous groups of picture elements having similar colour levels also tend to belong to the same underlying object and the segmentation may alternatively or additionally comprise grouping picture elements having similar colours in the same segment. The segmentation processor 111 is coupled to a segment motion processor 113 which is fed the segmentation information derived by the segmentation processor 111. In addition, the segment motion processor 113 is coupled to the motion estimation processor 109 and is fed the block motion estimation data from this. The segment motion processor 113 determines segment motion data for at least one but preferably all of the determined image segments in response to the block motion estimation data from the motion estimation processor 109. In a simple embodiment, the segment motion processor 113 identifies all blocks which are fully comprised within a segment and retrieves the motion vector for each of these. A motion vector for the entire segment is then determined by performing a selection of a suitable vector. In the preferred embodiment, a majority selection is performed by
selecting a motion vector corresponding to the most frequent motion vector value of the image blocks. In other embodiments averaging of the motion vectors of the blocks comprised in the segment may be used. In some cases, it may be advantageous to perform a weighted averaging for example by weighing motion vectors of blocks towards the inner regions of the image segment higher than motion vectors of blocks nearer the edges of the image segment. The segment motion processor 113 preferably repeats the operation for all image segments thereby generating one segment motion vector for each detected segment. It will be appreciated, that other algorithms or criterions for associating image blocks with image segments may be used without subtracting from the invention including for example selecting all blocks having more than a given number of pixels in common with the image segment. It will also be appreciated that any suitable algorithm or criterion for determining segment motion data from the block motion data may be used. For example, the segment motion processor 113 may simply select the motion vector of a segment as the motion vector of a single block, such as for example the block having the closest match or being the most central in the image segment. The video encoder 100 further comprises a motion compensation processor 115 which is coupled to the segment motion processor 113 and the segmentation processor 11 1. The motion compensation processor 115 receives the segment motion data, the segment information and the image data for the determined image segments. Specifically, the motion compensation processor 115 may receive the entire reference frame. The motion compensation processor 115 generates a motion compensation frame for the first frame in accordance with the segment information and the segment motion data. For example, the motion compensation processor 115 may generate a motion compensation frame by taking the detected segments and displacing them in accordance with the segment motion data. Thus, if a given detected segment is a 40 by 60 rectangular area at a spatial location of (x,y) = (300, 200) in the reference frame and has an associated motion vector of (Δx,Δy) = (30, 120), the reference frame image data of the 40 by 60 segment is inserted in the motion compensation frame at a location of (x,y) = (330, 320). Thus, the motion compensation frame comprises one or more image segments of the reference frame offset in accordance with the corresponding motion vectors.
The video encoder further comprises a subtracting element 117 which is coupled to the input frame memory 101 and the motion compensation processor 115. The subtracting element 117 receives the original first frame and the motion compensation frame and generates a relative frame to be encoded. Specifically, the subtracting element 117 may subtract the motion compensation frame from the original first frame on a pixel by pixel basis. As the motion estimation is aimed at finding matching image segments, the relative frame will have substantially reduced image data values within the segments. The subtracting element 117 is coupled to the encoding processor 103 which is fed the relative frame for encoding. Any suitable method of encoding may be used including for example encoding based on DCT, quantisation and coding as is known from MPEG-2 encoding. As the image data values are substantially reduced within the motion compensated segments, a significant reduction in the encoded data size is achieved. The encoded data is passed to the output processor 105 and the decoding processor 107. The video encoder then reads in the next frame to the input frame memory 101 and proceeds to encode this frame using the just encoded frame as a reference frame. It will be appreciated that any suitable algorithm or criterion for selecting reference frames and/or frames for encoding using motion estimation may be applied. As such, segment based motion compensation may be applied to only a subset of the frames while other frames are possibly encoded without motion compensation (e.g. intra frames) or using block based motion compensation. In the above described example, the segmentation processor 111 is fed the reference picture and performs the image segmentation on this frame. However, it will be appreciated that in other embodiments, image segmentation may alternatively or additionally be performed on the first frame. Thus, in accordance with the embodiment, motion compensation based on image segments is enabled while allowing motion estimation to be based on image blocks. The advantages of* segment based motion compensation are combined with the advantages of block based motion estimation resulting in an improved encoding, reduced data rate, increased video quality, facilitated encoding and/or facilitated hardware implementation. In the described embodiment, the reference frame is generated by local decoding of encoded output frames. However, it will be appreciated that other means of generating a reference frame may be used and that the decoding processor is not essential to the invention. For example, a reference frame may be generated from the originally received frames and may specifically be an original frame stored in a suitable memory.
Similarly, it will be appreciated that the output processor 105 is not essential to the invention and that the encoding processor 103 may directly generate output data in any suitable form including but not limited to a single data stream. Also, the input frame memory 101 may be omitted in some embodiments and any necessary buffering or image data storage may be implemented as part of the functionality of the other functional modules. It will be appreciated that the motion compensation processor 115, subtracting element 117 and encoding processor 103 of the current embodiment illustrates a specific example of encoding at least one segment of the first frame in response to the reference frame and the segment motion data but that any suitable algorithm or functionality for achieving this may be used. For example, generation of relative image data for the segment may be combined with the encoding process. Furthermore, it will be appreciated that the described embodiment uses a simple motion compensation which performs a simple translational shift of the location of image segments between the reference frame and the first frame. As such, the motion data may be represented by a simple motion vector. However, in other embodiments more complex motion compensation and motion data may be used. For example, in some embodiments, the movement of one or more of the image segments may be described by a complex motion model and the segment motion data may comprise data associated with a motion model for the at least one segment. In a specific exemplary embodiment, the video encoder may associate an image segment with a three dimensional object. A motion model for the three dimensional model may be determined and the representation of the object in the two dimensional image is derived from this model. Accordingly, a more accurate description of how an image segment changes between frames can be achieved thereby improving the coding efficiency. Due to the typically large size of image segments (and their typical correlation to three dimensional objects in the picture), this provides a particular advantage for segment based motion compensation. The motion estimation data may comprise information describing the model itself and/or may include parameter data to be applied to an already established or predefined model. The invention can be implemented in any suitable form including hardware, software, firmware or any combination of these. However, preferably, the invention is implemented as computer software running on one or more data processors and/or digital signal processors. The elements and components of an embodiment of the invention may be physically, functionally and logically implemented in any suitable way. Indeed the
functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the invention may be implemented in a single unit or may be physically and functionally distributed between different units and processors. Although the present invention has been described in connection with the preferred embodiment, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present invention is limited only by the accompanying claims. In the claims, the term comprising does not exclude the presence of other elements or steps. Furthermore, although individually listed, a plurality of means, elements or method steps may be implemented by e.g. a single unit or processor. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is no feasible and/or advantageous. In addition, singular references do not exclude a plurality. Thus references to "a", "an", "first", "second" etc do not preclude a plurality. Reference signs in the claims are provided merely as a clarifying example shall not be construed as limiting the scope of the claims in any way.