Bit Rate Control for Video Compression
Technical Field
The present invention relates to a bit rate control for the compression of video data. It has particular, but not exclusive, application to the provision of video over a packet switched network such as the Internet.
Background Art
Bit rate control plays an important role in the provision of multimedia over communications networks, and has been widely studied by many researchers for various standards and applications, such as storage media and real-time transmission with MPEG-1 and MPEG-2, videoconferencing with H.261 and H.263, and video object coding with MPEG-4.
For different coding standards and applications, different coding parameters are emphasised and different mechanisms are applied. For example, in MPEG-2, the most influential coding parameter with regard to picture quality is the quantization parameter (QP) used for texture coding. This parameter can be selected for an entire frame of the video sequence or can change from macroblock to macroblock. In most implementations, it is selected on the basis of buffer fullness, so that the buffer occupancy is maintained at a given level. The H.263 coding scheme allows for variable frameskip, and due to the low bit- rate conditions which may be imposed upon the encoder, it is up to the rate control algorithm to make appropriate decisions on both spatial and temporal coding parameters. If the buffer is in danger of overflow, complete frames may be disregarded at the encoder to allow bits used for the previous frame to be transmitted out of the buffer to thereby reduce the buffer level and delay. In conjunction with this frame-skipping mechanism, the bit rate control algorithm must determine a suitable quantization parameter (QP) to obtain the desired bit rate.
Similarly to H.263, MPEG-4 bit rate control also considers spatial and temporal coding parameters. However, the encoder must also consider the significant amount of bits which are used to code shape information such that arbitrarily shaped objects can be coded. Also, although each video object may be encoded at a different frame rate, it is preferable that all of the objects are encoded at the same frame rate in order to yield better video quality. Further, additional coding parameters are introduced by MPEG-4 to control the amount of bits used to specify the shape of an object. It is the responsibility of the rate control scheme to incorporate these new parameter decisions along with other parameter decisions to ensure that the video objects are effectively coded.
i In real-time video communications, the encoded bits are placed into an encoder buffer before it is transmitted through a network to a decoder. If the actual bit rate of the encoder is greater than the available channel bandwidth, the additional bits accumulate in the encoder buffer and increase buffer delay, which is the time needed to send the buffer bits remaining from the previously encoded frames. When the number of bits in the buffer is too high, the encoder usually skips some frames to reduce the buffer delay and avoid buffer overflow. This frame-skipping, however, produces undesirable motion discontinuity in the encoded video sequence. Conversely, if the buffer level is too low, there may be periods of time in which no bits are transmitted through the channel, and hence some channel bandwidth is wasted.
To overcome these two problems, a joint buffer control is usually used to maintain a buffer occupancy of about 50% of the buffer size after coding each frame. In order to do this, heuristic methods are usually employed, in which the target bit rate is increased if the current buffer level is less than half of the buffer size, and the target bit rate is decreased if the current buffer level is more than half of the current buffer size. Such schemes are disclosed in "Scalable Rate Control for MPEG-4 Video", H.J. Lee, T.H. Chiang and Y.Q. Zhang, IEEE Trans. Circuit Syst. Video Technol., 10:878-894, 2000, and in "MPEG-4 rate control for multiple video objects", A. Vetro, H. Sun and Y. Wang, IEEE Trans. Circuit Syst. Video Technol., 9:186-199, 1999. These schemes either encode video at a predefined fixed rate or at a predefined small set of fixed rates.
The existing schemes have problems when used in for example Internet applications and the streaming of video over the Internet. Due to the connectionless nature of the current Internet protocols and the routing mechanisms involved, the instantaneous bandwidth available to a particular user can vary widely in time and cannot in practice be previously known. The existing bit rate control schemes cannot adapt themselves quickly enough to the variations of channel bandwidth, and are not effective enough to achieve the objectives of Video over the Internet. /
A-h aim of the present invention is to provide a bit rate control which provides for better video quality, especially but not exclusively in Internet applications.
Summary of the Invention
Viewed from a first aspect, the present invention provides a bit rate control system for the encoding of video data in which the encoded bits are placed in a buffer prior to transmission, and in which a target encoding bit rate is determined based on the fullness of the buffer, characterized in that the buffer is modelled by a fluid-flow traffic model preferably of the form:
Bc (n + 1) = max{05 Bc n) + T(ή) - u n)}
where Bc(n) denotes the buffer level at time n;
T(n) is the actual encoding bit rate; and u(n) is the channel output rate.
The system of the present invention is able to keep the buffer occupancy closer to its target, which is preferably set at a predefined percentage (preferably about 50%) of a safety margin used to determine whether a frame of the video sequence to be encoded should be skipped, and to adapt itself faster to the variations of the channel bandwidth, and so will skip fewer frames at a low
bandwidth. This therefore provides a higher overall video quality, and is attractive for video over the Internet.
Preferably, the target encoding bit rate is given by the equation:
where A is the channel output rate; y is a buffer safety margin; B
s is the buffer size; B
c(n) is the current buffer level; and 0 < γ < 1 is an adjustable parameter. ,
"A" may be equal to the number of bits available for encoding all of the inter- frames of a current group of frames being encoded divided by the number of inter-frames to be encoded in the current group of frames. Alternatively, when for example providing video over the Internet, "A" may be the actual bandwidth estimated by using the packet loss information. This allows the variation of the channel bandwidth to be directly incorporated into the buffer control, and allows the system to adapt itself in time.
Meanwhile, the target bit rate is preferably modified based on the remaining bits available for encoding and on the remaining frames to be encoded. It may thus be:
where 0 < β < 1 is an adjustable parameter;
Tr is the number of remaining bits available for encoding; Nr is the number of frames remaining to be encoded; and Hhdr(n-1 ) is the amount of overhead bits used for the previous frame.
After the target bit rate is determined, a rate-distortion model preferably of the following form is further applied to determine the corresponding quantization parameter:
τ*» σ σ TT
where R is the total number of bits used to encode a frame; Q is the quantization parameter; Ci and C2 are first and second order coefficients; σ is an index of video coding complexity; and ,
Hhdr is the amount of overhead bits used.
Further preferably, the coefficients of the Rate-Distortion model are updated based upon data from a plurality of previous frames. The number of previous frame used is preferably determined by a sliding window mechanism, wherein the value of the current window size W(n) is given by:
W{ή) = mm{W(n - 1) + l,ς («) * ΣTmax }
where Wmaχ is a preset constant;
σ(ή) is the maximum absolute difference of the frame at time n.
Such a sliding window mechanism smoothes the impact of scene changes, and changes the window size gradually.
After the current frame is encoded, the total number of actual bits used to encode the current frame is added to the current buffer level. If the buffer is in danger of overflow, a switched frame skipping mechanism is preferably used to compute the number of skipped frames.
In one frame skip control, after the current frame is encoded, the next frame to be encoded will be skipped, if:
Bc(n + l) + T(n) - A ≥ Bs *f + TB(n)
where Bc(n+1 ) is the current buffer level;
T(n) is the actual number of bits used to encode the current frame; A is the channel output rate; Bs is the buffer size; γ is a pre-determined buffer safety margin;
and T(S(j,n)) (1< j < W(n)) denotes the total number of actual bits generated in the encoding of the previous W(n) frames.
In an alternative frame skip control, a frame skipping parameter Npost is set to skip the next Npost frames so that the following buffer condition is satisfied:
Bc(n + ϊ) <yBs where
Bc (n + 1) = max{θ, Bc (#ι) + T(n) - A(Nposl + 1)}
Bc(n) is the buffer level at time n;
T(n) is the actual number of bits used to encode the current frame;
A is the channel output rate;
Bs is the buffer size; and y is a pre-determined buffer safety margin.
The first-mentioned skipping control is preferably provided as a predictive switching control, the second-mentioned skipping control is preferably provided as a post-frame skipping control, and the skipping controls are preferably switched between one another based on the following switching law: a) The predictive frame skipping control is switched to the post- skipping control if a frame is skipped; and b) The post-skipping control is switched to the predictive frame skipping control if the current frame is not skipped.
The present invention also extends to a method for the encoding of a video sequence in accordance with the above system features, and to computer software for implementing the above system and method features.
It further extends to the use of the above features independently of one another, with for example the Rate-Distortion model defined above being in itself a new and advantageous model for use in bit rate control.
Brief Description of the Drawings
The present invention will hereinafter be described in greater detail by reference to the attached drawings which show an example form of the invention. It is to be understood that the particularity of the drawings does not supersede the generality of the preceding description of the invention.
Figure 1 is a, diagram of the structure of a typical network over which video streaming may be provided; and
Figure 2 is a functional block diagram of a video encoder scheme according to an embodiment of the present invention.
Detailed Description of the Invention
Fig. 1 shows a typical Internet structure over which a video sequence may need to be transmitted from a source 1 to one or more receivers 2. Due to the amount of data in a video sequence, the data must be compressed, otherwise the required transmission bit-rate would be unachievably high.
Thus, an encoder 3 is provided at the source 1 in order to compress the video data, and decoders 4 are provided at the receivers 2 in order to decode the data and reconstruct the video sequence. In between the encoder 1 and decoders 4, the compressed data is routed through various servers 5 and over what may be many different types of transmission channel 6.
Various different encoding systems have been provided for the compression of video data, and, for example, MPEG video compression is often employed. The current MPEG standards are MPEG-1 and MPEG-2, which are similar in basic concept, and MPEG-4 which is able to provide a low-bandwidth multimedia format that can contain a mix of media (including recorded video images and sounds and their computer-generated counterparts), and uses the concept of "Video Objects" to transmit independent images of arbitrary shape.
In MPEG compression, a video sequence is broken into a number of Groups of Pictures (GOP), each of which comprises a number of picture frames. Each frame is broken into a series of slices, and each slice consists of a set of macroblocks comprising arrays of luminance pixels and associated chrominance pixels. The macroblocks are divided into 8x8 blocks for encoding. Each block undergoes a Discrete Cosine Transform (DCT) to provide an array of DCT coefficients that are then quantized to force various of the coefficients (generally higher frequency coefficients) to zero so as to reduce the amount of data to be transmitted. Quantization is carried out by multiplying the DCT coefficient array by a quantization matrix, each value in the matrix being scaled by a quantization parameter. The matrix and quantization parameter can be altered on a frame-by-frame and/or block-by-block basis to alter the amount of
compression. The quantized coefficients then undergo further encoding to compress the transmission data still further.
The frames in a GOP comprise an Intra-frame (I frame) that is spatially compressed (in accordance with the above method), and Inter-frames (P and/or B frames) that are also temporally compressed in a motion-compensated prediction manner. Thus, each P frame in a sequence is predicted from the frame immediately preceding it, and each B frame is predicted from preceding and succeeding frames.
MPEG-4 also includes a Video Object layer between the frame layer and macroblock layer for specifying different independent objects within a scene.
In order to optimise video quality over a bit-rate range, e.g. in video-streaming to a number of receivers having different bandwidth capabilities, MPEG-4 also provides a Fine Granularity Scalability (FGS) scheme in which the coding of the video data is provided by a base layer and an enhancement layer, the base layer being designed to meet the lower bound of the bit rate range and the enhancement layer meeting the upper bound of the bit-rate range. The base layer is coded as discussed above, and the enhancement layer takes the original and reconstructed DCT coefficients of the base layer, and subtracts the reconstructed coefficients from the originals to provide a residue that is then encoded and transmitted with the base layer. The receivers of the data decode the base layer to provide a video signal based on the lowest bit rate range, and can improve the quality by decoding various amounts of the enhancement layer.
The present invention relates to a bit rate control scheme for the compression of video data, and may for example be used in encoding the base layer of an FGS scheme. It may especially be used in the FGS disclosed in the co-pending International PCT patent application filed in Singapore on 25 May 2001 and entitled "A Fine Granularity Scalability Scheme".
The present bit-rate control scheme consists of three layers, namely the GOP layer, the frame layer and the video object layer. The whole scheme is shown in Fig. 2.
The GOP layer rate control 1 is used to allocate bits to each GOP of the video sequence, each GOP being composed of one I frame and a number of P and B frames.
The total number of bits available for the video sequence will be:
TB = x R
where is the duration of the video sequence; and R is the bit rate for the sequence.
Assuming that the total number of I frames is N and that the number of P and B frames in the ith GOP are NPJ and NBJ , and that the frames have weightings of W|, Wp and WB, then the number of bits allocated to the ith GOP is:
For the sake of the present embodiment and for simplicity, it is assumed that each GOP has the same structure, and so the GOP Layer Rate Control will allocate each GOP the following number of bits:
TB
TB, =
N
After the GOP layer rate control at block 1 , the encoder carries out a buffer initialization at block 2, conducts the Intra-coding of the l-frame at block 3, updates a Rate-Distortion model at block 4 and checks as to whether the next
frame must be skipped at a skip-frame block 5 (e.g. because of possible buffer overrun).
Inter-coding is then performed in which the encoder 3 performs a joint buffer control at block 6, a Frame Layer Target Bit Rate calculation at block 7 and a Quantization Parameter calculation at block 8, before carrying out the Inter- coding of the P or B frame at block 9. After encoding of the frame, the R-D model update and Frame-skip control are again carried out at blocks 4 and 5 before conducting the encoding of the next inter-frame through block 6, etc.
Where the encoder scheme is used in the Video Object layer, the encoder also conducts a Target Bit Rate Allocation at block 10, and calculates a shape threshold in block 8 along with the quantization parameter calculation.
The part of the bit rate control in the frame layer consists of three stages: the initialization, pre-encoding and post-encoding stages.
(a) Initialization Stage
In the initialization stage of block 2, the encoder carries out three main tasks with respect to the frame layer control, these being:
(i) initialization of the buffer size based on latency requirements; (ii) subtraction of the bit count of the l-frame from the bit count of the ith GOP; and (iii) initialization of the buffer fullness - If the first GOP is encoded, then buffer fullness is set at 50% of a buffer safety margin (which will be 40% of the buffer size assuming a safety margin of 80%). Otherwise, the buffer fullness is set at the end level of the previous GPO.
The l-frame is quantized using an initial quantization value of Q0. The remaining available bits R0(i) for encoding all of the subsequent inter-frames can be calculated as:
Ro(ι) = Wt -K,. + (0.5*5, * -Be(i))
where TBj is the number of bits available to encode the ith group of frames; K,. is the number of bits used to encode the ith intra-frame; Bs is the buffer size; y is the buffer safety margin for skipping frames, having a typical value of 0.8; and Bc(i) is the buffer level at the start of encoding of the ith group of frames, with Bc(ϊ) = 0.5 *Bs * .
The channel output rate (the average number of bits to be drained from the
buffer per frame encoding) is then R0 (0 / NP i .
(b) Pre-encoding Stage
The pre-encoding stage includes setting a target bit rate for the encoding of the next video frame in the GOP, and setting the quantization parameter for quantization of the DCT coefficients in accordance with the target bit rate.
When the number of bits in the buffer is too large (e.g. is predicted to exceed a safety margin), the encoder usually skips some frames to reduce the buffer delay and avoid buffer overflow. This however produces undesirable motion discontinuities in the encoded video sequence. Conversely, if the buffer level is too low, there may be periods of time in which no bits are transmitted through the channel, and channel bandwidth is wasted.
In order to overcome these problems, a frame level control is adopted which sets the target bit rate so as to attempt to maintain a buffer occupancy after the coding of each frame of about 50% of the buffer safety margin (i.e. about 40% of the buffer size for a 0.8 safety margin).
It should be noted that this differs from the prior art, which sets the target buffer fullness at the middle level of the buffer. The present scheme enables a low encoder buffer delay to be maintained and the total delay to be reduced.
In order to determine the target bit rate, the dynamics of the buffer are represented by a fluid-flow traffic model with Bc(n) denoting the buffer level at time n: Bc {n + 1) = max{0, Bc («) + T{ή) - u(ή)} (1 )
where T(n) is the actual encoding bit rate; and u(n) is the channel output rate.
Using equation (1 ) and linear system control theory (see for example Chi-Tsong Chen, "Linear system theory and design", Rinehard and Winston, New York, 1984), the target bit rate is scaled based on the buffer size Bs, the current buffer level Bc(n) and the channel output rate R0(i)/NP i , and is given by:
where 0 < γ < 1 is an adjustable parameter having a typical value of 0.75.
When calculating the bit rate for the frame, the number of remaining bits Tr allocated to the current GOP and the remaining number of frames Nr of the current GOP should also be taken into account to ensure that there are available bits for the remaining frames, and so the final frame bit rate is:
where 0 < β < 1 is an adjustable parameter having a typical value of 0.585; and
Hhdr(n-1) is the amount of bits used for overhead data, that is, the bits used for non-texture data, e.g. shape information, motion vector information and header information.
It should be noted that the above method of using a fluid-flow model departs from the prior art use of heuristic methods for determining the target bit rate, and enables the buffer occupancy to be kept much closer to the target, so that fewer frames are skipped.
The present model-based method may be used in any suitable video transmission system, and is especially attractive when MPEG-4 video is transported over the Internet where variations in bandwidth occur. Using the heuristic approach, adjustment of the joint buffer control has a delay of one step, and cannot adapt itself in time to the variations in channel bandwidth. However, with the present model-based method, when the channel bandwidth is time-varying, the term RQ(i)/Np i may be replaced by the estimated actual channel bandwidth, e.g. by using the packet loss information. Thus, the variation of the channel bandwidth can be incorporated into the present joint buffer control, and the scheme can adapt itself in time.
A further point to note is that the receiver synchronization of a continuous media stream must deal with delay differences and variations. Since the present frame-layer control keeps the buffer occupancy much closer to the target (50% of the safety margin (40% of the buffer size)), the playout buffer delay can be reduced, and so the total delay is further reduced.
Once the target bit rate is determined, the corresponding quantization parameter, Q, can be computed by using a Rate-Distortion model, which takes the form of the following quadratic model:
R = c2 ^ϊ + Cι ^ + Hhdr
where R is the total number of bits used to encode a frame; Q is the quantization parameter; cι and c2 are first and second order coefficients;
σ is the mean absolute difference of texture computed using the motion-compensated residual for the luminance component (an index of video coding complexity); and Hhdr is the amount of bits used for overhead data, that is, non- texture data, e.g. video/frame syntax, bits used for shape information, motion vector information and header information.
(c) Post-encoding Stage
The post-encoding stage includes the processes of updating the parameters c-i and C2 of the Rate-Distortion model and determining whether any frame- skipping is necessary to prevent possible buffer overflow.
The statistics of quantization parameter value and bit rate value, taken from a number of previously encoded frames including the immediately preceding frame, are used to provide improved parameters Ci and C2 for the R-D model by using a linear regression technique.
The number of frames to use is based on a sliding window mechanism, which is designed to smooth the impact that a scene change might have in the updating of the R-D model.
If the complexity changes significantly, i.e. in high motion scenes, a smaller window with more recent data points after the change is used. Otherwise, a window with more data points is used. To ensure that the window size is not varied too rapidly, the window size is increased gradually.
Thus, the value of the current window size W(n) is given by:
W{ ) = mv {W(n - 1) + l,ς O) *Max_ Sliding _ JVindov^
where Max_Sliding_Window is a preset constant, and may be set to e.g. 20; and
The selected sample data points within the window W(n) are denoted as S(j,n) (1 <j ≤ W(n)).
For the selected data points, the encoder collects the quantization parameter statistics QO) and the actual bit rate statistics T(j), and, using a linear regression technique, the parameters can be obtained by:
C3 - -0, c2 =
After updating the R-D model, the total number of actual bits T(n) used to encode the current frame is added to the current buffer level, and a switched frame skip control is performed to prevent buffer overflow and overcome continuous frame skipping. The switched frame skipping control is composed of two basic controllers (a predictive frame skipping controller and a post frame skipping controller) and a corresponding switching law to determine the active controller.
In the predictive frame skip controller, a function TB is defined by:
where T(S(j,n)) (1 < j < W(n) denotes the total number of actual bits generated in the encoding of the previous W(n) frames.
The next frame to be encoded will be skipped, if the current buffer level plus the estimated number of bits for the next frame is larger than the sum of TB(n) and some pre-determined threshold, called the safety margin, that is if:
Bc(n + l) + T(n) - A ≥ Bs *y +TB(n)
where Bc(n+1) is the current buffer level; T(n) is the actual number of bits used to encode the current frame;
A is the channel output rate (which may be R0 (i)/NP i or is replaced by the estimated actual channel bandwidth);
Bs is the buffer size; and γ is the pre-determined safety margin.
If skipping takes place, the current buffer level is reduced by the channel output rate.
In the post frame skipping controller, a frame skipping parameter Npost is increased from zero until the following buffer condition is satisfied, the next Npost frames are then skipped by the encoder:
Bc(n + ϊ) <yBs where
Bc („ + !) = max{θ, Bc (n) + T{ ) - A(Npost + 1)} .
The predictive frame skipping control is initially used, and the switching law is:
a) The predictive frame skipping control is switched to the post- skipping control if a frame is skipped; and b) The post-skipping control is switched to the predictive frame skipping control if the current frame is not skipped.
Instead of using the switched frame-skipping control, the predictive or post frame skipping control may be used by itself.
Besides using the present method on the frame layer rate control, the above method may also be used to control the video object layer rate control.
In the video object rate control, the total target bit rate (as found in the frame layer control) is allocated to each video object according to its coding complexity, size and perceptual importance. Thus, for a given target bit rate, the target bit rate for an object i is given by:
τ∑ .{MOTiJX(n) + MOTijy(n))+ -τ)Pi + H"Λn 1} * ∑ ∑j (MOTljx (n) + MOTlJy (»))+ (1 -τ )∑ . Pj
where p- is the size of the video object i; ff**(» -i) =∑ ^(» -i) ;
MOTijχ(n) and MOTijy(n) are the absolute values of the jth motion vector component within the object i at the time n; and τ is an adjustable parameter 0 < τ < 1.
Also, to avoid using excessive bits for motion and shape information instead of for texture, and to balance the bit usage without imposing additional noticeable
distortion, the shape threshold values can be set dynamically based on the previous coding information.
In the adaptive threshold shape control, let
W,(n-1) B (i) = ∑Hhd i(S(j,n))
The threshold for the video object i, θ*, is initially set to zero, if fj(n) is less than Hhdr,i(n-1 ) - 1.25 HB(i) in the previous frame, then:
θ,- = min^max( ,θ. +θ^(θ}-
where θstep(i) > 0 and θmaχ(i) > 0 are predefined.
If fj(n) is greater than Hhdr,i(n-1 ) + 1.25HB(i), then it is decreased by:
θ^max^ A -Θ^ } -
Otherwise, the threshold is not changed.
When controlling the video object layer, the switched frame skipping control will preferably be used.
Besides controlling the frame layer bit rate and the video object layer, the present scheme can also control the macroblock layer control. The method is thus scalable.
It is to be understood that various alterations additions and/or modifications may be made to the parts previously described without departing from the ambit of the invention, and that, in the light of the teachings of the present invention, the control scheme may be implement in software and/or hardware in a variety of manners.