WO2009152536A1

WO2009152536A1 - Method for processing sport video sequences

Info

Publication number: WO2009152536A1
Application number: PCT/AT2008/000224
Authority: WO
Inventors: Luca Superiori; Markus Rupp
Original assignee: Mobilkom Austria Aktiengesellschaft
Priority date: 2008-06-20
Filing date: 2008-06-20
Publication date: 2009-12-23
Also published as: AT509759A2; AT509759A3; AT509759B1

Abstract

A method for processing sport video sequences for transmission through channels having limited transmission capacity, said method comprising the steps of segmenting pictures of the video sequences to obtain different type segments, corresponding to regions with different contents, namely at least players and background, and encoding the obtained different segments separately, with the application of different coding strategies, wherein the pictures of the video sequences are segmented on the basis of color characteristics, thereby deriving separate macroblocks for each segment.

Description

Method for processing sport video sequences

Field of invention

The present invention refers to a method for processing sport video sequences for transmission through channels having limited transmission capacity, such as UMTS networks, said method comprising the steps of segmenting pictures of the video sequences to obtain different type segments, corresponding to regions with different contents, namely at least players and background, and encoding the obtained different segments separately, with the application of different coding strategies.

Background art

From EP 0421186 Bl and EP 0959625 A2, it is known to segment video sequences, namely in particular with respect to the separation of players, e.g. tennis players, from an audience (background) , or from a field. With such segmentation, the encoding process can take into account the contents, and thus the importance, of the respective picture segments so that different encoding qualities may be applied for different segments or objects; in particular, the best quality may be applied to the most important object, namely players and ball. On the other side, the game field and the audience are of less importance so that less quality caused by encoding is acceptable. The prior art methods refer to different, yet rather complicated segmentations where rather complex and time consuming algorithms are used. In particular, segmentation of pictures is based on the coding of edges, or is carried out on the basis of contours and of motion (characteristics/behaviour) of picture elements.

Summary of the invention

Accordingly, it is an object of the present invention to provide a method for processing sport video sequences, in particular soccer video sequences, with an improved segmentation and encoding technique, to obtain an encoding optimization of the video sequences. The invention is based on several perceptions, as e.g. that in the case of sport video sequences^', the attention of the customer is focussed on the ball and the players, and that the encoding of grandstands (audience, background) requires a consistent amount of bits if compared with the players and balls. In particular, soccer is one of the most transmitted contents in UMTS networks, and therefore, the present invention particularly aims at an optimized segmentation and encoding technique for transmitting such soccer video sequences.

Furthermore, it is to be taken into account that video compression as used for transmission in UMTS networks impairs the subjective quality.

To solve the problem given, the present invention provides for a method as defined in the independent claim. Preferred embodiments and further developments are subject matter of the dependent claims.

According to the present segmentation technique, color characteristics are taken as a basis for generating the segments, or corresponding macroblock maps, which than are separately encoded for the intended wireless transmission.

In a preferred embodiment, each frame containing a long angle shot is automatically segmented in three regions:

- field

- ball and players

- grandstands (audience) .

The encoding process is aware of the segmentation, and during encoding, the quality of the most important object (ball and players) is preserved.

Furthermore, it is paid attention to the fact that the field is not affected by blockiness. The grandstands are encoded coarsely, and are refreshed periodically. It is preferred that the grandstands are not transmitted at all, and then reconstructed by means of camera movement compensation.

The three mentioned regions are encoded and stored in different packets. Then, a lower priority index may be associated to the packets containing the audience.

For performing highly efficient segmentation, it turned out as being advantageous that, for segmentation, each picture is transformed from the RGB (Red-Green-Blue) color domain to the Hue (H) -Saturation (S) -Value (V) color space. Here, it is further advantageous for deciding whether pixels of a picture belong to a given, rather stationary, first segment, e.g. a green soccer field, if it is checked whether H, S, V of these pixels are within a given range. Moreover, it is preferred that the pixels of the picture that have a predetermined H component are counted, and dependent on the obtained number of pixels, the ranges of H, S and V are defined. Furthermore, a preferred embodiment is characterized in that for deciding whether pixels of the picture belong to a second rather stationary segment, e.g. audience, a region growing algorithm is used where at least one region seed is placed in a respective corner macroblocks of the picture, and in the case that the number of pixels belonging to the first segment is less than a predetermined threshold, neighboured macroblocks are checked in this manner so that a map of this second segment macroblock is determined. Here, it is further suitable if after determination of the second segment macroblock map, the remaining macroblocks are decided to belong to the first segment, or to a third segment which, e.g., comprises players and ball, whereafter, by checking whether the number of pixels belonging to the first segment exceeds a further predetermined threshold or not, it is decided that the respective macroblock contains the first segment or the third segment. To avoid that for instance players (third segment) are associated with the audience segment, it is then preferred that rows of the audience map (second segment) are searched through for isolated macroblocks confined on the left and right side by field macroblocks, and in that such isolated macroblocks are removed from the audience map, and are associated to the third segment, i.e. the players map. An equivalent investigation is performed per column .

To obtain a high compression rate, that means savings in the bit rate to be transmitted, it is preferred that in the case of mac- roblocks of a substantially stationary segment, e.g. the audience macroblocks, only new macroblocks appearing due to camera movement are continuously encoded and transmitted, and refreshment macroblocks for actualization are encoded and transmitted in greater time intervals only, depending on changes in camera shots .

A specific advantage is that the proposed segmentation allows that encoding is carried out with a tuning of quantization parameters applied to coefficients to be transmitted, in the case of discrete cosine transformation (DCT) , to have different quantization parameters with respect to different segments. In particular, a high quantization parameter may be applied to the segment to the presenting audience.

According to a further aspect, the present invention also provides a system for optimized segmentation and encoding of sport video sequences.

Brief description of the drawings

The invention will be described more in detail herebelow with reference to preferred embodiments to which, however, should not be restricted, and with reference to the attached drawings. In the drawings,

Fig. 1 illustrates a diagram showing the size and average size of encoded macroblocks for different segments;

Fig. 2 schematically illustrates a system for picture segmentation and video encoding, for carrying out the present method;

Fig. 3 illustrates a more detailed block diagram showing the segmentation, encoding, and the decoding modules of the present technique;

Fig. 4 schematically depicts an original frame (picture);

Fig. 5 illustrates this frame (picture) after schematic macro- block subdivision; Fig. 6 illustrates this frame (picture) after determination of the macroblocks associated to the audience;

Fig. 7 depicts a flow chart illustrating different steps for segmentation;

Fig. 8 illustrates a flow chart for explanation of the conversion of pixels of the frames from the RGB space to the HSV color space;

Fig. 8A shows a schematic representation of the HSV color space;

Fig. 9 illustrates a flow chart with respect to a possible postprocessing of the segments achieved, to associate players to the player regions instead of to the grandstand regions;

Figs. 10 and 11 show diagrams of macroblock and bit rate distribution illustrating the high bit rate with respect to audience macroblocks transmission without applying the present processing method;

Fig. 12 depicts a flow chart illustrating the use of different quantization parameters for encoding the different segments;

Fig. 13 shows a diagram illustrating the size of the code associated to a single frame (frame size versus frame index) when using a standard encoding mechanism;

Fig. 14 illustrates, in a similar diagram, the frame size versus frame index, but now for the case of specific transmission technique with respect to the audience macroblocks;

Figs. 15 and 16 illustrate diagrams showing the normalized rate for field and grandstands depending on the QP settings, in Fig. 16 the case of peak signal to noise ratio (PSNR) depending on the considered QPs; and

Fig. 17 illustrates a diagram showing MOS results. Detailed description of preferred embodiments of the invention

In the following, a brief overview of the proposed technique of video encoding will be offered. The optimization of the video encoding will be defined as the optimal association of encoding rate to the contents of each single frame. The video codec considered here is the state-of-art H.264/AVC, but most of the proposed concepts can be applied to common video codecs, such as H.263, MPEG-2 and MPEG-4. These standards belong to the family of the so called hybrid block-based video codecs where the picture is subdivided in squares of 16 x 16 pixels called macro- blocks .

Focusing on football video sequences, three regions are defined according to their content and their importance from the user perspective, namely:

• Region 0 (RO) : field

• Region 1 (Rl) : players and ball

• Region 2 (R2) : grandstands (audience)

The size of the encoded macroblocks of each picture has been the object of an investigation. In particular, the distribution of the code between the three macroblock regions has been taken into account. In Fig. 1, the average size of encoded macroblocks belonging to each region is shown; in particular, the size of the encoded macroblocks belonging to grandstands is represented by graph 2, with the mean size being shown at 2'; graphs 1, and 1' respectively, illustrate the corresponding size and mean size of the players macroblocks, and graph 0 refers do the field macroblock.

The size of the code 2 associated to the macroblocks containing the grandstands results to be the biggest, followed by the code 1 associated to the players, both much bigger than the code 0 associated to the field. This result can be interpreted as follows .

The size of the resulting code after the encoding of a macro- block strongly depends on the high frequency content of such macroblock. The grandstands (R2) , in wide-angles shot, contain an irregular pattern consisting of a mixture of audience and other elements of the soccer stadium. This effect is further accentuated in low resolution sequences, with the spatial down- sampling. Moreover, because of the high frequency components, the encoding of such R2 macroblocks is made even more complex by a poor temporal prediction of the blocks.

The idea behind a segmentation of the pictures is to optimize the encoding of soccer (sport) video sequences associating better quality and more rate to the elements that are more important from the user's point of view, and less rate and lower quality to other elements. It has to be specified that here, the terms "less rate" and "lower quality" are not directly related. The grandstands can be considered, within a shot, as static elements of the video sequence, and it is therefore possible to refresh the grandstands whenever necessary (i.e. when the audience is celebrating a goal) .

Therefore, a coding mechanism was to be found for the content providers, or a transcoding mechanism, lying between the content providers and the final users, for service providers. A generic scheme of the present segmenting and encoding technique is shown in Fig. 2. In particular, it is shown in Fig. 2 that on the basis of an original picture 10 comprising grandstand region R2, players and ball region Rl, and field region RO, a picture segmentation is carried out in a module 11 whereafter video encoding follows in a module 12 which also uses picture information, cf. input 13 to the module 12; the result of this segmentation and encoding is an optimized video data stream 14.

A more detailed scheme can be found in Fig. 3. In the following, then, the two main elements of the scheme will be discussed in details: the segmentation process, and the modified H.264/AVC encoder.

With specific reference to Fig. 3, segmentation module 11 comprises a H- (Hue) component analysis module 15 outputting field macroblocks at 16, and a region growing module 17 outputting gandstand macroblocks at 18; by difference forming, compare nodes 19, 20 in Fig. 3, player and ball macroblocks are obtained at 21. The macroblocks 16, 18, 21 are then supplied to the encoding module 12, and combined with the original picture information at input 13, to obtain the final field macroblocks 16', grandstand macroblocks 18' and player and ball macroblocks 21'. These macroblocks 16', 18', 21' are then encoded separately, cf. video coding layer 22 in Fig. 3, thereby applying different quantization parameters QP RO, QP Rl and QP R2.

The encoded macroblocks are then ready for transmission in form of packets 23(RO), 24 (R2) and 25(Rl) . At the receiver side, the packets 23' (RO), 24' (R2) and 25' (Rl) are received, and decoded in a usual H.264/AVC decoder 26, and combined to obtain a reconstructed picture 27.

With particular reference to the segmentation, this segmentation process aims to associate each macroblock of the picture to the given region RO, Rl, R2. The input of this segmentation process is given by each raw frame of the sequence (in raw, yuv or bmp format) .

At the output, a macroblock association map, in the form

where each macroblock MB

. pic_width x pic heigth

'^■"' fβTFό ⁽

is associated to its appropriate region R

_./=[0,1,2] (4.3)

Given each frame of the uncompressed sequence, as e.g. the one schematically shown in Fig. 4, the scope of the segmentation block is to output an association map, indicating for each macroblock the region it is associated to. In Fig. 5, the picture of Fig. 4 comprising macroblock subdivision is shown.

The input images are in RGB (Red Green Blue) format. It is known that the components of this color format are highly correlated; therefore, it is preferred to convert the picture in HSV (Hue Saturation Value) format. The "Hue" represents the color tone of the pixel, the "Saturation" the purity of the color (from gray to the pure tone) and the "Value" its luminance. The main idea behind the segmentation algorithm is to consider the information about the color of the pixels representing the field. It is therefore possible to bound the tolerated values of Hue, Saturation and Value with respect to the field (RO) , the remaining regions then being the regions Rl (players and ball) and R2 (grandstands) . This principle will be discussed therebelow in still more detail on the basis of Fig. 7.

Nevertheless, it is aimed at segmenting the frame at macroblock level .

It is not necessary to define the precise boundaries of the objects, but rather to define the region each macroblock belongs to. In the following, the macroblock regions RO, Rl, R2 will be defined as an aggregation of macroblocks assumed to contain field elements, players, ball and field lines elements, and audience (grandstands) elements.

The method is focused on wide angle shots. In such sequences, the audience is located on the upper side of the frames. When approaching one of the two penalty areas of a soccer field, the left (right, respectively) side of the picture may contain the audience. In some particular shots, also the lower side of the frame may contain audience. Under this assumption, it has been decided to use a region growing algorithm to highlight the macroblocks belonging to the region R2. Seeds macroblocks of the audience are placed on the four corner macroblocks of the picture. One seed macroblock may belong to the audience or to the field, depending on its color characteristics. If it contains a number of green pixels (field pixels) not exceeding a given threshold, such seed is considered as the first macroblock of R2. The surrounding macroblocks are then evaluated, and, depending on their color characteristics, may be attached to the R2 region (therefore, the region is "growing"), or they are discarded. The process is terminated when all the border macroblocks of the audience region R2 have been examinated. The result thus obtained is schematically shown in Fig 6 where the macroblocks now determined and associated to the audience are shown at R2.

The remaining macroblocks are then macroblocks belonging to the RO region (field), and the macroblocks belonging to the Rl region (ball, players and field lines) .

The macroblocks belonging to the field region, RO, are the ones containing an amount of green pixels bigger than a predetermined threshold (depending on the characteristics of the picture) . The now remaining macroblocks belong to the players, ball and field lines region Rl.

Thereafter, a refining step may still be advantageous. When estimating the R2 region (audience) , it may happen that one ore more players overlap or border the audience. In that case the region growing algorithm would include the player (s) in the R2 region. However, the R2 region can not be concave or convex in contour. Therefore, the macroblocks initially associated to R2 but surrounded at the left and right side by macroblock belonging RO or Rl will be included in the Rl region (players etc.) . In case of sidelong audience, an equivalent refinement is performed per column.

With particular reference now to Fig. 7, it is shown there at step 30 that at the beginning of the segmentation, an RGB frame is taken which is then, at step 31, transformed from the RGB color domain to the HSV domain by means of an invertible transformation.

This pixel-wise RGB -> HSV transformation is illustrated in detail in Fig. 8. Accordingly, on the basis of the R, G and B contents of each pixel illustrated at 32, 33, 34 in Fig. 8, the maximum of R, G, B, max (R, G, B), is determined at block 35, and the value V is set V=max (R, G, B) , compare output block 36. Moreover, the minimum of R, G, B, min(R,G,B), is determined at block 37, and the difference Δ of the maximum and the minimum of R, G, B is calculated at block 38 (Δ=max-min) ; thereafter, this difference Δ is divided by the maximum of R, G, B, i.e. Δ/max (R, G, B) , at block 39, and the resulting quotient is taken - li as the saturation S, cf. block 40 in Fig. 8.

Furthermore, it is determined whether the maximum of R, G, B is R (block 41); or G (block 42); or B (block 43); and dependent on the outcome of these examinations, H (block 44) is determined as H= (G-B) /Δ (block 45); or as H=2+ (B-R) /Δ (block 46); or as H=4+ (R-G) /Δ (block 47).

An example for the HSV color space is shown in Fig. 8A. It may be seen there that the tone, Hue H, is expressed as an angle, the saturation S in a scale from zero to one and the value V in a scale from 0 to 255.

Referring back to Fig. 7, where, at block 48, the resulting HSV frame is illustrated, the HSV picture is now analyzed in order to obtain a map of the pixels associated to the field (region RO) . For each pixel, the histograms of H, S and V are built

(block 49), in order to highlight the range of each component. The quantity of pixels having Hue component comprising between the green tone limits (hue e [40,80]) are counted according to block 50, and are used to estimate the quantity of field present in the picture. This information is used to define the range of the components: the smaller the number of pixels belonging to the field (RO) is, the narrower will be the range considered when evaluating H (cf. blocks 51, 52), S (cf. block 53) and V

(cf. block 54) . If a pixel fulfills the constraints on the ranges given in blocks 52, 53 and 54, this pixel is associated to the field (cf . block 55) . The field pixel detection of 56 is finished (for the respective pixel), and the resulting map is then transformed to an equivalent one with the resolution of macroblocks, by counting the number of green pixels within each macroblock and comparing this with a threshold, cf. block 57 in Fig. 7.

The following step regards the detection of the audience macroblocks, according to block 58. As mentioned above, here, a region growing algorithm is used which is based on the color characteristics of the respective picture. Each region starts growing from a seed. The seeds are placed in the upper and lower corners of the picture (cf. block 59) since the audience ele- ments are always placed in the border of the picture. In case the number of field pixels in the seed macroblocks is smaller than a given threshold thrl, as checked in step 60, then the pixels are considered as the beginning of an audience region, cf. block 61; otherwise, they are discarded, cf. block 62.

Once all the seeds have been examined, the neighbor macroblocks of the remaining seeds are examined as described before. As a result, a map of the macroblocks associated to the audience region R2 is obtained at block 63. By means of difference, cf. block 64, the remaining macroblocks are the ones belonging to the field (RO) and to the players (Rl), cf. block 65. -Each of the remaining macroblocks is processed similarly, cf. block 66. Each of the macroblocks whose number of field pixels exceed a second given threshold thr2, cf. block 67, is considered to belong to the field region RO, cf. block 68, and is therefore associated to the field map, cf . block 69. The remaining macroblocks contain the remaining elements (players, ball and possibly field lines), cf. block 70 and are associated to the player map according to block 71.

With the above processing, players, ball and field lines confining with the audience could be associated to the audience because of the region growing algorithm. In order to avoid this, a refinement algorithm is applied according to block 72, which is discussed more in detail with reference to the refinement block diagram of Fig. 9 now.

According to Fig. 9, each row of the audience map, cf. block 73 in Fig. 9, is examined by looking for isolated macroblocks in the row, cf. block 74. It is useful to define isolated audience macroblocks as the ones confining left and right (cf. block 75) with field macroblocks (cf . block 76) . In fact, the audience cannot have the property of being convex, therefore isolated macroblocks cannot belong to the audience but must belong to players, ball or field lines confining with the audience and erroneously associated to the audience map. Such macroblocks are therefore removed from the audience map, cf. block 77, and associated to the players map, cf. block 78. Next, a description of the encoder 12 (Fig. 3) follows which, in principle, may be a usual H.264 /AVC encoder but has some adaptation in view of the present segmentation technique and the possibilities issuing therefrom.

In general, for transmission over packet-based networks, the encoded stream has to be segmented in packets having a maximal size, usually equal to the MTU (Maximum Transfer Unit) of the network in use. Since the size of the encoded macroblocks depends on its characteristics, the number of macroblocks contained in a packet is not constant. The macroblocks belonging to the same packet define a picture slice. The macroblocks are usually read in raster scan order. Therefore, a slice contains the macroblocks [M, M+l, M+2,..., N-2, N-I, N]. In H.264/AVC, a new error resilience tool was introduced, the so-called FMO (Flexible Macroblock Ordering); compare also US 2007/0201559 Al. It allows the definition of slice groups, where each slice group is a subset of the image. A macroblock belonging to a slice group will be encoded and packetized together with other macroblocks belonging to the same slice group.

In the present technique, the slice groups will be defined using the allocation map obtained by the segmentation described above. This already increases the error resilience capabilities of the whole video stream since it is possible to associate different priorities to the packets belonging to the different regions RO, Rl, R2.

An optimization of the encoding will be obtained by correspondingly tuning the QP (Quantization Parameter) of each slice group. In few words, the quantization parameters can be seen as scale factors defining how strongly the DCT coefficients have to be quantized: the smaller the quantization parameter, the finer the quantization. A finer quantization means a more accurate reconstruction but also more information to be transmitted. On the other hand, a bigger quantization parameter reduces the number of coefficients to be transmitted obtaining, at the decoder side, a less reliable reconstruction. The coefficients the quantization parameter is applied to are the corrections that have to be applied to the available macroblock prediction. A further description of the principle of this concept can be found in the literature (e.g. Iain. E.G. Richardson, "H.264/AVC and MPEG-4 Video Compression (Video Coding for Next-generation Multimedia)", Wiley 2005; ITU-T Rec. H.264/ISO/IEC 11496-10, "Advanced Video Coding", Final Committee Draft, Document JVTE022, Sept. 2002) .

Under these considerations, it has proven suitable to apply the following quantization parameters to the three defined regions RO, Rl, R2:

- Region RO (field): Small quantization parameter, e.g. 26 to 30. Even if one thinks that the field can be coarsely encoded, subjective tests showed that a high quantization parameter results in blockiness of the field. The blockiness of the field resulted to be one of the most annoying artefacts.

- Region Rl (players and ball) : Small quantization parameter, e.g. 26 to 30. These segments carry the most valuable information for the viewer.

- Region R2 (grandstands): High quantization parameter, e.g. 42. Since the grandstands contain mostly high frequency components, a high bit rate would be necessary for transmission but, within a shot, the high frequency components remain static. The high information associated to them by the standard encoder is due mainly to an inefficient temporal prediction of the blocks. Moreover, the attention of the user is supposed not to be focused on the grandstands, therefore, a small degradation in quality can be tolerated.

The analysis performed confirmed that the encoding of the region R2 would be most expensive in terms of required bits. After the downsampling, the grandstands resulted to be a pattern of high frequency components.

The encoder 12, when performing temporal prediction, will - for each macroblock - search its best prediction in the previous pictures. Because of downsampling, the high frequency pattern may vary significantly within two frames. Therefore, the efficiency of temporal prediction will suffer, resulting in lot of high frequency residual to be transmitted. Even if the macrob- locks associated to the audience, region R2, represent about the 21% of the pictures, their encoding requires 50% of the resulting bit rate, as shown in Figs. 10 (macroblock distribution) and 11 (bit rate distribution) .

From a subjective point of view, however, the grandstands (region R2) do not change within the two frames. Therefore the residual information sent by the encoder 12, mainly concerning high frequency components not appreciable by the human eyes, can be reduced, namely by increasing the QP of the macroblocks associated to the audience, region R2.

The advantages of using FMO (Flexible Macroblock Ordering) together with the proposed segmentation scheme can be summarized as follows:

- It is possible to apply to the macroblocks belonging to RO, Rl, R2 different QPs depending on the region they are associated to. If all the macroblocks belonging to a region are encoded sequentially, one has to define only one QP value for the whole picture slice, instead of defining a QP for each single macroblock.

- Some parts of the image are more robust against packet loss. The grandstands (region R2) remain nearly as a static background of the image. In case all the encoded macroblocks containing the grandstands are stored in the same packet, one can give to such packet a lower priority, if compared with the ones containing the players. In case the packet containing the audience is not received at the decoder side, it is possible to conceal the missing information by copying the audience from the previous picture while compensating the global camera motion exploiting, for instance, the movement of the field.

- Under this assumption, it is not necessary that the macroblocks of the audience are encoded at all, but rather cannot be exempted from regular transmission, and can be recovered as explained before. In this case, a refreshment picture has to be sent once in a while, whereas only the new macroblocks appearing because of the camera movement and not available in the current reference have to be encoded and transmitted. This may be explained in somewhat more detail as follows. First, however, still for general explanation, in Fig. 12, a macroblock is depicted at 80. For each macroblock of the picture, its best mo- tion compensated prediction is searched at block 81, namely by using a reference buffer 82. After calculating the difference between the original block and its prediction, cf. block 83, the difference block in the pixel domain is transformed to the frequency domain by means of an horizontal and a vertical Discrete Cosine Transformation (DCT), cf. block 84. The transformed residual block has then to be quantized. The Quantization Parameter QP, is chosen for each macroblock depending on the region the considered macroblock belongs to, cf . block 85. For the audience, a higher QP is chosen, leading the high frequency component to turn to zero. For the field and the players, a smaller QP is selected, therefore keeping more high frequency components during quantization, block 86, but calling for more bits to encode them by entropy encoding, block 87.

Now, with specific reference to the encoding of audience region information, the following may be useful to be explained.

Similarly to its predecessor, the H.264/AVC is a hybrid block based codec. Each video frame is subdivided in blocks of 16 x 16 pixels, the macroblocks. Depending on the frame type, such mac- roblocks are then encoded exploiting their spatial correlation with the neighboring ones (I frames) or with the ones in the previously encoded images (P frames) . The best prediction (temporal or spatial, respectively) of the original macroblock (in the case of P frames) is evaluated. A residual block is calculated as the elementwise difference between the best prediction of the macroblock and the original macroblock.

The difference block is then transformed in a transformed residual block t by means of two (horizontal and vertical) modified Discrete Cosine Transformations (DCT) . An element t(0; 0) represents the lowest frequency component of the transformed residual block (DC) . Higher row and column indexes are assigned to elements associated to increasing frequency components. The block t is then scalarly quantized, obtaining a block q. The quantization steps are indexed by the Quantization Parameter (QP) . Incrementing the value of QP more components in high frequency are rounded to zero. This results in less elements to be then entropy coded but, at the same time, in lack of detail in the reconstructed block at the decoder side. The encoding scheme is applied to all the macroblocks of the frame.

With respect to soccer video sequences, three different groups of scene components RO, Rl, R2 have been defined above, distinguishing their specific features and their impact on the perceived quality. The macroblocks containing the field are characterized by their color tone, green, and the absence of high frequency patterns. The player, ball and field lines macroblocks are considered as the element the attention of the observer is focused to. Their movement is not consistent with the global camera movement, and their shape can vary in time. The grandstands and advertisements basically remain as a static background according to the camera movement.

An analysis performed over 20 different football sequences in CIF resolution was performed aiming at the examination of the coding efficiency for the different macroblock groups. The analysis focuses the temporally predicted (P) frames. This is both because the spatially predicted (I) frames require much more bits than the (P) frames and because soccer sequences are characterized by strong temporal correlation between consecutive frames. The results for a representative sequence of 134 frames is shown in Figs. 10 and 11. Fig. 10 depicts the distribution of the 396 macroblocks over the three groups RO, Rl, R2. Fig. 11 shows the resulting code associated to each groups, normalized with respect to the total size of the frame.

As expected, the code associated to the macroblocks containing the field will be in average the smallest, due to the lack of high frequency details. Surprisingly, the macroblocks containing the audience, representing 15% or 16% of the total number of macroblocks, require the 50% of the total bit rate. This behavior can be justified considering the content of the macroblocks belonging to this group R2. The grandstands, particularly if crowded, are characterized by high frequency components. Even if imperceptibly for the human visual system, such patterns vary in time, resulting in inefficient prediction and, therefore, high frequency transformed residuals. The reduced resolution accentuates this effect.

From the observer point of view this configuration results to be unoptimized. Most of the data rate is, in fact, allocated for the macroblocks containing the least useful information concerning the match. Moreover, the information contained in the grandstands and in the advertisement remain subjectively static in time. Thus, the significant amount of code is rather mostly associated to details not perceptible by a human viewer.

As mentioned above, the selected QP affects strongly the size of the encoded stream as well as the quality of the decoded sequence. In an H.264/AVC encoded stream, the value of the quantization parameter is defined in the so called Picture Parameter Set (PPS) . Usually, all macroblocks utilize the QP specified in the PPS pointed by the frame they belong to. A deviation from that QP can be defined at slice level, for a whole collection of macroblocks, or even at macroblock level for each single macrob- lock, resulting in increasing signalization bits.

The present approach consists in the exploitation of the presented segmentation during the encoding. Traditionally, the macroblocks are encoded in a raster scan. This strategy results to be inappropriate for the present method. Instead, it is preferred to exploit the flexible macroblock ordering (FMO), an error resilience tool comprised in the H.264/AVC baseline profile. As mentioned above, FMO allows the encoder 12 to group the macroblocks in slices, sorted according to some specific patterns (mode 1 to 5) or to an association map given as input (mode 6) . This last opportunity has been selected for two different reasons. First, a single deviation from the global QP can be defined for each slice. On the other hand, the different regions can be encoded and packetized separately, obtaining data partitions. If a priority index is associated to each packet, in case of network congestion, the least important packets can be dropped reducing the impact on the perceived quality.

Therefore, after segmentation, a map containing the association between each macroblock and the region it belongs to, is given as input together with the frame to be encoded to the H.264/AVC encoder 12. The considered codec is the Joint Model (JM) H.264/AVC baseline encoder. According to the map, the macro- blocks belonging to a respective region can be separately encoded, using for each group an appropriate quantization parameter, and packetized.

Once the association map is obtained from the segmentation algorithm, the sequence using a modified Joint Model is encoded. The main scope of the algorithm now is the reduction of the bits associated to the audience, while preserving an acceptable quality, due to the strong correlation between two consecutive frames of the football sequence. While the movement of the player is hardly predictable, the elements containing the audience move coherently with the camera movement. Therefore, the movement of the whole region may be described by means of a single global motion vector.

In order to implement such an approach, it is possible to force the macroblocks of the audience to be skipped. In H.264/AVC one macroblock is skipped if its associated motion vector (MV) is equal to the predicted motion vector, i.e. the one obtained averaging the motion vectors of the neighboring macroblocks. Furthermore, the macroblock pointed by such a motion vector has to represent such a good approximation that no correction by means of residuals is necessary. It is possible to signalize in the first macroblock the global motion vector representing the camera movement, and to use this to predict the motion vectors of the other macroblocks belonging to the audience.

In order to estimate the global motion vector, a two pass encoding of the audience is proposed. In the first step, the audience is encoded using the common H.264 /AVC encoding procedure. The motion vectors obtained are already coherently pointing in a single direction, representing the camera movement within the two shots.

Additionally, several bits are associated to the residual en- coded macroblocks . The residuals are imputable to differences between the original block and the predicted one. The H.264/AVC calculates the difference on a discrete cosine transformed plane. Being the macroblocks associated to the audience mainly consisting of high frequency components deriving from spatial downsampling, the prediction results to be in most of the cases ineffective.

It may be thought to keep the motion vector and to drop the residuals. However, this would cause distortion at the decoder, since the image reference buffer at the encoder would not be accordingly updated. Therefore, a light second pass encoding is necessary. The global motion vector will be calculated as the main component (if any) of the motion vectors histogram. This motion vector is used for all the macroblocks belonging to the audience. In most cases, this will be obtained by skipping the macroblock. For isolated macroblocks, from whom no prediction is possible, the global motion vector will be signalized again.

Due to the camera movement, video information not contained in the previous frame appear at the border of the picture. As soon as the camera movement exceeds a whole macroblock, the macro- block belonging to the appeared column (horizontal movement) or row (vertical movement) may be encoded as common P macroblocks.

The proposed method works properly as soon as the shots do not include zoom. In such a case, the temporal prediction does not perform properly since the encoding mechanism is not able to compensate zoom without inserting residuals. Moreover, it has been observed that in case of zoom, the variance of the motion vector increases. In order to cope with this effect, a zoom detector based on the motion vector variance may be used.

In H.264/AVC, the motion vectors are encoded with the resolution ^'of one quarter of a pixel. If the value of the global motion vector is not an integer of a pixel (i.e. 4), the frame would be affected by blurring because of the interpolation performed. Therefore, the movement is buffered and applied on multiples of a pixel. The performance of the present method may be shown in terms of bit rate saving and resulting subjective quality, measured by MOS (Mean Opinion Score) .

In Fig. 14 the size of the code associated to a single frame is shown in terms of bit rate saving, compared with the case when using the standard encoding mechanism, Fig. 13. In both cases, the QP was set to 26 for all the regions. The small peaks in the plot of Fig. 14 are caused by the additional rate due to the encoding of the border as indicated above with reference to camera movement. The bigger peaks in Fig. 14 are the consequence of frame refreshment applied each 25 frames.

In the following, a simulation setup as well as the obtained results will be described. Different sets of QPs were assigned to the three defined regions RO, Rl, R2. The information associated to the macroblocks containing players, ball and lines was considered the most important one. Higher QPs are therefore assigned to the macroblocks containing the fields and the grandstands. For the player, ball and lines, common QP values between 26 and 30 were used. For the field and the lines, a set of QPs varying from 26 to 42 was used. A training set of sequences was encoded covering all the possible permutation of QPs.

As first analysis, the effect of the different quantization parameters was considered in terms of resulting rate, compared with the results obtained encoding the whole picture with a QP of 26. The results are shown in Fig. 15, setting the QP of the players to 26. As expected, increasing the QP for the field does, not provide significant improvements in terms of reduction of resulting code, since the number of coefficients in high frequency transformed residual is limited. On the contrary, the size of the encoded macroblocks associated to the grandstands can noticeably be adapted by modifying the quantization parameter .

Such results were then analyzed in terms of distortion. Fig. 16 shows the Peak Signal to Noise Ratio (PSNR) depending on the considered QPs. Surprisingly, the PSNR does not result to be so sensitive to the QP modification as the resulting rate was. Even for the values (42,26,42), where the rate resulted to be about the 25% of the original, the PSNR remains about the 80% of the original. As observed for the rate, also the objective distortion metric appears to be marginally dependent on the QP applied to the field. Therefore, the variations should only be attributed to the effect of the quantization on the grandstands. The minor decrease in PSNR compared with the substantial one in terms of rate, leads to the conclusion that, for an objective metric, the temporal prediction applied to the grandstands does not result effective, even for low QPs.

However, even if the prediction at the encoder is performed minimizing an objective metric as PSNR, the optimization of the encoding considering the subjective quality perceived by the observer is a target. Exploiting the results of the previous analysis, a refined set of QPs settings for different sequences has been defined. The field was encoded with moderate QPs, varying between 26 and 30. For the grandstands higher QPs were analyzed, between 30 and 42. In average, the sequences were 135 frames long. The sequences consisted of an I frame at the beginning, encoded using QP 26 for all the macroblocks group in order to offer an accurate reference for the temporal prediction. All the following frames were P encoded.

A Mean Opinion Score (MOS) was selected as subjective metric. In order to reach a wide range of test subjects, a web page was realized. The video sequences the test subjects had to evaluate consisted of five different football sequences, encoded using nine different sets of QPs and the uncompressed ones, for a total of 50 sequences. The order of the sequences was randomized. The volunteers were asked to evaluate the sequences without knowing which were the five uncompressed ones. They were also not aware of the method behind the different compressed images. The evaluation consisted on assigning to each displayed sequence a vote on a scale going from 1 (bad) to 5 (excellent) . Fig. 17 depicts the results of a representative sequence considering different settings of the encoder compared to the uncom- pressed sequence. The results collected indicate the effectiveness of the method. The observer, indeed, resulted to be only marginally annoyed by even strong compressions of the grandstands. The observer resulted rather to be extremely sensitive to even small increases of the QP used for encoding the field. This can be explained considering the different subjective response to a strong compression applied on the considered region. Even if the reconstruction of the grandstands will not be assisted by the high frequency transformed residual, their predictions still contain high frequency component. Therefore, the error occurs in the range where the human visual system results to be less sensitive. Imperfections in the reconstruction of the field, contrarily, will affect blocks consisting mainly of low frequency components, causing therefore noticeable and annoying blockiness. Moreover, the field surrounds the players and the ball. Being these the objects the attention of the observer is focused on, the user experience results to be furthermore impaired.

In the above, a novel encoding strategy aiming at increasing the perceived user quality for soccer video streaming was proposed. Three groups of scene components were defined: the grandstands, the field and one group comprising the ball, the players and the field lines. They present major differences both in terms of effects of compression and subjective importance. Such regions were identified by means of an image segmentation mechanism. The three groups of macroblocks were then separately encoded using different compression degrees. Subjective tests showed that the resulting code can be reduced up to a factor 2 compared to a standard encoded sequence, reducing the amount of bits associated to the grandstands, affecting only marginally the perceived user quality.

Claims

Claims :

1. A method for processing sport video sequences for transmission through channels having limited transmission capacity, such as UMTS networks, said method comprising the steps of segmenting pictures of the video sequences to obtain different type segments, corresponding to regions with different contents, namely at least players and background, and encoding the obtained different segments separately, with the application of different coding strategies, characterized in that the pictures of the video sequences are segmented on the basis of color characteristics, thereby deriving separate macroblocks for each segment.

2. The method according to claim 1, characterized in that for segmentation each picture is transformed from the RGB (Red- Green-Blue) color domain to the Hue (H) -Saturation (S) -Value (V) color space.

3. The method according to claim 2, characterized in that for deciding whether pixels of a picture belong to a given, rather stationary, first segment, e.g. a green soccer field, it is checked whether H, S, V of these pixels are within a given range .

4. The method according to claim 3, characterized in that the pixels of the picture that have a predetermined H component are counted, and dependent on the obtained number of pixels, the ranges of H, S and V are defined.

5. The method according to claim 3 or 4 , characterized in that for deciding whether pixels of the picture belong to a second rather stationary segment, e.g. audience, a region growing algorithm is used where at least one region seed is placed in a respective corner of the corresponding macroblock, and in the case that the number of pixels belonging to the first segment is less than a predetermined threshold, neighboured pixels are checked in this manner so that a map of this second segment macroblock is determined.

6. The method according to claim 5, characterized in that after determination of the second segment macroblock map, the remaining macroblocks are decided to belong to the first segment, or a third segment which, e.g., comprises players and ball, whereafter, by checking whether the number of pixels belonging to the first segment exceeds a further predetermined threshold or not, it is decided that the respective macroblock contains the first segment or the third segment.

7. The method according to claim 6, characterized in that rows of the audience map are searched through for isolated macro- blocks confined on the left and right side by field macroblocks, and in that such isolated macroblocks are removed from the audience map, and are associated to the third segment, i.e. the players map.

8. The method according to any one of claims 1 to 7, characterized in that in the case of macroblocks of a substantially stationary segment, e.g. the audience macroblocks, only new macroblocks appearing due to camera movement are continuously encoded and transmitted, and refreshment macroblocks for actualization are encoded and transmitted in greater time intervals only, depending on changes in camera shots.

9. The method according to any one of claims 1 to 8, characterized in that DCT encoding is applied to the macroblocks, and the encoding is carried out with a tuning of quantization parameters applied to coefficients to be transmitted, in accordance with the respective segments.

10. The method according to claim 9, characterized in that a high quantization parameter is applied to the segment representing audience.