EP2380350A1

EP2380350A1 - Video encoding system and method

Info

Publication number: EP2380350A1
Application number: EP09768199A
Authority: EP
Inventors: Jean-Pierre Morard
Original assignee: Sagemcom Broadband SAS
Current assignee: Sagemcom Broadband SAS
Priority date: 2008-12-30
Filing date: 2009-11-16
Publication date: 2011-10-26
Also published as: FR2940736A1; WO2010076439A1; US8731060B2; BRPI0923824A2; FR2940736B1; US20110268191A1; CN102318344A; CN102318344B

Abstract

The invention relates to a video encoding system and method, and can be used in the field of data video broadcasting from a server to a client terminal. The video encoding system (100) for encoding consecutive images of a video sequence comprises an input data reception module (101) for receiving the current image (F_n) to be encoded, means (103) for dividing the current image (F_n) into macroblocks, and a module (105) for estimating movement vectors and a movement compensation module (106). The data reception module (101) further receives a real movement vector of at least one moved area of the current image (F_n), the encoding system (100) including means (104) for allocating said real movement vector to macroblocks belonging to said moved area and means (118) for transmitting the real movement vector directly to said compensation module (106) without any estimation of the movement vectors by the estimation module (105) for the macroblocks that belong to the moved area.

Description

Video coding system and method

Technical field of the invention

The present invention relates to a video coding system. It also relates to a video coding method. The invention applies to the field of video data broadcasting by a server to a client terminal. The server, usually a computer, is connected to the client terminal, for example a video decoder, by a network, for example in the form of HDMI ("High Definition Multimedia Interface" in English), WIFI or Ethernet. The computer screen can then be displayed by the client terminal on a television screen according to a "Remote Frame Buffer" type protocol, for example VNC ("Virtual Network Computing"). Technological background of the invention

In such an architecture, the server encodes, that is to say compresses, what it broadcasts before sending it to the client terminal. If the server had to display on a screen of its own the images it broadcasts, it would not be necessary to compress them. To compress, the server captures its own display, encodes it and sends it over the network to the client terminal. Each image to be displayed is stored in a so-called "framebuffer" buffer of the server and is generally coded in RGB ("Red Green Blue") format which is the most direct way of coding the images, the three planes corresponding to the three elemental colors red, green and blue. The image is then generally transformed into a YUV (or luminance - chrominance) format. The first plane, called the luminance plane (Y) represents the luminous intensity of the pixels. The next two planes correspond to the chrominance (U, V) and carry the color information. There are essentially two YUV formats:

the 4: 2: 0 format (also called YUV12) for which the two chrominance planes each contain a sample for four pixels,

- the 4: 4: 4 format for which the three planes are the same size (ie there is one chrominance sample per pixel). The encoding performed by the server is an encoding of the spatiotemporal type such as H264. The H264 standard is a video coding standard de- jointly developed by VCEG (Video Coding Experts Group) and MPEG (Moving Pictures Experts Group). This standard makes it possible to encode video streams with a bit rate less than two times less than that obtained by the MPEG2 standard for the same quality. A spatio-temporal encoding fully encodes only part of the images to be transmitted in order to reconstitute a video. The H264 standard contains the types of images known and defined in the MPEG2 standard, namely:

the images I (Intra) whose coding does not depend on any other image,

the P (Predictive) images whose coding depends on images received previously,

- B (Bi-predictive) images that depend on images received previously and / or later.

However, the implementation of such an encoding solution poses a number of difficulties when it comes to deport in real time the display of the server on the client terminal.

Thus, such a coding mode is very expensive in terms of time and calculation means. To save bandwidth, the data must be compressed as much as possible. This important compression imposes a great complexity in the encoding. Thus, the server must not only perform image compression but must perform many calculations to determine the addresses and data to be encoded. This overconsumption of energy makes it difficult to implement other applications running on the same server.

In this context, the present invention aims at providing a spatio-temporal video coding system that makes it possible to reduce the encoding effort with a view to using a server client protocol in real time while allowing enough resources on the server in charge of encoding to run other applications. To this end, the invention proposes a video coding system for encoding successive images of a video sequence, the coding of at least one current image being performed relative to at least one preceding image. dente and / or posterior of said video sequence, said coding system comprising:

an input data reception module for receiving said current image to be coded; means for dividing said current image into macroblocks;

a module for estimating motion vectors as a function of the macroblocks of said current image and of said at least one previous and / or subsequent image,

a motion compensation module receiving motion vectors and providing at least one predicted zone, said coding system being characterized in that said data receiving module further receives a real motion vector of at least one displaced zone said current image, said coding system comprising: means for allocating said real motion vector to the macroblocks belonging to said displaced zone;

means for transmitting said real motion vector directly to said compensation module without estimation of motion vectors by said estimation module for said macroblocks belonging to said displaced zone.

The term macroblock denotes a rectangular elementary region of the image having a size between 4x4 and 16x16 pixels (via 8x16, 8x8, ...). Each macroblock itself consists of luminance blocks and chrominance blocks. Motion estimation in the context of a spatiotemporal coding is an operation which requires a very important computing power. The system according to the invention makes it possible to overcome a part of this estimate by advantageously using the provision of an already existing motion vector. Thanks to the invention, the provision of the motion vector relative to a zone (typically a rectangle within a frame or "frame" in English) having been displaced makes it possible not to calculate the motion vectors for the macroblocks that are in such a moved area. The real motion vector is directly injected on the input of the compensation module.

Thus, the encoding effort is significantly reduced compared to a conventional spatio-temporal encoding. The coding system finds a particularly interesting application in the case where the initiation of the movement of the zone is performed at a client terminal connected to a server via a VNC protocol, the rendering of the displacement being displayed on the screen of the terminal. The coding by the system according to the invention is performed at the server and the actual vector of the moved area is provided by a programming interface of the graphical environment of the server.

In addition to the reduced encoding effort, it will be noted that, thanks to the invention, the rendering will be better since we are working, at least in part, with real motion vectors and not estimated. Typically, such a real motion vector can be obtained for an area undergoing a displacement in the application frame such that:

scrolling (or "scrolling" in English) horizontally or vertically in the zone moved with a browser-type application;

- moving a graphical window of the operating system of the server;

- transition from one transparency to another transparency in the case of a slide show ("slideshow" in English);

- flash or silverlight animation.

The system according to the invention may also have one or more of the following characteristics, considered individually or in any technically possible combination:

the system according to the invention comprises means for transmitting only the macroblocks not belonging to said displaced zone to said motion vector estimation module; a subtracter for effecting the difference between the pixels of the current image and the predicted zone and providing a residual error corresponding to this difference; a frequency transform module applying a frequency transform on each macroblock processed by said estimation module as well as on said residual error; a module for quantizing data from said frequency transform module;

an entropic coder for coding data from said quantization module.

The present invention also relates to a video coding method for the coding of successive images of a video sequence, the coding of at least one current image being performed relative to at least one previous image and / or posterior of said video sequence, said method comprising the following steps:

receiving said current image to be encoded and a real motion vector of at least one zone displaced from said current image, dividing said current image into macroblocks,

assigning said real motion vector to the macroblocks belonging to said displaced zone,

estimation of motion vectors as a function of the macroblocks of said current image and of said at least one previous and / or subsequent image, said estimation being made only from the macroblocks not belonging to said displaced zone, said current image to be encoded being transmitted from a server to a client terminal, the coding being performed at the server and said actual vector of at least one zone moved from said current image being provided by a programming interface of the graphical environment of said server.

The method according to the invention may also have one or more of the following characteristics, considered individually or in any technically possible combination:

said video coding is a spatio-temporal coding H264 - the screen of said server is displayed by said client terminal on a screen according to a RFB protocol "Remote Frame Buffer" such as the VNC protocol "Virtual Network Computing", said real movement vector of said displaced zone is determined in the following cases: horizontal or vertical scrolling of said displaced zone with a browser-type application; o moving a graphical window of the operating system of said server; o transition from one transparency to another transparency in the case of a slide show; o Flash type animation. said client terminal is a video decoder;

said current image as well as said real motion vector are initially coded in an RGB format and then undergo a transformation in a YUV format.

said real motion vector is a vector with two or three dimensions.

Brief description of the figures

Other characteristics and advantages of the invention will emerge clearly from the description which is given below, by way of indication and in no way limiting, with reference to the appended FIG. 1 which is a simplified schematic representation of a system of coding according to the invention for the implementation of the coding method according to the invention.

DESCRIPTION OF THE PREFERRED EMBODIMENTS OF THE INVENTION FIG. 1 represents a coding system 100 according to the invention. The coding system 100 comprises: a module 101 for receiving input data,

a module 102 for processing the input data,

a motion estimation module 105 (also hereinafter referred to as a motion vector estimation module),

a motion compensation module 106, a subtractor 109 and an adder 1 10,

a frequency transform module 1 12 and an inverse frequency transform module 1 15, a quantization module 113 and an inverse quantization module 1 14,

a filter 1 16,

a buffer 1 1 1, a reordering module 108,

an entropic coder 120.

The invention applies to the field of video data broadcasting by a server to a client terminal. The server, generally a computer, is connected to the client terminal, for example a video decoder, via a network, for example in the form of HDMI ("High Definition Multimedia Interface"), WIFI or ethernet. The computer screen can then be displayed by the client terminal on a television screen according to a "Remote Frame Buffer" type protocol, for example VNC ("Virtual Network Computing"). The server encodes what it broadcasts before sending it to the client terminal. The encoding performed by the server is an encoding of the spatio-temporal type such as H264: it is therefore the server that integrates the encoding system 100 according to the invention.

The reception module 101 receives as input a predictive image F _n . F _n is the current image of the entire server screen. It should be noted that the invention relates only to the coding of the predictive images, the intra-predictive coding of the images I continuing to be done according to known techniques. Thus, to make the diagram clearer, the means necessary for intra-predictive coding have been deliberately omitted.

The image F _n is generally in a YUV12 format after undergoing a RGB - YUV transformation.

The reception module 101 also receives as input information on the zones having undergone a displacement (also called zone displaced in the remainder of the description) in the image F _n . The displaced zone is a rectangular zone generally represented by a quadruplet (x, y, I, h): x and y represent respectively the abscissa and the ordinate of the point at the top left of the zone, I represents the width of the rectangle and h is the height of said rectangle. The information received by the server concerning each zone moved is constituted by the real motion vector m = (mx, my) ^τ of this zone displaced, mx and my being the horizontal and vertical components of the real motion vector and T designating the transposition operator. Typically, this real vector can be obtained by the server via the programming interfaces of its graphical environment, also called API (Application Programming Interface) for graphical user interface (GUI "Graphical User Interface") of the software application running on the server and used by the client terminal or operating system (or "operating system" in English) of the server, Windows ™ for example.

This real motion vector is known to the software application since the latter is at the initiative of the displacement of the area following an event (typically an event generated by a click or mouse movement or typing) of the user final via the client terminal.

However in order to have the scale of this vector to calculate it in number of pixels, it may be necessary to access the APIs of the lower software layers. It is therefore preferable in the operating system layer (Windows ™) that the system 100 will rely to recover the real vector in order to implement this encoding software accelerator, regardless of the applications that will benefit. As an example, we can use a JavaScript function of win- dows.scrollby (x-coord, y-coord) type of Windows DOM which will be called when a key is switched on at the client terminal. "Down arrow": the function can provide the modulus of the motion vector: m = "^ (rnx ² + my ² , the direction of the vector being vertical downwards.

The size of the rectangle can also be obtained by functions of the "Windows" type. innerHeight "and" Windows. innerWidth ".

In any case, the server can obtain values characterizing the real movement vector of the zone moved by the user via the client terminal.

Typically, it is possible, for example, to obtain such a real motion vector for a zone undergoing displacement in the application frame such that: scrolling (or "scrolling" in English) horizontally or vertically in the zone moved with an application of the browser or "browser" type in English;

a displacement of a graphical window of the operating system of the server;

a transition from one transparency to another transparency in the case of a slide show ("slideshow" in English);

an animation of the flash or silverlight type.

The motion vector m = (mx, my) ^τ coded in RGB format is also transformed into YUV12 format.

The input data processing module 102 comprises:

means 103 for dividing the current image F _n into macroblocks,

means for allocating the real motion vector V to the macroblocks belonging to the displaced zone; means 1 to transmit the real motion vector directly to the compensation module 106 without estimating the motion vectors by the module 105; estimate, for macroblocks belonging to the moved zone,

means 119 for transmitting only the macroblocks not belonging to the zone moved to the motion estimation module 105.

In this way, a part of the calculation of the motion vectors for the macroblocks to which the module 104 has already allocated a real motion vector is saved by virtue of their belonging to a displaced zone. Thus, each current image F _n to be encoded is divided by means 103 into macroblocks corresponding to a rectangular elementary region of the image having a variable size between 4x4 and 16x16 pixels (via 8x16, 8x8, ...).

The means 104 knowing the displaced areas of the image F _{n as} well as their real motion vectors make it possible to attribute to the macroblocks belonging to a displaced zone the same real motion vector. Therefore, the means 1 19 will guide only the macroblocks not affected by a zone moved to the motion estimation module 105, the actual motion vectors of the other macroblocks being transmitted directly to the motion compensation module 106 via the means 118.

The function of the motion estimation module 105 is to retrieve a macroblock of the current image F _n in at least one previous image F _n- i of the server screen in its entirety (it could also be a posterior image in the case of an image B and even a plurality of images before and / or after). When we find a part of a previous image which resembles (according to criteria of least squares for example) the macroblock, we deduce a motion vector which corresponds to the difference between the position of the selected region and that of the selected region. macroblock.

The motion vectors that have been retained by the estimation module (in addition to the real motion vectors transmitted by the means 1 18) are transmitted to the motion compensation module 106. This gives a prediction error due to the fact that the region retained in the past image is not exactly equal to the macroblock analyzed. At the output of the motion compensation module 106, a predicted picture P is obtained. Subtractor 109 then calculates a residual error D _n between the pixels of F _n and the predicted picture P.

A frequency transform (of discrete cosine transform type DCT "Discrete Cosine Transform" or Hadamard transform) is applied via the frequency transform module 1 12 on each macroblock that has undergone motion estimation, as well as on the residual error D _n . This transform makes it possible to have a frequency representation of the modified zones.

The data from the frequency transform module 1 12 are then quantized (ie coded on a limited number of bits) by the quantization module 1 13 to provide transformed and quantized parameters X. The function of the quantization module 1 13 is to define different quantification steps depending on whether certain components will be judged or not significant visually; these quantization steps are defined in a quantization step table.

The module 1 14 of inverse quantizing retrieves the processed and quantized parameters X which then pass through the module 115 of inverse frequency transform that operates an inverse frequency transform to recover a quantized version D _'n of the residual error D _n; this quantized version D ' _n is then added to the macroblocks of the predicted zone P by the adder 1 10; the image at the output of the adder 1 10 is then processed by the deblocking filter to provide a reconstructed image F ' _n corresponding to a set of reconstructed zones having the same position, the same width and the same height as the modified areas. F ' _n is used internally by the decoder 100 to estimate the quality of the encoding.

The quantized results X from the quantization module 1 13 are then reordered by the reordering module 108 to group together the non-zero coefficients so as to allow an efficient representation of the other coefficients having a zero value.

The data then undergoes a final phase of entropy coding compression via the entropy coder 120. The function of the encodes encoder is to re-encode the data differently in order to reduce the number of bits necessary for their encoding by approaching as close as possible to the minimum of theoretical bits (which is fixed by entropy).

The entropy encoder 120 constructs an output stream φ in a Network Abstraction Layer (NAL) format defined to allow the use of the same video syntax in many network environments.

Note that the means and modules described above can be either software or made with specific electronic circuits.

Of course, the invention is not limited to the embodiment just described. In particular, the invention has been more particularly described in the context of the H264 coding but it applies to any type of spatiotemporal coding: this is for example the case of MPEG2 coding or VC1 coding. (SMPTE video compression standard "Society of Motion Picture and Television Engineers").

Note further that the motion vector has been described as a two-dimensional vector but it is also possible to use a three-dimensional motion vector, for example in the case of a graphical interface such as Aero ™ which is the graphical interface of Windows Vista ™ for displaying 3D effects.

Finally, any means can be replaced by equivalent means

Claims

A video encoding system (100) for coding successive images of a video sequence, the encoding of at least one current image (F _n ) being performed relative to at least one previous and / or subsequent image (F _n-1 ) of said video sequence, said encoding system (100) comprising:

a module (101) for receiving input data for receiving said current image (F _n ) to be encoded,

means (103) for dividing said current image (F _n ) into macroblocks,

a module (105) for estimating motion vectors as a function of the macroblocks of said current image (F _n ) and said at least one preceding and / or subsequent image (F _n-1 ),

a motion compensation module (106) receiving motion vectors and providing at least one predicted area (P), said encoding system (100) being characterized in that said data receiving module (101) receives furthermore, an unmeasured real motion vector of at least one displaced area of said current image (F _n ), said encoding system (100) comprising: - means (104) for allocating said unmeasured real motion vector to macroblocks belonging to said moved area;

means (118) for transmitting said non-estimated real motion vector directly to said compensation module (106) without estimation of the motion vectors by said estimation module (105) for said macroblocks belonging to said displaced zone.

2. System (100) for video coding according to the preceding claim characterized in that it comprises means (1 19) for transmitting only the macroblocks not belonging to said zone moved to said module (105) vector estimation of movement.

3. Video coding system (100) according to one of the preceding claims, characterized in that it comprises: a subtracter (109) for effecting the difference between the pixels of the current image (F _n ) and the predicted zone and providing a residual error (D _n ) corresponding to this difference,

a frequency transform module (1 12) applying a frequency transform on each macroblock processed by said module

(105) estimation and on said residual error (D _n ), a module (1 13) for quantizing data from said module

(1 12) of frequency transform,

an entropic coder (120) for coding data from said quantization module (1 13).

4. A video encoding method for coding successive images of a video sequence, the coding of at least one current image (F _n ) being performed relative to at least one previous and / or subsequent image (F _n-1 ) of said video sequence, said method comprising the steps of: - receiving said current image (F _n ) to be encoded and an unmeasured real motion vector of at least one zone displaced from said current image (F _n ) ,

dividing said current image into macroblocks,

assigning said unmeasured real motion vector to the macroblocks belonging to said displaced zone,

estimation of motion vectors as a function of the macroblocks of said current image and of said at least one previous and / or subsequent image, said estimation being made only from the macroblocks not belonging to said displaced zone, said current image to be encoded being transmitted from a server to a client terminal, the coding being performed at the server and said undesired real vector of at least one zone moved from said current image being provided by a programming interface of the graphical environment of said server .

5. Method according to the preceding claim characterized in that the screen of said server is displayed by said client terminal on a screen according to a RFB protocol "Remote Frame Buffer" such as the VNC protocol "Virtual Network Computing".

6. Method according to one of claims 4 or 5 characterized in that said video coding is a spatio-temporal coding H264

7. Method according to one of claims 4 to 6 characterized in that said real movement vector of said displaced zone is determined in the following cases:

scrolling horizontally or vertically said displaced area with a browser-type application;

moving a graphical window of the operating system of said server; - transition from one transparency to another transparency in the case of a slide show;

- animation of the flash type.

8. Method according to one of claims 4 to 7 characterized in that said client terminal is a video decoder.

9. Method according to one of claims 4 to 8 characterized in that said current image as well as said real motion vector are initially encoded in an RGB format and then undergo a transformation in a YUV format.

10. Method according to one of claims 4 to 9 characterized in that said real motion vector is a vector with two or three dimensions.