EP1433333A1

EP1433333A1 - Method and device for coding a scene

Info

Publication number: EP1433333A1
Application number: EP02791510A
Authority: EP
Inventors: Paul Kerbiriou; Gwena[L Kervella; Laurent Blonde; Michel Kerdranvat
Original assignee: Thomson Licensing SAS
Current assignee: THOMSON LICENSING
Priority date: 2001-07-27
Filing date: 2002-07-24
Publication date: 2004-06-30
Also published as: FR2828054A1; US20040258148A1; JP2004537931A; FR2828054B1; WO2003013146A1

Abstract

The invention concerns a method for coding a scene consisting of objects whereof the textures are defined from images or parts of images derived from different video sources (11, , 1n), characterised in that it comprises the following steps: spatial composition (2) of an image by dimensioning and positioning on an image, said images or parts of images derived from different video sources, to obtain a composite image; coding (3) the composite image; calculating and coding auxiliary data (4) containing data concerning the composition of the composite image and data concerning the textures of the objects.

Description

METHOD AND DEVICE FOR CODING A SCENE

The invention relates to a method and a device for coding and decoding a scene composed of objects whose textures come from different video sources.

More and more multimedia applications require the exploitation of video information at the same instant.

Multimedia broadcasting systems are generally based on the transmission of video information, either via separate elementary streams, or via a transport stream multiplexing the different elementary streams, or a combination of the two.

This video information is received by a terminal or receiver made up of a set of elementary decoders simultaneously performing the decoding of each of the elementary streams received or demultiplexed. The final image is composed from the decoded information. This is for example the case of the transmission of streams of MPEG 4 coded video data.

This type of advanced multimedia system attempts to offer great flexibility to the end user by offering them possibilities for composing several flows and interactivity at the terminal level. The extra processing is actually quite significant if we consider the complete chain, from. the generation of simple flows to the restitution of a final image. It concerns all the levels of the chain: coding, addition of inter-stream synchronization elements and packetization, multiplexing, demultiplexing, taking into account of inter-stream synchronization elements and depacketization, decoding.

Instead of having a single video image, you must transmit all the elements that will make up the final image, each in an elementary stream.

It is the composition system, upon reception, which produces the final image of the scene to be viewed according to the information defined by the content creator. A great complexity of management at the system level or at the processing level (preparation of the context and data, presentation of the results ...) is therefore generated.

Other systems are based on the generation of mosaics of images in post-production, that is to say before their transmission. This is for example the case for services such as program guides. The image thus obtained is coded and broadcast, for example in MPEG2 standard.

The first systems therefore require the management of numerous data flows both at the transmission and reception levels. It is not possible to achieve in a simple way, a local composition or "scene" from several videos. Expensive devices such as decoders and complex management of these decoders must be put in place for the exploitation of these streams. The number of decoders can be a function of the different types of coding used for the data received corresponding to each of the streams, but also the number of video objects that can compose the scene. The processing time of the received signals, due to centralized management of the decoders, is not optimized. The management and processing of the images obtained, because of their multitude, are complex.

As for the image mosaic technique on which the other systems are based, it offers few possibilities for composition and interaction at the terminal level and leads to too great rigidity.

The invention aims to overcome the aforementioned drawbacks.

Its subject is a method of coding a scene made up of objects whose textures are defined from images or parts of images from different video sources, characterized in that it comprises the steps:

- spatial composition of an image by sizing and positioning on an image, said images or parts of images from the different video sources, to obtain a composed image,

- coding of the composed image,

- calculation and coding of auxiliary data comprising information relating to the composition of the composed image and information relating to the textures of the objects. According to a particular implementation, the composite image is obtained by spatial multiplexing of the images or parts of images.

According to a particular implementation, the video sources from which the images or parts of images composing the same composed image are selected, have the same coding standards. The composite image may also include a still image not from a video source. According to a particular implementation, the dimensioning is a reduction in size obtained by subsampling.

According to a particular implementation, the composed image is coded according to the MPEG 4 standard and the information relating to the composition of the image are the texture coordinates.

The invention also relates to a method for decoding a scene composed of objects, coded from a composite video image grouping images or parts of images from different video sources and from auxiliary data which is information of composition of the composite video image and of information relating to the textures of the objects, characterized in that it performs the steps of:

- decoding of the video image to obtain a decoded image

- decoding of auxiliary data,

- extraction of textures from the decoded image to. from the auxiliary image composition data.,: '. :. : -:. ...

- plating textures on objects in the scene from auxiliary data relating to textures.

According to a setting. particular work, the method is characterized in that the extraction of the textures is carried out by spatial demultiplexing of the decoded image. •.

According to a particular implementation, the method is characterized in that a texture is processed by oversampling and spatial interpolation to obtain the texture to be displayed in the final image viewing the scene.

The invention also relates to a device for coding a scene composed of objects whose textures are defined from images or parts of images from different video sources, characterized in that it comprises:

a video editing circuit receiving the different video sources for dimensioning and positioning on an image, images or parts of images originating from these video sources, for producing a composite image,

an auxiliary data generation circuit connected to the video editing circuit to supply information relating to the composition of the composed image and information relating to the textures of the objects, a coding circuit for the composed image,

an auxiliary data coding circuit. The invention also relates to a device for decoding a scene composed of objects, coded from a composite video image grouping together images or parts of images from different video sources and from auxiliary data which is information. of composition of the composite video image and of information relating to the textures of the objects, characterized in that it comprises:

a circuit for decoding the composed video image to obtain a decoded image,

- a circuit for decoding the auxiliary data - a processing circuit receiving the auxiliary data and the decoded image for extracting textures from the decoded image from the auxiliary data for composing the image and for applying textures to objects of the scene from the auxiliary data relating to the textures.

The idea of the invention is to group, on an image, elements or texture elements which are images or parts of images coming from: different video sources and necessary for the construction of the scene to be visualized, so to "transport" this video information on a single image or a limited number of images. A spatial composition of these elements is therefore produced and it is the overall composite image obtained which is coded instead of coding separate from each video image from video sources. A scene . overall, the construction of which usually requires several video streams can be constructed from a more limited number of video streams and even from a single video stream transmitting the composed image.

Thanks to the transmission of an image composed in a simple manner and the transmission of associated data describing both this composition and the construction of the final scene, the decoding circuits are simplified and the construction of the scene carried out in a more flexible manner . Taking a simple example, if instead of coding and transmitting separately 4 images in QCIF format (acronym of the English expression Quarter Common Intermediate Format), that is to say coding and transmitting on an elementary stream each of the 4 images in QCIF format, only one image is transmitted in GIF (Common Intermediate Format) format grouping these four images, processing at the coding level and decoding is simplified and faster, for images of identical coding complexity.

On reception, the image is not simply presented. It is recomposed using transmitted composition information. This makes it possible to present the user with a less frozen image, potentially including an animation resulting from the composition, and to offer him further interactivity, each recomposed object being able to be active.

Management at the receiver is simplified, the data to be transmitted can be more compressed due to the grouping of video data on an image, the number of circuits necessary for decoding is reduced. Optimizing the number of streams minimizes the resources required in relation to the content transmitted.

Other features and advantages of the invention will appear clearly in the following description given by way of nonlimiting example. . and made with reference to the appended figures which represent:>

FIG. 1, a coding device according to the invention,

- Figure 2 a receiver according to the invention, - ~

- Figure 3 an example of a composite scene :. !

'':;' FIG. 1 represents a coding device according to the invention. The circuits at 1 _n symbolize the generation of: various video signals, available to the encoder for the coding of a scene to be viewed by the receiver. These signals are transmitted to a composition circuit 2 which has the function of composing an overall image from those corresponding to the signals received. The overall image obtained is called the composite image or mosaic. This composition is defined on the basis of information exchanged with an auxiliary data generation circuit 4. This is the composition information making it possible to define the composed image and thus to extract, at the receiver, the various elements or sub- images composing this image, for example position and shape information in the image such as the coordinates of the vertices of rectangles if the elements constituting the transmitted image are of rectangular shape or shape descriptors. This composition information makes it possible to extract textures and it is thus possible to define a library of textures for the composition of the final scene. These auxiliary data relate to the image composed by the circuit 2 but also to the final image representing the scene to be viewed at the receiver. This is then graphic information, for example relating to geometric shapes, appearances, the composition of the scene making it possible to configure a scene represented by the final image. This information defines the elements to be associated with graphic objects for the mapping of textures. They also define the possible interactivities making it possible to reconfigure the final image from these interactivities ... The composition of the image to be transmitted can be optimized according to the textures necessary for the construction of the final scene.

The composite image generated by the composition circuit 2 is transmitted to a coding circuit 3 which performs coding of this image. he

_" . it is for example an MPEG type coding of the overall image then.. cut into macroblocks. Limitations can be provided for the motion estimation by reducing the search windows to the

. dimension of the sub-images, or inside the zones in which the elements are positioned from one image to another, this in order to impose on the motion vectors to point in the same sub-image or zone of v coding of item. Auxiliary data ¹ from circuit 4. are i transmitted to a coding circuit 5 which realizes, .coding of these data:

.. ^" .- • ^' The outputs of coding circuits 3 and 5 are transmitted to the inputs of a multiplexing circuit 6 which multiplexes the received data, ie video data relating to the composed image and auxiliary data The output of the multiplexing circuit is transmitted to the input of a transmission circuit 7 for the transmission of the multiplexed data.

The composite image is produced from images or parts of images of any shape extracted from video sources but may also contain still images or, in general, any type of representation. Depending on the number of sub-images to be transmitted, one or more composed images can be produced for the same instant, that is to say for a final image of the scene. In the case where the video signals use different standards, these signals can be grouped by standard of the same type for the composition of a composite image. For example, a first composition is made from all the elements to be coded according to the MPEG-2 standard, a second composition from all the elements to be coded according to the MPEG-4 standard, another from the elements to be coded according to the standard JPEG or GIF images or other, so that a single stream is emitted per type of coding and / or by media type. 5 The image composed may be a regular mosaic consisting for example of rectangles or sub-images of the same size or else an irregular mosaic. The auxiliary flow transmits the data corresponding to the composition of the mosaic.

The composition circuit can perform the composition of the overall image 0 from enclosing rectangles or limitation windows defining the elements. Thus a choice of the elements necessary for the final scene is made by the composer. These elements are extracted from images available to the composer from different video streams. A spatial composition is then produced from the selected elements 5 - ^; by "placing" them on a global image constituting a single video. ^V The information about the positioning. these various elements, coordinates, dimensions, etc., are transmitted to the auxiliary data generation circuit which processes them to transmit them, on the stream.

The composition circuit is in the known field; This is for example 0 a professional video editing tool, of the "Adobe premiere" type (Adobe

- .. - is a registered trademark). Thanks to such a circuit, objects can be extracted ... from video sources, for example by selecting parts of images, the images of these objects can be resized and positioned on a global image. A spatial multiplexing is for example carried out to obtain the composite image.

The means of constructing a scene, from which a part of the auxiliary data is generated, are also in the known field. For example, the MPEG4 standard uses the VRML language (Virtual Reality Modeling Language) or more precisely the binary language BIFS 0 (BInary Format for Scenes) which allows to define the presentation of a scene, to change it, to update it . The BIFS description of a scene makes it possible to modify the properties of objects and to define their conditional behavior. It follows a hierarchical structure which is a tree description. 5 The data necessary for the description of a scene concern, among other things, the construction rules, the animation rules for an object, interactivity rules for another object ... They describe the final scenario. Some or all of this data constitutes the auxiliary data for the construction of the scene.

FIG. 2 represents a receiver for such a coded data stream.

The signal received at the input of the receiver 8 is transmitted to a demultiplexer 9 which separates the video stream from the auxiliary data. The video stream is transmitted to a video decoding circuit 10 which decodes the overall image as it was composed at the level of the coder. The auxiliary data at the output of the demultiplexer 9 are transmitted to a decoding circuit 11 which performs decoding of the auxiliary data. Finally, a processing circuit 12 processes the video data and the auxiliary data coming respectively from the circuits 10 and 11 to extract the elements, the textures necessary for the

.scene ¹ , then build this scene, the image representing this being then .- transmitted to the display 13. Or the elements constituting the composed image

are systematically extracted from the image to be used or not, i.e.

- construction information .de the final scene designate the elements

. necessary for the construction of this final scene,} the recomposition information then extracting only these elements from the composed image: - - The elements are extracted; - for example, by spatial demultiplexing.

They are 'resized, if necessary, by oversampling and spatial interpolation.

The construction information therefore makes it possible to select only a part of the elements constituting the composed image. They also allow the user to "navigate" in the constructed scene in order to view objects of interest. The navigation information from the user is for example transmitted to an input of the circuit 12 (not shown in the figure) which modifies the composition of the scene accordingly. Obviously, the textures transported by the composed image may not be used directly in the scene. They can, for example, be memorized by the receiver for use in offset time or for the constitution of a library used for the construction of the scene. An application of the invention relates to the transmission of video data in MPEG4 standard corresponding to several programs from a single video stream or more generally the optimization of the number of streams in an MPEG4 configuration, for example for a program guide application. If, in a classic MPEG-4 configuration, it is necessary to transmit as many streams as there are videos that can be viewed at the terminal, the method described makes it possible to send a global image containing several videos and to use texture coordinates to build a new scene upon arrival.

FIG. 3 represents an example of a composite scene constructed from elements of a composite image. The global image 14, also called composite texture, is composed of several sub-images or elements or sub-textures 15, 16, 17, 18, 19. The image 20, at the bottom of the figure, corresponds to the scene at view. The positioning of the objects to construct this scene corresponds to the graphic image 21 which represents the graphic objects. . ^• In the case of MPEGΓ4 coding and according to the prior art, each

, video or still image corresponding to elements. 15 to 19 is transmitted in ¹ a video or still image stream. The graphic data is transmitted in the graphic stream. ^• ^' . ^{• •} ; . In our invention, a global image is composed from the images relating to the different videos or still images to ^" form the composite image 14 represented at the top of the figure. This global image is coded. Auxiliary data relating to the composition of the overall image and defining the geometric shapes (only two shapes 22 and 23 are shown in the figure) are transmitted in parallel allowing the elements to be separated. The texture coordinates at the vertices, when these fields are used, allow these shapes to be textured from the composite image. Auxiliary data relating to the construction of the scene and defining the graphic image 21 are transmitted.

In the case of MPEG-4 coding of the composite image and according to the invention, the composite texture image is transmitted over the video stream. The elements are coded as video objects and their geometric shapes 22, 23 and texture coordinates at the vertices (in the composite image or the composite texture) are transmitted over the graphic stream. The texture coordinates are the composition information of the composed image. The stream which is transmitted can be coded to the MPEG-2 standard and in this case, it is possible to exploit the functionalities of the circuits of existing platforms integrating the receivers.

In the case of a platform capable of decoding more than one MPEG-2 program at a given time, elements supplementing the main programs can be transmitted on an additional video stream

MPEG-2 or MPEG-4. This flow can contain several visual elements such as logos, advertising banners, animated or not, which can be combined with one or other of the programs broadcast, at the choice of the broadcaster. These items can also be displayed based on user preferences or profile. An associated interaction can be expected. Two decoding circuits are used, one for the program, one for the composite image and the auxiliary data. A spatial multiplexing is then possible of the program being broadcast with additional information coming from the composed image. . : '.

: A single annex video stream can be used for a program package, to complete - several programs or several user profiles.

Claims

1 Method for coding a scene composed of objects whose textures are defined from images or parts of images from different video sources (1 ι, ... 1 _n ), characterized in that it comprises Steps:

- spatial composition (2) of an image by sizing and positioning on an image, said images or parts of images from different video sources, to obtain a composed image,

- coding (3) of the composed image,

- calculation and coding of auxiliary data (4) comprising information relating to the composition of the composed image, the textures of the objects and the composition of the scene.

2 Method according to claim 1, characterized in that the composite image is obtained by spatial multiplexing of the images or parts of images.

3 Method according to claim 1, characterized in that the video sources from which the images or parts of images composing the same composite image are selected, have the same coding standards.

4 Method according to claim 1, characterized in that the composed image also comprises a still image not originating from a video source.

5 Method according to claim 1, characterized in that the dimensioning is a reduction in size obtained by subsampling.

6 Method according to claim 1, characterized in that the composed image is coded according to the MPEG 4 standard and in that the information relating to the composition of the image are the texture coordinates.

7 Method for decoding a scene composed of objects, coded from a composite video image grouping images or parts of images of different video sources and from auxiliary data which are composition information of the composed video image, information relating to the textures of the objects and to the composition of the scene, characterized in that it performs the steps of: - decoding of the video image (10) to obtain a decoded image

- decoding of the auxiliary data (11),

- extraction (12) of textures from the decoded image from the auxiliary image composition data,

- plating of textures (12) on objects of the scene from auxiliary data relating to the textures and the composition of the scene.

8 decoding method according to claim 7, characterized in that the extraction of textures is carried out by spatial demultiplexing of the decoded image.

9 decoding method according to claim 7, characterized in that a texture is processed by oversampling and spatial interpolation to obtain the texture to be displayed in the final image viewing the scene. "

10 Device for coding a scene made up of objects whose textures are defined from images or parts of images from different 'video sources (1 -ι, ... 1 _n ), characterized in that He understands:

a video editing circuit (2) receiving the different video sources for dimensioning and positioning on an image, images or parts of images from these video sources, for producing a composite image,

- an auxiliary data generation circuit (4) connected to the video editing circuit (2) for providing information relating to the composition of the composed image, the textures of the objects and the composition of the scene,

- a coding circuit (3) of the composite image, - a coding circuit (5) of the auxiliary data.

11 Device for decoding a scene composed of objects, coded from a composite video image grouping images or parts of images from different video sources and from auxiliary data which is composition information of the composite video image and information relating to the textures of the objects and to the composition of the scene, characterized in that it comprises:

- a circuit for decoding the composite video image to obtain a decoded image (10), - a circuit for decoding the auxiliary data (11)

- a processing circuit (12) receiving the auxiliary data and the decoded image to extract textures from the decoded image from the auxiliary image composition data and to map textures on objects of the scene from auxiliary data relating to the textures and the composition of the scene.