WO2006006812A1

WO2006006812A1 - Apparatus and method for separating audio objects from the combined audio stream

Info

Publication number: WO2006006812A1
Application number: PCT/KR2005/002218
Authority: WO
Inventors: Jeong-Il Seo; Gi-Yoon Park; Dae-Young Jang; Kyeong-Ok Kang; Jin-Woo Hong
Original assignee: Electronics And Telecommunications Research Institute
Priority date: 2004-07-09
Filing date: 2005-07-09
Publication date: 2006-01-19
Also published as: KR100745689B1; EP1774656A4; KR20060050009A; EP1774656A1

Abstract

The present research, which relates to a terminal an method for separating audio objects from a combined audio stream, suggests a node structure that can separate audio objects which are compressed into one elementary stream b using Binary Format Scene and provides a terminal capabl of separating the audio objects from the combined audio stream by using the suggested node structure, and a metho thereof . The terminal includes: a decoder for decoding a elementary stream having audio objects compressed therei to thereby produce a decoded audio stream; and a composito for separating the audio objects from the decoded audi stream by using an audio object separation node, an forming an audio scene by using the separated audio objects wherein the audio object separation node includes: a fiel for describing the number of audio objects to be separated; and a field for describing whether to perform objec separation.

Description

APPARATUS AND METHOD FOR SEPARATING AUDIO OBJECTS FROM THE

COMBINED AUDIO STREAM

Description Technical Field

The present invention relates to a terminal apparatus, i.e., a terminal, for separating audio objects from one elementary stream (ES) including the audio objects, and a method thereof.

Background Art

The MPEG-4, which is a data compression and restoration technology standard defined by the Motion Picture Experts Group (MPEG) to transmit motion picture at a low transmission rate, makes it possible for a user to control audio and video contents by dividing the audio and video contents on an object basis and forming an audio video (AV) scene. For this, the MPEG-4 defines object descriptors (OD) for describing attributes of objects, elementary stream descriptor (ESD) for describing characteristics of compressed audio and video streams, and Binary Format Scene (BIFS) for describing an audio scene to be formed. Herein, one elementary stream descriptor (ESD) can describe only the characteristics of an elementary stream (ES) including one audio or video object. Meanwhile, although the object descriptor can include two or more elementary streams, it happens only in a selective case, such as scalable streams and a multiple languages.

Therefore, when two or more objects are included in one elementary stream, a stream that goes against the definition of object descriptor and elementary stream descriptor in the MPEG-4 is generated. For example, a Binaural Cue Coding (BCC) multiplexes more than two audio objects into one stream by compressing them into one combined mono audio signal and an additional binoral cue parameter. As described above, since one elementary stream includes more than two objects, which goes against to the definition of object descriptors and elementary stream descriptors in the MPEG-4, there is a problem that an audio scene is not formed in a receiving terminal.

Disclosure Technical Problem

It is, therefore, an object of the present invention to suggest a node structure that can separate a plurality of audio objects which are compressed into one elementary stream by using Binary Format Scene (BIFS) , a terminal capable of separating the audio objects from the combined audio stream by using the suggested node structure, and a method thereof.

The other objects and advantages of the present invention can be understood by the following description and the embodiment of the present invention. Also, the objects and advantages of the present invention can be easily realized by the means as claimed and combinations thereof.

Technical Solution

In accordance with one aspect of the present invention, there is provided a terminal including a decoder for decoding an elementary stream having a plurality of audio objects compressed therein to thereby produce a decoded audio stream; and a compositor for separating the audio objects from the decoded audio stream by using an audio object separation node, and forming an audio scene by using the separated audio objects, wherein the audio object separation node includes: a field for describing the number of audio objects to be separated; and a field for describing whether to perform object separation.

In accordance with another aspect of the present invention, there is provided a method for separating audio objects from a combined audio stream, including the steps of: a) decoding an elementary stream having a plurality of audio objects compressed therein to thereby produce a decoded audio stream; and b) separating the audio objects from the decoded audio stream by using an audio object separation node, wherein the audio object separation node includes: a field for describing the number of audio objects to be separated; and a field for describing whether to perform object separation.

Advantageous Effects

As described above, the present invention has an effect of using small bandwidth efficiently because it can process combined audio streams that go against the standard specification definition of the Motion Picture Experts Group 4 (MPEG-4) by suggesting a Binary Format For Scene (BIFS) node structure that can separate a plurality of audio objects which are compressed into one elementary stream.

Description of Drawings

The above and other objects and features of the present invention will become apparent from the following description of the preferred embodiments given in conjunction with the accompanying drawings, in which:

Fig. 1 is a diagram illustrating a node structure for separating audio objects in accordance with an embodiment of the present invention; Fig. 2 is a diagram describing an Audio Binary Format Scene (Audio BIFS) sound scene graph formed by using an "AudioObjectSeparator" of Fig. 1 in accordance with an embodiment of the present; Fig. 3 is a diagram showing an Audio BIFS sound scene graph formed by using an "AudioObjectSeparator" of Fig. 1 in accordance with another embodiment of the present; and

Fig. 4 is a block diagram illustrating a terminal that conforms to the Motion Picture Experts Group 4 (MPEG- 4) Standard.

Best Mode for the Invention

Following description exemplifies only the principles of the present invention. Even if they are not described or illustrated clearly in the present specification, one of ordinary skill in the art can embody the principles of the present invention and invent various apparatuses within the concept and scope of the present invention. The use of the conditional terms and embodiments presented in the present specification are intended only to make the concept of the present invention understood, and they are not limited to the embodiments and conditions mentioned in the specification. In addition, all the detailed description on the principles, viewpoints and embodiments and particular embodiments of the present invention should be understood to include structural and functional equivalents to them. The equivalents include not only currently known equivalents but also those to be developed in future, that is, all devices invented to perform the same function, regardless of their structures.

For example, block diagrams of the present invention should be understood to show a conceptual viewpoint of an exemplary circuit that embodies the principles of the present invention. Similarly, all the flowcharts, state conversion diagrams, pseudo codes and the like can be expressed substantially in a computer-readable media, and whether or not a computer or a processor is described distinctively, they should be understood to express various processes operated by a computer or a processor.

Functions of various devices illustrated in the drawings including a functional block expressed as a processor or a similar concept can be provided not only by using hardware dedicated to the functions, but also by using hardware capable of running proper software for the functions. When a function is provided by a processor, the function may be provided by a single dedicated processor, single shared processor, or a plurality of individual processors, part of which can be shared.

The apparent use of a term, 'processor' , ^Λcontrol' or similar concept, should not be understood to exclusively refer to a piece of hardware capable of running software, but should be understood to include a digital signal processor (DSP) , hardware, and ROM, RAM and non-volatile memory for storing software, implicatively. Other known and commonly used hardware may be included therein, too.

Other objects, features and advantages of the invention will become apparent from the following description of the embodiments with reference to the accompanying drawings, which is set forth hereinafter. The same reference numeral is given to the same element, although the element appears in different drawings. In addition, if further detailed description on the related prior arts is determined to blur the point of the present invention, the description is omitted. Hereafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.

Fig. 1, which is a diagram illustrating a node structure for separating audio objects in accordance with an embodiment of the present invention, defines a node for separating audio objects as "AudioObjectSeparator". The node for separating audio objects, which is suggested in the present invention, can be added to Motion Picture Experts Group 4 (MPEG-4) Audio Binary Format Scene (Audio

BIFS) nodes.

As illustrated in Fig. 1, the "AudioObjectSeparator" node of the present invention includes an "addChildren" field, a "removeChildren" field, a "child" field, a "url" field, a "numObject" field and a "separate" field. Each of the fields is described as follows.

The "addChildren" field describes a list of nodes to be added to the "AudioObjectSeparator" node as its child nodes. The "removeChildren" field describes a list of nodes to be removed among the child nodes of the

"AudioObjectSeparator" node.

The "child" field is used to connect sound samples stored in an "AudioBuffer" node. Only the "AudioObjectSeparator" node and an "AudioSource" node can have the "AudioBuffer" node as a child node.

The "url" field describes an identification (ID) of an object descriptor (OD) of an audio stream to which the

"AudioObjectSeparator" node is connected. The "numObject" field describes the number of objects to be separated and it separates the objects only when the

"separate" field is "TRUE."

Fig. 2 is a diagram describing an Audio Binary Format

Scene (Audio BIFS) sound scene graph formed by using the "AudioObjectSeparator" of Fig. 1 in accordance with an embodiment of the present. Herein, two or more audio objects included in an elementary stream are encoded in a parametric multichannel audio encoding method, such as

Binaural Cue Coding (BCC) and transmitted. As illustrated in Fig. 2, when an elementary stream including more than two audio objects is decoded in a decoder 201, the decoded audio stream is connected to an audio subgraph by an "AudioSource" node 202. In short, when the elementary stream having a plurality of audio objects compressed in the BCC method is decoded in a decoder 201, each of the audio object streams is delivered to the "AudioSource" node 202, individually.

An "AudioObjectSeparator" node 203 separates the audio streams which are transmitted in the separated state from the "AudioSource" node 202 according to each object and outputs the audio streams to a "sound2D" node 204. In other words, the "AudioObjectSeparator" node 203 performs a passive role of separating the audio streams transmitted in the separated state according to each object. Herein, the "addChildren" field of the "AudioObjectSeparator" node 203 includes such child nodes as the "AudiioSource" node, the "separate" field is defined to be "True," and the number of audio objects to be separated is defined in the "numObject" field. The "sound2D" node 204 defines attributes such as two dimensional space location of each audio object to thereby form an audio scene. Finally, a desired audio video (AV) scene is formed by integrating a video scene and an audio scene in a "Transform2D" node 205. Fig. 3 is a diagram showing an Audio BIFS sound scene graph formed by using the "AudioObjectSeparator" of Fig. 1 in accordance with another embodiment of the present.

In Fig. 3, an elementary stream including a plurality of audio objects is decoded in a decoder 301, and each of the decoded single audio streams is connected to an audio subgraph by an "AudioSource" node 302.

An "AudioObjectSeparator" node 303 of the present invention separates the single audio streams which are transmitted from the "AudioSource" node 302 according to the desired number of objects by using a Blind Source Separation (BSS) method and outputs the audio streams to a ^λλsound2D" node 304. Herein, the "separate" field of the "AudioObjectSeparator" node 303 is defined to be "True," and the number of audio objects to be separated is defined in the "numObject" field. In other words, the "AudioObjectSeparator" node 303 performs an active role of separating an audio stream into the desired number of objects based on the BSS method. The ^λλsound2D" node 304 defines attributes such as two-dimensional space location of each audio object to thereby form an audio scene.

Finally, a desired AV scene is formed by integrating a video scene and the audio scene in a "Transform2D" node 305.

Fig. 4 is a block diagram illustrating a terminal that conforms to the MPEG-4 Standard. A multiplexed bit stream which is received in the terminal of Fig. 4 is separated into an object descriptor elementary stream (ES) , a BIFS elementary stream, and object elementary streams. Herein, any one of the object elementary streams includes more than two audio objects. In the present invention, it is assumed that more than two audio objects are multiplexed into a stream by being compressed into one combined mono audio signal and an additional binaural cue parameter in the BCC method in a transmitting part. Thus, the terminal of Fig. 3 includes a BCC decoder 434. However, the combined audio stream can be encoded based on diverse compressing algorithms other than the BCC technique, and it is obvious that the terminal of Fig. 3 includes a decoder corresponding to a used compressing algorithm. The technology of the present invention can separate the audio objects compressed into one elementary stream by using the node structure suggested in Fig. 1 so that the audio objects can be controlled individually.

A terminal manager 420 analyzes an object descriptor from object descriptor elementary stream among demultiplexed elementary streams and inputs object elementary streams outputted from a demultiplexer 410 into corresponding decoders 434 and 436. In short, object elementary stream including more than two audio objects which are encoded in the BCC method is inputted into a BCC decoder 434 and decoded. Meanwhile, a BIFS stream including scene descriptor information is decoded in a scene decoder 432.

A compositor 440 generates a scene graph by using decoded BIFS information and objects decoded in an object decoder 436 and the BCC decoder 434. In particular, when an elementary stream including a plurality of audio objects is decoded in the BCC decoder 434, the decoded audio stream is connected to an audio subgraph by the "AudioSource" node, and the "AudioObjectSeparator" node of the present invention separates the audio stream transmitted from the "AudioSource" node according to each object and outputs the separated audio stream for each object to the ^λλSound2D" node. Herein, the "AudioObjectSeparator" of the present embodiment performs a passive role of separating the audio streams transmitted from the BCC decoder 434 in the separated state according to each object, which is shown in Fig. 2, but it can also perform the active role of separating single audio streams according to the desired number of objects by using the BSS method, which is shown in Fig. 3. The "Source2D" node forms an audio scene by defining attributes such as two-dimensional space location of each audio object. Finally, a desired AV scene graph is formed by integrating a video scene and an audio scene in the "Transform2D" node. A render 350 plays the AV scene, audio data and video data, which are transmitted from the compositor 340, by using a display unit or a speaker.

The method of the present invention can be realized in a program and stored in a computer-readable recording medium, such as CD-ROM, RAM, ROM, floppy disks, hard disks, magneto-optical disks and the like. Since this process can be easily implemented by those of ordinary skill in the art to which the present invention belongs, further description will not be provided herein.

While the present invention has been described with respect to certain preferred embodiments, it will be apparent to those skilled in the art that various changes and modifications may be made without departing from the scope of the invention as defined in the following claims.

Claims

What is claimed is:

1. A terminal, comprising: a decoder for decoding an elementary stream having a plurality of audio objects compressed therein to thereby produce a decoded audio stream; and a compositor for separating the audio objects from the decoded audio stream by using an audio object separation node, and forming an audio scene by using the separated audio objects, wherein the audio object separation node includes: a field for describing the number of audio objects to be separated; and a field for describing whether to perform object separation.

2. The terminal as recited in claim 1, wherein the audio object separation node further includes: a field for describing a list of nodes to be added as child nodes; a field for describing a list of nodes to be removed among child nodes; a field for connecting sound samples stored in a buffer node; and a field for describing an identification (ID) of an object descriptor (OD) of an audio stream connected to the audio object separation node.

3. The terminal as recited in claim 2, wherein the compositor connects the decoded audio stream to an audio subgraph in an "AudioSource" node, separates the audio stream transmitted from the "AudioSource" node according to each object in the audio object separation node to thereby produce audio streams for each object, and outputs the audio streams for each object to a "Sound2D" node.

4. A method for separating audio objects from a combined audio stream, comprising the steps of: a) decoding an elementary stream having a plurality of audio objects compressed therein to thereby produce a decoded audio stream; and b) separating the audio objects from the decoded audio stream by using an audio object separation node, wherein the audio object separation node includes: a field for describing the number of audio objects to be separated; and a field for describing whether to perform object separation.

5. The method as recited in claim 4, wherein the audio object separation node further includes: a field for describing a list of nodes to be added as child nodes; a field for describing a list of nodes to be removed among child nodes; a field for connecting sound samples stored in a buffer node; and a field for describing an identification (ID) of an object descriptor (OD) of an audio stream connected to the audio object separation node.

6. The method as recited in claim 5, wherein the step b) includes the steps of: bl) connecting the decoded audio stream to an audio subgraph in an "AudioSource" node; and b2) separating the audio stream transmitted from the "AudioSource" node according to each object in the audio object separation node.