KR101949756B1 - Apparatus and method for audio signal processing - Google Patents

Apparatus and method for audio signal processing Download PDF

Info

Publication number
KR101949756B1
KR101949756B1 KR1020120084229A KR20120084229A KR101949756B1 KR 101949756 B1 KR101949756 B1 KR 101949756B1 KR 1020120084229 A KR1020120084229 A KR 1020120084229A KR 20120084229 A KR20120084229 A KR 20120084229A KR 101949756 B1 KR101949756 B1 KR 101949756B1
Authority
KR
South Korea
Prior art keywords
object
signal
channel
group
signal group
Prior art date
Application number
KR1020120084229A
Other languages
Korean (ko)
Other versions
KR20140017342A (en
Inventor
오현오
송정욱
송명석
전세운
이태규
Original Assignee
인텔렉추얼디스커버리 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 인텔렉추얼디스커버리 주식회사 filed Critical 인텔렉추얼디스커버리 주식회사
Priority to KR1020120084229A priority Critical patent/KR101949756B1/en
Priority claimed from PCT/KR2013/006732 external-priority patent/WO2014021588A1/en
Publication of KR20140017342A publication Critical patent/KR20140017342A/en
Application granted granted Critical
Publication of KR101949756B1 publication Critical patent/KR101949756B1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding, i.e. using interchannel correlation to reduce redundancies, e.g. joint-stereo, intensity-coding, matrixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/03Aspects of down-mixing multi-channel audio to configurations with lower numbers of playback channels, e.g. 7.1 -> 5.1
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/13Aspects of volume control, not necessarily automatic, in stereophonic sound systems

Abstract

The present invention relates to a method and apparatus for processing an object audio signal, and more particularly, to a method and apparatus for encoding and decoding an object audio signal or rendering the object audio signal in a three-dimensional space.
According to an aspect of the present invention, there is provided a method for generating a downmix signal, the method comprising: generating a first object signal group and a second object signal group that classify a plurality of object signals according to a predetermined method; Generating a second downmix signal for a second object signal group, generating first object extraction information corresponding to a first downmix signal for object signals included in the first object signal group, And
And generating second object extraction information corresponding to the second downmix signal with respect to the object signals included in the second object signal group.

Description

[0001] APPARATUS AND METHOD FOR AUDIO SIGNAL PROCESSING [0002]

The present invention relates to a method and apparatus for processing an object audio signal, and more particularly, to a method and apparatus for encoding and decoding an object audio signal or rendering the object audio signal in a three-dimensional space.

3D audio is a series of signal processing to provide a lively sound in a three-dimensional space, by providing another dimension in the height direction on a horizontal sound scene (2D) provided by existing surround audio, Transmission, encoding, reproduction technology, and the like. Particularly, in order to provide 3D audio, a rendering technique is widely required in which an image is formed at a virtual position where a speaker is not used even if a larger number of speakers are used or a smaller number of speakers are used.

3D audio is expected to become an audio solution for future high-definition TVs (UHDTVs), including sound from vehicles evolving into high-quality infotainment space, as well as theater sound, personal 3DTV, tablet, smartphone, cloud Games and so on.

For 3D audio, it is necessary to transmit signals of more than 22.2 channels, which is a conventional compression transmission technique. In the case of conventional high-quality encoding such as MP3, AAC, DTS, and AC3, it is optimized to transmit only channels less than 5.1 channels.

In addition, in order to reproduce 22.2 channel signals, an infrastructure for a listening space in which 24 speaker systems are installed is required. In short, it is not easy to spread to the market, so a technology for effectively reproducing 22.2 channel signals in a space with a smaller number of speakers , A technique that allows the reproduction of a conventional stereo or 5.1 channel sound source in a larger number of speakers, 10.1 channel and 22.2 channel environment, and also a sound provided by the original sound source A technique for providing a scene, and a technique for enabling 3D sound to be enjoyed in a headphone listening environment. Such techniques are referred to herein as collective rendering and are referred to in detail as downmix, upmix, flexible rendering, binaural rendering, and the like.

On the other hand, an object-based signal transmission scheme is needed as an alternative for efficiently transmitting such a sound scene. It is more advantageous to transmit on an object basis than on a channel-based transmission according to a sound source. In addition, when transmitting on an object basis, the user can arbitrarily control the reproduction size and position of the objects, . Accordingly, there is a need for an effective transmission method capable of compressing object signals at a high transmission rate.

In addition, a sound source in which the channel-based signal and the object-based signal are mixed may exist, thereby providing a new type of listening experience. Accordingly, there is a need for a technique for effectively transmitting a channel signal and an object signal together and rendering the same effectively.

According to an aspect of the present invention, there is provided a method for generating a downmix signal, the method comprising: generating a first object signal group and a second object signal group that classify a plurality of object signals according to a predetermined method; Generating a second downmix signal for a second object signal group, generating first object extraction information corresponding to a first downmix signal for object signals included in the first object signal group, And

And generating second object extraction information corresponding to the second downmix signal with respect to the object signals included in the second object signal group.

According to another aspect of the present invention, there is provided a method for generating a downmix signal, the method comprising: receiving a plurality of downmix signals including a first downmix signal and a second downmix signal; Mix signal and second object extraction information for a second object signal group corresponding to the second downmix signal, and generating second object signal information for the first object signal group using the first downmix signal and the first object extraction information, And generating an object signal belonging to a second object signal group using the second downmix signal and the second object extracting information.

According to the present invention, an audio signal can be effectively represented, encoded, transmitted and stored, and a high-quality audio signal can be reproduced through various reproduction environments and devices.

The effects of the present invention are not limited to the above-mentioned effects, and the effects not mentioned can be clearly understood by those skilled in the art from the present specification and the accompanying drawings.

1 is a view for explaining viewing angles according to image sizes at the same viewing distance
2 is a diagram showing a configuration of a speaker arrangement of 22.2 channels
FIG. 3 is a conceptual diagram showing the position of each sound object on the listening space in which the listener listens to 3D audio.
FIG. 4 is an exemplary configuration diagram of an object signal group formed using the grouping method according to the present invention with respect to the objects shown in FIG.
5 is a block diagram of an embodiment of an object audio signal encoder according to the present invention.
6 is an exemplary configuration diagram of a decoding apparatus according to an embodiment of the present invention.
FIG. 7 is a block diagram of an embodiment of a bit string generated by coding by the encoding method according to the present invention
8 is a block diagram of an object and channel signal decoding system according to an embodiment of the present invention.
9 is a block diagram of another object and channel signal decoding system according to the present invention
10 is a block diagram of an embodiment of a decoding system according to the present invention
11 is a view for explaining masking thresholds for a plurality of object signals according to the present invention;
12 is a block diagram of an embodiment of an encoder for calculating a masking threshold for a plurality of object signals according to the present invention
13 is a diagram for explaining the arrangement according to the ITU-R recommendation for the 5.1 channel setup and the case where it is arranged at an arbitrary position;
FIG. 14 is a block diagram illustrating a structure of an embodiment in which a decoder for an object bit stream and a flexible rendering system using the decoder are connected to each other.
15 is a block diagram illustrating a structure of another embodiment that implements decoding and rendering of an object bit stream according to the present invention.
16 is a diagram showing a structure for determining and transmitting a transmission plan between a decoder and a renderer
17 is a conceptual diagram for explaining a concept of reproducing speakers absent by the display among the front-mounted speakers in the 22.2 channel system using the surrounding channels
18 is a flowchart illustrating a method of processing a sound source according to an embodiment of the present invention;
19 is a diagram illustrating an example of mapping a signal generated in each band to a speaker disposed in the vicinity of the TV
FIG. 20 is a diagram showing a relationship between products in which an audio signal processing apparatus according to an embodiment of the present invention is implemented

According to an aspect of the present invention, there is provided a method for generating a downmix signal, the method comprising: generating a first object signal group and a second object signal group that classify a plurality of object signals according to a predetermined method; Generating a second downmix signal for a second object signal group, generating first object extraction information corresponding to a first downmix signal for object signals included in the first object signal group, And

And generating second object extraction information corresponding to the second downmix signal with respect to the object signals included in the second object signal group.

Here, the audio signal processing method may further include a signal in which the first object signal group and the second object signal group are mixed to form a sound scene.

In the audio signal processing method, the first object signal group and the second object signal group may be composed of signals reproduced at the same time.

In the present invention, the first object signal group and the second object signal group can be encoded into one object signal bit stream.

Here, the generating of the first downmix signal may be performed by applying downmix gain information for each object to object signals included in the first object signal group, and the downmix gain information for each object may be obtained by using the first Object extraction information.

Here, the audio signal processing method may further include encoding the first object extraction information and the second object extraction information.

In the present invention, the audio signal processing method may further include generating global gain information for the entire object signal including the first object signal group and the second object signal group, Signal bitstream.

According to another aspect of the present invention, there is provided a method for generating a downmix signal, the method comprising: receiving a plurality of downmix signals including a first downmix signal and a second downmix signal; Mix signal and second object extraction information for a second object signal group corresponding to the second downmix signal, and generating second object signal information for the first object signal group using the first downmix signal and the first object extraction information, And generating an object signal belonging to a second object signal group using the second downmix signal and the second object extracting information.

Here, the audio signal processing method may further include generating an output audio signal using at least one object signal belonging to the first object signal group and at least one object signal belonging to the second object signal group .

Here, the first object extraction information and the second object extraction information may be received from one bit string.

Also, the audio signal processing method may include receiving downmix gain information for at least one object signal belonging to the first object signal group from the first object extracting information, and outputting the at least one object signal using the downmix gain information Can be generated.

The audio signal processing method may further include receiving global gain information, and the global gain information may be a gain value applied to both the first object signal group and the second object signal group.

Also, at least one object signal belonging to the first object signal group and at least one object signal belonging to the second object signal group can be reproduced at the same time.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to be illustrative of the present invention and not to limit the scope of the invention. Should be interpreted to include modifications or variations that do not depart from the spirit of the invention.

The terms and accompanying drawings used herein are for the purpose of facilitating the present invention and the shapes shown in the drawings are exaggerated for clarity of the present invention as necessary so that the present invention is not limited thereto And are not intended to be limited by the terms and drawings.

In the following description, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

In the present invention, the following terms can be interpreted according to the following criteria, and terms not described may be construed in accordance with the following. Coding can be interpreted as encoding or decoding as occasion demands, and information is a term that includes all of values, parameters, coefficients, elements, and the like, But the present invention is not limited thereto.

Hereinafter, a method and apparatus for processing an object audio signal according to an embodiment of the present invention will be described.

1 is a view for explaining viewing angles according to image sizes (e.g., UHDTV and HDTV) on the same viewing distance. Display technology has been developed and the size of the image has been increasing in accordance with the demand of the consumer. As shown in FIG. 1, UHDTV (7680 * 4320 pixel image, 110) is about 16 times larger than that of HDTV (1920 * 1080 pixel image, 120). If the HDTV is installed on the living room wall and the viewer is sitting on the living room sofa at a certain viewing distance, the viewing angle may be about 30 degrees. However, when UHDTV is installed in the same viewing distance, the viewing angle reaches about 100 degrees. When a large screen of high resolution and high resolution is installed as described above, it may be desirable to provide a sound having high sense of presence and impact suitable for the large content. In order to provide a viewer with almost the same environment as in the scene, the presence of one or two surround channel speakers may not be sufficient. Thus, a multi-channel audio environment having a larger number of speakers and channels may be required.

In addition to the home theater environment, there are also personal 3D TVs, smartphone TVs, 22.2-channel audio programs, cars, 3D videos, telepresence rooms, cloud-based gaming, Can be.

2 is a diagram showing a speaker arrangement of 22.2 channels as an example of a multi-channel. 22.2ch may be an example of a multi-channel environment for enhancing the sound field, and the present invention is not limited to a specific number of channels or a specific speaker arrangement. Referring to FIG. 2, a total of nine channels may be provided in the top layer 1010. It can be seen that a total of nine speakers are arranged at the front, three at the middle position, and three at the surround position. In the middle layer 1020, five speakers may be disposed on the front face, two speakers may be disposed on the middle position, and three speakers may be disposed on the surround position. Of the five speakers on the front, three of the center positions can be contained within the TV screen. A total of three channels and two LFE channels 1040 may be installed on the bottom layer 1030 at the front side.

In this manner, a high computation amount may be required for transmitting and reproducing multi-channel signals up to several tens of channels. Also, a high compression ratio may be required in consideration of a communication environment and the like. In addition, many households do not have a multi-channel (eg, 22.2-ch) speaker environment and many listeners have a 2-channel or 5.1-channel setup. When the multi-channel is converted and re-converted into 2-channel and 5.1-channel, it is necessary to store 22.2-channel PCM signals as well as communication inefficiency, resulting in inefficiency in memory management.

3 is a conceptual diagram showing positions of sound objects 120 constituting a three-dimensional sound scene in a listening space 130 where the listener 110 listens 3D audio. 3, each object 120 is shown as a point source for convenience of illustration. However, in addition to a point source, a sound source in the form of a plain wave or an ambient sound source A sound that is spread over a perceptible whole bearing).

FIG. 4 shows that the object signal groups 410 and 420 are formed using the grouping method according to the present invention with respect to the illustrated objects of FIG. According to the present invention, in encoding or processing an object signal, an object signal group is formed and encoded or processed in units of grouped objects. In this case, the coding includes discrete coding of an object as an individual signal or parametric coding of an object signal. Particularly, according to the present invention, grouped objects are generated in units in generating a downmix signal for parameter encoding of an object signal and parameter information of objects corresponding to a downmix. That is, in the case of the conventional SAOC encoding technique, all the objects constituting the sound scene may be one downmix signal (the downmix signal may be mono (1 channel) or stereo (2 channels) And the object parameter information corresponding thereto. However, as in the scenario considered in the present invention, it is possible to use more than 20 objects, more than 200 objects, and 500 objects as one downmix and corresponding object parameter information When expressed as parameters, upmixing and rendering that provide the desired level of sound quality is virtually impossible. Accordingly, in the present invention, a method of grouping objects to be encoded and generating a downmix on a group basis is used. The downmix gain may be applied when each object is downmixed in the process of being downmixed by a group, and the applied downmix gain for each object is included in the bit string for each group as additional information. Meanwhile, in order to effectively control encoding efficiency or overall gain, a global gain applied to each group and an object group gain applied to only the objects of each group can be used, and they are encoded and included in a bit string, Lt; / RTI >

The first method of forming a group is to form a group of nearby objects considering the position of each object on a sound scene. The object groups 410 and 420 of FIG. 4 are examples formed by this method. This is because the crosstalk distortion occurring between the objects due to incompleteness of the parameter coding, the distortion caused when the objects are moved to the third position or the size is changed are not audible to the listener 110 as much as possible Method. Distortion in objects at the same location is likely to be inaudible to the listener due to relatively masking. For the same reason, even when individual encoding is performed, the effect of sharing additional information through grouping among objects located at similar positions can be expected.

5 is a block diagram of an embodiment of an object audio signal encoder including an object grouping 550 and a downmix 520, 540 method according to the present invention. Downmixing is performed for each group, and parameters necessary for restoring the downmixed objects are generated (520, 540). The downmix signals generated for each group are further encoded through a waveform encoder 560 for encoding channel-specific waveforms such as AAC and MP3. This is commonly referred to as a core codec. In addition, encoding may be performed through coupling between downmix signals. The signals generated through the respective encoders are formed into a single bit stream through the multiplexer 570 and transmitted. Accordingly, it can be seen that the bitstreams generated through the downmix & parameter encoders 520 and 540 and the waveform encoder 560 all encode constituent objects constituting one sound scene. Also, object signals belonging to different object groups within the generated bit stream are encoded with the same time frame, and thus, they may be reproduced at the same time. The grouping information generated by the object grouping unit can be encoded and transmitted to the receiving end.

FIG. 6 is a block diagram illustrating an embodiment of performing decoding on a coded and transmitted signal. In the decoding process, a plurality of downmix signals decoded by the waveform decoding 620 are input to the upmixer & parameter decoder together with the corresponding parameters. Since there are a plurality of downmixes, it is necessary to decode a plurality of parameters.

If the transmitted bitstream includes a global gain and an object group gain, they can be applied to restore the size of a normal object signal. On the other hand, these gain values can be controlled in the rendering or transcoding process, and the gain of the entire signal can be controlled through the global gain control, and the group gain can be controlled through the object group gain. For example, when object grouping is performed on the basis of the playback speaker, the object group gain can be easily adjusted by adjusting the gain in order to implement flexible rendering, which will be described later.

At this time, although a plurality of parameter encoders or decoders are illustrated as being processed in parallel for convenience of explanation, it is also possible to sequentially perform encoding or decoding of a plurality of object groups through one system.

Another method of forming an object group is to group objects having low correlation into one group. This is a characteristic of the parameter encoding, and it takes into consideration that the objects having high correlation are difficult to separate each from the downmix. At this time, it is also possible to control the parameters such as the downmix gain and the like at the time of downmixing so that the grouped objects are further distanced from each other. At this time, the parameters used are preferably transmitted so that they can be used for signal restoration upon decoding.

Another method of forming an object group is to group objects having a high degree of correlation into one group. This is a method for increasing the compression efficiency in applications where the degree of utilization is not high, although it is difficult to separate the objects with high correlation using the parameters. In the case of a core codec, complex signals with various spectra require a lot of bits. Therefore, when a single core codec is used to group highly correlated objects, encoding efficiency is high.

Another method of forming an object group is to judge whether the object is masked or not and then encode it. For example, when the object A is in the relationship of masking the object B, the object B may be omitted in the encoding process when the two signals are included in one downmix and encoded into the core codec. In this case, when the object B is obtained by using the parameter at the decoding end, the distortion is large. Therefore, it is preferable that object A and object B having such a relationship are included in separate downmixes. On the other hand, when an object A and an object B are in a masking relationship but do not need to separately render two objects or at least do not need to separately process the masked object, conversely, . Therefore, the selection method may be different depending on the application. For example, if a particular object is masked or at least weak in a desirable sound scene in the encoding process, it may be implemented in a form that it is excluded from the object list and included in the masked object, or the two objects are combined and represented as one object .

Another way to form an object group is to separate things that are not point source objects, such as plane wave source objects or ambient source objects, and group them separately. Such sources need different compression encoding methods and parameters because of their different characteristics from point sources, and therefore, it is preferable to separate them separately.

The object information decoded for each group is reduced to the original objects through object grouping by referring to the transferred grouping information.

FIG. 7 shows an example of a bit string generated by encoding by the encoding method according to the present invention. Referring to FIG. 7, it can be seen that the main bitstream 700 in which the encoded channel or object data is transmitted is arranged in the order of the channel groups 720, 730, 740 or the object groups 750, 760, 770. Also, since the header includes channel group position information CHG_POS_INFO 711 and object group position information OBJ_POS_INFO 712, which are position information in the bit stream of each group, referring to this, it is possible to obtain desired group data Can be decoded first. Therefore, the decoder generally decodes data arriving first in group units, but it can arbitrarily change the order of decoding according to another policy or reason. In addition, FIG. 7 illustrates a sub-bit stream 701 containing metadata 703 and 704 for each channel or object as well as main decoding related information in addition to the main bit stream 700. The sub-bit sequence may be transmitted intermittently during the transmission of the main bit stream, or may be transmitted over a separate transmission channel.

(A method of allocating bits for each object group)

In generating a downmix for a plurality of groups and performing independent parametric object coding for each group, the number of bits used in each group may be different from each other. The criterion for allocating bits per group is the number of objects included in the group, the number of effective objects considering the masking effect among the objects in the group, the weight according to positions considering human spatial resolution, the sound pressure level of objects, The importance of the object on the scene, and the like can be considered. For example, if there are three spatial object groups A, B, and C, if the object signals of group 3, 2, and 1 are included, the allocated bits are 3a1 (nx), 2a2 (ny), a3n Can be assigned. Here, x and y represent the degree to which less bits can be allocated by the masking effect between objects in each group and within the object, and a1 and a2 a3 can be determined by various factors mentioned above for each group.

(Main object and sub object position information encoding in object group)

On the other hand, in the case of object information, it is preferable to have a means for recommending according to an intention created by a producer, or transmitting mix information proposed by another user through metadata as position and size information of an object. In the present invention, this is referred to as preset information for convenience. In the case of position information through a preset, especially when the object is a dynamic object whose position changes with time, the amount of information to be transmitted is not small. For example, if you transmit location information that varies every frame for 1000 objects, it is a very large amount of data. Therefore, it is desirable to effectively transmit the position information of the object. Accordingly, the present invention uses an effective encoding method of position information using the main object and the sub-object definition.

The main object is an object that expresses the position information of the object as an absolute coordinate value in the three-dimensional space. The subobject refers to an object having positional information by expressing the position in the three-dimensional space with respect to the principal object. Therefore, the subordinate object needs to know what the corresponding main object is. In the case of grouping, in particular, when grouping is performed based on the position in the space, one main object and the rest are subordinate objects in the same group, . If there is no grouping for encoding or if it is not advantageous to encode sub-object location information, a separate set for location information encoding may be formed. It is preferable that the objects belonging to the group or the set are located within a certain range in space so that it is more advantageous to express the sub object position information relative to the absolute value.

Another positional information encoding method according to the present invention is to express relative information of a fixed speaker position instead of a relative expression to a main object. For example, the relative position information of the object is expressed based on the designated position value of the 22-channel speaker. At this time, the number of speakers to be used as a reference and the position value can be set based on the value set in the current contents.

In another embodiment of the present invention, quantization is performed after the position information is expressed as an absolute value or a relative value, and the quantization step is variable based on the absolute position. For example, since the frontal area of a celadon is known to have a significantly higher discriminating ability with respect to a position than a side or a rear side, it is desirable to set the quantization step so that the frontal resolution is higher than the lateral resolution. Likewise, since the resolution for the azimuth is higher than the resolution for the azimuth, it is preferable to increase the quantization for the azimuth angle.

According to another embodiment of the present invention, in the case of a dynamic object whose position is time-varying, it is possible to express the relative position with respect to the previous position value of the object instead of expressing the relative position value with respect to the main object or another reference point . Therefore, it is preferable to transmit together the flag information for distinguishing the location information of the dynamic object temporally before and spatially based on the neighboring reference points.

(Decoder overall architecture)

8 is a block diagram of an object and channel signal decoding system according to the present invention. The system may receive either the object signal 801 or the channel signal 802 or a combination of the object signal and the channel signal and the object signal or the channel signal may be subjected to waveform coding 801 and 802 respectively or parametric coding 803 and 804 ). The decoding system can be roughly divided into a 3DA decoding unit 860 and a 3DA rendering unit 870, and the 3DA rendering unit 870 may use any external system or solution. Accordingly, it is desirable that the 3DA decoding unit 860 and the 3DA rendering unit 870 provide a standardized interface that is easily compatible with the outside.

9 is a block diagram of another object and channel signal decoding system according to the present invention. Similarly, the present system can receive the object signal 901 or the channel signal 902 or a combination of the object signal and the channel signal, and the object signal or channel signal can be waveform-encoded 901 902 or parametric 903 904 Can be. 8, the difference is that the individual object decoder 810, the separate channel decoder 820, and the parametric channel decoder 840 and the parametric object decoder 830, which are separated from each other, And the 3DA rendering unit 940 and the renderer interface unit 930 for the convenient and standardized interface are added to the 3D rendering unit 940 and the parametric decoder 920, respectively. The renderer interface unit 930 receives the user environment information, the renderer version, and the like from the 3DA renderer 940 existing in the inside or the outside and reproduces them together with a channel or object signal of a compatible format, So that data can be transmitted. The 3DA renderer interface 930 may include an order controller 1630 to be described later.

The parametric decoder 920 requires a downmix signal to generate an object or channel signal. The required downmix signal is decoded and input through a separate decoder 910. The encoder corresponding to the object and channel signal decoding system may be of various types and can be regarded as a compatible encoder if it can generate at least one of the bit strings 801, 802, 803, 804, 901, 902, 903 and 904 of the form shown in FIG. 8 and FIG. Also, according to the present invention, the decoding system shown in Figs. 8 and 9 is designed to ensure compatibility with past system or bit string. For example, when an individual channel bit stream encoded with AAC is input, it can be decoded through an individual (channel) decoder and transmitted to the 3DA renderer. In the case of an MPS (MPEG Surround) bit stream, a downmix signal is transmitted together with a downmixed AAC encoded signal through a separate (channel) decoder, and the resulting signal is transmitted to a parametric channel decoder. It works like a surround decoder. A bit string encoded by SAOC (Spatial Audio Object Coding) operates similarly. SAOC In the system of FIG. 8, SAOC has a structure of being rendered as a channel through MPEG Surround after being operated as a transcoder as in the prior art. For this purpose, it is desirable that the SAOC transcoder receives the reproduction channel environment information, and generates and transmits an optimized channel signal to the received reproduction channel environment information. Accordingly, conventional SAOC bitstreams can be received and decoded, but rendering specific to a user or playback environment can be performed. In the system of FIG. 9, when the SAOC bit string is inputted, the transcoding operation is converted into an individual object type suitable for channel or rendering instead of the transcoding operation for converting to the MPS bit string. Therefore, the computational complexity is lower than that of the transcoding structure, which is advantageous in terms of sound quality. In FIG. 9, the output of the object decoder is shown as channel only, but it may be transmitted to the renderer interface as an individual object signal. Also, although it is shown only in FIG. 9, when a residual signal is included in the parametric bit stream including the case of FIG. 8, the decoding thereof is decoded through a separate decoder.

(Individual for channel, parameter combination, residual)

10 is a diagram illustrating a configuration of an encoder and a decoder according to another embodiment of the present invention.

FIG. 10 shows a structure for scalable coding when the speaker setup of the decoder is different.

The encoder includes the downmixing unit 210 and the decoder includes the demultiplexing unit 220 and includes at least one of the first decoding unit 230 to the third decoding unit 250.

The downmixing unit 210 generates a downmix signal DMX by downmixing an input signal CH_N corresponding to a multi-channel. In this process, one or more of the upmix parameter (UP) and the upmix residual (UR) is generated. Then, by multiplexing the downmix signal DMX, upmix parameter UP (and upmix residual UR), one or more bitstreams are generated and transmitted to the decoder.

Here, the upmix parameter UP is a parameter required to upmix two or more channels to one or more channels, and may include a spatial parameter and an inter-channel phase difference (IPD).

The upmix residual signal UR corresponds to a residual signal which is a difference between the original signal CH_N and the recovered signal. The recovered signal corresponds to an upmix parameter UP to the downmix DMX And the downmixed signal by the downmixing unit 210 may be a signal encoded in a discrete manner.

The demultiplexer 220 of the decoder may extract the downmix signal DMX and the upmix parameter UP from one or more bitstreams and further extract the upmix residual (UR). Where the residual signal can be encoded in a similar manner to a separate encoding for the downmix signal. Therefore, the decoding of the residual signal is performed through the individual (channel) decoder in the system shown in FIG. 8 or FIG.

(Or one or more) of the first decoding unit 230 to the third decoding unit 250 depending on the speaker setup environment of the decoder. Depending on the type of device (smart phone, stereo TV, 5.1ch home theater, 22.2ch home theater, etc.), the loudspeaker setup environment may vary. If the bit stream and the decoder for generating the multichannel signal of 22.2 channels or the like are not optional, it is necessary to restore all 22.2 channels and then down mix according to the speaker reproduction environment. In this case, not only is the amount of computation required for restoration and downmixing very high, but also delays may occur.

However, according to another embodiment of the present invention, the disadvantage can be solved by selectively providing one (or more) of the first decoder to the third decoder according to the setup environment of each device.

The first decoder 230 is configured to decode only the downmix signal DMX and does not accompany an increase in the number of channels. If the downmix signal is mono, it outputs a mono channel signal, and if it is stereo, it outputs a stereo signal. A device having a headphone with one or two speaker channels, a smart phone, a TV, and the like.

Meanwhile, the second decoder 240 receives the downmix signal DMX and the upmix parameter UP, and generates a parametric M channel PM based on the downmix signal DMX and the upmix parameter UP. When the number of channels is increased as compared with the first decoder and only the parameters corresponding to the upmix up to the total M channels are present in the upmix parameter UP, have. For example, the original signal which is the input signal of the encoder is 22.2ch signal, and the M channel can be 5.1ch, 7.1ch channel and the like.

The third decoder 250 receives not only the downmix signal DMX and the upmix parameter UP but also the upmix residual UR. The second decoder generates the M-channel parametric channel, while the third decoder additionally applies the upmix residual signal UR to the N-channel reconstructed signal.

Each device selectively includes one or more of a first decoder and a third decoder and selectively parses the upmix parameter UP and the upmix residual UR in the bit stream to generate a signal suitable for each speaker setup environment The complexity and the amount of computation can be reduced by creating it directly.

(Object Waveform Coding Considering Masking)

The waveform coder of an object according to the present invention (hereinafter referred to as a waveform coder) refers to a case where a channel or an object audio signal is encoded so that it can be independently decoded for each channel or object, and a concept relative to parametric encoding / (Also referred to as discrete encoding / decoding) is bit-allocated in consideration of the position on the sound scene of the object. This is based on the binaural masking level difference (BMLD) phenomenon of psychoacoustics and the features of object signal coding.

In order to explain the BMLD phenomenon, an example of MS (Mid-Side) stereo coding used in the conventional audio coding method will be described as follows. That is, the masking phenomenon in psychoacoustic is possible when the masker generating the masking and the masking masking are spatially in the same direction. The correlation between the two-channel audio signals of the stereo audio signal is very high. If the sizes are the same, the phase (sound image) of the sound is centered between the two speakers. If there is no correlation, The speaker is concealed. Since the quantization noise of each channel is not correlated with each other when the channel is dual-mono independently of the input signal having the highest correlation, the audio signal is centered, . Therefore, the quantization noise to be masked is not masked due to the spatial inconsistency, resulting in a problem that the noise is distorted to the person. In order to solve such a problem, the sum-difference coding is performed by generating a psychoacoustic model using a signal obtained by subtracting a signal (Mid signal) plus two channel signals (Difference), quantizing the result using the quantized noise, Make sure it is in the same position as the sound image.

In the conventional channel coding, each channel is mapped to a speaker to be reproduced, and since the positions of the speakers are fixed and separated from each other, masking between channels can not be considered. However, when each object is encoded independently, whether the object is masked or not may be changed according to the location on the sound scene of the object. Accordingly, it is desirable to determine whether or not the object currently encoded by another object is masked, allocate and allocate bits according to the masking.

FIG. 11 shows a masking threshold 1130 for each signal for object 1 1110 and object 2 1120, a masking threshold that can be taken from these signals, and a signal that combines object 1 and object 2. Assuming that object 1 and object 2 are located at least at the same position with respect to the position of the listener or within a range in which the problem of BMLD does not occur, the area masked by the signal to the listener will be 1130, The S2 signal included in 1 will be a signal that is completely masked and inaudible. Therefore, in the process of encoding the object 1, it is preferable to encode the object 2 in consideration of the masking threshold value for the object 2. Since the masking threshold values are added together, they can be obtained by adding the respective masking threshold values to the object 1 and the object 2. Or the process of calculating the masking threshold value is very high, it is also preferable to calculate the masking threshold value by using the signal generated by summing the object 1 and the object 2 in advance and then encode the object 1 and the object 2, respectively. FIG. 12 is an embodiment of an encoder for calculating a masking threshold for a plurality of object signals according to the present invention.

Another masking threshold calculation method according to the present invention is a method of calculating masking thresholds by considering the degree of separation of two objects in space, instead of adding masking thresholds for two objects when the positions of two object signals do not completely coincide with the audible angle reference It is also possible to reflect it by attenuating it. The final joint masking thresholds M1 '(f) and M2' (f) to be used to encode each object when the masking threshold for object 1 is M1 (f) and the masking threshold for object 2 is M2 (f) Is generated so as to have the following relationship.

[Equation 1]

M1 (f) = M1 (f) + A (f) M2 (f)

M2 (f) = A (f) M1 (f) + M2 (f)

In this case, A (f) is an attenuation factor generated by the position and distance of space between two objects and the properties of two objects, and has a range of 0.0 = <A (f) = <1.0.

The resolution of a person's direction is deteriorated as it goes from side to side with respect to the front and becomes worse when going backward. Therefore, the absolute position of an object can serve as another factor for determining A (f).

According to another embodiment of the present invention, a masking threshold value of only one of two objects is used, and a masking threshold value of a relative object is acquired only for another object. This is called an independent object dependent object. Since the object using only its own masking threshold is high-quality encoded regardless of the relative object, the sound quality can be preserved even if the object is spatially separated from the object. Assuming that object 1 is an independent object and object 2 is a dependent object, a masking threshold value can be expressed by the following equation.

&Quot; (2) &quot;

M1 '(f) = M1 (f)

M2 (f) = A (f) M1 (f) + M2 (f)

Whether an independent object or a dependent object is the additional information for each object is preferably decoded and transmitted to the renderer.

In another embodiment according to the present invention, when two objects are similar to each other in space, it is also possible to combine the signals themselves into one object, instead of merely generating the masking thresholds.

In another embodiment according to the present invention, in particular, when parameter coding is performed, it is preferable to combine the two signals into one object in consideration of the correlation between the two signals and the spatial position of the two signals.

(Transcoding feature)

In another embodiment according to the present invention, in transcoding a bit string including a coupled object, particularly when transcoding at a lower bit rate, when the number of objects is reduced in order to reduce the data size, When an object is downmixed into a single object, it is preferable to represent the coupled object as a single object.

In the description of the encoding through the inter-object coupling, only the case of coupling only two objects is described as an example for convenience of explanation, but coupling to two or more objects can also be implemented in a similar manner.

(Flexible rendering required)

Among the technologies required for 3D audio, flexible rendering is one of the key challenges to be solved to maximize the quality of 3D audio. It is well known that the position of the 5.1 channel speaker is very irregular depending on the structure of the living room and the arrangement of the furniture. Even if there is a speaker at such an irregular position, a content producer should be able to provide a sound scene intended by the user. In order to do this, it is necessary to know the speaker environment in the reproduction environment which is different for each user, A rendering technique is needed to compensate for this. That is, a series of techniques are required to decode the transmitted bit stream according to the decoding method, and not to end the codec role, but to optimize and transform it according to the user's reproduction environment.

FIG. 13 shows a layout (gray, 1310) according to the ITU-R recommendation for a 5.1 channel setup and a case (red, 1320) arranged at an arbitrary position. In the actual living room environment, there may be a problem that the direction angle and the distance are different from the ITU-R recommendation. (Although it is not shown, there may be differences in the speaker's height.) It is difficult to provide the ideal 3D sound scene when the original channel signal is reproduced from such a different speaker position.

(Flexible rendering)

Amplitude Panning, which determines the direction information of a sound source between two speakers based on the signal size, or Vector-Based Amplitude Panning (VBAP), which is widely used to determine the direction of a sound source using three speakers in a three-dimensional space It can be seen that flexible rendering can be implemented relatively conveniently for object signals transmitted on an object-by-object basis. It is one of the advantages of transmitting object signals instead of channels.

(Object Decoding and Rendering Structure)

FIG. 14 shows structures (1400 and 1401) of two embodiments in which a decoder for an object bit stream and a flexible rendering system using the decoder are connected to each other according to the present invention. As described above, in the case of an object, there is an advantage in that an object can be positioned as a sound source in accordance with a desired sound scene. Here, the mixer 1420 receives position information represented by a mixing matrix and converts it into a priority channel signal. That is, the positional information of the sound scene is expressed as relative information from the speaker corresponding to the output channel. At this time, if the actual number and position of the speakers do not exist at the predetermined positions, it is necessary to re-render them using the corresponding speaker information. Rendering a channel signal back to another type of channel signal, as described below, is more difficult to implement than when the object is directly rendered on the final channel.

FIG. 15 shows a structure of still another embodiment implementing decryption and rendering of an object bit stream according to the present invention. Compared with the case of FIG. 14, it is possible to directly implement the flexible rendering 1510 according to the final speaker environment together with the decoding from the bit stream. That is, instead of going through two steps of mixing performed on a regular channel based on a mixing matrix and rendering the generated regular channel to a flexible speaker, a rendering matrix or a rendering matrix using a mixing matrix and speaker position information 1520 The rendering parameters are generated, and the object signal is directly rendered to the target speaker using the generated rendering parameters.

(Flexible rendering with channel attached)

On the other hand, when the channel signal is transmitted as an input, if the position of the speaker corresponding to the channel is changed to an arbitrary position, it is difficult to implement using the same panning method in the case of an object, and a separate channel mapping process is required. The problem is that the process and solution required for rendering the object signal and the channel signal are different from each other. Therefore, when the object signal and the channel signal are simultaneously transmitted and a sound scene in which the two signals are mixed is desired, Distortion is likely to occur. In order to solve such a problem, according to another embodiment of the present invention, flexible rendering of a channel signal is performed after a mix is first performed on a channel signal without separately performing flexible rendering on an object. Rendering using HRTF is preferably implemented in the same manner.

(Decoded downmix: parameter transmission or automatic generation)

In the case of downmix rendering, in the case of reproducing multi-channel contents through a smaller number of output channels, up to now, it has been common to implement an MN downmix matrix (M is the number of input channels and N is the number of output channels). That is, when 5.1 channel contents are reproduced in stereo, downmix is performed by a given expression. However, such a downmix implementation method has a problem of a calculation amount of decoding all bit strings corresponding to 22.2 transmitted channels, although the user's playback speaker environment is only 5.1 channels. In order to generate a stereo signal for reproduction on a portable device, if all the 22.2 channel signals need to be decoded, the computational burden is very high, and a huge amount of memory waste (storage of 22.2 channel decoded audio signals) occurs.

(Transcoding as an alternative to downmix)

As an alternative to this, a method of switching from a huge 22.2 channel original bit stream to a bit stream suitable for the target device or target playback space through effective transcoding can be considered. For example, if the content is 22.2 channel content stored in the cloud server, it is possible to implement a scenario in which the reproduction environment information is received from the client terminal, and converted and transmitted.

(Decryption sequence or downmix sequence; sequence control unit)

On the other hand, in the scenario in which the decoder and the rendering are separated, for example, there may be a case where it is necessary to decode 50 object signals together with an audio signal of 22.2 channels and transmit the decoded object signals to the renderer. Since it is a data rate signal, there is a problem that a very large bandwidth is required between the decoder and the renderer. Therefore, it is not desirable to simultaneously transmit such a large amount of data at once, and it is desirable to establish an effective transmission plan. Then, it is preferable that the decoder decides the decoding order and transmits it. FIG. 16 is a block diagram showing a structure for determining and transmitting a transmission plan between the decoder and the renderer.

The sequence control unit 1630 receives the supplementary information and metadata obtained through decoding of the bit stream and the transmission sequence for receiving the decoding sequence and the decoded signal from the renderer 1620 and transmitting the decoded sequence and the decoded signal to the renderer 1620, And transmits the determined control information to the decryptor 1610 and the renderer 1620 again. For example, if the renderer 1620 commands to completely remove a particular object, then it does not need to be sent to the renderer 1620, nor does it need to be decoded. Alternatively, in a situation where specific objects are rendered only on a specific channel, instead of separately transmitting the object, the transmission bandwidth may be reduced by downmixing the object in advance to the corresponding channel to be transmitted. In another embodiment, when sound scenes are spatially grouped and signals necessary for rendering are grouped together, it is possible to minimize the amount of signals that need to be unnecessarily queued in the renderer internal buffer. On the other hand, according to the renderer 1620, the size of data that can be accommodated at a time may be different. Such information may also be notified to the order controller 1630, and the decoder 1610 may determine the decoding timing and the amount of transmission.

On the other hand, the decoding control by the sequence control unit 1630 is further transmitted to the encoding stage, and the encoding process can be controlled. That is, it is possible to exclude unnecessary signals in encoding, determine grouping of objects and channels, and the like.

(Voice highway)

On the other hand, an object corresponding to a voice corresponding to bidirectional communication may be included in the bit stream. Since bidirectional communication is very sensitive to time delay unlike other contents, when the corresponding object or channel signal is received, it should be transmitted to the renderer first. The corresponding object or channel signal can be indicated by a separate flag or the like. First of all, the transport object is independent of the other object / channel signals contained in the same frame and the presentation time, unlike the other object / channel.

(AV Matching and Phantom Center)

UHDTV is one of the new problems when considering ultra-high-definition TV, which is often called a near field. That is, considering the viewing distance of a general user environment (living room), the distance from the reproduced speaker to the listener is shorter than the distance between the speakers, so that each speaker operates as a point source, In the absence of a speaker at the center, the spatial resolution of the sound object synchronized to the video must be very high to enable high quality 3D audio service.

In a conventional audio angle of about 30 degrees, stereo speakers arranged on the left and right are not in a near field situation, and are sufficient to provide a sound scene suitable for movement of an object on the screen (for example, a vehicle moving from left to right). However, in the UHDTV environment where the audiovisual angle is 100 degrees, not only the left and right resolutions but also additional resolutions constituting the top and bottom of the screen are required. For example, if there are two characters on the screen, it would not be a big problem in realistic sense even if it is said that two sounds are ignited in current HDTV. However, in UHDTV size, Will be perceived as a new form of distortion.

One of the solutions to this problem is the configuration of 22.2 channel speaker configuration. Figure 2 is an example of a 22.2 channel arrangement. 2, a total of eleven loudspeakers are arranged on the front side to increase the left and right spatial resolution and the vertical spatial resolution of the front side. Five speakers are placed in the middle layer of the former three speakers. In addition, three upper layers and three lower layers were added so that the sound level could be sufficiently accommodated. By using such an arrangement, the spatial resolution of the front surface is higher than that of the past, which will be advantageous for matching with the video signal. However, there is a problem that, in current TVs using display devices such as LCDs and OLEDs, the display occupies a position where a speaker should be present. That is, there is a problem in that the speakers must be provided outside the display area to provide a matched sound to each object position in the screen, unless the display itself provides sound or does not have a device character that penetrates the sound. In FIG. 2, speakers corresponding to the minimums FLc, FC, and FRc are disposed at positions overlapping with the display.

17 is a conceptual diagram for explaining a concept of reproducing speakers absent from the display among the front-mounted speakers in the 22.2 channel system using the peripheral channels. In order to accommodate the components FLc, FC and FRc, it is also conceivable to arrange additional speakers in the upper and lower peripheral portions of the display, such as circles indicated by dotted lines. According to FIG. 17, there may be seven peripheral channels that can be used to generate the FLc. By using these seven speakers, it is possible to reproduce a sound corresponding to the position of a member speaker by the principle of generating a virtual source.

Techniques and properties such as VBAP or HAAS Effect (pre-effect) can be used as a method of creating virtual sources using peripheral speakers. Alternatively, different panning techniques may be applied depending on the frequency band. Further, it is possible to consider changing the azimuth using the HRTF and adjusting the height. For example, when replacing FC with BtFC, it can be implemented by adding FC channel signal to BtFC by applying HRTF with ascending property. The HRTF observation is that you have to control the position of a specific Null in the high frequency band (which varies from person to person) in order to control the height of the sound. However, in order to generalize a different Null according to a person, height adjustment can be implemented by increasing or decreasing the high frequency band widely. If such a method is used, there is a disadvantage that the signal is distorted due to the influence of the filter.

The processing method for arranging the sound sources at the position of the member speaker according to the present invention is the same as that shown in Fig. Referring to FIG. 18, a channel signal corresponding to the phantom speaker position is used as an input signal, and the input signal is passed through a subband filter unit 1810 for dividing into three bands. In this case, instead of dividing into two bands or dividing into three bands instead of three bands, a method of performing different processing on the upper two bands may be implemented. The first band is relatively low in the low frequency band, but is a signal that can be reproduced via a woofer or subwoofer, since it is desirable to reproduce through a large speaker instead of being positionally insensitive. At this time, the first band signal adds a time delay 1820 to utilize the pre-effect. In this case, the time delay is not intended to compensate for the time delay of the filter occurring in the processing of the other bands, but provides an additional time delay to reproduce later than other band signals, that is, to provide a preceding effect.

The second band is a signal to be used to be reproduced through a speaker (a speaker disposed in the bubble of the TV display and the surroundings thereof) around the phantom speaker, is divided into at least two speakers and reproduced, and a panning algorithm 1830 such as VBAP is applied Are generated and applied. Therefore, it is necessary to accurately provide the number and position (relative to the phantom speaker) of the speakers on which the second band output is reproduced, so that the panning effect can be improved. In this case, it is possible to apply a filter considering HRTF in addition to VBAP panning, or to apply different phase filters or time delay filters to provide a time panning effect. Another advantage of dividing the bands into HRTFs is that they can limit the range of signal distortions caused by HRTF to within the processing band.

The third band may be used to generate a signal to be reproduced using a speaker array if present, and an array signal processing technique 1840 for sound source virtualization through at least three speakers may be applied. Or coefficients generated by WFS (Wave Field Synthesis) can be applied. In this case, the third band and the second band may actually be the same band.

FIG. 19 shows an embodiment in which a signal generated in each band is mapped to a speaker disposed in the vicinity of the TV. According to FIG. 19, the number and position information of the speakers corresponding to the second and third bands must be located at relatively accurately defined positions, and the position information is preferably provided to the processing system of FIG.

20 is a diagram illustrating a relationship among products in which an audio signal processing apparatus according to an embodiment of the present invention is implemented. Referring to FIG. 20, a wired / wireless communication unit 310 receives a bitstream through a wired / wireless communication scheme. More specifically, the wired / wireless communication unit 310 may include at least one of a wired communication unit 310A, an infrared communication unit 310B, a Bluetooth unit 310C, and a wireless LAN communication unit 310D.

The user authentication unit 320 performs user authentication by receiving the user information and performs at least one of the fingerprint recognition unit 320A, the iris recognition unit 320B, the face recognition unit 320C, and the voice recognition unit 320D The fingerprint, iris information, face contour information, and voice information may be input and converted into user information, and user authentication may be performed by determining whether the user information and the previously registered user data match each other .

The input unit 330 may include at least one of a keypad unit 330A, a touch pad unit 330B, and a remote control unit 330C as an input device for a user to input various kinds of commands. It is not limited.

The signal coding unit 340 performs encoding or decoding on the audio signal and / or the video signal received through the wired / wireless communication unit 310, and outputs the audio signal in the time domain. And an audio signal processing device 345. This corresponds to the above-described embodiment of the present invention (i.e., the decoder 600 according to one embodiment and the encoder and decoder 1400 according to another embodiment) Likewise, the audio processing unit 345 and the signal coding unit including it may be implemented by one or more processors.

The control unit 350 receives an input signal from the input devices and controls all the processes of the signal decoding unit 340 and the output unit 360. The output unit 360 is a component for outputting the output signal and the like generated by the signal decoding unit 340 and may include a speaker unit 360A and a display unit 360B. When the output signal is an audio signal, the output signal is output to the speaker, and when it is a video signal, the output signal is output through the display.

The audio signal processing method according to the present invention may be implemented as a program to be executed by a computer and stored in a computer-readable recording medium. The multimedia data having the data structure according to the present invention may also be recorded on a computer- Lt; / RTI &gt; The computer-readable recording medium includes all kinds of storage devices in which data that can be read by a computer system is stored. Examples of the computer-readable recording medium include a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like, and may be implemented in the form of a carrier wave (for example, transmission via the Internet) . In addition, the bit stream generated by the encoding method may be stored in a computer-readable recording medium or transmitted using a wired / wireless communication network.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. It will be understood that various modifications and changes may be made without departing from the scope of the appended claims.

210: Upper layer
220: middle layer
230: bottom layer
240: LFE channel
410: first object signal group
420: second object signal group
430: Celadon
520: Downmixer & parameter encoder 1
550: Object grouping unit
560: Waveform coder
620: Waveform Decoder
630: Upmixer & Parameter Decoder 1
670: Object D grouping unit
860: 3DA decoding unit
870: 3DA rendering unit 870
1110: Masking threshold curve of object 1
1120: masking threshold curve of object 2
1130: masking threshold curve of object 3
1230: Psychoacoustic model
1310: 5.1 channel loudspeaker placed according to ITU-R Recommendation
1320: 5.1 channel loudspeaker is in any position
1610: 3D audio decoder
1620: 3D Audio Renderer
1630:
1810: Subband filter section
1820: Time delay filter unit
1830: Panning Algorithm
1840: speaker array control unit

Claims (13)

  1. The method comprising: receiving a first signal for a first object signal group including a plurality of object signals and a second signal for a second object signal group including a plurality of object signals;
    Receiving first metadata for the first object signal group and second metadata for the second object signal group;
    Reproducing an object signal belonging to the first object signal group using the first signal and the first metadata; And
    And reproducing the object signal belonging to the second object signal group using the second signal and the second metadata,
    Wherein each of the metadata includes position information of an object corresponding to an object signal included in a corresponding object signal group,
    If the position of the object is a dynamic object that varies with time, the position information is a position value relative to a position value of the previous object in time,
    Wherein the first object signal group and the second object signal group are mixed to form a single sound scene.
  2. The method according to claim 1,
    The method comprising: obtaining downmix gain information for at least one object signal belonging to a first object signal group from the first metadata; and reproducing the at least one object signal using the downmix gain information.
  3. The method according to claim 1,
    And receiving global gain information, wherein the global gain information is a gain value applied to both the first object signal group and the second object signal group.
  4. The method according to claim 1,
    Further comprising information indicating that the position information of the dynamic object is a position value relative to a position value of the previous object in time.
  5. Generating a first signal for a first object signal group including a plurality of object signals and a second signal for a second object signal group including a plurality of object signals;
    Generating first metadata for the first object signal group and second metadata for the second object signal group,
    The object signal belonging to the first object signal group is reproduced using the first signal and the first metadata and the object signal belonging to the second object signal group is reproduced using the second signal and the second metadata, And,
    Wherein each of the metadata includes position information of an object corresponding to an object signal included in a corresponding object signal group,
    If the position of the object is a dynamic object that varies with time, the position information is a position value relative to a position value of the previous object in time,
    Wherein the first object signal group and the second object signal group are mixed to form a single sound scene.
  6. delete
  7. delete
  8. delete
  9. delete
  10. delete
  11. delete
  12. delete
  13. delete
KR1020120084229A 2012-07-31 2012-07-31 Apparatus and method for audio signal processing KR101949756B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020120084229A KR101949756B1 (en) 2012-07-31 2012-07-31 Apparatus and method for audio signal processing

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
KR1020120084229A KR101949756B1 (en) 2012-07-31 2012-07-31 Apparatus and method for audio signal processing
PCT/KR2013/006732 WO2014021588A1 (en) 2012-07-31 2013-07-26 Method and device for processing audio signal
CN201380039768.3A CN104541524B (en) 2012-07-31 2013-07-26 A kind of method and apparatus for processing audio signal
EP13825888.4A EP2863657B1 (en) 2012-07-31 2013-07-26 Method and device for processing audio signal
JP2015523022A JP6045696B2 (en) 2012-07-31 2013-07-26 Audio signal processing method and apparatus
US14/414,910 US9564138B2 (en) 2012-07-31 2013-07-26 Method and device for processing audio signal
US15/383,293 US9646620B1 (en) 2012-07-31 2016-12-19 Method and device for processing audio signal

Publications (2)

Publication Number Publication Date
KR20140017342A KR20140017342A (en) 2014-02-11
KR101949756B1 true KR101949756B1 (en) 2019-04-25

Family

ID=50266004

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020120084229A KR101949756B1 (en) 2012-07-31 2012-07-31 Apparatus and method for audio signal processing

Country Status (1)

Country Link
KR (1) KR101949756B1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180009750A (en) * 2015-06-17 2018-01-29 삼성전자주식회사 Method and apparatus for processing an internal channel for low computation format conversion
CN106373583B (en) * 2016-09-28 2019-05-21 北京大学 Multi-audio-frequency object coding and decoding method based on ideal soft-threshold mask IRM

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010507114A (en) * 2006-10-16 2010-03-04 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Apparatus and method for multi-channel parameter conversion
US20120183148A1 (en) * 2011-01-14 2012-07-19 Korea Electronics Technology Institute System for multichannel multitrack audio and audio processing method thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010507114A (en) * 2006-10-16 2010-03-04 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Apparatus and method for multi-channel parameter conversion
US20120183148A1 (en) * 2011-01-14 2012-07-19 Korea Electronics Technology Institute System for multichannel multitrack audio and audio processing method thereof

Also Published As

Publication number Publication date
KR20140017342A (en) 2014-02-11

Similar Documents

Publication Publication Date Title
JP6637208B2 (en) Audio signal processing system and method
US9384742B2 (en) Methods and apparatuses for encoding and decoding object-based audio signals
US10231073B2 (en) Ambisonic audio rendering with depth decoding
US20170125030A1 (en) Spatial audio rendering and encoding
Herre et al. MPEG-H 3D audio—The new standard for coding of immersive spatial audio
RU2618383C2 (en) Encoding and decoding of audio objects
JP5646699B2 (en) Apparatus and method for multi-channel parameter conversion
JP6088444B2 (en) 3D audio soundtrack encoding and decoding
KR102010914B1 (en) Method and apparatus for encoding and decoding successive frames of an ambisonics representation of a 2- or 3-dimensional sound field
JP2019134475A (en) Rendering method, rendering device, and recording medium
US8204756B2 (en) Methods and apparatuses for encoding and decoding object-based audio signals
JP6515087B2 (en) Audio processing apparatus and method
US8958566B2 (en) Audio signal decoder, method for decoding an audio signal and computer program using cascaded audio object processing stages
JP6047240B2 (en) Segment-by-segment adjustments to different playback speaker settings for spatial audio signals
US20150208168A1 (en) Controllable Playback System Offering Hierarchical Playback Options
KR101325402B1 (en) Apparatus and method for generating audio output signals using object based metadata
TWI508578B (en) Audio encoding and decoding
EP2000001B1 (en) Method and arrangement for a decoder for multi-channel surround sound
CN101356573B (en) Control for decoding of binaural audio signal
KR100848365B1 (en) Method for representing multi-channel audio signals
KR101424752B1 (en) An Apparatus for Determining a Spatial Output Multi-Channel Audio Signal
EP2539892B1 (en) Multichannel audio stream compression
JP5688030B2 (en) Method and apparatus for encoding and optimal reproduction of a three-dimensional sound field
ES2532152T3 (en) Binaural rendering of a multichannel audio signal
CN101490743B (en) Dynamic decoding of binaural audio signals

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
GRNT Written decision to grant