CN117321680A

CN117321680A - Apparatus and method for processing multi-channel audio signal

Info

Publication number: CN117321680A
Application number: CN202280035900.2A
Authority: CN
Inventors: 孙允宰; 高祥铁; 南佑铉; 金敬来; 金正奎; 李泰美; 郑铉权; 黄盛熙
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2021-05-21
Filing date: 2022-05-16
Publication date: 2023-12-29

Abstract

An apparatus for processing audio comprising: at least one processor configured to obtain a downmix audio signal from a bitstream, obtain downmix related information from the bitstream, de-mix the downmix related information by using the downmix related information, and reconstruct an audio signal comprising at least one frame based on the de-mixed audio signal. The downmix related information is information generated in units of frames by using an audio scene type.

Description

Apparatus and method for processing multi-channel audio signal

Technical Field

The present disclosure relates to the field of processing multi-channel audio signals. More specifically, the present disclosure relates to the field of processing audio signals from a lower channel layout of a multi-channel audio signal (e.g., a three-dimensional (3D) audio channel layout in front of a listener). The present disclosure relates to the field of performing a down-mixing process or an up-mixing process on a multi-channel audio signal according to an audio scene type. Furthermore, the present disclosure relates to the field of performing a down-mixing process or an up-mixing process on a multi-channel audio signal according to an energy value of an audio signal of a height channel.

Background

The audio signals are typically two-dimensional (2D) audio signals such as a 2-channel audio signal, a 5.1-channel audio signal, a 7.1-channel audio signal, and a 9.1-channel audio signal.

However, due to uncertainty in the audio information in the height direction, it may be necessary to generate a three-dimensional (3D) audio signal (an n-channel audio signal or a multi-channel audio signal, where n is an integer greater than 2) from the 2D audio signal to provide a spatial 3D effect of sound.

In a conventional channel layout for 3D audio signals, channels are arranged omnidirectionally around a listener. However, with the expansion of set top box (OTT) services, the increase in Television (TV) resolution, and the expansion of electronic device screens such as tablet computers, the demand by viewers for experiencing immersive sound, such as cinema content in a home environment, is increasing. Therefore, it is necessary to process an audio signal of a 3D audio channel layout (3D audio channel layout in front of a listener) in which channels are arranged in front of the listener in consideration of sound image (sound image) representation of objects (sound sources) on a screen.

Furthermore, in the case of the conventional 3D audio signal processing system, an independent audio signal of each independent channel of the 3D audio signal has been encoded/decoded. In particular, in order to reconstruct a two-dimensional (2D) audio signal, such as a conventional stereo audio signal, after reconstructing the 3D audio signal, the reconstructed 3D audio signal needs to be downmixed.

Disclosure of Invention

Technical problem

Embodiments of the present disclosure provide for processing of multi-channel audio signals to support a three-dimensional (3D) audio channel layout in front of a listener.

Solution to the problem

According to one aspect of the present disclosure, a method of processing audio includes: identifying an audio scene type of an audio signal, the audio signal comprising at least one frame; determining, in units of frames, downmix related information corresponding to an audio scene type; down-mixing the audio signal by using the down-mixing related information; and transmitting the downmix audio signal and the downmix related information.

The identification of the audio scene type may include: obtaining a center channel audio signal from the audio signal; identifying a dialogue type from the obtained center channel audio signal; obtaining a front channel audio signal and a side channel audio signal from the audio signal; identifying a sound effect type based on the front channel audio signal and the side channel audio signal; and identifying an audio scene type based on at least one of the identified dialog type and the identified sound effect type.

The identification of the dialog type may include: identifying a dialog type by using a first neural network for identifying a dialog type; identifying the dialog type as a first dialog type when the probability value of the dialog type identified by using the first neural network is greater than a predetermined first probability value of the first dialog type; and identifying the dialog type as a default dialog type when the probability value of the dialog type identified by using the first neural network is less than or equal to a predetermined first probability value.

The identification of the sound effect type may include: identifying a sound effect type by using a second neural network for identifying the sound effect type; identifying the sound effect type as a first sound effect type when the probability value of the sound effect type identified by using the second neural network is greater than a predetermined second probability value of the first sound effect type; and identifying the sound effect type as a default sound effect type when the probability value of the sound effect type identified by using the second neural network is less than or equal to a predetermined second probability value.

Identifying the audio scene type based on at least one of the identified dialog type or the identified sound effect type may include: identifying the audio scene type as a first dialog type when the identified dialog type is the first dialog type; identifying the audio scene type as a first sound effect type when the identified sound effect type is the first sound effect type; and identifying the audio scene type as a default type when the identified dialog type is a default type and the identified sound effect type is a default type.

The transmitted downmix related information may include index information indicating one of a plurality of audio scene types.

The method may further comprise: detecting a sound source object; and identifying additional weight parameters for mixing from the surround channels to the height channels based on the information about the detected sound source object, wherein the downmix related information further includes the additional weight parameters.

The method may further comprise: identifying an energy value of a high channel audio signal from the audio signal; identifying energy values of surround channel audio signals from the audio signals; and identifying additional weight parameters for mixing from the surround channels to the height channels based on the identified energy values of the height channel audio signal and the identified energy values of the surround channel audio signal, wherein the downmix related information further comprises the additional weight parameters.

The identification of the additional weight parameters may include: identifying an additional weight parameter as a first value when the energy value of the height channel audio signal is greater than a predetermined first value and the ratio of the energy value of the height channel audio signal to the energy value of the surround channel audio signal is greater than a predetermined second value; and identifying the additional weight parameter as a second value when the energy value of the height channel audio signal is less than or equal to a predetermined first value or the ratio is less than or equal to a predetermined second value.

The identification of the additional weight parameters may include: identifying a weight level of at least one time period of the audio signal based on a weight target ratio within the audio content of the audio signal; and identifying an additional weight parameter corresponding to the weight level, and wherein the weight of the boundary segment between the first time segment of the audio signal and the second time segment of the audio signal has a value between the weight of the remaining segment of the first time segment other than the boundary segment and the weight of the remaining segment of the second time segment other than the boundary segment.

The down-mixing may include: identifying a downmix profile corresponding to an audio scene type; obtaining a downmix weight parameter for mixing a second audio signal from at least one first audio signal of a first channel to a second audio signal of a second channel according to a downmix profile; and down-mixing the audio signal based on the obtained down-mixing weight parameter, and the down-mixing weight parameter may correspond to a previously determined audio scene type.

The detection of the sound source object may include: identifying a movement of the sound source object and a direction of the sound source object based on a correlation and delay between channels of the audio signal; and identifying a type of sound source object and a characteristic of the sound source object from the audio signal by using an object estimation probability model based on the gaussian mixture model, wherein the information about the detected sound source object includes information about at least one of a movement of the sound source object, a direction of the sound source object, a type of the sound source object, or a characteristic of the sound source object, and wherein identifying the additional weight parameter includes identifying the additional weight parameter for mixing from the surround channel to the height channel based on at least one of the movement of the sound source object, the direction of the sound source object, the type of the sound source object, or the characteristic of the sound source object.

According to one aspect of the present disclosure, a method of processing audio includes: obtaining a down-mix audio signal from the bitstream; obtaining downmix related information from a bitstream, wherein the downmix related information is generated in units of frames by using an audio scene type; unmixed the downmix audio signal by using the downmix related information; and reconstructing an audio signal comprising at least one frame based on the unmixed audio signal.

The audio scene type may be identified based on at least one of a dialog type or a sound effect type.

The audio signals may comprise up-mix channel group audio signals, wherein the up-mix channel group audio signals comprise up-mix channel audio signals of at least one up-mix channel, and wherein the up-mix channel audio signals comprise second audio signals obtained by de-mixing from first audio signals of at least one first channel.

The downmix related information may further include information on additional weight parameters for unmixing from the high channel to the surround channel, and the reconstructing of the audio signal may include reconstructing the audio signal by using the downmix weight parameters and the information on the additional weight parameters.

According to one aspect of the disclosure, an apparatus for processing audio includes at least one processor configured to execute one or more instructions, wherein the at least one processor is further configured to identify an audio scene type of an audio signal, the audio signal comprising at least one frame; determining, in units of frames, downmix related information corresponding to an audio scene type; down-mixing the audio signal by using the down-mixing related information; and transmitting the downmix audio signal and the downmix related information.

According to one aspect of the disclosure, an apparatus for processing audio includes at least one processor configured to execute one or more instructions, wherein the at least one processor is further configured to obtain a down-mix audio signal from a bitstream; obtaining downmix related information from a bitstream, wherein the downmix related information is generated in units of frames by using an audio scene type; unmixed the downmix audio signal by using the downmix related information; and reconstructing an audio signal comprising at least one frame based on the unmixed audio signal.

A method of processing audio according to an embodiment includes: identifying an audio scene type of an audio signal including at least one frame, determining down-mix related information to correspond to the audio scene type; down-mixing an audio signal including at least one frame by using the down-mixing related information; generating flag information indicating whether the audio scene type of the previous frame is the same as the audio scene type of the current frame based on the audio scene type of the previous frame and the audio scene type of the current frame; and transmitting at least one of a downmix audio signal, logo information, or downmix related information.

The transmitting may include: when the audio scene type of the previous frame is the same as the audio scene type of the current frame, flag information indicating that the audio scene type of the previous frame is the same as the audio scene type of the current frame and down-mix related information of the previous frame are transmitted, wherein the down-mix related information of the current frame may not be transmitted.

The transmitting may include: when the audio scene type of the previous frame is the same as the audio scene type of the current frame, the downmix related information of the downmix audio signal and the previous frame is transmitted, wherein flag information indicating that the audio scene type of the previous frame and the audio scene type of the current frame are the same as each other and the downmix related information of the current frame may not be transmitted.

According to an embodiment of the present disclosure, a method for processing audio includes: the method includes obtaining a down-mix audio signal from a bitstream, obtaining flag information indicating whether an audio scene type of a previous frame and an audio scene type of a current frame are identical to each other from the bitstream, obtaining down-mix related information of the current frame based on the flag information, wherein the down-mix related information of the current frame is information generated by using an audio scene type thereof, de-mixing the down-mix audio signal by using the down-mix related information of the current frame, and reconstructing an audio signal including at least one frame based on the de-mixed audio signal.

The acquiring of the downmix related information of the current frame may include: when the flag information indicates that the audio scene type of the previous frame is the same as the audio scene type of the current frame, the downmix related information of the current frame is acquired based on the downmix related information of the previous frame.

A computer-readable recording medium on which a program for implementing the method of the above aspect of the present disclosure can be recorded.

Advantageous effects of the present disclosure

With the method and apparatus for processing a multi-channel audio signal according to the embodiments of the present disclosure, both an audio signal of a three-dimensional (3D) audio channel layout in front of a listener and an audio signal of a 3D audio channel layout surrounding the listener can be encoded while supporting backward compatibility with a conventional stereo (2-channel) audio signal.

However, effects achieved by the apparatus and method of processing a multi-channel audio signal according to the embodiments of the present disclosure are not limited to those described above, and other effects not mentioned will be clearly understood from the following description by those of ordinary skill in the art to which the present disclosure pertains.

Drawings

The foregoing and other aspects, features, and advantages of certain embodiments of the disclosure will become more apparent from the following description, taken in conjunction with the accompanying drawings, in which:

Fig. 1A is a view for describing a scalable channel layout structure according to an embodiment.

Fig. 1B is a view for describing an example of a detailed scalable audio channel layout structure.

Fig. 2A is a block diagram of an audio encoding apparatus according to an embodiment.

Fig. 2B is a block diagram of an audio encoding apparatus according to an embodiment.

Fig. 2C is a block diagram of a structure of a multi-channel audio signal processor according to an embodiment.

Fig. 2D is a view for describing an example of detailed operation of the audio signal classifier.

Fig. 3A is a block diagram of a structure of a multi-channel audio decoding apparatus according to an embodiment.

Fig. 3B is a block diagram of a structure of a multi-channel audio decoding apparatus according to an embodiment.

Fig. 3C is a block diagram of a structure of a multi-channel audio signal reconstructor according to an embodiment.

Fig. 3D is a block diagram of the structure of an up-mix channel group audio generator according to an embodiment.

Fig. 4A is a block diagram of an audio encoding apparatus according to an embodiment.

Fig. 4B is a block diagram of the structure of an error-cancellation-related information generator according to an embodiment.

Fig. 5A is a block diagram of a structure of an audio decoding apparatus according to an embodiment.

Fig. 5B is a block diagram of a structure of a multi-channel audio signal reconstructor according to an embodiment.

Fig. 6A is a view for describing the transmission order and rules of an audio stream in each channel group of the audio encoding apparatus according to the embodiment.

Fig. 6B and 6C illustrate examples of mechanisms for progressive downmixing according to an embodiment.

Fig. 7A is a block diagram of an audio encoding apparatus according to an embodiment.

Fig. 7B is a block diagram of an audio encoding apparatus according to an embodiment.

Fig. 8 is a block diagram of an audio encoding apparatus according to an embodiment.

Fig. 9A is a block diagram of a structure of a multi-channel audio decoding apparatus according to an embodiment.

Fig. 9B is a block diagram of an audio decoding apparatus according to an embodiment.

Fig. 10 is a block diagram of an audio decoding apparatus according to an embodiment.

Fig. 11 is a view for describing in detail a process of identifying the type of audio scene content by an audio encoding apparatus according to an embodiment.

Fig. 12 is a view for describing a first Deep Neural Network (DNN) for identifying a dialog type according to an embodiment.

Fig. 13 is a view for describing a second DNN for identifying a type of sound effect according to an embodiment.

Fig. 14 is a view for describing in detail a process of identifying additional unmixed parameter weights for mixing from a surround channel to a height channel by an audio encoding apparatus according to an embodiment.

Fig. 15 is a view for describing in detail a process of identifying additional unmixed parameter weights for mixing from a surround channel to a height channel by an audio encoding apparatus according to an embodiment.

Fig. 16 is a flowchart of an audio processing method according to an embodiment.

Fig. 17A is a flowchart of an audio processing method according to an embodiment.

Fig. 17B is a flowchart of an audio processing method according to an embodiment.

Fig. 17C is a flowchart of an audio processing method according to an embodiment.

Fig. 17D is a flowchart of an audio processing method according to an embodiment.

Fig. 18A is a flowchart of an audio processing method according to an embodiment.

Fig. 18B is a flowchart of an audio processing method according to an embodiment.

Fig. 18C is a flowchart of an audio processing method according to an embodiment.

Fig. 18D is a flowchart of an audio processing method according to an embodiment.

Detailed Description

Throughout this disclosure, the expression "at least one of a, b or c" means a only, b only, c only, both a and b, both a and c, both b and c, all or variants thereof.

The disclosure is capable of various modifications thereto and various embodiments of the disclosure, and therefore specific embodiments of the disclosure are illustrated in the drawings and described in detail in the detailed description. It should be understood, however, that there is no intention to limit the disclosure to the specific embodiments of the disclosure, and it is to be understood that the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure.

In describing the embodiments of the present disclosure, when it is determined that the detailed description of the related art may unnecessarily obscure the subject matter, the detailed description thereof will be omitted. Moreover, the numerals (e.g., first, second, etc.) used in describing embodiments of the present disclosure are merely identification symbols that distinguish one component from another.

Furthermore, in this document, when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or directly coupled to the other element, but it should be understood that the element can be connected or coupled to the other element via yet another element therebetween, unless otherwise indicated.

Further, for a component denoted by "… unit", "module", or the like, two or more components may be integrated into one component, or one component may be divided into two or more components for each specific function. Each component to be described below may additionally perform a function responsible for some or all of the functions of other components in addition to the main function of the component, and some of the main functions of the components may be dedicated to and performed by other components.

Here, the "Deep Neural Network (DNN)" is a representative example of an artificial neural network model that simulates a brain nerve, and is not limited to an artificial neural network model using a specific algorithm.

Here, the "parameter" may be a value used during the operation of each layer constituting the neural network, and may include, for example, a weight (and deviation) used when an input value is applied to a predetermined calculation formula. The parameters may be expressed in the form of a matrix. The parameter may be a value set as a result of training and may be updated by separate training data as needed.

Here, the "multi-channel audio signal" may refer to an audio signal of n channels (where n is an integer greater than 2). The "mono audio signal" may be a one-dimensional (1D) audio signal, the "stereo channel audio signal" may be a two-dimensional (2D) audio signal, and the "multi-channel audio signal" may be a three-dimensional (3D) audio signal.

Here, the "channel (speaker) layout" may represent a combination of at least one channel, and a spatial arrangement of channels (speakers) may be specified. The channel used herein is a channel through which an audio signal is actually output, and thus may be referred to as a rendering channel.

For example, the channel layout may be an "x.y.z channel layout". Here, X may be the number of surround channels (surround channels), Y may be the number of bass channels (subwoofer channel), and Z may be the number of height channels (height channels). The channel layout may specify the spatial locations of the surround channel/subwoofer channel/height channel.

Examples of the "channel (speaker) layout" may include a 1.0.0 channel (or mono) layout, a 2.0.0 channel (or stereo) layout, a 5.1.0 channel layout, a 5.1.2 channel layout, a 5.1.4 channel layout, a 7.1.0 layout, a 7.1.2 layout, and a 3.1.2 channel layout, but the "channel layout" is not limited thereto, and various other channel layouts may exist.

Channels specified by the "channel (speaker) layout" may be referred to as various names, but may be collectively named for convenience of explanation.

The channels constituting the "channel (speaker) layout" may be named based on the respective spatial positions of the channels.

For example, the first surround channel of a 1.0.0 channel layout may be named mono. For a 2.0.0 channel layout, the first surround channel may be named the L2 channel and the second surround channel may be named the R2 channel.

Here, "L" represents a channel located on the left side of the listener, and "R" represents a channel located on the right side of the listener. "2" means that the number of surround channels is 2.

For a 5.1.0 channel layout, the first surround channel may be named the L5 channel, the second surround channel may be named the R5 channel, the third surround channel may be named the C channel, the fourth surround channel may be named the Ls5 channel, and the fifth surround channel may be named the Rs5 channel. Here, "C" denotes a channel located at the center with respect to the listener. "s" refers to the channel on the listener side. The first subwoofer channel of the 5.1.0 channel layout may be named the Low Frequency Effects (LFE) channel. Here, LFE may refer to a low frequency effect. In other words, the LFE channel may be a channel for outputting a low frequency sound effect.

The surround channels of the 5.1.2 channel layout and the 5.1.4 channel layout may be named the same as the surround channels of the 5.1.0 channel layout. Similarly, the subwoofer channels of the 5.1.2 channel layout and the 5.1.4 channel layout may be named the same as the subwoofer channels of the 5.1.0 channel layout.

The first high channel of the 5.1.2 channel layout may be named Hl5 channel. Here, H denotes a height channel. The second height channel may be named the Hr5 channel.

For a 5.1.4 channel layout, the first height channel may be named Hfl channel, the second height channel may be named Hfr channel, the third height channel may be named Hbl channel, and the fourth height channel may be named Hbr channel. Here, f denotes a front channel with respect to the listener, and b denotes a rear channel with respect to the listener.

For a 7.1.0 channel layout, the first surround channel may be named L channel, the second surround channel may be named R channel, the third surround channel may be named C channel, the fourth surround channel may be named Ls channel, the fifth surround channel may be named Rs channel, the sixth surround channel may be named Lb channel, and the seventh surround channel may be named Rb channel.

The individual surround channels of the 7.1.2 channel layout and the 7.1.4 channel layout may be named the same as the surround channels of the 7.1.0 channel layout. Similarly, the individual subwoofer channels of the 7.1.2 channel layout and the 7.1.4 channel layout may be named the same as the subwoofer channels of the 7.1.0 channel layout.

For a 7.1.2 channel layout, the first height channel may be named Hl7 channel and the second height channel may be named Hr7 channel.

For a 7.1.4 channel layout, the first height channel may be named Hfl channel, the second height channel may be named Hfr channel, the third height channel may be named Hbl channel, and the fourth height channel may be named Hbr channel.

For a 3.1.2 channel layout, the first surround channel may be named the L3 channel, the second surround channel may be named the R3 channel, and the third surround channel may be named the C channel. The first subwoofer channel of the 3.1.2 channel layout may be named LFE channel. For a 3.1.2 channel layout, the first height channel may be named Hfl3 channel (or Tl channel) and the second height channel may be named Hfr3 channel (or Tr channel).

Here, some channels may be named differently according to channel layout, but may represent the same channel. For example, the Hl5 channel and the Hl7 channel may be the same channel. Similarly, the Hr5 channel and the Hr7 channel may be the same channel.

Meanwhile, the channels are not limited to the above-described channel names, and various other channel names may be used.

For example, the L2 channel may be named as an L "channel, the R2 channel may be named as an R" channel, the L3 channel may be named as an ML3 (L ') channel, the R3 channel may be named as an MR3 (R') channel, the Hfl3 channel may be named as an MHL3 channel, the Hfr3 channel may be named as an MHR3 channel, the Ls5 channel may be named as an MSL5 (Ls ') channel, the Rs5 channel may be named as an MSR5 (Rs') channel, the Hl5 channel may be named as an MHL5 (Hl ') channel, the Hr5 channel may be named as an MHR5 (Hr') channel, and the C channel may be named as an MC channel.

The channels of the channel layout of the above layout may be named as in table 1.

TABLE 1

Sound channel layout	Channel name
		1.0.0	Mono channel
2.0.0	L2/R2
		5.1.0	L5/C/R5/Ls5/Rs5/LFE
5.1.2	L5/C/R5/Ls5/Rs5/Hl5/Hr5/LFE
		5.1.4	L5/C/R5/Ls5/Rs5/Hfl/Hfr/Hbl/Hbr/LFE
7.1.0	L/C/R/Ls/Rs/Lb/Rb/LFE
		7.1.2	L/C/R/Ls/Rs/Lb/Rb/Hl7/Hr7/LFE
7.1.4	L/C/R/Ls/Rs/Lb/Rb/Hfl/Hfr/Hbl/Hbr/LFE
		3.1.2	L3/C/R3/Hfl3/Hfr3/LFE

Meanwhile, the "transmission channel" is a channel for transmitting a compressed audio signal, and a part of the "transmission channel" may be the same as the "presentation channel", but is not limited thereto, and another part of the "transmission channel" may be a channel (mixed channel) of an audio signal in which the audio signal of the presentation channel is mixed. In other words, the "transmission channel" may be a channel containing an audio signal of the "presentation channel", but may also be a channel of which a part is the same as the presentation channel and the rest is a mixed channel different from the presentation channel.

The "transmission channel" may be named as distinguished from the "presentation channel". For example, when the transmission channel is an a/B channel, the a/B channel may contain an audio signal of an L2/R2 channel. When the transmission channel is a T/P/Q channel, the T/P/Q channel may contain audio signals of C/LFE/Hfl3 and Hfr3 channels. When the transmission channel is an S/U/V channel, the S/U/V channel may contain audio signals of L and R/Ls and Rs/Hfl and Hfr channels.

In the present disclosure, a "3D audio signal" may refer to an audio signal for detecting sound distribution and sound source position in a 3D space.

In the present disclosure, the "3D audio channel in front of the listener" may refer to a 3D audio channel based on a layout of audio channels arranged in front of the listener. The "3D audio channel in front of the listener" may be referred to as the "front 3D audio channel". In particular, the "3D audio channel in front of the listener" may be referred to as a "screen-centered 3D audio channel" because it is a 3D audio channel based on the layout of audio channels arranged around a screen located in front of the listener.

In this disclosure, a "listener-omni-directional 3D audio channel" may refer to a 3D audio channel based on a layout of audio channels arranged omni-directionally around the listener. The "listener omni-directional 3D audio channel" may be referred to as the "full 3D audio channel". Here, omni-directional may refer to all directions including forward, lateral, and rearward directions. In particular, the "listener omni-directional 3D audio channel" may also be referred to as a "listener centered 3D audio channel" because it is a 3D audio channel based on a layout of audio channels arranged omni-directionally around the listener.

In the present disclosure, a "channel group" as a kind of data unit may comprise a (compressed) audio signal of at least one channel. More specifically, the channel group may include at least one of a base channel group independent of another channel group or a dependent channel group dependent on at least one channel group. In this case, the target channel group on which the dependent channel group depends may be another dependent channel group, and may be a dependent channel group related to a lower channel layout. Alternatively, the channel group on which the dependent channel group depends may be a base channel group. The "channel group" contains one kind of data of the channel group, and thus the channel group may be referred to as "encoding code group". The sub channel group for further expanding the number of channels from the channels included in the base channel group may be referred to as a scalable channel group or an expanded channel group.

The audio signal of the "basic channel group" may include a mono audio signal or a stereo audio signal. The audio signal of the "basic channel group" may include, without being limited thereto, audio signals of 3D audio channels in front of the listener.

For example, the audio signal of the "sub-channel group" may include audio signals of 3D audio channels in front of the listener or audio signals of channels other than the audio signal of the "basic channel group" among audio signals of the listener omni-directional 3D audio channels. In this case, a part of the audio signal of the other channel may be an audio signal in which the audio signal of at least one channel is mixed (i.e., an audio signal of a mixed channel).

For example, the audio signal of the "basic channel group" may be a mono audio signal or a stereo audio signal. The "multi-channel audio signal" reconstructed based on the audio signals of the "base channel group" and the "dependent channel group" may be an audio signal of a 3D audio channel in front of the listener or an audio signal of a listener omni-directional 3D audio channel.

In the present disclosure, "up-mixing" may refer to an operation in which the number of presentation channels of an output audio signal is increased compared to the number of presentation channels of an input audio signal by unmixing.

In the present disclosure, "de-mixing" may refer to an operation of separating an audio signal of a specific channel from an audio signal in which audio signals of various channels are mixed (i.e., an audio signal of a mixed channel), and may refer to one of mixing operations. In this case, "unmixing" may be implemented as a calculation using a "unmixed matrix" (or a "downmix matrix" corresponding thereto), and the "unmixed" matrix may include at least one "unmixed weight parameter" (or a "downmix weight parameter" corresponding thereto) as a coefficient of the unmixed matrix (or a "downmix matrix" corresponding thereto). Alternatively, "unmixing" may be implemented as an arithmetic calculation based on a portion of the "unmixed matrix" (or a "downmix matrix" corresponding thereto), and may be implemented in various ways without being limited thereto. As described above, "unmixing" may be associated with "up-mixing".

"mixing" may refer to any operation of generating an audio signal of a new channel (i.e., a mixed channel) by adding values obtained by multiplying each of the audio signals of a plurality of channels by corresponding weights (i.e., by mixing the audio signals of a plurality of channels).

The "mixing" can be divided into "mixing" performed by the audio encoding apparatus and "unmixing" performed by the audio decoding apparatus in a narrow sense.

The "mixing" performed in the audio encoding apparatus may be implemented as a calculation using a (lower) mixing matrix, and the (lower) mixing matrix may include at least one (lower) mixing weight parameter "as a coefficient of the (lower) mixing matrix. Alternatively, the "(under) mixing" may be implemented as an arithmetic calculation based on a part of the "(under) mixing matrix", and may be implemented in various ways without being limited thereto.

In the present disclosure, an "up-mix channel group" may refer to a group including at least one up-mix channel, and an "up-mix channel" may refer to a de-mix channel separated by de-mixing of audio signals for encoding/decoding channels. The narrow "up-mix channel group" may include "up-mix channels". However, the "up-mix channel group" in a broad sense may also include "encoding/decoding channels" and "up-mix channels". Here, the "encoding/decoding channel" may refer to an independent channel of an audio signal encoded (compressed) and included in a bitstream or an independent channel of an audio signal obtained by decoding from a bitstream. In this case, a separate (de) mixing operation is not required in order to obtain an audio signal of an encoded/decoded channel.

The audio signal of the "up-mix channel group" in a broad sense may be a multi-channel audio signal, and the output multi-channel audio signal may be one of at least one multi-channel audio signal (i.e., an audio signal of at least one up-mix channel group or an up-mix channel audio signal) which is an audio signal output through a device such as a speaker.

In the present disclosure, "down-mixing" may refer to an operation in which the number of presentation channels of an output audio signal is reduced compared to the number of presentation channels of an input audio signal by mixing.

In the present disclosure, the "factor for error cancellation" (or error cancellation factor (ERF)) may be a factor for canceling an error of an audio signal occurring due to lossy codec.

Errors of an audio signal occurring due to lossy codec may include errors caused by quantization, more specifically, errors caused by coding (quantization) based on psychoacoustic characteristics, and the like. The "factor for error cancellation" may be referred to as a "codec error Cancellation (CER) factor" or an "error cancellation ratio", etc. In particular, the "error cancellation factor" may be referred to as a "scaling factor" because the error cancellation operation substantially corresponds to the scaling operation.

Hereinafter, embodiments of the present disclosure according to the technical spirit of the present disclosure will be described in detail in order.

Fig. 1A is a view for describing a scalable channel layout structure according to an embodiment of the present disclosure.

Conventional 3D audio decoding apparatuses receive compressed audio signals of independent channels of a specific channel layout from a bitstream. Conventional 3D audio decoding apparatuses reconstruct audio signals of a listener omni-directional 3D audio channel by using compressed audio signals of independent channels received from a bitstream. In this case, only the audio signal of a specific channel layout can be reconstructed.

Alternatively, the conventional 3D audio decoding apparatus receives compressed audio signals of independent channels (first independent channel group) of a specific channel layout from a bitstream. For example, the particular channel layout may be a 5.1 channel layout, and in this case, the compressed audio signals of the first independent channel group may be compressed audio signals of five surround channels and one subwoofer channel.

Here, in order to increase the number of channels, the conventional 3D audio decoding apparatus also receives compressed audio signals of other channels (second independent channel group) independent of the first independent channel group. For example, the compressed audio signals of the second independent channel group may be compressed audio signals of two height channels.

That is, the conventional 3D audio decoding apparatus reconstructs an audio signal of a listener omni-directional 3D audio channel by using a compressed audio signal of a second independent channel group received from a bitstream separately from a compressed audio signal of a first independent channel group received from the bitstream. Thus, an audio signal with an increased number of channels is reconstructed. Here, the audio signal of the listener omni-directional 3D audio channel may be an audio signal of 5.1.2 channels.

On the other hand, a conventional audio decoding apparatus supporting reproduction of only audio signals of a stereo channel cannot properly process compressed audio signals included in a bitstream.

A conventional 3D audio decoding apparatus supporting reproduction of a 3D audio signal also first decompresses (decodes) compressed audio signals of a first independent channel group and a second independent channel group to reproduce audio signals of stereo channels. Then, the conventional 3D audio decoding apparatus up-mixes the audio signal generated by decompression. However, in order to reproduce an audio signal of a stereo channel, operations such as up-mixing must be performed.

Accordingly, there is a need for a scalable channel layout structure capable of processing a compressed audio signal in a conventional audio decoding apparatus. Further, in the audio decoding apparatuses 300 and 500 (see fig. 3A, 3B, 5A and 5B) supporting reproduction of 3D audio signals according to various embodiments of the present disclosure, there is a need for a scalable channel layout structure capable of processing compressed audio signals according to a 3D audio channel layout supporting reproduction. Here, the scalable channel layout structure may refer to a layout structure in which the number of channels can be freely increased from the basic channel layout.

The audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure may reconstruct an audio signal of a scalable channel layout structure from a bitstream. With the scalable channel layout structure according to an embodiment of the present disclosure, the number of channels may be increased from the stereo channel layout 100 to the 3D audio channel layout 110 in front of the listener. Furthermore, with a scalable channel layout structure, the number of channels may be increased from the 3D audio channel layout 110 in front of the listener to a 3D audio channel layout 120 that is omnidirectionally located around the listener (or a listener omnidirectional 3D audio channel layout 120). For example, the 3D audio channel layout 110 in front of the listener may be a 3.1.2 channel layout. The listener omni-directional 3D audio channel layout 120 may be a 5.1.2 or 7.1.2 channel layout. However, the scalable channel layout that may be implemented in the present disclosure is not limited thereto.

As a basic channel group, an audio signal of a conventional stereo channel may be compressed. The conventional audio decoding apparatus may decompress the compressed audio signals of the basic channel group from the bitstream, thereby smoothly reproducing the audio signals of the conventional stereo channels.

In addition, as the sub-channel group, audio signals of channels other than the audio signals of the conventional stereo channels in the multi-channel audio signal may be compressed.

However, in increasing the number of channels, a part of the audio signals of the channel group may be audio signals of some independent channels among the audio signals mixed with the specific channel layout.

Accordingly, in the audio decoding apparatuses 300 and 500, a portion of the audio signals of the base channel group and a portion of the audio signals of the dependent channel group may be unmixed to generate an audio signal of an up-mix channel included in a specific channel layout.

Meanwhile, there may be one or more slave channel groups. For example, audio signals of channels other than the audio signals of the stereo channels among the audio signals of the 3D audio channel layout 110 in front of the listener may be compressed into the audio signals of the first sub-channel group.

The audio signals of channels other than the audio signals of channels reconstructed from the base channel group and the first dependent channel group in the audio signals of the listener omnidirectional 3D audio channel arrangement 120 may be compressed into the audio signals of the second dependent channel group.

The audio decoding apparatuses 300 and 500 according to embodiments of the present disclosure may support reproduction of audio signals of the listener omni-directional 3D audio channel layout 120.

Accordingly, the audio decoding apparatuses 300 and 500 according to embodiments of the present disclosure may reconstruct the audio signals of the listener omni-directional 3D audio channel layout 120 based on the audio signals of the base channel group and the audio signals of the first and second slave channel groups.

The conventional audio signal processing apparatus may ignore compressed audio signals of a sub-channel group that may not be reconstructed from a bitstream, and reproduce audio signals of a stereo channel reconstructed from the bitstream.

Similarly, the audio decoding apparatuses 300 and 500 may process compressed audio signals of the base channel group and the dependent channel group to reconstruct audio signals of a channel layout supportable in the scalable channel layout. The audio decoding apparatuses 300 and 500 may not reconstruct the compressed audio signal with respect to the unsupported higher channel layout from the bitstream. Accordingly, the audio signals of the supportable channel layouts may be reconstructed from the bitstream while the compressed audio signals related to the higher channel layouts not supported by the audio decoding apparatuses 300 and 500 are ignored.

Specifically, conventional audio encoding and decoding apparatuses compress and decompress audio signals of independent channels of a specific channel layout. Therefore, compression and decompression of audio signals of a limited channel layout is possible.

However, transmission and reconstruction of audio signals of stereo channels is possible by the audio encoding apparatuses 200 and 400 (see fig. 2A, 2B and 4A) and the audio decoding apparatuses 300 and 500 supporting the scalable channel layout according to various embodiments of the present disclosure. With the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure, transmission and reconstruction of audio signals of a 3D channel layout in front of a listener is possible. Further, with the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 according to the embodiments of the present disclosure, it is possible to transmit and reconstruct an audio signal of a 3D channel layout that is omnidirectionally surrounding a listener.

That is, the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure may transmit and reconstruct audio signals according to the layout of stereo channels. Further, the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure may freely convert an audio signal of a current channel layout into an audio signal of another channel layout. By mixing/de-mixing between audio signals of channels included in different channel layouts, a transition between channel layouts is possible. The audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure may support conversion between various channel layouts, thereby transmitting and reproducing audio signals of various 3D channel layouts. That is, channel independence is not ensured between a channel layout in front of a listener and a listener omni-directional channel layout or between a stereo channel layout and a channel layout in front of a listener, but free conversion is possible by mixing/unmixing of audio signals.

The audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure support processing of audio signals of a channel layout in front of a listener to transmit and reconstruct audio signals corresponding to speakers disposed around a screen, thereby improving the immersion of the listener.

Detailed operations of the audio encoding apparatuses 200 and 400 and the audio decoding apparatuses 300 and 500 according to various embodiments of the present disclosure will be described with reference to fig. 2A to 5B.

Fig. 1B is a view for describing an example of a detailed scalable audio channel layout structure. In this figure, each of numbered/directed edges (1) to (10) may represent a unmixed operation performed by the audio decoding apparatuses 300 and 500.

Referring to fig. 1B, in order to transmit the audio signal of the stereo channel layout 160, the audio encoding apparatuses 200 and 400 may generate compressed audio signals (a/B signals) of the basic channel group by compressing the L2/R2 signals.

In this case, the audio encoding apparatuses 200 and 400 may generate audio signals of the basic channel group by compressing the L2/R2 signal.

In addition, in order to transmit an audio signal of the layout 170 of 3.1.2 channels, which is one of 3D audio channels in front of a listener, the audio encoding apparatuses 200 and 400 may generate compressed audio signals of the sub-channel group by compressing C, LFE, hfl and Hfr3 signals. The audio decoding apparatuses 300 and 500 may reconstruct the L2/R2 signal by decompressing the compressed audio signals of the base channel group. The audio decoding apparatuses 300 and 500 may reconstruct C, LFE, hfl and Hfr3 signals by decompressing the compressed audio signals of the slave channel groups.

The audio decoding apparatuses 300 and 500 may reconstruct the L3 signal (1) of the 3.1.2 channel layout 170 by unmixing the L2 signal and the C signal. The audio decoding apparatuses 300 and 500 may reconstruct the R3 signal (2) of the 3.1.2 channel arrangement 170 by unmixing the R2 signal and the C signal.

As a result, the audio decoding apparatuses 300 and 500 can output the L3, R3, C, lfe, hfl, and Hfr3 signals as the audio signals of the 3.1.2 channel arrangement 170.

Meanwhile, in order to transmit the audio signal of the listener omni-directional 5.1.2 channel arrangement 180, the audio encoding apparatus 200 and 400 may further compress the L5 and R5 signals to generate compressed audio signals of the second slave channel group.

As described above, the audio decoding apparatuses 300 and 500 may reconstruct the L2/R2 signal by decompressing the compressed audio signal of the base channel group and reconstruct the C, LFE, hfl and Hfr3 signals by decompressing the compressed audio signal of the first sub channel group. In addition, the audio decoding apparatuses 300 and 500 may reconstruct the L5 and R5 signals by decompressing the compressed audio signals of the second subordinate channel group. In addition, as described above, the audio decoding apparatuses 300 and 500 may reconstruct the L3 and R3 signals by unmixing some of the decompressed audio signals.

In addition, the audio decoding apparatuses 300 and 500 may reconstruct the Ls5 signal (3) by unmixing the L3 and L5 signals. The audio decoding apparatuses 300 and 500 may reconstruct the Rs5 signal (4) by unmixing the R3 and R5 signals.

The audio decoding apparatuses 300 and 500 may reconstruct the Hl5 signal (5) by unmixing the Hfl3 and Ls5 signals. Hfl3 and Hl5 are the left front channels of the height channels.

The audio decoding apparatuses 300 and 500 may reconstruct the Hr5 signal by unmixing the Hfr3 and Rs5 signals (6). Hfr3 and Hr5 are the front right channels among the height channels.

As a result, the audio decoding apparatuses 300 and 500 can output Hl5, hr5, LFE, L, R, C, ls, and Rs5 signals as audio signals of the 5.1.2 channel arrangement 180.

Meanwhile, in order to transmit the audio signal of the 7.1.4 channel arrangement 190, the audio encoding apparatuses 200 and 400 may further compress Hfl, hfr, ls and Rs signals as the audio signals of the third dependent channel group.

As described above, the audio decoding apparatuses 300 and 500 may decompress the compressed audio signal of the base channel group, the compressed audio signal of the first sub-channel group, and the compressed audio signal of the second sub-channel group, and reconstruct the Hl5, hr5, LFE, L, R, C, ls5, and Rs5 signals by unmixing (1), (2), (3), (4), (5), and (6).

In addition, the audio decoding apparatuses 300 and 500 may reconstruct Hfl, hfr, ls and Rs signals by decompressing the compressed audio signals of the third slave channel group. The audio decoding apparatuses 300 and 500 may reconstruct the Lb signal of the 7.1.4 channel layout 190 by (7) unmixing the Ls5 signal and the Ls signal.

The audio decoding apparatuses 300 and 500 may reconstruct the Rb signal of the 7.1.4 channel layout 190 by (8) unmixing the Rs5 signal and the Rs signal.

The audio decoding apparatuses 300 and 500 may reconstruct the Hbl signal of the 7.1.4 channel arrangement 190 by (9) unmixing the Hfl signal and the Hl5 signal.

The audio decoding apparatuses 300 and 500 may reconstruct the Hbr signal of the 7.1.4 channel layout 190 by (10) unmixing the Hfr signal and the Hr5 signal.

As a result, the audio decoding apparatuses 300 and 500 can output Hfl, hfr, LFE, C, L, R, ls, rs, lb, rb, hbl and Hbr signals as audio signals of the 7.1.4 channel arrangement 190.

Accordingly, the audio decoding apparatuses 300 and 500 may reconstruct the audio signals of the 3D audio channels in front of the listener and the audio signals of the listener omni-directional 3D audio channels as well as the audio signals of the conventional stereo channel layout by supporting the scalable channel layout in which the number of channels is increased through the unmixing operation.

The scalable channel layout structure described in detail above with reference to fig. 1B is only an example, and the channel layout structure may be scalably implemented to include various channel layouts.

Fig. 2A is a block diagram of an audio encoding apparatus according to an embodiment of the present disclosure.

The audio encoding apparatus 200 may include a memory 210 and a processor 230. The audio encoding apparatus 200 may be implemented as an apparatus capable of performing audio processing, such as a server, a Television (TV), a camera, a cellular phone, a tablet Personal Computer (PC), a laptop computer, or the like.

Although the memory 210 and the processor 230 are separately shown in fig. 2A, the memory 210 and the processor 230 may be implemented by one hardware module (e.g., chip).

Processor 230 may be implemented as a dedicated processor for neural network-based audio processing. Alternatively, the processor 230 may be implemented by a combination of a general-purpose processor, such as an Application Processor (AP), a Central Processing Unit (CPU), or a Graphics Processing Unit (GPU), and software. The special purpose processor may include a memory to implement embodiments of the present disclosure or a memory processor to use external memory.

Processor 230 may include multiple processors. In this case, the processor 230 may be implemented as a combination of dedicated processors or may be implemented by a combination of software and a plurality of general-purpose processors (such as an AP, a CPU, or a GPU).

Memory 210 may store one or more instructions for audio processing. In an embodiment of the present disclosure, the memory 210 may store a neural network. When the neural network is implemented in the form of a dedicated hardware chip for artificial intelligence or as part of an existing general purpose processor (e.g., CPU or AP) or a graphics-specific processor (e.g., GPU), the neural network may not be stored in the memory 210. The neural network may be implemented by an external device (e.g., a server), and in this case, the audio encoding apparatus 200 may request and receive result information based on the neural network from the external device.

Processor 230 may sequentially process successive frames and obtain successive encoded (compressed) frames according to instructions stored in memory 210. Successive frames may refer to frames that make up audio.

The processor 230 may perform an audio processing operation with the original audio signal as an input and output a bitstream including the compressed audio signal. In this case, the original audio signal may be a multi-channel audio signal. The compressed audio signal may be a multi-channel audio signal having a channel number less than or equal to the channel number of the original audio signal.

In this case, the bitstream may include a base channel group, and further, n sub channel groups (n is an integer greater than or equal to 1). Therefore, the number of channels can be freely increased according to the number of the dependent channel groups.

Fig. 2B is a block diagram of an audio encoding apparatus according to an embodiment of the present disclosure.

Referring to fig. 2B, the audio encoding apparatus 200 may include a multi-channel audio encoder 250, a bitstream generator 280, and an additional information generator 285. The multi-channel audio encoder 250 may include a multi-channel audio signal processor 260 and a compressor 270.

Referring back to fig. 2A, as described above, the audio encoding apparatus 200 may include the memory 210 and the processor 230, and instructions for implementing the components 250, 260, 270, 280, and 285 of fig. 2B may be stored in the memory 210 of fig. 2A. Processor 230 may execute instructions stored in memory 210.

The multi-channel audio signal processor 260 may obtain at least one audio signal of a base channel group and at least one audio signal of at least one dependent channel group from an original audio signal. For example, when the original audio signal is an audio signal of a 7.1.4-channel layout, the multi-channel audio signal processor 260 may obtain an audio signal of 2 channels (stereo channels) as an audio signal of a basic channel group among the audio signals of the 7.1.4-channel layout.

The multi-channel audio signal processor 260 may obtain audio signals of channels other than the 2-channel audio signal from the 3.1.2-channel layout audio signal as the audio signals of the first sub-channel group to reconstruct the 3.1.2-channel layout audio signal which is one of the 3D audio channels in front of the listener. In this case, the audio signals of some channels of the first sub-channel group may be unmixed to generate an audio signal of the unmixed channel.

The multi-channel audio signal processor 260 may obtain audio signals of channels other than the audio signals of the base channel group and the audio signals of the first sub-channel group from the audio signals of the 5.1.2 channel arrangement, which is one of the 3D audio channels in front of and behind the listener, as the audio signals of the second sub-channel group to reconstruct the audio signals of the 5.1.2 channel arrangement. In this case, the audio signals of some channels of the second sub-channel group may be unmixed to generate an audio signal of the unmixed channel.

The multi-channel audio signal processor 260 may obtain audio signals of channels other than the audio signals of the first and second sub-channel groups from the audio signals of the 7.1.4 channel arrangement, which is one of the listener's omni-directional 3D audio channels, as audio signals of the third sub-channel group to reconstruct the audio signals of the 7.1.4 channel arrangement. Also, the audio signals of some channels of the third dependent channel group may be unmixed to obtain an audio signal of the unmixed channel.

The detailed operation of the multi-channel audio signal processor 260 will be described later with reference to fig. 2C.

The compressor 270 may compress the audio signals of the base channel group and the audio signals of the dependent channel group. That is, the compressor 270 may compress at least one audio signal of the basic channel set to obtain at least one compressed audio signal of the basic channel set. Here, compression may refer to compression based on various audio codecs. For example, compression may include transform and quantization processes.

Here, the audio signal of the basic channel group may be a mono signal or a stereo signal. Alternatively, the audio signal of the base channel group may include an audio signal of a first channel generated by mixing the audio signal L of the left stereo channel with c_1. Here, c_1 may be an audio signal of a center channel (e.g., a center channel audio signal) in front of a listener who is decompressed after compression. In the present disclosure, when an audio signal is described using a name ("x_y"), an "X" may represent a name of a channel, and a "Y" may represent decoded, up-mixed, applied with an error cancellation factor (i.e., scaled), or applied LFE gain. For example, the decoded signal may be denoted as "x_1", and a signal (up-mix signal) generated by up-mixing the decoded signal may be denoted as "x_2". Alternatively, a signal to which LFE gain is applied to the decoded LFE signal may also be denoted as "x_2". The signal to which the error cancellation factor is applied to the up-mix signal (i.e., the scaled signal) may be denoted as "x_3".

The audio signals of the base channel group may include the audio signal of the second channel generated by mixing the audio signal R of the right stereo channel with c_1.

The compressor 270 may obtain (e.g., generate) at least one compressed audio signal of the at least one slave channel group by compressing the at least one audio signal of the at least one slave channel group.

The additional information generator 285 may generate additional information based on at least one of the original audio signal, the compressed audio signal of the basic channel group, or the compressed audio signal of the sub channel group. In this case, the additional information may be information related to the multi-channel audio signal and include various information for reconstructing the multi-channel audio signal.

For example, the additional information may include an audio object signal of a 3D audio channel in front of the listener, which indicates at least one of an audio signal, a position, a shape, an area, or a direction of an audio object (sound source). Alternatively, the additional information may include information about the total number of audio streams including the base channel audio stream and the dependent channel audio stream. The additional information may include downmix gain information. The additional information may include channel map information. The additional information may include volume information. The additional information may include LFE gain information. The additional information may include Dynamic Range Control (DRC) information. The additional information may include channel layout presentation information. The additional information may further include information about the number of coupled audio streams, information indicating a multi-channel layout, information about whether a dialog exists in the audio signal and a dialog level, information about whether LFE is output, information about whether an audio object exists on a screen, information about the presence or absence of an audio signal of a continuous audio channel (or an audio signal based on a scene or a surround sound audio signal), and information about the presence or absence of an audio signal of a discrete audio channel (or an audio signal based on an object or a spatial multi-channel audio signal). The additional information may comprise information about the downmix, the information comprising at least one downmix weight parameter of a downmix matrix for reconstructing the multi-channel audio signal. The de-mixing and the (down) mixing correspond to each other such that the information about the de-mixing may correspond to the information about the (down) mixing, and the information about the de-mixing may include the information about the (down) mixing. For example, the information about the de-mixing may comprise at least one (lower) mixing weight parameter of the (lower) mixing matrix. The de-mixing weight parameters may be obtained based on the (down) mixing weight parameters.

The additional information may be various combinations of the above pieces of information. In other words, the additional information may include at least one of the aforementioned pieces of information.

When there is an audio signal of a sub-channel corresponding to at least one audio signal of the base channel group, the additional information generator 285 may generate sub-channel audio signal identification information indicating the presence of the audio signal of the sub-channel.

The bitstream generator 280 may generate a bitstream including the compressed audio signal of the base channel group and the compressed audio signal of the dependent channel group. The bit stream generator 280 may generate a bit stream further including the additional information generated by the additional information generator 285.

More specifically, the bitstream generator 280 may generate a base channel audio stream and a dependent channel audio stream. The base channel audio stream may comprise compressed audio signals of a base channel group and the secondary channel audio stream may comprise compressed audio signals of a secondary channel group.

The bitstream generator 280 may generate a bitstream including a base channel audio stream and a plurality of dependent channel audio streams. The plurality of secondary channel audio streams may include n secondary channel audio streams (where n is an integer greater than 1). In this case, the base channel audio stream may include a mono audio signal or a stereo compressed audio signal.

For example, in channels of a first multi-channel layout reconstructed from a base channel audio stream and a first dependent channel audio stream, the number of surround channels may be S _n-1 The number of the bass channels may be W _n-1 And the number of height channels may be H _n-1 . In a second multi-channel layout reconstructed from the base channel audio stream, the first dependent channel audio stream, and the second dependent channel audio stream, the number of surround channels may be S _n The number of the bass channels may be W _n And the number of height channels may be H _n 。

In this case S _n-1 Can be less than or equal to S _n ，W _n-1 May be less than or equal to W _n And H is _n-1 May be less than or equal to H _n . Here, S can be excluded _n-1 Equal to Sn, W _n-1 Equal to W _n And H _n-1 Equal to H _n Is the case in (a).

That is, the number of surround channels of the second multi-channel arrangement needs to be greater than the number of surround channels of the first multi-channel arrangement. Alternatively or additionally, the number of subwoofer channels of the second multi-channel arrangement needs to be greater than the number of subwoofer channels of the first multi-channel arrangement. Alternatively or additionally, the number of height channels of the second multi-channel layout needs to be greater than the number of height channels of the first multi-channel layout.

Furthermore, the number of surround channels of the second multi-channel arrangement may be not smaller than the number of surround channels of the first multi-channel arrangement. Also, the number of subwoofer channels of the second multichannel layout may be not less than the number of subwoofer channels of the first multichannel layout. The number of height channels of the second multi-channel layout may be not less than the number of height channels of the first multi-channel layout.

Furthermore, there is no case where the number of surround channels of the second multi-channel layout is equal to the number of surround channels of the first multi-channel layout, the number of subwoofer channels of the second multi-channel layout is equal to the number of subwoofer channels of the first multi-channel layout, and the number of height channels of the second multi-channel layout is equal to the number of height channels of the first multi-channel layout. That is, all channels of the second multi-channel layout may not be identical to all channels of the first multi-channel layout.

More specifically, for example, when the first multi-channel layout is a 5.1.2-channel layout, the second multi-channel layout may be a 7.1.4-channel layout.

In addition, the bit stream generator 280 may generate metadata including additional information.

As a result, the bitstream generator 280 may generate a bitstream including the base channel audio stream, the sub channel audio stream, and the metadata.

The bitstream generator 280 may generate a bitstream in a form in which the number of channels may freely increase from the basic channel group.

That is, an audio signal of a base channel group may be reconstructed from a base channel audio stream, and a multi-channel audio signal in which the number of channels increases from the base channel group may be reconstructed from the base channel audio stream and a dependent channel audio stream.

Meanwhile, the bitstream generator 280 may generate a file stream having a plurality of audio tracks. The bitstream generator 280 may generate an audio stream of a first audio track of at least one compressed audio signal comprising a set of base channels. The bitstream generator 280 may generate an audio stream of the second track including the sub-channel audio signal identification information. In this case, the second track following the first track may be adjacent to the first track.

When there is a dependent channel audio signal corresponding to at least one audio signal of the base channel group, the bitstream generator 280 may generate an audio stream of a second audio track including at least one compressed audio signal of the at least one dependent channel group.

Meanwhile, when there is no slave channel audio signal corresponding to at least one audio signal of the base channel group, the bitstream generator 280 may generate an audio stream including a second audio track of the audio signal of the next base channel group with respect to the audio signal of the first audio track of the base channel group.

Fig. 2C is a block diagram of the structure of the multi-channel audio signal processor 260 according to an embodiment of the present disclosure.

Referring to fig. 2C, the multi-channel audio signal processor 260 may include a channel layout identifier 261, a down-mix channel audio generator 262, and an audio signal classifier 266.

The channel layout identifier 261 may identify at least one channel layout from the original audio signal. In this case, the at least one channel layout may comprise a plurality of layered channel layouts. The channel layout identifier 261 may identify a channel layout of the original audio signal. The channel layout identifier 261 may identify a channel layout lower than that of the original audio signal. For example, when the original audio signal is an audio signal of a 7.1.4 channel layout, the channel layout identifier 261 may identify a 7.1.4 channel layout and identify a 5.1.2 channel layout, a 3.1.2 channel layout, a 2 channel layout, or the like that is lower than the 7.1.4 channel layout. The higher channel layout may refer to a layout in which at least one of the surround channels/subwoofer channels/height channels is greater in number than the lower channel layout. The higher/lower channel layout may be determined according to whether the number of surround channels is large or small, and for the same number of surround channels, the higher/lower channel layout may be determined according to whether the number of subwoofer channels is large or small. For the same number of surround channels and bass gun channels, the higher/lower channel layout may be determined depending on whether the number of height channels is large or small.

Further, the identified channel layout may include a target channel layout. The target channel layout may refer to the highest channel layout of the audio signal included in the final output bitstream. The target channel layout may be a channel layout of the original audio signal or a channel layout lower than the channel layout of the original audio signal.

More specifically, the channel layout identified from the original audio signal may be hierarchically determined from the channel layout of the original audio signal. In this case, the channel layout identifier 261 may identify at least one channel layout among predetermined channel layouts. For example, the channel layout identifier 261 may identify some predetermined channel layouts, 7.1.4 channel layouts, 5.1.4 channel layouts, 5.1.2 channel layouts, 3.1.2 channel layouts, and 2 channel layouts, from the layout of the original audio signal.

The channel layout identifier 261 may transmit a control signal to a specific down-mix channel audio generator corresponding to the identified at least one channel layout based on the identified channel layout. The specific down-mix channel audio generator may be at least one of the first down-mix channel audio generator 263, the second down-mix channel audio generator 264, …, or the nth down-mix channel audio generator 265. The downmix channel audio generator 262 may generate the downmix channel audio from the original audio signal based on at least one channel layout recognized by the channel layout recognizer 261. The downmix channel audio generator 262 may generate the downmix channel audio from the original audio signal by using a downmix matrix including at least one downmix weight parameter.

For example, when the channel layout of the original audio signal is the nth channel layout arranged in ascending order among the predetermined channel layouts, the down-mix channel audio generator 262 may generate down-mix channel audio of the (n-1) th channel layout that is just lower (immediately lower than) than the channel layout of the original audio signal from the original audio signal. By repeating this process, the down-mix channel audio generator 262 can generate down-mix channel audio of a channel layout lower than the current channel layout.

For example, the downmix channel audio generator 262 may include a first downmix channel audio generator 263, a second downmix channel audio generator 264, …, and an (n-1) th downmix channel audio generator. (N-1) may be less than or equal to N.

In this case, the (n-1) -th down-mix channel audio generator may generate an audio signal of the (n-1) -th channel layout from the original audio signal. In addition, the (n-2) -th down-mix channel audio generator may generate an audio signal of the (n-2) -th channel layout from the original audio signal. In this way, the first downmix channel audio generator 263 may generate an audio signal of the first channel layout from the original audio signal. The first channel layout may be the first layout in a hierarchically ordered list, set or group of predetermined channel layouts. In this case, the audio signal of the first channel arrangement may be an audio signal of a basic channel group.

Meanwhile, each of the down-mix channel audio generators (e.g., the first down-mix channel audio generator 263, the second down-mix channel audio generators 264, …, and the nth down-mix channel audio generator 265) may be connected in a cascade manner. That is, the down-mix channel audio generators (e.g., the first down-mix channel audio generator 263, the second down-mix channel audio generators 264, …, and the nth down-mix channel audio generator 265) may be connected such that the output of the higher down-mix channel audio generator becomes the input of the lower down-mix channel audio generator. For example, the audio signal of the (n-1) -th channel layout may be output from the (n-1) -th down-mix channel audio generator with the original audio signal as an input, and the audio signal of the (n-1) -th channel layout may be input to the (n-2) -th down-mix channel audio generator, and the (n-2) -th down-mix channel audio may be generated from the (n-2) -th down-mix channel audio generator. In this way, the down-mix channel audio generators (e.g., the first down-mix channel audio generator 263, the second down-mix channel audio generators 264, …, and the nth down-mix channel audio generator 265) may be connected to output an audio signal of each channel layout.

The audio signal classifier 266 may obtain the audio signal of the base channel group and the audio signal of the dependent channel group based on the audio signal of the at least one channel layout. In this case, the audio signal classifier 266 may mix audio signals of at least one channel included in the audio signals of the at least one channel layout through the mixer 267. The audio signal classifier 266 may classify the mixed audio signal as at least one of an audio signal of a base channel group or an audio signal of a dependent channel group.

Referring to fig. 2D, the downmix channel audio generator 262 of fig. 2C may obtain an audio signal of the 5.1.2 channel layout 291, an audio signal of the 3.1.2 channel layout 292, an audio signal of the 2 channel layout 293, and an audio signal of the mono layout 294, which are audio signals of the lower channel layout, from the original audio signal of the 7.1.4 channel layout 290. The down-mix channel audio generators (e.g., the first down-mix channel audio generator 263, the second down-mix channel audio generators 264, …, and the nth down-mix channel audio generator 265) of the down-mix channel audio generator 262 are connected in a cascade manner so that audio signals can be sequentially obtained from a current channel layout to a next lower channel layout.

The audio signal classifier 266 of fig. 2C may classify the audio signals of the mono layout 294 as audio signals of a basic channel group.

The audio signal classifier 266 may classify the audio signal of the L2 channel, which is a part of the audio signal of the 2-channel layout 293, as the audio signal of the sub-channel group #1 296. Meanwhile, the audio signal of the L2 channel and the audio signal of the R2 channel are mixed to generate the audio signal of the mono layout 294, so that the audio decoding apparatuses 300 and 500 may, conversely, unmixe the audio signal of the mono layout 294 and the audio signal of the L2 channel to reconstruct the audio signal of the R2 channel. Accordingly, the audio signal of the R2 channel may not be classified as an audio signal of an individual channel group. In other words, it may not be necessary to classify the audio signal of the R2 channel as an audio signal of a separate channel group.

The audio signal classifier 266 may classify an Hfl3 channel audio signal, a C channel audio signal, an LFE channel audio signal, and an Hfr3 channel audio signal among the audio signals of the 3.1.2 channel arrangement 292 as the audio signal of the subordinate channel group #2 297. The audio signal of the L2 channel is generated by mixing the audio signal of the L3 channel and the audio signal of the C channel, so that, conversely, the audio decoding apparatuses 300 and 500 can reconstruct the audio signal of the L3 channel of the sub-channel group #2 297 by unmixing the audio signal of the L2 channel and the audio signal of the C channel.

Accordingly, the L3 channel audio signal in the audio signal of the 3.1.2 channel layout 292 may not be classified as an audio signal of a specific channel group.

For the same reason, the R3 channel may not be classified into an audio signal of a specific channel group.

The audio signal classifier 266 may transmit the audio signal of the L channel and the audio signal of the R channel, which are audio signals of some channels of the 5.1.2 channel arrangement 291, as the audio signals of the slave channel group #3 298 so as to transmit the audio signals of the 5.1.2 channel arrangement 291. Meanwhile, the audio signal of one of Ls5, hl5, rs5, and Hr5 channels may be one of the audio signals of the 5.1.2 channel arrangement 291, but may not be classified as an audio signal of an individual dependent channel group. This is because the signals of the Ls5, hl5, rs5, and Hr5 channels may not be channel audio signals in front of the listener, but may be signals in which audio signals of at least one of front, side, and rear audio channels of the audio signals of the 7.1.4 channel arrangement 290 may be mixed. By compressing the audio signals of the audio channels in front of the listener from the original audio signals, instead of classifying the mixed signals into the audio signals of the sub-channel groups and compressing them, the sound quality of the audio signals of the audio channels in front of the listener can be improved. As a result, the listener can feel that the sound quality of the reproduced audio signal is improved.

However, ls5 or Hl5 may be classified as the audio signal of the dependent channel group #3 298 instead of L, and Rs5 or Hr5 may be classified as the audio signal of the dependent channel group #3 298 instead of R, as the case may be.

The audio signal classifier 266 may classify the Ls, hfl, rs or Hfr channel audio signal among the audio signals of the 7.1.4 channel arrangement 290 as the audio signal of the subordinate channel group #4 299. In this case, lb, hbl, rb and Hbr may not be classified as audio signals of the sub channel group #4 299. By compressing the audio signals of the side audio channels close to the front of the listener instead of classifying the audio signals of the audio channels behind the listener in the audio signals of the 7.1.4 channel arrangement 290 as audio signals of the channel group and compressing them, the sound quality of the audio signals of the side audio channels close to the front of the listener can be improved. Accordingly, the listener can feel that the sound quality of the reproduced audio signal is improved. However, lb instead of Ls, hbl instead of Hfl, rb instead of Rs, and Hbr instead of Hfr may be classified as audio signals of the sub channel group #4 299 according to circumstances.

As a result, the downmix channel audio generator 262 of fig. 2C may generate a plurality of lower-layout audio signals (downmix channel audio) based on the plurality of lower-channel layouts identified from the original audio signal layout. The audio signal classifier 266 of fig. 2C may classify the audio signals of the base channel group and the audio signals of the sub channel groups #1, #2, #3, and # 4. The classified audio signals of the channels may classify a part of the audio signals of the independent channels among the audio signals of each channel as an audio signal of a channel group according to each channel layout. The audio decoding apparatuses 300 and 500 may reconstruct the audio signal not classified by the audio signal classifier 266 through the de-mixing. Meanwhile, when an audio signal of a left channel with respect to a listener is classified as an audio signal of a specific channel group, an audio signal of a right channel corresponding to the left channel may be classified as an audio signal of a corresponding channel group. That is, the audio signals of the coupled channels may be classified into audio signals of one channel group.

When the audio signals of the stereo channel arrangement are classified as audio signals of the basic channel group, the audio signals of the coupled channels may all be classified as audio signals of one channel group. However, as described above with reference to fig. 2D, when the audio signal of the mono layout is classified as the audio signal of the basic channel group, one of the audio signals of the stereo channels may be classified as the audio signal of the sub channel group #1, in addition. However, the method of classifying the audio signals of the channel group may be various, and is not limited to the description with reference to fig. 2D. That is, when the audio signals of the channel groups after the classification are unmixed and the audio signals of the channels that are not classified as the audio signals of the channel groups can be reconstructed from the unmixed audio signals, the audio signals of the channel groups can be classified in various forms.

Fig. 3A is a block diagram of a structure of a multi-channel audio decoding apparatus according to an embodiment of the present disclosure.

The audio decoding apparatus 300 may include a memory 310 and a processor 330. The audio decoding apparatus 300 may be implemented as an apparatus capable of audio processing, such as a server, a television, a camera, a mobile phone, a computer, a digital broadcasting terminal, a tablet PC, a laptop computer, or the like.

Although the memory 310 and the processor 330 are separately illustrated in fig. 3A, the memory 310 and the processor 330 may be implemented by one hardware module (e.g., chip).

The processor 330 may be implemented as a dedicated processor for neural network based audio processing. Alternatively, the processor 330 may be implemented by a combination of a general-purpose processor (e.g., an AP, CPU, or GPU) and software. The special purpose processor may include a memory to implement embodiments of the present disclosure or a memory processor to use external memory.

Processor 330 may include a plurality of processors. In this case, the processor 330 may be implemented as a combination of dedicated processors or may be implemented by a combination of software and a plurality of general-purpose processors (such as an AP, a CPU, or a GPU).

Memory 310 may store one or more instructions for audio processing. According to an embodiment of the present disclosure, the memory 310 may store a neural network. When the neural network is implemented in the form of a dedicated hardware chip for AI or as part of an existing general-purpose processor (e.g., CPU or AP) or a graphics-specific processor (e.g., GPU), the neural network may not be stored in the memory 310. The neural network may be implemented as an external device (e.g., a server). In this case, the audio decoding apparatus 300 may request the neural network-based result information from the external apparatus and receive the neural network-based result information from the external apparatus.

Processor 330 may sequentially process successive frames according to instructions stored in memory 310 to obtain successive reconstructed frames. Successive frames may refer to frames that make up audio.

The processor 330 may output a multi-channel audio signal by performing an audio processing operation on an input bitstream. The bitstream may be implemented in a scalable form to increase the number of channels from the base channel group. For example, the processor 330 may obtain compressed audio signals of the base channel group from the bitstream, and may reconstruct audio signals of the base channel group (e.g., stereo channel audio signals) by decompressing the compressed audio signals of the base channel group. In addition, the processor 330 may reconstruct the audio signals of the dependent channel groups by decompressing the compressed audio signals of the dependent channel groups from the bitstream. The processor 330 may reconstruct the multi-channel audio signal based on the audio signals of the base channel group and the audio signals of the dependent channel group.

Meanwhile, the processor 330 may reconstruct the audio signals of the first slave channel group by decompressing the compressed audio signals of the first slave channel group from the bitstream. The processor 330 may reconstruct the audio signals of the second set of slave channels by decompressing the compressed audio signals of the second set of slave channels.

The processor 330 may reconstruct a multi-channel audio signal of an increased number of channels based on the audio signals of the base channel group and the corresponding audio signals of the first and second dependent channel groups. Similarly, the processor 330 may decompress the compressed audio signals of the n number of sub-channel groups (where n is an integer greater than 2), and may reconstruct a multi-channel audio signal of a further increased number of channels based on the audio signals of the base channel group and the corresponding audio signals of the n number of sub-channel groups.

Fig. 3B is a block diagram of a structure of a multi-channel audio decoding apparatus according to an embodiment of the present disclosure.

Referring to fig. 3B, the audio decoding apparatus 300 may include an information acquirer 350 and a multi-channel audio decoder 360. The multi-channel audio decoder 360 may include a decompressor 370 and a multi-channel audio signal reconstructor 380.

The audio decoding apparatus 300 may include the memory 310 and the processor 330 of fig. 3A, and instructions for implementing the components 350, 360, 370, and 380 of fig. 3B may be stored in the memory 310. Processor 330 may execute instructions stored in memory 310.

The information acquirer 350 may acquire the compressed audio signal of the basic channel group from the bitstream. That is, the information acquirer 350 may classify the base channel audio stream including at least one compressed audio signal of the base channel group from the bitstream.

The information acquirer 350 may also acquire at least one compressed audio signal of at least one slave channel group from the bitstream. That is, the information acquirer 350 may classify at least one sub-channel audio stream including at least one compressed audio signal of a sub-channel group from the bitstream.

Meanwhile, the bitstream may include a base channel audio stream and a plurality of dependent channel streams. The plurality of slave channel audio streams may include a first slave channel audio stream and a second slave channel audio stream.

In this case, a description will be given of a limitation of channels of a multi-channel first audio signal reconstructed by a base channel audio stream and a first sub-channel audio stream and a multi-channel second audio signal reconstructed by the base channel audio stream, the first sub-channel audio stream, and a second sub-channel audio stream.

For example, in channels of a first multi-channel layout reconstructed from a base channel audio stream and a first dependent channel audio stream, the number of surround channels may be S _n-1 The number of the sound channels of the bass gun canIs W _n-1 And the number of height channels may be H _n-1 . In a second multi-channel layout reconstructed from the base channel audio stream, the first dependent channel audio stream, and the second dependent channel audio stream, the number of surround channels may be S _n The number of the bass channels may be W _n And the number of height channels may be H _n . In this case S _n-1 Can be less than or equal to S _n ，W _n-1 May be less than or equal to W _n And H is _n-1 May be less than or equal to H _n . Here, S can be excluded _n-1 Equal to Sn, W _n-1 Equal to W _n And H _n-1 Equal to H _n Is the case in (a).

Furthermore, the case where the number of surround channels of the second multi-channel layout is equal to the number of surround channels of the first multi-channel layout and the number of subwoofer channels of the second multi-channel layout is equal to the number of subwoofer channels of the first multi-channel layout and the number of height channels of the second multi-channel layout is equal to the number of height channels of the first multi-channel layout does not exist. That is, all channels of the second multi-channel layout may not be identical to all channels of the first multi-channel layout.

Meanwhile, the bitstream may include a file stream having a plurality of tracks including a first track and a second track. The process of the information acquirer 350 acquiring at least one compressed audio signal of at least one sub-channel group according to additional information included in the audio track will be described below.

The information acquirer 350 may acquire at least one compressed audio signal of the basic channel group from the first audio track.

The information acquirer 350 may acquire the sub-channel audio signal identification information from the second track adjacent to the first track.

When the sub-channel audio signal identification information indicates that the sub-channel audio signal exists in the second track, the information acquirer 350 may acquire at least one audio signal of at least one sub-channel group from the second track.

When the sub-channel audio signal identification information indicates that the sub-channel audio signal does not exist in the second track, the information acquirer 350 may acquire a next audio signal of the basic channel group from the second track.

The information acquirer 350 may acquire additional information related to the reconstruction of the multi-channel audio from the bitstream. That is, the information acquirer 350 may classify metadata including additional information from a bitstream and acquire the additional information from the classified metadata.

The decompressor 370 may reconstruct the audio signals of the basic channel set by decompressing at least one compressed audio signal of the basic channel set.

The decompressor 370 may reconstruct the at least one audio signal of the at least one slave channel group by decompressing the at least one compressed audio signal of the at least one slave channel group.

In this case, the decompressor 370 may include separate first to nth decompressors for decoding the compressed audio signals of the respective channel groups (n channel groups). In this case, the first to nth decompressors may operate in parallel with each other.

The multi-channel audio signal reconstructor 380 may reconstruct a multi-channel audio signal based on at least one audio signal of the base channel group and at least one audio signal of the at least one dependent channel group.

For example, when the audio signal of the base channel group is an audio signal of a stereo channel, the multi-channel audio signal reconstructor 380 may reconstruct an audio signal of a 3D audio channel in front of the listener based on the audio signal of the base channel group and the audio signal of the first sub channel group. For example, the 3D audio channel in front of the listener may be 3.1.2 channels.

Alternatively, the multi-channel audio signal reconstructor 380 may reconstruct the audio signals of the listener omni-directional audio channel based on the audio signals of the base channel group, the audio signals of the first slave channel group, and the audio signals of the second slave channel group. For example, the listener omni-directional 3D audio channel may be 5.1.2 channels or 7.1.4 channels.

The multi-channel audio signal reconstructor 380 may reconstruct a multi-channel audio signal based not only on the audio signal of the base channel group and the audio signal of the dependent channel group, but also on the additional information. In this case, the additional information may be additional information for reconstructing the multi-channel audio signal. The multi-channel audio signal reconstructor 380 may output the reconstructed at least one multi-channel audio signal.

The multi-channel audio signal reconstructor 380 according to an embodiment of the present disclosure may generate a first audio signal of a 3D audio channel in front of a listener from at least one audio signal of a base channel group and at least one audio signal of at least one dependent channel group. The multi-channel audio signal reconstructor 380 may reconstruct a multi-channel audio signal including a second audio signal of the 3D audio channel in front of the listener based on the first audio signal and the audio object signal of the 3D audio channel in front of the listener. In this case, the audio object signal may indicate at least one of an audio signal, shape, area, position, or direction of an audio object (e.g., a sound source), and may be obtained from the information acquirer 350.

The detailed operation of the multi-channel audio signal reconstructor 380 will now be described with reference to fig. 3C.

Fig. 3C is a block diagram of the structure of a multi-channel audio signal reconstructor 380 according to an embodiment of the present disclosure.

Referring to fig. 3C, the multi-channel audio signal reconstructor 380 may include an up-mix channel group audio generator 381 and a renderer 386.

The up-mix channel group audio generator 381 may generate an audio signal of the up-mix channel group based on the audio signal of the base channel group and the audio signal of the dependent channel group. In this case, the audio signal of the upmix channel group may be a multi-channel audio signal. In this case, in addition, a multi-channel audio signal may be generated based further on additional information (e.g., information about dynamic unmixed weight parameters).

The up-mix channel group audio generator 381 may generate audio signals of up-mix channels by unmixing the audio signals of the base channel group and some of the audio signals of the dependent channel group. For example, by unmixed audio signals L and R of the base channel group and a partial audio signal C of the dependent channel group, audio signals L3 and R3 of the unmixed channels (or up-mixed channels) may be generated.

The up-mix channel group audio generator 381 may generate audio signals of some channels of the multi-channel audio signal by bypassing a de-mixing operation for some audio signals of the slave channel group. For example, the up-mix channel group audio generator 381 may generate C, LFE, hfl and Hfr3 channel audio signals of the multi-channel audio signal by bypassing a de-mixing operation for audio signals of C, LFE, hfl and Hfr3 channels, which are some of the audio signals of the slave channel group.

As a result, the up-mix channel group audio generator 381 may generate audio signals of the up-mix channel group based on the audio signals of the up-mix channels generated by the de-mixing and the audio signals of the sub-channel group in which the de-mixing operation is bypassed. For example, the up-mix channel group audio generator 381 may generate the L3, R3, C, LFE, hfl, and Hfr3 channel audio signals as 3.1.2 channel audio signals based on the L3 and R3 channel audio signals as the audio signals of the de-mix channel and the C, LFE, hfl3 and Hfr3 channel audio signals as the audio signals of the dependent channel group.

The detailed operation of the up-mix channel group audio generator 381 will be described later with reference to fig. 3D.

The renderer 386 may include a volume controller 388 and a limiter 389. The multi-channel audio signal input to the renderer 386 may be a multi-channel audio signal of at least one channel layout. The multi-channel audio signal input to the renderer 386 may be a Pulse Code Modulation (PCM) signal.

Meanwhile, the volume (loudness) of the audio signal of each channel may be measured based on ITU-R bs.1770, which may be signaled by the received additional information about the bitstream.

The volume controller 388 may control the volume of the audio signal of each channel to a target volume (e.g., -24 LKFS) based on the volume information signaled through the bitstream.

Meanwhile, the true peak may be measured based on ITU-R bs.1770.

Limiter 389 may limit the true peak level of the audio signal (e.g., to-1 dBTP) after volume control.

Although the post-processing components 388 and 389 included in the renderer 386 have been described so far, at least one component may be omitted and the order of each component may be changed according to circumstances without being limited thereto.

The multi-channel audio signal outputter 390 may receive the post-processed multi-channel audio signal and may output at least one multi-channel audio signal. For example, with the post-processed multi-channel audio signal as an input, the multi-channel audio signal outputter 390 may output an audio signal of each channel of the multi-channel audio signal to an audio output device corresponding to each channel according to a target channel layout. The audio output device may include various types of speakers.

Fig. 3D is a block diagram of the structure of an up-mix channel group audio generator according to an embodiment of the present disclosure.

Referring to fig. 3D, the up-mix channel group audio generator 381 may include a de-mixer 382. The de-mixer 382 may include a first de-mixer 383 and second to nth de-mixers 384 to 385.

The de-mixer 382 may obtain an audio signal of a new channel (an up-mix channel or a de-mix channel) from the audio signals of the base channel group and the audio signals of some channels (decoding channels) of the subordinate channel group. That is, the de-mixer 382 may obtain an audio signal of one up-mix channel from at least one audio signal in which several channels are mixed. The de-mixer 382 may output an audio signal including an up-mixed channel audio signal and a specific layout of the decoded channel audio signal.

For example, the de-mixing operation in the de-mixer 382 may be bypassed so that the audio signals of the base channel group may be output as the audio signals of the first channel arrangement.

With the audio signals of the base channel group and the audio signals of the first dependent channel group as inputs, the first de-mixer 383 may de-mix the audio signals of some channels. In this case, an audio signal of a downmix channel (or an upmix channel) may be generated. The first de-mixer 383 may generate the audio signals of the independent channels by bypassing the mixing operation for the audio signals of the other channels. The first de-mixer 383 may output an audio signal of the second channel layout, which is a signal including an audio signal of the up-mix channel and an audio signal of the independent channel.

The second de-mixer 384 may generate the audio signals of the de-mixed channels (or up-mixed channels) by de-mixing the audio signals of the second channel layout and the audio signals of some of the second slave channels. The second de-mixer 384 may generate the audio signals of the independent channels by bypassing the mixing operation for the audio signals of the other channels. The second de-mixer 384 may output the audio signal of the third channel arrangement, which includes the audio signal of the up-mix channel and the audio signal of the independent channel.

Similar to the operation of the second de-mixer 384, the nth de-mixer may output the audio signal of the nth channel arrangement based on the audio signal of the (n-1) th channel arrangement and the audio signal of the (n-1) th dependent channel group. N may be less than or equal to N.

The nth de-mixer 385 may output the audio signal of the nth channel arrangement based on the audio signal of the (N-1) th channel arrangement and the audio signal of the (N-1) th dependent channel group.

Although the audio signals of the lower channel layout are shown as being directly input to the respective de-mixers 383, 384 to 385, the audio signals of the channel layout output through the renderer 386 of fig. 3C may alternatively be input to each of the de-mixers 383, 384 to 385. That is, the post-processed audio signal of the lower channel layout may be input to each of the de-mixers 383, 384 to 385.

With reference to fig. 3D, it is described that the mixers 383, 384, and 385 may be connected in a cascade manner to output an audio signal of each channel layout.

However, without connecting the de-mixers 383, 384, and 385 in a cascade manner, an audio signal of a specific layout may be output from the audio signal of the base channel group and the audio signal of the at least one sub-channel group.

Meanwhile, by using a down-mix gain for preventing clipping, an audio signal generated by mixing signals of several channels in the audio encoding apparatuses 200 and 400 may have a reduced level. The audio decoding apparatuses 300 and 500 may match the level of the audio signal with the level of the original audio signal based on the corresponding downmix gain for the signal generated by mixing.

Meanwhile, an operation based on the above-described down-mix gain may be performed for each channel or channel group. The audio encoding apparatuses 200 and 400 may signal information on the downmix gain through additional information on the bitstream of each channel or each channel group. Accordingly, the audio decoding apparatuses 300 and 500 may obtain information on a downmix gain from additional information on a bitstream of each channel or each channel group and perform the above-described operation based on the downmix gain.

Meanwhile, the de-mixer 382 may perform a de-mixing operation based on dynamic de-mixing weight parameters of the de-mixing matrix (down-mixing weight parameters corresponding to the down-mixing matrix). In this case, the audio encoding apparatuses 200 and 400 may signal the dynamic downmix weight parameters or the dynamic downmix weight parameters corresponding thereto through additional information regarding the bitstream. Some of the unmixed weight parameters may not be signaled and have a fixed value.

Accordingly, the audio decoding apparatuses 300 and 500 may obtain information on the dynamic downmix weight parameters (or information on the dynamic downmix weight parameters) from the additional information on the bitstream and perform a downmix operation based on the obtained information on the dynamic downmix weight parameters (or information on the dynamic downmix weight parameters).

Fig. 4A is a block diagram of an audio encoding apparatus according to an embodiment of the present disclosure.

Referring to fig. 4A, the audio encoding apparatus 400 may include a multi-channel audio encoder 450, a bitstream generator 480, and an error-cancellation-related information generator 490. The multi-channel audio encoder 450 may include a multi-channel audio signal processor 460 and a compressor 470.

The components 450, 460, 470, 480, and 490 of fig. 4A may be implemented by the memory 210 and the processor 230 of fig. 2A.

The operations of the multi-channel audio encoder 450, the multi-channel audio signal processor 460, the compressor 470, and the bitstream generator 480 of fig. 4A correspond to the operations of the multi-channel audio encoder 250, the multi-channel audio signal processor 260, the compressor 270, and the bitstream generator 280, respectively, and thus the detailed description thereof will be replaced by the description of fig. 2B.

The error-cancellation-related information generator 490 may be included in the additional information generator 285 of fig. 2B, but may also exist alone without limitation.

Error cancellation related information generator 490 may determine an error cancellation factor (e.g., a scaling factor) based on the first power value and the second power value. In this case, the first power value may be an energy value of one channel of the original audio signal or may be an energy value of an audio signal of one channel obtained by down-mixing from the original audio signal. The second power value may be a power value of an audio signal of an up-mix channel that is one of the audio signals of the up-mix channel group. The audio signal of the upmix channel group may be an audio signal obtained by unmixed the base channel reconstruction signal and the dependent channel reconstruction signal.

The error-cancellation related information generator 490 may determine an error-cancellation factor for each channel.

The error cancellation related information generator 490 may generate error cancellation related information (or error cancellation related information) including information about the determined error cancellation factor. The bit stream generator 480 may generate a bit stream further including error cancellation related information. The detailed operation of the error-cancellation-related information generator 490 will now be described with reference to fig. 4B.

Fig. 4B is a block diagram of the structure of the error-cancellation-related information generator 490 according to an embodiment of the present disclosure.

Referring to fig. 4B, the error cancellation related information generator 490 may include a decompressor 492, a de-mixer 494, a Root Mean Square (RMS) value determiner 496, and an error cancellation factor determiner 498.

The decompressor 492 may generate a base channel reconstructed signal by decompressing the compressed audio signals of the base channel group. In addition, the decompressor 492 may generate a sub-channel reconstruction signal by decompressing the compressed audio signals of the sub-channel group.

The de-mixer 494 may de-mix the base channel reconstruction signal and the slave channel reconstruction signal to generate an audio signal of the up-mix channel group. More specifically, the unmixer 494 may generate an audio signal of an up-mix channel (or a unmixed channel) by unmixeing audio signals of some channels of the audio signals of the base channel group and the dependent channel group. The de-mixer 494 may bypass de-mixing operations for some of the audio signals of the base channel group and the slave channel group.

The unmixer 494 may obtain audio signals of an up-mix channel group including the audio signals of the up-mix channels and the audio signals whose unmixed operations are bypassed.

RMS value determiner 496 may determine an RMS value of the first audio signal of one of the up-mix channels of the up-mix channel group. The RMS value determiner 496 may determine the RMS value of the second audio signal of one channel of the original audio signal or the RMS value of the second audio signal of one channel of the audio signal downmixed from the original audio signal. In this case, the channels of the first audio signal and the channels of the second audio signal may indicate the same channels in the channel layout.

The error cancellation factor determiner 498 may determine an error cancellation factor based on the RMS value of the first audio signal and the RMS value of the second audio signal. For example, a value generated by dividing the RMS value of the first audio signal by the RMS value of the second audio signal may be obtained as the value of the error cancellation factor. The error cancellation factor determiner 498 may generate information about the determined error cancellation factor. The error cancellation factor determiner 498 may output error cancellation related information including information about the error cancellation factor.

Fig. 5A is a block diagram of a structure of an audio decoding apparatus according to an embodiment of the present disclosure.

Referring to fig. 5A, the audio decoding apparatus 500 may include an information acquirer 550, a multi-channel audio decoder 560, a decompressor 570, a multi-channel audio signal reconstructor 580, and an error-cancellation-related information acquirer 555. The components 550, 555, 560, 570, and 580 of fig. 5A may be implemented by the memory 310 and the processor 330 of fig. 3A.

Instructions for implementing components 550, 555, 560, 570, and 580 of fig. 5A may be stored in memory 310 of fig. 3A. Processor 330 may execute instructions stored in memory 310.

The operations of the information acquirer 550, the decompressor 570, and the multi-channel audio signal reconstructor 580 of fig. 5A include the operations of the information acquirer 350, the decompressor 370, and the multi-channel audio signal reconstructor 380 of fig. 3B, respectively, and thus redundant descriptions will be replaced by those described with reference to fig. 3B. Hereinafter, a description that is not repeated with the description of fig. 3B will be provided.

The information acquirer 550 may acquire metadata from a bitstream.

The error-cancellation-related information acquirer 555 may acquire the error-cancellation-related information from metadata included in the bit stream. Here, the information on the error cancellation factor included in the error cancellation related information may be an error cancellation factor of an audio signal of one up-mix channel of the up-mix channel group. An error cancellation related information acquirer 555 may be included in the information acquirer 550.

The multi-channel audio signal reconstructor 580 may generate an audio signal of the up-mix channel group based on at least one audio signal of the base channel and at least one audio signal of the at least one sub-channel group. The audio signal of the upmix channel group may be a multi-channel audio signal. The multi-channel audio signal reconstructor 580 may reconstruct an audio signal of one up-mix channel included in the up-mix channel group by applying an error removal factor to the audio signal of the one up-mix channel.

The multi-channel audio signal reconstructor 580 may output a multi-channel audio signal including the reconstructed audio signal of the one up-mix channel.

Fig. 5B is a block diagram of a structure of a multi-channel audio signal reconstructor according to an embodiment of the present disclosure.

The multi-channel audio signal reconstructor 580 may include an up-mix channel group audio generator 581 and a renderer 583. The renderer 583 may include an error canceller 584, a volume controller 585, a limiter 586, and a multi-channel audio signal output 587.

The up-mix channel group audio generator 581, the error canceller 584, the fader 585, the limiter 586, and the multi-channel audio signal outputter 587 of fig. 5B may include the operations of the up-mix channel group audio generator 381, the fader 388, the limiter 389, and the multi-channel audio signal outputter 390 of fig. 3C, and thus redundant descriptions will be replaced with those described with reference to fig. 3C. Hereinafter, portions that do not overlap with fig. 3C will be described.

The error canceller 584 may reconstruct the error-cancelled audio signal of the first channel based on the audio signal of the first upmix channel of the upmix channel group of the multi-channel audio signal and the error cancellation factor of the first upmix channel. In this case, the error cancellation factor may be a value based on the RMS value of the audio signal of the first channel of the original audio signal or the audio signal downmixed from the original audio signal and the RMS value of the audio signal of the first upmix channel of the upmix channel group. The first channel and the first up-mix channel may indicate the same channel of the channel layout. The error canceller 584 may cancel an error caused by encoding by making the RMS value of the audio signal of the first up-mix channel of the current up-mix channel group be the RMS value of the audio signal of the first channel of the original audio signal or the audio signal down-mixed from the original audio signal.

Meanwhile, the error removal factor may be different between adjacent audio frames. In this case, in the end portion of the previous frame and the beginning portion of the next frame, the audio signal may beat due to discontinuous error cancellation factors.

Accordingly, the error canceller 584 can determine an error cancellation factor used in the frame boundary adjacent portion by performing smoothing on the error cancellation factor. The frame boundary adjacent portion may refer to an end portion of a previous frame relative to the boundary and a first portion of a next frame relative to the boundary. Each portion may include a predetermined number of samples.

Here, smoothing may refer to an operation of converting a discontinuous error concealment factor between adjacent audio frames into a continuous error concealment factor in a frame boundary portion.

The multi-channel audio signal outputter 587 may output a multi-channel audio signal including an error-canceled audio signal of one channel.

Meanwhile, at least one of the post-processing components 585 and 586 included in the renderer 583 may be omitted, and the order of the post-processing components 584, 585, and 586 including the error canceller 584 may be changed according to circumstances.

As described above, the audio decoding apparatuses 200 and 400 may generate a bitstream. The audio encoding apparatuses 200 and 400 may transmit the generated bit stream.

In this case, the bitstream may be generated in the form of a file stream. The audio decoding apparatuses 300 and 500 may receive the bitstream. The audio decoding apparatuses 300 and 500 may reconstruct the multi-channel audio signal based on information obtained from the received bitstream. In this case, the bitstream may be included in a predetermined file container. For example, the file container may be a Moving Picture Experts Group (MPEG) -4 media container for compressing various multimedia digital data, such as MPEG-4 part 14 (MP 4), or the like.

Fig. 6A is a view for describing a transmission order and a rule of an audio stream in each channel group by the audio encoding apparatuses 200 and 400 according to an embodiment of the present disclosure.

In the scalable format, the transmission order and rules of the audio streams in each channel group may be as follows.

The audio encoding apparatuses 200 and 400 may first transmit the coupled stream and then transmit the uncoupled stream.

The audio encoding apparatuses 200 and 400 may first transmit the coupled stream for the surround channels and then transmit the coupled stream for the height channels.

The audio encoding apparatuses 200 and 400 may first transmit the coupled streams for the front channels and then transmit the coupled streams for the side channels or the rear channels.

For uncoupled streaming, the audio encoding apparatus 200 and 400 may first send a stream for the center channel and then send streams for the LFE channel and the other channel. Here, when the base channel group includes a mono signal, another channel may exist. In this case, the other channel may be one of the left channel L2 or the right channel R2 of the stereo channel.

The audio encoding apparatuses 200 and 400 may compress the audio signals of the coupled channels into a pair. The audio encoding apparatuses 200 and 400 may first transmit a coupled stream including audio signals compressed into a pair. For example, the coupled channels may refer to bilateral symmetric channels such as L/R, ls/Rs, lb/Rb, hfl/Hfr, hbl/Hbr channels, and the like.

Hereinafter, according to the above-described transmission order and rule of the streams in each channel group, the stream configuration of each channel group in the bit stream 610 of case 1 will be described.

Referring to fig. 6A, for example, the audio encoding apparatuses 200 and 400 may compress L1 and R1 signals as 2-channel audio signals, and the compressed L1 and R1 signals may be included in a C1 bitstream of a Basic Channel Group (BCG).

Next to the basic channel group, the audio encoding apparatuses 200 and 400 may compress the 4-channel audio signal into the audio signal of the sub-channel group # 1.

The audio encoding apparatuses 200 and 400 may compress the Hfl3 signal and the Hfr3 signal, and the compressed Hfl3 signal and Hfr3 signal may be included in a C2 bitstream in the bitstream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may compress the C signal, and the compressed C signal may be included in an M1 bit stream in the bit stream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may compress the LFE signal, and the compressed LFE signal may be included in an M2 bitstream in the bitstream of the slave channel group # 1.

The audio decoding apparatuses 300 and 500 may reconstruct an audio signal of a 3.1.2 channel layout based on the compressed audio signals of the base channel group and the dependent channel group # 1.

Immediately after the sub-channel group #1, the audio encoding apparatuses 200 and 400 may compress the 6-channel audio signal into the audio signal of the sub-channel group # 2.

The audio encoding apparatuses 200 and 400 may first compress the L signal and the R signal, and the compressed L signal and R signal may be included in a C3 bitstream in the bitstream of the slave channel group # 2.

Next to the C3 bitstream, the audio encoding apparatuses 200 and 400 may compress the Ls signal and the Rs signal, and the compressed Ls signal and Rs signal may be included in the C4 bitstream in the bitstream of the slave channel group # 2.

Next to the C4 bitstream, the audio encoding apparatuses 200 and 400 may compress the Hfl signal and the Hfr signal, and the compressed Hfl and Hfr signals may be included in the C5 bitstream in the bitstream of the slave channel group # 2.

The audio decoding apparatuses 300 and 500 may reconstruct an audio signal of a 7.1.4 channel layout based on compressed audio signals of the base channel group, the sub channel group #1, and the sub channel group # 2.

Hereinafter, according to the above-described transmission order and rule of the streams in each channel group, the stream configuration of each channel group in the bit stream 620 of case 2 will be described.

The audio encoding apparatuses 200 and 400 may compress an L2 signal and an R2 signal, which are 2-channel audio signals, and the compressed L2 signal and R2 signal may be included in a C1 bitstream in a bitstream of a basic channel group.

Next to the basic channel group, the audio encoding apparatuses 200 and 400 may compress the 6-channel audio signal into the audio signal of the sub-channel group # 1.

The audio encoding apparatuses 200 and 400 may first compress the L signal and the R signal, and the compressed L signal and R signal may be included in a C2 bitstream in the bitstream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may compress the Ls signal and the Rs signal, and the compressed Ls signal and Rs signal may be included in a C3 bitstream in the bitstream of the sub-channel group # 1.

The audio encoding apparatuses 200 and 400 may reconstruct an audio signal of a 7.1.0 channel layout based on the compressed audio signals of the base channel group and the dependent channel group # 1.

Immediately after the sub-channel group #1, the audio encoding apparatuses 200 and 400 may compress the 4-channel audio signal into the audio signal of the sub-channel group # 2.

The audio encoding apparatuses 200 and 400 may compress the Hfl signal and the Hfr signal, and the compressed Hfl signal and Hfr signal may be included in a C4 bitstream in the bitstream of the slave channel group # 2.

The audio encoding apparatuses 200 and 400 may compress Hbl signals and Hbr signals, and the compressed Hfl signals and Hfr signals may be included in a C5 bitstream in the bitstream of the slave channel group # 2.

Hereinafter, according to the above-described transmission order and rule of the streams in each channel group, the stream configuration of each channel group in the bit stream 630 of case 3 will be described.

Next to the basic channel group, the audio encoding apparatuses 200 and 400 may compress the 10-channel audio signal into the audio signal of the sub-channel group # 1.

The audio encoding apparatuses 200 and 400 may compress the Hfl signal and the Hfr signal, and the compressed Hfl signal and Hfr signal may be included in a C4 bitstream in the bitstream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may compress Hbl signals and Hbr signals, and the compressed Hfl signals and Hfr signals may be included in a C5 bitstream in the bitstream of the slave channel group # 1.

The audio encoding apparatuses 200 and 400 may reconstruct an audio signal of a 7.1.4 channel layout based on the compressed audio signals of the base channel group and the dependent channel group # 1.

Meanwhile, the audio decoding apparatuses 300 and 500 may perform the de-mixing in a stepwise manner by using at least one up-mixing unit. The unmixing may be performed based on audio signals of channels included in at least one channel group.

For example, the 1.X to 2.X up-mixing unit (first up-mixing unit) may de-mix an audio signal of a right channel from an audio signal of a mono channel as a mixed right channel.

Alternatively, the 2.X to 3.X up-mixing unit (second up-mixing unit) may de-mix the audio signal of the center channel from the audio signals of the L2 and R2 channels corresponding to the mixed center channel. Alternatively, the 2.X to 3.X up-mixing unit (second up-mixing unit) may de-mix the audio signal of the L3 channel and the audio signal of the R3 channel from the audio signals of the L2 and R2 channels of the mixed L3 and R3 channels and the audio signal of the C channel.

The x to 5.X up-mixing unit (third up-mixing unit) may de-mix audio signals of the Ls5 channel and the Rs5 channel from audio signals of the L3, R3, L (5), and R (5) channels corresponding to the Ls5/Rs5 mixed channel.

The x to 7.X up-mixing unit (fourth up-mixing unit) may de-mix the audio signal of the Lb channel and the audio signal of the Rb channel from the audio signals of the Ls5, ls7, and Rs7 channels corresponding to the mixed Lb/Rb channel.

The x.x.2 (FH) to x.x.2 (H) up-mixing unit (fourth up-mixing unit) may de-mix audio signals of the Hl channel and the Hr channel from audio signals of the Hfl3, hfr3, L5, R3, and R5 channels corresponding to the mixed Ls/Rs channel.

The x.x.2 (H) to x.x.4 up-mixing units (fifth up-mixing units) may de-mix the audio signals of the Hbl channels and the Hbr channels from the audio signals of the Hl, hr, hfl channels corresponding to the mixing Hbl/Hbr channels and the Hfr channels.

For example, the audio decoding apparatuses 300 and 500 may perform the unmixing to the 3.2.1 channel layout by using the first up-mixing unit.

The audio decoding apparatuses 300 and 500 may perform the de-mixing to the 7.1.4 channel layout by using the second and third up-mixing units for the surround channels and the fourth and fifth up-mixing units for the height channels.

Alternatively, the audio decoding apparatuses 300 and 500 may perform the unmixing to the 7.1.0 channel layout by using the first mixing unit, the second mixing unit, and the third mixing unit. The audio decoding apparatuses 300 and 500 may not perform the unmixing from the 7.1.0 channel layout to the 7.1.4 channel layout.

Alternatively, the audio decoding apparatuses 300 and 500 may perform the unmixing to the 7.1.4 channel layout by using the first mixing unit, the second mixing unit, and the third mixing unit. The audio decoding apparatuses 300 and 500 may not perform the de-mixing on the height channels.

Hereinafter, a rule for generating a channel group by the audio encoding apparatuses 200 and 400 will be described. For a channel layout CLi in a scalable format (i is an integer from 0 to n, and CLi represents Si, wi, and Hi), si+wi+hi may refer to the number of channels of channel group #i. The number of channels of the channel group #i may be greater than the number of channels of the channel group #i-1.

The channel group #i may include as many original channels (display channels) of Cli as possible. The original channel may follow the priorities described below.

When H is _i-1 At 0, the priority of the height channel may be higher than the priority of the other channels. The priorities of the center channel and LFE channels may precede the other channels.

The priority of the high front channel may be before the priorities of the side channel and the high rear channel.

The priority of the side channels may be before the priority of the rear channels. Further, the priority of the left channel may precede the priority of the right channel.

For example, when n is 4, CL0 is a stereo channel, CL1 is 3.1.2 channels, CL2 is 5.1.2 channels, and CL3 is 7.1.4 channels, a channel group can be generated as described below.

The audio encoding apparatuses 200 and 400 may generate a basic channel group including a (L2) and B (R2) signals. The audio encoding apparatuses 200 and 400 may generate the sub-channel group #1 including Q1 (Hfl 3), Q2 (Hfr 3), T (=c), and P (=lfe) signals. The audio encoding apparatuses 200 and 400 may generate the sub-channel group #2 including S1 (=l) and S2 (=r) signals.

The audio encoding apparatuses 200 and 400 may generate the sub-channel group #3 including V1 (Hfl), V2 (Hfr), U1 (Ls), and U2 (Rs) signals.

Meanwhile, the audio decoding apparatuses 300 and 500 may reconstruct an audio signal of 7.1.4 channels from the decompressed audio signal by using the down-mix matrix. In this case, the downmix matrix may include, for example, the downmix weight parameters in table 2 provided as follows.

TABLE 2

Here, cw denotes a center weight, and may be 0 when the channel layout of the base channel group is a 3.1.2 channel layout, and 1 when the channel layout of the base channel group is a 2 channel layout. w may represent the surround to high mixing weights. Alpha, beta, gamma, and delta may represent the down-mix weight parameters and may be variable. The audio encoding apparatuses 200 and 400 may generate bitstreams including downmix weight parameter information such as α, β, γ, δ, and w, and the audio decoding apparatuses 300 and 500 may obtain the downmix weight parameter information from the bitstreams.

On the other hand, the weight parameter information on the downmix matrix (or the downmix matrix) may be in the form of an index. For example, the weight parameter information on the downmix matrix (or the downmix matrix) may be index information indicating one of a plurality of downmix (or downmix) weight parameter sets, and at least one of the downmix (or the downmix) weight parameters corresponding to the one downmix (or the downmix) weight parameter set may exist in the form of a lookup table (LUT). For example, the weight parameter information on the downmix (or the downmix) matrix may be information indicating one of a plurality of downmix (or the downmix) weight parameter sets, and at least one of α, β, γ, δ, or w may be predefined in the LUT corresponding to the one downmix (or the downmix) weight parameter set. Accordingly, the audio decoding apparatuses 300 and 500 may obtain α, β, γ, δ, and w corresponding to the one downmix (unmixed) weight parameter set.

The matrix for down-mixing from the first channel layout to the second channel layout may comprise a plurality of matrices. For example, the matrix may include a first matrix for down-mixing from the first channel layout to the third channel layout and a second matrix for down-mixing from the third channel layout to the second channel layout.

More specifically, for example, a matrix for downmixing from an audio signal of a 7.1.4 channel layout to an audio signal of a 3.1.2 channel layout may include a first matrix for downmixing from an audio signal of a 7.1.4 channel layout to an audio signal of a 5.1.4 channel layout and a second matrix for downmixing from an audio signal of a 5.1.4 channel layout to an audio signal of a 3.1.2 channel layout.

Tables 3 and 4 show a first matrix and a second matrix for down-mixing audio signals from a 7.1.4 channel layout to a 3.1.2 channel layout based on content-based down-mix parameters and on surround to height weights.

TABLE 3

First matrix (7.1 to 5.1 down-mix matrix)

TABLE 4

Secondary matrix (5.1.4 to 3.1.2 down-mix matrix)

Here, α, β, γ, or δ represents one of the down-mix parameters, and w represents the surround to height weight.

For up-mixing (or de-mixing) from 5.X channels to 7.X channels, de-mixing weight parameters α and β may be used.

For up-mixing from the x.x.2 (H) channel to the x.x.4 channel, the de-mixing weight parameter γ may be used.

For up-mixing from 3.X channels to 5.X channels, a de-mixing weight parameter δ may be used.

For up-mixing from x.x.2 (FH) channels to x.x.2 (H) channels, the de-mixing weight parameters w and δ may be used.

For up-mix from 2.X channel to 3.X channel, a-3 dB de-mix weight parameter may be used. That is, the unmixed weight parameters may be fixed values and may not be signaled.

Furthermore, for up-mix to 1.X channel and 2.X channel, a-6 dB de-mix weight parameter may be used. That is, the unmixed weight parameters may be fixed values and may not be signaled.

Meanwhile, the downmix weight parameter for the downmix may be a parameter included in one of a plurality of types. For example, the type 1 de-mixing weight parameters α, β, γ, and δ may be 0dB, -3dB, and-3 dB. The type 2 unmixed weight parameters α, β, γ, and δ may be-3 dB, and-3 dB. The type 3 unmixed weight parameters α, β, γ, and δ may be 0dB, -1.25dB, and-1.25 dB. The type 1 may be a type indicating a case where the audio signal is a normal audio signal, the type 2 may be a type indicating a case where a dialog is included in the audio signal (dialog type), and the type 3 may be a type indicating a case where a sound effect exists in the audio signal (sound effect type).

The audio encoding apparatuses 200 and 400 may analyze the audio signal and determine one of a plurality of types according to the analyzed audio signal. The audio encoding apparatuses 200 and 400 may perform down-mixing for the original audio by using the determined type of the de-mixing weight parameters to generate an audio signal of a lower channel layout.

The audio encoding apparatuses 200 and 400 may generate a bitstream including index information indicating one of a plurality of types. The audio decoding apparatuses 300 and 500 may obtain index information from the bitstream and identify one of a plurality of types based on the obtained index information. The audio decoding apparatuses 300 and 500 may up-mix the audio signals of the decompressed channel groups by using the recognized type of the de-mixing weight parameters to reconstruct the audio signals of the specific channel layout.

Alternatively, the audio signal generated according to the down-mixing may be represented as equation 1 provided below. That is, the down-mixing may be performed based on an operation using an equation of the form of a one-time polynomial, and each of the down-mixed audio signals may be generated.

[ equation 1]

Ls5＝α×Ls7+β×Lb7

Rs5＝α×Rs7+β×Rb7

L3＝L5+δ×Ls5

R3＝R5+δ×Rs5

L2＝L3+p ₂ ×C

R2＝R3+p ₂ ×C

Mono＝p ₁ ×(L2+R2)

Hl＝Hfl+γ×Hbl

Hr＝Hfr+γ×Hbr

Hfl3＝Hl×w′×δ×Ls5

Hfr3＝Hr×w′×δ×Rs5

Here, p ₁ May be about 0.5 (i.e., -6 dB), and p ₂ May be about 0.707 (i.e., -3 dB). The α and β may be values for down-mixing the number of surround channels from 7 channels to 5 channels. For example, α or β may be 1 (i.e., 0 dB), 0.866 (i.e., -1.25 dB), or 0.707 (i.e., -3 dB). Gamma may be a value for down-mixing the number of height channels from 4 channels to 2 channels. For example, γ may be one of 0.866 or 0.707. Delta may be a value for down-mixing the number of surround channels from 5 channels to 3 channels. Delta may be 0.866 or 0.707. w' may be a value for down-mixing from H2 (e.g., the height channel of the 5.1.2 channel layout or the 7.1.2 channel layout) to Hf2 (the height channel of the 3.1.2 channel layout).

Similarly, an audio signal generated by the unmixing can be expressed as equation 2. That is, the unmixed may be performed in a stepwise manner (the operation procedure of each equation corresponds to one unmixed procedure) based on the operation of the equation using the one-time polynomial form, not limited to the operation using the unmixed matrix, and each unmixed audio signal may be generated.

[ equation 2]

L3＝L2-p ₂ ×C

R3＝R2-p ₂ ×C

Hl＝Hf3-w′×(L3-L5)

Hr＝Hfr3-w'×(R3-R5)

w' may be a value for down-mixing from H2 (e.g., height channels of 5.1.2 channel layout or 7.1.2 channel layout) to Hf2 (height channels of 3.1.2 channel layout) or for de-mixing from Hf2 (height channels of 3.1.2 channel layout) to H2 (e.g., height channels of 5.1.2 channel layout or 7.1.2 channel layout).

The sum can be updated according to w _w And its corresponding value of w'. w may be about-1 or 1 and may be transmitted for each frame.

For example sum _w The initial value of w may be 0, and when w is 1 for each frame, sum _w May be increased by 1 and sum when w is-1 for each frame _w The value of (2) may be reduced by 1. When sum is _w When the value of (2) is increased or decreased by 1, when sum _w When the value of (2) exceeds the range of 0 to 10, sum may be calculated _w The value of (2) remains at 0 or 10. Showing w' and sum _w Table 5 of the relationship between them can be as follows. That is, w' may be gradually updated for each frame, and thus may be used for unmixing from Hf2 to H2.

TABLE 5

sum _w	0	1	2	3	4	5
							w’	0	0.0179	0.0391	0.0658	0.1038	0.25
sum _w	6	7	8	9	10
							w’	0.3962	w”	0.4609	0.4821	0.5

Without being limited thereto, the unmixing may be performed by combining a plurality of unmixing processes. For example, the signal of the Ls5 channel or the Rs5 channel demultiplexed from the 2 surround channels of L2 and R2 may be represented as equation 3, which arranges the second to fifth equations of equation 2.

[ equation 3]

/>

The signals of the Hl channel or the Hr channel demultiplexed from the 2 surround channels of L2 and R2 may be represented as equation 4, which arranges the second and third equations and the eighth and ninth equations of equation 2.

[ equation 4]

H＝Hfl3-w×(L2-p ₂ ×C-L5)

Hr＝Hfr3-w×(R2-p ₂ ×C-R5)

Fig. 6B and 6C illustrate examples of mechanisms for progressive downmixing according to an embodiment. The step down mix for the surround channels and the height channels may have mechanisms such as those shown in fig. 6B and 6C.

The downmix related information (or the downmix related information) may be index information indicating one of a plurality of modes based on a combination of preset 5 downmix weight parameters (or downmix weight parameters). For example, as shown in table 6, the down-mix weight parameters corresponding to the plurality of modes may be predetermined.

TABLE 6

Mode	Down-mix weight parameters (α, β, γ, δ, w) (or de-mix weight parameters)
		1	(1，1，0.707，0.707，-1)
2	(0.707，0.707，0.707，0.707，-1)
		3	(1，0.866，0.866，0.866，-1)
4	(1，1，0.707，0.707，1)
		5	(0.707，0.707，0.707，0.707，1)
6	(1，0.866，0.866，0.866，1)

Hereinafter, an audio encoding process and an audio decoding process for performing down-mixing or de-mixing based on an audio scene type will be described with reference to fig. 7A to 18D. Further, an audio encoding process and an audio decoding process for performing down-mixing or de-mixing based on energy analysis of an audio signal of a height channel (e.g., a height channel audio signal) or the like will be described.

Fig. 7A is a block diagram of an audio encoding apparatus according to an embodiment of the present disclosure.

The audio encoding apparatus 700 may include a memory 710 and a processor 730. The audio encoding apparatus 700 may be implemented as an apparatus capable of performing audio processing, such as a server, a TV, a camera, a cellular phone, a tablet PC, a laptop computer, or the like.

Although the memory 710 and the processor 730 are separately shown in fig. 7A, the memory 710 and the processor 730 may be implemented by one hardware module (e.g., chip).

Processor 730 may be implemented as a dedicated processor for neural network based audio processing. Alternatively, processor 730 may be implemented by a combination of a general-purpose processor (e.g., an AP, CPU, or GPU) and software. The special purpose processor may include a memory to implement embodiments of the present disclosure or a memory processor to use external memory.

Processor 730 may include multiple processors. In this case, the processor 330 may be implemented as a combination of dedicated processors or may be implemented by a combination of software and a plurality of general-purpose processors (such as an AP, a CPU, or a GPU).

Memory 710 may store one or more instructions for audio processing. In an embodiment of the present disclosure, the memory 710 may store a neural network. When the neural network is implemented in the form of a dedicated hardware chip for artificial intelligence or as part of an existing general-purpose processor (e.g., CPU or AP) or a graphics-specific processor (e.g., GPU), the neural network may not be stored in the memory 710. The neural network may be implemented by an external device (e.g., a server), and in this case, the audio encoding apparatus 700 may request and receive result information based on the neural network from the external device.

Processor 730 may sequentially process successive frames and obtain successive encoded (compressed) frames according to instructions stored in memory 710. Successive frames may refer to frames that make up audio.

The processor 730 may perform an audio processing operation with the original audio signal as an input and output a bitstream including the compressed audio signal. In this case, the original audio signal may be a multi-channel audio signal. The compressed audio signal may be a multi-channel audio signal having a channel number less than or equal to the channel number of the original audio signal. In this case, the bitstream may include compressed audio signals of the base channel group, and further, compressed audio signals of n dependent channel groups (n is an integer greater than or equal to 1). Therefore, the number of channels can be freely increased according to the number of the dependent channel groups.

Fig. 7B is a block diagram of an audio encoding apparatus according to an embodiment of the present disclosure.

Referring to fig. 2B, the audio encoding apparatus 700 may include a multi-channel audio encoder 740, a bitstream generator 780, and an additional information generator 785. The multi-channel audio encoder 740 may include a multi-channel audio signal processor 750 and a compressor 776.

Referring back to fig. 7A, as described above, the audio encoding apparatus 700 may include the memory 710 and the processor 730, and instructions for implementing the components 740, 750, 760, 765, 770, 775, 776, 780, and 785 of fig. 1B may be stored in the memory 710 of fig. 7A. Processor 730 may execute instructions stored in memory 710.

The multi-channel audio signal processor 750 may obtain (e.g., generate) at least one audio signal of the base channel group and at least one audio signal of the at least one slave channel group from the original audio signal.

The multi-channel audio signal processor 750 may include an audio scene type identifier 760, a downmix weight parameter identifier 765, a downmix channel audio generator 770, and an audio signal classifier 775.

The audio scene type identifier 760 may identify an audio scene type of the original audio signal. The audio scene type may be identified for each frame.

The audio scene type identifier 760 may downsample an original audio signal and identify an audio scene type based on the downsampled original audio signal.

The audio scene type identifier 760 may obtain the audio signal of the center channel from the original audio signal. The audio scene type identifier 760 may identify a dialog type from the obtained audio signal of the center channel. In this case, the audio scene type identifier 760 may identify the dialog type by using a first neural network for identifying the dialog type. More specifically, the audio scene type identifier 760 may identify the first dialog type as a dialog type when a probability value of the dialog type identified by using the first neural network is greater than a predetermined first probability value of the first dialog type.

The audio scene type identifier 760 may identify a default type (e.g., a default dialog type) as a dialog type when the probability of the dialog type identified by using the first neural network is less than or equal to a predetermined first probability value for the first dialog type.

The audio scene type identifier 760 may identify a type of sound effect from an original audio signal based on an audio signal of a front channel (e.g., a front channel audio signal) and an audio signal of a side channel (e.g., a side channel audio signal).

The audio scene type identifier 760 may identify the type of sound effect by using a second neural network for identifying the type of sound effect. More specifically, the audio scene type identifier 760 may identify a sound effect type as a first sound effect type when a probability value of the sound effect type identified by using the second neural network is greater than a predetermined second probability value of the first sound effect type.

The audio scene type identifier 760 may identify the sound effect type as a default type (e.g., a default sound effect type) when the probability value of the sound effect type identified by using the second neural network is less than or equal to a predetermined second probability value of the first sound effect type.

The audio scene type identifier 760 may identify an audio scene type based on at least one of the identified dialog type or the identified sound effect type. In other words, the audio scene type identifier 760 may identify one of a plurality of audio scene types. The process for identifying the type of audio scene will be described in detail with reference to fig. 5.

The downmix weight parameter identifier 765 may identify a downmix profile corresponding to an audio scene type. The downmix weight parameter identifier 765 may obtain a downmix weight parameter for (down) mixing from a first audio signal of at least one first channel to a second audio signal of a second channel according to a downmix profile. A particular down-mix weight parameter corresponding to a particular audio scene type may be predetermined.

The downmix channel audio generator 770 may down-mix the original audio signal based on the obtained downmix weight parameters. The down-mix channel audio generator 770 may generate an audio signal of a predetermined channel layout as a result of the down-mixing.

The audio signal classifier 775 may generate at least one audio signal of the base channel group and at least one audio signal of the dependent channel group based on the audio signals of the predetermined channel layout.

The compressor 776 may compress the audio signals of the base channel group and the audio signals of the dependent channel group. That is, the compressor 776 may compress at least one audio signal of the basic channel set to obtain at least one compressed audio signal of the basic channel set. Here, compression may refer to compression based on various audio codecs. For example, compression may include transform and quantization processes.

The compressor 776 may obtain at least one compressed audio signal of the at least one slave channel group by compressing the at least one audio signal of the at least one slave channel group.

The additional information generator 785 may generate additional information including information regarding the type of audio scene.

The bitstream generator 780 may generate a bitstream including the compressed audio signal of the base channel group and the compressed audio signal of the dependent channel group.

The bitstream generator 780 may generate a bitstream further including the additional information generated by the additional information generator 785.

More specifically, the bitstream generator 780 may generate a primary audio stream and a secondary audio stream. The primary audio stream may comprise compressed audio signals of a primary channel group and the secondary audio stream may comprise compressed audio signals of a secondary channel group.

In addition, the bitstream generator 780 may generate metadata including additional information. As a result, the bitstream generator 780 may generate a bitstream including the primary audio stream, the secondary audio stream, and the metadata.

Fig. 8 is a block diagram of an audio encoding apparatus according to an embodiment of the present disclosure.

Referring to fig. 8, the audio encoding apparatus 800 may include a multi-channel audio encoder 840, a bitstream generator 880, and an additional information generator 885.

The multi-channel audio signal processor 850 may include a down-mix weight parameter identifier 855, an additional weight parameter identifier 860, a down-mix channel audio generator 870, and an audio signal classifier 875.

The downmix weight parameter identifier 855 may identify the downmix weight parameters.

As in the downmix weight parameter identifier 765 described with reference to fig. 7B, the downmix weight parameter identifier 855 may identify the downmix weight parameters based on the audio scene type. However, the example is not limited thereto, and the down-mix weight parameters may be identified in various ways.

The additional weight parameter identifier 860 may identify energy values of the audio signal of the high channel from the original audio signal. The additional weight parameter identifier 860 may identify energy values of the audio signals of the surround channels from the original audio signal. Meanwhile, the additional weight parameter identifier 860 may determine a range of additional weights or values of additional weight candidates (e.g., first weight and eighth weight) according to the audio scene type.

The additional weight parameter identifier 860 may identify additional weight parameters for mixing from the surround channels to the height channels based on the energy values of the audio signals of the identified height channels and the energy values of the identified surround channels. The energy value of the surround channel may be a moving average of the total power of the surround channel. More specifically, the energy value of the surround channel may be a Root Mean Square Energy (RMSE) value based on a long time window. The energy value of the height channel may be a short-term time power value with respect to the height channel. More specifically, the energy value of the height channel may be an RMSE value based on a short-term time window. The additional weight parameter identifier 860 may identify the additional weight parameter as a first value when the energy value of the height channel is greater than a predetermined first value, or when the ratio of the energy value of the height channel to the energy value of the surround channel is greater than a predetermined second value. For example, the first value may be 0.

The additional weight parameter identifier 860 may identify the additional weight parameter as a second value when the energy value of the height channel is less than or equal to a predetermined first value, or when the ratio of the energy value of the height channel to the energy value of the surround channel is less than or equal to a predetermined second value. The second value may be 1, but is not limited thereto, and may be a value greater than the first value, such as 0.5.

The additional weight parameter identifier 860 may identify a weight level of at least one time period of the original audio signal based on a weight target ratio within the audio content of the audio signal. For example, when the target ratio of level 1 is 30%, the target ratio of level 2 is 60%, and the target ratio of level 3 is 10%, the additional weight parameter identifier 860 may identify the weight level of at least one period of time according to the target ratio. In other words, the additional weight parameter identifier 860 may identify level 0 in the case of a period of an early portion of the content, level 1 in the case of a period of a middle portion of the content, and level 2 in the case of a period of a later portion of the content. In this case, additional weight parameters corresponding to the respective levels may be identified. When the weight corresponding to each level is constant, a weight discontinuity may occur in the boundary segment between the time segments.

The additional weight parameter identifier 860 may determine different weights in the boundary segments between time segments. More specifically, for the weights of the boundary segments between the first time segment and the second time segment, the additional weight parameter identifier 860 may identify a value between the weights of the remaining segments excluding the boundary segments from the first time segment and the weights of the remaining segments excluding the boundary segments from the second time segment. To minimize the weight discontinuity in the boundary segment, the additional weight parameter identifier 860 may identify values between weights adjacent to the outside of the boundary segment as the weights of the boundary segment. For example, in a boundary segment between an early part (level 0) and a middle part (level 1), a value of a level may be increased (for example, 0.1 is increased) for each sub-segment, and a weight corresponding to the level (for example, an output of a level-based function) may be determined. In this case, the weight corresponding to the level between the level 0 and the level 1 may be a value between the weight of the level 0 and the weight of the level 1. As a result, the weight discontinuity may be minimized.

The down-mix channel audio generator 870 may down-mix the original audio signal according to a predetermined channel layout based on the obtained down-mix weight parameters and the additional weight parameters. The down-mix channel audio generator 870 may generate an audio signal of a predetermined channel layout as a result of the down-mixing.

The down-mix channel audio generator 870 may generate an audio signal of the height channel based on the down-mix weight parameters and the additional weight parameters for mixing from the surround channel to the height channel. In this case, the final weight parameter for mixing from the surround channel to the height channel may be expressed as a result obtained by multiplying the downmix weight parameter by the additional weight parameter.

The additional information generator 885 may generate additional information including additional weight parameters.

Fig. 9A is a block diagram of a structure of a multi-channel audio decoding apparatus according to an embodiment of the present disclosure.

The audio decoding apparatus 900 may include a memory 910 and a processor 930. The audio decoding apparatus 900 may be implemented as a device capable of audio processing, such as a server, TV, camera, mobile phone, tablet PC, laptop, etc.

Although the memory 910 and the processor 930 are separately shown in fig. 9A, the memory 910 and the processor 930 may be implemented by one hardware module (e.g., chip).

Processor 930 may be implemented as a dedicated processor for neural network-based audio processing. Alternatively, processor 930 may be implemented by a combination of a general-purpose processor (e.g., an AP, CPU, or GPU) and software. The special purpose processor may include a memory to implement embodiments of the present disclosure or a memory processor to use external memory.

Processor 930 may include multiple processors. In this case, the processor 330 may be implemented as a combination of dedicated processors or may be implemented by a combination of software and a plurality of general-purpose processors (such as an AP, a CPU, or a GPU).

Memory 910 may store one or more instructions for audio processing. In an embodiment of the present disclosure, the memory 910 may store a neural network. When the neural network is implemented in the form of a dedicated hardware chip for artificial intelligence or as part of an existing general-purpose processor (e.g., CPU or AP) or a graphics-specific processor (e.g., GPU), the neural network may not be stored in the memory 910. The neural network may be implemented as an external device (e.g., a server). In this case, the audio decoding apparatus 900 may request the neural network-based result information from the external apparatus and receive the neural network-based result information from the external apparatus.

Processor 930 may sequentially process the successive frames according to instructions stored in memory 910 to obtain successive reconstructed frames. Successive frames may refer to frames that make up audio.

The processor 930 may output a multi-channel audio signal by performing an audio processing operation on the input bitstream. The bitstream may be implemented in a scalable form to increase the number of channels from the base channel group. For example, processor 930 may obtain compressed audio signals for a base channel group from a bitstream and may reconstruct audio signals for the base channel group (e.g., stereo channel audio signals) by decompressing the compressed audio signals for the base channel group. In addition, the processor 930 may reconstruct the audio signals of the slave channel group by decompressing the compressed audio signals of the slave channel group from the bitstream. Processor 930 may reconstruct the multi-channel audio signal based on the audio signals of the base channel group and the audio signals of the slave channel group.

Meanwhile, the processor 930 may reconstruct the audio signals of the first slave channel group by decompressing the compressed audio signals of the first slave channel group from the bitstream. Processor 930 may reconstruct the audio signals of the second set of slave channels by decompressing the compressed audio signals of the second set of slave channels.

Processor 930 may reconstruct the multi-channel audio signal for the increased number of channels based on the audio signals of the base channel group and the corresponding audio signals of the first and second slave channel groups. Similarly, the processor 330 may decompress the compressed audio signals of the n sub-channel groups (where n is an integer greater than 2), and may reconstruct a multi-channel audio signal of a further increased number of channels based on the audio signals of the base channel group and the corresponding audio signals of the base channel group and the n sub-channel groups.

Fig. 9B is a block diagram of a structure of an audio decoding apparatus according to an embodiment of the present disclosure.

Referring to fig. 9B, the audio decoding apparatus 900 includes an information acquirer 950 and a multi-channel audio decoder 960. The multi-channel audio decoder 960 includes a decompressor 970 and a multi-channel audio signal reconstructor 980.

The audio decoding apparatus 900 may include the memory 910 and the processor 930 of fig. 9A, and instructions for implementing each of the components 950, 960, 970, 980, 985, 990, and 995 of fig. 9B may be stored in the memory 910. Processor 930 may execute instructions stored in memory 910.

The information acquirer 950 may acquire the basic audio stream and the at least one auxiliary audio stream from the bit stream. The base audio stream may comprise at least one compressed audio signal of the base channel group. The auxiliary audio stream may obtain at least one compressed audio signal of at least one of the slave channel groups.

The information acquirer 950 may acquire metadata from a bit stream. The metadata may include additional information. For example, the metadata may be information about the type of audio scene of the original audio signal. The information about the audio scene types may be index information indicating one of the audio scene types. Information about the type of audio scene may be obtained for each frame, but may be obtained periodically for various data units. Alternatively, information about the type of audio scene may be obtained aperiodically at every scene change.

The decompressor 970 may obtain the audio signal of the basic channel set included in the basic audio stream by decompressing at least one compressed audio signal of the basic channel set. The decompressor 970 may obtain at least one audio signal of at least one slave channel group included in the auxiliary audio stream from at least one compressed audio signal of at least one slave channel group.

The unmixed parameter identifier 990 may identify the unmixed weight parameters based on information about the type of audio scene. That is, the unmixed parameter identifier 990 may identify the unmixed weight parameters corresponding to the type of audio scene. That is, the unmixed parameter identifier 990 may identify one audio scene type from among a plurality of audio scene types based on index information on the audio scene type, and identify the unmixed weight parameters corresponding to the identified audio scene type. The unmixed weight parameters respectively corresponding to a plurality of audio scene types may be predetermined and stored.

The up-mix channel group audio generator 985 may generate an up-mix channel group audio signal by unmixing at least one audio signal of the base channel group and at least one audio signal of the at least one slave channel group. In this case, the up-mix channel group audio signal may be a multi-channel audio signal.

The multi-channel audio signal outputter 995 may output at least one up-mix channel group audio signal.

Fig. 10 is a block diagram of a structure of an audio decoding apparatus according to an embodiment of the present disclosure.

The audio decoding apparatus 1000 may include an information acquirer 1050 and a multi-channel audio decoder 1060. The multi-channel audio decoder 1060 may include a decompressor 1070 and a multi-channel audio signal reconstructor 1075.

The information acquirer 1050, the decompressor 1070, and the multi-channel audio signal outputter 1095 of fig. 10 may perform various operations of the information acquirer 950, the decompressor 970, and the multi-channel audio signal outputter 995 described above with reference to fig. 9. Therefore, a description of the operation repeated with the operation of fig. 9 will be omitted.

The information acquirer 1050 may acquire additional information including information about the additional unmixed weight parameters from the bitstream.

The additional downmix parameter identifier 1090 may identify the additional downmix weight parameters based on information about the additional downmix weight parameters. The additional downmix weight parameters may be downmix weight parameters corresponding to weight parameters for mixing from the surround channels to the height channels. That is, the additional unmixed parameter identifier 1090 may identify weight parameters for unmixed from the high channel to the surround channel. However, the present disclosure is not limited thereto, and the additional downmix parameter identifier 1090 may identify a range of additional downmix weight parameters or values of additional downmix weight parameter candidates based on information about the type of audio scene obtained from the bitstream. The additional downmix parameter identifier 1090 may identify the additional downmix weight parameters based on a range of the additional downmix weight parameters or a value of the additional downmix weight parameter candidate. In this case, information on additional unmixed weight parameters may be used.

The up-mix channel group audio generator 1080 may perform a de-mixing of the audio signal according to the de-mixing weight parameters and the additional de-mixing weight parameters. The unmixing may be performed on the audio signals of the base channel group and the audio signals of the dependent channel group. For example, the up-mix channel group audio generator 1080 may perform the de-mixing from the height channels to the surround channels according to the de-mixing weight parameters and the additional weight parameters from the height channels to the surround channels. In the case of unmixing to other channels, the up-mix channel group audio generator 1080 may perform the unmixing according to the unmixed weight parameters without additional weight parameters.

Fig. 11 is a view for describing in detail a process of identifying an audio scene type by the audio encoding apparatus 700 according to an embodiment of the present disclosure.

Referring to fig. 11, the audio encoding apparatus 700 may obtain (step 1100) an audio signal of a center channel from an original audio signal.

The audio encoding apparatus 700 may calculate a probability value of a category of at least one dialog type by using a first neural network for identifying a dialog type (step 1110). The first neural network 1110 may recognize an audio signal of a center channel as an input.

The audio encoding apparatus 700 may identify (step 1120) a probability value P for a category of the first dialog type _dialog Whether or not it is greater than a threshold Th of the first dialog type _dialog 。

Probability value P when a first dialog type class _dialog Greater than threshold Th of first dialog type category _dialog When the audio encoding apparatus 700 may identify the first dialog type as a dialog type.

Probability values for categories of the first dialogue typeP _dialog Less than or equal to the threshold Th of the first dialog type category _dialog When the audio encoding apparatus 700 may recognize the sound effect type. However, the present disclosure is not limited thereto, and the audio encoding apparatus 700 may compare the probability value of each category with the threshold value of each category and identify at least one dialog type for a plurality of dialog type categories. In this case, one dialog type may be identified, or a dialog type of the highest probability value may be identified, according to the priority. When the dialog does not correspond to any of the plurality of dialog types (i.e., when the dialog is a default type), the audio encoding device 700 may then identify a sound effect type.

Hereinafter, a process in which the audio encoding apparatus 700 recognizes a sound effect type will be described.

The audio encoding apparatus 700 may obtain (step 1130) an audio signal of a front channel and an audio signal of a side channel from the original audio signal.

The audio encoding apparatus 700 may calculate a probability value for a category of at least one sound effect type by using a second neural network for identifying the sound effect type (step 1140). The second neural network 1140 may receive as inputs the audio signal of the front channel and the audio signal of the side channel. The sound effects may be included in audio content such as games or movies, and may be directed or spatially moving sounds.

The audio encoding apparatus 700 may identify (step 1150) a probability value P for a category of the first sound effect type _effect Whether or not it is greater than a threshold Th of the first sound effect type _effect 。

Probability value P of category of first sound effect type _effect Greater than a threshold Th of the first sound effect type _effect When the audio encoding apparatus 700 may identify the first sound effect type as a sound effect type.

Probability value P of category of first sound effect type _effect Less than or equal to the threshold Th of the first sound effect type _effect When the audio encoding apparatus 700 may identify a default type. However, the present disclosure is not limited thereto, and the audio encoding apparatusThe apparatus 700 may compare the probability value for each category to a threshold value for each category and identify at least one sound effect type for a plurality of sound effect type categories (e.g., a category of a first sound effect type, a category of a second effect type, a..and a category of an nth sound effect type).

In this case, one sound effect type may be identified, or the sound effect type of the highest probability value may be identified, according to the priority. When the sound effect does not correspond to any of the plurality of sound effect types, the audio encoding apparatus 700 may identify a default type.

However, the present disclosure is not limited thereto, and various audio scene types, such as a music type and a sports/crowd type, may be identified in addition to a dialogue type and a sound effect type. The music type may be a type of audio scene having balanced sound among audio channels. The sports/crowd type may be a type of audio scene that shows a lot of people in a cheering atmosphere or with clear narrative sounds. Here, the default type may be a type that is recognized when a specific audio scene type is not recognized. Various audio scene types may be identified by using separate neural networks. The neural network used to identify each audio scene type may be trained separately.

In fig. 11, the dialog type is first identified, and then the sound effect type is identified. However, the present disclosure is not limited thereto, and a sound effect type may be first identified, and then, a dialog type may be identified. When other audio scene types exist, the types of the respective audio scene types may be identified according to priorities among the audio scene types.

Fig. 12 is a view for describing a first Deep Neural Network (DNN) 1200 for identifying a dialog type according to an embodiment of the present disclosure.

The first DNN 1200 may include at least one convolution layer, a pooling layer, and a full connection layer. The convolution layer obtains feature data by processing the input data using a filter having a predefined size. The parameters of the filter of the convolutional layer may be optimized by a training process to be described below. The pooling layer may be a layer for selecting and outputting only the feature values of some samples from among the feature values of all samples of the feature data to reduce the size of the input data. The pooling layer may include a maximum pooling layer and an average pooling layer. A fully connected layer is a layer for classifying features, where each neuron of one layer is connected to each neuron of the next layer.

Referring to fig. 12, preprocessing is performed on the audio signal 1201 of the center channel (steps 1202-1204), and then the preprocessed audio signal 1205 of the center channel is input to the first DNN 1200.

First, RMS normalization is performed on the center channel audio signal 1201 (step 1202). Because the energy of each sound source is different, the energy value of the audio signal may be normalized according to a specific standard. When the number of samples is N, the audio signal 1201 of the center channel may be a one-dimensional signal of size n×1. For example, the audio signal 1201 of the center channel may be a one-dimensional signal of 8640×1 size. To reduce the amount of computation, the center channel audio signal 1201 may be downsampled and then RMS normalization may be performed thereon (step 1202).

Next, short-time frequency conversion is performed on the audio signal on which RMS normalization is performed (step 1203). The one-dimensional input signal in units of time is output as a two-dimensional signal in units of time and frequency. The two-dimensional signal in units of time and frequency may be a two-dimensional signal of size x×y×1. For example, the audio signal of the center channel on which the short-time frequency transform is performed may be a two-dimensional signal of 68×127×1 size.

The output signal obtained by performing the short-time frequency transform is a complex signal (a+jb) having a real part and an imaginary part. Since it is difficult to directly use complex numbers, the absolute value of complex signals (root (a) ² +b ² ))。

A Mel scale (Mel-scale) is performed on the two-dimensional signal in units of time and frequency (step 1204). The mel scale is a scale that considers the characteristic that humans are cognitively sensitive to changes in low frequency signals and relatively insensitive to changes in high frequency signals, and refers to an operation of rescaling data on the frequency axis such that signal data that humans consider cognitively more sensitive is emphasized more precisely. As a result, the output two-dimensional signal may be a two-dimensional signal having a size of x×y "X1 of reduced frequency axis data. For example, the Mel-scaled audio signal of the center channel may be a two-dimensional signal of size 68×68×1.

Referring to fig. 12, preprocessing is performed on an audio signal 1201 of a center channel, and then, the preprocessed audio signal is input to a first DNN 1200.

Referring to fig. 12, the preprocessed signal 1205 of the center channel is input to the first DNN 1200. The pre-processed audio signal 1205 of the center channel includes samples divided in time and frequency. That is, the preprocessed audio signal 1205 of the center channel may be two-dimensional sample data. Each sample of the center channel's preprocessed audio signal 1205 has a characteristic value of a particular frequency at a particular time.

A first convolution layer 1220 including c a x b-sized filters processes the center channel's preprocessed audio signal 1205. For example, as a result of the processing of the first convolution layer 1220, a first intermediate signal 1206 of (68, c) size may be obtained. In this case, the first convolution layer 1220 may include a plurality of convolution layers, and the input of the first layer and the output of the second layer may be connected to each other for training. The first layer and the second layer may be the same layer. However, the present disclosure is not limited thereto, and the second layer may be a subsequent layer to the first layer. When the second layer is a subsequent layer to the first layer, the activation function of the first layer may be a modified linear unit (ReLU).

Pooling of the first intermediate signal 1206 may be performed by using a first pooling layer 1230. For example, as a result of the pooling layer 1230 processing, a second intermediate layer 1207 of the size of (34, c) may be obtained.

The second convolution layer 1240 processes the signals input with f d×e-sized filters. As a result of the processing of the second convolution layer 1240, a third intermediate layer 1208 of (17, f) size may be obtained.

Pooling may be performed on the third middle layer 1208 by using the second pooling layer 1250. For example, as a result of the processing of the pooling layer 1250, a fourth intermediate layer 1209 of a (9, f) size may be obtained.

The first full connection layer 1260 may output one-dimensional characteristic signals by classifying the input characteristic signals. As a result of the processing of the first full connection layer 1260, an (1, n) -sized audio feature signal 1210 may be obtained. Here, N may represent the number of categories. These categories may correspond to respective dialog types.

The first DNN 1200 according to an embodiment of the present disclosure obtains an audio feature signal 1210 (e.g., a probability signal) from the audio signal 1201 of the center channel.

In fig. 12, a first DNN 1200 includes two convolutional layers, two pooled layers, and one fully-connected layer. However, this is merely an example, and the number of convolution layers, the number of pooling layers, and the number of full connection layers included in the first DNN 1200 may be variously modified as long as N categories of audio feature signals 1210 can be obtained from the audio signal 1201 of the center channel. Also, the number and size of filters used in each convolution layer can be variously modified, and connection method of each layer can be variously modified.

Fig. 13 is a view for describing a second DNN 1300 for identifying a sound effect type according to an embodiment of the present disclosure.

The second DNN 1300 may include at least one convolution layer, a pooling layer, and a full connection layer. The convolution layer obtains feature data by processing the input data with a filter of a predefined size. The parameters of the filter of the convolutional layer may be optimized by a training process to be described below. The pooling layer is a layer for selecting and outputting only the feature values of some samples from among the feature values of all samples of the feature data to reduce the size of the input data, and may include a maximum pooling layer and an average pooling layer. A fully connected layer is a layer for classifying features, where each neuron of one layer is connected to each neuron of the next layer.

Referring to fig. 13, preprocessing is performed on the front/side/height channel audio signal 1301 (steps 1302-1304), and then the preprocessed audio signal is input to the second DNN 1300. The preprocessing process of the front/side/height channel audio signal 1301 is similar to that of fig. 12, and thus, a detailed description thereof will be omitted.

Referring to fig. 13, the front/side/height channel pre-processed audio signal 1305 is input to the second DNN 1300. The pre-processed audio signal 1301 of the front/side/height channel comprises samples divided in channel, time and frequency. That is, the pre-processed audio signal 1305 of the front/side/height channels may be three-dimensional sample data. Each sample of the pre-processed audio signal 1305 of the front/side/height channel has a characteristic value of a specific frequency at a specific time.

The first convolution layer 1320 includes c a b-sized filters and processes the center channel pre-processed audio signal 1305. For example, as a result of the processing of the first convolution layer 1320, a first intermediate signal 1306 of (68, c) magnitude may be obtained. In this case, the first convolution layer 1320 may include a plurality of convolution layers, and the input of the first layer and the output of the second layer may be connected to each other for training. The first layer and the second layer may be the same layer, but are not limited thereto, and the second layer may be a subsequent layer to the first layer. When the second layer is a subsequent layer to the first layer, the activation function of the first layer may be a modified linear unit (ReLU).

Pooling of the first intermediate signal 1306 may be performed by using a first pooling layer 1330. For example, as a result of the pooling layer 1330 process, a second intermediate layer 1307 of size (34, c) may be obtained.

The second convolution layer 1340 processes signals input with f d×e-sized filters. As a result of the processing of the second convolution layer 1340, a third intermediate layer 1308 of (17, f) size may be obtained.

The third middle tier 1308 may be pooled by using a second pooling layer 1350. For example, as a result of the processing of the pooling layer 1350, a (9, f) -sized fourth intermediate layer 1309 may be obtained.

The first full connection layer 1360 may output one-dimensional characteristic signals by classifying the input characteristic signals. As a result of the processing of the first full connection layer 1360, an (1, n) -sized audio feature signal 1310 may be obtained. Here, N may represent the number of categories. These categories may correspond to respective sound effect types.

The second DNN 1300 according to an embodiment of the present disclosure obtains an audio feature signal 1310 (e.g., a probability signal) from the front/side/height channel audio signal 1301.

In fig. 13, the second DNN 1300 includes two convolutional layers, two pooled layers, and one fully-connected layer. However, this is only an example, and the number of convolution layers, the number of pooling layers, and the number of full connection layers included in the second DNN 1300 may be variously modified as long as N categories of audio feature signals 1310 can be obtained from the front/side/height channel audio signal 1301. Also, the number and size of filters used in each convolution layer can be variously modified, and the connection and connection method between each layer can be variously modified.

Fig. 14 is a view for describing in detail a process of identifying additional unmixed parameter weights for mixing from a surround channel to a height channel by an audio encoding apparatus 800, according to an embodiment of the disclosure.

Referring to fig. 14, the audio encoding apparatus 800 may obtain (step 1400) a height channel audio signal from an original audio signal. The audio encoding apparatus 800 may perform energy analysis on the audio signal of the height channel (step 1410).

The energy analysis may be performed by using a neural network for energy analysis (step 1410). In this case, additional weights (first weights) for mixing from the surround channels to the height channels may be identified by using a neural network for energy analysis based on the audio signal of the height channel.

The audio encoding apparatus 800 may identify (step 1420) a power value E of an audio signal of a height channel _hgt Whether or not it is greater than threshold Th _hgt1 . In this case, the power value is the RMS value of the signal, and may be a power value of a short period (average power value of a short period time window).

When E is identified _hgt Greater than threshold Th _hgt1 The audio encoding apparatus 800 may recognize the audio signal for mixing from the surround channelsAdditional weights to the height channel (first weights). For example, the first weight may be 0, but is not limited thereto, and the first weight may be a value less than 1.

Power value E of audio signal of high channel _hgt Less than or equal to threshold Th _hgt1 When it is time, the audio encoding apparatus 800 may perform energy analysis on the audio signal of the surround channel (step 1440). The energy analysis may be performed by using a neural network for energy analysis (step 1440).

In this case, additional weights (first weights or second weights) for mixing from the surround channel to the height channel may be identified by using a neural network for energy analysis based on the audio signal of the height channel and the audio signal of the surround channel.

The audio encoding apparatus 800 may obtain (step 1430) an audio signal of a surround channel from the original audio signal. The audio encoding apparatus 800 may perform energy analysis on the audio signals of the surround channels (step 1440).

The audio encoding apparatus 800 may identify (step 1450) a power value E of an audio signal of a height channel _hgt Power value E of audio signal with surround channel _srd Whether the difference between them is greater than a threshold Th _hgt2 . In this case, the power value E is the RMS value _srd May be a moving average of the total power (average power value of long time window).

Power value E of audio signal of high channel _hgt Power value E of audio signal with surround channel _srd The difference between them is greater than the threshold Th _hgt2 When the audio encoding apparatus 800 may identify additional weights (first weights) for mixing from the surround channels to the height channels.

Power value E of audio signal of high channel _hgt Power value E of audio signal with surround channel _srd The difference between them being less than or equal to the threshold Th _hgt2 When the audio encoding apparatus 800 may identify additional weights (second weights) for mixing from the surround channels to the height channels. In this case, the second weight has a value greater than 0, and may have a value greater than the first weight. Example(s)For example, the second weight may be one of 0.5, 0.75, and 1.

Above, the audio encoding apparatus 800 performs the power value E of the audio signal of the height channel _hgt Power value E of audio signal with surround channel _srd The difference between and the threshold Th _hgt2 The comparison is performed. However, the present disclosure is not limited thereto, and the operation may be replaced with a power value E of the audio signal of the height channel _hgt Power value E of audio signal with surround channel _srd Is compared with a threshold.

Fig. 15 is a view for describing in detail a process of identifying additional unmixed parameter weights for mixing from a surround channel to a height channel by an audio encoding apparatus 800 according to an embodiment of the present disclosure.

Referring to fig. 15, the audio encoding apparatus 800 may obtain (step 1500) a height channel audio signal and an all channel audio signal from an original audio signal.

The audio encoding apparatus 800 may obtain the power value E by performing energy analysis on the audio signal of the height channel (step 1510) _hgt . In addition, the audio encoding apparatus 800 may obtain the power value E by performing energy analysis on the audio signals of all channels (step 1510) _total . Here, the power value E _hgt May be the average power value (RMS value) of the short term time window, and E _total May be the average power value (RMS value) of the long time window.

The audio encoding apparatus 800 may identify (step 1520) a power value E of an audio signal of a height channel _hgt Power value E of audio signal with all channels _total Ratio (E) _hgt /E _total ) Whether or not it is greater than threshold Th _hgt1 。

When the power value E of the audio signal of the height channel is recognized _hgt Power value E of audio signal with all channels _total Ratio (E) _hgt /E _total ) Greater than threshold Th _hgt1 When the audio encoding apparatus 800 may identify additional weights (first weights) for mixing from the surround channels to the height channels. For example, the first weight may be 0,but is not limited thereto and may be less than 1.

When the power value E of the audio signal of the height channel is recognized _hgt Power value E of audio signal with all channels _total Ratio (E) _hgt /E _total ) Less than or equal to threshold Th _hgt1 When it is time, the audio encoding apparatus 800 may perform energy analysis on the audio signal of the surround channel (step 1540). The energy analysis 1540 may be performed by using a neural network for energy analysis.

The audio encoding apparatus 800 may obtain (step 1530) an audio signal of a surround channel from an original audio signal. The audio encoding apparatus 800 may perform energy analysis on the audio signals of the surround channels (step 1540).

The audio encoding apparatus 800 may identify (step 1550) a power value E of an audio signal of a height channel _hgt Power value E of audio signal with surround channel _srd Ratio (E) _hgt /E _srd ) Whether or not it is greater than threshold Th _hgt2 . In this case, the power value E _srd Is the RMS value and may be a moving average of the total power (average over a long time window).

Power value E of audio signal of high channel _hgt Power value E of audio signal with surround channel _srd Ratio (E) _hgt /E _srd ) Greater than threshold Th _hgt2 When the audio encoding apparatus 800 may identify additional weights (first weights) for mixing from the surround channels to the height channels.

Power value E of audio signal of high channel _hgt Power value E of audio signal with surround channel _srd Ratio (E) _hgt /E _srd ) Less than or equal to threshold Th _hgt2 When the audio encoding apparatus 800 may identify additional weights (second weights) for mixing from the surround channels to the height channels. In this case, the second weight may be greater than 0, and may be greater than the first weight.

Above, the audio encoding apparatus 800 performs the power value E of the audio signal of the height channel _hgt Power value E of audio signal with all channels _total Ratio of (2) to thresholdValue Th _hgt1 Comparing operation and comparing power value E of audio signal of high-level channel _hgt Power value E of audio signal with surround channel _srd And threshold Th _hgt2 The comparison is performed. However, the present disclosure is not limited thereto, and these operations may be replaced with operations of comparing a difference in power values, not a ratio of power values, with a threshold value.

Fig. 16 is a flowchart of a method of processing audio according to an embodiment of the present disclosure.

In operation S1605, the audio encoding apparatus 800 may identify the movement and direction of the sound source object based on the correlation and delay between channels of the audio signal including at least one frame.

In operation S1610, the audio encoding apparatus 800 may identify the type and characteristics of the sound source object from the audio signal including at least one frame by using the object estimation probability model based on the gaussian mixture model.

In operation S1615, the audio encoding apparatus 800 may identify additional weight parameters for mixing from the surround channels to the height channels based on at least one of the movement, direction, type, or characteristics of the sound source object.

Fig. 17A is a flowchart of a method of processing audio according to an embodiment of the present disclosure.

In operation S1702, the audio encoding apparatus 700 may identify an audio scene type of an original audio signal.

In operation S1704, the audio encoding apparatus 700 may down-mix an original audio signal according to a predetermined channel layout based on the identified audio scene type.

In operation S1706, the audio encoding apparatus 700 may obtain at least one audio signal of a basic channel group and at least one audio signal of at least one subordinate channel group from audio signals of a predetermined channel layout.

In operation S1708, the audio encoding apparatus 700 may generate at least one compressed audio signal of the basic channel group by compressing at least one audio signal of the basic channel group.

In operation S1710, the audio encoding apparatus 700 may generate at least one compressed audio signal of the at least one slave channel group by compressing the at least one audio signal of the at least one slave channel group.

In operation S1712, the audio encoding apparatus 700 may generate a bitstream including at least one compressed audio signal of a base channel group and at least one compressed audio signal of at least one dependent channel group. The audio encoding apparatus 700 may generate a bitstream further including information about the content of the audio scene.

Fig. 17B is a flowchart of a method of processing audio according to an embodiment of the present disclosure.

In operation S1722, the audio encoding apparatus 800 may identify an energy value of a height channel from an original audio signal.

In operation S1724, the audio encoding apparatus 800 may identify energy values of the surround channels from the original audio signal.

In operation S1726, the audio encoding apparatus 800 may identify additional weights for mixing from the surround channel to the height channel based on the identified energy value of the height channel and the identified energy value of the surround channel.

In operation S1728, the audio encoding apparatus 700 may down-mix the original audio signal according to a predetermined channel layout based on the additional weights.

In operation S1730, the audio encoding apparatus 700 may obtain at least one audio signal of a base channel group and at least one audio signal of a dependent channel group from audio signals of a predetermined channel layout.

In operation S1732, the audio encoding apparatus 700 may generate at least one compressed audio signal of the basic channel group by compressing the at least one audio signal of the basic channel group.

In operation S1734, the audio encoding apparatus 700 may generate compressed audio signals of at least one sub-channel group by compressing at least one audio signal of at least one sub-channel group.

In operation S1736, the audio encoding apparatus 700 may generate a bitstream including at least one compressed audio signal of the base channel group and at least one compressed audio signal of the at least one sub channel group. The audio encoding apparatus 700 may generate a bitstream further including information about the identified additional weights. More specifically, the audio encoding apparatus 700 may generate a bitstream further including weights for de-mixing, the weights being additional weights corresponding to the additional weights for mixing. The weights for the unmixing may be weights for the unmixing from the high channel to the surround channel.

Fig. 17C is a flowchart of a method of processing audio according to an embodiment of the present disclosure.

In operation S1742, the audio encoding apparatus 700 may identify an audio scene type of an audio signal including at least one frame.

In operation S1744, the audio encoding apparatus 700 may determine the downmix related information to correspond to the audio scene type in units of frames.

In operation S1746, the audio encoding apparatus 700 may down-mix an audio signal including at least one frame by using the down-mix related information determined in units of frames.

The audio encoding apparatus 700 may transmit the downmix audio signal and the downmix related information determined in frame units in operation S1748.

Fig. 17D is a flowchart of a method of processing audio according to an embodiment of the present disclosure.

In operation S1752, the audio encoding apparatus 700 may identify an audio scene type of an audio signal including at least one frame.

In operation S1754, the audio encoding apparatus 700 may determine the downmix related information to correspond to the audio scene type in units of frames.

In operation S1756, the audio encoding apparatus 700 may down-mix an audio signal including at least one frame by using the down-mix related information.

In operation S1758, the audio encoding apparatus 700 may generate flag information indicating whether the audio scene type of the previous frame is the same as the audio scene type of the current frame, based on the audio scene type of the previous frame and the audio scene type of the current frame.

According to an embodiment, when the audio scene type of the previous frame is the same as the audio scene type of the current frame, the audio encoding apparatus 700 may generate flag information indicating that the audio scene type of the previous frame is the same as the audio scene type of the current frame.

When the audio scene type of the previous frame is different from the audio scene type of the current frame, the audio encoding apparatus 700 may not generate the flag information. Since the flag information is not generated, the flag information may not be transmitted.

According to the embodiment, when the audio scene type of the previous frame is the same as the audio scene type of the current frame, the audio encoding apparatus 700 may not generate flag information, and may not transmit the flag information because the flag information is not generated.

The audio encoding apparatus 700 may generate flag information when the audio scene type of the previous frame is different from the audio scene type of the current frame.

The audio encoding apparatus 700 may transmit at least one of the downmix audio signal, the flag information, or the downmix related information in operation S1760.

According to an embodiment, when the audio scene type of the previous frame is the same as the audio scene type of the current frame, the audio encoding apparatus 700 may transmit the downmix audio signal and flag information indicating that the audio scene type of the previous frame is the same as the audio scene type of the current frame. In this case, the downmix related information of the current frame may not be additionally transmitted.

When the audio scene type of the previous frame is different from that of the current frame, the audio encoding apparatus 700 may transmit the downmix audio signal and the downmix related information of the current frame. The flag information may not be additionally transmitted.

In general, when the audio scene type of the current frame is the same as that of the current frame, flag information and downmix related information of the current frame may not be transmitted.

When the audio scene type of the current frame is different from that of the current frame, flag information and downmix related information of the current frame may be transmitted.

However, the present disclosure is not limited to an example of selectively transmitting flag information, and the audio encoding apparatus 700 may transmit flag information regardless of whether the audio scene type of the previous frame is the same as the audio scene type of the current frame.

Meanwhile, when the audio scene type of a frame included in a higher data unit than the frame is the same audio scene type, flag information may be generated for the higher data unit and transmitted. In this case, the downmix related information is not transmitted for each frame, and the downmix related information regarding higher data units may be transmitted.

Fig. 18A is a flowchart of a method of processing audio according to an embodiment of the present disclosure.

In operation S1802, the audio decoding apparatus 900 may obtain at least one compressed audio signal of a basic channel group from a bitstream.

In operation S1804, the audio decoding apparatus 900 may obtain at least one compressed audio signal of at least one slave channel group from the bitstream.

In operation S1806, the audio decoding apparatus 900 may obtain information indicating the type of audio scene from the bitstream.

In operation S1808, the audio decoding apparatus 900 may reconstruct the audio signals of the basic channel group by decompressing at least one compressed audio signal of the basic channel group.

In operation S1810, the audio decoding apparatus 900 may reconstruct at least one audio signal of at least one slave channel group by decompressing at least one compressed audio signal of at least one slave channel group.

In operation S1812, the audio decoding apparatus 900 may identify at least one down-mix weight parameter corresponding to an audio scene type.

In operation S1814, the audio decoding apparatus 900 may generate an audio signal of an up-mix channel group by using at least one down-mix weight parameter based on at least one audio signal of a base channel group and at least one audio signal of at least one sub channel group.

Fig. 18B is a flowchart of a method of processing audio according to an embodiment of the present disclosure.

In operation S1822, the audio decoding apparatus 1000 may obtain at least one compressed audio signal of a base channel group from a bitstream.

In operation S1824, the audio decoding apparatus 1000 may obtain at least one compressed audio signal of at least one slave channel group from the bitstream.

In operation S1826, the audio decoding apparatus 1000 may obtain information regarding additional weights for unmixing from the high-level channel to the surround channel from the bitstream.

In operation S1828, the audio decoding apparatus 1000 may reconstruct the audio signal of the base channel group by decompressing at least one compressed audio signal of the base channel group.

In operation S1830, the audio decoding apparatus 1000 may reconstruct at least one audio signal of at least one sub-channel group by decompressing at least one compressed audio signal of at least one sub-channel group.

In operation S1832, the audio decoding apparatus 1000 may generate an audio signal of an up-mix channel group by using at least one down-mix weight parameter and information on additional weights based on at least one audio signal of a base channel group and at least one audio signal of at least one dependent channel group.

Fig. 18C is a flowchart of a method of processing audio according to an embodiment of the present disclosure.

In operation S1842, the audio decoding apparatus 900 may obtain a down-mixed audio signal from the bitstream.

In operation S1844, the audio decoding apparatus 900 may obtain the downmix related information from the bitstream. The downmix related information may be information generated in units of frames by using an audio scene type.

In operation S1846, the audio decoding apparatus 900 may unmixe the downmix audio signal by using the downmix related information generated in units of frames.

In operation S1848, the audio decoding apparatus 900 may reconstruct an audio signal including at least one frame based on the unmixed audio signal.

Fig. 18D is a flowchart of a method of processing audio according to an embodiment of the present disclosure.

In operation S1852, the audio decoding apparatus 900 may obtain a down-mixed audio signal from the bitstream.

In operation S1854, the audio decoding apparatus 900 may obtain flag information indicating whether the audio scene type of the previous frame is the same as the audio scene type of the current frame from the bitstream. According to circumstances, the audio decoding apparatus 900 may not obtain flag information from the bitstream and may derive the flag information.

In operation S1856, the audio decoding apparatus 900 may obtain downmix related information regarding the current frame based on the flag information.

For example, when the flag information indicates that the audio scene type of the previous frame is the same as the audio scene type of the current frame, the audio decoding apparatus 900 may obtain the downmix related information regarding the current frame based on the downmix related information regarding the previous frame. The audio decoding apparatus 900 may not obtain the downmix related information regarding the current frame from the bitstream.

When the flag information indicates that the audio scene type of the previous frame is not the same as the audio scene type of the current frame, the audio decoding apparatus 900 may obtain the downmix related information regarding the current frame from the bitstream.

In operation S1858, the audio decoding apparatus 900 may unmixe the downmix audio signal by using the downmix related information regarding the current frame.

In operation S1860, the audio decoding apparatus 900 may reconstruct an audio signal including at least one frame based on the unmixed audio signal.

Above, the audio decoding apparatuses 900 and 1000 perform the operation of unmixing the downmix audio signal by using the downmix related information generated in units of frames. However, the audio signal in a channel layout (e.g., a 7.1.4 channel layout) higher than the audio signal in the output channel layout may be reconstructed. That is, the audio signals in the output layout may not be reconstructed by unmixing.

In this case, the audio decoding apparatuses 900 and 1000 may reconstruct the audio signal in the output channel layout by down-mixing the reconstructed audio signal in the higher channel layout using the down-mix related information generated in units of frames. As a result, the downmix related information received from the audio encoding apparatuses 700 and 800 is not limited to be used in the downmix operation of the audio decoding apparatuses 900 and 1000, but may be used in the downmix operation as the case may be.

However, the flag information is not limited to being transmitted in units of frames, and the downmix related information may be signaled for a higher audio data unit (e.g., a parameter sampling unit) including k frames (k is an integer greater than 1). In this case, the information about the size of the higher audio data unit and the downmix related information received from the higher audio data unit may be signaled through the bitstream. The information about the size of the higher audio data unit may be information about the value of k.

When the downmix related information is received from the higher audio data unit, the downmix related information may not be obtained in units of frames included in the higher data unit. For example, the downmix related information may be obtained in a first frame included in the higher audio data unit, but may not be obtained in a frame subsequent to the first frame of the higher audio data unit.

At the same time, the flag may be obtained in a frame following the first frame of the higher audio data unit.

Based on the flag, when it is recognized that the audio scene type of the previous frame is not identical to the audio scene type of the current frame, the downmix related information may be additionally obtained. The downmix related information updated by the flag can be used in frames subsequent to the frame in which the flag is obtained in the higher audio data unit.

Meanwhile, when the audio scene type of the current frame is the same as that of the current frame, the flag of the current frame is not obtained, but previously obtained downmix related information may be used.

According to the embodiments of the present disclosure, the original sound effect can be maintained through an appropriate down-mixing or up-mixing process according to the type of audio scene.

According to embodiments of the present disclosure, audio signals may be dynamically mixed so that the audio of the surround channel and the audio of the height channel may be well presented in a large screen. That is, when reproduced audio is concentrated in the surround, audio signals of the surround channels Ls and Rs can be allocated not only to the L/R channels but also to the height channels, thereby maximizing the surround effect. Alternatively, by mixing the audio signals of the surround channels Ls and Rs to the L/R channel instead of the height channel, the horizontal sound and the vertical sound can be distinguished, so that the surround effect and the height effect can be expressed simultaneously in a balanced manner.

Meanwhile, the above-described embodiments of the present disclosure may be written as programs or instructions executable on a computer, and the programs or instructions may be stored in a storage medium.

The machine-readable storage medium may be provided in the form of a non-transitory storage medium. Wherein the term "non-transitory storage medium" means only that the storage medium is a tangible device and does not include a signal (e.g., electromagnetic waves), but the term does not distinguish between data being semi-permanently stored in the storage medium and data being temporarily stored in the storage medium. For example, a "non-transitory storage medium" may include a buffer that temporarily stores data.

According to embodiments of the present disclosure, methods according to various embodiments disclosed herein may be included and provided in a computer program product. The computer program product may be used as a product for conducting transactions between sellers and buyers. The computer program product may be distributed in the form of a machine-readable storage medium, such as a compact disk read only memory (CD-ROM), or via an application store (e.g., playStore ^TM ) Online distribution (e.g., download or upload), or directly between two user devices (e.g., smartphones). When distributed online, at least a portion of a computer program product (e.g., a downloadable app) may be at least temporarily stored or generated on a machine-readable storage medium (e.g., memory of a manufacturer's server, a server of an application store) Or a relay server).

Meanwhile, the model associated with the above neural network may be implemented as a software module. When implemented as software modules (e.g., program modules including instructions), the neural network model may be stored on a computer-readable recording medium.

Furthermore, the neural network model may be integrated in the form of a hardware chip and may be part of the apparatus described above. For example, the neural network model may be manufactured in the form of a dedicated hardware chip for artificial intelligence, or as part of a conventional general purpose processor (e.g., CPU or AP) or a dedicated graphics processor (e.g., GPU).

Furthermore, the neural network model may be provided in the form of downloadable software. The computer program product may comprise a product in the form of a software program (e.g., a downloadable application) that is distributed electronically through a manufacturer or electronic marketplace. For electronic distribution, at least a portion of the software program may be stored in a storage medium or temporarily generated. In this case, the storage medium may be a server of a manufacturer or an electronic market, or a storage medium of a relay server.

The technical spirit of the present disclosure has been described in detail with reference to example embodiments, but is not limited to the above-described embodiments, and various changes and modifications may be made by one of ordinary skill in the art within the technical spirit of the present disclosure, without being limited to the foregoing embodiments.

Claims

1. A method of processing audio, the method comprising:

identifying an audio scene type of an audio signal, the audio signal comprising at least one frame;

determining, in frame units, downmix related information, the downmix related information corresponding to the audio scene type;

downmixing the audio signal by using the downmix related information; and

and transmitting the downmix audio signal and the downmix related information.

2. The method of claim 1, wherein identifying the audio scene type comprises:

obtaining a center channel audio signal from the audio signal;

identifying a dialogue type from the obtained center channel audio signal;

obtaining a front channel audio signal and a side channel audio signal from the audio signal;

identifying a sound effect type based on the front channel audio signal and the side channel audio signal; and

the audio scene type is identified based on at least one of the identified dialog type or the identified sound effect type.

3. The method of claim 2, wherein identifying the type of conversation comprises:

identifying a dialog type by using a first neural network for identifying the dialog type;

Identifying the dialog type as a first dialog type when a probability value of the dialog type identified by using the first neural network is greater than a predetermined first probability value of the first dialog type; and

when the probability value of the dialog type identified by using the first neural network is less than or equal to the predetermined first probability value, the dialog type is identified as a default dialog type.

4. A method according to claim 3, wherein the identification of the sound effect type comprises:

identifying a sound effect type by using a second neural network for identifying the sound effect type;

identifying the sound effect type as a first sound effect type when a probability value of the sound effect type identified by using the second neural network is greater than a predetermined second probability value of the first sound effect type; and

when a probability value of the sound effect type identified by using the second neural network is less than or equal to the predetermined second probability value, the sound effect type is identified as a default sound effect type.

5. The method of claim 2, wherein identifying the audio scene type based on at least one of the identified dialog type or the identified sound effect type comprises:

Identifying the audio scene type as a first dialog type when the identified dialog type is the first dialog type;

identifying the audio scene type as a first sound effect type when the identified sound effect type is the first sound effect type; and

when the identified dialog type is a default type and the identified sound effect type is the default type, the audio scene type is identified as a default type.

6. The method of claim 1, further comprising:

detecting a sound source object; and

based on the information about the detected sound source object, additional weight parameters for mixing from surround channels to height channels are identified,

wherein the downmix related information further comprises the additional weight parameter.

7. The method of claim 1, further comprising:

identifying an energy value of a high channel audio signal from the audio signal;

identifying energy values of surround channel audio signals from the audio signals; and

identifying additional weight parameters for mixing from the surround channels to the height channels based on the identified energy values of the height channel audio signals and the identified energy values of the surround channel audio signals,

8. The method of claim 7, wherein the identifying of the additional weight parameter comprises:

identifying the additional weight parameter as a first value when an energy value of the high channel audio signal is greater than a predetermined first value and a ratio of the energy value of the high channel audio signal to the energy value of the surround channel audio signal is greater than a predetermined second value; and

the additional weight parameter is identified as a second value when the energy value of the high channel audio signal is less than or equal to the predetermined first value or the ratio is less than or equal to the predetermined second value.

9. The method of claim 7, wherein the identifying of the additional weight parameter comprises:

identifying a weight level for at least one time period of the audio signal based on a weight target ratio within audio content of the audio signal; and

identifying additional weight parameters corresponding to the weight levels, and

wherein the weight of a boundary segment between a first time segment of the audio signal and a second time segment of the audio signal has a value between the weight of the remaining segments in the first time segment other than the boundary segment and the weight of the remaining segments in the second time segment other than the boundary segment.

10. The method of claim 6, wherein the detection of the sound source object comprises:

identifying a movement of the sound source object and a direction of the sound source object based on a correlation and a delay between channels of the audio signal; and

identifying the type of the sound source object and the characteristics of the sound source object from the audio signal by using an object estimation probability model based on a gaussian mixture model,

wherein the information about the detected sound source object includes information about at least one of movement of the sound source object, direction of the sound source object, type of the sound source object, or characteristics of the sound source object, and

wherein identifying the additional weight parameters includes identifying additional weight parameters for mixing from the surround channel to the height channel based on at least one of a movement of the sound source object, a direction of the sound source object, a type of the sound source object, or a characteristic of the sound source object.

11. A method of processing audio, the method comprising:

obtaining a down-mix audio signal from the bitstream;

obtaining downmix related information from the bitstream, wherein the downmix related information is generated in units of frames by using an audio scene type;

Unmixed the downmix audio signal by using the downmix related information; and

reconstructing an audio signal comprising at least one frame based on the unmixed audio signal.

12. The method of claim 11, wherein the audio scene type is identified based on at least one of a dialog type or a sound effect type.

13. The method of claim 12, wherein the audio signal comprises an up-mix channel group audio signal,

wherein the upmix channel group audio signal comprises upmix channel audio signals of at least one upmix channel, and

wherein the up-mix channel audio signal comprises a second audio signal obtained by de-mixing from the first audio signal of the at least one first channel.

14. The method of claim 11, wherein the downmix related information further includes information on additional weight parameters for unmixing from a height channel to a surround channel, and

wherein the reconstructing of the audio signal comprises reconstructing the audio signal by using a downmix weight parameter and information on the additional weight parameter.

15. A computer-readable recording medium having recorded thereon a program for implementing the method of any one of claims 1 to 10.