EP2720223A2

EP2720223A2 - Audio signal processing method, audio encoding apparatus, audio decoding apparatus, and terminal adopting the same

Info

Publication number: EP2720223A2
Application number: EP12797100.0A
Authority: EP
Inventors: Nam-Suk Lee
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2011-06-07
Filing date: 2012-06-07
Publication date: 2014-04-16
Also published as: WO2012169808A3; KR20140037118A; CN103733256A; WO2012169808A2

Abstract

An audio signal processing method includes: when a first plurality of input channels are down-mixed to a second plurality of output channels, comparing locations of the first plurality of input channels with locations of the second plurality of output channels; down-mixing channels of the first plurality of input channels, which have the same locations as those of the second plurality of output channels, to channels at the same locations among the second plurality of output channels; searching for at least one adjacent channel for each of the remaining channels among the first plurality of input channels; determining a weighting factor for the searched adjacent channel in consideration of at least one of a distance between channels, a correlation between signals, and an error during restoration; and down-mixing each of the remaining channels among the first plurality of input channels to the adjacent channel based on the determined weighting factor.

Description

[Technical Field]

Apparatuses and methods consistent with exemplary embodiments relate to audio encoding/decoding, and more particularly, to an audio signal processing method capable of minimizing deterioration of sound quality when a multi-channel audio signal is restored, an audio encoding apparatus, an audio decoding apparatus, and a terminal adopting the same.

[Background Art]

Recently, along with the spread of multimedia content, demands of users desiring to experience a relatively realistic and rich sound source environment have increased. To satisfy these demands of users, researches of multi-channel audio have been briskly conducted.
A multi-channel audio signal requires a high-efficiency data compression ratio according to a transmission environment. Specifically, a spatial parameter is used to restore the multi-channel audio signal. In a process of extracting the spatial parameter, distortion may occur due to affection of a reverberation signal. Then, deterioration of sound quality may occur when the multi-channel audio signal is restored.
Therefore, there is required a multi-channel audio codec technique capable of reducing or removing deterioration of sound quality which may occur when a multi-channel audio signal is restored using a spatial parameter.

[Disclosure]

[Technical Problem]

Aspects of one or more exemplary embodiments provide an audio signal processing method capable of minimizing deterioration of sound quality when a multi-channel audio signal is restored, an audio encoding apparatus, an audio decoding apparatus, and a terminal adopting the same.

[Technical Solution]

According to an aspect of one or more exemplary embodiments,, there is provided an audio signal processing method including: when a first plurality of input channels are down-mixed to a second plurality of output channels, comparing locations of the first plurality of input channels with locations of the second plurality of output channels; down-mixing channels of the first plurality of input channels, which have the same locations as those of the second plurality of output channels, to channels at the same locations among the second plurality of output channels; searching for at least one adjacent channel for each of the remaining channels among the first plurality of input channels; determining a weighting factor for the searched adjacent channel in consideration of at least one of a distance between channels, a correlation between signals, and an error during restoration; and down-mixing each of the remaining channels among the first plurality of input channels to the adjacent channel based on the determined weighting factor.

[Description of Drawings]

FIG. 1 is a block diagram of an audio signal processing system according to an exemplary embodiment;
FIG. 2 is a block diagram of an audio encoding apparatus according to an exemplary embodiment;
FIG. 3 is a block diagram of an audio decoding apparatus according to an exemplary embodiment;
FIG. 4 illustrates channel matching between a 10.2-channel audio signal and a 5.1-channel audio signal, according to an exemplary embodiment;
FIG. 5 is a flowchart of a down-mixing method according to an exemplary embodiment;
FIG. 6 is a flowchart of an up-mixing method according to an exemplary embodiment;
FIG. 7 is a block diagram of a spatial parameter encoding apparatus according to an exemplary embodiment;
FIGS. 8A and 8B illustrate variable quantization steps according to energy values in frequency bands of each frame for a down-mixed channel;
FIG. 9 is a graph showing per-frequency band energy distribution of spectral data for the whole channels;
FIGS. 10A to 10C are graphs showing a total bitrate adjusted by changing a threshold value;
FIG. 11 is a flowchart of a method of generating a spatial parameter, according to an exemplary embodiment;
FIG. 12 is a flowchart of a method of generating a spatial parameter, according to another exemplary embodiment;
FIG. 13 is a flowchart of an audio signal processing method according to an exemplary embodiment;
FIGS. 14A to 14C show an example for describing operation 1110 of FIG. 11 or operation 1330 of FIG. 13;
FIG. 15 shows another example for describing operation 1110 of FIG. 11 or operation 1330 of FIG. 13;
FIGS. 16A to 16D show another example for describing operation 1110 of FIG. 11 or operation 1330 of FIG. 13;
FIG. 17 is a graph showing a total sum of angle parameters;
FIG. 18 is to describe calculation of angle parameters, according to an exemplary embodiment;
FIG. 19 is a block diagram of an audio signal processing system integrating a multi-channel codec and a core codec, according to an exemplary embodiment;
FIG. 20 is a block diagram of an audio encoding apparatus according to an exemplary embodiment; and
FIG. 21 is a block diagram of an audio decoding apparatus according to an exemplary embodiment;

[Mode for Invention]

The present invention may allow various kinds of change or modification and various changes in form, and specific embodiments will be illustrated in drawings and described in detail in the specification. However, it should be understood that the specific embodiments do not limit the present invention to a specific disclosing form but include every modified, equivalent, or replaced one within the spirit and technical scope of the present invention. In the following description, well-known functions or constructions are not described in detail so as not to obscure the invention with unnecessary detail.
Although terms, such as 'first' and 'second', can be used to describe various elements, the elements cannot be limited by the terms. The terms can be used to classify a certain element from another element.
The terminology used in the present invention is used only to describe specific embodiments and does not have any intention to limit the present invention. Although general terms as currently widely used as possible are selected as the terms used in the present invention while taking functions in the present invention into account, they may vary according to an intention of one of ordinary skill in the art, judicial precedents, or the appearance of new technology. In addition, in specific cases, terms intentionally selected by the applicant may be used, and in this case, the meaning of the terms will be disclosed in a corresponding description of the invention. Accordingly, the terms used in the present invention should be defined not by simple names of the terms but by the meaning of the terms and the contents over the present invention.
The singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. In the present invention, it should be understood that terms, such as 'include' and 'have', are used to indicate the existence of an implemented feature, number, step, operation, element, part, or a combination thereof without excluding in advance the possibility of the existence or addition of one or more other features, numbers, steps, operations, elements, parts, or combinations thereof.
The present invention will now be described more fully with reference to the accompanying drawings, in which exemplary embodiments of the present invention are shown. Like reference numerals in the drawings denote like elements, and thus their repetitive description will be omitted.
FIG. 1 is a block diagram of an audio signal processing system 100 according to an exemplary embodiment. The audio signal processing system 100 corresponds to a multimedia device and may include a voice communication exclusive terminal including a telephone, a mobile phone, and the like, a broadcasting or music exclusive terminal including a TV, an MP3 player, and the like, or a hybrid terminal of the voice communication exclusive terminal and the broadcasting or music exclusive terminal but is not limited thereto. The audio signal processing system 100 may be used as a client, a server, or a transducer disposed between a client and a server.
Referring to FIG. 1, the audio signal processing system 100 includes an encoding apparatus 110 and a decoding apparatus 120. According to an exemplary embodiment, the audio signal processing system 100 may include both the encoding apparatus 110 and the decoding apparatus 120, and according to another exemplary embodiment, the audio signal processing system 100 may include any one of the encoding apparatus 110 and the decoding apparatus 120.
The encoding apparatus 110 receives an original signal formed with a plurality of channels, i.e., a multi-channel audio signal, and generates a down-mixed audio signal by down-mixing the original signal. The encoding apparatus 110 generates a prediction parameter and encodes the prediction parameter. The prediction parameter is applied to restore the original signal from the down-mixed audio signal. In detail, the prediction parameter is a value associated with a down-mix matrix used for down-mixing the original signal, each coefficient value included in the down-mix matrix, and the like. For example, the prediction parameter may include a spatial parameter. The prediction parameter may vary according to a product specification, a design specification, and the like of the encoding apparatus 110 or the decoding apparatus 120 and may be set as an experimentally optimized value. Herein, a channel may indicate a speaker.
The decoding apparatus 120 generates a restored signal corresponding to the original signal, i.e., the multi-channel audio signal, by up-mixing the down-mixed audio signal using the prediction parameter.
FIG. 2 is a block diagram of an audio encoding apparatus 200 according to an exemplary embodiment.
Referring to FIG. 2, the audio encoding apparatus 200 may include a down-mixing unit 210, a side information generation unit 220, and an encoding unit 230. The components may be integrated as at least one module and implemented as at least one processor (not shown).
The down-mixing unit 210 receives an N-channel audio signal and down-mixes the received N-channel audio signal. The down-mixing unit 210 may generate a mono-channel audio signal or an M-channel audio signal by down-mixing the N-channel audio signal (M<N). For example, the down-mixing unit 210 may generate a three-channel audio signal or six-channel audio signal by down-mixing a 10.2-channel audio signal so as to correspond to a 2.1-channel audio signal or a 5.1-channel audio signal.
According to an exemplary embodiment, the down-mixing unit 210 generates a first mono channel by selecting and down-mixing two of the N channels and generates a second mono channel by down-mixing the generated first mono channel and another channel. A final mono-channel audio signal or the M-channel audio signal may be generated by repeating a process of down-mixing a mono channel generated as a down-mixing result and another channel.
To down-mix the N-channel audio signal while minimizing entropy, it is preferable that similar channels are down-mixed. Therefore, the down-mixing unit 210 may down-mix a multi-channel audio signal at a relatively high compression ratio by down-mixing channels having a high correlation therebetween.
The side information generation unit 220 generates side information required to restore multiple channels from a down-mixed channel. Every time the down-mixing unit 210 sequentially down-mixes multiple channels, the side information generation unit 220 generates side information required to restore the multiple channels from the down-mixed channel. At this time, the side information generation unit 220 may generate information for determining intensities of two channels to be down-mixed and information for determining phases of the two channels.
In addition, every time down-mixing is performed, the side information generation unit 220 generates information indicating which channels have been down-mixed. When channels are down-mixed in an order based on a correlation calculation instead of a fixed order, the side information generation unit 220 may generate a down-mixing order of the channels as the side information.
The side information generation unit 220 repeats generation of information required to restore channels down-mixed to a mono channel every time down-mixing is performed. For example, if a mono channel is generated by sequentially down-mixing 12 channels 11 times, information on a down-mixing order, information for determining intensities of channels, and information for determining phases of the channels are generated 11 times. According to an exemplary embodiment, when information for determining intensities of channels and information for determining phases of the channels are generated for each of a plurality of frequency bands, if the number of frequency bands is k, 11*k pieces of information for determining intensities of channels may be generated, and 11*k pieces of information for determining phases of channels may be generated.
The encoding unit 230 may encode the mono-channel audio signal or the M-channel audio signal down-mixed and generated by the down-mixing unit 210. If the audio signal output from the down-mixing unit 210 is an analog signal, the analog signal is converted into a digital signal, and symbols are encoded according to a predetermined algorithm. An encoding algorithm is not limited, and all algorithms for generating a bitstream by encoding an audio signal may be used in the encoding unit 230. In addition, the encoding unit 230 may encode the side information generated to restore a multi-channel audio signal from a mono-channel audio signal by the side information generation unit 220.
FIG. 3 is a block diagram of an audio decoding apparatus 300 according to an exemplary embodiment.
Referring to FIG. 3, the audio decoding apparatus 300 may include an extraction unit 310, a decoding unit 320, and an up-mixing unit 330. The components may be integrated as at least one module and implemented as at least one processor (not shown).
The extraction unit 310 extracts encoded audio and encoded side information from received audio data, i.e., a bitstream. The encoded audio may be generated by down-mixing N channels to a mono channel or M channels (M<N) and encoding an audio signal according to a predetermined algorithm.
The decoding unit 320 decodes the encoded audio and the encoded side information extracted by the extraction unit 310. In this case, the decoding unit 320 decodes the encoded audio and the encoded side information by using the same algorithm as used for encoding. As a result of the audio decoding, a mono-channel audio signal or an M-channel audio signal is restored.
The up-mixing unit 330 restores an N-channel audio signal before down-mixing by up-mixing the audio signal decoded by the decoding unit 720. At this time, the up-mixing unit 330 restores the N-channel audio signal based on the side information decoded by the decoding unit 320.
That is, the up-mixing unit 330 up-mixes a down-mixed audio signal to a multi-channel audio signal by reversely performing a down-mixing process with reference to side information that is a spatial parameter. At this time, channels are sequentially separated from a mono channel by referring to the side information in which information on a down-mixing order of the channels is included. The channels may be sequentially separated from the mono channel by determining intensities and phases of channels, which have been down-mixed, according to information for determining the intensities and phases of the channels, which have been down-mixed.
FIG. 4 illustrates channel matching between a 10.2-channel audio signal 410 and a 5.1-channel audio signal 420, according to an exemplary embodiment.
When an input multi-channel audio signal is a 10.2-channel audio signal, a multi-channel audio signal, such as a 7.1-channel audio signal, a 5.1-channel audio signal, or a 2.0-channel audio signal, down-mixed to a smaller number of channels than the 10.2 channels may be used as an output multi-channel audio signal.
As shown in FIG. 4, when the 10.2-channel audio signal 410 is down-mixed to the 5.1-channel audio signal 420, if FL and RL channels in the 5.1 channels are confirmed as adjacent channels of an LW channel in the 10.2 channels, weighting factors of the FL and RL channels may be determined in consideration of a location, a correlation or an error during restoration. According to an exemplary embodiment, if it is determined that a weighting factor of the FL channel is 0, and a weighting factor of the RL channel is 1, a channel signal of the LW channel in the 10.2 channels may be down-mixed to the RL channel in the 5.1 channels.
In addition, L and Ls channels in the 10.2 channels may be allocated to the FL and RL channels in the 5.1 channels at the same locations, respectively.
FIG. 5 is a flowchart of a down-mixing method according to an exemplary embodiment.
Referring to FIG. 5, in operation 510, the number and locations of input channels are checked from first layout information. For example, the first layout information is IC(1), IC(2), ..., IC(N), and locations of N input channels may be checked from the first layout information.
In operation 520, the number and locations of down-mixed channels, i.e., output channels, are checked from second layout information. For example, the second layout information is DC(1), DC(2), ..., DC(N), and locations of M output channels (M<N) may be checked from the second layout information.
In operation 530, it is determined whether a channel having a same output location among the input channels and the output channels exists by starting from a first channel IC(1) of the input channels.
In operation 540, if a channel having a same output location among the input channels and the output channels exists, a channel signal of a corresponding input channel is allocated to an output channel at the same location. For example, if output locations of an input channel IC(n) and an output channel DC(m) are the same, DC(m) may be DC(m)+IC(n).
In operation 550, if no channel having a same output location among the input channels and the output channels exists, it is determined whether a channel adjacent to an input channel IC(n) among the output channels exists by starting from the first channel IC(1) of the input channels.
In operation 560, if it is determined in operation 550 that a plurality of adjacent channels exist, a channel signal of the input channel IC(n) is distributed to each of the plurality of adjacent channels by using a predetermined weighting factor corresponding to each of the plurality of adjacent channels. For example, if it is determined that DC(i), DC(j), and DC(k) of the output channels are adjacent channels of the input channel IC(n), weighting factors w_i, w_j, and w_k may be set for the input channel IC(n) and the output channel DC(i), the input channel IC(n) and the output channel DC(j), and the input channel IC(n) and the output channel DC(k), respectively. The channel signal of the input channel IC(n) may be distributed by using the set weighting factors w_i, w_j, and w_k such that DC(i)=DC(i)+w_i*IC(n), DC(j)=DC(j)+w_j*IC(n), and DC(k)=DC(k)+w_k*IC(n).
A weighting factor may be set by a method described below.
According to an exemplary embodiment, a weighting factor may be determined according to a relationship between a plurality of adjacent channels and the input channel IC(n). For the relationship between a plurality of adjacent channels and the input channel IC(n), at least one of a channel length between the plurality of adjacent channels and the input channel IC(n), a correlation between a channel signal of each of the plurality of adjacent channels and the channel signal of the input channel IC(n), and an error during restoration in the plurality of adjacent channels.
According to another exemplary embodiment, a weighting factor may be determined as 0 or 1 according to a relationship between a plurality of adjacent channels and the input channel IC(n). For example, an adjacent channel closest to the input channel IC(n) among the plurality of adjacent channels may be determined as 1, and the remaining adjacent channels may be determined as 0. Alternatively, an adjacent channel having a channel signal which has the highest correlation with the channel signal of the input channel IC(n) among channel signals of the plurality of adjacent channels may be determined as 1, and the remaining adjacent channels may be determined as 0. Alternatively, an adjacent channel having the least error during restoration among the plurality of adjacent channels may be determined as 1, and the remaining adjacent channels may be determined as 0.
In operation 570, it is determined whether all the input channels have been checked, and if all the input channels have not been checked, the method proceeding to operation 530 to repeat operations 530 to 560.
In operation 580, if all the input channels have been checked, configuration information and a corresponding spatial parameter of down-mixed channels having the signals allocated in operation 540 and the signals distributed in operation 560 are finally generated.
The down-mixing method according to an exemplary embodiment may be performed in units of channels, frames, frequency bands or frequency spectra, and thus, the accuracy of performance improvement may be adjusted according to circumstances. Herein, a frequency band is a unit of grouping samples of an audio spectrum and may have a uniform or non-uniform length by reflecting a threshold band. In a case of the non-uniform length, one frame may be set so that the number of samples included in each frequency band gradually increases from a starting sample to the last sample. If multiple bitrates are supported, the number of samples included in each of frequency bands corresponding at different bitrates may be set to be the same. The number of samples included in one frame or one frequency band may be determined in advance.
In the down-mixing method according to an exemplary embodiment, a weighting factor used for channel down-mixing may be determined in correspondence with a layout of down-mixed channels and a layout of input channels. Accordingly, the down-mixing method may adaptively deal with various layouts, and a weighting factor may be determined in consideration of locations of channels, a correlation between channel signals, or an error during restoration, thereby improving sound quality. In addition, channels down-mixed in consideration of locations of channels, a correlation between channel signals, or an error during restoration are configured, and thus, if an audio decoding apparatus has the same channels as the number of down-mixed channels, even though a user listens to only the down-mixed channels without a separate up-mixing process, that the user cannot recognize subjective deterioration of sound quality.
FIG. 6 is a flowchart of an up-mixing method according to an exemplary embodiment.
Referring to FIG. 6, in operation 610, configuration information and a corresponding spatial parameter of down-mixed channels, which are generated through the process as shown in FIG. 5, are received.
In operation 620, an input channel audio signal is restored by up-mixing the down-mixed channels using the configuration information and the corresponding spatial parameter of the down-mixed channels, which are received in operation 610.
FIG. 7 is a block diagram of a spatial parameter encoding apparatus 700 according to an exemplary embodiment, which may be included in the encoding unit 230 of FIG. 2.
Referring to FIG. 7, the spatial parameter encoding apparatus 700 may include an energy calculation unit 710, a quantization step determination unit 720, a quantization unit 730, and a multiplexing unit 740. The components may be integrated as at least one module and implemented as at least one processor (not shown).
The energy calculation unit 710 receives a down-mixed channel signal provided from the down-mixing unit (refer to 210 of FIG. 2) and calculates an energy value in units of channels, frames, frequency bands, or frequency spectra. Herein, an example of an energy value may be a norm value.
The quantization step determination unit 720 determines a quantization step by using the energy value calculated in units of channels, frames, frequency bands, or frequency spectra, which is provided from the energy calculation unit 710. For example, a quantization step may be small for a channel, a frame, a frequency band, or a frequency spectrum having a large energy value, and a quantization step may be large for a channel, a frame, a frequency band, or a frequency spectrum having a small energy value. In this case, two quantization steps may be set, and one of the two quantization steps may be selected according to a result of comparing an energy value with a predetermined threshold value. When a quantization step is adaptively allocated in correspondence with a distribution of energy values, a quantization step matching a distribution of energy values may be selected. Accordingly, bits to be allocated for quantization may be adjusted based on auditory importance, thereby improving sound quality. According to an exemplary embodiment, a total bitrate may be adjusted by variably changing a threshold frequency while maintaining a weighting factor allocated according to energy distribution of each down-mixed channel.
The quantization and lossless encoding unit 730 quantizes a spatial parameter in units of channels, frames, frequency bands, or frequency spectra by using the quantization step determined by the quantization step determination unit 720 and lossless-encodes the quantized spatial parameter.
The multiplexing unit 740 generates a bitstream by multiplexing the lossless-encoded spatial parameter and a lossless-encoded down-mixed audio signal.
FIGS. 8A and 8B illustrate variable quantization steps according to energy values in frequency bands of each frame for a down-mixed channel, wherein a channel 1 and a channel 2 are down-mixed, and a channel 3 and a channel 4 are down-mixed. In FIGS. 8A and 8B, d0 denotes energy values of a down-mixed channel of the channel 1 and the channel 2, and d1 denotes energy values of a down-mixed channel of the channel 3 and the channel 4.
FIGS. 8A and 8B indicate that two quantization steps are set, and a hatched portion corresponds to a frequency band having an energy value that is equal to or greater than a predetermined threshold value, and thus, a small quantization step is set for the hatched portion.
FIG. 9 is a graph showing per-frequency band energy distribution of spectral data for the whole channels, and FIGS. 10A to 10C are graphs showing a total bitrate adjusted by changing a threshold frequency in consideration of energy distribution in a state where a weighting factor is allocated according to an energy value for each channel.
FIG. 10A shows an example in which a small quantization step is set to left parts, i.e., low-frequency regions 110a, 120a, and 130a less than an initial threshold frequency 100a, and a large quantization step is set to right parts, i.e., high- frequency bands 110b, 120b, and 130b greater than the initial threshold frequency 100a, based on the initial threshold frequency 100a. FIG. 10B shows an example in which a threshold frequency 100b higher than the initial threshold frequency 100a is used to increase regions 140a, 150a, and 160a for which the small quantization step is set, thereby increasing a total bitrate. FIG. 10C shows an example in which a threshold frequency 100c lower than the initial threshold frequency 100a is used to decrease regions 170a, 180a, and 190a for which the small quantization step is set, thereby decreasing a total bitrate.
FIG. 11 is a flowchart of a method of generating a spatial parameter, according to an exemplary embodiment, which may be performed by the encoding apparatus 200 of FIG. 2.
Referring to FIG. 11, in operation 1110, N angle parameters are generated.
In operation 1120, (N-1) angle parameters among the N angle parameters are independently encoded.
In operation 1130, the remaining one angle parameter is predicted from the (N-1) angle parameters.
In operation 1140, the predicted angle parameter is residual-encoded to generate a residue of the remaining one angle parameter.
FIG. 12 is a flowchart of a method of generating a spatial parameter, according to another exemplary embodiment, which may be performed by the decoding apparatus 200 of FIG. 3.
Referring to FIG. 12, in operation 1210, (N-1) angle parameters among N angle parameters are received.
In operation 1220, the remaining one angle parameter is predicted from the (N-1) angle parameters.
In operation 1230, the remaining one angle parameter is generated by adding the predicted angle parameter and a residue.
FIG. 13 is a flowchart of an audio signal processing method according to an exemplary embodiment.
Referring to FIG. 13, in operation 1310, first to nth channel signals ch1 to chn that are multiple channel signals are down-mixed. In detail, the first to nth channel signals ch1 to chn may be down-mixed to one mono signal DM. Operation 1310 may be performed by the down-mixing unit 210.
In operation 1320, (n-1) channel signals among the first to nth channel signals ch1 to chn are summed, or the first to nth channel signals ch1 to chn are summed. In detail, channel signals except for a reference channel signal among the first to nth channel signals ch1 to chn may be summed, and the summed signal becomes a first sum signal. Alternatively, the first to nth channel signals ch1 to chn may be summed, and the summed signal becomes a second sum signal.
In operation 1330, a first spatial parameter may be generated using a correlation between the first sum signal that is a signal generated in operation 1320 and the reference channel signal. Alternatively, in operation 1330, instead of generating the first spatial parameter, a second spatial parameter may be generated using a correlation between the second sum signal that is a signal generated in operation 1320 and the reference channel signal.
The reference channel signal may be each of the first to nth channel signals ch1 to chn. Therefore, the number of reference channel signals may be n, and n spatial parameters corresponding to the n reference channel signals may be generated.
Therefore, operation 1330 may further include generating first to nth spatial parameters by setting each of the first to nth channel signals ch1 to chn as a reference channel signal.
Operations 1320 and 1330 may be performed by the down-mixing unit 210.
In operation 1340, the spatial parameter SP generated in operation 1330 is encoded and transmitted to the decoding apparatus (refer to 300 of FIG. 3). In addition, the mono signal DM generated in operation 1310 is encoded and transmitted to the decoding apparatus (refer to 300 of FIG. 3). In detail, the encoded spatial parameter SP and the encoded mono signal DM may be included in a transmission stream TS and transmitted to the decoding apparatus (refer to 300 of FIG. 3). The spatial parameter SP included in the transmission stream TS indicates a spatial parameter set including the first to nth spatial parameters.
Operation 1340 may be performed by the encoding apparatus (refer to 200 of FIG. 2).
FIGS. 14A to 14C show an example for describing operation 1110 of FIG. 11 or operation 1330 of FIG. 13. Hereinafter, an operation of generating the first sum signal and the first spatial parameter is described in detail with reference to FIGS. 14A to 14C.
FIGS. 14A to 14C illustrate a case where a multi-channel signal includes the first to third channel signals ch1, ch2, and ch3. In addition, FIGS. 14A to 14C illustrate a vector sum of signals as a sum of signals, wherein the sum of signals indicates down mixing, and various down-mixing methods may be used instead of a vector sum method.
FIGS. 14A to 14C illustrate cases where a reference channel signal is the first channel signal ch1, the second channel signal ch2, and the third channel signal ch3, respectively.
Referring to FIG. 14A, when the reference channel signal is the first channel signal ch1, the side information generation unit (refer to 220 of FIG. 2) generates a sum signal 1410 by summing (ch2+ch3) the second and third channel signals ch2 and ch3 except for the reference channel signal. Thereafter, the side information generation unit (refer to 220 of FIG. 2) generates a spatial parameter by using a correlation (ch1, ch2+ch3) between the first channel signal ch1 that is the reference channel signal and the sum signal 1410. The spatial parameter includes information indicating the correlation between the reference channel signal and the sum signal 1410 and information indicating a relative signal magnitude of the reference channel signal and the sum signal 1410.
Referring to FIG. 14B, when the reference channel signal is the second channel signal ch2, the side information generation unit (refer to 220 of FIG. 2) generates a sum signal 1420 by summing (ch1+ch3) the first and third channel signals ch1 and ch3 except for the reference channel signal. Thereafter, the side information generation unit (refer to 220 of FIG. 2) generates a spatial parameter by using a correlation (ch2, ch1+ch3) between the second channel signal ch2 that is the reference channel signal and the sum signal 1420.
Referring to FIG. 14C, when the reference channel signal is the third channel signal ch3, the side information generation unit (refer to 220 of FIG. 2) generates a sum signal 1430 by summing (ch1+ch2) the first and second channel signals ch1 and ch2 except for the reference channel signal. Thereafter, the side information generation unit (refer to 220 of FIG. 2) generates a spatial parameter by using a correlation (ch3, ch1+ch2) between the third channel signal ch3 that is the reference channel signal and the sum signal 1430.
When a multi-channel signal includes three channel signals, the number of reference channel signals is 3, and three spatial parameters may be generated. The generated spatial parameters are encoded by the encoding apparatus (refer to 200 of FIG. 2) and transmitted to the decoding apparatus (refer to 300 of FIG. 3) via a network (not shown).
A mono signal DM obtained by down-mixing the first to third channel signals ch1, ch2, and ch3 is the same as a sum signal of the first to third channel signals ch1, ch2, and ch3 and may be represented by DM=ch1+ch2+ch3. Therefore, a relationship ch1=DM-(ch2+ch3) is valid.
The decoding apparatus 300 receives and decodes first spatial parameters that are the spatial parameters described with reference to FIGS. 14A to 14C. The decoding apparatus (refer to 300 of FIG. 3) restores original channel signals by using the decoded mono signal and the decoded spatial parameters. As described above, the relationship ch1=DM-(ch2+ch3) is valid, and the spatial parameter generated with reference to FIG. 14A may include a parameter indicating a relative magnitude of the first channel signal ch1 and the sum signal 1410 (ch2+ch3) and a parameter indicating similarity between the first channel signal ch1 and the sum signal 1410 (ch2+ch3), and thus, the first channel signal ch1 and the sum signal 1410 (ch2+ch3) may be restored by using the spatial parameter generated with reference to FIG. 14A and the mono signal DM. In the same way, the second channel signal ch2 and the sum signal 1420 (ch1+ch3), and the third channel signal ch3 and the sum signal 1430 (ch1+ch2) may be restored by using the spatial parameters generated with reference to FIGS. 14B and 14C, respectively. That is, the up-mixing unit (refer to 330 of FIG. 3) may restore all of the first to third channel signals ch1, ch2, and ch3.
FIG. 15 shows another example for describing operation 1110 of FIG. 11 or operation 1330 of FIG. 13. Hereinafter, an operation of generating the second sum signal and the second spatial parameter is described in detail with reference to FIG. 15. FIG. 15 illustrates a case where a multi-channel signal includes the first to third channel signals ch1, ch2, and ch3. In addition, FIG. 15 illustrates a vector sum of signals as a sum of signals.
Referring to FIG. 15, the second sum signal is a signal obtained by summing the first to third channel signals ch1, ch2, and ch3, and thus, a signal 1520 (ch1+ch2+ch3) obtained by adding the third channel signal ch3 to a signal 1510 obtained by summing the first and second channel signals ch1 and ch2 is the second sum signal.
First, a spatial parameter between the first channel signal ch1 and the second sum signal 1520 with the first channel signal ch1 as a reference channel signal is generated. In detail, a spatial parameter including at least one of a first parameter and a second parameter may be generated by using a correlation (ch1, ch1+ch2+ch3) between the first channel signal ch1 and the second sum signal 1520.
Next, a spatial parameter is generated by using a correlation (ch2, ch1+ch2+ch3) between the second channel signal ch2 and the second sum signal 1520 with the second channel signal ch2 as a reference channel signal. Finally, a spatial parameter is generated by using a correlation (ch3, ch1+ch2+ch3) between the third channel signal ch3 and the second sum signal 1520 with the third channel signal ch3 as a reference channel signal.
The decoding apparatus (refer to 300 of FIG. 3) receives and decodes second spatial parameters that are the spatial parameters described with reference to FIG. 15. Thereafter, the decoding apparatus (refer to 300 of FIG. 3) restores original channel signals by using the decoded mono signal and the decoded spatial parameters. The decoded mono signal corresponds to a signal (ch1+ch2+ch3) of summing multiple channel signals.
Therefore, the first channel signal ch1 may be restored by using the spatial parameter, which is generated using the correlation (ch1, ch1+ch2+ch3) between the first channel signal ch1 and the second sum signal 1520, and the decoded mono signal. Similarly, the second channel signal ch2 may be restored by using the spatial parameter generated using the correlation (ch2, ch1+ch2+ch3) between the second channel signal ch2 and the second sum signal 1520. In addition, the third channel signal ch3 may be restored by using the spatial parameter generated using the correlation (ch3, ch1+ch2+ch3) between the third channel signal ch3 and the second sum signal 1520.
FIGS. 16A to 16D show another example for describing operation 1110 of FIG. 11 or operation 1330 of FIG. 13.
First, in the encoding apparatus 200 of FIG. 2, the spatial parameter generated by the side information generation unit 220 may include an angle parameter as a first parameter. The angle parameter is a parameter indicating as a predetermined angle value a signal magnitude correlation between a reference channel signal, which is any one of the first to nth channel signals ch1 to chn, and the remaining channel signals except for the reference channel signal among the first to nth channel signals ch1 to chn. The angle parameter may be named a global vector angle (GVA). In addition, the angle parameter may be a parameter representing a relative magnitude of the reference channel signal and a first sum signal as an angle value.
The side information generation unit 220 may generate first to nth angle parameters with each of the first to nth channel signals ch1 to chn as a reference channel signal. Hereinafter, an angle parameter generated with a kth channel signal chk as a reference channel signal is referred to as a kth angle parameter.
FIG. 16A shows a case where a multi-channel signal received by the encoding apparatus includes the first to third channel signals ch1, ch2, and ch3. FIGS. 16B, 16C, and 16D show cases where a reference channel signal is the first channel signal ch1, the second channel signal ch2, and the third channel signal ch3, respectively.
Referring to FIG. 16B, when a reference channel signal is the first channel signal ch1, the side information generation unit (refer to 220 of FIG. 2) sums (ch2+ch3) the second and third channels signals ch2 and ch3 that are the remaining channel signals except for the reference channel signal and obtains a first angle parameter angle1 1622 that is an angle parameter between a sum signal 1620 and the first channel signal ch1.
In detail, the first angle parameter angle1 1622 may be obtained from inverse tangent of a value obtained by dividing an absolute value of the sum signal (ch2+ch3) 1620 by an absolute value of the first channel signal ch1.
Referring to FIG. 16C, a second angle parameter angle2 1632 with the second channel signal ch2 as a reference channel signal may be obtained from inverse tangent of a value obtained by dividing an absolute value of a sum signal (ch1+ch3) 1630 by an absolute value of the second channel signal ch2.
Referring to FIG. 16D, a third angle parameter angle3 1642 with the third channel signal ch3 as a reference channel signal may be obtained from inverse tangent of a value obtained by dividing an absolute value of a sum signal (ch1+ch2) 1640 by an absolute value of the third channel signal ch3.
FIG. 17 is a graph showing a total sum of angle parameters, wherein an x axis indicates an angle value, and a y axis indicates a distribution probability. In addition, in the angle value, one unit corresponds to 6 degrees. For example, a value of 30 in the x axis indicates 180 degrees.
In detail, a total sum of n angle parameters calculated with each of the first to nth channel signals as a reference channel signal is converged to a predetermined value. The converged predetermined value may vary according to a value of n and may be optimized through simulations or experiments. For example, when n is 3, the converged predetermined value may be 180 degrees.
Referring to FIG. 17, when n is 3, a total sum of three angle parameters is converged to about 30 units, i.e., about 180 degrees 1710, as shown in FIG. 17. The graph of FIG. 14 is obtained through a simulation or an experiment.
Exceptionally, a total sum of the three angle parameters may be converged to about 45 units, i.e., about 270 degrees 1720. A predetermined value may be converged to about 270 degrees 1720 when each angle parameter has a value of 90 degrees because all the three channel signals are silent. In this exceptional case, if a value of any one of the three angle parameters is changed to 0, a total sum of the three angle parameters is converged to about 180 degrees 1710. When all the three channel signals are silent, a down-mixed mono signal also has a value of 0, and even though the mono signal is up-mixed and decoded, a result thereof is 0. Therefore, even though a value of one angle parameter is changed to 0, an up-mixing and decoding result is not changed, and thus, it does not matter even though a value of any one of the three angle parameters is changed to 0.
FIG. 18 is to describe calculation of angle parameters, according to an exemplary embodiment, wherein a multi-channel signal includes the first to third channel signals ch1, ch2, and ch3. According to an exemplary embodiment, a spatial parameter including angle parameters except for a kth angle parameter among the first to nth angle parameters and a residue of the kth angle parameter which is used to calculate the kth angle parameter may be generated.
Referring to FIG. 18, when the first channel signal ch1 is a reference channel signal, a first angle parameter is calculated and encoded, and the encoded first angle parameter is included in a predetermined bit region 1810 and is transmitted to the decoding apparatus (refer to 300 of FIG. 3). When the second channel signal ch2 is a reference channel signal, a second angle parameter is calculated and encoded, and the encoded second angle parameter is included in a predetermined bit region 1830 and is transmitted to the decoding apparatus (refer to 300 of FIG. 3).
When a third angle parameter is the kth angle parameter described above, a residue of the kth angle parameter may be obtained as below.
Since a total sum of the n angle parameters is converged to a predetermined value, a value of the kth angle parameter may be obtained by subtracting values of the angle parameters except for the kth angle parameter among the n angle parameters from the predetermined value. In detail, when n is 3, if not all the three channel signals are silent, a total sum of the three angle parameters is converged to about 180 degrees. Therefore, a value of the third angle parameter = 180 degrees - (a value of the first angle parameter + a value of the second angle parameter). The third angle parameter may be predicted using a correlation between the first to third angle parameters.
In detail, the side information generation unit (refer to 220 of FIG. 2) predicts a value of the kth angle parameter among the first to nth angle parameters. A predetermined bit region 1870 indicates a data region in which the predicted value of the kth angle parameter is included.
Thereafter, the side information generation unit (refer to 220 of FIG. 2) compares the predicted value of the kth angle parameter with an original value of the kth angle parameter. A predetermined bit region 1850 indicates a data region in which a value of the third angle parameter angle3 1642 calculated with reference to FIG. 16D is included.
Thereafter, the side information generation unit (refer to 220 of FIG. 2) generates a difference between the predicted value 1870 of the kth angle parameter and the original value 1850 of the kth angle parameter as a residue of the kth angle parameter. A predetermined bit region 1890 indicates a data region in which the residue of the kth angle parameter is included.
The encoding apparatus (refer to 200 of FIG. 2) encodes a spatial parameter including angle parameters (parameters include in the data regions 1810 and 1830) except for the kth angle parameter among the first to nth angle parameters and the residue (parameter included in the data region 1890) of the kth angle parameter and transmits the encoded spatial parameter to the decoding apparatus (refer to 300 of FIG. 3).
The decoding apparatus (refer to 300 of FIG. 3) receives a spatial parameter including the angle parameters except for the kth angle parameter among the first to nth angle parameters and the residue of the kth angle parameter.
The decoding unit (refer to 320 of FIG. 3) in the decoding apparatus (refer to 300 of FIG. 3) restores the kth angle parameter by using the received spatial parameter and a predetermined value.
In detail, the decoding unit (refer to 320 of FIG. 3) may generate by subtracting values of the angle parameters except for the kth angle parameter among the first to nth angle parameters from the predetermined value and compensating the residue of the kth angle parameter from the subtracting result.
The residue of the kth angle parameter has a less data size than a value of the kth angle parameter. Therefore, when the spatial parameter including the angle parameters except for the kth angle parameter among the first to nth angle parameters and the residue of the kth angle parameter is transmitted to the decoding apparatus (refer to 300 of FIG. 3), a data amount transmitted and received between the encoding apparatus (refer to 200 of FIG. 2) and the decoding apparatus (refer to 300 of FIG. 3) may be reduced.
When angle parameters are generated for, for example, three channels, an angle parameter of which channel has been residual-coded may be perceived by using values of 0, 1, and 2. That is, when all the three channels are independently encoded, 2 bits * 3 = 6 bits are required, but only 5 bits may be required according to the method described below.
When D=A+B*3+C*9 (a range of %D: 0∼26), if a value of D is known in decoding, A, B, and C may be obtained by C=floor(D/9); D'=mod(D,9); B=floor(D'/3); A=mod(D'/3).
FIG. 19 is a block diagram of an audio signal processing system 1900 integrating a multi-channel codec and a core codec, according to an exemplary embodiment.
The audio signal processing system 1900 shown in FIG. 19 includes an encoding apparatus 1910 and a decoding apparatus 1940. According to an exemplary embodiment, the audio signal processing system 1900 may include both the encoding apparatus 1910 and the decoding apparatus 1940, and according to another exemplary embodiment, the audio signal processing system 1900 may include any one of the encoding apparatus 1910 and the decoding apparatus 1940.
The encoding apparatus 1910 may include a multi-channel encoder 1920 and a core encoder 1930, and the decoding apparatus 1940 may include a core decoder 1850 and a multi-channel decoder 1860.
Examples of a codec algorithm used in the core encoder 1930 and the core decoder 1850 may be AC-3, enhancement AC-3, and AAC using modified discrete cosine transform (MDCT) but are not limited thereto.
FIG. 20 is a block diagram of an audio encoding apparatus 2000 according to an exemplary embodiment, which integrates a multi-channel encoder 2010 and a core encoder 2040.
The audio encoding apparatus 2000 shown in FIG. 20 includes the multi-channel encoder 2010 and the core encoder 2040, wherein the multi-channel encoder 2010 may include a transform unit 2020 and a down-mixing unit 2030, and the core encoder 2040 may include an envelope encoding unit 2050, a bit allocation unit 2060, a quantization unit 2070, and a bitstream formatting unit 2080. The components may be integrated as at least one module and implemented as at least one processor (not shown).
Referring to FIG. 20, the transform unit 2020 transforms a PCT input of a time domain into spectral data of a frequency domain. At this time, modified odd discrete Fourier transform (MODFT) may be applied. Since an MDCT component is generated according to MODFT = MDCT + jMDST, the existing inverse transform part and the existing analysis filter bank part are not necessary. Furthermore, since MODFT has a complex value, a level/phase/correlation may be more accurately obtained than in MDCT.
The down-mixing unit 2030 extracts a spatial parameter from the spectral data provided from the transform unit 2020 and generates a down-mixed spectrum by down-mixing the spectral data. The extracted spatial parameter is provided to the bitstream formatting unit 2080.
The envelope encoding unit 2050 acquires an envelope value in a predetermined frequency band unit from MDCT transform coefficients of the down-mixed spectrum provided from the down-mixing unit 2030 and lossless encodes the envelope value. Herein, envelope may be formed from any one of power, an average amplitude, a norm value, and average energy obtained in the predetermined frequency band unit.
The bit allocation unit 2060 generates bit allocation information required to encode a transform coefficient by using an envelope value obtained in each frequency band unit and normalizes the MDCT transform coefficients. In this case, an envelope value quantized and lossless-encoded in each frequency band unit may be included in a bitstream and transmitted to a decoding apparatus (refer to 2100 of FIG. 21). In association with bit allocation using an envelope value of each frequency band, a dequantized envelope value may be used so that the encoding apparatus 2000 and the decoding apparatus (refer to 2100 of FIG. 21) use a same process. When a norm value is used as an envelope value, a masking threshold value may be calculated using a norm value in each frequency band unit, and a required number of bits may be perceptually predicted using the masking threshold value.
The quantization unit 2070 generates a quantization index by quantizing the MDCT transform coefficients of the down-mixed spectrum based on the bit allocation information provided from the bit allocation unit 2060.
The bitstream formatting unit 2080 generates a bitstream by formatting the encoded spectral envelope, the quantization index of the down-mixed spectrum, and the spatial parameter.
FIG. 21 is a block diagram of an audio decoding apparatus 2100 according to an exemplary embodiment, which integrates a core decoder 2110 and a multi-channel decoder 2160.
The audio decoding apparatus 2100 shown in FIG. 21 includes the core decoder 2110 and the multi-channel decoder 2160, wherein the core decoder 2110 may include a bitstream parsing unit 2120, an envelope decoding unit 2130, a bit allocation unit 2140, and a dequantization unit 2150, and the multi-channel decoder 2160 may include an up-mixing unit 2150, and an inverse transform unit 2180. The components may be integrated as at least one module and implemented as at least one processor (not shown).
Referring to FIG. 21, the bitstream parsing unit 2120 extracts an encoded spectral envelope, a quantization index of a down-mixed spectrum, and a spatial parameter by parsing a bitstream transmitted via a network (not shown).
The envelope decoding unit 2130 lossless-encodes the encoded spectral envelope provided from the bitstream parsing unit 2120.
The bit allocation unit 2140 allocates bits required to decode a transform coefficient by using the encoded spectral envelope provided in each frequency band unit from the bitstream parsing unit 2120. The bit allocation unit 2140 may operate the same as the bit allocation unit 2060 of the audio encoding apparatus 2000 of FIG. 20.
The dequantization unit 2150 generates spectral data of an MDCT component by dequantizing the quantization index of the down-mixed spectrum provided from the bitstream parsing unit 2120 on the basis of bit allocation information provided from the bit allocation unit 2140.
The up-mixing unit 2170 up-mixes the spectral data of the MDCT component provided from the dequantization unit 2150 by using the spatial parameter provided from the bitstream parsing unit 2120 and inverse-normalizes the up-mixed spectrum by using the decoded spectral envelope provided from the envelope decoding unit 2130.
The inverse transform unit 2180 generates a pulse-code modulation (PCM) output of the time domain by inverse-transforming the up-mixed spectrum provided from the up-mixing unit 2170. At this time, an inverse MODFT may be applied to correspond to the transform unit (refer to 2020 of FIG. 20). To this end, spectral data of a modified discrete sine transform (MDST) component may be generated or predicted from the spectral data of the MDCT component. The inverse MODFT may be applied by generating spectral data of an MODFT component using the spectral data of the MDCT component and the generated or predicted spectral data of the MDST component. The inverse transform unit 2180 may apply an inverse MDCT to the spectral data of the MDCT component. To this end, a parameter for compensating for an error generated during up-mixing in a MDCT domain may be transmitted from the audio encoding apparatus (refer to 2000 of FIG. 20).
According to an exemplary embodiment, for a stationary signal duration, multi-channel decoding may be performed in the MDCT domain. For a non-stationary duration, an MODFT component may be generated by generating or predicting an MDST component from an MDCT component in a transient signal duration and be multi-channel-decoded in an MODFT domain.
Whether a current signal corresponds to a stationary signal duration or a non-stationary signal duration may be checked using flag information or window information added in a predetermined frequency band or frame unit to a bitstream.
For example, when a short window is applied, the current signal may correspond to a non-stationary signal duration, and when a long window is applied, the current signal may correspond to a stationary signal duration.
In more detail, characteristics of a current signal may be checked by using blksw and AHT flag information when an enhancement AC-3 algorithm is applied to a core codec and using blksw flag information when an AC-3 algorithm is applied to the core codec.
According to FIGS. 20 and 21, by using MODFT for time/frequency domain transform, even though a multi-channel codec and a core codec using different transform schemes are integrated, the complexity in a decoding end may be reduced. In addition, even though a multi-channel codec and a core codec using different transform schemes are integrated, the existing synthesis filter bank part and the existing transform part are not necessary, and thus, overlap add may be omitted, thereby preventing an additional latency.
The methods according to the embodiments can be written as computer executable programs and can be implemented in general-use digital computers that execute the programs using a computer-readable recording medium. In addition, data structures, program instructions, or data files usable in the embodiment of the present invention can be recorded on the computer-readable recording medium in various manners. The computer-readable recording medium may include all types of storage devices in which data readable by a computer system is stored. Examples of the computer-readable recording medium include magnetic media, such as hard disks, floppy disks, and magnetic tapes, optical recording media, such as CD-ROMs and DVDs, magneto-optical media, such as floptical disks, and hardware devices, such as read only memory (ROM), random access memory (RAM), and flash memory, particularly configured to store and execute program instructions. In addition, the computer-readable recording medium may be a transmission medium for transmitting a signal designating program instructions, a data structure, or the like. Examples of the program instructions may include machine language codes created by a compiler and high-level language codes executable by a computer system using an interpreter or the like.
Although exemplary embodiments of the present invention have been described in detail with reference to the attached drawings, the present invention is not limited to these embodiments. It is clear that various kinds of change or modification can be performed within the scope of the technical spirit disclosed in claims by those of ordinary skill in the art, and it is understood that those changes or modifications belong to the technical scope of the present invention.

Claims

1. An audio signal processing method comprising:

when a first plurality of input channels are down-mixed to a second plurality of output channels, comparing locations of the first plurality of input channels with locations of the second plurality of output channels;

down-mixing channels of the first plurality of input channels, which have the same locations as those of the second plurality of output channels, to channels at the same locations among the second plurality of output channels;

searching for at least one adjacent channel for each of the remaining channels among the first plurality of input channels;

determining a weighting factor for the searched adjacent channel in consideration of at least one of a distance between channels, a correlation between signals, and an error during restoration; and

down-mixing each of the remaining channels among the first plurality of input channels to the adjacent channel based on the determined weighting factor.