WO2012006770A1

WO2012006770A1 - Audio signal generator

Info

Publication number: WO2012006770A1
Application number: PCT/CN2010/075107
Authority: WO
Inventors: Faller Christof; Yue Lang; Jianfeng Xu
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2010-07-12
Filing date: 2010-07-12
Publication date: 2012-01-19
Also published as: CN102986254A; CN102986254B

Abstract

The invention relates to an audio signal generator for generating a downmix audio signal from a multi-channel audio signal comprising a first audio channel signal and a second audio channel signal. The audio signal generator comprises a processor (103) for amending a phase of the first audio channel signal using a first phase shift coefficient, and/or for amending a phase of the second audio channel signal using a second phase shift coefficient to reduce signal cancellations when combining the resulting first and second audio channel signal, and a combiner (109) for combining the resulting first and second audio channel signal to obtain the downmix audio signal.

Description

DESCRIPTION

Audio signal generator BACKGROUND OF THE INVENTION

The present invention relates to mobile communications over communication networks. In order to code a multi-channel audio signal, parametric stereo or multi-channel audio coding as described in C. Faller and F. Baumgarte, "Efficient representation of spatial audio using perceptual parametrization," in Proc. IEEE Workshop on Appl. of Sig. Proc. to Audio and Acoust., Oct. 2001 , pp. 199-202, C. Faller and F. Baumgarte, "Binaural Cue Coding: A novel and efficient representation of spatial audio," in Proc. ICASSP, May 2002, vol. 2, pp. 1841-1844, E. Schuijers, W.

Oomen, B. den Brinker, and J. Breebaart, "Advances in parametric coding for high-quality audio," in Preprint 1 14th Conv. Aud. Eng. Soc, Mar. 2003, F.

Baumgarte and C. Faller, "Binaural Cue Coding - Part I: Psychoacoustic

fundamentals and design principles," IEEE Trans, on Speech and Audio Proc, vol. 1 1 , no. 6, pp. 509-519, Nov. 2003, C. Faller and F. Baumgarte, "Binaural Cue Coding - Part II: Schemes and applications," IEEE Trans, on Speech and Audio Proc, vol. 1 1 , no. 6, pp. 520-531 , Nov. 2003, may be applied. Conventional parametric stereo or multi-channel audio coding approaches apply downmixing to generate a downmix audio signal comprising fewer channels than the original multi-channel audio signal. These fewer channels may be waveform coded and side information relating to the original signal channel relations may be added to the coded audio channels. The decoder may use this side information to regenerate the original number of audio channels based on the decoded waveform coded audio channels. When the audio channels are independent then the downmix audio signal can be generated by summing the input audio channels. When, however, the audio channels are not independent, such as is commonly the case for stereo and multichannel audio signals, then the summing operation may result in coloration of the sound due to time varying inter-channel signal statistics. To mitigate this problem, e.g. a magnitude equalization may be deployed, as described in A. Baumgarte, C. Faller, and P. Kroon, "Audio coder enhancement using scalable binaural cue coding with equalized mixing," in Preprint 1 16th Conv. Aud. Eng. Soc, May 2004. However, when there are delays between the original audio channels then the magnitude equalization may not always sufficiently correct the undesired effects of signal cancellation, occurring by cancellation when out-of-phase signals are added for downmix generation. This problem may occur when a sound engineer mixed music using delays between channels, phase inversion, or spaced microphones for recording. When the parametric stereo or multi-channel audio coding is used for speech applications, i.e. telephony or voice-over-IP, then the mentioned problems may occur when several microphones are used to pick up voice in a tele-conferencing scenario. SUMMARY OF THE INVENTION

A goal to be achieved by the present invention is to provide a concept for more efficiently generating a downmix signal from a plurality of audio channels. The invention is based on the finding, that a downmix audio signal may more efficiently be generated when a time-adaptive phase alignment is used prior to summation of audio channel signals embodying input audio channels. The phase alignment may reduce signal cancellations when combining the resulting audio channel signals to obtain a downmix signal and may be performed either frame by frame and/or upon the basis of an averaging process which is performed over a multiplicity of frames. Additionally, magnitude equalization may be applied in addition to the averaging process.

According to a first aspect, the invention relates to an audio signal generator for generating a downmix audio signal from a multi-channel audio signal comprising a first audio channel signal and a second audio channel signal, the audio signal generator comprising a processor for amending a phase of the first audio channel signal using a first phase shift coefficient, and/or for amending a phase of the second audio channel signal using a second phase shift coefficient to reduce signal cancellations when combining the resulting first and second audio channel signal, and a combiner for combining the resulting first and second audio channel signal to obtain the downmix audio signal.

According to an implementation form of the first aspect, the processor is

configured to amend the phase of the first audio channel signal and/or the phase of the second audio channel signal to match a phase of a reference signal. The reference signal may be e.g. a predetermined reference signal or may be generated from the first and the second audio signal. According to an implementation form of the first aspect, the processor is

configured to determine a mean value of a product of the first audio channel signal and the second audio channel signal to obtain the first phase shift factor and/or the second phase shift factor. The mean value may be determined upon the basis of an averaging process by summing such products e.g. over a plurality of frames.

According to an implementation form of the first aspect, the processor is

configured to set the first phase shift coefficient or the second phase shift coefficient to one. Thus, a phase of only one audio channel may be amended. According to an implementation form of the first aspect, the first phase shift coefficient is a complex-conjugated version of the second phase shift coefficient. In order to obtain the complex-conjugated version of the either phase shift coefficient, the sign of the respective imaginary part may be inverted.

According to an implementation form of the first aspect, the processor is configured to determine the first phase shift coefficient P_x{k,i) and the second phase shift coefficient P₂(k,i) , k denoting a time index, i denoting a frequency index, upon the basis of the following formulas:

E{X,( , ₂ ^*( , }

P₂(k,i)

E{ ,( , ₂ ^*(M} wherein X_x(k,i) and X₂(k,i) respectively denote the first audio channel signal and the second audio channel signal, and wherein E{.} denotes an averaging operation. According to an implementation form of the first aspect, the processor is configured to determine the first phase shift coefficient P_x{k,i) and the second phase shift coefficient P₂(k,i) , k denoting a time index, i denoting a frequency index, upon the basis of the following formulas:

P_l(k,i) = P(k,iY

P₂(k, i) = P(k,i)

wherein X_{(k,i) and X₂(k,i) respectively denote the first audio channel signal and the second audio channel signal, and wherein E{.} denotes an averaging operation.

According to an implementation form of the first aspect, the processor is configured to determine the first phase shift coefficient P_x (k,i) and the second phase shift coefficient P₂(k, i) , k denoting a time index, i denoting a frequency index, upon the basis of the following formulas:

E{S(k,i)X₂ ^*(k,i)}

P₂(k,i) =

E{s(k,i)x;(k,i)}

S(k,i) = X,(k,i) + X₂(k,i) or

wherein

wherein X^k ) and X₂(k,i) respectively denote the first audio channel signal and the second audio channel signal, and wherein E{.} denotes an averaging operation. According to an implementation form of the first aspect, the processor is

configured to weight the downmix signal by a power factor, in particular by a power factor which depends on a sum of powers of the first channel audio signal and the second channel audio signal. Thus, the power factor scales the downmix signal in order to adjust its power with regard to the first and second audio channel.

According to an implementation form of the first aspect, the combiner is configured to superimpose the first auxiliary signal and the second auxiliary signal to obtain the downmix signal. In order to superimpose the auxiliary signals, the combiner may be configured to sum up the auxiliary signals.

According to an implementation form of the first aspect, the processor is

configured to multiply the first audio channel signal by the first phase shift coefficient, or to multiply the second audio channel signal by the second phase shift coefficient for phase amendment. The processing means may comprise at least one multiplier to multiply the respective audio channel signal.

According to an implementation form of the first aspect, the audio signal generator further comprises a transformer for transforming a first time-domain signal into frequency domain to obtain the first audio channel signal, and for transforming a second time-domain signal into frequency domain to obtain the second audio channel signal. The transformer may be a Fourier transformer.

According to an implementation form of the first aspect, the downmix audio signal is a frequency domain signal, and wherein the audio signal generator further comprises a transformer for transforming the downmix audio signal into time- domain. The transformer may be e.g. an inverse Fourier transformer.

Furthermore, each implementation form of the first aspect may be combined with any other implementation form of the first aspect to obtain further implementation forms of the first aspect of the invention. According to a second aspect, the invention relates to a method for generating a downmix audio signal from a multi-channel audio signal comprising a first audio channel signal and a second audio channel signal, the method comprising amending a phase of the first audio channel signal using a first phase shift coefficient, and/or for amending a phase of the second audio channel signal using a second phase shift coefficient to reduce signal cancellations when combining the resulting first and second audio channel signal, and combining the resulting first and second audio channel signal to obtain the downmix audio signal.

According to some implementation forms of the second aspect or according to another aspect, a method is provided for generating a downmix signal of multiple input audio channels. The method may comprise the steps of receiving a plurality of input audio channels, converting the input audio channels to a plurality of subbands, estimating the phase difference between the input audio channels and a reference audio channel, modifying the phase of at least one input audio channel subband to match the phase of the corresponding reference audio channel subband, generating a sum of the modified input audio channel subbands to generate the downmix signal subbands, and converting the downmix signal subbands to the time-domain to generate the downmix output signal.

According to a third aspect, the invention relates to a computer program for performing the method for generating a downmix audio signal when run on a computer.

BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, in which:

Fig. 1 shows a block diagram of an audio signal generator; and Fig. 2 shows a diagram of a method for generating a downmix signal. DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Fig. 1 shows a block diagram of an audio signal generator according to an implementation form. For brevity, the following descriptions may refer to a stereo signal forming an embodiment of a multi-channel signal. Thus, the left and right channels of the stereo signal may form embodiments of the first and second audio channel signal of a multi-channel audio signal.

As shown in Fig. 1 , the audio signal generator may comprise a transformer 101 for transforming a left time-domain channel, x-i(n) of a stereo signal, and for transforming a right time-domain channel, x₂(n) of the stereo signal into frequency domain to obtain a first audio channel signal Xi(k, i) and a second audio channel signal X₂(k, i) in frequency domain. The first and second audio channel signals are provided to a processor 103 which is configured to amend a phase of the first audio channel signal using a first phase shift coefficient Pi(k, i) and/or for amending a phase of the second audio channel signal using a second phase shift coefficient P₂(k, i) to reduce signal cancellations when combining the resulting first and second audio channel signal after amendment. In order to amend the respective phase of the respective audio channel signal, the processor may comprise a first multiplier 105 for multiplying the first audio channel signal with the first phase shift coefficient, and a second multiplier 107 for multiplying the second audio channel signal with the second phase shift coefficient.

The output of the multiplier 105 and 107 may be provided to a combiner 109 for combining, e.g. superimposing, the resulting first and second audio channel signal to obtain the downmix audio signal. In order to determine the first and second phase shift coefficient, the processor 103 may comprise a downmix parameter computer 1 10 receiving the outputs of the transformer 101. The downmix parameter computer 1 10 may be configured to determine the first and second phase shift coefficient according to the principles and/or upon the basis of the formulas described herein.

Optionally, the audio signal generator may comprise a further multiplier 1 1 1 for weighting the output of the combiner 109 with a power factor M(k, i). Optionally, the processor 103 may be configured to weight the output of the combiner 109 with the power factor. At the output of the combiner 109 or at the output of the multiplier 1 1 1 , a downmix audio signal X(k, i) in frequency domain may result. The downmix audio signal in frequency domain may be transformed into time-domain using e.g. an inverse filter bank 1 13, which may be implemented as a inverse Fourier transform by way of example.

The transformer 101 may, correspondingly, comprise a first filter bank 1 15 for transforming the left channel to obtain the first audio channel signal, and a second filter bank 1 17 for transforming the right channel to obtain the second audio channel signal in frequency domain. The filter banks 1 15, 1 17 may be

implemented as Fourier transformers.

Fig. 2 shows a diagram of a method for generating a downmix audio signal from a multi-channel audio signal which comprises a first audio channel signal and a second audio channel signal. The method comprises amending 201 a phase of the first audio channel signal using a first phase shift coefficient, and/or amending 203 a phase of the second audio channel signal using a second phase shift coefficient, and combining 205 the resulting first and second audio channel signal to obtain the downmix audio signal. With respect to Fig. 1 , the left and right time-domain channels of a stereo signal are denoted x-i(n) and X2(n), where n is the discrete time index. For downmix processing, the signals are converted to a time-frequency representation. The left and right stereo signal channels in the time-frequency representation are denoted Xi(k, i) and X₂(k, i), where k is e.g. a downsampled time index (also referred to as frame index) and * is a frequency index. Without loss of generality, it may in the following assumed that a complex-valued time frequency representation is used.

The downmix signal is computed as

X(k, i) = M(k, (k, i)X, (k, i) + P₂(k, i)X₂ (k, )) where M(k, i) is an optional real-valued gain factor and Pi(k, i) and P₂(k, i) are complex left and right "phase alignment" factors with magnitude one. Figure 1 shows the processing scheme which is applied to generate the downmix signal. The left and right signals, x-i(n) and X2(n)„ are converted to a time-frequency domain by a transform or interbank (FB). Downmix processing parameters are computed and applied prior to adding the left and right subband signals to generate the subband downmix signal. The subband downmix signal is converted back to time domain using an inverse filterbank/transform (IFB).

The goal is to determine Pi(k, i) and P₂(k, i) such that the left and right channels add in phase to prevent potentially time dependent signal cancellations.

Additionally, the real-valued factor M(k, i) is determined such that the power of X(k, i) is the same or approximates the sum of the power of Xi(k, i) and X₂(k, i).

One strategy is to align one channel, e.g. X₂(k,i) , relative to the other channel, e.g. X_x(k,i) . This may be achieved by choosing

Ρ^,ί) = \

where E{.} is a short-time averaging operation, . is the absolute value of a complex number, and * denotes complex conjugate. For the operation, a single pole averaging with a 80 ms time constant may me chosen. As mentioned above, M(k,i) may be computed such that the power of the downmix signal is the same or approximates the sum of power of the left and right channel. This may be achieved by using

E{ _t (k, i)X; (k, i) + E{X₂ (k, i)X₂ (k, } }

i)X_x (k, i) + P₂(k, i)X₂ (k, if }

To improve performance in terms of artifacts when M(k, i) becomes too large or too small, the range of M(k, i) may be limited to [0.5, 2] corresponding to ± 6dB.

According to some embodiments, the following formulas may be used to obtain the phase shift coefficients:

P₂(k,i) = P(k,i)

with

According to the above formulas, both audio channel signals representing e.g. a right channel and a left channel may be phase modified. As opposed to applying the whole phase correction to one channel, half of the phase correction may be applied to both channels, which may have the advantage that the maximum audio waveform modification is smaller. Alternatively, one may phase-align both audio channel signals, e.g. the left and right channel of a stereo signal, relative to the sum signal, i.e.

E{S(k )X₂ ^*(k, i)}

P₂(k, i) =

E{S(k,i)X₂ ^*(k, i)} with S(k,i) = X_x{k,i) + X₂{k,i) forming an embodiment of a reference audio signal. According to some embodiments, instead of using a sum signal, a reference signal ^'may be used which has a phase which may be a weighted sum of the phases of both channels and a magnitude which is the sum or norm of the magnitude of both channels. That is, the phase shift coefficients may be used with a reference signal ("sum signal") which may be equal to:

Such signal may have the following properties:

• Power spectrum is the sum of left and right power spectra, such that during time- averaging operations, the phase will be weighted by signal power.

^• Phase is weighted average of the phase of left and right, i.e. first and second, channel. The weights may be chosen such that the phase of the stronger channel may dominate. According to some implementation forms, the reference signal may be one of the first or second audio channel signals.

According to some implementation forms, the reference signal may be the sum of the first and second audio channel signal.

According to some implementation forms, the reference signal may be a signal with a magnitude which is a combination of the input signal subband magnitudes, and a phase which is a combination of the input signal subband phases.

According to some implementation forms, a phase difference may be estimated using an averaging process over multiple frames.

According to some implementation forms, a gain factor may be applied to the downmix subbband signals for magnitude equalization, after summation.

Claims

CLAIMS:

1. Audio signal generator for generating a downmix audio signal from a multichannel audio signal comprising a first audio channel signal and a second audio channel signal, the audio signal generator comprising: a processor (103) for amending a phase of the first audio channel signal using a first phase shift coefficient, and/or for amending a phase of the second audio channel signal using a second phase shift coefficient to reduce signal

cancellations when combining the resulting first and second audio channel signal; and a combiner (109) for combining the resulting first and second audio channel signal to obtain the downmix audio signal.

2. The audio signal generator of claim 1 , wherein the processor (103) is configured to amend the phase of the first audio channel signal or the phase of the second audio channel signal to match a phase of a reference signal.

3. The audio signal generator of claim 1 or 2, wherein the processor (103) is configured to determine a mean value of a product of the first audio channel signal and the second audio channel signal to obtain the first phase shift factor or the second phase shift factor

4. The audio signal generator of any of the preceding claims, wherein the processor (103) is configured to set the first phase shift coefficient or the second phase shift coefficient to one.

5. The audio signal generator of any of the preceding claims, wherein the first phase shift coefficient is a complex-conjugated version of the second phase shift coefficient.

6. The audio signal generator of any of the preceding claims, wherein the processor (103) is configured to determine the first phase shift coefficient P_x{k,i) and the second phase shift coefficient P₂(k,i) , k denoting a time index, i denoting a frequency index, upon the basis of the following formulas:

/?(*, /) = 1

P₂(k,i)

wherein X_x(k,i) and X₂(k,i) respectively denote the first audio channel signal and the second audio channel signal, and wherein E{.} denotes an averaging operation.

7. The audio signal generator of any of the preceding claims, wherein the processor (103) is configured to determine the first phase shift coefficient P_x{k ) and the second phase shift coefficient P₂(k,i) , k denoting a time index, i denoting a frequency index, upon the basis of the following formulas:

P₂(k,i) = P(k,i)

wherein X^k ) and X₂(k,i) respectively denote the first audio channel signal and the second audio channel signal, and wherein E{.} denotes an averaging operation.

8. The audio signal generator of any of the preceding claims, wherein the processor (103) is configured to determine the first phase shift coefficient P^k ) and the second phase shift coefficient P₂(k,i) , k denoting a time index, i denoting a frequency index, upon the basis of the following formulas:

S(k, = X (k, + X₂ (k, or

wherein

x.ik +\x₂{k )\ wherein X (k, i) and X₂(k,i) respectively denote the first audio channel signal and the second audio channel signal, and wherein E{.} denotes an averaging operation.

9. The audio signal generator of any of the preceding claims, wherein the processor (103) is configured to weight the downmix signal by a power factor, in particular by a power factor which depends on a sum of powers of the first channel audio signal and the second channel audio signal.

10. The audio signal generator of any of the preceding claims, wherein the combiner (109) is configured to superimpose the first auxiliary signal and the second auxiliary signal to obtain the downmix signal.

1 1 . The audio signal generator of any of the preceding claims, wherein the processor (103) is configured to multiply the first audio channel signal by the first phase shift coefficient, or to multiply the second audio channel signal by the second phase shift coefficient for phase amendment.

12. The audio signal generator of any of the preceding claims, further comprising a transformer (101 ) for transforming a first time-domain signal into frequency domain to obtain the first audio channel signal, and for transforming a second time-domain signal into frequency domain to obtain the second audio channel signal.

13. The audio signal generator of any of the preceding claims, wherein the downmix audio signal is a frequency domain signal, and wherein the audio signal generator further comprises a transformer (1 13) for transforming the downmix audio signal into time-domain.

14. A method for generating a downmix audio signal from a multi-channel audio signal comprising a first audio channel signal and a second audio channel signal, the method comprising: amending (201 ) a phase of the first audio channel signal using a first phase shift coefficient to reduce signal cancellations when combining the resulting first and second audio channel signal; and/or amending (201 ) a phase of the second audio channel signal using a second phase shift coefficient to reduce signal cancellations when combining the resulting first and second audio channel signal; and combining (205) the resulting first and second audio channel signal to obtain the downmix audio signal.