WO2009125046A1

WO2009125046A1 - Processing of signals

Info

Publication number: WO2009125046A1
Application number: PCT/FI2008/050182
Authority: WO
Inventors: Pasi Ojala
Original assignee: Nokia Corporation
Priority date: 2008-04-11
Filing date: 2008-04-11
Publication date: 2009-10-15
Also published as: CN102027535A

Abstract

The disclosure is a method and an apparatus for processing audio signals. Two or more audio signals are input and analysedto form a set of parameters. At least two of said two or more audio signals are combined to form a combined audio signal. In the method the signal level of the combined audio signal is determined and a correction factor is determined on a basis of a difference between the signal level of the combined audio signal and a signal level of at least one of the inputted audio signal. The correction factor can be used reduce difference between the signal level of the combined audio signal and the signal level of the inputted audio signal. There is also disclosed a method for synthesizing the audio signals from the combined audio signals. The parameters can be used in the synthesizing. Also a computer program is disclosed comprising program code means adapted to perform the processing of audio signals the program is run on a processor.

Description

Processing of signals

Field of the Invention

The invention relates to binaural audio coding and representation of multichannel audio sources. The invention relates to a method and apparatus for forming a combined audio signal, and a method and apparatus for reconstructing two or more audio signals from the combined audio signal.

Background Information

Spatial audio scene consists of audio sources and ambience around a listener. Figure 1 presents an example situation with different sound sources 101 , 102 around the listener 103 or a dummy head recording device 104a, 104b. In addition, there is ambient background noise caused by the room effect, i.e. the reverberation of the audio sources due to the properties of the space the audio sources are located. The image is perceived due to the directions of arrival of sound from the audio sources as well as the reverberation. A human being is able to capture the three dimensional image using signals from the left and the right ear. Hence, recording the audio image using microphones placed close to ear drums is sufficient to capture the spatial audio image.

An efficient transmission and representation of spatial audio image using two channels may require a specific coding algorithm for the audio content. The spatial information may need to be conveyed efficiently to the receiver and representation device in which the captured scene is rendered.

Summary of Some Examples of the Invention

An example embodiment of the present invention provides a method in which signals from multiple sources are down-mixed to a smaller number of signals and information relating to the ambience is also formed. The down-mixed signals can be up-mixed to form multiple signals resembling at least some of the original signals and taking into consideration the ambience. The idea of an example embodiment of the invention is a binaural audio encoding algorithm taking into account one or more ambience components. The algorithm optionally comprises performing time-to-frequency transform and/or analysis of binaural audio signals. The algorithm estimates level and time difference between channels. This estimation may use the optional time- to-frequency coefficients. The algorithm also estimates an inter-channel level correction gain for down-mixed signal to incorporate the ambient signal contribution. The inter-channel level and time difference and ambient level correction cue info on one or more subbands can be transmitted and/or stored. The down-mixed signal can be encoded by an encoder which may be a speech/audio encoder. The binaural signal reconstruction in the receiving end can be performed by, for example, synthesising the ambience signal component using the level correction information, decoding the down-mixed signal with a decoder, time-to-frequency transforming and analysing the down-mixed signal, synthesising the multi-channel signals using the received inter-channel level and time differences in one or more subbands, and synthesising the ambient component by decorrelation of binaural signal in one or more subbands using the ambient level correction cue.

According to a first aspect of the present invention there is provided a method comprising:

- inputting two or more audio signals;

- analysing the audio signals to form a set of parameters;

- combining at least two of said two or more audio signals to form a combined audio signal;

The method is characterised in that the analysing comprises

- determining the signal level of the combined audio signal;

- determining a correction factor on a basis of a difference between the signal level of the combined audio signal and a signal level of at least one of the inputted audio signals to reduce difference between the signal level of the combined audio signal and the signal level of the inputted audio signal.

According to a second aspect of the present invention there is provided a method comprising:

- inputting a combined audio signal and one or more parameters relating to the audio signals from which the combined audio signal has been formed; - synthesizing two or more audio signals on the basis of the combined audio signal and said one or more parameters; and

- using the set of parameters to amend the synthesized audio signals to reconstruct the ambience to the audio signals. The method is characterised in that said one or more parameters comprise a correction factor, and the method comprises using the correction factor in said synthesizing two or more audio signals.

According to a third aspect of the present invention there is provided an apparatus comprising:

- an input for inputting two or more audio signals;

- an analyser for analysing the audio signals to form a set of parameters;

- a combiner for combining at least two of said two or more audio signals to form a combined audio signal; The apparatus is characterised in that the analyser comprises

- a level determinator for determining the signal level of the combined audio signal;

- a gain determinator for determining a correction factor on a basis of a difference between the signal level of the combined audio signal and a signal level of at least one of the inputted audio signals to reduce difference between the signal level of the combined audio signal and the signal level of the inputted audio signal.

According to a fourth aspect of the present invention there is provided an apparatus comprising:

- an input for inputting a combined audio signal and one or more parameters relating to the audio signals from which the combined audio signal has been formed;

- a synthesizer for synthesizing two or more audio signals on the basis of the combined audio signal and said one or more parameters.

The apparatus is characterised in that said one or more parameters comprise a correction factor, and the apparatus comprises a corrector for using the correction factor in said synthesizing two or more audio signals.

According to a fifth aspect of the present invention there is provided a computer program comprising program code means adapted to perform the following when the program is run on a processor: - inputting two or more audio signals;

- analysing the audio signals to form a set of parameters;

- combining at least two of said two or more audio signals to form a combined audio signal; The computer program is characterised in that the computer program comprising program code means adapted to

- determine the signal level of the combined audio signal;

- determine a correction factor on a basis of a difference between the signal level of the combined audio signal and a signal level of at least one of the inputted audio signals to reduce difference between the signal level of the combined audio signal and the signal level of the inputted audio signal.

According to a sixth aspect of the present invention there is provided a computer program comprising program code means adapted to perform the following when the program is run on a processor:

- inputting a combined audio signal and one or more parameters relating to the audio signals from which the combined audio signal has been formed;

- synthesizing two or more audio signals on the basis of the combined audio signal and said one or more parameters;

The computer program is characterised in that said one or more parameters comprise a correction factor, and the computer program comprises program code means adapted to use the correction factor in said synthesizing two or more audio signals.

The developed concept can be applied to, for example, telepresence and audio/video conferencing services. Some examples of the invention relate to speech and audio coding, media adaptation, transmission of real time multimedia over packet switched network (e.g. Voice over IP), etc.

Description of the Drawings

In the following some example embodiments of the invention will be described in more detail with reference to the drawings, in which

Fig. 1 depicts an example of spatial audio image capture using two microphones, Fig. 2 depicts an example of a binaural and multi channel audio analysis functionality,

Fig. 3 depicts an example of determining inter-channel level difference, inter-channel time difference, and inter-channel coherence between channel pairs for different subbands and time instants,

Fig. 4 depicts an example of a binaural synthesis,

Fig. 5 depicts an example of a multi channel audio encoding and decoding algorithm,

Fig. 6 depicts an example embodiment of an encoder according to the present invention as a simplified block diagram,

Fig. 7 depicts an example embodiment of a decoder according to the present invention as a simplified block diagram,

Fig. 8a depicts an example embodiment of an encoding method according to the present invention as a simplified flow chart,

Fig. 8b depicts an example embodiment of an analysis phase according to the present invention as a simplified flow chart,

Fig. 9 depicts an example embodiment of a decoding method according to the present invention as a simplified flow chart,

Fig. 10 depicts an example of a device in which the invention can be applied, and

Fig. 11 depicts an example of a system in which the invention can be applied. Detailed Description of Some Examples of the Invention

One method for spatial audio coding is the binaural cue coding (BCC) parametehsation in which the input signal consisting of two or more channels is first transformed in time-frequency domain using, for example, the Fourier transform or the quadrature mirror filter bank (QMF) decomposition. In the transformation a temporal portion of the audio signal of a channel is transformed into frequency domain wherein the frequency domain representation of the signal comprises a number of subbands. Hence, for a certain time instant k there are a number of subband representations of the audio signal.

Figure 2 presents the basic idea on the spatial audio coding. The audio scene 201 is analysed 202 in transform domain 203 and the corresponding parameterisation is transmitted to a receiver. The scene parameters could also be used in down-mixing 204 a multi-channel sound to remove e.g. the time difference between the channels. The down-mixed signal 205 can then be forwarded for e.g. a mono/stereo audio encoder.

BCC analysis

The BCC analysis consists of inter-channel level difference (ILD) and inter- channel time difference (ITD) parameters estimated within each transform domain time-frequency (time-subband) slot. In addition, the inter-channel coherence (IC) between each or some of the channel pairs can be determined. These parameters can also be called as BCC cues or inter- channel cues. Figure 3 discloses an example of an inter-channel level difference and inter-channel time difference estimation for multi channel audio content. The inter-channel level difference and inter-channel time difference parameters are determined between each channel pair. The inter- channel coherence is typically determined individually for each channel. In case of a binaural audio signal consisting of two channels, the BCC cues are determined between decomposed left and right channels.

The inter-channel level difference (ILD) for each subband AL_n is typically estimated in logarithmic domain as follows:

where S_n and s^R are time domain left and right channel signals in subband n , respectively.

The inter-channel time difference (ITD), i.e. the delay between the left channel and the right channel is determined for each subband n as follows

τ_n = arg max_d{Φ_n(αf,/c)} (2)

where Φ_n(d,k) is a normalised correlation

Φ_B(d.*)= ,. «S (* - tf. )^r«π*(* - tf2 ) ₍₃₎

V(s£ ((( - CZ₁ ) ^Ts_n ^L (k - Cl₁ )Js* (k - d₂)^τs^(l< - d₂ )) where

Of₁ = max {0, - of} c/₂ = max{θ, c/}

The normalised correlation is actually the inter-channel coherence (IC) parameter. It is typically utilised for capturing the ambient components that are decorrelated with the "dry" sound components represented by phase and magnitude parameters in Equations (1 ) and (2). The dry sound components are the pure sound signals from different audio sources without signals caused by reverberation of the sound sources e.g. due to a room effect.

BCC coefficients could as well be determined in transform domain such as in discrete Fourier transform (DFT) domain. Using windowed Short Time Fourier Transform (STFT), the subband signals above are converted to grouped transform coefficients. S_n and S^ are the spectral coefficient vectors of left and right binaural signal for subband n of the given analysis frame, respectively. The transform domain inter-channel level difference parameter ILD can be determined according to Equation 1

* where means complex conjugate.

The inter-channel time difference (ITD) is easier to handle as inter-channel phase difference (ICPD): φ_n = z(s_n ^{L *}S«), (6)

Inter-channel coherence calculation is quite similar to the time domain calculation in Equation (3)

The BCC determination in the discrete Fourier transform domain requires significantly less computation when the time domain inter-channel time difference estimation using correlation estimated is changed to inter-channel phase difference estimation of the discrete Fourier transform domain spectral coefficients.

The unified domain transform (UDT) could be considered as a special case of the binaural cue coding. UDT for binaural (two channel) audio consist of a rotation matrix describing the locations of sound sources. The rotation matrix in two dimensions, i.e. with two input channels is cosσ smσ

R = (8) - sinσ cosσ

where the components of the rotation matrix are

S^L cosσ = , and (9)

L² + , S o£R²

S^R sinσ = (10) s_n ^L + S ,£R' Basically, in case of two dimensional matrix, the components can be understood as an amplitude panning of a stereo signal. When the signal phase is taken into account the UDT domain signal can be calculated as )

where the complex valued φ_n and φ_n are the phase of the left input signal

/ , 2 ^~^2 and the right input signal, respectively. M_n = ^S_n + S_n is basically the rotated down-mixed signal from which the phase is removed.

Looking at the rotation matrix, it can be noticed that

tanσ = (12)

_^R

which is actually related to the ILD value in Equation (5). Furthermore, the phase values could be conveyed as the phase difference, i.e. the ICPD. Hence, unified domain transform is closely related to the BCC parameterisation.

The level and time/phase difference cues represent the dry surround sound components. They basically model the sound source locations in space. Basically, ILD and ITD/ICPD cues represent surround sound panning coefficients. The coherence cue, on the other hand, is supposed to cover the relation between coherent and decorrelated sounds. The level of late reverberation of sound sources e.g. due to the room effect, and the ambient sound distributed between input channels may have significant contribution to the spatial audio sensation. Therefore, a proper estimation and synthesis of inter-channel cue is a matter of importance in binaural coding.

Principal component analysis (PCA) of binaural and multi-channel audio tries to separate the correlated directional sources and the ambient signals. It can be assumed that the surround sound consists of directional sources constructed of the source signals panned in different directions, and additive ambience. Hence, the eigenvalues of covahance matrix of the surround sound depend on the panning gains, variances of directional sources and ambient signals, and the correlation of ambience. This means that the determined eigenvectors are used to project the input binaural signal into principal components. The highest eigenvalue corresponds to the directional components, while the reminder is considered as the ambience.

The ambience is also visible in the unified domain transform domain. In practice, when the rotation and phase removal is done according to Equation (11 ), the outcome is actually

in which A_n is the ambient signal. The phase cancellation as well as the rotation may not be absolutely correct and the ambience may not be completely cancelled within the down-mixed signal with given parameters.

The output of the encoder is the inter-channel level difference (ILD) i.e. rotation matrix representing the stereo panning coefficients, inter-channel phase difference (ICPD) i.e. inter-channel time difference (ITD), inter-channel correlation (IC) and down-mixed audio signal.

It can be seen that the parameterisation does not represent the ambient signal level.

Down-mix

The down-mixed signal can be created, for example, by averaging the signal in transform domain. In a two channel case (left and right channel) this can be expressed as

There are also other means to create the down-mixed signal such as the principal component analysis and the unified domain transform mentioned above. In addition, the left and right channels could be weighted in such a manner that the energy of the signal is preserved e.g. when the other channel is close to zero. However, when the binaural synthesis is based on the level difference of left and right input channels and the down-mixed signal, the down-mixing method should be predetermined. Otherwise, the conversion from single ILD parameter to channel gains for left and right channel may not be possible.

BCC synthesis

The binaural synthesis can also be conducted in a time-frequency domain. Figure 4 presents the basic structure of performing the binaural synthesis 401 in a time-frequency domain. The down-mixed mono speech/audio frame consisting of N samples s_o,...,s_Λ/_₁ is converted to N spectral samples

S₀,...,S_N_ι e.g. with discrete Fourier transform (DFT) or with another time- to-frequency transform method.

The inter-channel level difference and inter-channel time difference coefficients are now applied to create binaural audio. When the down-mixed signal is created according to the Equation (14), and the inter-channel level difference is determined as the level difference of the left and right channel, the left and right channel signals are synthesised for each subband n as follows

where S_n is the spectral coefficient vector of the down-mixed signal according to Equation (14), S^ and S^ are the spectral coefficients of the left and right binaural signals, respectively.

It should be noted, that the BCC synthesis using frequency dependent level and delay parameters creates the dry surround sound component. The ambience is still missing and could be synthesised using the coherence parameter. A method for synthesis of the coherence cue consists of e.g. decorrelation of a signal to create late reverberation signal. The implementation consists of filtering each output channel with random phase filtered and adding the result into the output signal. When different filters with a delay are applied to each channel, a decorrelated signal is created.

Figure 5 presents the generic multi channel coding with flexible channel configuration using BCC cues. The number of output audio channels/objects 504 does not need to be identical to the number of input channels/objects 501. For example, the output of the mixer 502/renderer 503 could be intended for any loudspeaker output configuration from stereo to N channel output. The output could be rendered also into binaural format for headphone listening.

In the following an encoder 1 according to an example embodiment of the present invention will be described with reference to the block diagram of Fig. 6 and the flow diagram of Fig. 8. Although the signals presented in the following description relate to audio signals the invention is not limited to processing audio signals. The encoder 1 comprises a first interface 1.1 for inputting a number of audio signals from a number of audio channels 2.1 — 2.m (block 801 in Fig. 8). Although five audio channels are depicted in Fig. 6 it is obvious that the number of audio channels can also be two, three, four or more than five. The signal of one audio channel may comprise an audio signal from one audio source or from more than one audio source. The audio source can be a microphone, a radio, a TV, an MP3 player, a DVD player, a CDROM player, a synthesizer, a personal computer, a communication device, a music instrument, etc. In other words, the audio sources to be used with the present invention are not limited to certain kind of audio sources. It should also be noticed that the audio sources need not be similar to each other but different combinations of different audio sources are possible. Signals from the audio sources 2.1 — 2.m are converted to digital samples in analog-to-digital converters 3.1 — 3.m (block 802). In this example embodiment there is one analog-to-digital converter for each audio source but it is also possible to implement the analog-to-digital conversion by using less analog-to-digital converters than one for each audio source. It may be possible to perform the analog-to-digital conversion of all the audio sources by using one analog-to-digital converter 3.1.

The samples formed by the analog-to-digital converters 3.1 — 3.m are stored, if necessary, to a memory 4. The memory 4 comprises a number of memory sections 4.1 — 4.m for samples of each audio source. These memory sections 4.1 — 4.m can be implemented in a same memory device or in different memory devices. The memory or a part of it can also be a memory of a processor 6, for example.

In this example embodiment a time-to-frequency transformation is performed to the audio samples to represent the audio signals in a time-frequency domain (block 803). The time-to-frequency transformation can be performed, for example, by matched filters such as a quadrature mirror filter bank, by discrete Fourier transform, etc. There can be separate time-to-frequency transformers for each audio source or one time-to-frequency transformer 5 may be sufficient to make the time-to-frequency transformations for signals of different audio channels. The time-to-frequency transformation is performed by using a number of samples i.e. a set of samples at a time. Such sets of samples can also be called as frames. In an example embodiment one frame of samples represent a 20 ms part of an audio signal in time domain but also other lengths can be used, for example 10 ms. After the time-to-frequency transformation the audio signal is divided into a number of subbands. The transformed signals on these subbands n at each time instant k can be represented by a number of transform coefficients.

The analysis block 7 performs an inter channel analysis for the subbands of the audio signals (block 804). In this example embodiment one reference channel is selected among the audio channels (block 804.1 ). Without the loss of generality, the first audio channel 2.1 can be selected as the reference channel. Hence, the analysis is performed for other channels with respect to the reference channel. For example, the analysis block 5 estimates for each subband n and time instant k of the signal of the second audio channel 2.2 the inter-channel level difference (ILD) with respect to the reference channel

2.1 (block 804.2) by using, for example, the following equation:

where s_n ^r and s% are time domain signals of the reference channel and the channel to be processed (2.2 — 2.m) in subband n, respectively. The values of the obtained inter-channel level difference parameters are stored into the memory 4, when necessary. The inter-channel level difference parameters are also calculated for the subbands of the other audio channels in a corresponding manner.

The analysis block 7 also estimates for each subband n and time instant k of the signal of the second audio channel 2.2 the inter-channel time difference (ITD) with respect to the reference channel 2.1 (block 804.3) by using, for example, the following equation:

/_{<I QX} (18)

where

CZ₁ = max {θ, - c/} c/₂ = max{θ, c/}

The equation (18) is derived from equations (2), (3) and (4). The values of the obtained inter-channel time difference parameters are stored into the memory 4, when necessary. The inter-channel time difference parameters are also calculated for the subbands of the other audio channels in a corresponding manner.

The inter-channel coherence (IC) parameter for the subbands of the second audio channel can be determined (block 804.4) on the basis of the factor in brackets in equation (18), which corresponds with Equation (3).

It is also possible to calculate the inter-channel level different parameters and inter-channel time difference parameters in a transform domain. Hence, Equations (5), (6) and (7) could be used to perform the calculations in the transform domain. The combining block 8 combines two or more of the signals from different audio channels into one or more combined channels (block 806). This kind of operation can also be called as down-mixing. Some non-limiting examples of down-mixing ratios are from two audio channels to one combined channel, from five audio channels to two combined channels, from five audio channels to one combined channel, from seven audio channels to two combined channels, and from seven audio channels to one combined channel. However, also other down-mixing ratios can be implemented in connection with the present invention. Generally, the down-mixing reduced the first number of channels, say M, into a second number of channels, say P, in such a way that P<M.

The combining block 8 performs the down-mixing in the time domain or in the transform domain. The down-mixing can be performed, for example, by averaging or summing the signals of different channels 2.1 — 2.m. Before the combination the phase difference between the channels to be combined may be removed e.g. using the information provided by the inter-channel time/phase difference parameter.

In a situation in which the number of the combined channels is greater than one, a down-mixing table (not shown) could be used to define how the signals of different audio channels should be combined. For example, if five channels should be down-mixed to two channels, it could be performed by averaging signals of a second channel, a third channel and half of a first channel into a first combined channel, and by averaging the signals of a fourth channel, a fifth channel and half of the first channel into a second combined channel. Table 1 shows an example of down-mix scaling factors for down-mixing a 5.1 surround content into two channels. The 5.1 surround content comprises, for example, a front left channel, a front right channel, a center channel, a surround left channel, a surround right channel and a low frequency effect (LFE) channel.

Table 1. Scaling factors for stereo down-mix

The PCA and UDT approach in a binaural (and multi-channel) coding indicates that the ambient signal including the late reverberation is not described with level and phase difference parametehsation intended for dry sounds.

The ambient signal visible in Equation (13) is naturally affecting the down- mixing in Equation (14). When the signal power of the input signal is compared against the phase removed down-mixed signal power it can be noticed that in some cases signal powers of more than one of the individual channels are higher than the power of the down-mixed signal. For example, power of all individual channels is higher than the power of the down-mixed signal:

s_n\² <|si|² &|s_n|² <|s_n ²|² &-&|s_n|² <

(20)

The reason is the fact that actually there is the additional ambient component which is not visible in down-mixing of dry signals. Phase removed input signals may still have ambient components cancelling each other. Hence, for two channel (binaural) condition, the down-mixing in Equation (14) could be amended as follows:

¹ (ς^r + ς^x)- ς (21 a)

Similar approach can be implemented for N channels down-mixed to one channel for example as follows:

The coherence information determined e.g. in Equation (7) gives some indication about the presence of ambience, but does not provide means to represent the additional ambience in Equations (21 a) and (21 b).

The ambient signal could be subtracted from the down-mixed signal in Equations (21 a) and (21 b) using the original input signals, but for the binaural coding only the spectral level of ambience is needed. Therefore, to parameterise the ambient signal, only the level information (block 804.5) is sufficient.

First of all, according to an example embodiment of this invention, the level of the down-mixed signal is corrected to maintain the signal power after the phase difference removal. It may not be possible to incorporate the accurate ambient signal in the down-mixing. However, it is possible to correct the down-mixed signal level by taking into account the missing ambience.

+ ςX S_n + A S_n + g_n S_n = (i + (22)

By using the level correction factor (l + g_n ) the output signal level of the down-mix in the encoder can be substantially maintained at the same level than the input signal level. The correction gain is thus determined by calculating the differences of the down-mixed signal and the phase removed input signal level (block 804.6). This can be performed, for example, by the correction calculation block 9. The correction calculation block 9 outputs (block 805) the correction factor(s) to a multiplier 10 which multiplies the signal output by the combining block 8 with the correction factor(s) and produces the corrected output signal (block 807). It should be noted here that if there are more than one output signals from the combination block 8, the correction factor can be applied to all output signals. It may be possible that the same correction factor can not be used for all the output signals wherein the correction calculation block 8 calculates the correction factor for each output channel i.e. each output channel may have a down-mixed signal- specific correction factor.

Due to the nature of the parameter, the correction gain (l + g_n) represents the additional ambient level in each subband n . Hence, the parameter should also be used in the ambient signal synthesis. As the ambience level is corrected in the down-mixed signal, similar correction is not needed in the decoder, but the information about the level of ambience may be needed in decorrelation of the synthesised sound. The correction factor value is mapped to the ambience level in the decoder. For example, when the correction factor exceeds a predetermined threshold value, the value of the ambience level information is changed. The decoder will accordingly increase the ambient level of the synthesised signal. There can also be more than one threshold values for controlling the ambience levels.

The analog-to-digital converters 3.1 — 3.m may be implemented as separate components or inside the processor 6 such as a digital signal processor

(DSP), for example. The transform block 5, the analysis block 7, the combining block 8, the correction calculation block 9 and the multiplier 10 can also be implemented by hardware components or as a computer code of the processor 6, or as a combination of hardware components and computer code. It is also possible that the other elements can be implemented in hardware or as a computer code.

The computer code can be stored into a storage device such as a code memory 18 which can be part of the memory 4 or separate from the memory 4, or to another kind of data carrier. The code memory 18 or part of it can also be a memory of the processor 6. The computer code can be stored by a manufacturing phase of the device or separately wherein the computer code can be delivered to the device by e.g. downloading from a network, from a data carrier like a memory card, a CDROM or a DVD.

The analysis and down-mixing operations can also be implemented as a module which may be a hardware component, a programmable logic array, an application specific integrated circuit, a processor or another semiconductor chip or a chip set. The module may also comprise some of the other blocks of the encoder 1.

The encoder 1 may transmit the correction gain as such to the decoder or simply gives an indication of the level of the correction gain. As explained earlier, the binaural synthesis does not need to apply the correction gain to amplify the decoded down-mixed signal or synthesised binaural output since the correction was already done in the encoder. However, an alternative implementation of the encoder 1 does not apply the correction gain to the down-mixed signal, but transmits the parameter as part of the cues and performs the level correction according to Equation (22) in the decoder 21.

If the corrected output signal is in transform domain the inverse transform block 11 performs an inverse transformation (block 808) of the down-mixed signal for the speech/audio encoder 12 for encoding the audio signal (block 809). However, in some embodiments the corrected output signal(s) can be provided to the speech/audio encoder 12 conducting the encoding of the corrected output signal in the transform domain. Hence, the inverse transformation may not be needed in the encoder 1. It is also possible that the encoder comprises a cue coder 13 for coding the cue information and possibly information of the correction factor(s) (block 810) before the audio and cue information are transmitted.

The encoded output signal from the speech/audio encoder 12, the cues and possibly the information about the correction gain(s) can be combined into a single bit stream by the multiplexer 14 (block 811 ), or they can be output as separate bit streams. The bit stream(s) can be encoded by the channel encoder 15 (block 812), when necessary, to transmit (block 813) the bit stream via a communication channel 17 to a receiver 20 by a transmitter 16.

It is not always necessary to transmit the audio signal, the cues and the information relating to the ambience after encoding but it is also possible to store the audio signal, the cues and the information relating to the ambience to a storage device such as a memory card, a memory chip, a DVD disk, a CDROM, etc, from which the information can later be provided to a decoder

21 for reconstruction of the audio signals and the ambience. Next, the operations performed in a decoder 21 according to an example embodiment of the invention will be described with reference to the block diagram of Fig. 7 and the flow diagram of Fig. 9. The bit stream is received by the receiver 20 (block 901 in Fig. 9) and, if necessary, a channel decoder

22 performs channel decoding (block 902) to reconstruct the bit stream(s) carrying the one or more combined signals, the cues and possibly information about the correction gain(s).

The combined signal, the cues and information on the correction factor(s) can be separated from the reconstructed bit stream by the demultiplexer 23 (block 903), in the case they were multiplexed into a single bit stream. In this example embodiment the reconstructed bit streams at the output of the optional channel decoder 22 contain the audio signal in the encoded form. Hence, the bit stream is decoded by the audio decoder 24 to obtain the replica of the corrected audio signal in time domain (block 904) i.e. the audio signal constructed by the inverse transform block 11. The output signal from the audio decoder 24 is provided to the up-mixing block 25 to form two or more audio signals (block 905). In case the encoding is conducted in transform domain that is similar to the transform used for spatial parameter estimation and synthesis, the decoder does not need inverse transform into time domain before the spatial synthesis, i.e. before the up-mix operation. In an example embodiment the up-mixing block 25 forms as many output signals (channels) than was combined in the combining block 8, i.e. M channels are reconstructed. In another example embodiment the up-mixing block 25 forms less output signals than was combined in the combining block 8. In a yet another example embodiment the up-mixing block 25 forms more output signals than the original number P of input signals (i.e. the combined input signals). That means that more than M channels are reconstructed. For example, if five channels were combined to one channel, the up-mixing block 25 could form two, three, four, five or even more than five output signals. As a general rule, the up-mixing block forms Q channels from P combined channels, P < Q and P < M.

The decoder 21 may also comprise a cue decoder 27 to decode the optionally encoded cue information and/or information on the correction factor(s) (block 906). The decoder 21 comprises a correction block 26 which takes into account the received cues and possibly the correction factor(s) to synthesize the audio signals with the ambience (block 907). The correction block 26 may comprise for example, random coefficient FIR filters modelling late reverberation or simple comb filters for each reconstructed channel. The correction block 26 also comprises an input 26.1 to which the received parameters may be input to be used in the synthesis of the audio signals.

The decoder 21 can also comprise a processor 29 and a memory 28 for storing data and/or computer code.

The synthesis of ambience in the decoder 21 utilises the correction gain or information about the level of correction gain (one or more correction factors) when decorrelating the output signal. For example, random coefficient FIR filters modelling late reverberation, or simple comb filters for each output channel, could be controlled with the ambience level information. For example, in a two-channel case the synthesised first (e.g. left) and second (e.g. right) channel signals can be written as

_ .2πnτ_n si = aiS_ne ^2N +b_{S \,n (23)

.2πnτ_n S* = a₂S_ne ^2Λ/ ₊ b₂S_2,n . (24)

The first scale factors aι, a₂ correspond to the inter-channel level difference and the second scale factors t\, b₂ correspond to the ambience level information. A low ambience level information implies a low scaling factor. In an example embodiment the inter-channel level difference and ambient level could be balanced so that the overall level of the output signal is not increased and the level difference between the left channel and the right channel remains substantially the same as in the corresponding input signals. The DFT domain signals S_{1 n} and S_{2 n} of the exemplary two- channel case are the decorrelated ambience signals in subband n :

Si_J1 = H_1J1S_n, / = 1,2 (25)

in which H_{1 n} is the decorrelation filter.

A general multi-channel equation for the synthesized channels i can be derived from equations (23) and (24) e.g. as follows:

S_n =

+ biS_iιn (26)

in which i is the number of the synthesized channel. The synthesized audio signals can be provided to loudspeakers 30.1 — 3O.q, for example, for listening (block 908). It is also possible to store the synthesized audio signals to a storage device such as the data memory 28.1 of the decoder, a memory card, a memory chip, a DVD disk, a CDROM, etc. It is also possible that some elements of the decoder 21 can also be implemented in hardware or as a computer code and the computer code can be stored into a storage device such as a code memory 28.2 which can be part of the memory 28 or separate from the memory 28, or to another kind of data carrier. The code memory 28.2 or part of it can also be a memory of the processor 29 of the decoder 21. The computer code can be stored by a manufacturing phase of the device or separately wherein the computer code can be delivered to the device by e.g. downloading from a network, from a data carrier like a memory card, a CDROM or a DVD. The invention can be applied, for example, in ITU-T (the International Telecommunication Union - Telecommunication Standardization Section) EV-VBR (Embedded Variable Bit Rate coding) stereo extension and 3GPP EPS (evolved packet switch) speech/audio coding. There are also other systems and environments in which the present invention can be implemented.

In Fig. 10 there is depicted an example of a device 110 in which the invention can be applied. The device can be, for example, an audio recording device, a wireless communication device, a computer equipment such as a portable computer, etc. The device 110 comprises a processor 6 in which at least some of the operations of the invention can be implemented, a memory 4, a set of inputs 1.1 for inputting audio signals from a number of audio sources 2.1 — 2.m, one or more A/D-converters for converting analog audio signals to digital audio signals, an audio encoder 12 for encoding the combined audio signal, and a transmitter 16 for transmitting information from the device 110.

In Fig. 11 there is depicted an example of a device 111 in which the invention can be applied. The device 111 can be, for example, an audio playing device such as a MP3 player, a CDROM player, a DVD player, etc. The device 111 can also be a wireless communication device, a computer equipment such as a portable computer, etc. The device 111 comprises a processor 29 in which at least some of the operations of the invention can be implemented, a memory 28, a receiver 20 for receiving a combined audio signals and parameters relating to the combined audio signal from another device 111 , an audio decoder 24 for decoding the combined audio signal, a synthesizer 26 for synthesizing the a number of audio signals, and a number of outputs for outputting the synthesized audio signals to loudspeakers 30.1 — 30. q.

An apparatus of an example embodiment according to the present invention comprises means for inputting two or more audio signals; means for analysing the audio signals to form a set of parameters; means for combining at least two of said two or more audio signals to form a combined audio signal; means for determining the signal level of the combined audio signal; and means for determining a correction factor on a basis of a difference between the signal level of the combined audio signal and a signal level of at least one of the inputted audio signals to reduce difference between the signal level of the combined audio signal and the signal level of the inputted audio signal.

An apparatus of another example embodiment according to the present invention comprises means for inputting a combined audio signal and one or more parameters relating to the audio signals from which the combined audio signal has been formed; means for synthesizing two or more audio signals on the basis of the combined audio signal and said one or more parameters; said one or more parameters comprising a correction factor; and the apparatus further comprises means for using the correction factor in said synthesizing two or more audio signals.

The combinations of claim elements as stated in the claims can be changed in any number of different ways and still be within the scope of various embodiments of the invention.

Claims

Claims:

1. A method comprising:

- inputting two or more audio signals; - analysing the audio signals to form a set of parameters;

- combining at least two of said two or more audio signals to form a combined audio signal; characterised in that the analysing comprises

- determining the signal level of the combined audio signal; - determining a correction factor on a basis of a difference between the signal level of the combined audio signal and a signal level of at least one of the inputted audio signals to reduce difference between the signal level of the combined audio signal and the signal level of the inputted audio signal.

2. The method according to claim 1 , characterised by

- selecting a reference channel among the two or more input channels; and

- using the selected reference channel in determining the correction factor.

3. The method according to claim 2, characterised by dividing a frequency band of the audio signals into subbands.

4. The method according to claim 3, characterised by calculating the correction factor for a subband as follows:

in which S_n is the signal level of the combined signal, S_n is the signal level of the reference signal, S^ is the signal level of the signal to be analysed, A_n is an ambient signal, and g_n is the correction factor.

5. The method according to any of the claims 1 to 4, characterised by amending the combined audio signal with the correction factor.

6. The method according to claim 5, characterised in that one or more combined signals are formed, and that each combined signal is amended by the same correction factor.

7. The method according to claim 5, characterised in that two or more combined signals are formed, that for each combined signal a down-mixed signal-specific correction factor is formed, and that each combined signal is amended by the down-mixed signal-specific correction factor.

8. The method according to claim 6 or 7, characterised in that the combined signal is amended by multiplying the combined signal with the correction factor.

9. The method according to any of the claims 1 to 8, characterised by transmitting the combined audio signal and the correction factor to a receiver.

10. The method according to any of the claims 1 to 9, characterised by:

- converting the audio signals from time domain to a transform domain;

- forming the combined audio signal in the transform domain; - determining the correction factor in the transform domain; and

- converting the combined audio signal to the time domain.

11. A method comprising:

- synthesizing two or more audio signals on the basis of the combined audio signal and said one or more parameters; characterised in that said one or more parameters comprise a correction factor, and the method comprises using the correction factor in said synthesizing two or more audio signals.

12. The method according to claim 11 , characterised by

- synthesizing each audio signal; and

- correcting each synthesized audio signal by using said correction factor.

13. The method according to claim 11 or 12, characterised in that said one or more parameters comprise ambience level information; and that an ambient component is synthesized by decorrelating said two or more audio signals using the ambient level information.

14. The method according to claim 13, characterised by - performing said correlation by using filters, and

- controlling said filters by the ambient level information.

15. The method according to any of the claims 10 to 14, characterised in that a frequency band of the audio signals is divided into subbands, the method comprising:

- receiving a correction factor for each subband, and

- synthesizing each subband of the audio signal using the correlation factor of the subband.

16. The method according to claim 15, characterised in that the synthesis of the output channels for each subband is performed by using the equation

_ .2πnτ_n

S_n' = aiS_ne 2Λ/ _{+ bjSjιn t} in which n is the subband, a,- is the first scale factor corresponding to the inter-channel level difference, bj is the second scale factor corresponding to the ambience level,

N is the total number of synthesized channels, τ_n is the inter-channel time difference, and i is the number of the synthesized channel.

17. An apparatus comprising:

- an input for inputting two or more audio signals;

- an analyser for analysing the audio signals to form a set of parameters; - a combiner for combining at least two of said two or more audio signals to form a combined audio signal; characterised in that the analyser comprises

- a level determinator for determining the signal level of the combined audio signal; - a gain determinator for determining a correction factor on a basis of a difference between the signal level of the combined audio signal and a signal level of at least one of the inputted audio signals to reduce difference between the signal level of the combined audio signal and the signal level of the inputted audio signal.

18. The apparatus according to claim 17, characterised in that the apparatus comprises a selector for selecting a reference channel among the two or more input channels; and said gain determinator is configured to use the selected reference channel in determining the correction factor.

19. The apparatus according to claim 18, characterised in that the apparatus comprises a divider for dividing a frequency band of the audio signals into subbands.

20. The apparatus according to claim 19, characterised in that said gain determinator is configured to calculate the correction factor for a subband as follows:

S_n is the signal level of the combined signal, S_n is the signal level of the reference signal, S^ is the signal level of the signal to be analysed, A_n is an ambient signal, and g_n is the correction factor.

21. The apparatus according to any of the claims 17 to 20, characterised in that the apparatus comprises a multiplier for amending the combined audio signal with the correction factor.

22. The apparatus according to claim 21 , characterised in that said combiner is configured to form one or more combined signals, and that said multiplier is configured to amend each combined signal by the same correction factor.

23. The apparatus according to claim 21 , characterised in that said combiner is configured to form two or more combined signals and to form for each combined signal a down-mixed signal-specific correction factor, and said multiplier is configured to amend each combined signal by the down- mixed signal-specific correction factor.

24. The apparatus according to claim 21 , 22 or 23, characterised in that said combiner is configured to amend the combined signal by multiplying the combined signal with the correction factor.

25. The apparatus according to any of the claims 17 to 24, characterised in that the apparatus comprises a transmitter for transmitting the combined audio signal and the correction factor to a receiver.

26. The apparatus according to any of the claims 17 to 25, characterised in that the apparatus comprises a converter for converting the audio signals from time domain to a transform domain; said combiner is configured for forming the combined audio signal in the transform domain; said gain determinator is configured for determining the correction factor in the transform domain; and the apparatus further comprises an inverse-converter for converting the combined audio signal to the time domain.

27. An apparatus comprising:

- a synthesizer for synthesizing two or more audio signals on the basis of the combined audio signal and said one or more parameters; characterised in that said one or more parameters comprise a correction factor, and the apparatus comprises a corrector for using the correction factor in said synthesizing two or more audio signals.

28. The apparatus according to claim 27, characterised in that said synthesizer is configured for synthesizing each audio signal; and said corrector is configured for correcting each synthesized audio signal by using said correction factor.

29. The apparatus according to claim 27 or 28, characterised in that said one or more parameters comprise ambience level information; and that said synthesizer comprises a decorrelator for decorrelating said two or more audio signals using the ambient level information.

30. The apparatus according to claim 29, characterised in that the correlator comprises filters, and a control input for controlling said filters by the ambient level information.

31. The apparatus according to any of the claims 27 — 30, characterised in that a frequency band of the audio signals is divided into subbands, and said input is configured to receive a correction factor for each subband, and said synthesizer is configured to synthesize each subband of the audio signal using the correlation factor of the subband.

32. The apparatus according to claim 31 , characterised in that said synthesizer is configured to synthesize the output channels for each subband by using the equation S_n' =

+ bjS_{j n} , in which n is the subband, a,- is the first scale factor corresponding to the inter-channel level difference, bj is the second scale factor corresponding to the ambience level,

33. A computer program comprising program code means adapted to perform the following when the program is run on a processor: - inputting two or more audio signals;

- analysing the audio signals to form a set of parameters;

- combining at least two of said two or more audio signals to form a combined audio signal; characterised in that the computer program comprising program code means adapted to - determine the signal level of the combined audio signal;

34. A computer program according to claim 33 comprising program code means adapted to perform any of the steps 1 to 9 when the program is run on a processor.

35. A computer program comprising program code means adapted to perform the following when the program is run on a processor:

- synthesizing two or more audio signals on the basis of the combined audio signal and said one or more parameters; characterised in that said one or more parameters comprise a correction factor, and the computer program comprises program code means adapted to use the correction factor in said synthesizing two or more audio signals.

36. A computer program according to claim 35 comprising program code means adapted to perform any of the steps 10 to 16 when the program is run on a processor.