MX2008008424A

MX2008008424A - Decoding of binaural audio signals

Info

Publication number: MX2008008424A
Application number: MX/A/2008/008424A
Authority: MX
Inventors: Pasi Ojala; Julia Turku; Mauri Vaananen; Mikko Tammi
Original assignee: Nokia Corporation; Pasi Ojala; Mikko Tammi; Julia Turku; Vaeaenaenen Mauri
Priority date: 2006-01-09
Filing date: 2008-06-26
Publication date: 2008-10-03

Abstract

A method for sy nthesizing a binaural audio signal, the method comprising:inputting a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding sets of side information describing a multi-channel sound image;and applying a predetermined set of head-related transfer function filters to the at least one combined signal in proportion determined by said corresponding set of side information to synthesize a binaural audio signal.

Description

DECODING OF BINAURAL AUDIO SIGNALS RELATED APPLICATIONS This application claims the priority of the international application PCT / FI2006 / 050014, filed on January 9, 2006, the application of the United States 11 / 334,041, filed on January 17, 2006 and the request of the United States 11 / 354,211, filed on February 13, 2006.

FIELD OF THE INVENTION The present invention relates to a spatial audio coding, and more particularly to the decoding of binaural audio signals.

BACKGROUND OF THE INVENTION In spatial audio coding, a two / multi-channel audio signal is processed so that the audio signals to be reproduced on different audio channels differ from each other, thus providing listeners with an impression of a spatial effect around the audio source. The spatial effect can be created by directly recording the audio in formats suitable for multi-channel or binaural playback, or the spatial effect can be created artificially in any signal Two / multi-channel audio, which is known as spatialization. It is generally known that for reproduction in headphones, artificial spatialization can be performed by filtering the Head-Related Transfer Function (HRTF), which produces binaural signals for the left and right ear of the listener. The sound source signals are filtered with filters derived from the HRTF corresponding to their source address. An HRTF is the transfer function measured from a sound source in the free field to the ear of a human being or an artificial head, divided between the transfer function for a microphone that replaces the head and is placed in the middle part of the head. The artificial room effect (for example, previous reflections and / or subsequent reverberation) can be added to the spatialized signals to improve the externalization and naturalness of the source. As the variety of audio listening and interaction devices increases, compatibility becomes more important. Among spatial audio formats, compatibility is achieved through up-mixing and down-mixing techniques. It is generally known that there are algorithms for converting a multi-channel audio signal into a stereo format, such such as Dolby Digital® and Dolby Surround®, and then convert a stereo signal into a binaural signal. However, in this type of processing, the spatial image of the original mui-channel audio signal can not be fully reproduced. A better way to convert a multi-channel audio signal to hear in the headphone is to replace the original speakers with virtual speakers by using HRTF filtering and to reproduce the signals of speaker channels through those (for example Dolby Headphone® ). However, this process has the disadvantage that, in order to generate a binaural signal, a multi-channel mix is always needed first. That is, the multi-channel signals (for example 5 + 1 channels) are first encoded and synthesized and then the HRTFs are applied to each signal to form a binaural signal. This is a problematic procedure in the computational aspect, compared to decoding directly from the compressed multi-channel format to the binaural format. Binaural indication coding (BCC) is a highly developed parametric spatial audio coding method. BCC represents a multi-channel spatial signal as a single (or several) audio channel mixed downward and a set of perceptually relevant interchannel differences, estimated as a function of frequency and time from of the original signal. The method allows a mixed spatial audio signal for an arbitrary speaker configuration to be converted by some other speaker configuration, which consists of either the same or a different number of speakers. Therefore, the BCC is designed for multi-channel speaker systems. However, the generation of a binaural signal from a monosignal processed by BCC and its secondary information requires that a multi-channel representation based on the monoseñal and secondary information is first synthesized, and only then can it be possible to generate a binaural signal for playback in spatial headphones from the multi-channel representation. It is evident that this procedure is also not optimized in view of the generation of a binaural signal.

SUMMARY OF THE INVENTION An improved method and technical equipment implementing the method has now been invented, whereby the generation of a binaural signal is made possible directly from a parametrically encoded audio signal. Various aspects of the invention include a decoding method, a decoder, an apparatus and computer programs, which are characterized by what is generally described in detail below. Several embodiments of the invention are also described. According to a first aspect, a method according to the invention is based on the idea of synthesizing a binaural audio signal in such a way that first a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information describing a multi-channel sound image. The combined signal or signals are divided into a plurality of subbands; and the parameter values for the subbands are determined from the secondary information group. Then, a predetermined set of head-related transfer function filters is applied to at least one combined signal in proportion determined by the parameter values to synthesize a binaural audio signal. According to one embodiment, the parameter values are determined by interpolation of a parameter value corresponding to a particular subband from the following and previous parameter values provided by the secondary information group. According to one embodiment, from the predetermined group of head-related transfer function filters, one is chosen to apply a left-right pair of head-related transfer function filters corresponding to each direction speaker of the original multi-channel speaker configuration. According to one embodiment, said group of secondary information comprises a group of gain estimates for the channel signals of the multi-channel audio, which describes the original sound image. According to one modality, the gain estimates of the original multi-channel audio are determined as a function of time and frequency; and the gains for each speaker channel are adjusted in such a way that the sum of the squares of each gain value is equal to one. According to one embodiment, the combined signal (s) are divided into one of the following types of subbands: a plurality of quadrature mirror filter sub-bands (QMF); a plurality of sub-bands of Equivalent Rectangular Bandwidth (ERB, for its acronym in English); or a plurality of frequency bands psycho-acoustically motivated. According to one embodiment, the parameter values are gain values for at least one subband. According to one embodiment, the step of determining the gain values for subbands further comprises: determining gain values for each channel signal of the multi-channel audio that describes the original sound image; and interpolating a simple gain value for subbands from the gain values of each channel signal. According to one embodiment, a frequency domain representation of the binaural signal for subbands is determined by multiplying the combined signal or signals with at least one gain value and a head-related transfer function filter. The configuration according to the invention provides significant advantages. A major advantage is the simplicity and low computational complexity of the decoding processes. The decoder is also flexible in the sense that it performs the binaural synthesis completely based on the spatial and coding parameters given by the encoder. In addition, spatiality equal to the original signal is maintained in the conversion. As for secondary information, a group of gain estimates of the original mix is sufficient. More significantly, the invention makes possible the improved utilization of the intermediate compressive state provided in parametric audio coding, improving transmission efficiency as well as audio storage. If the gain values are determined for sub-bands to From the secondary information, the quality of the binaural output signal can be improved by introducing lighter changes of the gain values from one frequency band to another. Also, filtering can be significantly simplified. Additional aspects of the invention include various apparatuses configured to carry out the inventive steps of the above methods BRIEF DESCRIPTION OF THE DRAWINGS Next, various embodiments of the invention will be described in greater detail, with reference to the accompanying drawings, in which: Figure 1 shows a generic Binaural Indication Coding (BCC) scheme according to the prior art; Figure 2 shows the general structure of a BCC synthesis scheme according to the prior art; Figure 3 shows a block diagram of the binaural decoder according to an embodiment of the invention; and Figure 4 shows an electronic device according to an embodiment of the invention in a reduced block diagram.

DETAILED DESCRIPTION OF THE MODALITIES OF THE INVENTION Next, the invention will be illustrated with reference to the Binaural Indication Coding (BCC) as an exemplified platform for implementing the decoding scheme according to the modalities. However, it is clear that the invention is not limited to BCC-type spatial audio coding methods alone, but can be implemented in any audio coding scheme that provides at least one combined audio signal from the original group of one or several audio channels and appropriate spatial secondary information. Binaural Indication Coding (BCC) is a general concept for parametric representation of spatial audio, multi-channel output supply with an arbitrary number of channels from a simple audio channel plus some secondary information. Figure 1 illustrates this concept. Several input audio channels (M) are combined into a single output signal (S; "sum") by a down-mix process. In parallel, the most prominent interchannel indications describing the multi-channel sound image are extracted from the input channels and encoded in a compacted manner as secondary BCC information. Both the sum signal and the secondary information are then transmitted to the receiver side, possibly using an appropriate scheme of Low bit rate audio encoding to encode the sum signal. Finally, the BCC decoder generates a multi-channel (N) output signal for the loudspeakers of the transmitted sum signal and the spatial indication information by resynthesis of the channel output signals, which carry the relevant inter-channel indications. , such as Intercanal Time Difference (ICTD), Intercanal Difference (ICLD) and Intercanal Coherence (ICC). Accordingly, the BCC secondary information, i.e. the interchannel indications, is chosen in view of optimizing the reconstruction of the multi-channel audio signal, particularly for loudspeaker reproduction. There are two BCC schemes, specifically BCC for flexible reproduction (BCC type I), which is designed for the transmission of a number of separate source signals in order to reproduce them in the receiver, and BCC for natural reproduction (BCC type II), which is intended for the transmission of a number of audio channels of a stereo or surrounding signal. The BCC for Flexible Playback takes separate signals from audio sources (eg voice signals, instruments recorded separately, multi-track playback) as input. The BCC for Natural Reproduction, in turn, takes a "final mix" or multi-channel stereo signal as input (for example, compact disc audio (CD)), surround sound of digital versatile disk (DVD, for its acronym in English)). If these processes are carried out by means of conventional coding techniques, the bit rate is scaled proportionally or at least almost proportionally to the number of audio channels, for example the transmission of the six audio channels of the multi system. -channel 5.1. it requires a bit rate almost six times that of an audio channel. However, both BCC schemes result in a bit rate that is only slightly higher than the bit rate required for the transmission of an audio channel, since the secondary BCC information requires only a transfer rate. of very low bits (for example 2 kb / s). Figure 2 shows the general structure of a BCC synthesis scheme. The transmitted monosignal ("sum") is first displayed in the time domain in frames and then mapped to a spectral representation of appropriate subbands by a Fast Fourier Transform (FFT) process and a FB filter bank. In the general case of reproduction channels, the ICLD and ICTD are considered in each sub- band between pairs of channels, ie for each channel with respect to a reference channel. The subbands are selected in such a way that a sufficiently high frequency resolution is achieved, for example a subband bandwidth equal to twice the scale of Equivalent Rectangular Bandwidth (ERB) is considered typically adequate. For each output channel that is going to be generated, the ICTD for individual time delays and ICLD for level differences are imposed on the spectral coefficients, followed by a coherence synthesis process that reintroduces the most relevant aspects of coherence and / or correlation (ICC) between the synthesized audio channels. Finally, all synthesized output channels are converted into a time domain representation by an inverse FFT process (IFFT), resulting in multi-channel output. For a more detailed description of the BCC procedure, reference is made to: F. Baumgarte and C. Faller: "Binaural Cue Coding - Part I: Psychoacoustic Fundamentals and Design Principles," IEEE Transactions on Speech and Audio Processing, vol. 11, No. 6, November 2003, already: C. Faller and F. Baumgarte: "Binaural Cue Coding - Part II: Schemes and Applications", IEEE Transactions on Speech and Audio Processing, Vol. 11, No. 6 November 2003. The BCC is an example of coding schemes, which provides a suitable platform to implement the decoding scheme according to the modalities. The binaural decoder according to one modality receives the monofonized signal and the secondary information as inputs. The idea is to replace each speaker in the original mix with a pair of HRTF corresponding to the direction of the speaker with respect to the position of the listener. Each frequency channel of the monofonized signal feeds each pair of filters that implement the HRTFs in the proportion dictated by a group of gain values, which can be calculated based on the secondary information. Therefore, the process can be devised to implement a group of virtual speakers, which correspond to the originals, in the context of binaural audio. Therefore, the invention adds value to the BCC by allowing, in addition to multi-channel audio signals for various speaker configurations, that a binaural audio signal is also derived directly from the parametrically encoded spatial audio signal, without any BCC intermediary synthesis process. Some embodiments of the invention are illustrated below with reference to Figure 3, which shows a block diagram of the binaural decoder according to an aspect of the invention. The decoder (300) comprises a first input (302) for the monofonized signal and a second input (304) for the secondary information. The entries (302), (304) are shown as distinctive entries in order to illustrate the modalities, but a skilled person will appreciate that in the practical implementation, the monofonized signal and the secondary information can be supplied through the same input. According to one modality, the secondary information does not have to include the same interchannel indications as in the BCC schemes, that is Intercanal Time Difference (ICTD), Intercanal Level Difference (ICLD) and Intercanal Coherence (ICC), but more well only a group of gain estimates that define the distribution of sound pressure between the channels of the original mix, in each band of sufficient frequency. In addition to the gain estimates, the secondary information of preference includes the number and positions of the speakers of the original mix in relation to the position of the listener, as well as the length of the picture employed. According to one embodiment, instead of transmitting the gain estimates as a part of the secondary information of an encoder, the gain estimates are calculated in the decoder from the inter-channel indications of the BCC schemes, for example from of ICLD. The encoder (300) further comprises a display unit in a window (306), wherein the signal monofonizada first it is divided in pictures of time of the used length of picture, and later the pictures are visualized in a window appropriately, for example sinusoidal window. An appropriate frame length should be adjusted in such a way that the frames are large enough for discrete Fourier transform (DFT) as long as they are simultaneously short enough to handle rapid variations in the signal. Experiments have shown that an adequate frame length is approximately 50 ms. Accordingly, if the sampling frequency of 44.1 kHz (commonly used in various audio coding schemes) is used, then the frame may comprise, for example, 2048 samples resulting in the frame length of 46.4 ms. The visualization in a window is preferably given in such a way that the adjacent windows are superimposed by 50% in order to smooth the transitions caused by the spectral modifications (level and delay). Subsequently, the monofonized signal displayed in a window is transformed into frequency domain into an FFT unit (308). The processing is done in the frequency domain with the objective of achieving efficient computing. A skilled person will realize that the previous steps of the signal processing can be carried out outside the actual decoder (300), i.e. the unit display in a window (306) and the FFT unit (308) can be implemented in the apparatus, where the decoder is included, and the monofonized signal to be processed is already displayed in a window and is transformed into domain of frequency, when it is supplied to the decoder. For purposes of efficient computation of the signal in frequency domain, the signal is fed to a filter bank (310), which divides the signal into psychoacoustically motivated frequency bands. According to one embodiment, the filter bank 310 is designed in such a way that it is configured to divide the signal into 32 frequency bands that comply with the commonly known Rectangular Equivalent Bandwidth (ERB) scale, which gives Result the signal components x0, x31 in the 32 mentioned frequency bands. The decoder (300) comprises a group of HRTF (312), (314) as pre-stored information, from which a left-right pair of HRTF corresponding to each speaker address is chosen. For purposes of illustration, two groups of HRTF (312), (314) are shown in figure 3, one for the left-side signal and one for the right-side signal, but it is evident that in the practical implementation a HRTF group. To adjust the chosen left-right pairs of HRTF to correspond to each channel sound level of loudspeaker, preferably the gain values G are estimated. As mentioned above, the gain estimates can be included in the secondary information received from the decoder, or can be calculated in the decoder based on the BCC secondary information. Accordingly, a gain is estimated for each speaker channel as a function of time and frequency, and in order to preserve the gain level of the original mix, the gains for each speaker channel are preferably adjusted in such a way that the sum of the squares of each gain value is equal to one. This provides the advantage that, if N is the number of channels to be generated virtually, then only Nl gain estimates need to be transmitted from the encoder, and the lost gain value can be calculated based on the gain values Nl. However, a skilled person will realize that the operation of the invention does not need to adjust the sum of the squares of each gain value to be equal to one, but the decoder can scale the squares of the gain values in such a way that the sum is equal to one. Then each left-right pair of the HRTF filters (312), (314) are adjusted in the proportion dictated by the gain group G, resulting in adjusted HRTF filters (312 '), (314'). Again, it is observed that in practice, the magnitudes of the original HRTF filters (312), (314) are scaled only according to the gain values, but for purposes of illustrating the modalities, "additional" HRTF groups (312 '), ( 314 ') are shown in Figure 3. For each frequency band, the mono-signal components x0, x31 are fed to each left-right pair of the adjusted HRTF filters (312'), (314 '). The filter outputs for the left-side signal and for the right-side signal are then summed in summing units (316), (318) for both binaural channels. The binaural signals added again are displayed in a sinusoidal window, and are retransformed in time domain by an inverse FFT process carried out in the IFFT units (320), (322). In case the analysis filters do not add one, or their phase response is not linear, then a suitable synthesis filter bank is preferably used to avoid distortion in the final binaural signals BR and BL. According to one modality, in order to improve the externalization, ie the location outside the head, of the binaural signal, a moderate response of the room can be added to the binaural signal. For that purpose, the decoder may comprise a reverberation unit, preferably located between the summing units (316), (318) and the IFFT units (320), (322).

The added response of the room mimics the effect of the room in a listening situation. The necessary reverberation time is, however, sufficiently short in such a way that the computational complexity does not increase noticeably. The binaural decoder (300) described in Figure 3 also makes possible a special case of a decoding by a stereo downmix process, where the spatial image becomes narrow. The operation of the decoder (300) is corrected in such a way that each adjustable HRTF filter (312), (314), which in the previous modes was only scaled according to the gain values, is replaced by a predetermined gain. Accordingly, the monofonized signal is processed through constant HRTF filters consisting of a simple gain multiplied by a group of gain values calculated based on the secondary information. As a result, spatial audio is mixed down into a stereo signal. This special case provides the advantage that a stereo signal can be created from the combined signal using the secondary spatial information without the need to decode the spatial audio, thereby the stereo decoding procedure is simpler than the conventional BCC synthesis. The structure of the binaural decoder (300) remains the same, contrary to the figure 3, only the adjustable HRTF filters (312), (314) are replaced by downmix filters having predetermined gains for the stereo downmix. If the binaural decoder comprises HRTF filters, for example, for a surround audio configuration 5.1, then for the special case of the stereo downmix decoding, the constant gains for the HRTF filters may be, for example, as defined. in table 1.

Table 1. HRTF filters for stereo downmix The configuration according to the invention provides significant advantages. A major advantage is the simplicity and low computational complexity of the decoding process. The decoder is also flexible in the sense that it performs the binaural upmix completely based on the parameters of encoding and spacers given by the encoder. In addition, spatiality is maintained with respect to the original signal in the conversion. As for secondary information, a group of gain estimates of the original mix is sufficient. From the point of view of audio transmission or storage, the most significant advantage is gained through improved efficiency when using the compressive intermediate state provided in the parametric audio coding. An experienced person will realize that, since the HRTFs are very individual and the average is impossible, the perfect re-spatialization could only be achieved by measuring the only HRTF group by the listener himself. Accordingly, the use of HRTF inevitably colors the signal in such a way that the quality of the processed audio is not equivalent to the original. However, since the measurement of each listener's HRTF is an unrealistic option, the best possible result is achieved, when using either a modeled group or a group measured from a mannequin head or a person with a head of average size and remarkable symmetry. As stated above, according to one modality, the gain estimates can be included in the secondary information received from the encoder. Accordingly, one aspect of the invention relates to an encoder for spatial audio signal of multi-channel that estimates a gain for each speaker channel as a function of frequency and time and includes the gain estimates in the secondary information to be transmitted along one (or several) combined channels. The encoder can be, for example, a BCC encoder known as such, which is further configured to calculate the gain estimates, either in addition to or instead of the ICTD, ICLD and ICC of interchannel indications describing the image of multi-channel sound. Subsequently, both the sum signal and the secondary information, comprising at least the gain estimates, are transmitted to the receiver side, preferably using an appropriate low bit rate audio coding scheme, to encode the signal of sum. According to one embodiment, if the gain estimates are calculated in the encoder, the calculation is carried out by comparing the gain level of each individual channel with the accumulated gain level of the combined channel; that is, if we denote the gain levels by X, the individual channels of the original speaker configuration by "m" and the samples by "k", then for each channel the gain estimate is calculated as Xm (k) · Therefore, the gain estimates determine the proportional gain magnitude of each individual channel compared to the magnitude of total gain of all channels. According to one embodiment, if the gain estimates are calculated in the decoder based on the secondary information of BCC, the calculation can be carried out for example based on the values of the interchannel level difference ICLD. Therefore, if N is the number of "loudspeakers" to be generated virtually, then equations N-1, which comprise N-1 unknown variables, are first composed based on the ICLD values. Subsequently, the sum of the squares of each speaker equation is set equal to 1, so the gain estimation of an individual channel can be solved, and based on the resolved profit estimate, the rest of the gain estimates can be solved from equations N-1. For example, if the number of channels to be generated is virtually five (N = 5), equations N-1 can be formed as follows: L2 = L1 + ICLD1, L3 = L1 + ICLD2, L4 = L1 + ICLD3 and L5 = L1 + ICLD4. Then, the sum of its squares is set equal to 1: Ll2 + (L1 + ICLD1) 2 + (L1 + ICLD2) 2 + (L1 + ICLD3) 2 + (L1 + ICLD4) 2 = 1. Then you can solve the value of Ll, and based on Ll, the rest of the gain level values L2 -L5 can be solved. According to a further embodiment, the basic idea of the invention, ie generate a binaural signal directly from a parametrically encoded audio signal without having to decode it first in a multi-channel format, it can also be implemented in such a way that instead of using the group of gain estimates and applying them to each frequency sub-band, only the channel level information part (ICLD) of the secondary information bit stream is used together with the sum signal (s) to construct the binaural signal. Therefore, instead of defining a set of gain estimates in the decoder or including the gain estimates in the BCC secondary information in the encoder, the channel level information part (ICLD) of the conventional BCC secondary information of each original channel is appropriately processed as a function of time and frequency in the decoder. The original sum signal (s) is divided (n) into appropriate frequency bins, and the gains for the frequency bins are derived from the channel level information. This process makes it possible to improve the quality of the binaural output signal by introducing lighter changes of the gain values from one frequency band to another. In this mode, the preliminary stages of the process are similar to those described above: the sum signal (s) (mono or stereo) and the information If the secondary signal is entered into the decoder, the sum signal is divided into time frames of the used frame length, which are then displayed in a window appropriately, for example they are displayed in a sinusoidal window. Again, 50% of sinusoidal windows superimposed in the analysis are used and FFT is used to efficiently convert the time domain signal into frequency domain. Now, if the length of the analysis window is N samples and the windows are 50% overlap, we have in the frequency domain N / 2 frequency bins. In this modality, instead of dividing the signal into psychoacoustically motivated frequency bands, such as subbands according to the ERB scale, the processing is applied to these frequency bins. As described above, the secondary information of the BCC encoder provides information on how the sum signal (s) should be scaled to obtain each individual channel. Gain information is generally provided only for time and restricted frequency positions. In the time direction, gain values are given, for example once in a table of 2048 samples. For the implementation of the present modality, gain values are needed in the middle of each sinusoidal window and for each frequency bin (ie N / 2 gain values in the middle of each sinusoidal window). This is efficiently achieved by the interpolation means. Alternatively, the gain information can be provided in certain time cases in the secondary information, and the number of time cases within a table can also be provided in the secondary information. In this alternative implementation, the gain values are interpolated based on the knowledge of time cases and the number of time cases when the gain values are updated. Let's assume that the BCC multi-channel encoder provides Ng gain values at time instants tm, m = 0, 1, 2, .... in relation to the current time instant tw (the center of the current sinusoidal window), the next and previous gain value groups, provided by the BCC multi-channel encoder, are searched for, let's note the tprev and tsig notation. Using for example linear interpolation, Ng gain values are interpolated at the time instant tw such that the distances from tw to tprev and tsig are used in the interpolation as scaling factors. According to another modality, the gain value (tprev or tsig), which is closer to the time instant tw, is simply selected, which provides a more direct solution for determining a very approximate gain value.

After a set of gain values Ng has been determined for the current time instant, these need to be interpolated in the frequency direction to obtain an individual gain value for each N / 2 frequency bins. Simple linear interpolation can be used to complete this task, however, synchronous interpolation can also be used for example. Generally, the Ng gain values are given with higher resolution at low frequencies (the resolution can follow, for example, the ERB scale), which has to be considered in the interpolation. Interpolation can occur in the linear or logarithmic domain. The total number of interpolated gain groups is equal to the number of output channels in the multi-channel decoder, multiplied by the number of sum signals. In addition, the HRTFs of the original speaker addresses are needed to build the binaural signal. Also, HRTFs become the frequency domain. To make frequency domain processing more direct, the same frame length (N samples) is used in the conversion as that used to convert one or more time domain sum signals in frequency domain (for N / 2 frequency bins). Let Yx (n) and Y2 (n) be the frequency domain representation of the binaural signals left and right, respectively. In the case of a sum signal (ie, a monophonic sum signal Xsumal (n)), the binaural output is constructed as follows: YAn) = XUmM ?. { H (n) g (n)) c = l Y2 (n) = Xvmul (n) i (HHn) gl < (n)) where 0 = n < N / 2 C is the total number of channels in the BCC multi-channel encoder (for example, a 5.1 audio signal comprises 6 channels), and g1c (n) is the interpolated gain value for the sum mono signal to construct the channel c at the moment of current time tw. H ^ ín) and H2c (n) are the DFT domain representations of the HRTFs for the left and right ears for the output channel c of the multi-channel encoder, ie the direction of each original channel has to be known. When there are two sum signals (stereo sum signal) provided by the BCC multi-channel encoder, both sum signals (Xsumal (n) and Xsuma2 (n)) are made on both binaural outputs as follows: where 0 = n < N / 2 Now giC (n) and g2c (n) represent the gains used for the left and right summing signals in the multi-channel encoder to construct the output channel c as a sum of them. Again, the later stages of the process are similar to those described above: the Yx (n) and Y2 (n) values are transformed back to the time domain with the IFFT process, the signals are displayed in a sinusoidal window once again , and the overlapping windows are added together. The main advantage of the modality described above is that the gains do not change rapidly from one frequency bin to another, which can happen in the case when ERB (or other) sub-bands are used. Thus, the quality of the binaural output signal is generally better. In addition, by using DFT domain representations summed of HRTFs for the right and left ears (H ^ ín) and H2 ° (n)) instead of particular left-right pairs of HRTFs for each channel of the multi-channel audio , filtering can be simplified significantly. In the embodiment described above, the binaural signal was constructed in the DFT domain and the signal division into subbands according to the ERB scale with the filter bank that can be left out. Although the implementation advantageously does not need any filter bank, a skilled person will realize that another related transformation other than DFT or appropriate filter bank structures can also be used with sufficiently high frequency resolution. In these cases, the previous construction equations of Yx (n) and Y2 (n) must be modified in such a way that the filtration of HRTF is carried out based on the properties established by the transformation of the filter bank in question. Therefore, if for example a QMF filter bank is applied, then the frequency resolution is defined by the sub-bands of QMF. If the group of gain values Ng is less than the number of sub-bands of QMF, the gain values are interpolated to obtain individual gain for each sub-band. For example, 28 gain values that correspond to 28 frequency bands for a given time case available in secondary information, can be mapped for 105 QMF sub-bands by linear or non-linear interpolation to avoid sudden variations in the sub-bands narrow adjacent. Subsequently, the equations described above for the frequency domain representation of the left and right binaural signals (Y1 (n), Y2 (n)) are also applied, with the exception that H ^ ín) and H2c (n) are HRTF filters in the QMF domain in matrix format and Xsumal (n) a monofonized signal block. In case of a stereo sum signal, the HRTF filters are in the form of a convolutional matrix and Xsumal (n) and Xauma2 (n) are blocks of the two summing signals, respectively. An example of the actual current filtering implementation in the QMF domain is described in IEEE document 0-7803-5041-3 / 99, Lanciani C. A. et al .: "Subband domain filtering of MPEG audio signáis". For purposes of simplicity, the previous examples are described in such a way that the input channels (M) are mixed down in the encoder to form a simple combined channel (for example mono). However, the modalities are equally applicable in alternative implementations, where the multiple input channels (M) are mixed down to form two or more separate combined channels (S), depending on the particular audio processing application. If the downmix generates mixed multi-channel, the combined channel data can be transmitted using conventional audio transmission techniques. For example, if two channels are generated combined, conventional stereo transmission techniques can be employed. In this case, a BCC decoder can extract and use the BCC codes to synthesize a binaural signal from the two combined channels. According to one embodiment, the number (N) of the "loudspeakers" generated virtually in the synthesized binaural signal may be different (greater or lesser) than the number of input channels (M), depending on the particular application. For example, the input audio could correspond to the surround sound 7.1 and the binaural output audio could be synthesized to correspond to the surround sound 5.1, or vice versa. The above embodiments can be generalized in such a way that the embodiments of the invention allow to convert input audio channels M into combined audio channels S and one or more corresponding groups of secondary information, wherein M > S, and generate output audio channels N from the combined audio channels S and corresponding groups of secondary information, wherein N > S, and N may be the same or different from M. Since the bit rate required for the transmission of a combined channel and the secondary information required is very low, the invention is especially applicable in systems, where the Available bandwidth is a scarce resource, as in wireless communication systems. Accordingly, the modalities are especially applicable in mobile terminals or in another portable device that typically lacks high quality speakers, where the characteristics of the surrounding multi-channel sound can be introduced through headphones that listen to the binaural audio signal. according to the modalities. An additional field of viable applications includes teleconferencing services, where teleconferencing participants can be easily distinguished by giving listeners the impression that participants speaking at the conference are in different places in the conference room. Figure 4 illustrates a simplified structure of a data processing device (TE), wherein the binaural decoding system according to the invention can be implemented. The data processing device (TE) can be, for example, a mobile terminal, an MP3 player, a personal digital assistant device (PDA) or a personal computer (PC). ). The data processing unit (TE) comprises means of input / output (I / O, by its abbreviation in English), a central processing unit (CPU, for its acronym in English) and memory (MEM). The memory (MEM) comprises a portion of read-only memory (ROM) and a rewritable portion, such as a random access memory (RAM) and instant memory (FLASH). The information used to communicate with different external parties, for example a read-only compact disc (CD-ROM), other devices and the user, is transmitted through the media i / O (1/0 ) to / from the central processing unit (CPU). If the data processing device is implemented as a mobile station, it typically includes a Tx / Rx transceiver, which communicates with the wireless network, typically with a base transceiver station (BTS) by means of a antenna. The user interface (UI) equipment typically includes a screen, a keyboard, a microphone, and media connectors for headphones. The data processing device may further comprise multiple media cards (MMCs) of connecting means, such as a standard form slot, for several hardware modules or as IC integrated circuits, which may provide various applications to run on the data processing device. Accordingly, the binaural decoding system according to the invention can be executed in a central processing unit CPU or in a processor.

Dedicated digital signal (DSP) (a parametric code processor) of the data processing device, thereby the data processing device receives a parametrically encoded audio signal comprising at least one combined signal from a plurality of audio channels and one or more corresponding groups of secondary information describing a multi-channel sound image. The parametrically encoded audio signal may be received from memory means, for example a CD-ROM, or from a wireless network via the antenna and the Tx / Rx transceiver. The data processing device further comprises a suitable filter bank and a predetermined set of head-related transfer function filters, so the data processing device transforms the combined signal into a frequency domain and applies a left-right pair suitable for head-related transfer function filters, to the combined signal in proportion determined by the corresponding group of secondary information, to synthesize a binaural audio signal, which is then reproduced via the headphones. Likewise, the coding system according to the invention can be executed in a central processing unit CPU or in a dedicated digital signal processor DSP of the data processing device, thus the data processing device generates a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information including gain estimates for the channel signals of the multi-channel audio. The functionalities of the invention can be implemented in a terminal device, such as a mobile station, also as a computer program which, when executed in a CPU central processing unit or in a dedicated DSP digital signal processor, affects the terminal device for implementing methods of the invention. The functions of the SW computer program can be distributed to several separate components of the program that communicate with each other. Computer software can be stored in any memory medium, such as a PC's hard disk or a CD-ROM disk, from where it can be loaded into the memory of the mobile terminal. Computer software can also be loaded over a network, for example using a stack of transmission control protocol / Internet protocol (TCP / IP). It is also possible to use hardware solutions or a combination of hardware and software solutions to implement the inventive means. Accordingly, the computer program product The above can be implemented at least partially as a hardware solution, for example as application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs), in a module of hardware comprising means connectors for connecting the module to an electronic device, or as one or more integrated circuits IC, the hardware module or the ICs further include various means for carrying out the program coding tasks, the means are implemented as hardware and / or software. It will be apparent to one skilled in the art that the present invention is not limited only to the embodiments presented above, but can be modified within the scope of the appended claims.

Claims

CLAIMS: 1. A method for synthesizing a binaural audio signal, the method comprising: inputting a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information describing a multi-channel sound image; and dividing the combined signal (s) into a plurality of subbands; determine parameter values for subbands from the secondary information group; and applying a predetermined group of transfer function filters related to the head, to the combined signal or signals in proportion determined by the value parameters to synthesize a binaural audio signal. The method according to claim 1, wherein: the parameter values are determined by interpolating a parameter value corresponding to a particular subband from the following and previous parameter values provided by the secondary information group. 3. The method according to claim 1 or 2, further comprising: applying, from the predetermined group of head-related transfer function filters, a left-right pair of head-related transfer function filters corresponding to each speaker address of the original multi-channel audio. The method according to any one of the preceding claims, wherein: the secondary information group comprises a group of gain estimates for the channel signals of the multi-channel audio that describes the original sound image. The method according to claim 4, wherein: the secondary information group further comprises the number and speaker positions of the original multi-channel sound image, in relation to a listener position, and a length of the picture used . The method according to claim 3, wherein: the secondary information group comprises interchannel indications used in a Binaural Indication Coding (BCC) scheme, such as Interchannel Time Difference (ICTD), Interchannel Level Difference (ICLD) and Inter-channel Coherence (ICC), the method also includes: calculating a group of profit estimates of the original multi-channel audio based on at least one of the inter-channel indications of the BCC scheme. The method according to any of claims 4-6, further comprising: determining the group of the gain estimates of the original multi-channel audio as a function of time and frequency; and adjust the gains for each speaker channel such that the sum of the squares of each gain value is equal to one. The method according to claim 1, further comprising: dividing the combined signal (s) into one of the following types of subbands: a plurality of sub-bands of QMF; a plurality of sub-bands of Equivalent Rectangular Bandwidth (ERB); or a plurality of frequency bands psycho-acoustically motivated. The method according to claim 8, further comprising: dividing the combined signal or signals in frequency domain into 32 frequency bands that comply with the Equivalent Rectangular Bandwidth (ERB) scale. The method according to claim 9, further comprising: adding the outputs of the head-related transfer function filters for each of the frequency bands for a left-side signal and a right-side signal separately; and transforming the summed left-side signal and the right-side signal summed in the time domain to create a left-side component and a right-side component of a binaural audio signal. The method according to claim 1, wherein: the parameter values are gain values for at least one subband. The method according to claim 11, wherein: the gain values are determined by selecting the closest gain value provided by the secondary information group. The method according to claim 11 or 12, wherein the step of dividing the combined signal (s) into a plurality of subbands further comprises: dividing the combined signal (s) into periods comprising a predetermined number of samples whose pictures are then displayed in a window; and transforming the combined signal (s) into frequency domain to create a plurality of frequency sub-bands. 14. The method according to any of claims 11-13, wherein the step of determining gain values for sub-bands further comprises: determining gain values for each channel signal of the multi-channel audio describing the original sound image; and interpolating a simple gain value for subbands from the gain values of each channel signal. The method according to any of claims 11-14, further comprising: determining a frequency domain representation of the binaural signal for subbands by multiplying the combined signal or signals with at least one gain value and a transfer function filter related to the predetermined head. The method according to claim 15, wherein the frequency domain representations of the binaural signals for each frequency bin are determined from a monophonic sum signal xSumai (n) according to: Yí (n) = Xsum! Ll (n)? (H (n) glc (n)) c = l i where Yx (n) and Y2 (n) are the representation of frequency domain of the left and right binaural signals, c is the number of the coding channels, giC (n) is the interpolated gain value for the sum mono signal to construct the channel c at a particular time instant tw, and H ^ Cn) and H2c (n) are subband domain representations of the head-related transfer function filters for the left and right ears for the output channel of the encoder. The method according to claim 15, wherein the frequency domain representations of the binaural signals for each frequency bin are determined from stereo sum signals Xsumal (n) and xSum2 (n) according to: YM = (n) g (n)) + Xswm2 (n)? (HÍ (n) g¡ (n)) where Y- ^ (n) and Y2 (n) are the frequency domain representation of the left and right binaural signals, c is the number of the encoding channels, g ^ ín) is the interpolated gain value for the mono signal of sum to construct the channel c at a particular time instant tw, and H ^ ín) and H2 ° (n) are sub-band domain representations of the head-related transfer function filters for the ears left and right for the c channel of the encoder output. The method according to claim 11, wherein: the gain values are determined by interpolation of each gain value corresponding to a particular frequency sub-band from gain values of adjacent frequency subbands provided by the secondary information group. 19. A parametric audio decoder, comprising: a parametric code processor for processing a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information describing a mui-channel sound image; means for dividing the combined signal (s) into a plurality of subbands; means for determining parameter values for subbands from the secondary information group; and a synthesizer for applying a predetermined set of head-to-head transfer function filters to or combined signals in proportion determined by the parameter values to synthesize a binaural audio signal. 20. The decoder according to claim 19, wherein: the parameter values are determined by interpolation of each parameter value corresponding to a particular subband from the following and previous gain values provided by the secondary information group. The decoder according to claim 19 or 20, wherein: the synthesizer is configured to apply, from the predetermined group of head-related transfer function filters, a left-right pair of transfer function filters related to the head corresponding to each speaker address of the original multi-channel audio. 22. The decoder according to any of claims 19-21, wherein: the secondary information group comprises a group of gain estimates for the channel signals of the multi-channel audio that describes the original sound image. The decoder according to claim 21, wherein: the secondary information group comprises interchannel indications used in a Binaural Indication Coding (BCC) scheme, such as Inter-channel Time Difference (ICTD), Inter-Channel Level Difference (ICLD) and Inter-channel Coherence (ICC), the decoder is configured to calculate a group of gain estimates of the original multi-channel audio based on at least one of the indications interconnection of the BCC scheme. The decoder according to claim 19, further comprising: means for dividing the combined signal (s) into one of the following types of subbands: a plurality of sub-bands of QMF; a plurality of sub-bands of Equivalent Rectangular Bandwidth (ERB); or a plurality of frequency bands psycho-acoustically motivated. The decoder according to claim 24, wherein: the means for dividing at least one combined signal in frequency domain comprises a filter bank configured to divide the combined signal or signals into 32 frequency bands that comply with the Wide scale of Equivalent Rectangular Band (ERB). 26. The decoder according to claim 25, further comprising: a summing unit for adding outputs of the transfer function filters related to the head for each of the frequency bands for a signal on the left side and a signal on the right side separately; and a transformation unit for transforming the summed left side signal and the right side signal summed in time domain to create a left side component and a right side component of a binaural audio signal. 27. The decoder according to claim 19, wherein: the parameter values are gain values for at least one subband. 28. The decoder according to claim 27, wherein: the gain values are determined by selecting the closest gain value provided by the secondary information group. 29. The decoder according to claim 27 or 28, wherein the means for determining gain values for at least one subband are configured to: determine gain values for each channel signal of the multi-channel audio that describes the image of original sound; and interpolating a simple gain value for at least one subband from the gain values of each channel signal. The decoder according to any of claims 27-29, wherein the decoder is configured to: determine a frequency domain representation of the binaural signal for at least one subband by multiplying the signal or signals combined with the minus a gain value and a transfer function filter related to the predetermined head. 31. A computer program product, stored in a computer readable medium and executable in a data processing device, for processing a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or various corresponding groups of secondary information describing a multi-channel sound image, the computer program product comprises: a section of computer program code for dividing the combined signal or signals into a plurality of sub-bands, - a computer program code section for determining parametric values for at least one subband from the secondary information group; and a section of computer program code to apply a predetermined group of function filters of head-related transfer to or signals combined in proportion determined by the parameter values to synthesize a binaural audio signal. 32. An apparatus for synthesizing a binaural audio signal, the apparatus comprising: means for inputting a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information describing a multi-channel sound image; means for dividing the combined signal (s) into a plurality of subbands; means for determining parameter values for at least one subband from the secondary information group; means for applying a predetermined group of transfer function filters related to the head to the combined signal (s) in proportion determined by the parameter values to synthesize a binaural audio signal; and means for supplying the binaural audio signal in audio reproduction media. 33. The apparatus according to claim 32, such apparatus is a mobile terminal, a personal calendar device (PDA) or a personal computer.