MX2008008829A

MX2008008829A - Decoding of binaural audio signals

Info

Publication number: MX2008008829A
Application number: MX/A/2008/008829A
Authority: MX
Inventors: Pasi Ojala; Julia Turku; Mauri Vaananen
Original assignee: Nokia Corporation; Pasi Ojala; Julia Turku; Vaeaenaenen Mauri
Priority date: 2006-01-09
Filing date: 2008-07-08
Publication date: 2008-09-26

Abstract

A method for synthesizing a binaural audio signal, the method comprising:inputting a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding sets of side information describing a multi- channel sound image;and applying a predetermined set of head-related transfer function filters to the at least one combined signal in proportion determined by the corresponding set of side information to synthesize a binaural audio signal. A corresponding parametric audio decoder, parametric audio encoder, computer program product, and apparatus for synthesizing a binaural audio signal are also described.

Description

DECODING OF BINAURAL AUDIO SIGNALS RELATED REQUESTS This application claims the priority of the international application PCT / FI2006 / 050014, filed on January 9, 2006 and the application of the United States 11 / 334,041, filed on January 17, 2006.

FIELD OF THE INVENTION The present invention relates to a spatial audio coding, and more particularly to the decoding of binaural audio signals.

BACKGROUND OF THE INVENTION In spatial audio coding, a two / multiple channel audio signal is processed in such a way that the audio signals to be reproduced in different audio channels differ from each other, thus providing listeners with a impression of a spatial effect around the audio source. The spatial effect can be created by directly recording the audio in formats suitable for multi-channel or binaural reproduction, or the spatial effect can be created artificially in any two / multiple channel audio signal, which is known as spatialization. It is generally known that for reproduction in headphones, artificial spatialization can be performed by filtering the Head-Related Transfer Function (HRTF), which produces binaural signals for the left and right ear of the listener. The sound source signals are filtered with filters derived from the HRTF corresponding to their source address. An HRTF is the transfer function measured from a sound source in the free field to the ear of a human being or an artificial head, divided between the transfer function for a microphone that replaces the head and is placed in the middle part of the head. The artificial room effect (for example, previous reflections and / or subsequent reverberation) can be added to the spatialized signals to improve the externalization and naturalness of the source. As the variety of audio listening and interaction devices increases, compatibility becomes more important. Among spatial audio formats, compatibility is achieved through up-mixing and down-mixing techniques. It is generally known that algorithms exist for converting a multi-channel audio signal into a stereo format, such as Dolby Digital® and Dolby Surround®, and then converting a stereo signal into a binaural signal. However, in this type of processing, the spatial image of the original multi-channel audio signal can not be fully reproduced. A better way to convert a multi-channel audio signal to hear in the headphone is to replace the original speakers with virtual speakers by using HRTF filtering and to reproduce the signals of speaker channels through those (for example Dolby Headphone®) . However, this process has the disadvantage that, in order to generate a binaural signal, a multi-channel mix is always needed first. That is, the multi-channel signals (for example 5 + 1 channels) are first encoded and synthesized, and then the HRTFs are applied to each signal to form a binaural signal. This is a computationally problematic procedure, compared to decoding directly from the compressed multi-channel format to the binaural format. Binaural indication coding (BCC) is a highly developed parametric spatial audio coding method. BCC represents a multi-channel spatial signal as a single (or several) audio channel mixed downward and a set of perceptually relevant interchannel differences, estimated as a function of frequency and time from the original signal. The method allows a mixed spatial audio signal for an arbitrary speaker configuration to be converted by some other speaker configuration, which consists of either the same or a different number of speakers. Therefore, the BCC is designed for multi-channel speaker systems. However, the generation of a binaural signal from a monosignal processed by BCC and its secondary information requires that a multi-channel representation based on the monoseñal and secondary information is first synthesized, and only then can it be possible to generate a binaural signal for playback in spatial headphones from the multi-channel representation. It is evident that this procedure is not optimized in view of the generation of a binaural signal.

SUMMARY OF THE INVENTION An improved method and technical equipment implementing the method has now been invented, whereby the generation of a binaural signal is made possible directly from a parametrically encoded audio signal. Various aspects of the invention include: a decoding method, a decoder, an apparatus, a coding method, an encoder and computer programs, which are characterized by what is stated in the independent claims. Various embodiments of the invention are described in the dependent claims. According to a first aspect, a method according to the invention is based on the idea of synthesizing a binaural audio signal in such a way that first a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information describing a multi-channel sound image. Then, a predetermined group of head-related transfer function filters is applied to the combined signal or signals in a proportion determined by the corresponding group of secondary information, to synthesize a binaural audio signal. According to one embodiment, from the predetermined group of head-related transfer function filters, one is chosen to apply a left-right pair of head-related transfer function filters corresponding to each address of the loudspeaker of the head. Original multi-channel speaker configuration. According to one embodiment, said group of secondary information comprises a group of gain estimates for the channel signals of the multi-channel audio, which describes the original sound image. According to one modality, the gain estimates of the original multi-channel audio are determined as a function of time and frequency; and the gains for each speaker channel are adjusted in such a way that the sum of the squares of each gain value is equal to one. According to one embodiment, at least one combined signal is divided into periods of a used frame length, the frames are then displayed in a window; and the or the combined signals are transformed into the frequency domain before applying the head-related transfer function filters. According to one embodiment, at least one combined signal is divided into the frequency domain into a plurality of psychoacoustically motivated frequency bands, such that the frequency bands comply with the Equivalent Rectangular Bandwidth scale (ERB, its acronyms in English), before applying the filters of function of transfer related to the head. According to one embodiment, the outputs of the head-related transfer function filters for each frequency band for a left-side signal and a right-side signal are summed separately; and the summed left-side signal and the right-side-summed signal are transformed into the time domain to create a left-side component and a right-side component of a binaural audio signal. A second aspect provides a method for generating a parametrically encoded audio signal, the method comprising: inputting a multi-channel audio signal comprising a plurality of audio channels; generating at least one combined signal from the plurality of audio channels; and generating one or more corresponding groups of secondary information, which includes gain estimates for the plurality of audio channels. According to one embodiment, the gain estimates are calculated by comparing the gain level of each individual channel with the cumulative gain level of the combined signal or signals. The configuration according to the invention provides significant advantages. A major advantage is the simplicity and low computational complexity of the decoding processes. The decoder is also flexible because it performs the binaural synthesis completely based on the spatial and coding parameters given by the encoder. further, spatiality equal to the original signal is maintained in the conversion. As for secondary information, a group of gain estimates of the original mix is sufficient. More significantly, the invention makes possible the improved utilization of the intermediate compressive state provided in parametric audio coding, improving transmission efficiency as well as audio storage. Additional aspects of the invention include various apparatuses configured to carry out the inventive steps of the above methods BRIEF DESCRIPTION OF THE DRAWINGS Next, various embodiments of the invention will be described in greater detail, with reference to the accompanying drawings, in which: Figure 1 shows a generic Binaural Indication Coding (BCC) scheme according to the prior art.; Figure 2 shows the general structure of a BCC synthesis scheme according to the prior art; Figure 3 shows a block diagram of the binaural decoder according to an embodiment of the invention; and Figure 4 shows an electronic device according to an embodiment of the invention in a reduced block diagram.

DESCRIPTION OF MODALITIES Next, the invention will be illustrated with reference to the Binaural Indication Coding (BCC) as an exemplified platform for implementing the decoding scheme according to the modalities. However, it is clear that the invention is not limited to BCC-type spatial audio coding methods alone, but can be implemented in any audio coding scheme that provides at least one combined audio signal from the original group of one or several audio channels and appropriate spatial secondary information. Binaural Indication Coding (BCC) is a general concept for parametric representation of spatial audio, multi-channel output supply with an arbitrary number of channels from a simple audio channel plus some secondary information. Figure 1 illustrates this concept. Various audio input channels (M) are combined into a simple output signal (S; "sum") by a down-mixing process. In parallel, the most prominent interchannel indications describing the multi-channel sound image are extracted from the input channels and encoded in a compacted manner as secondary BCC information. Both the sum signal and the secondary information are then transmitted to the receiver side, possibly using an appropriate low bit rate audio coding scheme to encode the sum signal. Finally, the BCC decoder generates a multi-channel (N) output signal for the loudspeakers of the transmitted sum signal and the spatial indication information by resynthesis of the channel output signals, which carry the relevant inter-channel indications. , such as Intercanal Time Difference (ICTD), Intercanal Difference (ICLD) and Intercanal Coherence (ICC). Accordingly, the BCC secondary information, i.e. interchannel indications, is chosen in view of optimizing the reconstruction of the multi-channel audio signal, particularly for loudspeaker reproduction. There are two BCC schemes, specifically BCC for flexible reproduction (BCC type 1), which is designed for the transmission of a number of separate source signals in order to reproduce them in the receiver, and BCC for natural reproduction (BCC type II), which is intended for the transmission of a number of audio channels of a stereo or surrounding signal. The BCC for Flexible Playback takes separate signals from audio sources (eg voice signals, instruments recorded separately, multi-track playback) as input. The BCC for Natural Play, in turn, takes a "final mix" or multi-channel stereo signal as input (for example compact disc audio (CD), surround sound from digital versatile disc (DVD , for its acronym in English) ) . If these processes are carried out by means of conventional coding techniques, the bit rate is scaled proportionally or at least almost proportionally to the number of audio channels, for example the transmission of the six audio channels of the multi-channel system. 5.1 channel it requires a bit rate almost six times that of an audio channel. However, both BCC schemes result in a bit rate that is only slightly higher than the bit rate required for the transmission of an audio channel, since the secondary BCC information requires only a bit rate very low (for example 2 kb / s). Figure 2 shows the general structure of a BCC synthesis scheme. The transmitted monoseñal ("sum") is first displayed in a window in the time domain in frames and then mapped to a spectral representation of appropriate subbands by a Fast Fourier Transform (FFT) process. ) and a FB filter bank. Instead of the processes in the FFT and FB, a filter bank process of the Quadrature Mirror Filter (QMF) type can be used to carry out a signal decomposition. In the general case of reproduction channels, the ICLD and ICTD are considered in each sub-band between pairs of channels, ie for each channel with respect to a reference channel. The subbands are selected in such a way that a sufficiently high frequency resolution is achieved, for example a subband bandwidth equal to twice the scale of Rectangular Equivalent Bandwidth (ERB) is considered typically adequate. For each output channel that is going to be generated, the ICTD for individual time delays and ICLD for differences, level are imposed on the spectral coefficients, followed by a coherence synthesis process that reintroduces the most relevant aspects of coherence and / or correlation (ICC) between the synthesized audio channels. Finally, all synthesized output channels are converted into a time domain representation by an inverse FFT process (IFFT), resulting in multi-channel output. For a more detailed description of the BCC procedure, reference is made to: F. Baumgarte and C. Faller: "Binaural Cue Coding - Part I: Psychoacoustic Fundamentals and Design Principies", IEEE Transactions on Speech and Audio Processing, Vol. 11, No 6, November 2003, already: C. Faller and F. Baumgarte: nBinaural Cue Coding - Part II: Schemes and Applications, "IEEE Transactions on Speech and Audio Processing, Vol. 11, No. 6 November 2003. The BCC is a example of coding schemes, which provides a suitable platform to implement the decoding scheme according to the modalities.The binaural decoder according to one modality receives the monofonized signal and the secondary information as inputs.The idea is to replace each speaker in the original mix with a pair of HRTF corresponding to the direction of the speaker with respect to the position of the listener Each frequency channel of the monofonised signal is fed to each pair of filters, implement HRTF in the proportion indicated by a group of profit values, which can be calculated based on the secondary information. Therefore, the process can be devised to implement a group of virtual speakers, which correspond to the originals, in the context of binaural audio. Therefore, the invention adds value to the BCC by allowing, in addition to multi-channel audio signals for various speaker configurations, that a binaural audio signal is also derived directly from the parametrically encoded spatial audio signal, without any process synthesis of intermediate BCC. Some embodiments of the invention are illustrated below with reference to Figure 3, which shows a block diagram of the binaural decoder according to an aspect of the invention. The decoder (300) comprises a first input (302) for the monofonized signal and a second input (304) for the secondary information. The entries (302), (304) are shown as distinctive entries in order to illustrate I the modalities, but an expert person will appreciate that in the practical implementation, monofonized signal and secondary information can be supplied through the same input. According to one modality, the secondary information does not have to include the same interchannel indications as in the BCC schemes, that is Intercanal Time Difference (ICTD), Intercanal Level Difference (ICLD) and Intercanal Coherence (ICC), but more well only a group of gain estimates that define the distribution of sound pressure between the channels of the original mix, in each band of sufficient frequency. In addition to the gain estimates, the secondary information of preference includes the number and positions of the speakers of the original mix in relation to the position of the listener, as well as the length of the picture employed. According to one embodiment, instead of transmitting the gain estimates as a part of the secondary information of an encoder, the gain estimates are calculated in the decoder from the inter-channel indications of the BCC schemes, for example from of ICLD. The encoder (300) further comprises a display unit in a window (306), wherein the monofonised signal is first divided into time frames of the used frame length, and then the frames are displayed in a window appropriately, for example sinusoidal window. An appropriate frame length should be adjusted in such a way that the frames are large enough for discrete Fourier transform (DFT) as long as they are simultaneously short enough to handle rapid variations in the signal. Experiments have shown that an adequate frame length is approximately 50 ms. Accordingly, if the sampling frequency of 44.1 kHz (commonly used in various audio coding schemes) is used, then the frame may comprise, for example, 2048 samples resulting in the frame length of 46.4 ms. The visualization in a window is preferably given in such a way that the adjacent windows overlap by 50% in order to smooth the transitions caused by the spectral modifications (level and delay). Subsequently, the monofonized signal displayed in a window is transformed into frequency domain into an FFT unit (308). The processing is done in the frequency domain with the objective of achieving efficient computing. A skilled person will realize that the previous steps of signal processing can be carried out outside the actual decoder (300), i.e. the display unit in a window (306) and the FFT unit (308) can be implemented in the apparatus, where the decoder is included, and the monofonized signal to be processed is already displayed in a window and is transformed into a frequency domain, when it is supplied to the decoder. For purposes of efficient computation of the signal in frequency domain, the signal is fed to a filter bank (310), which divides the signal into psychoacoustically motivated frequency bands. According to one embodiment, the filter bank (310) is designed in such a way that it is configured to divide the signal into 32 frequency bands that meet the commonly known Rectangular Equivalent Bandwidth (ERB) scale, which gives Result the signal components x0, ..., x31 in the 32 mentioned frequency bands. As an alternative for the blocks (306), (308) and (310), the time-frequency domain processing of the monofonized signal can be carried out in a QMF filter bank unit that performs the decomposition of the signal . A skilled person will realize that in addition to an FFT processing or a QMF filter bank processing, any other suitable method can be used to carry out the desired time-frequency domain processing. The decoder (300) comprises a group of HRTF (312), (314) as pre-stored information, from which a left-right pair of HRTF corresponding to each speaker address is chosen. For purposes of illustration, two groups of HRTF (312), (314) are shown in figure 3, one for the left-side signal and one for the right-side signal, but it is evident that in the practical implementation a HRTF group. To adjust the chosen left-right pairs of HRTF to correspond to each speaker channel sound level, the G gain values are preferably estimated. As mentioned above, the gain estimates can be included in the secondary information received. of the decoder, or can be calculated in the decoder based on the BCC secondary information. Accordingly, a gain is estimated for each speaker channel as a function of time and frequency, and in order to preserve the gain level of the original mix, the gains for each speaker channel are preferably adjusted in such a way that the sum of the squares of each gain value is equal to one. This provides the advantage that, if N is the number of channels to be generated virtually, then only Nl gain estimates need to be transmitted from the encoder, and the lost gain value can be calculated based on the gain values Nl. However, a skilled person will realize that the operation of the invention does not need to adjust the sum of the squares of each gain value to be equal to one, but the decoder can scale the squares of the gain values in such a way that the sum is equal to one. Then each left-right pair of the HRTF filters (312), (314) are adjusted in the proportion dictated by the gain group G, resulting in adjusted HRTF filters (312 '), (314'). Again, it is noted that in practice, the magnitudes of the original HRTF filters (312), (314) are scaled only according to the gain values, but for purposes of illustrating the modalities, "additional" HRTF groups ( 312 '), (314') are shown in Figure 3. For each frequency band, the mono-signal components x0, ..., x31 are fed to each left-right pair of the adjusted HRTF filters (312 ') , (314 ') - The filter outputs for the left-side signal and for the right-side signal are then summed in summing units (316), (318) for both binaural channels. The binaural signals added again are displayed in a sinusoidal window, and are retransformed in time domain by an inverse FFT process carried out in the IFFT units (320), (322). In case the analysis filters do not add one, or their phase response is not linear, then a suitable synthesis filter bank is preferably used to avoid distortion in the final binaural signals BR and BL Again, if used a QMF filter bank unit in the decomposition of the signal as described above, the IFFT units (320), (322) are preferably replaced by IQMF filter bank units (inverse QMF). According to one modality, in order to improve the externalization, ie the location outside the head, of the binaural signal, a moderate response of the room can be added to the binaural signal. For that purpose, the decoder may comprise a reverberation unit, preferably located between the summing units (316), (318) and the IFFT units (320), (322). The added response of the room mimics the effect of the room in a listening situation. The necessary reverberation time is, however, sufficiently short in such a way that the computational complexity does not increase noticeably. The binaural decoder (300) described in figure 3 also makes possible a special case of a decoding by a downmixing process, stereo, where the spatial image becomes narrow. The operation of the decoder (300) is corrected in such a way that each adjustable HRTF filter (312), (314), which in the previous modes was only scaled according to the gain values, is replaced by a predetermined gain. Accordingly, the monofonized signal is processed through constant HRTF filters consisting of a simple gain multiplied by a group of gain values calculated based on the secondary information. As a result, spatial audio is mixed down into a stereo signal. This special case provides the advantage that a stereo signal can be created from the combined signal or signals using the secondary spatial information without the need to decode the spatial audio, thus the stereo decoding procedure is simpler than the synthesis of BCC conventional. The structure of the binaural decoder (300) remains the same, unlike in Figure 3, only the adjustable HRTF filters (312), (314) are replaced by downmix filters having predetermined gains for the downmix, stereo. If the binaural decoder comprises HRTF filters, for example, for a surround audio configuration 5.1, then for the special case of the stereo downmix decoding, the constant gains for the HRTF filters may be, for example, as defined. in table 1.

Table 1. HRTF filters for stereo downmix The configuration according to the invention provides significant advantages. A major advantage is the simplicity and low computational complexity of the decoding process. The decoder is also flexible in the sense that it performs the binaural upmix completely based on the coding and spatial parameters given by the encoder. In addition, spatiality is maintained with respect to the original signal in the conversion. As for secondary information, a group of gain estimates of the original mix is sufficient. From the point of view of audio transmission or storage, the most significant advantage is gained through improved efficiency when using the compressive intermediate state provided in parametric audio coding. An experienced person will realize that, since the HRTFs are very individual and the average is impossible, the perfect re-spatialization could only be achieved by measuring the only HRTF group by the listener himself. Accordingly, the use of HRTF inevitably colors the signal in such a way that the quality of the processed audio is not equivalent to the original. However, since the measurement of each listener's HRTF is an unrealistic option, the best possible result is achieved, when using either a modeled group or a group measured from a mannequin head or a person with a head of average size and remarkable symmetry. As stated above, according to one modality, the gain estimates can be included in the secondary information received from the encoder. Accordingly, an aspect of the invention relates to an encoder for multi-channel spatial audio signal that estimates a gain for each speaker channel as a function of frequency and time and includes the gain estimates in the secondary information that is going to transmit along one (or several) combined channels. The encoder can be, for example, a BCC encoder known as such, which is further configured to calculate the gain estimates, either in addition to or instead of the ICTD, ICLD and ICC of interchannel indications describing the image of multi-channel sound. Subsequently, both the sum signal and the secondary information, comprising at least the gain estimates, are transmitted to the receiver side, preferably using an appropriate low bit rate audio coding scheme, to encode the signal of sum. According to one embodiment, if the gain estimates are calculated in the encoder, the calculation is carried out by comparing the gain level of each individual channel with the accumulated gain level of the combined channel; that is, if we denote the gain levels by X, the individual channels of the original speaker configuration by "m" and the samples by "k", then for each channel the gain estimate is calculated as Xm (k) Xsum ( k). Accordingly, the gain estimates determine the proportional gain magnitude of each individual channel compared to the total gain magnitude of all channels. According to one embodiment, if the gain estimates are calculated in the decoder based on the secondary information of BCC, the calculation can be carried out for example based on the values of the interchannel level difference ICLD. Therefore, if N is the number of the "loudspeakers" to be generated virtually, then the N-1 equations, which comprise N-l unknown variables, are first composed based on the ICLD values. Subsequently, the sum of the squares of each speaker equation is set equal to 1, so the gain estimation of an individual channel can be solved, and based on the resolved profit estimate, the rest of the gain estimates can be solved from equations Nl.

For example, if the number of channels to be generated is virtually five (N = 5), the equations N-l can be formed as follows: L2 = L1 + ICLD1, L3 = L1 + ICLD2, L4 = L1 + ICLD3 and L5 = L1 + ICLD4. Then, the sum of its squares is set equal to 1: Ll2 + (L1 + ICLD1) 2 + (L1 + ICLD2) 2 + (L1 + ICLD3) 2 + (L1 + ICLD4) 2 = 1. Then you can solve the value of Ll, and based on Ll, you can solve the rest of the gain level values L2 - L5. For purposes of simplicity, the previous examples are described in such a way that the input channels (M) are mixed downwardly in the encoder to form a simple combined channel (for example mono). However, the modalities are equally applicable in alternative implementations, where the multiple input channels (M) are mixed down to form two or more separate combined channels (S), depending on the particular audio processing application. If the downmix generates multiple combined channels, the combined channel data can be transmitted using conventional audio transmission techniques. For example, if two combined channels are generated, conventional stereo transmission techniques can be employed. In this case, a BCC decoder can extract and use the BCC codes to synthesize a binaural signal from the two combined channels. According to one embodiment, the number (N) of the "loudspeakers" generated virtually in the synthesized binaural signal may be different (greater or lesser) than the number of input channels (M), depending on the particular application. For example, the input audio could correspond to the surround sound 7.1 and the binaural output audio could be synthesized to correspond to the surround sound 5.1, or vice versa. The above embodiments can be generalized in such a way that the embodiments of the invention allow to convert input audio channels M into combined audio channels S and one or more corresponding groups of secondary information, wherein M > S, and generate output audio channels N from the combined audio channels S and corresponding groups of secondary information, wherein N > S, and N can be the same or different from M. Since the bit rate required for the transmission of a combined channel and the secondary information required is very low, the invention is especially applicable in systems, where the width Bandwidth is a scarce resource, as in wireless communication systems. Consequently, the modalities are especially applicable in mobile terminals or in another portable device that typically lacks high-quality loudspeakers, where the characteristics of the multi-channel surround sound can be introduced through headphones that listen to the binaural audio signal of according to the modalities. An additional field of viable applications includes teleconferencing services, where teleconferencing participants can be easily distinguished by giving listeners the opportunity to listen. impression that the participants who speak at the conference are in different places in the conference room. Figure 4 illustrates a simplified structure of a data processing device (TE), wherein the binaural decoding system according to the invention can be implemented. The data processing device (TE) can be, for example, a mobile terminal, a personal digital assistant device (PDA) or a personal computer (PC). The data processing unit (TE) comprises input / output means (I / O, by its abbreviation in English), a central processing unit (CPU, for its acronym in English) and memory (MEM). The memory (MEM) comprises a portion of read-only memory (ROM) and a rewritable portion., such as a random access memory (RAM) and instant memory (FLASH). The information used to communicate with different external parties, for example a read-only compact disc (CD-ROM), other devices and the user, is transmitted through the media 1/0 (1/0 ) to / from the central processing unit (CPU). If the data processing device is implemented as a mobile station, it typically includes a Tx / Rx transceiver, which communicates with the wireless network, typically with a base transceiver station (BTS) by means of a antenna. The user interface (UI) equipment typically includes a screen, a keyboard, a microphone, and media connectors for headphones. The data processing device may further comprise multiple media cards (MMCs) of connecting means, such as a standard form slot, for several hardware modules or as IC integrated circuits, which may provide various applications to run on the data processing device. Accordingly, the binaural decoding system according to the invention can be executed in a CPU central processing unit or in a dedicated digital signal processor (DSP) (a parametric code processor) of the device. data processing, thereby the data processing device receives a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information describing a multi-sound image. channel. The parametrically encoded audio signal may be received from memory means, for example a CD-ROM, or from a wireless network via the antenna and the Tx / Rx transceiver. The data processing device further comprises a suitable filter bank and a predetermined set of head-related transfer function filters, so the data processing device transforms the combined signal or signals into frequency domain and applies a left pair - Appropriate right of filters of transfer function related to the head, to the or combined signals in a proportion determined by the corresponding group of secondary information, to synthesize a binaural audio signal, which is then reproduced via the headphones. Likewise, the coding system according to the invention can be executed in a central processing unit CPU or in a dedicated digital signal processor DSP of the data processing device, thus the data processing device generates an encoded audio signal parametrically comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information including gain estimates for the channel signals of the multi-channel audio. The functionalities of the invention can be implemented in a terminal device, such as a mobile station, also as a computer program which, when executed in a CPU central processing unit or in a dedicated DSP digital signal processor, affects the terminal device for implementing methods of the invention. The functions of the SW computer program can be distributed to several separate components of the program that communicate with each other. Computer software can be stored in any memory medium, such as a PC's hard disk or a CD-ROM disk, from where it can be loaded into the memory of the mobile terminal. Computer software can also be loaded over a network, for example using a stack of transmission control protocol / Internet protocol (TCP / IP). It is also possible to use hardware solutions or a combination of hardware and software solutions to implement the inventive means. Therefore, the above computer program product can be implemented at least partially as a hardware solution, for example as application-specific integrated circuits (ASICs) or array of programmable gates per field (FPGA, for short) in English), in a hardware module comprising connecting means for connecting the module to an electronic device, or as one or more IC integrated circuits, the hardware module or the ICs also include various means for carrying out the coding tasks of the program, the means are implemented as hardware and / or software. It is obvious that the present invention is not limited only to the embodiments presented above, but can be modified within the scope of the appended claims.

Claims

CLAIMS; 1. A method for synthesizing a binaural audio signal, the method comprising: inputting a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information, which describes a multi-channel sound image; and applying a predetermined group of head-related transfer function filters to the combined signal or signals, in a proportion determined by the corresponding group of secondary information, in order to synthesize a binaural audio signal. The method according to claim 1, further comprising: applying, from the predetermined group of head-related transfer function filters, a left-right pair of head-related transfer function filters corresponding to each address of the original multi-channel audio speaker. The method according to claim 1 or 2, wherein the secondary information group comprises a group of gain estimates for the channel signals of the multi-channel audio that describes the original sound image. The method according to claim 3, wherein the secondary information group further comprises the number and speaker positions of the original multi-channel sound image, in relation to a listener's position, and a length of the used frame. The method according to claim 1 or 2, wherein the secondary information group comprises interchannel indications used in a Binaural Indication Coding (BCC) scheme, such as Interchannel Time Difference (ICTD), Interchannel Level Difference (ICLD). ) and Inter-Channel Coherence (ICC), the method also includes: calculating a group of gain estimates of the original multi-channel audio based on at least one of the inter-channel indications of the BCC scheme. The method according to any of claims 3-5, further comprising: determining the group of the gain estimates of the original multi-channel audio as a function of time and frequency; and adjust the gains for each speaker channel such that the sum of the squares of each gain value is equal to one. The method according to any preceding claim, further comprising: dividing at least one combined signal into time frames of an employed frame length, whose frames are then displayed in a window; and transforming the combined signal (s) into the frequency domain before applying the head-related transfer function filters. The method according to claim 7, further comprising: dividing at least one combined signal in the frequency domain into a plurality of psycho-acoustically motivated frequency bands, before applying head-related transfer function filters. The method according to claim 8, further comprising: dividing the combined signal or signals in the frequency domain into 32 frequency bands that comply with the Equivalent Rectangular Bandwidth (ERB) scale. The method according to any of claims 7-9, wherein the step of transforming the combined signal (s) into the frequency domain is performed using QMF filters to decompose the combined signal (s). The method according to any of claims 8-10, further comprising: adding the outputs of the head-related transfer function filters for each of the frequency bands for a left-side signal and a side-side signal separately and transforming the summed left-side signal and the right-side signal summed in the time domain to create a left-side component and a right-side component of a binaural audio signal. 12. A method for synthesizing a stereo audio signal, the method comprising: inputting a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information which describes a multi-channel sound image; and applying a group of down-mix filters, having predetermined gain values, to the combined signal or signals, in the proportion determined by the corresponding group of secondary information, to synthesize a stereo audio signal. 13. A parametric audio decoder, comprising: a parametric code processor for processing a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information describing a multi-channel sound image; and a synthesizer for applying a predetermined group of transfer function filters related to the head to the combined signal or signals, in a proportion determined by the corresponding group of secondary information in order to synthesize a binaural audio signal. The decoder according to claim 13, wherein the synthesizer is configured to apply, from the predetermined group of head-related transfer function filters, a left-right pair of head-related transfer function filters that corresponds to each speaker address of the original multi-channel audio. The decoder according to claim 13 or 14, wherein the secondary information group comprises a group of gain estimates for the channel signals of the multi-channel audio describing the original sound image. The decoder according to claim 13 or 14, wherein the secondary information group comprises interchannel indications used in a Binaural Indication Coding (BCC) scheme, such as Interchannel Time Difference (ICTD), Interchannel Level Difference (ICLD). ) and Inter-channel Coherence (ICC), the decoder is configured to calculate a group of gain estimates of the original multi-channel audio based on at least one of the inter-channel indications of the BCC scheme. The decoder according to any of claims 13-16, further comprising: means for dividing the combined signal (s) into time frames of a used frame length; means to visualize the pictures in a window; and means for transforming the combined signal (s) into the frequency domain prior to applying the head-related transfer function filters. 18. The decoder of claim 17, further comprising: means for dividing the combined signal or signals in the frequency domain into a plurality of psychoacoustically motivated frequency bands before applying the head-related transfer function filters. The decoder according to claim 18, wherein: the means for dividing the combined signal or signals in the frequency domain comprises a filter bank configured to divide the combined signal or signals into 32 frequency bands that comply with the scale of Width of Rectangular Equivalent Band (ERB). The decoder according to any of claims 17-19 wherein the means for transforming the combined signal or signals into the frequency domain comprises QMF filters configured to decompose the combined signal or signals. The decoder according to any of claims 17-20, further comprising: an adding unit for summing the outputs of the head-related transfer function filters for each of the frequency bands for a left-side signal and a signal from the right side separately; and a transformation unit for transforming the summed left side signal and the right side signal summed in time domain to create a left side component and a right side component of a binaural audio signal. 22. A parametric audio decoder, comprising: a parametric code processor for processing a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information describing a multi-channel sound image; and a synthesizer for applying a group of downmix filters having predetermined gain values to the combined signal or signals in a proportion determined by the corresponding group of secondary information to synthesize a stereo audio signal. 23. A computer program product, stored in a computer readable medium and executable in a data processing device, for processing a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or several corresponding groups of secondary information describing a multi-channel sound image, the computer program product comprising: a section of computer program code for controlling the transformation of the combined signal or signals in the frequency domain; and a computer program code section for applying a predetermined set of head-related transfer function filters to the combined signal or signals in a ratio determined by the corresponding group of secondary information to synthesize a binaural audio signal. 24. An apparatus for synthesizing a binaural audio signal, the apparatus comprising: means for inputting a parametrically encoded audio signal comprising at least one combined signal of a plurality of audio channels and one or more corresponding groups of secondary information describing a multi-channel sound image; means for applying a predetermined group of head-related transfer function filters to the combined signal or signals, in a proportion determined by the corresponding group of secondary information, in order to synthesize a binaural audio signal; and means for supplying the binaural audio signal in audio reproduction media. 25. The apparatus according to claim 24, the apparatus is a mobile terminal, an electronic calendar device (PDA) or a personal computer. 26. A method for generating a parametrically encoded audio signal, the method comprising: inputting a multi-channel audio signal comprising a plurality of audio channels; generating at least one combined signal from the plurality of audio channels; and generating one or more corresponding groups of secondary information, which includes gain estimates for the plurality of audio channels. The method according to claim 26, further comprising: calculating the gain estimates by comparing the gain level of each individual channel with the cumulative gain level of the combined signal. The method according to claim 26 or 27, wherein the secondary information group further comprises the number and positions of the speakers of a multi-channel original sound image in relation to a listener position, and a frame length employee. 29. The method according to any of claims 26-28, wherein the secondary information group further comprises interchannel indications used in a Binaural Indication Coding (BCC) scheme, such as Intercanal Time Difference (ICTD), Level Difference Intercanal (ICLD) and Intercanal Coherence (ICC). The method according to any of claims 26-29, further comprising: determining the group of the gain estimates of the original multi-channel audio as a function of time and frequency; and adjust the gains for each speaker channel such that the sum of the squares of each gain value is equal to one. 31. A parametric audio encoder for generating a parametrically encoded audio signal, the encoder comprising: means for inputting a multi-channel audio signal comprising a plurality of audio channels; means for generating at least one combined signal of the plurality of audio channels; and means for generating one or more corresponding groups of secondary information, which includes gain estimates for the plurality of audio channels. 32. The encoder according to claim 31, further comprising: means for calculating the gain estimates by comparing the gain level of each individual channel with the cumulative gain level of the combined signal. 33. A computer program product, stored in a computer readable medium and executable in a data processing device, for generating a parametrically encoded audio signal, the computer program product comprises: a program code section of a computer program computer for introducing a multi-channel audio signal comprising a plurality of audio channels; a section of computer program code for generating at least one combined signal of the plurality of audio channels; and a computer program code section for generating one or more corresponding groups of secondary information, which includes gain estimates for the plurality of audio channels.